initial commit

2025-05-09 00:51:11 +08:00 · 2022-08-26 17:42:09 +00:00 · 2022-08-26 17:42:09 +00:00 · c71768bb55
commit c71768bb55
823 changed files with 276191 additions and 0 deletions
--- a/.clang-format
+++ b/.clang-format
@ -0,0 +1,5 @@
+BasedOnStyle: Google
+ColumnLimit: 120
+ContinuationIndentWidth: 4
+IndentWidth: 4
+TabWidth: 4
--- a/.gitignore
+++ b/.gitignore
@ -0,0 +1,4 @@
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
--- a/.gitmodules
+++ b/.gitmodules
@ -0,0 +1,6 @@
+[submodule "mcunet"]
+	path = mcunet
+	url = https://github.com/mit-han-lab/mcunet.git
+[submodule "TinyEngine/third_party/CMSIS"]
+	path = TinyEngine/third_party/CMSIS
+	url = https://github.com/ARM-software/CMSIS_5.git
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@ -0,0 +1,51 @@
+exclude: "code_generator/tflite/.*"
+repos:
+  - repo: https://github.com/pre-commit/pre-commit-hooks
+    rev: v4.0.1
+    hooks:
+      - id: trailing-whitespace
+      - id: mixed-line-ending
+        args: ["--fix=lf"]
+      - id: end-of-file-fixer
+      - id: check-merge-conflict
+      - id: requirements-txt-fixer
+      - id: fix-encoding-pragma
+        args: ["--remove"]
+      - id: debug-statements
+      - id: check-toml
+  - repo: https://github.com/executablebooks/mdformat
+    rev: 0.7.10
+    hooks:
+      - id: mdformat
+  - repo: https://github.com/psf/black
+    rev: 22.3.0
+    hooks:
+      - id: black
+  - repo: https://github.com/pycqa/isort
+    rev: 5.10.1
+    hooks:
+      - id: isort
+        args: ["--sp", "pyproject.toml"]
+  - repo: https://github.com/pycqa/flake8
+    rev: 4.0.1
+    hooks:
+      - id: flake8
+        additional_dependencies:
+          - flake8-comprehensions==3.7.0
+          - flake8-docstrings==1.6.0
+  - repo: local
+    hooks:
+      - id: pylint
+        name: pylint
+        entry: pylint
+        language: system
+        types: [python]
+        require_serial: true
+  - repo: https://github.com/pre-commit/mirrors-mypy
+    rev: v0.910-1
+    hooks:
+      - id: mypy
+  - repo: https://github.com/pre-commit/mirrors-clang-format
+    rev: v13.0.0
+    hooks:
+      - id: clang-format
--- a/21
+++ b/21
@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2022 MIT HAN Lab
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
--- a/README.md
+++ b/README.md
@ -0,0 +1,224 @@
+# TinyEngine
+
+This is the official implementation of TinyEngine, a memory-efficient and high-performance neural network library for Microcontrollers.
+TinyEngine is a part of MCUNet, which also consists of TinyNAS. MCUNet is a system-algorithm co-design framework for tiny deep learning on microcontrollers. TinyEngine and TinyNAS are co-designed to fit the tight memory budgets.
+
+**The MCUNet and TinyNAS repo is [here](https://github.com/mit-han-lab/mcunet).**
+
+### [MCUNetV1](https://mcunet.mit.edu/#mcunetv1) | [MCUNetV2](https://mcunet.mit.edu/#mcunetv2) | [MCUNetV3](https://mcunet.mit.edu/#mcunetv3)
+
+### [Demo (Inference)](https://www.youtube.com/watch?v=YvioBgtec4U)
+
+![demo](assets/figures/mcunet_demo.gif)
+
+### [Demo (Training)](https://www.youtube.com/watch?v=XaDCO8YtmBw)
+
+![demo_v3](assets/figures/mcunetV3_demo_2images.gif)
+
+
+## News
+
+We will soon release **Tiny Training Engine** used in [MCUNetV3: On-Device Training Under 256KB Memory](https://mcunet.mit.edu/#mcunetv3). **If you are interested in getting updates, please sign up [here](https://forms.gle/UW1uUmnfk1k6UJPPA) to get notified!**
+
+- **(2022/08)** Our **New Course on TinyML and Efficient Deep Learning** will be released soon in September 2022: [efficientml.ai](https://efficientml.ai/).
+- **(2022/08)** We include the [demo tutorial](tutorial) for deploying a visual wake word (VWW) model onto microcontrollers.
+- **(2022/08)** We opensource the TinyEngine repo.
+- **(2022/07)** We include the person detection model used in the video demo above in the [MCUNet repo](https://github.com/mit-han-lab/mcunet).
+- **(2022/06)** We refactor the [MCUNet repo](https://github.com/mit-han-lab/mcunet) as a standalone repo (previous repo: https://github.com/mit-han-lab/tinyml)
+- **(2021/10)** **MCUNetV2** is accepted to NeurIPS 2021: https://arxiv.org/abs/2110.15352 !
+- **(2020/10)** **MCUNet** is accepted to NeurIPS 2020 as **spotlight**: https://arxiv.org/abs/2007.10319 !
+- Our projects are covered by: [MIT News](https://news.mit.edu/2020/iot-deep-learning-1113), [MIT News (v2)](https://news.mit.edu/2021/tiny-machine-learning-design-alleviates-bottleneck-memory-usage-iot-devices-1208), [WIRED](https://www.wired.com/story/ai-algorithms-slimming-fit-fridge/), [Morning Brew](https://www.morningbrew.com/emerging-tech/stories/2020/12/07/researchers-figured-fit-ai-ever-onto-internet-things-microchips), [Stacey on IoT](https://staceyoniot.com/researchers-take-a-3-pronged-approach-to-edge-ai/), [Analytics Insight](https://www.analyticsinsight.net/amalgamating-ml-and-iot-in-smart-home-devices/), [Techable](https://techable.jp/archives/142462), etc.
+
+
+## Overview
+
+Microcontrollers are low-cost, low-power hardware. They are widely deployed and have wide applications, but the tight memory budget (50,000x smaller than GPUs) makes deep learning deployment difficult.
+
+MCUNet is a **system-algorithm co-design** framework for tiny deep learning on microcontrollers. It consists of **TinyNAS** and **TinyEngine**. They are co-designed to fit the tight memory budgets. With system-algorithm co-design, we can significantly improve the deep learning performance on the same tiny memory budget.
+
+![overview](assets/figures/overview.png)
+
+Specifically, TinyEngine is a memory-efficient inference library. TinyEngine adapts the memory scheduling according to the overall network topology rather than layer-wise optimization, reducing memory usage and accelerating the inference. It outperforms existing inference libraries such as [TF-Lite Micro](https://www.tensorflow.org/lite/microcontrollers) from Google, [CMSIS-NN](https://arxiv.org/abs/1801.06601) from Arm, and [X-CUBE-AI](https://www.st.com/en/embedded-software/x-cube-ai.html) from STMicroelectronics.
+
+TinyEngine adopts the following optimization techniques to accelerate inference speed and minimize memory footprint.
+
+*   [**In-place depth-wise convolution**](https://mcunet.mit.edu/#mcunetv1): A unique data placement technique for depth-wise convolution that overwrites input data by intermediate/output data to reduce peak SRAM memory.
+*   [**Operator fusion**](https://docs.microsoft.com/en-us/windows/ai/directml/dml-fused-activations): A method that improves performance by merging one operator into a different operator so that they are executed together without requiring a roundtrip to memory.
+*   [**SIMD (Single instruction, multiple data) programming**](https://en.wikipedia.org/wiki/Single_instruction,_multiple_data): A computing method that performs the same operation on multiple data points simultaneously.
+*   [**HWC to CHW weight format transformation**](https://oneapi-src.github.io/oneDNN/dev_guide_understanding_memory_formats.html): A weight format transformation technique that increases cache hit ratio for in-place depth-wise convolution.
+*   [**Image to Column (Im2col) convolution**](https://iq.opengenus.org/im2col/): An implementation technique of computing convolution operation using general matrix multiplication (GEMM) operations.
+*   [**Loop reordering**](https://xilinx.github.io/Vitis_Accel_Examples/2019.2/html/loop_reorder.html): A loop transformation technique that attempts to optimize a program's execution speed by reordering/interchanging the sequence of loops.
+*   [**Loop unrolling**](https://en.wikipedia.org/wiki/Loop_unrolling): A loop transformation technique that attempts to optimize a program's execution speed at the expense of its binary size, which is an approach known as space-time tradeoff.
+*   [**Loop tiling**](https://en.wikipedia.org/wiki/Loop_nest_optimization): A loop transformation technique that attempts to reduce memory access latency by partitioning a loop's iteration space into smaller chunks or blocks, so as to help ensure data used in a loop stays in the cache until it is reused.
+
+![inplace_depthwise](assets/figures/inplace_depthwise.png)
+
+By adopting the abovementioned optimization techniques, TinyEngine can not only enhance inference speed but also reduce peak memory, as shown in the figures below.
+
+**MAC/s improvement breakdown:**
+![mac_result](assets/figures/mac_result.png)
+
+**Peak memory reduction:**
+![peakmem_result](assets/figures/peakmem_result.png)
+
+To sum up, our **TinyEngine** inference engine could be a useful infrastructure for MCU-based AI applications. It significantly **improves the inference speed and reduces the memory usage** compared to existing libraries like [TF-Lite Micro](https://www.tensorflow.org/lite/microcontrollers), [CMSIS-NN](https://arxiv.org/abs/1801.06601), [X-CUBE-AI](https://www.st.com/en/embedded-software/x-cube-ai.html), etc. It improves the inference speed by **1.1-18.6x**, and reduces the peak memory by **1.3-3.6x**.
+
+![measured_result](assets/figures/measured_result.png)
+
+
+
+## Code Structure
+
+`code_generator` contains a python library that is used to compile neural networks into low-level source code (C/C++).
+
+`TinyEngine` contains a C/C++ library that implements operators and performs inference on Microcontrollers.
+
+`examples` contains the examples of transforming TFLite models into our TinyEngine models.
+
+`tutorial` contains the demo tutorial of deploying a visual wake word (VWW) model onto microcontrollers.
+
+`assets` contains misc assets.
+
+
+## Requirement
+
+- Python 3.6+
+- STM32CubeIDE 1.5+
+
+
+## Setup for Users
+
+First, clone this repository:
+
+```bash
+git clone --recursive https://github.com/mit-han-lab/tinyengine.git
+```
+
+(Optional) Using a virtual environment with `conda` is recommended.
+
+```bash
+conda create -n tinyengine python=3.6 pip
+conda activate tinyengine
+```
+
+Install dependencies:
+
+```bash
+pip install -r requirements.txt
+```
+
+
+## Setup for Developers
+
+Install pre-commit hooks to automatically format changes in your code.
+
+```
+pre-commit install
+```
+
+
+## Deployment Example
+
+Please see [tutorial](tutorial) to learn how to deploy a visual wake word (VWW) model onto microcontrollers by using TinyEngine.
+
+
+## Measured Results
+
+- All the tflite models are from [Model Zoo in MCUNet repo](https://github.com/mit-han-lab/mcunet#model-zoo). Please see MCUNet repo to know how to build the pre-trained int8 quantized models in TF-Lite format.
+- All the **latency**, **peak memory (SRAM)** and **Flash memory usage** results are profiled on STM32F746G-DISCO discovery boards.
+- Note that we measure the newer versions of libraries in this repo, so that the results in this repo might be different from the ones in the MCUNet papers.
+- Since TF-Lite Micro no longer has version numbers anymore, we use the git commit ID to indicate its newer version.
+- All the tflite models are compiled by `-Ofast` optimization level in STM32CubeIDE.
+- OOM denotes Out Of Memory.
+
+The **latency** results:
+
+| net_id                       | TF-Lite Micro<br>v2.1.0 | TF-Lite Micro<br>[@ 713b6ed](https://github.com/tensorflow/tflite-micro/tree/713b6ed6bd81d8d6906d885e14f444aaf9c154f6) | CMSIS-NN<br>v2.0.0 | X-CUBE-AI<br>v7.1.0 | TinyEngine |
+| ---------------------------- | ----------------------- | -------------------------- | ------------------ | --------- | ---------- |
+| *# mcunet models (VWW)*      |                         |                            |                    |           |            |
+| mcunet-5fps-vww              |          624ms          |          2346ms            |        269ms       |   137ms   |   128ms    |
+| mcunet-10fps-vww             |          345ms          |          1230ms            |        143ms       |    76ms   |    66ms    |
+| mcunet-320kB-vww             |           OOM           |            OOM             |         OOM        |   657ms   |   570ms    |
+| *# mcunet models (ImageNet)* |                         |                            |                    |           |            |
+| mcunet-5fps                  |           OOM           |            OOM             |         OOM        |   149ms   |   135ms    |
+| mcunet-10fps                 |           OOM           |            OOM             |         OOM        |    84ms   |    62ms    |
+| mcunet-256kB                 |           OOM           |            OOM             |         OOM        |   839ms   |   681ms    |
+| mcunet-320kB                 |           OOM           |            OOM             |         OOM        |    OOM    |   819ms    |
+| *# baseline models*          |                         |                            |                    |           |            |
+| mbv2-320kB                   |           OOM           |            OOM             |         OOM        |    OOM    |   292ms    |
+| proxyless-320kB              |           OOM           |            OOM             |         OOM        |   484ms   |   425ms    |
+
+The **peak memory (SRAM)** results:
+
+| net_id                       | TF-Lite Micro<br>v2.1.0 | TF-Lite Micro<br>[@ 713b6ed](https://github.com/tensorflow/tflite-micro/tree/713b6ed6bd81d8d6906d885e14f444aaf9c154f6) | CMSIS-NN<br>v2.0.0 | X-CUBE-AI<br>v7.1.0 | TinyEngine |
+| ---------------------------- | ----------------------- | -------------------------- | ------------------ | --------- | ---------- |
+| *# mcunet models (VWW)*      |                         |                            |                    |           |            |
+| mcunet-5fps-vww              |          227kB          |            220kB           |        248kB       |   123kB   |    88kB    |
+| mcunet-10fps-vww             |          169kB          |            163kB           |        199kB       |    98kB   |    56kB    |
+| mcunet-320kB-vww             |           OOM           |             OOM            |         OOM        |   259kB   |   162kB    |
+| *# mcunet models (ImageNet)* |                         |                            |                    |           |            |
+| mcunet-5fps                  |           OOM           |             OOM            |         OOM        |   126kB   |    90kB    |
+| mcunet-10fps                 |           OOM           |             OOM            |         OOM        |    76kB   |    45kB    |
+| mcunet-256kB                 |           OOM           |             OOM            |         OOM        |   311kB   |   200kB    |
+| mcunet-320kB                 |           OOM           |             OOM            |         OOM        |    OOM    |   242kB    |
+| *# baseline models*          |                         |                            |                    |           |            |
+| mbv2-320kB                   |           OOM           |             OOM            |         OOM        |    OOM    |   284kB    |
+| proxyless-320kB              |           OOM           |             OOM            |         OOM        |   312kB   |   242kB    |
+
+The **Flash memory usage** results:
+
+| net_id                       | TF-Lite Micro<br>v2.1.0 | TF-Lite Micro<br>[@ 713b6ed](https://github.com/tensorflow/tflite-micro/tree/713b6ed6bd81d8d6906d885e14f444aaf9c154f6) | CMSIS-NN<br>v2.0.0 | X-CUBE-AI<br>v7.1.0 | TinyEngine |
+| ---------------------------- | ----------------------- | -------------------------- | ------------------ | --------- | ---------- |
+| *# mcunet models (VWW)*      |                         |                            |                    |           |            |
+| mcunet-5fps-vww              |          782kB          |             733kB          |        743kB       |   534kB   |   517kB    |
+| mcunet-10fps-vww             |          691kB          |             643kB          |        653kB       |   463kB   |   447kB    |
+| mcunet-320kB-vww             |           OOM           |              OOM           |         OOM        |   773kB   |   742kB    |
+| *# mcunet models (ImageNet)* |                         |                            |                    |           |            |
+| mcunet-5fps                  |           OOM           |              OOM           |         OOM        |   737kB   |   720kB    |
+| mcunet-10fps                 |           OOM           |              OOM           |         OOM        |   856kB   |   837kB    |
+| mcunet-256kB                 |           OOM           |              OOM           |         OOM        |   850kB   |   827kB    |
+| mcunet-320kB                 |           OOM           |              OOM           |         OOM        |    OOM    |   835kB    |
+| *# baseline models*          |                         |                            |                    |           |            |
+| mbv2-320kB                   |           OOM           |              OOM           |         OOM        |    OOM    |   828kB    |
+| proxyless-320kB              |           OOM           |              OOM           |         OOM        |   866kB   |   835kB    |
+
+
+## Citation
+
+If you find the project helpful, please consider citing our paper:
+
+```
+@article{
+  lin2020mcunet,
+  title={Mcunet: Tiny deep learning on iot devices},
+  author={Lin, Ji and Chen, Wei-Ming and Lin, Yujun and Gan, Chuang and Han, Song},
+  journal={Advances in Neural Information Processing Systems},
+  volume={33},
+  year={2020}
+}
+
+@inproceedings{
+  lin2021mcunetv2,
+  title={MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning},
+  author={Lin, Ji and Chen, Wei-Ming and Cai, Han and Gan, Chuang and Han, Song},
+  booktitle={Annual Conference on Neural Information Processing Systems (NeurIPS)},
+  year={2021}
+}
+
+@inproceedings{
+  lin2022ondevice,
+  title={On-Device Training Under 256KB Memory},
+  author={Lin, Ji and Zhu, Ligeng and Chen, Wei-Ming and Wang, Wei-Chen and Gan, Chuang and Han, Song},
+  booktitle={ArXiv},
+  year={2022}
+} 
+```
+
+
+## Related Projects
+
+[MCUNet: Tiny Deep Learning on IoT Devices](https://mcunet.mit.edu/#mcunetv1) (NeurIPS'20)
+
+[MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning](https://mcunet.mit.edu/#mcunetv2) (NeurIPS'21)
+
+[MCUNetV3: On-Device Training Under 256KB Memory](https://mcunet.mit.edu/#mcunetv3)
--- a/TinyEngine/include/arm_nnfunctions_modified.h
+++ b/TinyEngine/include/arm_nnfunctions_modified.h
@ -0,0 +1,236 @@
+/*
+ * Copyright (C) 2010-2022 Arm Limited or its affiliates.
+ *
+ * SPDX-License-Identifier: Apache-2.0
+ *
+ * Licensed under the Apache License, Version 2.0 (the License); you may
+ * not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS, WITHOUT
+ * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+/* ----------------------------------------------------------------------
+ * This file is MODIFIED from Arm CMSIS NN Library.
+ *
+ * Project: TinyEngine
+ * Title:   arm_nnfunctions_modified.h
+ * Description:  Public header file for TinyEngine.
+ *
+ * Reference papers:
+ *  - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
+ *  - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
+ *  - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
+ * Contact authors:
+ *  - Wei-Ming Chen, wmchen@mit.edu
+ *  - Wei-Chen Wang, wweichen@mit.edu
+ *  - Ji Lin, jilin@mit.edu
+ *  - Ligeng Zhu, ligeng@mit.edu
+ *  - Song Han, songhan@mit.edu
+ *
+ * Original Project:      CMSIS NN Library
+ * Original Title:        arm_nnfunctions.h
+ *
+ * Target Processor:  Cortex-M CPUs
+ * -------------------------------------------------------------------- */
+
+/**
+   \mainpage CMSIS NN Software Library
+   *
+   * Introduction
+   * ------------
+   *
+   * This user manual describes the CMSIS NN software library,
+   * a collection of efficient neural network kernels developed to maximize the
+   * performance and minimize the memory footprint of neural networks on Cortex-M processor cores.
+   *
+   * The library is divided into a number of functions each covering a specific category:
+   * - Convolution Functions
+   * - Activation Functions
+   * - Fully-connected Layer Functions
+   * - SVDF Layer Functions
+   * - Pooling Functions
+   * - Softmax Functions
+   * - Basic math Functions
+   *
+   * The library has separate functions for operating on different weight and activation data
+   * types including 8-bit integers (q7_t) and 16-bit integers (q15_t). The descrition of the
+   * kernels are included in the function description. The implementation details are also
+   * described in this paper [1].
+   *
+   * Function Classification
+   * --------
+   * The functions can be classified into two segments
+   * - Legacy functions supporting ARM's internal symmetric quantization(8 bits).
+   * - Functions that support TensorFlow Lite framework with symmetric quantization(8 bits).
+   *
+   * The legacy functions can be identified with their suffix of _q7 or _q15 and are no new development is done there.
+   * The article in [2] describes in detail how to run a network using the legacy functions.
+   *
+   * The functions supporting TensorFlow Lite framework is identified by the _s8 suffix and can be invoked from TFL
+   * micro. The functions are bit exact to TensorFlow Lite. Refer to the TensorFlow's documentation in [3] on how to run
+   * a TensorFlow Lite model using optimized CMSIS-NN kernels.
+   *
+   * Block Diagram
+   * --------
+   * \image html CMSIS-NN-OVERVIEW.PNG
+   *
+   * Examples
+   * --------
+   *
+   * The library ships with a number of examples which demonstrate how to use the library functions.
+   *
+   * Pre-processor Macros
+   * ------------
+   *
+   * Each library project have different pre-processor macros.
+   *
+   * - ARM_MATH_DSP:
+   *
+   * Define macro ARM_MATH_DSP, If the silicon supports DSP instructions(DSP extension).
+   *
+   * - ARM_MATH_MVEI:
+   *
+   * Define macro ARM_MATH_MVEI, If the silicon supports M-Profile Vector Extension.
+
+   * - ARM_MATH_AUTOVECTORIZE
+   *  Used in conjucture with ARM_MATH_MVEI to let the compiler auto vectorize for the functions that uses inline
+   *  assembly. It does not affect functions that use C or intrinsics.
+   * - ARM_MATH_BIG_ENDIAN:
+   *
+   * Define macro ARM_MATH_BIG_ENDIAN to build the library for big endian targets. This is supported only for the legacy
+   * functions i.e, functions targetted at TensorFlow Lite do not support big endianness. By default library builds for
+   * little endian targets.
+   *
+   * - ARM_NN_TRUNCATE:
+   *
+   * Define macro ARM_NN_TRUNCATE to use floor instead of round-to-the-nearest-int for the computation.
+   *
+   *
+   * Copyright Notice
+   * ------------
+   *
+   * Copyright (C) 2010-2019 Arm Limited. All rights reserved.
+   *
+   * [1] CMSIS-NN: Efficient Neural Network Kernels for Arm Cortex-M CPUs https://arxiv.org/abs/1801.06601
+   *
+   * [2] Converting a Neural Network for Arm Cortex-M with CMSIS-NN
+   *
+   https://developer.arm.com/solutions/machine-learning-on-arm/developer-material/how-to-guides/converting-a-neural-network-for-arm-cortex-m-with-cmsis-nn/single-page
+   * [3] https://www.tensorflow.org/lite/microcontrollers/library
+   *
+   * [4] https://github.com/ARM-software/CMSIS_5/tree/develop/CMSIS/NN#legacy-vs-tfl-micro-compliant-apis
+   */
+
+/**
+ * @defgroup groupNN Neural Network Functions
+ * A collection of functions to perform basic operations for neural network layers. Functions with a _s8 suffix support
+ * TensorFlow Lite framework.
+ */
+
+#ifndef _ARM_NNFUNCTIONS_H
+#define _ARM_NNFUNCTIONS_H
+
+#include "arm_nn_math_types.h"
+#include "arm_nn_types.h"
+#include "arm_nnsupportfunctions.h"
+
+#define USE_INTRINSIC
+
+//#define ARM_NN_TRUNCATE /* This config the rounding model to floor or round to the nearest int */
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * @defgroup NNConv Convolution Functions
+ *
+ * Collection of convolution, depthwise convolution functions and their variants.
+ *
+ * The convolution is implemented in 2 steps: im2col and GEMM
+ *
+ * im2col is a process of converting each patch of image data into
+ * a column. After im2col, the convolution is computed as matrix-matrix
+ * multiplication.
+ *
+ * To reduce the memory footprint, the im2col is performed partially.
+ * Each iteration, only a few column (i.e., patches) are generated and
+ * computed with GEMM kernels similar to CMSIS-DSP arm_mat_mult functions.
+ *
+ */
+
+    arm_status arm_convolve_s8_4col(const q7_t *input,
+                                    const uint16_t input_x,
+                                    const uint16_t input_y,
+                                    const uint16_t input_ch,
+                                    const uint16_t input_batches,
+                                    const q7_t *kernel,
+                                    const uint16_t output_ch,
+                                    const uint16_t kernel_x,
+                                    const uint16_t kernel_y,
+                                    const uint16_t pad_x,
+                                    const uint16_t pad_y,
+                                    const uint16_t stride_x,
+                                    const uint16_t stride_y,
+                                    const int32_t *bias,
+                                    q7_t *output,
+                                    const int32_t *output_shift,
+                                    const int32_t *output_mult,
+                                    const int32_t out_offset,
+                                    const int32_t input_offset,
+                                    const int32_t out_activation_min,
+                                    const int32_t out_activation_max,
+                                    const uint16_t output_x,
+                                    const uint16_t output_y,
+                                    q15_t *buffer_a);
+
+
+    q7_t *arm_nn_mat_mult_kernel_s8_s16_reordered_oddch(const q7_t *input_a,
+                                                        const q15_t *input_b,
+                                                        const uint16_t output_ch,
+                                                        const int32_t *out_shift,
+                                                        const int32_t *out_mult,
+                                                        const int32_t out_offset,
+                                                        const int16_t activation_min,
+                                                        const int16_t activation_max,
+                                                        const uint16_t num_col_a,
+                                                        const int32_t *const output_bias,
+                                                        q7_t *out_0);
+
+    q7_t *arm_nn_mat_mult_kernel_s8_s16_reordered_8mul(const q7_t *input_a,
+                                                       const q15_t *input_b,
+                                                       const uint16_t output_ch,
+                                                       const int32_t *out_shift,
+                                                       const int32_t *out_mult,
+                                                       const int32_t out_offset,
+                                                       const int16_t activation_min,
+                                                       const int16_t activation_max,
+                                                       const uint16_t num_col_a,
+                                                       const int32_t *const output_bias,
+                                                       q7_t *out_0);
+
+    q7_t *arm_nn_mat_mult_kernel3_input3_s8_s16(const q7_t *input_a,
+                                                const q15_t *input_b,
+                                                const uint16_t output_ch,
+                                                const int32_t *out_shift,
+                                                const int32_t *out_mult,
+                                                const int32_t out_offset,
+                                                const int16_t activation_min,
+                                                const int16_t activation_max,
+                                                const uint16_t num_col_a,
+                                                const int32_t *const output_bias,
+                                                q7_t *out_0,
+                                                q15_t *kbuf);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif
--- a/TinyEngine/include/detectionUtility.h
+++ b/TinyEngine/include/detectionUtility.h
@ -0,0 +1,27 @@
+/* ----------------------------------------------------------------------
+ * Project: TinyEngine
+ * Title:   detectionUtility.h
+ *
+ * Reference papers:
+ *  - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
+ *  - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
+ *  - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
+ * Contact authors:
+ *  - Wei-Ming Chen, wmchen@mit.edu
+ *  - Wei-Chen Wang, wweichen@mit.edu
+ *  - Ji Lin, jilin@mit.edu
+ *  - Ligeng Zhu, ligeng@mit.edu
+ *  - Song Han, songhan@mit.edu
+ *
+ * Target ISA:  ARMv7E-M
+ * -------------------------------------------------------------------- */
+
+#ifndef TINYENGINE_INCLUDE_DETECTIONUTILITY_H_
+#define TINYENGINE_INCLUDE_DETECTIONUTILITY_H_
+
+int postProcessing(signed char *input, unsigned char* runtime_buffer,
+		int y_zero, float y_scale, int shape_x, int shape_y, int shape_c, int resolution,
+		int width, int height , float conf_thresh, float out_boxes[10][6]);
+
+
+#endif /* TINYENGINE_INCLUDE_DETECTIONUTILITY_H_ */
--- a/TinyEngine/include/fp_requantize_op.h
+++ b/TinyEngine/include/fp_requantize_op.h
@ -0,0 +1,99 @@
+/* ----------------------------------------------------------------------
+ * Project: TinyEngine
+ * Title:   fp_requantize_op.h
+ *
+ * Reference papers:
+ *  - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
+ *  - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
+ *  - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
+ * Contact authors:
+ *  - Wei-Ming Chen, wmchen@mit.edu
+ *  - Wei-Chen Wang, wweichen@mit.edu
+ *  - Ji Lin, jilin@mit.edu
+ *  - Ligeng Zhu, ligeng@mit.edu
+ *  - Song Han, songhan@mit.edu
+ *
+ * Target ISA:  ARMv7E-M
+ * -------------------------------------------------------------------- */
+#ifndef TINYENGINE_INCLUDE_FP_REQUANTIZE_OP_H_
+#define TINYENGINE_INCLUDE_FP_REQUANTIZE_OP_H_
+
+tinyengine_status convolve_1x1_s8_ch8_fpreq(const q7_t *input,
+		const uint16_t input_x, const uint16_t input_y, const uint16_t input_ch,
+		const q7_t *kernel, const int32_t *bias, const float *scales,
+		const int32_t out_offset, const int32_t input_offset,
+		const int32_t out_activation_min, const int32_t out_activation_max,
+		q7_t *output, const uint16_t output_x, const uint16_t output_y,
+		const uint16_t output_ch, q15_t *runtime_buf);
+
+tinyengine_status convolve_1x1_s8_ch16_fpreq(const q7_t *input,
+		const uint16_t input_x, const uint16_t input_y, const uint16_t input_ch,
+		const q7_t *kernel, const int32_t *bias, const float *scales,
+		const int32_t out_offset, const int32_t input_offset,
+		const int32_t out_activation_min, const int32_t out_activation_max,
+		q7_t *output, const uint16_t output_x, const uint16_t output_y,
+		const uint16_t output_ch, q15_t *runtime_buf);
+
+tinyengine_status convolve_1x1_s8_ch24_fpreq(const q7_t *input,
+		const uint16_t input_x, const uint16_t input_y, const uint16_t input_ch,
+		const q7_t *kernel, const int32_t *bias, const float *scales,
+		const int32_t out_offset, const int32_t input_offset,
+		const int32_t out_activation_min, const int32_t out_activation_max,
+		q7_t *output, const uint16_t output_x, const uint16_t output_y,
+		const uint16_t output_ch, q15_t *runtime_buf);
+
+tinyengine_status convolve_1x1_s8_ch48_fpreq(const q7_t *input,
+		const uint16_t input_x, const uint16_t input_y, const uint16_t input_ch,
+		const q7_t *kernel, const int32_t *bias, const float *scales,
+		const int32_t out_offset, const int32_t input_offset,
+		const int32_t out_activation_min, const int32_t out_activation_max,
+		q7_t *output, const uint16_t output_x, const uint16_t output_y,
+		const uint16_t output_ch, q15_t *runtime_buf);
+
+tinyengine_status convolve_1x1_s8_fpreq(const q7_t *input,
+		const uint16_t input_x, const uint16_t input_y, const uint16_t input_ch,
+		const q7_t *kernel, const int32_t *bias, const float *scales,
+		const int32_t out_offset, const int32_t input_offset,
+		const int32_t out_activation_min, const int32_t out_activation_max,
+		q7_t *output, const uint16_t output_x, const uint16_t output_y,
+		const uint16_t output_ch, q15_t *runtime_buf);
+
+tinyengine_status convolve_1x1_s8_fpreq_bitmask(const q7_t *input,
+		const uint16_t input_x, const uint16_t input_y, const uint16_t input_ch,
+		const q7_t *kernel, const int32_t *bias, const float *scales,
+		const int32_t out_offset, const int32_t input_offset,
+		const int32_t out_activation_min, const int32_t out_activation_max,
+		q7_t *output, q7_t *mask, const uint16_t output_x, const uint16_t output_y,
+		const uint16_t output_ch, q15_t *runtime_buf);
+
+q7_t* mat_mult_kernel_s8_s16_reordered_fpreq(const q7_t *input_a,
+		const q15_t *input_b, const uint16_t output_ch, const float *scales,
+		const int32_t out_offset, const int16_t activation_min,
+		const int16_t activation_max, const uint16_t num_col_a,
+		const int32_t *const output_bias, q7_t *out_0);
+
+q7_t* mat_mult_kernel_s8_s16_reordered_ch8_fpreq(const q7_t *input_a,
+		const q15_t *input_b, const uint16_t output_ch, const float *scales,
+		const int32_t out_offset, const int16_t activation_min,
+		const int16_t activation_max, const uint16_t num_col_a,
+		const int32_t *const output_bias, q7_t *out_0);
+
+q7_t* mat_mult_kernel_s8_s16_reordered_ch16_fpreq(const q7_t *input_a,
+		const q15_t *input_b, const uint16_t output_ch, const float *scales,
+		const int32_t out_offset, const int16_t activation_min,
+		const int16_t activation_max, const uint16_t num_col_a,
+		const int32_t *const output_bias, q7_t *out_0);
+
+q7_t* mat_mult_kernel_s8_s16_reordered_ch24_fpreq(const q7_t *input_a,
+		const q15_t *input_b, const uint16_t output_ch, const float *scales,
+		const int32_t out_offset, const int16_t activation_min,
+		const int16_t activation_max, const uint16_t num_col_a,
+		const int32_t *const output_bias, q7_t *out_0);
+
+q7_t* mat_mult_kernel_s8_s16_reordered_ch48_fpreq(const q7_t *input_a,
+		const q15_t *input_b, const uint16_t output_ch, const float *scales,
+		const int32_t out_offset, const int16_t activation_min,
+		const int16_t activation_max, const uint16_t num_col_a,
+		const int32_t *const output_bias, q7_t *out_0);
+
+#endif /* TINYENGINE_INCLUDE_FP_REQUANTIZE_OP_H_ */
--- a/TinyEngine/include/genNN.h
+++ b/TinyEngine/include/genNN.h
@ -0,0 +1,35 @@
+/* ----------------------------------------------------------------------
+ * Project: TinyEngine
+ * Title:   genNN.h
+ *
+ * Reference papers:
+ *  - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
+ *  - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
+ *  - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
+ * Contact authors:
+ *  - Wei-Ming Chen, wmchen@mit.edu
+ *  - Wei-Chen Wang, wweichen@mit.edu
+ *  - Ji Lin, jilin@mit.edu
+ *  - Ligeng Zhu, ligeng@mit.edu
+ *  - Song Han, songhan@mit.edu
+ *
+ * Target ISA:  ARMv7E-M
+ * -------------------------------------------------------------------- */
+
+#ifndef INC_GENNN_H_
+#define INC_GENNN_H_
+
+#include <stdint.h>
+
+signed char* getInput();
+signed char* getOutput();
+float* getOutput_fp();
+int32_t* getOutput_int32();
+
+void setupBuffer();
+void invoke(float* labels);
+void getResult(uint8_t *P, uint8_t *NP);
+int* getKbuffer();
+void end2endinference();
+
+#endif /* INC_GENNN_H_ */
--- a/TinyEngine/include/img2col_element.h
+++ b/TinyEngine/include/img2col_element.h
@ -0,0 +1,546 @@
+/* ----------------------------------------------------------------------
+ * Project: TinyEngine
+ * Title:   img2col_element.h
+ *
+ * Reference papers:
+ *  - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
+ *  - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
+ *  - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
+ * Contact authors:
+ *  - Wei-Ming Chen, wmchen@mit.edu
+ *  - Wei-Chen Wang, wweichen@mit.edu
+ *  - Ji Lin, jilin@mit.edu
+ *  - Ligeng Zhu, ligeng@mit.edu
+ *  - Song Han, songhan@mit.edu
+ *
+ * Target ISA:  ARMv7E-M
+ * -------------------------------------------------------------------- */
+
+#ifndef ARMNN_INCLUDE_IMG2COL_ELEMENT_H_
+#define ARMNN_INCLUDE_IMG2COL_ELEMENT_H_
+
+#include "arm_nnsupportfunctions.h"
+#include "arm_math_memory.h"
+
+#define b2_q7_q15_offset_ele(src,dst)													\
+/* convert from q7 to q15 and then store the results in the destination buffer */	\
+/*in_q7x4 = b2_nn_read_q7x4_ia((const q7_t **)&src);											 	\
+in_q15x2_1 = __SXTB16(__ROR(in_q7x4, 8));										 	\
+in_q15x2_2 = __SXTB16(in_q7x4);		*/												 	\
+in_q15x2_1 = ((src[0] & 0x0C) >> 2) + ((src[0] & 0xC0) << 10);\
+in_q15x2_2 = (src[0] & 0x03) + ((src[0] & 0x30) << 12);\
+src +=1;\
+out_q15x2_2 = __PKHTB(in_q15x2_1, in_q15x2_2, 16);									\
+/* Maximum of 9 bits from the addition is expected */								\
+out_q15x2_2 = __SADD16(out_q15x2_2, offset_q15x2);									\
+																					\
+out_q15x2_1 = __PKHBT(in_q15x2_2, in_q15x2_1, 16);									\
+out_q15x2_1 = __SADD16(out_q15x2_1, offset_q15x2);									\
+																					\
+write_q15x2_ia(&dst, out_q15x2_1);													\
+write_q15x2_ia(&dst, out_q15x2_2);
+
+#define b4_q7_q15_offset_ele(src,dst)													\
+/* convert from q7 to q15 and then store the results in the destination buffer */	\
+/*in_q7x4 = b4_nn_read_q7x4_ia((const q7_t **)&src);											 	\
+in_q15x2_1 = __SXTB16(__ROR(in_q7x4, 8));										 	\
+in_q15x2_2 = __SXTB16(in_q7x4);		*/											 	\
+in_q15x2_1 = ((src[0] & 0xF0) >> 4) + ((src[1] & 0xF0) << 12);\
+in_q15x2_2 = (src[0] & 0x0F) + ((src[1] & 0x0F) << 16);\
+src +=2;\
+out_q15x2_2 = __PKHTB(in_q15x2_1, in_q15x2_2, 16);									\
+/* Maximum of 9 bits from the addition is expected */								\
+out_q15x2_2 = __SADD16(out_q15x2_2, offset_q15x2);									\
+																				\
+out_q15x2_1 = __PKHBT(in_q15x2_2, in_q15x2_1, 16);									\
+out_q15x2_1 = __SADD16(out_q15x2_1, offset_q15x2);									\
+																\
+write_q15x2_ia(&dst, out_q15x2_1);													\
+write_q15x2_ia(&dst, out_q15x2_2);
+
+#define q7_q15_offset_ele(src,dst)													\
+/* convert from q7 to q15 and then store the results in the destination buffer */	\
+in_q7x4 = arm_nn_read_q7x4_ia((const q7_t **)&src);									\
+/* Extract and sign extend each of the four q7 values to q15 */					 	\
+in_q15x2_1 = __SXTB16(__ROR(in_q7x4, 8));										 	\
+in_q15x2_2 = __SXTB16(in_q7x4);													 	\
+																					\
+out_q15x2_2 = __PKHTB(in_q15x2_1, in_q15x2_2, 16);									\
+/* Maximum of 9 bits from the addition is expected */								\
+out_q15x2_2 = __SADD16(out_q15x2_2, offset_q15x2);									\
+																					\
+out_q15x2_1 = __PKHBT(in_q15x2_2, in_q15x2_1, 16);									\
+out_q15x2_1 = __SADD16(out_q15x2_1, offset_q15x2);									\
+																					\
+write_q15x2_ia(&dst, out_q15x2_1);													\
+write_q15x2_ia(&dst, out_q15x2_2);
+
+#define q8_q15_offset_ele(src,dst)													\
+/* convert from q8 to q15 and then store the results in the destination buffer */	\
+in_q7x4 = arm_nn_read_q7x4_ia((const q8_t **)&src);											 	\
+/* Extend each of the four q8 values to q15 */					 	\
+in_q15x2_1 = __UXTB16(__ROR(in_q7x4, 8));										 	\
+in_q15x2_2 = __UXTB16(in_q7x4);													 	\
+																					\
+out_q15x2_2 = __PKHTB(in_q15x2_1, in_q15x2_2, 16);									\
+/* Maximum of 9 bits from the addition is expected */								\
+out_q15x2_2 = __SADD16(out_q15x2_2, offset_q15x2);									\
+																					\
+out_q15x2_1 = __PKHBT(in_q15x2_2, in_q15x2_1, 16);									\
+out_q15x2_1 = __SADD16(out_q15x2_1, offset_q15x2);									\
+																					\
+write_q15x2_ia(&dst, out_q15x2_1);													\
+write_q15x2_ia(&dst, out_q15x2_2);
+
+#define b4_q15_offset_reordered_ele(src,dst)\
+/* convert from q7 to q15 and then store the results in the destination buffer */\
+in_q7x4 = b4_nn_read_q7x4_ia((const q7_t **)&src);\
+\
+/* Extract and sign extend each of the four q7 values to q15 */\
+out_q15x2_1 = __SXTB16(__ROR(in_q7x4, 8));\
+out_q15x2_2 = __SXTB16(in_q7x4);\
+\
+out_q15x2_1 = __SADD16(out_q15x2_1, offset_q15x2);\
+out_q15x2_2 = __SADD16(out_q15x2_2, offset_q15x2);\
+\
+write_q15x2_ia(&dst, out_q15x2_2);\
+write_q15x2_ia(&dst, out_q15x2_1);
+
+#define b2_q15_offset_reordered_ele(src,dst)\
+/* convert from q7 to q15 and then store the results in the destination buffer */\
+in_q7x4 = b2_nn_read_q7x4_ia(&src);\
+\
+/* Extract and sign extend each of the four q7 values to q15 */\
+out_q15x2_1 = __SXTB16(__ROR(in_q7x4, 8));\
+out_q15x2_2 = __SXTB16(in_q7x4);\
+\
+out_q15x2_1 = __SADD16(out_q15x2_1, offset_q15x2);\
+out_q15x2_2 = __SADD16(out_q15x2_2, offset_q15x2);\
+\
+write_q15x2_ia(&dst, out_q15x2_2);\
+write_q15x2_ia(&dst, out_q15x2_1);
+
+#define q7_q15_offset_reordered_ele(src,dst)\
+/* convert from q7 to q15 and then store the results in the destination buffer */\
+in_q7x4 = arm_nn_read_q7x4_ia((const q7_t **)&src);\
+\
+/* Extract and sign extend each of the four q7 values to q15 */\
+out_q15x2_1 = __SXTB16(__ROR(in_q7x4, 8));\
+out_q15x2_2 = __SXTB16(in_q7x4);\
+\
+out_q15x2_1 = __SADD16(out_q15x2_1, offset_q15x2);\
+out_q15x2_2 = __SADD16(out_q15x2_2, offset_q15x2);\
+\
+write_q15x2_ia(&dst, out_q15x2_2);\
+write_q15x2_ia(&dst, out_q15x2_1);
+
+#define q31_assign2(src,dst)														\
+*dst++ = *src++;																	\
+*dst++ = *src++;
+
+#define q31_assign4(src,dst)														\
+q31_assign2(src,dst)																	\
+q31_assign2(src,dst)																	\
+
+#define q31_assign6(src,dst)														\
+q31_assign4(src,dst)																	\
+q31_assign2(src,dst)																	\
+
+#define q31_assign8(src,dst)														\
+q31_assign4(src,dst)																	\
+q31_assign4(src,dst)																	\
+
+#define q31_assign10(src,dst)														\
+q31_assign8(src,dst)																	\
+q31_assign2(src,dst)																	\
+
+#define q31_assign12(src,dst)														\
+q31_assign10(src,dst)																	\
+q31_assign2(src,dst)																	\
+
+#define q31_pad2(dst,padvalue)														\
+*dst++ = padvalue;																	\
+*dst++ = padvalue;																	\
+
+#define q31_pad4(dst,padvalue)														\
+q31_pad2(dst,padvalue)																	\
+q31_pad2(dst,padvalue)																	\
+
+#define q31_pad6(dst,padvalue)														\
+q31_pad4(dst,padvalue)																	\
+q31_pad2(dst,padvalue)																	\
+
+#define q31_pad10(dst,padvalue)														\
+q31_pad6(dst,padvalue)																	\
+q31_pad4(dst,padvalue)																	\
+
+#define q31_pad14(dst,padvalue)														\
+q31_pad6(dst,padvalue)																	\
+q31_pad6(dst,padvalue)																	\
+q31_pad2(dst,padvalue)																	\
+
+
+#define assignq31toq15()\
+dst = (q15_t*)dst_31;\
+dst2 = (q15_t*)dst2_31;\
+dst3 = (q15_t*)dst3_31;\
+dst4 = (q15_t*)dst4_31;\
+dst5 = (q15_t*)dst5_31;\
+dst6 = (q15_t*)dst6_31;\
+dst7 = (q15_t*)dst7_31;\
+
+#define assignq15toq31()\
+dst_31 = (q31_t*)dst;\
+dst2_31 = (q31_t*)dst2;\
+dst3_31 = (q31_t*)dst3;\
+dst4_31 = (q31_t*)dst4;\
+dst5_31 = (q31_t*)dst5;\
+dst6_31 = (q31_t*)dst6;\
+dst7_31 = (q31_t*)dst7;\
+
+/* ---------------------------------- Pad ---------------------------------- */
+#define basic_pad_1row(col,dst_31,pad_out_q15x2)\
+block_cnt = channel_div4 * col; 	\
+while (block_cnt > 0)\
+{ \
+	q31_pad2(dst_31,pad_out_q15x2) 	\
+	block_cnt--; \
+}
+
+#define basic_pad_2row(col,dst_31,dst2_31,pad_out_q15x2)\
+block_cnt = channel_div4 * col; 	\
+while (block_cnt > 0)\
+{ \
+	q31_pad2(dst_31,pad_out_q15x2) 	\
+	q31_pad2(dst2_31,pad_out_q15x2) 	\
+	block_cnt--; \
+}
+
+#define basic_pad_3row(col,dst_31,dst2_31,dst3_31,pad_out_q15x2)\
+block_cnt = channel_div4 * col; 	\
+while (block_cnt > 0)\
+{ \
+	q31_pad2(dst_31,pad_out_q15x2) 	\
+	q31_pad2(dst2_31,pad_out_q15x2) 	\
+	q31_pad2(dst3_31,pad_out_q15x2) 	\
+	block_cnt--; \
+}
+
+#define basic_pad_4row(col,dst_31,dst2_31,dst3_31,dst4_31,pad_out_q15x2)\
+block_cnt = channel_div4 * col; 	\
+while (block_cnt > 0)\
+{ \
+	q31_pad2(dst_31,pad_out_q15x2) 	\
+	q31_pad2(dst2_31,pad_out_q15x2) 	\
+	q31_pad2(dst3_31,pad_out_q15x2) 	\
+	q31_pad2(dst4_31,pad_out_q15x2) 	\
+	block_cnt--; \
+}
+
+#define basic_pad_5row(col,dst_31,dst2_31,dst3_31,dst4_31,dst5_31,pad_out_q15x2)\
+block_cnt = channel_div4 * col; 	\
+while (block_cnt > 0)\
+{ \
+	q31_pad2(dst_31,pad_out_q15x2) 	\
+	q31_pad2(dst2_31,pad_out_q15x2) 	\
+	q31_pad2(dst3_31,pad_out_q15x2) 	\
+	q31_pad2(dst4_31,pad_out_q15x2) 	\
+	q31_pad2(dst5_31,pad_out_q15x2) 	\
+	block_cnt--; \
+}
+
+#define pad_1row_1col(dst_31,pad_out_q15x2)	basic_pad_1row(1,dst_31,pad_out_q15x2)
+#define pad_1row_2col(dst_31,pad_out_q15x2)	basic_pad_1row(2,dst_31,pad_out_q15x2)
+#define pad_1row_3col(dst_31,pad_out_q15x2)	basic_pad_1row(3,dst_31,pad_out_q15x2)
+#define pad_2row_1col(dst_31,dst2_31,pad_out_q15x2)	basic_pad_2row(1,dst_31,dst2_31,pad_out_q15x2)
+#define pad_2row_2col(dst_31,dst2_31,pad_out_q15x2)	basic_pad_2row(2,dst_31,dst2_31,pad_out_q15x2)
+#define pad_2row_3col(dst_31,dst2_31,pad_out_q15x2)	basic_pad_2row(3,dst_31,dst2_31,pad_out_q15x2)
+#define pad_2row_4col(dst_31,dst2_31,pad_out_q15x2)	basic_pad_2row(4,dst_31,dst2_31,pad_out_q15x2)
+#define pad_2row_5col(dst_31,dst2_31,pad_out_q15x2)	basic_pad_2row(5,dst_31,dst2_31,pad_out_q15x2)
+#define pad_3row_1col(dst_31,dst2_31,dst3_31,pad_out_q15x2)	basic_pad_3row(1,dst_31,dst2_31,dst3_31,pad_out_q15x2)
+#define pad_3row_2col(dst_31,dst2_31,dst3_31,pad_out_q15x2)	basic_pad_3row(2,dst_31,dst2_31,dst3_31,pad_out_q15x2)
+#define pad_3row_3col(dst_31,dst2_31,dst3_31,pad_out_q15x2)	basic_pad_3row(3,dst_31,dst2_31,dst3_31,pad_out_q15x2)
+#define pad_4row_1col(dst_31,dst2_31,dst3_31,dst4_31,pad_out_q15x2)	basic_pad_4row(1,dst_31,dst2_31,dst3_31,dst4_31,pad_out_q15x2)
+#define pad_4row_2col(dst_31,dst2_31,dst3_31,dst4_31,pad_out_q15x2)	basic_pad_4row(2,dst_31,dst2_31,dst3_31,dst4_31,pad_out_q15x2)
+#define pad_4row_3col(dst_31,dst2_31,dst3_31,dst4_31,pad_out_q15x2)	basic_pad_4row(3,dst_31,dst2_31,dst3_31,dst4_31,pad_out_q15x2)
+#define pad_5row_1col(dst_31,dst2_31,dst3_31,dst4_31,dst5_31,pad_out_q15x2)	basic_pad_5row(1,dst_31,dst2_31,dst3_31,dst4_31,dst5_31,pad_out_q15x2)
+#define pad_5row_2col(dst_31,dst2_31,dst3_31,dst4_31,dst5_31,pad_out_q15x2)	basic_pad_5row(2,dst_31,dst2_31,dst3_31,dst4_31,dst5_31,pad_out_q15x2)
+#define pad_5row_3col(dst_31,dst2_31,dst3_31,dst4_31,dst5_31,pad_out_q15x2)	basic_pad_5row(3,dst_31,dst2_31,dst3_31,dst4_31,dst5_31,pad_out_q15x2)
+
+/* ---------------------------------- Load ---------------------------------- */
+#define basic_load_1row(col,src,dst)\
+block_cnt = channel_div4 * col; \
+while (block_cnt > 0) \
+{\
+	q7_q15_offset_ele(src,dst)\
+	block_cnt--;\
+}
+#define basic_load_2row(col,src,src2,dst,dst2)\
+block_cnt = channel_div4 * col; \
+while (block_cnt > 0) \
+{\
+	q7_q15_offset_ele(src,dst)\
+	q7_q15_offset_ele(src2,dst2)\
+	block_cnt--;\
+}
+#define basic_load_3row(col,src,src2,src3,dst,dst2,dst3)\
+block_cnt = channel_div4 * col; \
+while (block_cnt > 0) \
+{\
+	q7_q15_offset_ele(src,dst)\
+	q7_q15_offset_ele(src2,dst2)\
+	q7_q15_offset_ele(src3,dst3)\
+	block_cnt--;\
+}
+#define basic_load_4row(col,src,src2,src3,src4,dst,dst2,dst3,dst4)\
+block_cnt = channel_div4 * col; \
+while (block_cnt > 0) \
+{\
+	q7_q15_offset_ele(src,dst)\
+	q7_q15_offset_ele(src2,dst2)\
+	q7_q15_offset_ele(src3,dst3)\
+	q7_q15_offset_ele(src4,dst4)\
+	block_cnt--;\
+}
+#define basic_load_5row(col,src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5)\
+block_cnt = channel_div4 * col; \
+while (block_cnt > 0) \
+{\
+	q7_q15_offset_ele(src,dst)\
+	q7_q15_offset_ele(src2,dst2)\
+	q7_q15_offset_ele(src3,dst3)\
+	q7_q15_offset_ele(src4,dst4)\
+	q7_q15_offset_ele(src5,dst5)\
+	block_cnt--;\
+}
+
+///////////////////////// 4bit //////////////////////////
+#define b4_load_1row(col,src,dst)\
+block_cnt = channel_div4 * col; \
+while (block_cnt > 0) \
+{\
+	b4_q7_q15_offset_ele(src,dst)\
+	block_cnt--;\
+}
+#define b4_load_2row(col,src,src2,dst,dst2)\
+block_cnt = channel_div4 * col; \
+while (block_cnt > 0) \
+{\
+	b4_q7_q15_offset_ele(src,dst)\
+	b4_q7_q15_offset_ele(src2,dst2)\
+	block_cnt--;\
+}
+#define b4_load_3row(col,src,src2,src3,dst,dst2,dst3)\
+block_cnt = channel_div4 * col; \
+while (block_cnt > 0) \
+{\
+	b4_q7_q15_offset_ele(src,dst)\
+	b4_q7_q15_offset_ele(src2,dst2)\
+	b4_q7_q15_offset_ele(src3,dst3)\
+	block_cnt--;\
+}
+#define b4_load_4row(col,src,src2,src3,src4,dst,dst2,dst3,dst4)\
+block_cnt = channel_div4 * col; \
+while (block_cnt > 0) \
+{\
+	b4_q7_q15_offset_ele(src,dst)\
+	b4_q7_q15_offset_ele(src2,dst2)\
+	b4_q7_q15_offset_ele(src3,dst3)\
+	b4_q7_q15_offset_ele(src4,dst4)\
+	block_cnt--;\
+}
+#define b4_load_5row(col,src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5)\
+block_cnt = channel_div4 * col; \
+while (block_cnt > 0) \
+{\
+	b4_q7_q15_offset_ele(src,dst)\
+	b4_q7_q15_offset_ele(src2,dst2)\
+	b4_q7_q15_offset_ele(src3,dst3)\
+	b4_q7_q15_offset_ele(src4,dst4)\
+	b4_q7_q15_offset_ele(src5,dst5)\
+	block_cnt--;\
+}
+///////////////////////// 2bit //////////////////////////
+#define b2_load_1row(col,src,dst)\
+block_cnt = channel_div4 * col; \
+while (block_cnt > 0) \
+{\
+	b2_q7_q15_offset_ele(src,dst)\
+	block_cnt--;\
+}
+#define b2_load_2row(col,src,src2,dst,dst2)\
+block_cnt = channel_div4 * col; \
+while (block_cnt > 0) \
+{\
+	b2_q7_q15_offset_ele(src,dst)\
+	b2_q7_q15_offset_ele(src2,dst2)\
+	block_cnt--;\
+}
+#define b2_load_3row(col,src,src2,src3,dst,dst2,dst3)\
+block_cnt = channel_div4 * col; \
+while (block_cnt > 0) \
+{\
+	b2_q7_q15_offset_ele(src,dst)\
+	b2_q7_q15_offset_ele(src2,dst2)\
+	b2_q7_q15_offset_ele(src3,dst3)\
+	block_cnt--;\
+}
+#define b2_load_4row(col,src,src2,src3,src4,dst,dst2,dst3,dst4)\
+block_cnt = channel_div4 * col; \
+while (block_cnt > 0) \
+{\
+	b2_q7_q15_offset_ele(src,dst)\
+	b2_q7_q15_offset_ele(src2,dst2)\
+	b2_q7_q15_offset_ele(src3,dst3)\
+	b2_q7_q15_offset_ele(src4,dst4)\
+	block_cnt--;\
+}
+#define b2_load_5row(col,src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5)\
+block_cnt = channel_div4 * col; \
+while (block_cnt > 0) \
+{\
+	b2_q7_q15_offset_ele(src,dst)\
+	b2_q7_q15_offset_ele(src2,dst2)\
+	b2_q7_q15_offset_ele(src3,dst3)\
+	b2_q7_q15_offset_ele(src4,dst4)\
+	b2_q7_q15_offset_ele(src5,dst5)\
+	block_cnt--;\
+}
+
+#define b4_load_1row_1col(src,dst) b4_load_1row(1,src,dst)
+#define b4_load_1row_2col(src,dst) b4_load_1row(2,src,dst)
+#define b4_load_1row_3col(src,dst) b4_load_1row(3,src,dst)
+#define b4_load_1row_4col(src,dst) b4_load_1row(4,src,dst)
+#define b4_load_2row_1col(src,src2,dst,dst2) b4_load_2row(1,src,src2,dst,dst2)
+#define b4_load_2row_2col(src,src2,dst,dst2) b4_load_2row(2,src,src2,dst,dst2)
+#define b4_load_2row_3col(src,src2,dst,dst2) b4_load_2row(3,src,src2,dst,dst2)
+#define b4_load_2row_4col(src,src2,dst,dst2) b4_load_2row(4,src,src2,dst,dst2)
+#define b4_load_3row_1col(src,src2,src3,dst,dst2,dst3) b4_load_3row(1,src,src2,src3,dst,dst2,dst3)
+#define b4_load_3row_2col(src,src2,src3,dst,dst2,dst3) b4_load_3row(2,src,src2,src3,dst,dst2,dst3)
+#define b4_load_3row_3col(src,src2,src3,dst,dst2,dst3) b4_load_3row(3,src,src2,src3,dst,dst2,dst3)
+#define b4_load_3row_4col(src,src2,src3,dst,dst2,dst3) b4_load_3row(4,src,src2,src3,dst,dst2,dst3)
+#define b4_load_4row_1col(src,src2,src3,src4,dst,dst2,dst3,dst4) b4_load_4row(1,src,src2,src3,src4,dst,dst2,dst3,dst4)
+#define b4_load_4row_2col(src,src2,src3,src4,dst,dst2,dst3,dst4) b4_load_4row(2,src,src2,src3,src4,dst,dst2,dst3,dst4)
+#define b4_load_4row_3col(src,src2,src3,src4,dst,dst2,dst3,dst4) b4_load_4row(3,src,src2,src3,src4,dst,dst2,dst3,dst4)
+#define b4_load_4row_4col(src,src2,src3,src4,dst,dst2,dst3,dst4) b4_load_4row(4,src,src2,src3,src4,dst,dst2,dst3,dst4)
+#define b4_load_5row_1col(src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5) b4_load_5row(1,src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5)
+#define b4_load_5row_2col(src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5) b4_load_5row(2,src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5)
+#define b4_load_5row_3col(src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5) b4_load_5row(3,src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5)
+#define b4_load_5row_4col(src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5) b4_load_5row(4,src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5)
+
+#define b2_load_1row_1col(src,dst) b2_load_1row(1,src,dst)
+#define b2_load_1row_2col(src,dst) b2_load_1row(2,src,dst)
+#define b2_load_1row_3col(src,dst) b2_load_1row(3,src,dst)
+#define b2_load_1row_4col(src,dst) b2_load_1row(4,src,dst)
+#define b2_load_2row_1col(src,src2,dst,dst2) b2_load_2row(1,src,src2,dst,dst2)
+#define b2_load_2row_2col(src,src2,dst,dst2) b2_load_2row(2,src,src2,dst,dst2)
+#define b2_load_2row_3col(src,src2,dst,dst2) b2_load_2row(3,src,src2,dst,dst2)
+#define b2_load_2row_4col(src,src2,dst,dst2) b2_load_2row(4,src,src2,dst,dst2)
+#define b2_load_3row_1col(src,src2,src3,dst,dst2,dst3) b2_load_3row(1,src,src2,src3,dst,dst2,dst3)
+#define b2_load_3row_2col(src,src2,src3,dst,dst2,dst3) b2_load_3row(2,src,src2,src3,dst,dst2,dst3)
+#define b2_load_3row_3col(src,src2,src3,dst,dst2,dst3) b2_load_3row(3,src,src2,src3,dst,dst2,dst3)
+#define b2_load_3row_4col(src,src2,src3,dst,dst2,dst3) b2_load_3row(4,src,src2,src3,dst,dst2,dst3)
+#define b2_load_4row_1col(src,src2,src3,src4,dst,dst2,dst3,dst4) b2_load_4row(1,src,src2,src3,src4,dst,dst2,dst3,dst4)
+#define b2_load_4row_2col(src,src2,src3,src4,dst,dst2,dst3,dst4) b2_load_4row(2,src,src2,src3,src4,dst,dst2,dst3,dst4)
+#define b2_load_4row_3col(src,src2,src3,src4,dst,dst2,dst3,dst4) b2_load_4row(3,src,src2,src3,src4,dst,dst2,dst3,dst4)
+#define b2_load_4row_4col(src,src2,src3,src4,dst,dst2,dst3,dst4) b2_load_4row(4,src,src2,src3,src4,dst,dst2,dst3,dst4)
+#define b2_load_5row_1col(src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5) b2_load_5row(1,src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5)
+#define b2_load_5row_2col(src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5) b2_load_5row(2,src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5)
+#define b2_load_5row_3col(src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5) b2_load_5row(3,src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5)
+#define b2_load_5row_4col(src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5) b2_load_5row(4,src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5)
+
+#define load_1row_1col(src,dst) basic_load_1row(1,src,dst)
+#define load_1row_2col(src,dst) basic_load_1row(2,src,dst)
+#define load_1row_3col(src,dst) basic_load_1row(3,src,dst)
+#define load_1row_4col(src,dst) basic_load_1row(4,src,dst)
+#define load_2row_1col(src,src2,dst,dst2) basic_load_2row(1,src,src2,dst,dst2)
+#define load_2row_2col(src,src2,dst,dst2) basic_load_2row(2,src,src2,dst,dst2)
+#define load_2row_3col(src,src2,dst,dst2) basic_load_2row(3,src,src2,dst,dst2)
+#define load_2row_4col(src,src2,dst,dst2) basic_load_2row(4,src,src2,dst,dst2)
+#define load_3row_1col(src,src2,src3,dst,dst2,dst3) basic_load_3row(1,src,src2,src3,dst,dst2,dst3)
+#define load_3row_2col(src,src2,src3,dst,dst2,dst3) basic_load_3row(2,src,src2,src3,dst,dst2,dst3)
+#define load_3row_3col(src,src2,src3,dst,dst2,dst3) basic_load_3row(3,src,src2,src3,dst,dst2,dst3)
+#define load_3row_4col(src,src2,src3,dst,dst2,dst3) basic_load_3row(4,src,src2,src3,dst,dst2,dst3)
+#define load_4row_1col(src,src2,src3,src4,dst,dst2,dst3,dst4) basic_load_4row(1,src,src2,src3,src4,dst,dst2,dst3,dst4)
+#define load_4row_2col(src,src2,src3,src4,dst,dst2,dst3,dst4) basic_load_4row(2,src,src2,src3,src4,dst,dst2,dst3,dst4)
+#define load_4row_3col(src,src2,src3,src4,dst,dst2,dst3,dst4) basic_load_4row(3,src,src2,src3,src4,dst,dst2,dst3,dst4)
+#define load_4row_4col(src,src2,src3,src4,dst,dst2,dst3,dst4) basic_load_4row(4,src,src2,src3,src4,dst,dst2,dst3,dst4)
+#define load_5row_1col(src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5) basic_load_5row(1,src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5)
+#define load_5row_2col(src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5) basic_load_5row(2,src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5)
+#define load_5row_3col(src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5) basic_load_5row(3,src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5)
+#define load_5row_4col(src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5) basic_load_5row(4,src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5)
+
+/* ---------------------------------- Reuse ---------------------------------- */
+#define basic_reuse_1row(col,src_31,dst_31)\
+block_cnt = channel_div4 * col;\
+while (block_cnt > 0)\
+{\
+	q31_assign2(src_31,dst_31)\
+	block_cnt--;\
+}
+#define basic_reuse_2row(col,src_31,src2_31,dst_31,dst2_31)\
+block_cnt = channel_div4 * col;\
+while (block_cnt > 0)\
+{\
+	q31_assign2(src_31,dst_31)\
+	q31_assign2(src2_31,dst2_31)\
+	block_cnt--;\
+}
+#define basic_reuse_3row(col,src_31,src2_31,src3_31,dst_31,dst2_31,dst3_31)\
+block_cnt = channel_div4 * col;\
+while (block_cnt > 0)\
+{\
+	q31_assign2(src_31,dst_31)\
+	q31_assign2(src2_31,dst2_31)\
+	q31_assign2(src3_31,dst3_31)\
+	block_cnt--;\
+}
+#define basic_reuse_4row(col,src_31,src2_31,src3_31,src4_31,dst_31,dst2_31,dst3_31,dst4_31)\
+block_cnt = channel_div4 * col;\
+while (block_cnt > 0)\
+{\
+	q31_assign2(src_31,dst_31)\
+	q31_assign2(src2_31,dst2_31)\
+	q31_assign2(src3_31,dst3_31)\
+	q31_assign2(src4_31,dst4_31)\
+	block_cnt--;\
+}
+#define basic_reuse_5row(col,src_31,src2_31,src3_31,src4_31,src5_31,dst_31,dst2_31,dst3_31,dst4_31,dst5_31)\
+block_cnt = channel_div4 * col;\
+while (block_cnt > 0)\
+{\
+	q31_assign2(src_31,dst_31)\
+	q31_assign2(src2_31,dst2_31)\
+	q31_assign2(src3_31,dst3_31)\
+	q31_assign2(src4_31,dst4_31)\
+	q31_assign2(src5_31,dst5_31)\
+	block_cnt--;\
+}
+
+#define reuse_1row_1col(src_31,dst_31) basic_reuse_1row(1,src_31,dst_31)
+#define reuse_1row_2col(src_31,dst_31) basic_reuse_1row(2,src_31,dst_31)
+#define reuse_1row_3col(src_31,dst_31) basic_reuse_1row(3,src_31,dst_31)
+#define reuse_1row_4col(src_31,dst_31) basic_reuse_1row(4,src_31,dst_31)
+#define reuse_1row_5col(src_31,dst_31) basic_reuse_1row(5,src_31,dst_31)
+#define reuse_1row_6col(src_31,dst_31) basic_reuse_1row(6,src_31,dst_31)
+#define reuse_2row_1col(src_31,src2_31,dst_31,dst2_31) basic_reuse_2row(1,src_31,src2_31,dst_31,dst2_31)
+#define reuse_2row_2col(src_31,src2_31,dst_31,dst2_31) basic_reuse_2row(2,src_31,src2_31,dst_31,dst2_31)
+#define reuse_2row_3col(src_31,src2_31,dst_31,dst2_31) basic_reuse_2row(3,src_31,src2_31,dst_31,dst2_31)
+#define reuse_2row_4col(src_31,src2_31,dst_31,dst2_31) basic_reuse_2row(4,src_31,src2_31,dst_31,dst2_31)
+#define reuse_2row_5col(src_31,src2_31,dst_31,dst2_31) basic_reuse_2row(5,src_31,src2_31,dst_31,dst2_31)
+#define reuse_2row_6col(src_31,src2_31,dst_31,dst2_31) basic_reuse_2row(6,src_31,src2_31,dst_31,dst2_31)
+#define reuse_3row_1col(src_31,src2_31,src3_31,dst_31,dst2_31,dst3_31) basic_reuse_3row(1,src_31,src2_31,src3_31,dst_31,dst2_31,dst3_31)
+#define reuse_3row_2col(src_31,src2_31,src3_31,dst_31,dst2_31,dst3_31) basic_reuse_3row(2,src_31,src2_31,src3_31,dst_31,dst2_31,dst3_31)
+#define reuse_3row_3col(src_31,src2_31,src3_31,dst_31,dst2_31,dst3_31) basic_reuse_3row(3,src_31,src2_31,src3_31,dst_31,dst2_31,dst3_31)
+#define reuse_3row_4col(src_31,src2_31,src3_31,dst_31,dst2_31,dst3_31) basic_reuse_3row(4,src_31,src2_31,src3_31,dst_31,dst2_31,dst3_31)
+#define reuse_3row_5col(src_31,src2_31,src3_31,dst_31,dst2_31,dst3_31) basic_reuse_3row(5,src_31,src2_31,src3_31,dst_31,dst2_31,dst3_31)
+#define reuse_3row_6col(src_31,src2_31,src3_31,dst_31,dst2_31,dst3_31) basic_reuse_3row(6,src_31,src2_31,src3_31,dst_31,dst2_31,dst3_31)
+#define reuse_4row_3col(src_31,src2_31,src3_31,src4_31,dst_31,dst2_31,dst3_31,dst4_31) basic_reuse_4row(3,src_31,src2_31,src3_31,src4_31,dst_31,dst2_31,dst3_31,dst4_31)
+#define reuse_4row_4col(src_31,src2_31,src3_31,src4_31,dst_31,dst2_31,dst3_31,dst4_31) basic_reuse_4row(4,src_31,src2_31,src3_31,src4_31,dst_31,dst2_31,dst3_31,dst4_31)
+#define reuse_4row_5col(src_31,src2_31,src3_31,src4_31,dst_31,dst2_31,dst3_31,dst4_31) basic_reuse_4row(5,src_31,src2_31,src3_31,src4_31,dst_31,dst2_31,dst3_31,dst4_31)
+#define reuse_4row_6col(src_31,src2_31,src3_31,src4_31,dst_31,dst2_31,dst3_31,dst4_31) basic_reuse_4row(6,src_31,src2_31,src3_31,src4_31,dst_31,dst2_31,dst3_31,dst4_31)
+#define reuse_5row_3col(src_31,src2_31,src3_31,src4_31,src5_31,dst_31,dst2_31,dst3_31,dst4_31,dst5_31) basic_reuse_5row(3,src_31,src2_31,src3_31,src4_31,src5_31,dst_31,dst2_31,dst3_31,dst4_31,dst5_31)
+#define reuse_5row_4col(src_31,src2_31,src3_31,src4_31,src5_31,dst_31,dst2_31,dst3_31,dst4_31,dst5_31) basic_reuse_5row(4,src_31,src2_31,src3_31,src4_31,src5_31,dst_31,dst2_31,dst3_31,dst4_31,dst5_31)
+#define reuse_5row_5col(src_31,src2_31,src3_31,src4_31,src5_31,dst_31,dst2_31,dst3_31,dst4_31,dst5_31) basic_reuse_5row(5,src_31,src2_31,src3_31,src4_31,src5_31,dst_31,dst2_31,dst3_31,dst4_31,dst5_31)
+#define reuse_5row_6col(src_31,src2_31,src3_31,src4_31,src5_31,dst_31,dst2_31,dst3_31,dst4_31,dst5_31) basic_reuse_5row(6,src_31,src2_31,src3_31,src4_31,src5_31,dst_31,dst2_31,dst3_31,dst4_31,dst5_31)
+#endif /* ARMNN_INCLUDE_IMG2COL_ELEMENT_H_ */
--- a/TinyEngine/include/kernel_element.h
+++ b/TinyEngine/include/kernel_element.h
@ -0,0 +1,421 @@
+/* ----------------------------------------------------------------------
+ * Project: TinyEngine
+ * Title:   kernel_element.h
+ *
+ * Reference papers:
+ *  - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
+ *  - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
+ *  - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
+ * Contact authors:
+ *  - Wei-Ming Chen, wmchen@mit.edu
+ *  - Wei-Chen Wang, wweichen@mit.edu
+ *  - Ji Lin, jilin@mit.edu
+ *  - Ligeng Zhu, ligeng@mit.edu
+ *  - Song Han, songhan@mit.edu
+ *
+ * Target ISA:  ARMv7E-M
+ * -------------------------------------------------------------------- */
+
+#ifndef ARMNN_INCLUDE_KERNEL_ELEMENT_H_
+#define ARMNN_INCLUDE_KERNEL_ELEMENT_H_
+
+#include "mutable_function.h"
+#include "precision_cnt.h"
+
+#define loop_ele_ext()													\
+sum = __SMLAD(col32[0], k_buf1[0], sum);								\
+sum_2 = __SMLAD(col32[1], k_buf1[1], sum_2);					\
+sum_3 = __SMLAD(col32[2], k_buf1[2], sum_3);				\
+sum_4 = __SMLAD(col32[3], k_buf1[3], sum_4);				\
+col32 += 4;\
+k_buf1 += 4;																\
+
+#define loop_ele()													\
+op_a = arm_nn_read_q15x2(col_pos);									\
+op_b = arm_nn_read_q15x2(col_pos + input_ch);						\
+																	\
+op_c = __PKHBT(op_b, op_a, 16);										\
+op_a = __PKHTB(op_b, op_a, 16);										\
+sum = __SMLAD(op_c, k_buf1[0], sum);								\
+sum_2 = __SMLAD(op_a, k_buf1[q32_elements], sum_2);					\
+																	\
+op_a = arm_nn_read_q15x2(col_pos + 2);								\
+op_b = arm_nn_read_q15x2(col_pos + input_ch + 2);					\
+																	\
+op_c = __PKHBT(op_b, op_a, 16);										\
+op_a = __PKHTB(op_b, op_a, 16);										\
+sum_3 = __SMLAD(op_c, k_buf1[q32_elements*2], sum_3);				\
+sum_4 = __SMLAD(op_a, k_buf1[q32_elements*3], sum_4);				\
+																	\
+col_pos += two_inch;												\
+k_buf1++;
+/* end of loop_ele() */
+
+#define prepare_loops()\
+q7_t *out_1 = out + output_ch / output_scaler;\
+const int32_t *out_shift = output_shift;\
+const int32_t *out_mult = output_mult;\
+const int32_t *obias = bias;\
+uint16_t row_count = output_ch / 2;\
+q31_t *ksrc = &kbuf[0];\
+/* end of prepare_loops() */
+
+#define conv_1stloop_ele()\
+q31_t ch_0_out_0 = *obias;\
+q31_t ch_0_out_1 = *obias++;\
+q31_t ch_1_out_0 = *obias;\
+q31_t ch_1_out_1 = *obias++;\
+q31_t b0 = arm_nn_read_q15x2_ia((const q15_t **)&ip_b0);\
+q31_t b1 = arm_nn_read_q15x2_ia((const q15_t **)&ip_b1);\
+ch_0_out_0 = __SMLAD(*ksrc, b0, ch_0_out_0);\
+ch_0_out_1 = __SMLAD(*ksrc++, b1, ch_0_out_1);\
+ch_1_out_0 = __SMLAD(*ksrc2, b0, ch_1_out_0);\
+b0 = arm_nn_read_q15x2_ia((const q15_t **)&ip_b0);\
+ch_1_out_1 = __SMLAD(*ksrc2++, b1, ch_1_out_1);\
+ /* end of conv_1stloop_ele */
+
+#define conv_lastloop_ele()\
+b1 = arm_nn_read_q15x2_ia((const q15_t **)&ip_b1);\
+\
+ch_0_out_0 = __SMLAD(*ksrc, b0, ch_0_out_0);\
+ch_0_out_1 = __SMLAD(*ksrc++, b1, ch_0_out_1);\
+ch_1_out_0 = __SMLAD(*ksrc2, b0, ch_1_out_0);\
+ch_1_out_1 = __SMLAD(*ksrc2++, b1, ch_1_out_1);\
+\
+ksrc = ksrc2;\
+ /* end of conv_lastloop_ele */
+
+#define conv_midloop_ele(k_index)													\
+b1 = arm_nn_read_q15x2_ia(&ip_b1);\
+ch_0_out_0 = __SMLAD(ksrc[k_index], b0, ch_0_out_0);\
+ch_0_out_1 = __SMLAD(ksrc[k_index], b1, ch_0_out_1);\
+ch_1_out_0 = __SMLAD(ksrc2[k_index], b0, ch_1_out_0);\
+b0 = arm_nn_read_q15x2_ia(&ip_b0);\
+ch_1_out_1 = __SMLAD(ksrc2[k_index], b1, ch_1_out_1);\
+ /* end of conv_midloop_ele */
+
+#define conv_midloop_ptrele()													\
+b1 = arm_nn_read_q15x2_ia((const q15_t **)&ip_b1);\
+ch_0_out_0 = __SMLAD(*ksrc, b0, ch_0_out_0);\
+ch_0_out_1 = __SMLAD(*ksrc++, b1, ch_0_out_1);\
+ch_1_out_0 = __SMLAD(*ksrc2, b0, ch_1_out_0);\
+b0 = arm_nn_read_q15x2_ia((const q15_t **)&ip_b0);\
+ch_1_out_1 = __SMLAD(*ksrc2++, b1, ch_1_out_1);\
+ /* end of conv_midloop_ele */
+
+#define unroll_8inch()\
+prepare_loops();\
+while (row_count) {\
+	const q15_t *ip_b0 = two_column_buffer;\
+	const q15_t *ip_b1 = ip_b0 + 8;\
+	q31_t *ksrc2 = ksrc + 4;\
+	conv_1stloop_ele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_lastloop_ele()\
+	mix_assign_requantize()\
+	row_count--;\
+}\
+
+/* Specialized Loop Unrolling  */
+//this can be selected for different models
+#define unroll_8inch()\
+prepare_loops();\
+while (row_count) {\
+	const q15_t *ip_b0 = two_column_buffer;\
+	const q15_t *ip_b1 = ip_b0 + 8;\
+	q31_t *ksrc2 = ksrc + 4;\
+	conv_1stloop_ele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_lastloop_ele()\
+	mix_assign_requantize()\
+	row_count--;\
+}\
+
+#define unroll_12inch()\
+prepare_loops();\
+while (row_count) {\
+	const q15_t *ip_b0 = two_column_buffer;\
+	const q15_t *ip_b1 = ip_b0 + 12;\
+	q31_t *ksrc2 = ksrc + 6;\
+	conv_1stloop_ele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_lastloop_ele()\
+	mix_assign_requantize()\
+	row_count--;\
+}\
+
+#define unroll_16inch()\
+prepare_loops();\
+while (row_count) {\
+	const q15_t *ip_b0 = two_column_buffer;\
+	const q15_t *ip_b1 = ip_b0 + 16;\
+	q31_t *ksrc2 = ksrc + 8;\
+	conv_1stloop_ele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_lastloop_ele()\
+	mix_assign_requantize()\
+	row_count--;\
+}\
+
+#define unroll_20inch()\
+prepare_loops();\
+while (row_count) {\
+	const q15_t *ip_b0 = two_column_buffer;\
+	const q15_t *ip_b1 = ip_b0 + 20;\
+	q31_t *ksrc2 = ksrc + 10;\
+	conv_1stloop_ele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_lastloop_ele()\
+	mix_assign_requantize()\
+	row_count--;\
+}\
+
+#define unroll_24inch()\
+prepare_loops();\
+while (row_count) {\
+	const q15_t *ip_b0 = two_column_buffer;\
+	const q15_t *ip_b1 = ip_b0 + 24;\
+	q31_t *ksrc2 = ksrc + 12;\
+	conv_1stloop_ele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_lastloop_ele()\
+	mix_assign_requantize()\
+	row_count--;\
+}\
+
+
+#define unroll_32inch()\
+prepare_loops();\
+while (row_count) {\
+	const q15_t *ip_b0 = two_column_buffer;\
+	const q15_t *ip_b1 = ip_b0 + 32;\
+	q31_t *ksrc2 = ksrc + 16;\
+	conv_1stloop_ele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_lastloop_ele()\
+	mix_assign_requantize()\
+	row_count--;\
+}\
+
+#define unroll_36inch()\
+prepare_loops();\
+while (row_count) {\
+	const q15_t *ip_b0 = two_column_buffer;\
+	const q15_t *ip_b1 = ip_b0 + 36;\
+	q31_t *ksrc2 = ksrc + 18;\
+	conv_1stloop_ele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_lastloop_ele()\
+	mix_assign_requantize()\
+	row_count--;\
+}\
+
+#define unroll_40inch()\
+prepare_loops();\
+while (row_count) {\
+	const q15_t *ip_b0 = two_column_buffer;\
+	const q15_t *ip_b1 = ip_b0 + 40;\
+	q31_t *ksrc2 = ksrc + 20;\
+	conv_1stloop_ele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_lastloop_ele()\
+	mix_assign_requantize()\
+	row_count--;\
+}\
+
+#define unroll_48inch()\
+prepare_loops();\
+while (row_count) {\
+	const q15_t *ip_b0 = two_column_buffer;\
+	const q15_t *ip_b1 = ip_b0 + 48;\
+	q31_t *ksrc2 = ksrc + 24;\
+	conv_1stloop_ele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_midloop_ptrele()\
+	conv_lastloop_ele()\
+	mix_assign_requantize()\
+	row_count--;\
+}\
+
+
+/* END: Specialized Loop Unrolling  */
+
+#define b2_assign_requantize()	\
+ch_0_out_0 = arm_nn_requantize(ch_0_out_0, *out_mult,*out_shift);\
+ch_0_out_0 += out_offset;\
+ch_0_out_0 = MAX(ch_0_out_0, out_activation_min);\
+ch_0_out_0 = MIN(ch_0_out_0, out_activation_max);\
+\
+ch_0_out_1 = arm_nn_requantize(ch_0_out_1, *out_mult,*out_shift);\
+ch_0_out_1 += out_offset;\
+ch_0_out_1 = MAX(ch_0_out_1, out_activation_min);\
+ch_0_out_1 = MIN(ch_0_out_1, out_activation_max);\
+out_mult++;\
+out_shift++;\
+ch_1_out_0 = arm_nn_requantize(ch_1_out_0, *out_mult,*out_shift);\
+ch_1_out_0 += out_offset;\
+ch_1_out_0 = MAX(ch_1_out_0, out_activation_min);\
+ch_1_out_0 = MIN(ch_1_out_0, out_activation_max);\
+ch_1_out_1 = arm_nn_requantize(ch_1_out_1, *out_mult,*out_shift);\
+ch_1_out_1 += out_offset;\
+ch_1_out_1 = MAX(ch_1_out_1, out_activation_min);\
+ch_1_out_1 = MIN(ch_1_out_1, out_activation_max);\
+if(lower_bit == 1){\
+	*out = (q7_t) ((ch_0_out_0 & 0x03) + ((ch_1_out_0 & 0x03) << 2));\
+	*out_1 = (q7_t) ((ch_0_out_0 & 0x03) + ((ch_1_out_1 & 0x03) << 2));\
+	lower_bit = 3;\
+}\
+else{\
+	*out++ += (q7_t) (((ch_0_out_0 & 0x03) + ((ch_1_out_0 & 0x03) << 2)) << 4);\
+	*out_1++ += (q7_t) (((ch_0_out_1 & 0x03) + ((ch_1_out_1 & 0x03) << 2)) << 4);\
+	lower_bit = 1;\
+}\
+out_mult++;\
+out_shift++;\
+
+#define b4_assign_requantize()	\
+ch_0_out_0 = arm_nn_requantize(ch_0_out_0, *out_mult,*out_shift);\
+ch_0_out_0 += out_offset;\
+ch_0_out_0 = MAX(ch_0_out_0, out_activation_min);\
+ch_0_out_0 = MIN(ch_0_out_0, out_activation_max);\
+\
+ch_0_out_1 = arm_nn_requantize(ch_0_out_1, *out_mult,*out_shift);\
+ch_0_out_1 += out_offset;\
+ch_0_out_1 = MAX(ch_0_out_1, out_activation_min);\
+ch_0_out_1 = MIN(ch_0_out_1, out_activation_max);\
+out_mult++;\
+out_shift++;\
+ch_1_out_0 = arm_nn_requantize(ch_1_out_0, *out_mult,*out_shift);\
+ch_1_out_0 += out_offset;\
+ch_1_out_0 = MAX(ch_1_out_0, out_activation_min);\
+ch_1_out_0 = MIN(ch_1_out_0, out_activation_max);\
+*out++ = (q7_t) ((ch_0_out_0 & 0x0F) + ((ch_1_out_0 & 0x0F) << 4));\
+ch_1_out_1 = arm_nn_requantize(ch_1_out_1, *out_mult,*out_shift);\
+ch_1_out_1 += out_offset;\
+ch_1_out_1 = MAX(ch_1_out_1, out_activation_min);\
+ch_1_out_1 = MIN(ch_1_out_1, out_activation_max);\
+*out_1++ = (q7_t) ((ch_0_out_1 & 0x0F) + ((ch_1_out_1 & 0x0F) << 4));\
+out_mult++;\
+out_shift++;\
+
+#define assign_requantize()	\
+ch_0_out_0 = arm_nn_requantize(ch_0_out_0, *out_mult,*out_shift);\
+ch_0_out_0 += out_offset;\
+ch_0_out_0 = MAX(ch_0_out_0, out_activation_min);\
+ch_0_out_0 = MIN(ch_0_out_0, out_activation_max);\
+*out++ = (q7_t) ch_0_out_0;\
+\
+ch_0_out_1 = arm_nn_requantize(ch_0_out_1, *out_mult,*out_shift);\
+ch_0_out_1 += out_offset;\
+ch_0_out_1 = MAX(ch_0_out_1, out_activation_min);\
+ch_0_out_1 = MIN(ch_0_out_1, out_activation_max);\
+*out_1++ = (q7_t) ch_0_out_1;\
+out_mult++;\
+out_shift++;\
+ch_1_out_0 = arm_nn_requantize(ch_1_out_0, *out_mult,*out_shift);\
+ch_1_out_0 += out_offset;\
+ch_1_out_0 = MAX(ch_1_out_0, out_activation_min);\
+ch_1_out_0 = MIN(ch_1_out_0, out_activation_max);\
+*out++ = (q7_t) ch_1_out_0;\
+\
+ch_1_out_1 = arm_nn_requantize(ch_1_out_1, *out_mult,*out_shift);\
+ch_1_out_1 += out_offset;\
+ch_1_out_1 = MAX(ch_1_out_1, out_activation_min);\
+ch_1_out_1 = MIN(ch_1_out_1, out_activation_max);\
+*out_1++ = (q7_t) ch_1_out_1;\
+out_mult++;\
+out_shift++;\
+ /* end of assign_requantize */
+
+#endif /* ARMNN_INCLUDE_KERNEL_ELEMENT_H_ */
--- a/TinyEngine/include/mutable_function.h
+++ b/TinyEngine/include/mutable_function.h
@ -0,0 +1,236 @@
+/* ----------------------------------------------------------------------
+ * Project: TinyEngine
+ * Title:   mutable_function.h
+ *
+ * Reference papers:
+ *  - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
+ *  - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
+ *  - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
+ * Contact authors:
+ *  - Wei-Ming Chen, wmchen@mit.edu
+ *  - Wei-Chen Wang, wweichen@mit.edu
+ *  - Ji Lin, jilin@mit.edu
+ *  - Ligeng Zhu, ligeng@mit.edu
+ *  - Song Han, songhan@mit.edu
+ *
+ * Target ISA:  ARMv7E-M
+ * -------------------------------------------------------------------- */
+
+#ifndef TINYENGINE_SOURCE_CONVOLUTIONFUNCTIONS_MIX_MUTABLE_FUNCTION_H_
+#define TINYENGINE_SOURCE_CONVOLUTIONFUNCTIONS_MIX_MUTABLE_FUNCTION_H_
+
+/* mutable functions */
+#if KERNEL_PRE == 4
+#define mix_read_and_pad_reordered b4_read_and_pad_reordered
+#define mix_nn_read_q7x4 b4_nn_read_q7x4
+#define mix_read_and_pad b4_read_and_pad
+#elif KERNEL_PRE == 2
+#define mix_read_and_pad_reordered b2_read_and_pad_reordered
+#define mix_nn_read_q7x4 b2_nn_read_q7x4
+#define mix_read_and_pad b2_read_and_pad
+#else
+#define mix_read_and_pad_reordered read_and_pad_reordered
+#define mix_nn_read_q7x4 arm_nn_read_q7x4
+#define mix_read_and_pad read_and_pad
+#endif
+
+#if INPUT_PRE == 4
+#define mix_q7_q15_offset_ele b4_q7_q15_offset_ele
+#elif INPUT_PRE == 2
+#define mix_q7_q15_offset_ele b2_q7_q15_offset_ele
+#else
+#define mix_q7_q15_offset_ele q7_q15_offset_ele
+#endif
+
+#if INPUT_PRE == 4
+#define	mix_q7_q15_offset_reordered_ele b4_q15_offset_reordered_ele
+#define mix_load_1row_1col b4_load_1row_1col
+#define mix_load_1row_2col b4_load_1row_2col
+#define mix_load_1row_3col b4_load_1row_3col
+#define mix_load_1row_4col b4_load_1row_4col
+#define mix_load_1row_5col b4_load_1row_5col
+#define mix_load_1row_6col b4_load_1row_6col
+#define mix_load_1row_7col b4_load_1row_7col
+#define mix_load_2row_1col b4_load_2row_1col
+#define mix_load_2row_2col b4_load_2row_2col
+#define mix_load_2row_3col b4_load_2row_3col
+#define mix_load_2row_4col b4_load_2row_4col
+#define mix_load_2row_5col b4_load_2row_5col
+#define mix_load_2row_6col b4_load_2row_6col
+#define mix_load_2row_7col b4_load_2row_7col
+#define mix_load_3row_1col b4_load_3row_1col
+#define mix_load_3row_2col b4_load_3row_2col
+#define mix_load_3row_3col b4_load_3row_3col
+#define mix_load_3row_4col b4_load_3row_4col
+#define mix_load_3row_5col b4_load_3row_5col
+#define mix_load_3row_6col b4_load_3row_6col
+#define mix_load_3row_7col b4_load_3row_7col
+#define mix_load_4row_1col b4_load_4row_1col
+#define mix_load_4row_2col b4_load_4row_2col
+#define mix_load_4row_3col b4_load_4row_3col
+#define mix_load_4row_4col b4_load_4row_4col
+#define mix_load_4row_5col b4_load_4row_5col
+#define mix_load_4row_6col b4_load_4row_6col
+#define mix_load_4row_7col b4_load_4row_7col
+#define mix_load_5row_1col b4_load_5row_1col
+#define mix_load_5row_2col b4_load_5row_2col
+#define mix_load_5row_3col b4_load_5row_3col
+#define mix_load_5row_4col b4_load_5row_4col
+#define mix_load_5row_5col b4_load_5row_5col
+#define mix_load_5row_6col b4_load_5row_6col
+#define mix_load_5row_7col b4_load_5row_7col
+#define mix_load_6row_1col b4_load_6row_1col
+#define mix_load_6row_2col b4_load_6row_2col
+#define mix_load_6row_3col b4_load_6row_3col
+#define mix_load_6row_4col b4_load_6row_4col
+#define mix_load_6row_5col b4_load_6row_5col
+#define mix_load_6row_6col b4_load_6row_6col
+#define mix_load_6row_7col b4_load_6row_7col
+#define mix_load_7row_1col b4_load_7row_1col
+#define mix_load_7row_2col b4_load_7row_2col
+#define mix_load_7row_3col b4_load_7row_3col
+#define mix_load_7row_4col b4_load_7row_4col
+#define mix_load_7row_5col b4_load_7row_5col
+#define mix_load_7row_6col b4_load_7row_6col
+#define mix_load_7row_7col b4_load_7row_7col
+#elif INPUT_PRE == 2
+#define	mix_q7_q15_offset_reordered_ele b2_q15_offset_reordered_ele
+#define mix_load_1row_1col b2_load_1row_1col
+#define mix_load_1row_2col b2_load_1row_2col
+#define mix_load_1row_3col b2_load_1row_3col
+#define mix_load_1row_4col b2_load_1row_4col
+#define mix_load_1row_5col b2_load_1row_5col
+#define mix_load_1row_6col b2_load_1row_6col
+#define mix_load_1row_7col b2_load_1row_7col
+#define mix_load_2row_1col b2_load_2row_1col
+#define mix_load_2row_2col b2_load_2row_2col
+#define mix_load_2row_3col b2_load_2row_3col
+#define mix_load_2row_4col b2_load_2row_4col
+#define mix_load_2row_5col b2_load_2row_5col
+#define mix_load_2row_6col b2_load_2row_6col
+#define mix_load_2row_7col b2_load_2row_7col
+#define mix_load_3row_1col b2_load_3row_1col
+#define mix_load_3row_2col b2_load_3row_2col
+#define mix_load_3row_3col b2_load_3row_3col
+#define mix_load_3row_4col b2_load_3row_4col
+#define mix_load_3row_5col b2_load_3row_5col
+#define mix_load_3row_6col b2_load_3row_6col
+#define mix_load_3row_7col b2_load_3row_7col
+#define mix_load_4row_1col b2_load_4row_1col
+#define mix_load_4row_2col b2_load_4row_2col
+#define mix_load_4row_3col b2_load_4row_3col
+#define mix_load_4row_4col b2_load_4row_4col
+#define mix_load_4row_5col b2_load_4row_5col
+#define mix_load_4row_6col b2_load_4row_6col
+#define mix_load_4row_7col b2_load_4row_7col
+#define mix_load_5row_1col b2_load_5row_1col
+#define mix_load_5row_2col b2_load_5row_2col
+#define mix_load_5row_3col b2_load_5row_3col
+#define mix_load_5row_4col b2_load_5row_4col
+#define mix_load_5row_5col b2_load_5row_5col
+#define mix_load_5row_6col b2_load_5row_6col
+#define mix_load_5row_7col b2_load_5row_7col
+#define mix_load_6row_1col b2_load_6row_1col
+#define mix_load_6row_2col b2_load_6row_2col
+#define mix_load_6row_3col b2_load_6row_3col
+#define mix_load_6row_4col b2_load_6row_4col
+#define mix_load_6row_5col b2_load_6row_5col
+#define mix_load_6row_6col b2_load_6row_6col
+#define mix_load_6row_7col b2_load_6row_7col
+#define mix_load_7row_1col b2_load_7row_1col
+#define mix_load_7row_2col b2_load_7row_2col
+#define mix_load_7row_3col b2_load_7row_3col
+#define mix_load_7row_4col b2_load_7row_4col
+#define mix_load_7row_5col b2_load_7row_5col
+#define mix_load_7row_6col b2_load_7row_6col
+#define mix_load_7row_7col b2_load_7row_7col
+#else
+#define mix_q7_q15_offset_reordered_ele q7_q15_offset_reordered_ele
+#define mix_load_1row_1col load_1row_1col
+#define mix_load_1row_2col load_1row_2col
+#define mix_load_1row_3col load_1row_3col
+#define mix_load_1row_4col load_1row_4col
+#define mix_load_1row_5col load_1row_5col
+#define mix_load_1row_6col load_1row_6col
+#define mix_load_1row_7col load_1row_7col
+#define mix_load_2row_1col load_2row_1col
+#define mix_load_2row_2col load_2row_2col
+#define mix_load_2row_3col load_2row_3col
+#define mix_load_2row_4col load_2row_4col
+#define mix_load_2row_5col load_2row_5col
+#define mix_load_2row_6col load_2row_6col
+#define mix_load_2row_7col load_2row_7col
+#define mix_load_3row_1col load_3row_1col
+#define mix_load_3row_2col load_3row_2col
+#define mix_load_3row_3col load_3row_3col
+#define mix_load_3row_4col load_3row_4col
+#define mix_load_3row_5col load_3row_5col
+#define mix_load_3row_6col load_3row_6col
+#define mix_load_3row_7col load_3row_7col
+#define mix_load_4row_1col load_4row_1col
+#define mix_load_4row_2col load_4row_2col
+#define mix_load_4row_3col load_4row_3col
+#define mix_load_4row_4col load_4row_4col
+#define mix_load_4row_5col load_4row_5col
+#define mix_load_4row_6col load_4row_6col
+#define mix_load_4row_7col load_4row_7col
+#define mix_load_5row_1col load_5row_1col
+#define mix_load_5row_2col load_5row_2col
+#define mix_load_5row_3col load_5row_3col
+#define mix_load_5row_4col load_5row_4col
+#define mix_load_5row_5col load_5row_5col
+#define mix_load_5row_6col load_5row_6col
+#define mix_load_5row_7col load_5row_7col
+#define mix_load_6row_1col load_6row_1col
+#define mix_load_6row_2col load_6row_2col
+#define mix_load_6row_3col load_6row_3col
+#define mix_load_6row_4col load_6row_4col
+#define mix_load_6row_5col load_6row_5col
+#define mix_load_6row_6col load_6row_6col
+#define mix_load_6row_7col load_6row_7col
+#define mix_load_7row_1col load_7row_1col
+#define mix_load_7row_2col load_7row_2col
+#define mix_load_7row_3col load_7row_3col
+#define mix_load_7row_4col load_7row_4col
+#define mix_load_7row_5col load_7row_5col
+#define mix_load_7row_6col load_7row_6col
+#define mix_load_7row_7col load_7row_7col
+#endif
+
+#if OUTPUT_PRE == 4
+#define	mix_assign_requantize()	b4_assign_requantize()
+#elif OUTPUT_PRE == 2
+#define	mix_assign_requantize()	b2_assign_requantize()
+#else
+#define	mix_assign_requantize()	assign_requantize()
+#endif
+
+#if KERNEL_PRE == 4
+	#if OUTPUT_PRE == 4
+#define mix_nn_mat_mult_kernel_s8_s16_reordered b44_nn_mat_mult_kernel_s8_s16_reordered
+#define mix_nn_mat_mult_kernel_s8_s16_reordered_8mul b44_nn_mat_mult_kernel_s8_s16_reordered_8mul
+	#elif OUTPUT_PRE == 2
+#define mix_nn_mat_mult_kernel_s8_s16_reordered b42_nn_mat_mult_kernel_s8_s16_reordered
+#define mix_nn_mat_mult_kernel_s8_s16_reordered_8mul b42_nn_mat_mult_kernel_s8_s16_reordered_8mul
+	#else
+#define mix_nn_mat_mult_kernel_s8_s16_reordered b48_nn_mat_mult_kernel_s8_s16_reordered
+#define mix_nn_mat_mult_kernel_s8_s16_reordered_8mul b48_nn_mat_mult_kernel_s8_s16_reordered_8mul
+	#endif//OUTPUT
+#elif KERNEL_PRE == 2
+	#if OUTPUT_PRE == 4
+#define mix_nn_mat_mult_kernel_s8_s16_reordered b24_nn_mat_mult_kernel_s8_s16_reordered
+#define mix_nn_mat_mult_kernel_s8_s16_reordered_8mul b24_nn_mat_mult_kernel_s8_s16_reordered_8mul
+	#elif OUTPUT_PRE == 2
+#define mix_nn_mat_mult_kernel_s8_s16_reordered b22_nn_mat_mult_kernel_s8_s16_reordered
+#define mix_nn_mat_mult_kernel_s8_s16_reordered_8mul b22_nn_mat_mult_kernel_s8_s16_reordered_8mul
+#else
+#define mix_nn_mat_mult_kernel_s8_s16_reordered b28_nn_mat_mult_kernel_s8_s16_reordered
+#define mix_nn_mat_mult_kernel_s8_s16_reordered_8mul b28_nn_mat_mult_kernel_s8_s16_reordered_8mul
+	#endif//OUTPUT
+#else
+	#define mix_nn_mat_mult_kernel_s8_s16_reordered arm_nn_mat_mult_kernel_s8_s16_reordered
+	#define mix_nn_mat_mult_kernel_s8_s16_reordered_8mul arm_nn_mat_mult_kernel_s8_s16_reordered_8mul
+#endif
+
+
+#endif /* TINYENGINE_SOURCE_CONVOLUTIONFUNCTIONS_MIX_MUTABLE_FUNCTION_H_ */
--- a/TinyEngine/include/precision_cnt.h
+++ b/TinyEngine/include/precision_cnt.h
@ -0,0 +1,31 @@
+/* ----------------------------------------------------------------------
+ * Project: TinyEngine
+ * Title:   precision_cnt.h
+ *
+ * Reference papers:
+ *  - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
+ *  - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
+ *  - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
+ * Contact authors:
+ *  - Wei-Ming Chen, wmchen@mit.edu
+ *  - Wei-Chen Wang, wweichen@mit.edu
+ *  - Ji Lin, jilin@mit.edu
+ *  - Ligeng Zhu, ligeng@mit.edu
+ *  - Song Han, songhan@mit.edu
+ *
+ * Target ISA:  ARMv7E-M
+ * -------------------------------------------------------------------- */
+
+#ifndef TINYENGINE_SOURCE_CONVOLUTIONFUNCTIONS_MIX_PRECISION_CNT_H_
+#define TINYENGINE_SOURCE_CONVOLUTIONFUNCTIONS_MIX_PRECISION_CNT_H_
+
+/* MIX precision */
+#define INPUT_PRE 8
+#define KERNEL_PRE 8
+#define OUTPUT_PRE 8
+#define input_scaler (8 / INPUT_PRE)
+#define weight_scaler (8 / KERNEL_PRE)
+#define output_scaler (8 / OUTPUT_PRE)
+
+
+#endif /* TINYENGINE_SOURCE_CONVOLUTIONFUNCTIONS_MIX_PRECISION_CNT_H_ */
--- a/TinyEngine/include/profile.h
+++ b/TinyEngine/include/profile.h
@ -0,0 +1,68 @@
+/* ----------------------------------------------------------------------
+ * Project: TinyEngine
+ * Title:   profile.h
+ *
+ * Reference papers:
+ *  - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
+ *  - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
+ *  - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
+ * Contact authors:
+ *  - Wei-Ming Chen, wmchen@mit.edu
+ *  - Wei-Chen Wang, wweichen@mit.edu
+ *  - Ji Lin, jilin@mit.edu
+ *  - Ligeng Zhu, ligeng@mit.edu
+ *  - Song Han, songhan@mit.edu
+ *
+ * Target ISA:  ARMv7E-M
+ * -------------------------------------------------------------------- */
+
+#include "stm32f7xx_hal.h"
+#include <stdio.h>
+#include <string.h>
+static UART_HandleTypeDef UART;
+#define RUNS 1
+static int profile_i;
+static int start, end;
+static char buf[100];
+
+void printLog(const char *s) {
+	static int is_initialized = 0;
+	if (!is_initialized) {
+		UART.Instance = USART1;
+		UART.Init.BaudRate = 115200;
+		UART.Init.WordLength = UART_WORDLENGTH_8B;
+		UART.Init.StopBits = UART_STOPBITS_1;
+		UART.Init.Parity = UART_PARITY_NONE;
+		UART.Init.Mode = UART_MODE_TX_RX;
+		UART.Init.HwFlowCtl = UART_HWCONTROL_NONE;
+		UART.Init.OverSampling = UART_OVERSAMPLING_16;
+		UART.Init.OneBitSampling = UART_ONE_BIT_SAMPLE_DISABLE;
+		UART.AdvancedInit.AdvFeatureInit = UART_ADVFEATURE_NO_INIT;
+		if (HAL_UART_Init(&UART) != HAL_OK) {
+			//Error handling
+		}
+		is_initialized = 1;
+	}
+	HAL_UART_Transmit(&UART, (uint8_t*) s, strlen(s), 10);
+}
+
+void recieveChar(char *s) {
+	static int is_initialized = 0;
+	if (!is_initialized) {
+		UART.Instance = USART1;
+		UART.Init.BaudRate = 115200;
+		UART.Init.WordLength = UART_WORDLENGTH_8B;
+		UART.Init.StopBits = UART_STOPBITS_1;
+		UART.Init.Parity = UART_PARITY_NONE;
+		UART.Init.Mode = UART_MODE_TX_RX;
+		UART.Init.HwFlowCtl = UART_HWCONTROL_NONE;
+		UART.Init.OverSampling = UART_OVERSAMPLING_16;
+		UART.Init.OneBitSampling = UART_ONE_BIT_SAMPLE_DISABLE;
+		UART.AdvancedInit.AdvFeatureInit = UART_ADVFEATURE_NO_INIT;
+		if (HAL_UART_Init(&UART) != HAL_OK) {
+			//Error handling
+		}
+		is_initialized = 1;
+	}
+	HAL_UART_Receive(&UART, (uint8_t*) s, 1, 10);
+}
--- a/TinyEngine/include/tinyengine_function.h
+++ b/TinyEngine/include/tinyengine_function.h
@ -0,0 +1,161 @@
+/* ----------------------------------------------------------------------
+ * Project: TinyEngine
+ * Title:   tinyengine_function.h
+ *
+ * Reference papers:
+ *  - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
+ *  - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
+ *  - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
+ * Contact authors:
+ *  - Wei-Ming Chen, wmchen@mit.edu
+ *  - Wei-Chen Wang, wweichen@mit.edu
+ *  - Ji Lin, jilin@mit.edu
+ *  - Ligeng Zhu, ligeng@mit.edu
+ *  - Song Han, songhan@mit.edu
+ *
+ * Target ISA:  ARMv7E-M
+ * -------------------------------------------------------------------- */
+
+#include <stdint.h>
+#include <stdbool.h>
+typedef int8_t q7_t;
+typedef uint8_t q8_t;
+typedef int16_t q15_t;
+typedef uint16_t q16_t;
+typedef int32_t q31_t;
+typedef uint32_t q32_t;
+
+typedef enum {
+	STATE_SUCCESS = 0, /* No error */
+	PARAM_NO_SUPPORT = 1, /* Unsupported parameters */
+} tinyengine_status;
+
+typedef struct add_params {
+	int input_h, input_w, input_c, left_shift;
+	int input1_offset, input1_multiplier, input1_shift;
+	int input2_offset, input2_multiplier, input2_shift;
+	int output_offset, output_multiplier, output_shift;
+	int quantized_activation_max, quantized_activation_min;
+
+} ADD_params;
+
+#define TN_MAX(A,B) ((A) > (B) ? (A) : (B))
+#define TN_MIN(A,B) ((A) < (B) ? (A) : (B))
+
+// bit assignment and check
+#define BIT_SET(a,b) ((a) |= (1ULL<<(b)))
+#define BIT_CLEAR(a,b) ((a) &= ~(1ULL<<(b)))
+#define BIT_FLIP(a,b) ((a) ^= (1ULL<<(b)))
+#define BIT_CHECK(a,b) (!!((a) & (1ULL<<(b))))        // '!!' to make sure this returns 0 or 1
+
+#define BITMASK_SET(x, mask) ((x) |= (mask))
+#define BITMASK_CLEAR(x, mask) ((x) &= (~(mask)))
+#define BITMASK_FLIP(x, mask) ((x) ^= (mask))
+#define BITMASK_CHECK_ALL(x, mask) (!(~(x) & (mask)))
+#define BITMASK_CHECK_ANY(x, mask) ((x) & (mask))
+
+tinyengine_status convolve_1x1_s8(const q7_t *input, const uint16_t input_x,
+		const uint16_t input_y, const uint16_t input_ch, const q7_t *kernel,
+		const int32_t *bias, const int32_t *output_shift,
+		const int32_t *output_mult, const int32_t out_offset,
+		const int32_t input_offset, const int32_t out_activation_min,
+		const int32_t out_activation_max, q7_t *output, const uint16_t output_x,
+		const uint16_t output_y, const uint16_t output_ch, q15_t *runtime_buf);
+
+tinyengine_status convolve_1x1_s8_ch8(const q7_t *input, const uint16_t input_x,
+		const uint16_t input_y, const uint16_t input_ch, const q7_t *kernel,
+		const int32_t *bias, const int32_t *output_shift,
+		const int32_t *output_mult, const int32_t out_offset,
+		const int32_t input_offset, const int32_t out_activation_min,
+		const int32_t out_activation_max, q7_t *output, const uint16_t output_x,
+		const uint16_t output_y, const uint16_t output_ch, q15_t *runtime_buf);
+
+tinyengine_status convolve_1x1_s8_ch16(const q7_t *input,
+		const uint16_t input_x, const uint16_t input_y, const uint16_t input_ch,
+		const q7_t *kernel, const int32_t *bias, const int32_t *output_shift,
+		const int32_t *output_mult, const int32_t out_offset,
+		const int32_t input_offset, const int32_t out_activation_min,
+		const int32_t out_activation_max, q7_t *output, const uint16_t output_x,
+		const uint16_t output_y, const uint16_t output_ch, q15_t *runtime_buf);
+
+tinyengine_status convolve_1x1_s8_ch24(const q7_t *input,
+		const uint16_t input_x, const uint16_t input_y, const uint16_t input_ch,
+		const q7_t *kernel, const int32_t *bias, const int32_t *output_shift,
+		const int32_t *output_mult, const int32_t out_offset,
+		const int32_t input_offset, const int32_t out_activation_min,
+		const int32_t out_activation_max, q7_t *output, const uint16_t output_x,
+		const uint16_t output_y, const uint16_t output_ch, q15_t *runtime_buf);
+
+tinyengine_status convolve_1x1_s8_ch48(const q7_t *input,
+		const uint16_t input_x, const uint16_t input_y, const uint16_t input_ch,
+		const q7_t *kernel, const int32_t *bias, const int32_t *output_shift,
+		const int32_t *output_mult, const int32_t out_offset,
+		const int32_t input_offset, const int32_t out_activation_min,
+		const int32_t out_activation_max, q7_t *output, const uint16_t output_x,
+		const uint16_t output_y, const uint16_t output_ch, q15_t *runtime_buf);
+
+tinyengine_status convolve_s8_kernel3_inputch3_stride2_pad1(const q7_t *input,
+		const uint16_t input_x, const uint16_t input_y, const uint16_t input_ch,
+		const q7_t *kernel, const int32_t *bias, const int32_t *output_shift,
+		const int32_t *output_mult, const int32_t output_offset,
+		const int32_t input_offset, const int32_t output_activation_min,
+		const int32_t output_activation_max, q7_t *output,
+		const uint16_t output_x, const uint16_t output_y,
+		const uint16_t output_ch, q15_t *runtime_buf, q15_t *kbuf,
+		q7_t pad_value);
+
+tinyengine_status add(int size, ADD_params *params, const int8_t *input1_data,
+		const int8_t *input2_data, int8_t *output_data);
+
+tinyengine_status avg_pooling(const q7_t *input, const uint16_t input_h,
+		const uint16_t input_w, const uint16_t input_c, const uint16_t sample_h,
+		const uint16_t sample_w, const uint16_t output_h,
+		const uint16_t output_w, const int32_t out_activation_min,
+		const int32_t out_activation_max, q7_t *output);
+
+tinyengine_status fully_connected_fp(const float *input, const uint16_t input_x,
+		const uint16_t input_y, const uint16_t input_ch,
+		const uint16_t output_ch, const float *bias, const float *weights,
+		float *output);
+
+tinyengine_status statble_softmax_inplace(float *input, const uint16_t length);
+
+tinyengine_status mat_mul_fp(const float *matA, const uint16_t matA_row,
+		const uint16_t matA_col, const float *matB, const uint16_t matB_col,
+		float *output);
+
+tinyengine_status convolve_s8_kernel3_inputch3_stride2_pad1_fpreq(
+		const q7_t *input, const uint16_t input_x, const uint16_t input_y,
+		const uint16_t input_ch, const q7_t *kernel, const int32_t *bias,
+		const float *scales, const int32_t output_offset,
+		const int32_t input_offset, const int32_t output_activation_min,
+		const int32_t output_activation_max, q7_t *output,
+		const uint16_t output_x, const uint16_t output_y,
+		const uint16_t output_ch, q15_t *runtime_buf, q15_t *kbuf,
+		q7_t pad_value);
+
+tinyengine_status add_fpreq(int size, const int8_t* input1_data, const float input1_scale, const float input1_zero,
+			const int8_t* input2_data, const float input2_scale, const float input2_zero, const float output_scale,
+			const float zero_y, int8_t* output_data);
+
+tinyengine_status add_fpreq_mask(int size, const int8_t* input1_data, const float input1_scale, const float input1_zero,
+			const int8_t* input2_data, const float input2_scale, const float input2_zero, const float output_scale,
+			const float zero_y, int8_t* output_data, int8_t* output_mask);
+
+tinyengine_status add_fpreq_bitmask(int size, const int8_t* input1_data, const float input1_scale, const float input1_zero,
+			const int8_t* input2_data, const float input2_scale, const float input2_zero, const float output_scale,
+			const float zero_y, int8_t* output_data, int8_t* output_mask);
+
+tinyengine_status where_int8(const bool* inMask, const uint16_t size, signed char* input1_data,
+	    const char* input2_data, char* output_data);
+
+tinyengine_status convolve_1x1_s8_fpreq_mask_partialCH(const q7_t *input,
+		const uint16_t input_x, const uint16_t input_y, const uint16_t input_ch,
+		const q7_t *kernel_sram, const q7_t *kernel_flash, const uint16_t first_k_channel, const int32_t *bias, const float *scales,
+		const int32_t out_offset, const int32_t input_offset,
+		const int32_t out_activation_min, const int32_t out_activation_max,
+		q7_t *output, q7_t *mask, const uint16_t output_x, const uint16_t output_y,
+		const uint16_t output_ch, q15_t *runtime_buf);
+
+#include "genInclude.h"
+#include "fp_requantize_op.h"
--- a/TinyEngine/include/tinyengine_lib.h
+++ b/TinyEngine/include/tinyengine_lib.h
@ -0,0 +1,31 @@
+/* ----------------------------------------------------------------------
+ * Project: TinyEngine
+ * Title:   tinyengine_lib.h
+ *
+ * Reference papers:
+ *  - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
+ *  - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
+ *  - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
+ * Contact authors:
+ *  - Wei-Ming Chen, wmchen@mit.edu
+ *  - Wei-Chen Wang, wweichen@mit.edu
+ *  - Ji Lin, jilin@mit.edu
+ *  - Ligeng Zhu, ligeng@mit.edu
+ *  - Song Han, songhan@mit.edu
+ *
+ * Target ISA:  ARMv7E-M
+ * -------------------------------------------------------------------- */
+
+#ifndef TINYENGINE_INCLUDE_TINYENGINE_FUNCTIONLIB_H_
+#define TINYENGINE_INCLUDE_TINYENGINE_FUNCTIONLIB_H_
+#include <stdio.h>
+
+typedef int8_t q7_t;
+typedef uint8_t q8_t;
+typedef int16_t q15_t;
+typedef uint16_t q16_t;
+typedef int32_t q31_t;
+typedef uint32_t q32_t;
+
+
+#endif /* TINYENGINE_INCLUDE_TINYENGINE_FUNCTIONLIB_H_ */
--- a/TinyEngine/include/yoloOutput.h
+++ b/TinyEngine/include/yoloOutput.h
@ -0,0 +1,33 @@
+/* ----------------------------------------------------------------------
+ * Project: TinyEngine
+ * Title:   yoloOutput.h
+ *
+ * Reference papers:
+ *  - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
+ *  - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
+ *  - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
+ * Contact authors:
+ *  - Wei-Ming Chen, wmchen@mit.edu
+ *  - Wei-Chen Wang, wweichen@mit.edu
+ *  - Ji Lin, jilin@mit.edu
+ *  - Ligeng Zhu, ligeng@mit.edu
+ *  - Song Han, songhan@mit.edu
+ *
+ * Target ISA:  ARMv7E-M
+ * -------------------------------------------------------------------- */
+
+typedef struct box{
+	float x0;
+	float y0;
+	float x1;
+	float y1;
+	float score;
+} det_box;
+
+det_box** postprocessing(signed char *input_data[3], signed char y_zero[3], float y_scale[3],
+		unsigned char *data_buf, int w, int h, int output_c, int num_classes, const int anchors[3][3][2], int outputs,
+		const float NMS_threshold, const float VALID_THRESHOLD, int* box_ret, det_box** ret_box);
+
+det_box** postprocessing_fp(float *input_data[3], signed char y_zero[3], float y_scale[3],
+		unsigned char *data_buf, int w, int h, int output_c, int num_classes, const int anchors[3][3][2], int outputs,
+		const float NMS_threshold, const float VALID_THRESHOLD, int* box_ret, det_box** ret_box);
--- a/TinyEngine/src/kernels/fp_requantize_op/add_fpreq.c
+++ b/TinyEngine/src/kernels/fp_requantize_op/add_fpreq.c
@ -0,0 +1,88 @@
+/* ----------------------------------------------------------------------
+ * Project: TinyEngine
+ * Title:   add_fpreq.c
+ *
+ * Reference papers:
+ *  - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
+ *  - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
+ *  - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
+ * Contact authors:
+ *  - Wei-Ming Chen, wmchen@mit.edu
+ *  - Wei-Chen Wang, wweichen@mit.edu
+ *  - Ji Lin, jilin@mit.edu
+ *  - Ligeng Zhu, ligeng@mit.edu
+ *  - Song Han, songhan@mit.edu
+ *
+ * Target ISA:  ARMv7E-M
+ * -------------------------------------------------------------------- */
+
+#include <math.h>
+#include "arm_math.h"
+#include "tinyengine_function.h"
+
+tinyengine_status add_fpreq(int size, const int8_t* input1_data, const float input1_scale, const float input1_zero,
+			const int8_t* input2_data, const float input2_scale, const float input2_zero, const float output_scale,
+			const float zero_y, int8_t* output_data) {
+  for (int i = 0; i < size; ++i) {
+	  float input1_fp = ((float)*input1_data++ - input1_zero) * input1_scale;
+	  float input2_fp = ((float)*input2_data++ - input2_zero) * input2_scale;
+      int clamped_output = (int)round((input1_fp + input2_fp) / output_scale + zero_y); // to align with tvm implementation
+      clamped_output = TN_MAX(clamped_output, -128);
+      clamped_output = TN_MIN(clamped_output, 127);
+      output_data[i] = (int8_t)(clamped_output);
+  }
+}
+
+const int activation_min = -128;
+const int activation_max = 127;
+tinyengine_status add_fpreq_mask(int size, const int8_t* input1_data, const float input1_scale, const float input1_zero,
+			const int8_t* input2_data, const float input2_scale, const float input2_zero, const float output_scale,
+			const float zero_y, int8_t* output_data, int8_t* output_mask) {
+  for (int i = 0; i < size; ++i) {
+	  float input1_fp = ((float)*input1_data++ - input1_zero) * input1_scale;
+	  float input2_fp = ((float)*input2_data++ - input2_zero) * input2_scale;
+      int clamped_output = (int)round((input1_fp + input2_fp) / output_scale + zero_y); // to align with tvm implementation
+      int8_t mask_value = 1;
+	  if (clamped_output < activation_min){
+		  clamped_output = activation_min;
+		  mask_value = 0;
+	  }
+	  if (clamped_output > activation_max){
+		  clamped_output = activation_max;
+		  mask_value = 0;
+	  }
+      output_data[i] = (int8_t)(clamped_output);
+      output_mask[i] = mask_value;
+  }
+}
+
+
+tinyengine_status add_fpreq_bitmask(int size, const int8_t* input1_data, const float input1_scale, const float input1_zero,
+			const int8_t* input2_data, const float input2_scale, const float input2_zero, const float output_scale,
+			const float zero_y, int8_t* output_data, int8_t* output_mask) {
+  int mask_idx = 0;
+  for (int i = 0; i < size; ++i) {
+	  float input1_fp = ((float)*input1_data++ - input1_zero) * input1_scale;
+	  float input2_fp = ((float)*input2_data++ - input2_zero) * input2_scale;
+      int clamped_output = (int)round((input1_fp + input2_fp) / output_scale + zero_y); // to align with tvm implementation
+      int8_t mask_value = 1;
+	  if (clamped_output < activation_min){
+		  clamped_output = activation_min;
+		  mask_value = 0;
+	  }
+	  if (clamped_output > activation_max){
+		  clamped_output = activation_max;
+		  mask_value = 0;
+	  }
+      output_data[i] = (int8_t)(clamped_output);
+	  if (mask_value == 1)
+		  BIT_SET(*output_mask, mask_idx);
+	  else
+		  BIT_CLEAR(*output_mask, mask_idx);
+	  mask_idx++;
+	  if (mask_idx == 8){
+		  mask_idx = 0;
+		  output_mask++;
+	  }
+  }
+}
--- a/TinyEngine/src/kernels/fp_requantize_op/convolve_1x1_s8_ch16_fpreq.c
+++ b/TinyEngine/src/kernels/fp_requantize_op/convolve_1x1_s8_ch16_fpreq.c
@ -0,0 +1,122 @@
+/* ----------------------------------------------------------------------
+ * Project: TinyEngine
+ * Title:   convolve_1x1_s8_ch16_fpreq.c
+ *
+ * Reference papers:
+ *  - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
+ *  - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
+ *  - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
+ * Contact authors:
+ *  - Wei-Ming Chen, wmchen@mit.edu
+ *  - Wei-Chen Wang, wweichen@mit.edu
+ *  - Ji Lin, jilin@mit.edu
+ *  - Ligeng Zhu, ligeng@mit.edu
+ *  - Song Han, songhan@mit.edu
+ *
+ * Target ISA:  ARMv7E-M
+ * -------------------------------------------------------------------- */
+
+#include "arm_nnfunctions.h"
+#include "img2col_element.h"
+#include "tinyengine_function.h"
+#include "fp_requantize_op.h"
+
+#define DIM_KER_X (1U)
+#define DIM_KER_Y (1U)
+
+tinyengine_status convolve_1x1_s8_ch16_fpreq(const q7_t *input,
+		const uint16_t input_x, const uint16_t input_y, const uint16_t input_ch,
+		const q7_t *kernel, const int32_t *bias, const float *scales,
+		const int32_t out_offset, const int32_t input_offset,
+		const int32_t out_activation_min, const int32_t out_activation_max,
+		q7_t *output, const uint16_t output_x, const uint16_t output_y,
+		const uint16_t output_ch, q15_t *runtime_buf) {
+	int32_t i_element;
+	(void) input_x;
+	(void) input_y;
+
+	/* Partial(two columns) im2col buffer */
+	q15_t *two_column_buffer = runtime_buf;
+	q7_t *out = output;
+	const int32_t num_elements = output_x * output_y;
+	const int channel_div4 = (input_ch >> 2);
+
+	const int16_t inoff16 = input_offset;
+	q31_t offset_q15x2 = __PKHBT(inoff16, inoff16, 16);
+
+	for (i_element = 0; i_element < num_elements / 2; i_element++) {
+		/* Fill buffer for partial im2col - two columns at a time */
+		q7_t *src = &input[i_element * input_ch * 2];
+		q15_t *dst = two_column_buffer;
+
+		//use variables
+		q31_t in_q7x4;
+		q31_t in_q15x2_1;
+		q31_t in_q15x2_2;
+		q31_t out_q15x2_1;
+		q31_t out_q15x2_2;
+
+		int cnt = channel_div4;	//two columns
+		while (cnt > 0) {
+			q7_q15_offset_reordered_ele(src, dst)
+			q7_q15_offset_reordered_ele(src, dst)
+			cnt--;
+		}
+
+		out = mat_mult_kernel_s8_s16_reordered_ch16_fpreq(kernel,
+				two_column_buffer, output_ch, scales, (q7_t) out_offset,
+				out_activation_min, out_activation_max,
+				input_ch * DIM_KER_Y * DIM_KER_X, bias, out);
+	}
+
+	/* check if there is an odd column left-over for computation */
+	if (num_elements & 0x1) {
+		int32_t i_ch_out;
+		const q7_t *ker_a = kernel;
+		q7_t *src = &input[(num_elements - 1) * input_ch];
+		q15_t *dst = two_column_buffer;
+
+		//use variables
+		q31_t in_q7x4;
+		q31_t in_q15x2_1;
+		q31_t in_q15x2_2;
+		q31_t out_q15x2_1;
+		q31_t out_q15x2_2;
+
+		int cnt = channel_div4;	//two * numof2col columns
+		while (cnt > 0) {
+			q7_q15_offset_reordered_ele(src, dst)
+			cnt--;
+		}
+
+		for (i_ch_out = 0; i_ch_out < output_ch; i_ch_out++) {
+			q31_t sum = bias[i_ch_out];
+
+			/* Point to the beginning of the im2col buffer where the input is available as a rearranged column */
+			const q15_t *ip_as_col = runtime_buf;
+			uint16_t col_count = (input_ch * DIM_KER_X * DIM_KER_Y) >> 2;
+
+			while (col_count) {
+				q31_t ker_a1, ker_a2;
+				q31_t in_b1, in_b2;
+				ker_a = read_and_pad_reordered(ker_a, &ker_a1, &ker_a2);
+
+				in_b1 = arm_nn_read_q15x2_ia(&ip_as_col);
+				sum = __SMLAD(ker_a1, in_b1, sum);
+				in_b2 = arm_nn_read_q15x2_ia(&ip_as_col);
+				sum = __SMLAD(ker_a2, in_b2, sum);
+
+				col_count--;
+			}
+
+			sum = (float) sum * scales[i_ch_out];
+			sum += out_offset;
+			sum = MAX(sum, out_activation_min);
+			sum = MIN(sum, out_activation_max);
+			*out++ = (q7_t) sum;
+		}
+	}
+
+	/* Return to application */
+	return STATE_SUCCESS;
+}
--- a/TinyEngine/src/kernels/fp_requantize_op/convolve_1x1_s8_ch24_fpreq.c
+++ b/TinyEngine/src/kernels/fp_requantize_op/convolve_1x1_s8_ch24_fpreq.c
@ -0,0 +1,122 @@
+/* ----------------------------------------------------------------------
+ * Project: TinyEngine
+ * Title:   convolve_1x1_s8_ch24_fpreq.c
+ *
+ * Reference papers:
+ *  - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
+ *  - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
+ *  - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
+ * Contact authors:
+ *  - Wei-Ming Chen, wmchen@mit.edu
+ *  - Wei-Chen Wang, wweichen@mit.edu
+ *  - Ji Lin, jilin@mit.edu
+ *  - Ligeng Zhu, ligeng@mit.edu
+ *  - Song Han, songhan@mit.edu
+ *
+ * Target ISA:  ARMv7E-M
+ * -------------------------------------------------------------------- */
+
+#include "arm_nnfunctions.h"
+#include "img2col_element.h"
+#include "tinyengine_function.h"
+#include "fp_requantize_op.h"
+
+#define DIM_KER_X (1U)
+#define DIM_KER_Y (1U)
+
+tinyengine_status convolve_1x1_s8_ch24_fpreq(const q7_t *input,
+		const uint16_t input_x, const uint16_t input_y, const uint16_t input_ch,
+		const q7_t *kernel, const int32_t *bias, const float *scales,
+		const int32_t out_offset, const int32_t input_offset,
+		const int32_t out_activation_min, const int32_t out_activation_max,
+		q7_t *output, const uint16_t output_x, const uint16_t output_y,
+		const uint16_t output_ch, q15_t *runtime_buf) {
+	int32_t i_element;
+	(void) input_x;
+	(void) input_y;
+
+	/* Partial(two columns) im2col buffer */
+	q15_t *two_column_buffer = runtime_buf;
+	q7_t *out = output;
+	const int32_t num_elements = output_x * output_y;
+	const int channel_div4 = (input_ch >> 2);
+
+	const int16_t inoff16 = input_offset;
+	q31_t offset_q15x2 = __PKHBT(inoff16, inoff16, 16);
+
+	for (i_element = 0; i_element < num_elements / 2; i_element++) {
+		/* Fill buffer for partial im2col - two columns at a time */
+		q7_t *src = &input[i_element * input_ch * 2];
+		q15_t *dst = two_column_buffer;
+
+		//use variables
+		q31_t in_q7x4;
+		q31_t in_q15x2_1;
+		q31_t in_q15x2_2;
+		q31_t out_q15x2_1;
+		q31_t out_q15x2_2;
+
+		int cnt = channel_div4;	//two columns
+		while (cnt > 0) {
+			q7_q15_offset_reordered_ele(src, dst)
+			q7_q15_offset_reordered_ele(src, dst)
+			cnt--;
+		}
+
+		out = mat_mult_kernel_s8_s16_reordered_ch24_fpreq(kernel,
+				two_column_buffer, output_ch, scales, (q7_t) out_offset,
+				out_activation_min, out_activation_max,
+				input_ch * DIM_KER_Y * DIM_KER_X, bias, out);
+	}
+
+	/* check if there is an odd column left-over for computation */
+	if (num_elements & 0x1) {
+		int32_t i_ch_out;
+		const q7_t *ker_a = kernel;
+		q7_t *src = &input[(num_elements - 1) * input_ch];
+		q15_t *dst = two_column_buffer;
+
+		//use variables
+		q31_t in_q7x4;
+		q31_t in_q15x2_1;
+		q31_t in_q15x2_2;
+		q31_t out_q15x2_1;
+		q31_t out_q15x2_2;
+
+		int cnt = channel_div4;	//two * numof2col columns
+		while (cnt > 0) {
+			q7_q15_offset_reordered_ele(src, dst)
+			cnt--;
+		}
+
+		for (i_ch_out = 0; i_ch_out < output_ch; i_ch_out++) {
+			q31_t sum = bias[i_ch_out];
+
+			/* Point to the beginning of the im2col buffer where the input is available as a rearranged column */
+			const q15_t *ip_as_col = runtime_buf;
+			uint16_t col_count = (input_ch * DIM_KER_X * DIM_KER_Y) >> 2;
+
+			while (col_count) {
+				q31_t ker_a1, ker_a2;
+				q31_t in_b1, in_b2;
+				ker_a = read_and_pad_reordered(ker_a, &ker_a1, &ker_a2);
+
+				in_b1 = arm_nn_read_q15x2_ia(&ip_as_col);
+				sum = __SMLAD(ker_a1, in_b1, sum);
+				in_b2 = arm_nn_read_q15x2_ia(&ip_as_col);
+				sum = __SMLAD(ker_a2, in_b2, sum);
+
+				col_count--;
+			}
+
+			sum = (float) sum * scales[i_ch_out];
+			sum += out_offset;
+			sum = MAX(sum, out_activation_min);
+			sum = MIN(sum, out_activation_max);
+			*out++ = (q7_t) sum;
+		}
+	}
+
+	/* Return to application */
+	return STATE_SUCCESS;
+}
--- a/TinyEngine/src/kernels/fp_requantize_op/convolve_1x1_s8_ch48_fpreq.c
+++ b/TinyEngine/src/kernels/fp_requantize_op/convolve_1x1_s8_ch48_fpreq.c
@ -0,0 +1,122 @@
+/* ----------------------------------------------------------------------
+ * Project: TinyEngine
+ * Title:   convolve_1x1_s8_ch48_fpreq.c
+ *
+ * Reference papers:
+ *  - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
+ *  - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
+ *  - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
+ * Contact authors:
+ *  - Wei-Ming Chen, wmchen@mit.edu
+ *  - Wei-Chen Wang, wweichen@mit.edu
+ *  - Ji Lin, jilin@mit.edu
+ *  - Ligeng Zhu, ligeng@mit.edu
+ *  - Song Han, songhan@mit.edu
+ *
+ * Target ISA:  ARMv7E-M
+ * -------------------------------------------------------------------- */
+
+#include "arm_nnfunctions.h"
+#include "arm_nnsupportfunctions.h"
+#include "img2col_element.h"
+#include "tinyengine_function.h"
+
+#define DIM_KER_X (1U)
+#define DIM_KER_Y (1U)
+
+tinyengine_status convolve_1x1_s8_ch48_fpreq(const q7_t *input,
+		const uint16_t input_x, const uint16_t input_y, const uint16_t input_ch,
+		const q7_t *kernel, const int32_t *bias, const float *scales,
+		const int32_t out_offset, const int32_t input_offset,
+		const int32_t out_activation_min, const int32_t out_activation_max,
+		q7_t *output, const uint16_t output_x, const uint16_t output_y,
+		const uint16_t output_ch, q15_t *runtime_buf) {
+	int32_t i_element;
+	(void) input_x;
+	(void) input_y;
+
+	/* Partial(two columns) im2col buffer */
+	q15_t *two_column_buffer = runtime_buf;
+	q7_t *out = output;
+	const int32_t num_elements = output_x * output_y;
+	const int channel_div4 = (input_ch >> 2);
+
+	const int16_t inoff16 = input_offset;
+	q31_t offset_q15x2 = __PKHBT(inoff16, inoff16, 16);
+
+	for (i_element = 0; i_element < num_elements / 2; i_element++) {
+		/* Fill buffer for partial im2col - two columns at a time */
+		q7_t *src = &input[i_element * input_ch * 2];
+		q15_t *dst = two_column_buffer;
+
+		//use variables
+		q31_t in_q7x4;
+		q31_t in_q15x2_1;
+		q31_t in_q15x2_2;
+		q31_t out_q15x2_1;
+		q31_t out_q15x2_2;
+
+		int cnt = channel_div4;	//two columns
+		while (cnt > 0) {
+			q7_q15_offset_reordered_ele(src, dst)
+			q7_q15_offset_reordered_ele(src, dst)
+			cnt--;
+		}
+
+		out = mat_mult_kernel_s8_s16_reordered_ch48_fpreq(kernel,
+				two_column_buffer, output_ch, scales, (q7_t) out_offset,
+				out_activation_min, out_activation_max,
+				input_ch * DIM_KER_Y * DIM_KER_X, bias, out);
+	}
+
+	/* check if there is an odd column left-over for computation */
+	if (num_elements & 0x1) {
+		int32_t i_ch_out;
+		const q7_t *ker_a = kernel;
+		q7_t *src = &input[(num_elements - 1) * input_ch];
+		q15_t *dst = two_column_buffer;
+
+		//use variables
+		q31_t in_q7x4;
+		q31_t in_q15x2_1;
+		q31_t in_q15x2_2;
+		q31_t out_q15x2_1;
+		q31_t out_q15x2_2;
+
+		int cnt = channel_div4;	//two * numof2col columns
+		while (cnt > 0) {
+			q7_q15_offset_reordered_ele(src, dst)
+			cnt--;
+		}
+
+		for (i_ch_out = 0; i_ch_out < output_ch; i_ch_out++) {
+			q31_t sum = bias[i_ch_out];
+
+			/* Point to the beginning of the im2col buffer where the input is available as a rearranged column */
+			const q15_t *ip_as_col = runtime_buf;
+			uint16_t col_count = (input_ch * DIM_KER_X * DIM_KER_Y) >> 2;
+
+			while (col_count) {
+				q31_t ker_a1, ker_a2;
+				q31_t in_b1, in_b2;
+				ker_a = read_and_pad_reordered(ker_a, &ker_a1, &ker_a2);
+
+				in_b1 = arm_nn_read_q15x2_ia(&ip_as_col);
+				sum = __SMLAD(ker_a1, in_b1, sum);
+				in_b2 = arm_nn_read_q15x2_ia(&ip_as_col);
+				sum = __SMLAD(ker_a2, in_b2, sum);
+
+				col_count--;
+			}
+
+			sum = (float) sum * scales[i_ch_out];
+			sum += out_offset;
+			sum = MAX(sum, out_activation_min);
+			sum = MIN(sum, out_activation_max);
+			*out++ = (q7_t) sum;
+		}
+	}
+
+	/* Return to application */
+	return STATE_SUCCESS;
+}
--- a/TinyEngine/src/kernels/fp_requantize_op/convolve_1x1_s8_ch8_fpreq.c
+++ b/TinyEngine/src/kernels/fp_requantize_op/convolve_1x1_s8_ch8_fpreq.c
@ -0,0 +1,122 @@
+/* ----------------------------------------------------------------------
+ * Project: TinyEngine
+ * Title:   convolve_1x1_s8_ch8_fpreq.c
+ *
+ * Reference papers:
+ *  - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
+ *  - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
+ *  - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
+ * Contact authors:
+ *  - Wei-Ming Chen, wmchen@mit.edu
+ *  - Wei-Chen Wang, wweichen@mit.edu
+ *  - Ji Lin, jilin@mit.edu
+ *  - Ligeng Zhu, ligeng@mit.edu
+ *  - Song Han, songhan@mit.edu
+ *
+ * Target ISA:  ARMv7E-M
+ * -------------------------------------------------------------------- */
+
+#include "arm_nnfunctions.h"
+#include "img2col_element.h"
+#include "tinyengine_function.h"
+#include "fp_requantize_op.h"
+
+#define DIM_KER_X (1U)
+#define DIM_KER_Y (1U)
+
+tinyengine_status convolve_1x1_s8_ch8_fpreq(const q7_t *input,
+		const uint16_t input_x, const uint16_t input_y, const uint16_t input_ch,
+		const q7_t *kernel, const int32_t *bias, const float *scales,
+		const int32_t out_offset, const int32_t input_offset,
+		const int32_t out_activation_min, const int32_t out_activation_max,
+		q7_t *output, const uint16_t output_x, const uint16_t output_y,
+		const uint16_t output_ch, q15_t *runtime_buf) {
+	int32_t i_element;
+	(void) input_x;
+	(void) input_y;
+
+	/* Partial(two columns) im2col buffer */
+	q15_t *two_column_buffer = runtime_buf;
+	q7_t *out = output;
+	const int32_t num_elements = output_x * output_y;
+	const int channel_div4 = (input_ch >> 2);
+
+	const int16_t inoff16 = input_offset;
+	q31_t offset_q15x2 = __PKHBT(inoff16, inoff16, 16);
+
+	for (i_element = 0; i_element < num_elements / 2; i_element++) {
+		/* Fill buffer for partial im2col - two columns at a time */
+		q7_t *src = &input[i_element * input_ch * 2];
+		q15_t *dst = two_column_buffer;
+
+		//use variables
+		q31_t in_q7x4;
+		q31_t in_q15x2_1;
+		q31_t in_q15x2_2;
+		q31_t out_q15x2_1;
+		q31_t out_q15x2_2;
+
+		int cnt = channel_div4;	//two columns
+		while (cnt > 0) {
+			q7_q15_offset_reordered_ele(src, dst)
+			q7_q15_offset_reordered_ele(src, dst)
+			cnt--;
+		}
+
+		out = mat_mult_kernel_s8_s16_reordered_fpreq(kernel, two_column_buffer,
+				output_ch, scales, (q7_t) out_offset, out_activation_min,
+				out_activation_max, input_ch * DIM_KER_Y * DIM_KER_X, bias,
+				out);
+	}
+
+	/* check if there is an odd column left-over for computation */
+	if (num_elements & 0x1) {
+		int32_t i_ch_out;
+		const q7_t *ker_a = kernel;
+		q7_t *src = &input[(num_elements - 1) * input_ch];
+		q15_t *dst = two_column_buffer;
+
+		//use variables
+		q31_t in_q7x4;
+		q31_t in_q15x2_1;
+		q31_t in_q15x2_2;
+		q31_t out_q15x2_1;
+		q31_t out_q15x2_2;
+
+		int cnt = channel_div4;	//two * numof2col columns
+		while (cnt > 0) {
+			q7_q15_offset_reordered_ele(src, dst)
+			cnt--;
+		}
+
+		for (i_ch_out = 0; i_ch_out < output_ch; i_ch_out++) {
+			q31_t sum = bias[i_ch_out];
+
+			/* Point to the beginning of the im2col buffer where the input is available as a rearranged column */
+			const q15_t *ip_as_col = runtime_buf;
+			uint16_t col_count = (input_ch * DIM_KER_X * DIM_KER_Y) >> 2;
+
+			while (col_count) {
+				q31_t ker_a1, ker_a2;
+				q31_t in_b1, in_b2;
+				ker_a = read_and_pad_reordered(ker_a, &ker_a1, &ker_a2);
+
+				in_b1 = arm_nn_read_q15x2_ia(&ip_as_col);
+				sum = __SMLAD(ker_a1, in_b1, sum);
+				in_b2 = arm_nn_read_q15x2_ia(&ip_as_col);
+				sum = __SMLAD(ker_a2, in_b2, sum);
+
+				col_count--;
+			}
+
+			sum = (float) sum * scales[i_ch_out];
+			sum += out_offset;
+			sum = MAX(sum, out_activation_min);
+			sum = MIN(sum, out_activation_max);
+			*out++ = (q7_t) sum;
+		}
+	}
+
+	/* Return to application */
+	return STATE_SUCCESS;
+}
--- a/TinyEngine/src/kernels/fp_requantize_op/convolve_1x1_s8_fpreq.c
+++ b/TinyEngine/src/kernels/fp_requantize_op/convolve_1x1_s8_fpreq.c
@ -0,0 +1,125 @@
+/* ----------------------------------------------------------------------
+ * Project: TinyEngine
+ * Title:   convolve_1x1_s8_fpreq.c
+ *
+ * Reference papers:
+ *  - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
+ *  - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
+ *  - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
+ * Contact authors:
+ *  - Wei-Ming Chen, wmchen@mit.edu
+ *  - Wei-Chen Wang, wweichen@mit.edu
+ *  - Ji Lin, jilin@mit.edu
+ *  - Ligeng Zhu, ligeng@mit.edu
+ *  - Song Han, songhan@mit.edu
+ *
+ * Target ISA:  ARMv7E-M
+ * -------------------------------------------------------------------- */
+
+#include "arm_nnfunctions.h"
+#include "img2col_element.h"
+#include "tinyengine_function.h"
+
+#define DIM_KER_X (1U)
+#define DIM_KER_Y (1U)
+
+tinyengine_status convolve_1x1_s8_fpreq(const q7_t *input,
+		const uint16_t input_x, const uint16_t input_y, const uint16_t input_ch,
+		const q7_t *kernel, const int32_t *bias, const float *scales,
+		const int32_t out_offset, const int32_t input_offset,
+		const int32_t out_activation_min, const int32_t out_activation_max,
+		q7_t *output, const uint16_t output_x, const uint16_t output_y,
+		const uint16_t output_ch, q15_t *runtime_buf) {
+	if (input_ch % 4 != 0 || input_ch % 2 != 0) {
+		return PARAM_NO_SUPPORT;
+	}
+
+	int32_t i_element;
+	(void) input_x;
+	(void) input_y;
+
+	/* Partial(two columns) im2col buffer */
+	q15_t *two_column_buffer = runtime_buf;
+	q7_t *out = output;
+	const int32_t num_elements = output_x * output_y;
+	const int channel_div4 = (input_ch >> 2);
+
+	const int16_t inoff16 = input_offset;
+	q31_t offset_q15x2 = __PKHBT(inoff16, inoff16, 16);
+
+	for (i_element = 0; i_element < num_elements / 2; i_element++) {
+		/* Fill buffer for partial im2col - two columns at a time */
+		q7_t *src = &input[i_element * input_ch * 2];
+		q15_t *dst = two_column_buffer;
+
+		//use variables
+		q31_t in_q7x4;
+		q31_t in_q15x2_1;
+		q31_t in_q15x2_2;
+		q31_t out_q15x2_1;
+		q31_t out_q15x2_2;
+
+		int cnt = channel_div4;	//two columns
+		while (cnt > 0) {
+			q7_q15_offset_reordered_ele(src, dst)
+			q7_q15_offset_reordered_ele(src, dst)
+			cnt--;
+		}
+
+		out = mat_mult_kernel_s8_s16_reordered_fpreq(kernel, two_column_buffer,
+				output_ch, scales, (q7_t) out_offset, out_activation_min,
+				out_activation_max, input_ch * DIM_KER_Y * DIM_KER_X, bias,
+				out);
+	}
+
+	/* check if there is an odd column left-over for computation */
+	if (num_elements & 0x1) {
+		int32_t i_ch_out;
+		const q7_t *ker_a = kernel;
+		q7_t *src = &input[(num_elements - 1) * input_ch];
+		q15_t *dst = two_column_buffer;
+
+		//use variables
+		q31_t in_q7x4;
+		q31_t in_q15x2_1;
+		q31_t in_q15x2_2;
+		q31_t out_q15x2_1;
+		q31_t out_q15x2_2;
+
+		int cnt = channel_div4;	//two * numof2col columns
+		while (cnt > 0) {
+			q7_q15_offset_reordered_ele(src, dst)
+			cnt--;
+		}
+
+		for (i_ch_out = 0; i_ch_out < output_ch; i_ch_out++) {
+			q31_t sum = bias[i_ch_out];
+
+			/* Point to the beginning of the im2col buffer where the input is available as a rearranged column */
+			const q15_t *ip_as_col = runtime_buf;
+			uint16_t col_count = (input_ch * DIM_KER_X * DIM_KER_Y) >> 2;
+
+			while (col_count) {
+				q31_t ker_a1, ker_a2;
+				q31_t in_b1, in_b2;
+				ker_a = read_and_pad_reordered(ker_a, &ker_a1, &ker_a2);
+
+				in_b1 = arm_nn_read_q15x2_ia(&ip_as_col);
+				sum = __SMLAD(ker_a1, in_b1, sum);
+				in_b2 = arm_nn_read_q15x2_ia(&ip_as_col);
+				sum = __SMLAD(ker_a2, in_b2, sum);
+
+				col_count--;
+			}
+
+			sum = (q31_t) ((float) sum * scales[i_ch_out]);
+			sum += out_offset;
+			sum = MAX(sum, out_activation_min);
+			sum = MIN(sum, out_activation_max);
+			*out++ = (q7_t) sum;
+		}
+	}
+
+	/* Return to application */
+	return STATE_SUCCESS;
+}
--- a/TinyEngine/src/kernels/fp_requantize_op/convolve_s8_kernel3_inputch3_stride2_pad1_fpreq.c
+++ b/TinyEngine/src/kernels/fp_requantize_op/convolve_s8_kernel3_inputch3_stride2_pad1_fpreq.c
@ -0,0 +1,287 @@
+/* ----------------------------------------------------------------------
+ * Project: TinyEngine
+ * Title:   convolve_s8_kernel3_inputch3_stride2_pad1_fpreq.c
+ *
+ * Reference papers:
+ *  - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
+ *  - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
+ *  - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
+ * Contact authors:
+ *  - Wei-Ming Chen, wmchen@mit.edu
+ *  - Wei-Chen Wang, wweichen@mit.edu
+ *  - Ji Lin, jilin@mit.edu
+ *  - Ligeng Zhu, ligeng@mit.edu
+ *  - Song Han, songhan@mit.edu
+ *
+ * Target ISA:  ARMv7E-M
+ * -------------------------------------------------------------------- */
+
+#include "arm_math.h"
+#include "arm_nnfunctions.h"
+#include "arm_nnsupportfunctions.h"
+#include "img2col_element.h"
+#include "tinyengine_function.h"
+
+tinyengine_status convolve_s8_kernel3_inputch3_stride2_pad1_fpreq(
+		const q7_t *input, const uint16_t input_x, const uint16_t input_y,
+		const uint16_t input_ch, const q7_t *kernel, const int32_t *bias,
+		const float *scales, const int32_t output_offset,
+		const int32_t input_offset, const int32_t output_activation_min,
+		const int32_t output_activation_max, q7_t *output,
+		const uint16_t output_x, const uint16_t output_y,
+		const uint16_t output_ch, q15_t *runtime_buf, q15_t *kbuf,
+		q7_t pad_value) {
+	const int kernel_y = 3;
+	const int kernel_x = 3;
+
+	int16_t i_out_y, i_out_x, i_ker_y, i_ker_x;
+
+	/* Generate two columns from the input tensor a GEMM computation */
+	q15_t *two_column_buf = runtime_buf;
+	q7_t *out = output;
+
+	q15_t pad16 = pad_value;
+	const int16_t inoff16 = input_offset;
+	q15_t pad_out = pad16 + inoff16;
+	q31_t pad_out_q15x2 = __PKHBT(pad_out, pad_out, 16);
+	q31_t offset_q15x2 = __PKHBT(inoff16, inoff16, 16);
+
+	const q7_t *ip_a0 = kernel;
+
+	for (int i = 0; i < output_ch; i += 2) {
+		q15_t *dst1 = &kbuf[i * 27]; //each q31_t store 2 elements
+		q15_t *dst2 = dst1 + 27;
+
+		const q7_t *ip_a1 = ip_a0 + 27;
+
+		//27 for each output_ch
+		q31_t *dst1_31 = dst1;
+		q31_t *dst2_31 = dst2;
+		ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
+		ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
+		dst1_31 += 2;
+		dst2_31 += 2;
+
+		ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
+		ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
+		dst1_31 += 2;
+		dst2_31 += 2;
+
+		ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
+		ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
+		dst1_31 += 2;
+		dst2_31 += 2;
+
+		ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
+		ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
+		dst1_31 += 2;
+		dst2_31 += 2;
+
+		ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
+		ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
+		dst1_31 += 2;
+		dst2_31 += 2;
+
+		ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
+		ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
+		dst1_31 += 2;
+		dst2_31 += 2;
+		//25, 26, 27
+		dst1 = dst1_31;
+		dst2 = dst2_31;
+		dst1[0] = *ip_a0++;
+		dst1[1] = *ip_a0++;
+		dst1[2] = *ip_a0++;
+		dst2[0] = *ip_a1++;
+		dst2[1] = *ip_a1++;
+		dst2[2] = *ip_a1++;
+
+		/* skip row */
+		ip_a0 += 27;
+	}
+
+	for (i_out_y = 0; i_out_y < output_y; i_out_y++) {
+		for (i_out_x = 0; i_out_x < output_x; i_out_x++) {
+			/* This part implements the im2col function */
+			const int16_t base_idx_y = (i_out_y * 2) - 1;
+			const int16_t base_idx_x = (i_out_x * 2) - 1;
+			const q15_t *col_buffer = two_column_buf;
+
+			//use variables
+			q31_t in_q7x4;
+			q31_t in_q15x2_1;
+			q31_t in_q15x2_2;
+			q31_t out_q15x2_1;
+			q31_t out_q15x2_2;
+
+			/* load address:8bit */
+			q7_t *src;
+			q7_t *src2;
+			q7_t *src3;
+
+			/* buffer for load:16bit */
+			q15_t *dst;
+			q15_t *dst2;
+			q15_t *dst3;
+
+			int input_row_offset = 3 * input_x;
+			dst = col_buffer;
+			dst2 = dst + 9;
+			dst3 = dst2 + 9;
+			if (base_idx_y != -1) {
+				if (base_idx_x != -1) { //load all for now and unroll all
+					//3x3 = 9 elements
+					src = input
+							+ (base_idx_y * input_x + base_idx_x) * input_ch;
+					src2 = src + input_row_offset;
+					src3 = src2 + input_row_offset;
+
+					//4 * 2 = 8
+					q7_q15_offset_ele(src, dst)
+					q7_q15_offset_ele(src, dst)
+					*dst++ = *src++ + input_offset;
+					//
+					q7_q15_offset_ele(src2, dst2)
+					q7_q15_offset_ele(src2, dst2)
+					*dst2++ = *src2++ + input_offset;
+					//
+					q7_q15_offset_ele(src3, dst3)
+					q7_q15_offset_ele(src3, dst3)
+					*dst3++ = *src3++ + input_offset;
+				} else {						//first element is pad
+												//3x3 = 9 elements
+					src = input + (base_idx_y * input_x) * input_ch;
+					src2 = src + input_row_offset;
+					src3 = src2 + input_row_offset;
+
+					//pad the first one: 1x3 = 3
+					*dst++ = pad_out;
+					*dst++ = pad_out;
+					*dst++ = pad_out;
+					*dst2++ = pad_out;
+					*dst2++ = pad_out;
+					*dst2++ = pad_out;
+					*dst3++ = pad_out;
+					*dst3++ = pad_out;
+					*dst3++ = pad_out;
+					//load 6 elements
+					//4 * 1 = 6
+					q7_q15_offset_ele(src, dst)
+					*dst++ = *src++ + input_offset;
+					*dst++ = *src++ + input_offset;
+					//
+					q7_q15_offset_ele(src2, dst2)
+					*dst2++ = *src2++ + input_offset;
+					*dst2++ = *src2++ + input_offset;
+					//
+					q7_q15_offset_ele(src3, dst3)
+					*dst3++ = *src3++ + input_offset;
+					*dst3++ = *src3++ + input_offset;
+				}
+			} else {						// first row is padded
+											//3x3 = 9 elements
+				*dst++ = pad_out;
+				q31_t *dst_31 = dst;
+				*dst_31++ = pad_out_q15x2;
+				*dst_31++ = pad_out_q15x2;
+				*dst_31++ = pad_out_q15x2;
+				*dst_31++ = pad_out_q15x2;
+				if (base_idx_x != -1) {	//load all for now and unroll all
+					//3x3 = 9 elements
+					src2 = input + (base_idx_x) * input_ch;
+					src3 = src2 + input_row_offset;
+
+					//4 * 2 = 8
+					q7_q15_offset_ele(src2, dst2)
+					q7_q15_offset_ele(src2, dst2)
+					*dst2++ = *src2++ + input_offset;
+					//
+					q7_q15_offset_ele(src3, dst3)
+					q7_q15_offset_ele(src3, dst3)
+					*dst3++ = *src3++ + input_offset;
+				} else {						//first element is pad
+												//3x3 = 9 elements
+					src2 = input;
+					src3 = src2 + input_row_offset;
+
+					//pad the first one: 1x3 = 3
+					*dst2++ = pad_out;
+					*dst2++ = pad_out;
+					*dst2++ = pad_out;
+					*dst3++ = pad_out;
+					*dst3++ = pad_out;
+					*dst3++ = pad_out;
+					//load 6 elements
+					q7_q15_offset_ele(src2, dst2)
+					*dst2++ = *src2++ + input_offset;
+					*dst2++ = *src2++ + input_offset;
+					//
+					q7_q15_offset_ele(src3, dst3)
+					*dst3++ = *src3++ + input_offset;
+					*dst3++ = *src3++ + input_offset;
+				}
+			}
+
+			two_column_buf += 27;
+			/* Computation is filed for every 2 columns */
+			if (two_column_buf == runtime_buf + 2 * 27) {
+
+				out = mat_mult_kernel3_input3_s8_s16_fpreq(kernel, runtime_buf,
+						output_ch, scales, output_offset, output_activation_min,
+						output_activation_max, input_ch * kernel_y * kernel_x,
+						bias, out, kbuf);
+
+				/* counter reset */
+				two_column_buf = runtime_buf;
+			}
+		}
+	}
+
+	/* left-over because odd number of output pixels */
+	if (two_column_buf != runtime_buf) {
+		const q7_t *ker_a = kernel;
+		int i;
+
+		for (i = 0; i < output_ch; i++) {
+			/* Load the accumulator with bias first */
+			q31_t sum = bias[i];
+
+			/* Point to the beginning of the im2col buffer where the input is available as a rearranged column */
+			const q15_t *ip_as_col = runtime_buf;
+
+			/* 4 multiply and accumulates are done in one loop. */
+			uint16_t col_count = (input_ch * kernel_y * kernel_x) >> 2;
+
+			while (col_count) {
+				q31_t ker_a1, ker_a2;
+				q31_t ip_b1, ip_b2;
+
+				ker_a = read_and_pad(ker_a, &ker_a1, &ker_a2);
+
+				ip_b1 = arm_nn_read_q15x2_ia(&ip_as_col);
+				sum = __SMLAD(ker_a1, ip_b1, sum);
+				ip_b2 = arm_nn_read_q15x2_ia(&ip_as_col);
+				sum = __SMLAD(ker_a2, ip_b2, sum);
+
+				col_count--;
+			}
+			/* Handle left over mac */
+			col_count = input_ch * kernel_y * kernel_x & 0x3;
+			while (col_count) {
+				q7_t ker_a1 = *ker_a++;
+				q15_t ip_b1 = *ip_as_col++;
+				sum += ker_a1 * ip_b1;
+				col_count--;
+			}
+
+			sum = (float) sum * scales[i];
+			sum += output_offset;
+			sum = MAX(sum, output_activation_min);
+			sum = MIN(sum, output_activation_max);
+			*out++ = (q7_t) sum;
+		}
+	}
+
+	/* Return to application */
+	return STATE_SUCCESS;
+}
--- a/TinyEngine/src/kernels/fp_requantize_op/mat_mul_kernels_fpreq.c
+++ b/TinyEngine/src/kernels/fp_requantize_op/mat_mul_kernels_fpreq.c
--- a/TinyEngine/src/kernels/int_only/add.c
+++ b/TinyEngine/src/kernels/int_only/add.c
@ -0,0 +1,92 @@
+/* ----------------------------------------------------------------------
+ * Project: TinyEngine
+ * Title:   add.c
+ *
+ * Reference papers:
+ *  - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
+ *  - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
+ *  - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
+ * Contact authors:
+ *  - Wei-Ming Chen, wmchen@mit.edu
+ *  - Wei-Chen Wang, wweichen@mit.edu
+ *  - Ji Lin, jilin@mit.edu
+ *  - Ligeng Zhu, ligeng@mit.edu
+ *  - Song Han, songhan@mit.edu
+ *
+ * Target ISA:  ARMv7E-M
+ * -------------------------------------------------------------------- */
+
+#include <math.h>
+#include "arm_math.h"
+#include "tinyengine_function.h"
+
+int32_t Add(int32_t a, int32_t b) {
+  return a + b;
+}
+int32_t ShiftRight(int32_t a, int offset) {
+  return a >> offset;
+}
+int32_t BitAnd(int32_t a, int32_t b) {
+  return a & b;
+}
+int32_t BitNot(int32_t a) {
+  return ~a;
+}
+int32_t MaskIfNonZero(int32_t a) {
+  static const int32_t zero = 0;
+  return a ? BitNot(zero) : zero;
+}
+int32_t MaskIfGreaterThan(int32_t a, int32_t b) {
+  return MaskIfNonZero(a > b);
+}
+int32_t MaskIfLessThan(int32_t a, int32_t b) {
+  return MaskIfNonZero(a < b);
+}
+
+static inline int32_t SaturatingRoundingDoublingHighMul(int32_t a, int32_t b) {
+  int64_t a_64 = a;
+  int64_t b_64 = b;
+  int64_t ab_64 = a_64 * b_64;
+  int32_t nudge = ab_64 >= 0 ? (1 << 30) : (1 - (1 << 30));
+  int32_t ab_x2_high32 = (int32_t)((ab_64 + nudge) / (1ll << 31));
+  return a == b && a == -2147483648 ? 2147483647 : ab_x2_high32;
+}
+
+static inline  int32_t RoundingDivideByPOT(int32_t x, int exponent) {
+  const int32_t mask = ((1ll << exponent) - 1);
+  const int32_t zero = (0);
+  const int32_t one = (1);
+  const int32_t remainder = BitAnd(x, mask);
+  const int32_t threshold = Add(ShiftRight(mask, 1), BitAnd(MaskIfLessThan(x, zero), one));
+  return Add(ShiftRight(x, exponent), BitAnd(MaskIfGreaterThan(remainder, threshold), one));
+}
+
+static inline int32_t MultiplyByQuantizedMultiplierSmallerThanOneExp(
+		int32_t x, int32_t quantized_multiplier, int left_shift) {
+  return RoundingDivideByPOT(
+      SaturatingRoundingDoublingHighMul(x, quantized_multiplier), -left_shift);
+}
+
+tinyengine_status add(int size, ADD_params* params, const int8_t* input1_data,
+			const int8_t* input2_data, int8_t* output_data) {
+  for (int i = 0; i < size; ++i) {
+    const int32_t input1_val = params->input1_offset + input1_data[i];
+    const int32_t input2_val = params->input2_offset + input2_data[i];
+    const int32_t shifted_input1_val = input1_val * (1 << params->left_shift);
+    const int32_t shifted_input2_val = input2_val * (1 << params->left_shift);
+    const int32_t scaled_input1_val =
+        MultiplyByQuantizedMultiplierSmallerThanOneExp(
+            shifted_input1_val, params->input1_multiplier, params->input1_shift);
+    const int32_t scaled_input2_val =
+        MultiplyByQuantizedMultiplierSmallerThanOneExp(
+            shifted_input2_val, params->input2_multiplier, params->input2_shift);
+    const int32_t raw_sum = scaled_input1_val + scaled_input2_val;
+    const int32_t raw_output =
+        MultiplyByQuantizedMultiplierSmallerThanOneExp(
+            raw_sum, params->output_multiplier, params->output_shift) +
+        params->output_offset;
+    const int32_t clamped_output = TN_MIN(params->quantized_activation_max,
+    		TN_MAX(params->quantized_activation_min, raw_output));
+    output_data[i] = (int8_t)(clamped_output);
+  }
+}
--- a/TinyEngine/src/kernels/int_only/arm_convolve_s8_4col.c
+++ b/TinyEngine/src/kernels/int_only/arm_convolve_s8_4col.c
@ -0,0 +1,223 @@
+/*
+ * Copyright (C) 2010-2022 Arm Limited or its affiliates.
+ *
+ * SPDX-License-Identifier: Apache-2.0
+ *
+ * Licensed under the Apache License, Version 2.0 (the License); you may
+ * not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS, WITHOUT
+ * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+/* ----------------------------------------------------------------------
+ * This file is MODIFIED from Arm CMSIS NN Library.
+ *
+ * Project: TinyEngine
+ * Title:   arm_convolve_s8_4col.c
+ * Description:  s8_4col version of convolution using symmetric quantization.
+ *
+ * Reference papers:
+ *  - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
+ *  - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
+ *  - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
+ * Contact authors:
+ *  - Wei-Ming Chen, wmchen@mit.edu
+ *  - Wei-Chen Wang, wweichen@mit.edu
+ *  - Ji Lin, jilin@mit.edu
+ *  - Ligeng Zhu, ligeng@mit.edu
+ *  - Song Han, songhan@mit.edu
+ *
+ * Original Project:      CMSIS NN Library
+ * Original Title:        arm_convolve_s8.c
+ *
+ * Target Processor:  Cortex-M CPUs
+ * -------------------------------------------------------------------- */
+
+#include "arm_math.h"
+#include "arm_nnfunctions.h"
+#include "arm_nnsupportfunctions.h"
+
+/**
+ *  @ingroup groupNN
+ */
+
+/**
+ * @addtogroup NNConv
+ * @{
+ */
+
+/*
+ * Basic s8_4col convolution function.
+ *
+ * Refer header file for details. Optimal use case for the DSP/MVE implementation is when input and output channels
+ * are multiples of 4 or atleast greater than 4.
+ *
+ */
+
+arm_status arm_convolve_s8_4col(const q7_t *input,
+                           const uint16_t input_x,
+                           const uint16_t input_y,
+                           const uint16_t input_ch,
+                           const uint16_t input_batches,
+                           const q7_t *kernel,
+                           const uint16_t output_ch,
+                           const uint16_t kernel_x,
+                           const uint16_t kernel_y,
+                           const uint16_t pad_x,
+                           const uint16_t pad_y,
+                           const uint16_t stride_x,
+                           const uint16_t stride_y,
+                           const int32_t *bias,
+                           q7_t *output,
+                           const int32_t *output_shift,
+                           const int32_t *output_mult,
+                           const int32_t out_offset,
+                           const int32_t input_offset,
+                           const int32_t out_activation_min,
+                           const int32_t out_activation_max,
+                           const uint16_t output_x,
+                           const uint16_t output_y,
+                           q15_t *buffer_a)
+{
+    int i_batch;
+    for (i_batch = 0; i_batch < input_batches; i_batch++)
+    {
+        input += i_batch * (input_x * input_y * input_ch);
+        output += i_batch * (output_x * output_y * output_ch);
+
+        int16_t i_out_y, i_out_x, i_ker_y, i_ker_x;
+
+        /* Generate two columns from the input tensor a GEMM computation */
+        q15_t *four_column_buf = buffer_a;
+
+        q7_t *out = output;
+
+        /* This part implements the im2col function */
+        for (i_out_y = 0; i_out_y < output_y; i_out_y++)
+        {
+            for (i_out_x = 0; i_out_x < output_x; i_out_x++)
+            {
+                for (i_ker_y = i_out_y * stride_y - pad_y; i_ker_y < i_out_y * stride_y - pad_y + kernel_y; i_ker_y++)
+                {
+                    for (i_ker_x = i_out_x * stride_x - pad_x; i_ker_x < i_out_x * stride_x - pad_x + kernel_x; i_ker_x++)
+                    {
+                        if (i_ker_y < 0 || i_ker_y >= input_y || i_ker_x < 0 || i_ker_x >= input_x)
+                        {
+                            /* Filling 0 for out-of-bound paddings */
+                            memset(four_column_buf, 0, sizeof(q15_t) * input_ch);
+                        }
+                        else
+                        {
+                            /* Copying the pixel data to column */
+                            arm_q7_to_q15_with_offset(input + (i_ker_y * input_x + i_ker_x) * input_ch, four_column_buf, input_ch, input_offset);
+                        }
+                        four_column_buf += input_ch;
+                    }
+                }
+
+                /* Computation is filed for every 2 columns */
+                if (four_column_buf == buffer_a + 4 * input_ch * kernel_y * kernel_x)
+                {
+                    out =
+                        arm_nn_mat_mult_kernel_s8_s16_4col(kernel,
+                                                      buffer_a,
+                                                      output_ch,
+                                                      output_shift,
+                                                      output_mult,
+                                                      out_offset,
+                                                      out_activation_min,
+                                                      out_activation_max,
+                                                      input_ch * kernel_y * kernel_x,
+                                                      bias,
+                                                      out);
+
+                    /* counter reset */
+                    four_column_buf = buffer_a;
+                }
+            }
+        }
+
+        q15_t *four_column_buf_mid = buffer_a;
+
+        if (four_column_buf >= four_column_buf_mid + 2 * input_ch * kernel_y * kernel_x) {
+            out =
+                arm_nn_mat_mult_kernel_s8_s16(kernel,
+                                              four_column_buf_mid,
+                                              output_ch,
+                                              output_shift,
+                                              output_mult,
+                                              out_offset,
+                                              out_activation_min,
+                                              out_activation_max,
+                                              input_ch * kernel_y * kernel_x,
+                                              bias,
+                                              out);
+
+            four_column_buf_mid = buffer_a + 2 * input_ch * kernel_y * kernel_x;
+
+        }
+
+        /* left-over because odd number of output pixels */
+        if (four_column_buf != four_column_buf_mid)
+        {
+            const q7_t *ker_a = kernel;
+            int i;
+
+            for (i = 0; i < output_ch; i++)
+            {
+                /* Load the accumulator with bias first */
+                q31_t sum = bias[i];
+
+                /* Point to the beginning of the im2col buffer where the input is available as a rearranged column */
+                const q15_t *ip_as_col = four_column_buf_mid;
+
+                /* 4 multiply and accumulates are done in one loop. */
+                uint16_t col_count = (input_ch * kernel_y * kernel_x) >> 2;
+
+                while (col_count)
+                {
+                    q31_t ker_a1, ker_a2;
+                    q31_t ip_b1, ip_b2;
+
+                    ker_a = read_and_pad(ker_a, &ker_a1, &ker_a2);
+
+                    ip_b1 = arm_nn_read_q15x2_ia(&ip_as_col);
+                    sum = __SMLAD(ker_a1, ip_b1, sum);
+                    ip_b2 = arm_nn_read_q15x2_ia(&ip_as_col);
+                    sum = __SMLAD(ker_a2, ip_b2, sum);
+
+                    col_count--;
+                }
+                /* Handle left over mac */
+                col_count = input_ch * kernel_y * kernel_x & 0x3;
+                while (col_count)
+                {
+                    q7_t ker_a1 = *ker_a++;
+                    q15_t ip_b1 = *ip_as_col++;
+                    sum += ker_a1 * ip_b1;
+                    col_count--;
+                }
+
+                sum = arm_nn_requantize(sum, output_mult[i], output_shift[i]);
+                sum += out_offset;
+                sum = MAX(sum, out_activation_min);
+                sum = MIN(sum, out_activation_max);
+                *out++ = (q7_t)sum;
+            }
+        }
+    }
+
+    /* Return to application */
+    return ARM_MATH_SUCCESS;
+}
+
+/**
+ * @} end of NNConv group
+ */
--- a/TinyEngine/src/kernels/int_only/arm_nn_mat_mult_kernel3_input3_s8_s16.c
+++ b/TinyEngine/src/kernels/int_only/arm_nn_mat_mult_kernel3_input3_s8_s16.c
@ -0,0 +1,245 @@
+/*
+ * Copyright (C) 2010-2020 Arm Limited or its affiliates. All rights reserved.
+ *
+ * SPDX-License-Identifier: Apache-2.0
+ *
+ * Licensed under the Apache License, Version 2.0 (the License); you may
+ * not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS, WITHOUT
+ * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+/* ----------------------------------------------------------------------
+ * This file is MODIFIED from Arm CMSIS NN Library.
+ *
+ * Project: TinyEngine
+ * Title:   arm_nn_mat_mult_kernel3_input3_s8_s16.c
+ * Description:  Matrix-multiplication function for convolution (input channel = 3 and kernel size = 3).
+ *
+ * Reference papers:
+ *  - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
+ *  - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
+ *  - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
+ * Contact authors:
+ *  - Wei-Ming Chen, wmchen@mit.edu
+ *  - Wei-Chen Wang, wweichen@mit.edu
+ *  - Ji Lin, jilin@mit.edu
+ *  - Ligeng Zhu, ligeng@mit.edu
+ *  - Song Han, songhan@mit.edu
+ *
+ * Original Project:      CMSIS NN Library
+ * Original Title:        arm_nn_mat_mult_kernel_s8_s16.c
+ *
+ * Target Processor:  Cortex-M cores
+ * -------------------------------------------------------------------- */
+
+#include "arm_math.h"
+#include "arm_nnfunctions.h"
+#include "arm_nnsupportfunctions.h"
+
+/*
+   * Matrix-multiplication function for convolution with per-channel requantization.
+   *
+   * Refer header file for details.
+   *
+   */
+
+q7_t *arm_nn_mat_mult_kernel3_input3_s8_s16(const q7_t *input_a,
+        const q15_t *input_b,
+        const uint16_t output_ch,
+        const int32_t *out_shift,
+        const int32_t *out_mult,
+        const int32_t out_offset,
+        const int16_t activation_min,
+        const int16_t activation_max,
+        const uint16_t num_col_a,
+        const int32_t *const output_bias,
+        q7_t *out_0,
+		q15_t *kbuf)
+{
+    /* set up the second output pointers */
+    q7_t *out_1 = out_0 + output_ch;
+    const int32_t *bias = output_bias;
+
+    uint16_t row_count = output_ch / 2;
+    const q15_t *ksrc = &kbuf[0];
+    /* this loop over rows in A */
+    while (row_count)
+    {
+        /* setup pointers for B */
+        const q15_t *ip_b0 = input_b;
+        const q15_t *ip_b1 = ip_b0 + num_col_a;
+        const q31_t *ip31_b0 = ip_b0;
+        const q31_t *ip31_b1 = ip_b1;
+
+        /* align the second pointer for A */
+        const q15_t *ksrc2 = ksrc + 27;
+        q31_t *ksrc_31 = ksrc;
+        q31_t *ksrc2_31 = ksrc2;
+
+        /* Init accumulator with bias for channel N and N + 1 */
+        q31_t ch_0_out_0 = *bias;
+        q31_t ch_0_out_1 = *bias++;
+        q31_t ch_1_out_0 = *bias;
+        q31_t ch_1_out_1 = *bias++;
+
+		//------------------4
+		q31_t a01, a02, a11, a12;
+		q31_t b0 = arm_nn_read_q15x2_ia(&ip_b0);
+		q31_t b1 = arm_nn_read_q15x2_ia(&ip_b1);
+
+		ch_0_out_0 = __SMLAD(ksrc_31[0], b0, ch_0_out_0);
+		ch_0_out_1 = __SMLAD(ksrc_31[0], b1, ch_0_out_1);
+		ch_1_out_0 = __SMLAD(ksrc2_31[0], b0, ch_1_out_0);
+		ch_1_out_1 = __SMLAD(ksrc2_31[0], b1, ch_1_out_1);
+
+		b0 = arm_nn_read_q15x2_ia(&ip_b0);
+		b1 = arm_nn_read_q15x2_ia(&ip_b1);
+
+		ch_0_out_0 = __SMLAD(ksrc_31[1], b0, ch_0_out_0);
+		ch_0_out_1 = __SMLAD(ksrc_31[1], b1, ch_0_out_1);
+		ch_1_out_0 = __SMLAD(ksrc2_31[1], b0, ch_1_out_0);
+		ch_1_out_1 = __SMLAD(ksrc2_31[1], b1, ch_1_out_1);
+
+		//------------------8
+		b0 = arm_nn_read_q15x2_ia(&ip_b0);
+		b1 = arm_nn_read_q15x2_ia(&ip_b1);
+
+		ch_0_out_0 = __SMLAD(ksrc_31[2], b0, ch_0_out_0);
+		ch_0_out_1 = __SMLAD(ksrc_31[2], b1, ch_0_out_1);
+		ch_1_out_0 = __SMLAD(ksrc2_31[2], b0, ch_1_out_0);
+		ch_1_out_1 = __SMLAD(ksrc2_31[2], b1, ch_1_out_1);
+
+		b0 = arm_nn_read_q15x2_ia(&ip_b0);
+		b1 = arm_nn_read_q15x2_ia(&ip_b1);
+
+		ch_0_out_0 = __SMLAD(ksrc_31[3], b0, ch_0_out_0);
+		ch_0_out_1 = __SMLAD(ksrc_31[3], b1, ch_0_out_1);
+		ch_1_out_0 = __SMLAD(ksrc2_31[3], b0, ch_1_out_0);
+		ch_1_out_1 = __SMLAD(ksrc2_31[3], b1, ch_1_out_1);
+
+		//------------------12
+		b0 = arm_nn_read_q15x2_ia(&ip_b0);
+		b1 = arm_nn_read_q15x2_ia(&ip_b1);
+
+		ch_0_out_0 = __SMLAD(ksrc_31[4], b0, ch_0_out_0);
+		ch_0_out_1 = __SMLAD(ksrc_31[4], b1, ch_0_out_1);
+		ch_1_out_0 = __SMLAD(ksrc2_31[4], b0, ch_1_out_0);
+		ch_1_out_1 = __SMLAD(ksrc2_31[4], b1, ch_1_out_1);
+
+		b0 = arm_nn_read_q15x2_ia(&ip_b0);
+		b1 = arm_nn_read_q15x2_ia(&ip_b1);
+
+		ch_0_out_0 = __SMLAD(ksrc_31[5], b0, ch_0_out_0);
+		ch_0_out_1 = __SMLAD(ksrc_31[5], b1, ch_0_out_1);
+		ch_1_out_0 = __SMLAD(ksrc2_31[5], b0, ch_1_out_0);
+		ch_1_out_1 = __SMLAD(ksrc2_31[5], b1, ch_1_out_1);
+
+		//------------------16
+		b0 = arm_nn_read_q15x2_ia(&ip_b0);
+		b1 = arm_nn_read_q15x2_ia(&ip_b1);
+		ch_0_out_0 = __SMLAD(ksrc_31[6], b0, ch_0_out_0);
+		ch_0_out_1 = __SMLAD(ksrc_31[6], b1, ch_0_out_1);
+		ch_1_out_0 = __SMLAD(ksrc2_31[6], b0, ch_1_out_0);
+		ch_1_out_1 = __SMLAD(ksrc2_31[6], b1, ch_1_out_1);
+
+		b0 = arm_nn_read_q15x2_ia(&ip_b0);
+		b1 = arm_nn_read_q15x2_ia(&ip_b1);
+
+		ch_0_out_0 = __SMLAD(ksrc_31[7], b0, ch_0_out_0);
+		ch_0_out_1 = __SMLAD(ksrc_31[7], b1, ch_0_out_1);
+		ch_1_out_0 = __SMLAD(ksrc2_31[7], b0, ch_1_out_0);
+		ch_1_out_1 = __SMLAD(ksrc2_31[7], b1, ch_1_out_1);
+
+		//------------------20
+		b0 = arm_nn_read_q15x2_ia(&ip_b0);
+		b1 = arm_nn_read_q15x2_ia(&ip_b1);
+		ch_0_out_0 = __SMLAD(ksrc_31[8], b0, ch_0_out_0);
+		ch_0_out_1 = __SMLAD(ksrc_31[8], b1, ch_0_out_1);
+		ch_1_out_0 = __SMLAD(ksrc2_31[8], b0, ch_1_out_0);
+		ch_1_out_1 = __SMLAD(ksrc2_31[8], b1, ch_1_out_1);
+
+		b0 = arm_nn_read_q15x2_ia(&ip_b0);
+		b1 = arm_nn_read_q15x2_ia(&ip_b1);
+
+		ch_0_out_0 = __SMLAD(ksrc_31[9], b0, ch_0_out_0);
+		ch_0_out_1 = __SMLAD(ksrc_31[9], b1, ch_0_out_1);
+		ch_1_out_0 = __SMLAD(ksrc2_31[9], b0, ch_1_out_0);
+		ch_1_out_1 = __SMLAD(ksrc2_31[9], b1, ch_1_out_1);
+
+		//------------------24
+		b0 = arm_nn_read_q15x2_ia(&ip_b0);
+		b1 = arm_nn_read_q15x2_ia(&ip_b1);
+		ch_0_out_0 = __SMLAD(ksrc_31[10], b0, ch_0_out_0);
+		ch_0_out_1 = __SMLAD(ksrc_31[10], b1, ch_0_out_1);
+		ch_1_out_0 = __SMLAD(ksrc2_31[10], b0, ch_1_out_0);
+		ch_1_out_1 = __SMLAD(ksrc2_31[10], b1, ch_1_out_1);
+
+		b0 = arm_nn_read_q15x2_ia(&ip_b0);
+		b1 = arm_nn_read_q15x2_ia(&ip_b1);
+
+		ch_0_out_0 = __SMLAD(ksrc_31[11], b0, ch_0_out_0);
+		ch_0_out_1 = __SMLAD(ksrc_31[11], b1, ch_0_out_1);
+		ch_1_out_0 = __SMLAD(ksrc2_31[11], b0, ch_1_out_0);
+		ch_1_out_1 = __SMLAD(ksrc2_31[11], b1, ch_1_out_1);
+
+		//------------------25,26,27
+		b0 = arm_nn_read_q15x2_ia(&ip_b0);
+		b1 = arm_nn_read_q15x2_ia(&ip_b1);
+		ch_0_out_0 = __SMLAD(ksrc_31[12], b0, ch_0_out_0);
+		ch_0_out_1 = __SMLAD(ksrc_31[12], b1, ch_0_out_1);
+		ch_1_out_0 = __SMLAD(ksrc2_31[12], b0, ch_1_out_0);
+		ch_1_out_1 = __SMLAD(ksrc2_31[12], b1, ch_1_out_1);
+		q15_t _b0 = *ip_b0++;
+		q15_t _b1 = *ip_b1++;
+
+		ch_0_out_0 += ksrc[26] * _b0;
+		ch_0_out_1 += ksrc[26] * _b1;
+		ch_1_out_0 += ksrc2[26] * _b0;
+		ch_1_out_1 += ksrc2[26] * _b1;
+
+        ch_0_out_0 = arm_nn_requantize(ch_0_out_0, *out_mult, *out_shift);
+        ch_0_out_0 += out_offset;
+        ch_0_out_0 = MAX(ch_0_out_0, activation_min);
+        ch_0_out_0 = MIN(ch_0_out_0, activation_max);
+        *out_0++ = (q7_t)ch_0_out_0;
+
+        ch_0_out_1 = arm_nn_requantize(ch_0_out_1, *out_mult, *out_shift);
+        ch_0_out_1 += out_offset;
+        ch_0_out_1 = MAX(ch_0_out_1, activation_min);
+        ch_0_out_1 = MIN(ch_0_out_1, activation_max);
+        *out_1++ = (q7_t)ch_0_out_1;
+        out_mult++;
+        out_shift++;
+
+        ch_1_out_0 = arm_nn_requantize(ch_1_out_0, *out_mult, *out_shift);
+        ch_1_out_0 += out_offset;
+        ch_1_out_0 = MAX(ch_1_out_0, activation_min);
+        ch_1_out_0 = MIN(ch_1_out_0, activation_max);
+        *out_0++ = (q7_t)ch_1_out_0;
+
+        ch_1_out_1 = arm_nn_requantize(ch_1_out_1, *out_mult, *out_shift);
+        ch_1_out_1 += out_offset;
+        ch_1_out_1 = MAX(ch_1_out_1, activation_min);
+        ch_1_out_1 = MIN(ch_1_out_1, activation_max);
+        *out_1++ = (q7_t)ch_1_out_1;
+        out_mult++;
+        out_shift++;
+
+        /* skip row */
+        ksrc += 54;
+        row_count--;
+    }
+
+    out_0 += output_ch;
+
+    /* return the new output pointer with offset */
+    return out_0;
+}
--- a/TinyEngine/src/kernels/int_only/arm_nn_mat_mult_kernel_s8_s16_reordered_8mul.c
+++ b/TinyEngine/src/kernels/int_only/arm_nn_mat_mult_kernel_s8_s16_reordered_8mul.c
@ -0,0 +1,174 @@
+/*
+ * Copyright (C) 2010-2020 Arm Limited or its affiliates. All rights reserved.
+ *
+ * SPDX-License-Identifier: Apache-2.0
+ *
+ * Licensed under the Apache License, Version 2.0 (the License); you may
+ * not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS, WITHOUT
+ * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+/* ----------------------------------------------------------------------
+ * This file is MODIFIED from Arm CMSIS NN Library.
+ *
+ * Project: TinyEngine
+ * Title:   arm_nn_mat_mult_kernel_s8_s16_reordered_8mul.c
+ * Description:  Matrix-multiplication function for convolution with reordered columns (input channels with the multiple of 8).
+ *
+ * Reference papers:
+ *  - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
+ *  - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
+ *  - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
+ * Contact authors:
+ *  - Wei-Ming Chen, wmchen@mit.edu
+ *  - Wei-Chen Wang, wweichen@mit.edu
+ *  - Ji Lin, jilin@mit.edu
+ *  - Ligeng Zhu, ligeng@mit.edu
+ *  - Song Han, songhan@mit.edu
+ *
+ * Original Project:      CMSIS NN Library
+ * Original Title:        arm_nn_mat_mult_kernel_s8_s16_reordered.c
+ *
+ * Target Processor:  Cortex-M cores
+ * -------------------------------------------------------------------- */
+
+#include "arm_math.h"
+#include "arm_nnfunctions.h"
+#include "arm_nnsupportfunctions.h"
+
+/*
+ * Matrix-multiplication with re-ordered input and bias inputs for convolution with per-channel
+ *        requantization. The re-ordering is a consequence of sign extension is done by the SXTB16 command.
+ *
+ * Refer header file for details. This function differs from arm_nn_mat_mult_kernel_s8_s16(), in that it uses
+ *        read_and_pad_reordered() instead of arm_nn_mat_mult_kernel_s8_s16(). Investigating the cycles impact and
+ *        unifying these two functions is a potential future improvement.
+ *
+ */
+
+q7_t *arm_nn_mat_mult_kernel_s8_s16_reordered_8mul(const q7_t *input_a,
+                                              const q15_t *input_b,
+                                              const uint16_t output_ch,
+                                              const int32_t *out_shift,
+                                              const int32_t *out_mult,
+                                              const int32_t out_offset,
+                                              const int16_t activation_min,
+                                              const int16_t activation_max,
+                                              const uint16_t num_col_a,
+                                              const int32_t *const output_bias,
+                                              q7_t *out_0)
+{
+    /* set up the second output pointers */
+    q7_t *out_1 = out_0 + output_ch;
+    const int32_t *bias = output_bias;
+
+    uint16_t row_count = output_ch / 2;
+    const q7_t *ip_a0 = input_a;
+    /* this loop over rows in A */
+    while (row_count)
+    {
+        /* setup pointers for B */
+        const q15_t *ip_b0 = input_b;
+        const q15_t *ip_b1 = ip_b0 + num_col_a;
+
+        /* align the second pointer for A */
+        const q7_t *ip_a1 = ip_a0 + num_col_a;
+
+        /* Init accumulator with bias for channel N and N + 1 */
+        q31_t ch_0_out_0 = *bias;
+        q31_t ch_0_out_1 = *bias++;
+        q31_t ch_1_out_0 = *bias;
+        q31_t ch_1_out_1 = *bias++;
+
+        uint16_t col_count = num_col_a / 8;
+        /* accumulate over the vector */
+        while (col_count)
+        {
+            q31_t a01, a02, a11, a12;
+            q31_t b0 = arm_nn_read_q15x2_ia(&ip_b0);
+            q31_t b1 = arm_nn_read_q15x2_ia(&ip_b1);
+
+            ip_a0 = read_and_pad_reordered(ip_a0, &a01, &a02);
+
+            ch_0_out_0 = __SMLAD(a01, b0, ch_0_out_0);
+            ip_a1 = read_and_pad_reordered(ip_a1, &a11, &a12);
+            ch_0_out_1 = __SMLAD(a01, b1, ch_0_out_1);
+            ch_1_out_0 = __SMLAD(a11, b0, ch_1_out_0);
+            b0 = arm_nn_read_q15x2_ia(&ip_b0);
+            ch_1_out_1 = __SMLAD(a11, b1, ch_1_out_1);
+
+            b1 = arm_nn_read_q15x2_ia(&ip_b1);
+
+            ch_0_out_0 = __SMLAD(a02, b0, ch_0_out_0);
+            ch_0_out_1 = __SMLAD(a02, b1, ch_0_out_1);
+            ch_1_out_0 = __SMLAD(a12, b0, ch_1_out_0);
+            b0 = arm_nn_read_q15x2_ia(&ip_b0);
+            ch_1_out_1 = __SMLAD(a12, b1, ch_1_out_1);
+
+            b1 = arm_nn_read_q15x2_ia(&ip_b1);
+
+            ip_a0 = read_and_pad_reordered(ip_a0, &a01, &a02);
+
+			ch_0_out_0 = __SMLAD(a01, b0, ch_0_out_0);
+			ip_a1 = read_and_pad_reordered(ip_a1, &a11, &a12);
+			ch_0_out_1 = __SMLAD(a01, b1, ch_0_out_1);
+			ch_1_out_0 = __SMLAD(a11, b0, ch_1_out_0);
+			b0 = arm_nn_read_q15x2_ia(&ip_b0);
+			ch_1_out_1 = __SMLAD(a11, b1, ch_1_out_1);
+
+			b1 = arm_nn_read_q15x2_ia(&ip_b1);
+
+			ch_0_out_0 = __SMLAD(a02, b0, ch_0_out_0);
+			ch_0_out_1 = __SMLAD(a02, b1, ch_0_out_1);
+			ch_1_out_0 = __SMLAD(a12, b0, ch_1_out_0);
+			ch_1_out_1 = __SMLAD(a12, b1, ch_1_out_1);
+
+            col_count--;
+        } /* while over col_count */
+
+        ch_0_out_0 = arm_nn_requantize(ch_0_out_0, *out_mult, *out_shift);
+        ch_0_out_0 += out_offset;
+        ch_0_out_0 = MAX(ch_0_out_0, activation_min);
+        ch_0_out_0 = MIN(ch_0_out_0, activation_max);
+        *out_0++ = (q7_t)ch_0_out_0;
+
+        ch_0_out_1 = arm_nn_requantize(ch_0_out_1, *out_mult, *out_shift);
+        ch_0_out_1 += out_offset;
+        ch_0_out_1 = MAX(ch_0_out_1, activation_min);
+        ch_0_out_1 = MIN(ch_0_out_1, activation_max);
+        *out_1++ = (q7_t)ch_0_out_1;
+        out_mult++;
+        out_shift++;
+
+        ch_1_out_0 = arm_nn_requantize(ch_1_out_0, *out_mult, *out_shift);
+        ch_1_out_0 += out_offset;
+        ch_1_out_0 = MAX(ch_1_out_0, activation_min);
+        ch_1_out_0 = MIN(ch_1_out_0, activation_max);
+        *out_0++ = (q7_t)ch_1_out_0;
+
+        ch_1_out_1 = arm_nn_requantize(ch_1_out_1, *out_mult, *out_shift);
+        ch_1_out_1 += out_offset;
+        ch_1_out_1 = MAX(ch_1_out_1, activation_min);
+        ch_1_out_1 = MIN(ch_1_out_1, activation_max);
+        *out_1++ = (q7_t)ch_1_out_1;
+        out_mult++;
+        out_shift++;
+
+        /* skip row */
+        ip_a0 += num_col_a;
+        row_count--;
+    }
+
+    out_0 += output_ch;
+
+    /* return the new output pointer with offset */
+    return out_0;
+}
--- a/TinyEngine/src/kernels/int_only/arm_nn_mat_mult_kernel_s8_s16_reordered_oddch.c
+++ b/TinyEngine/src/kernels/int_only/arm_nn_mat_mult_kernel_s8_s16_reordered_oddch.c
@ -0,0 +1,215 @@
+/*
+ * Copyright (C) 2010-2020 Arm Limited or its affiliates. All rights reserved.
+ *
+ * SPDX-License-Identifier: Apache-2.0
+ *
+ * Licensed under the Apache License, Version 2.0 (the License); you may
+ * not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS, WITHOUT
+ * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+/* ----------------------------------------------------------------------
+ * This file is MODIFIED from Arm CMSIS NN Library.
+ *
+ * Project: TinyEngine
+ * Title:   arm_nn_mat_mult_kernel_s8_s16_reordered_oddch.c
+ * Description:  Matrix-multiplication function for convolution with reordered columns (odd number of channel).
+ *
+ * Reference papers:
+ *  - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
+ *  - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
+ *  - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
+ * Contact authors:
+ *  - Wei-Ming Chen, wmchen@mit.edu
+ *  - Wei-Chen Wang, wweichen@mit.edu
+ *  - Ji Lin, jilin@mit.edu
+ *  - Ligeng Zhu, ligeng@mit.edu
+ *  - Song Han, songhan@mit.edu
+ *
+ * Original Project:      CMSIS NN Library
+ * Original Title:        arm_nn_mat_mult_kernel_s8_s16_reordered.c
+ *
+ * Target Processor:  Cortex-M cores
+ * -------------------------------------------------------------------- */
+
+#include "arm_math.h"
+#include "arm_nnfunctions.h"
+#include "arm_nnsupportfunctions.h"
+
+/*
+ * Matrix-multiplication with re-ordered input and bias inputs for convolution with per-channel
+ *        requantization. The re-ordering is a consequence of sign extension is done by the SXTB16 command.
+ *
+ * Refer header file for details. This function differs from arm_nn_mat_mult_kernel_s8_s16(), in that it uses
+ *        read_and_pad_reordered() instead of arm_nn_mat_mult_kernel_s8_s16(). Investigating the cycles impact and
+ *        unifying these two functions is a potential future improvement.
+ *
+ */
+
+q7_t *arm_nn_mat_mult_kernel_s8_s16_reordered_oddch(const q7_t *input_a,
+                                              const q15_t *input_b,
+                                              const uint16_t output_ch,
+                                              const int32_t *out_shift,
+                                              const int32_t *out_mult,
+                                              const int32_t out_offset,
+                                              const int16_t activation_min,
+                                              const int16_t activation_max,
+                                              const uint16_t num_col_a,
+                                              const int32_t *const output_bias,
+                                              q7_t *out_0)
+{
+#if defined(ARM_MATH_LOOPUNROLL) && defined(ARM_MATH_DSP)
+    /* set up the second output pointers */
+    q7_t *out_1 = out_0 + output_ch;
+    const int32_t *bias = output_bias;
+
+    uint16_t row_count = output_ch / 2;
+    const q7_t *ip_a0 = input_a;
+    /* this loop over rows in A */
+    while (row_count)
+    {
+        /* setup pointers for B */
+        const q15_t *ip_b0 = input_b;
+        const q15_t *ip_b1 = ip_b0 + num_col_a;
+
+        /* align the second pointer for A */
+        const q7_t *ip_a1 = ip_a0 + num_col_a;
+
+        /* Init accumulator with bias for channel N and N + 1 */
+        q31_t ch_0_out_0 = *bias;
+        q31_t ch_0_out_1 = *bias++;
+        q31_t ch_1_out_0 = *bias;
+        q31_t ch_1_out_1 = *bias++;
+
+        uint16_t col_count = num_col_a / 4;
+        /* accumulate over the vector */
+        while (col_count)
+        {
+        	q31_t a01, a02, a11, a12;
+			q31_t b0 = arm_nn_read_q15x2_ia(&ip_b0);
+			q31_t b1 = arm_nn_read_q15x2_ia(&ip_b1);
+
+			ip_a0 = read_and_pad_reordered(ip_a0, &a01, &a02);
+
+			ch_0_out_0 = __SMLAD(a01, b0, ch_0_out_0);
+			ip_a1 = read_and_pad_reordered(ip_a1, &a11, &a12);
+			ch_0_out_1 = __SMLAD(a01, b1, ch_0_out_1);
+			ch_1_out_0 = __SMLAD(a11, b0, ch_1_out_0);
+			b0 = arm_nn_read_q15x2_ia(&ip_b0);
+			ch_1_out_1 = __SMLAD(a11, b1, ch_1_out_1);
+
+			b1 = arm_nn_read_q15x2_ia(&ip_b1);
+
+			ch_0_out_0 = __SMLAD(a02, b0, ch_0_out_0);
+			ch_0_out_1 = __SMLAD(a02, b1, ch_0_out_1);
+			ch_1_out_0 = __SMLAD(a12, b0, ch_1_out_0);
+			ch_1_out_1 = __SMLAD(a12, b1, ch_1_out_1);
+
+            col_count--;
+        } /* while over col_count */
+
+        ch_0_out_0 = arm_nn_requantize(ch_0_out_0, *out_mult, *out_shift);
+        ch_0_out_0 += out_offset;
+        ch_0_out_0 = MAX(ch_0_out_0, activation_min);
+        ch_0_out_0 = MIN(ch_0_out_0, activation_max);
+        *out_0++ = (q7_t)ch_0_out_0;
+
+        ch_0_out_1 = arm_nn_requantize(ch_0_out_1, *out_mult, *out_shift);
+        ch_0_out_1 += out_offset;
+        ch_0_out_1 = MAX(ch_0_out_1, activation_min);
+        ch_0_out_1 = MIN(ch_0_out_1, activation_max);
+        *out_1++ = (q7_t)ch_0_out_1;
+        out_mult++;
+        out_shift++;
+
+        ch_1_out_0 = arm_nn_requantize(ch_1_out_0, *out_mult, *out_shift);
+        ch_1_out_0 += out_offset;
+        ch_1_out_0 = MAX(ch_1_out_0, activation_min);
+        ch_1_out_0 = MIN(ch_1_out_0, activation_max);
+        *out_0++ = (q7_t)ch_1_out_0;
+
+        ch_1_out_1 = arm_nn_requantize(ch_1_out_1, *out_mult, *out_shift);
+        ch_1_out_1 += out_offset;
+        ch_1_out_1 = MAX(ch_1_out_1, activation_min);
+        ch_1_out_1 = MIN(ch_1_out_1, activation_max);
+        *out_1++ = (q7_t)ch_1_out_1;
+        out_mult++;
+        out_shift++;
+
+        /* skip row */
+        ip_a0 += num_col_a;
+        row_count--;
+    }
+
+    if (output_ch & 1)
+    {
+        /* setup pointers for B */
+        const q15_t *ip_b0 = input_b;
+        const q15_t *ip_b1 = ip_b0 + num_col_a;
+
+        /* Init accumulator with bias for channel N + 1 */
+        q31_t ch_0_out_0 = *bias;
+        q31_t ch_0_out_1 = ch_0_out_0;
+
+        int32_t col_count = num_col_a / 4;
+        while (col_count)
+        {
+            q31_t a01, a02;
+            q31_t b0 = arm_nn_read_q15x2_ia(&ip_b0);
+            q31_t b1 = arm_nn_read_q15x2_ia(&ip_b1);
+
+            ip_a0 = read_and_pad_reordered(ip_a0, &a01, &a02);
+
+            ch_0_out_0 = __SMLAD(a01, b0, ch_0_out_0);
+            ch_0_out_1 = __SMLAD(a01, b1, ch_0_out_1);
+
+            b0 = arm_nn_read_q15x2_ia(&ip_b0);
+            b1 = arm_nn_read_q15x2_ia(&ip_b1);
+
+            ch_0_out_0 = __SMLAD(a02, b0, ch_0_out_0);
+            ch_0_out_1 = __SMLAD(a02, b1, ch_0_out_1);
+
+            col_count--;
+        } /* while over col_count */
+
+        ch_0_out_0 = arm_nn_requantize(ch_0_out_0, *out_mult, *out_shift);
+        ch_0_out_0 += out_offset;
+        ch_0_out_0 = MAX(ch_0_out_0, activation_min);
+        ch_0_out_0 = MIN(ch_0_out_0, activation_max);
+        *out_0++ = (q7_t)ch_0_out_0;
+
+        ch_0_out_1 = arm_nn_requantize(ch_0_out_1, *out_mult, *out_shift);
+        ch_0_out_1 += out_offset;
+        ch_0_out_1 = MAX(ch_0_out_1, activation_min);
+        ch_0_out_1 = MIN(ch_0_out_1, activation_max);
+        *out_1++ = (q7_t)ch_0_out_1;
+    }
+
+    out_0 += output_ch;
+
+    /* return the new output pointer with offset */
+    return out_0;
+#else
+    (void)input_a;
+    (void)input_b;
+    (void)output_ch;
+    (void)out_shift;
+    (void)out_mult;
+    (void)out_offset;
+    (void)activation_min;
+    (void)activation_max;
+    (void)num_col_a;
+    (void)output_bias;
+    (void)out_0;
+    /* To be completed */
+    return NULL;
+#endif
+}
--- a/TinyEngine/src/kernels/int_only/avgpooling.c
+++ b/TinyEngine/src/kernels/int_only/avgpooling.c
@ -0,0 +1,57 @@
+/* ----------------------------------------------------------------------
+ * Project: TinyEngine
+ * Title:   avgpooling.c
+ *
+ * Reference papers:
+ *  - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
+ *  - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
+ *  - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
+ * Contact authors:
+ *  - Wei-Ming Chen, wmchen@mit.edu
+ *  - Wei-Chen Wang, wweichen@mit.edu
+ *  - Ji Lin, jilin@mit.edu
+ *  - Ligeng Zhu, ligeng@mit.edu
+ *  - Song Han, songhan@mit.edu
+ *
+ * Target ISA:  ARMv7E-M
+ * -------------------------------------------------------------------- */
+
+#include "tinyengine_function.h"
+
+tinyengine_status avg_pooling(const q7_t* input, const uint16_t input_h, const uint16_t input_w,
+		const uint16_t input_c,	const uint16_t sample_h, const uint16_t sample_w,
+		const uint16_t output_h, const uint16_t output_w, const int32_t out_activation_min,
+        const int32_t out_activation_max, q7_t* output)
+{
+	int h, w, c;
+	int sh, sw;
+	const int divider_half = ((sample_h * sample_w) / 2);
+	for(c = 0; c < input_c; c++){
+		for(h = 0; h < output_h; h++){
+			for(w = 0; w < output_w; w++){
+				int avg = 0;
+
+				for(sh = 0; sh < sample_h; sh++){
+					int height = sh + h * sample_h;
+					for(sw = 0; sw < sample_w; sw++){
+						int width = sw + w * sample_w;
+						avg += input[(width + height * input_w) * input_c + c];
+					}
+				}
+
+				// for rounded div
+				if (avg > 0)
+					avg += divider_half;
+				else
+					avg -= divider_half;
+
+				int out = avg / (sample_h * sample_w);
+				out = TN_MAX(out, out_activation_min);
+				out = TN_MIN(out, out_activation_max);
+				output[(w + h * output_w) * input_c + c] = out;
+			}
+		}
+	}
+}
+
+
--- a/TinyEngine/src/kernels/int_only/concat_ch.c
+++ b/TinyEngine/src/kernels/int_only/concat_ch.c
@ -0,0 +1,43 @@
+/* ----------------------------------------------------------------------
+ * Project: TinyEngine
+ * Title:   concat_ch.c
+ *
+ * Reference papers:
+ *  - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
+ *  - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
+ *  - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
+ * Contact authors:
+ *  - Wei-Ming Chen, wmchen@mit.edu
+ *  - Wei-Chen Wang, wweichen@mit.edu
+ *  - Ji Lin, jilin@mit.edu
+ *  - Ligeng Zhu, ligeng@mit.edu
+ *  - Song Han, songhan@mit.edu
+ *
+ * Target ISA:  ARMv7E-M
+ * -------------------------------------------------------------------- */
+
+#include "arm_nnfunctions.h"
+#include "tinyengine_function.h"
+
+tinyengine_status concat_ch(const q7_t *input1, const uint16_t input_x,
+	const uint16_t input_y, const uint16_t input1_ch, const q7_t* input2, const uint16_t input2_ch, q7_t *output) {
+
+	int elements = input_y * input_x;
+
+	while(elements--){
+		//place the first input
+		memcpy(output, input1, input1_ch);
+		input1 += input1_ch; output += input1_ch;
+
+		//place the second input
+		memcpy(output, input2, input2_ch);
+		input2 += input2_ch; output += input2_ch;
+	}
+
+	return STATE_SUCCESS;
+}
+
+
+
+
+
--- a/TinyEngine/src/kernels/int_only/convolve_1x1_s8.c
+++ b/TinyEngine/src/kernels/int_only/convolve_1x1_s8.c
@ -0,0 +1,127 @@
+/* ----------------------------------------------------------------------
+ * Project: TinyEngine
+ * Title:   convolve_1x1_s8.c
+ *
+ * Reference papers:
+ *  - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
+ *  - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
+ *  - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
+ * Contact authors:
+ *  - Wei-Ming Chen, wmchen@mit.edu
+ *  - Wei-Chen Wang, wweichen@mit.edu
+ *  - Ji Lin, jilin@mit.edu
+ *  - Ligeng Zhu, ligeng@mit.edu
+ *  - Song Han, songhan@mit.edu
+ *
+ * Target ISA:  ARMv7E-M
+ * -------------------------------------------------------------------- */
+
+#include "arm_nnfunctions.h"
+#include "img2col_element.h"
+#include "tinyengine_function.h"
+
+#define DIM_KER_X (1U)
+#define DIM_KER_Y (1U)
+
+tinyengine_status convolve_1x1_s8(const q7_t *input, const uint16_t input_x,
+		const uint16_t input_y, const uint16_t input_ch, const q7_t *kernel,
+		const int32_t *bias, const int32_t *output_shift,
+		const int32_t *output_mult, const int32_t out_offset,
+		const int32_t input_offset, const int32_t out_activation_min,
+		const int32_t out_activation_max, q7_t *output, const uint16_t output_x,
+		const uint16_t output_y, const uint16_t output_ch, q15_t *runtime_buf) {
+	if (input_ch % 4 != 0 || input_ch % 2 != 0) {
+		return PARAM_NO_SUPPORT;
+	}
+
+	int32_t i_element;
+	(void) input_x;
+	(void) input_y;
+
+	/* Partial(two columns) im2col buffer */
+	q15_t *two_column_buffer = runtime_buf;
+	q7_t *out = output;
+	const int32_t num_elements = output_x * output_y;
+	const int channel_div4 = (input_ch >> 2);
+
+	const int16_t inoff16 = input_offset;
+	q31_t offset_q15x2 = __PKHBT(inoff16, inoff16, 16);
+
+	for (i_element = 0; i_element < num_elements / 2; i_element++) {
+		/* Fill buffer for partial im2col - two columns at a time */
+		q7_t *src = &input[i_element * input_ch * 2];
+		q15_t *dst = two_column_buffer;
+
+		//use variables
+		q31_t in_q7x4;
+		q31_t in_q15x2_1;
+		q31_t in_q15x2_2;
+		q31_t out_q15x2_1;
+		q31_t out_q15x2_2;
+
+		int cnt = channel_div4;	//two columns
+		while (cnt > 0) {
+			q7_q15_offset_reordered_ele(src, dst)
+			q7_q15_offset_reordered_ele(src, dst)
+			cnt--;
+		}
+
+		out = arm_nn_mat_mult_kernel_s8_s16_reordered(kernel,
+				two_column_buffer, output_ch, output_shift, output_mult,
+				(q7_t) out_offset, out_activation_min,
+				out_activation_max, input_ch * DIM_KER_Y * DIM_KER_X,
+				bias, out);
+	}
+
+	/* check if there is an odd column left-over for computation */
+	if (num_elements & 0x1) {
+		int32_t i_ch_out;
+		const q7_t *ker_a = kernel;
+		q7_t *src = &input[(num_elements - 1) * input_ch];
+		q15_t *dst = two_column_buffer;
+
+		//use variables
+		q31_t in_q7x4;
+		q31_t in_q15x2_1;
+		q31_t in_q15x2_2;
+		q31_t out_q15x2_1;
+		q31_t out_q15x2_2;
+
+		int cnt = channel_div4;	//two * numof2col columns
+		while (cnt > 0) {
+			q7_q15_offset_reordered_ele(src, dst)
+			cnt--;
+		}
+
+		for (i_ch_out = 0; i_ch_out < output_ch; i_ch_out++) {
+			q31_t sum = bias[i_ch_out];
+
+			/* Point to the beginning of the im2col buffer where the input is available as a rearranged column */
+			const q15_t *ip_as_col = runtime_buf;
+			uint16_t col_count = (input_ch * DIM_KER_X * DIM_KER_Y) >> 2;
+
+			while (col_count) {
+				q31_t ker_a1, ker_a2;
+				q31_t in_b1, in_b2;
+				ker_a = read_and_pad_reordered(ker_a, &ker_a1, &ker_a2);
+
+				in_b1 = arm_nn_read_q15x2_ia(&ip_as_col);
+				sum = __SMLAD(ker_a1, in_b1, sum);
+				in_b2 = arm_nn_read_q15x2_ia(&ip_as_col);
+				sum = __SMLAD(ker_a2, in_b2, sum);
+
+				col_count--;
+			}
+
+			sum = arm_nn_requantize(sum, output_mult[i_ch_out],
+					output_shift[i_ch_out]);
+			sum += out_offset;
+			sum = MAX(sum, out_activation_min);
+			sum = MIN(sum, out_activation_max);
+			*out++ = (q7_t) sum;
+		}
+	}
+
+	/* Return to application */
+	return STATE_SUCCESS;
+}
--- a/TinyEngine/src/kernels/int_only/convolve_1x1_s8_SRAM.c
+++ b/TinyEngine/src/kernels/int_only/convolve_1x1_s8_SRAM.c
@ -0,0 +1,158 @@
+/* ----------------------------------------------------------------------
+ * Project: TinyEngine
+ * Title:   convolve_1x1_s8_SRAM.c
+ *
+ * Reference papers:
+ *  - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
+ *  - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
+ *  - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
+ * Contact authors:
+ *  - Wei-Ming Chen, wmchen@mit.edu
+ *  - Wei-Chen Wang, wweichen@mit.edu
+ *  - Ji Lin, jilin@mit.edu
+ *  - Ligeng Zhu, ligeng@mit.edu
+ *  - Song Han, songhan@mit.edu
+ *
+ * Target ISA:  ARMv7E-M
+ * -------------------------------------------------------------------- */
+
+#include "arm_nnfunctions.h"
+#include "tinyengine_function.h"
+#include "img2col_element.h"
+#include "kernel_element.h"
+
+#define DIM_KER_X (1U)
+#define DIM_KER_Y (1U)
+
+//#define FULL_UNROLL
+
+tinyengine_status convolve_1x1_s8_SRAM(const q7_t *input, const uint16_t input_x,
+		const uint16_t input_y, const uint16_t input_ch, const q7_t *kernel,
+		const int32_t *bias, const int32_t *output_shift,
+		const int32_t *output_mult, const int32_t out_offset,
+		const int32_t input_offset, const int32_t out_activation_min,
+		const int32_t out_activation_max, q7_t *output, const uint16_t output_x,
+		const uint16_t output_y, const uint16_t output_ch, q15_t *runtime_buf, q15_t *kbuf) {
+	if (input_ch % 4 != 0 || input_ch % 2 != 0) {
+		return PARAM_NO_SUPPORT;
+	}
+
+	int32_t i_element;
+	(void) input_x;
+	(void) input_y;
+
+	/* Partial(two columns) im2col buffer */
+	q15_t *two_column_buffer = runtime_buf;
+	q7_t *out = output;
+	const int32_t num_elements = output_x * output_y;
+	const int channel_div4 = (input_ch >> 2);
+
+	const int16_t inoff16 = input_offset;
+	q31_t offset_q15x2 = __PKHBT(inoff16, inoff16, 16);
+
+	/* whether kernels can fit in the buffer */
+	//fill in kernels
+	const q7_t *ip_a0 = kernel;
+	for (int i = 0; i < output_ch; i += 2) {
+		q31_t *dst1 = &kbuf[i * input_ch / 2]; //each q31_t store 2 elements
+		q31_t *dst2 = dst1 + input_ch / 2;
+
+		/* align the second pointer for A */
+		const q7_t *ip_a1 = ip_a0 + input_ch;
+
+		uint16_t col_count = input_ch / 4;
+		/* accumulate over the vector */
+		while (col_count) {
+			q31_t a01, a02, a11, a12;
+
+			ip_a0 = read_and_pad_reordered(ip_a0, &dst1[0], &dst1[1]);
+			ip_a1 = read_and_pad_reordered(ip_a1, &dst2[0], &dst2[1]);
+
+			dst1 += 2;
+			dst2 += 2;
+			col_count--;
+		} /* while over col_count */
+
+		/* skip row */
+		ip_a0 += input_ch;
+	}
+
+	/* output stationary */
+	for (i_element = 0; i_element < num_elements; i_element += 2) {
+		q7_t *src = &input[i_element * input_ch];
+		q15_t *dst = two_column_buffer;
+
+		//use variables
+		q31_t in_q7x4;
+		q31_t in_q15x2_1;
+		q31_t in_q15x2_2;
+		q31_t out_q15x2_1;
+		q31_t out_q15x2_2;
+
+		int cnt = channel_div4; //two columns
+		while (cnt > 0) {
+			q7_q15_offset_reordered_ele(src, dst)
+			q7_q15_offset_reordered_ele(src, dst)
+			cnt--;
+		}
+
+		out = mat_mult_s16(kernel,
+				two_column_buffer, output_ch, output_shift, output_mult,
+				(q7_t) out_offset, out_activation_min,
+				out_activation_max, input_ch,
+				bias, out, kbuf);
+
+	}
+
+	/* check if there is an odd column left-over for computation */
+	if (num_elements & 0x1) {
+		int32_t i_ch_out;
+		const q7_t *ker_a = kernel;
+		q7_t *src = &input[(num_elements - 1) * input_ch];
+		q15_t *dst = two_column_buffer;
+
+		//use variables
+		q31_t in_q7x4;
+		q31_t in_q15x2_1;
+		q31_t in_q15x2_2;
+		q31_t out_q15x2_1;
+		q31_t out_q15x2_2;
+
+		int cnt = channel_div4;	//two * numof2col columns
+		while (cnt > 0) {
+			q7_q15_offset_reordered_ele(src, dst)
+			cnt--;
+		}
+
+		for (i_ch_out = 0; i_ch_out < output_ch; i_ch_out++) {
+			q31_t sum = bias[i_ch_out];
+
+			/* Point to the beginning of the im2col buffer where the input is available as a rearranged column */
+			const q15_t *ip_as_col = runtime_buf;
+			uint16_t col_count = (input_ch * DIM_KER_X * DIM_KER_Y) >> 2;
+
+			while (col_count) {
+				q31_t ker_a1, ker_a2;
+				q31_t in_b1, in_b2;
+				ker_a = read_and_pad_reordered(ker_a, &ker_a1, &ker_a2);
+
+				in_b1 = arm_nn_read_q15x2_ia(&ip_as_col);
+				sum = __SMLAD(ker_a1, in_b1, sum);
+				in_b2 = arm_nn_read_q15x2_ia(&ip_as_col);
+				sum = __SMLAD(ker_a2, in_b2, sum);
+
+				col_count--;
+			}
+
+			sum = arm_nn_requantize(sum, output_mult[i_ch_out],
+					output_shift[i_ch_out]);
+			sum += out_offset;
+			sum = MAX(sum, out_activation_min);
+			sum = MIN(sum, out_activation_max);
+			*out++ = (q7_t) sum;
+		}
+	}
+
+	/* Return to application */
+	return STATE_SUCCESS;
+}
--- a/TinyEngine/src/kernels/int_only/convolve_1x1_s8_ch16.c
+++ b/TinyEngine/src/kernels/int_only/convolve_1x1_s8_ch16.c
@ -0,0 +1,123 @@
+/* ----------------------------------------------------------------------
+ * Project: TinyEngine
+ * Title:   convolve_1x1_s8_ch16.c
+ *
+ * Reference papers:
+ *  - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
+ *  - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
+ *  - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
+ * Contact authors:
+ *  - Wei-Ming Chen, wmchen@mit.edu
+ *  - Wei-Chen Wang, wweichen@mit.edu
+ *  - Ji Lin, jilin@mit.edu
+ *  - Ligeng Zhu, ligeng@mit.edu
+ *  - Song Han, songhan@mit.edu
+ *
+ * Target ISA:  ARMv7E-M
+ * -------------------------------------------------------------------- */
+
+#include "arm_nnfunctions.h"
+#include "img2col_element.h"
+#include "tinyengine_function.h"
+
+#define DIM_KER_X (1U)
+#define DIM_KER_Y (1U)
+
+tinyengine_status convolve_1x1_s8_ch16(const q7_t *input, const uint16_t input_x,
+		const uint16_t input_y, const uint16_t input_ch, const q7_t *kernel,
+		const int32_t *bias, const int32_t *output_shift,
+		const int32_t *output_mult, const int32_t out_offset,
+		const int32_t input_offset, const int32_t out_activation_min,
+		const int32_t out_activation_max, q7_t *output, const uint16_t output_x,
+		const uint16_t output_y, const uint16_t output_ch, q15_t *runtime_buf) {
+	int32_t i_element;
+	(void) input_x;
+	(void) input_y;
+
+	/* Partial(two columns) im2col buffer */
+	q15_t *two_column_buffer = runtime_buf;
+	q7_t *out = output;
+	const int32_t num_elements = output_x * output_y;
+	const int channel_div4 = (input_ch >> 2);
+
+	const int16_t inoff16 = input_offset;
+	q31_t offset_q15x2 = __PKHBT(inoff16, inoff16, 16);
+
+	for (i_element = 0; i_element < num_elements / 2; i_element++) {
+		/* Fill buffer for partial im2col - two columns at a time */
+		q7_t *src = &input[i_element * input_ch * 2];
+		q15_t *dst = two_column_buffer;
+
+		//use variables
+		q31_t in_q7x4;
+		q31_t in_q15x2_1;
+		q31_t in_q15x2_2;
+		q31_t out_q15x2_1;
+		q31_t out_q15x2_2;
+
+		int cnt = channel_div4;	//two columns
+		while (cnt > 0) {
+			q7_q15_offset_reordered_ele(src, dst)
+			q7_q15_offset_reordered_ele(src, dst)
+			cnt--;
+		}
+
+		out = mat_mult_kernel_s8_s16_reordered_ch16(kernel,
+				two_column_buffer, output_ch, output_shift, output_mult,
+				(q7_t) out_offset, out_activation_min,
+				out_activation_max, input_ch * DIM_KER_Y * DIM_KER_X,
+				bias, out);
+	}
+
+	/* check if there is an odd column left-over for computation */
+	if (num_elements & 0x1) {
+		int32_t i_ch_out;
+		const q7_t *ker_a = kernel;
+		q7_t *src = &input[(num_elements - 1) * input_ch];
+		q15_t *dst = two_column_buffer;
+
+		//use variables
+		q31_t in_q7x4;
+		q31_t in_q15x2_1;
+		q31_t in_q15x2_2;
+		q31_t out_q15x2_1;
+		q31_t out_q15x2_2;
+
+		int cnt = channel_div4;	//two * numof2col columns
+		while (cnt > 0) {
+			q7_q15_offset_reordered_ele(src, dst)
+			cnt--;
+		}
+
+		for (i_ch_out = 0; i_ch_out < output_ch; i_ch_out++) {
+			q31_t sum = bias[i_ch_out];
+
+			/* Point to the beginning of the im2col buffer where the input is available as a rearranged column */
+			const q15_t *ip_as_col = runtime_buf;
+			uint16_t col_count = (input_ch * DIM_KER_X * DIM_KER_Y) >> 2;
+
+			while (col_count) {
+				q31_t ker_a1, ker_a2;
+				q31_t in_b1, in_b2;
+				ker_a = read_and_pad_reordered(ker_a, &ker_a1, &ker_a2);
+
+				in_b1 = arm_nn_read_q15x2_ia(&ip_as_col);
+				sum = __SMLAD(ker_a1, in_b1, sum);
+				in_b2 = arm_nn_read_q15x2_ia(&ip_as_col);
+				sum = __SMLAD(ker_a2, in_b2, sum);
+
+				col_count--;
+			}
+
+			sum = arm_nn_requantize(sum, output_mult[i_ch_out],
+					output_shift[i_ch_out]);
+			sum += out_offset;
+			sum = MAX(sum, out_activation_min);
+			sum = MIN(sum, out_activation_max);
+			*out++ = (q7_t) sum;
+		}
+	}
+
+	/* Return to application */
+	return STATE_SUCCESS;
+}
--- a/TinyEngine/src/kernels/int_only/convolve_1x1_s8_ch24.c
+++ b/TinyEngine/src/kernels/int_only/convolve_1x1_s8_ch24.c
@ -0,0 +1,124 @@
+/* ----------------------------------------------------------------------
+ * Project: TinyEngine
+ * Title:   convolve_1x1_s8_ch24.c
+ *
+ * Reference papers:
+ *  - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
+ *  - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
+ *  - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
+ * Contact authors:
+ *  - Wei-Ming Chen, wmchen@mit.edu
+ *  - Wei-Chen Wang, wweichen@mit.edu
+ *  - Ji Lin, jilin@mit.edu
+ *  - Ligeng Zhu, ligeng@mit.edu
+ *  - Song Han, songhan@mit.edu
+ *
+ * Target ISA:  ARMv7E-M
+ * -------------------------------------------------------------------- */
+
+#include "arm_nnfunctions.h"
+#include "arm_nnsupportfunctions.h"
+#include "img2col_element.h"
+#include "tinyengine_function.h"
+
+#define DIM_KER_X (1U)
+#define DIM_KER_Y (1U)
+
+tinyengine_status convolve_1x1_s8_ch24(const q7_t *input, const uint16_t input_x,
+		const uint16_t input_y, const uint16_t input_ch, const q7_t *kernel,
+		const int32_t *bias, const int32_t *output_shift,
+		const int32_t *output_mult, const int32_t out_offset,
+		const int32_t input_offset, const int32_t out_activation_min,
+		const int32_t out_activation_max, q7_t *output, const uint16_t output_x,
+		const uint16_t output_y, const uint16_t output_ch, q15_t *runtime_buf) {
+	int32_t i_element;
+	(void) input_x;
+	(void) input_y;
+
+	/* Partial(two columns) im2col buffer */
+	q15_t *two_column_buffer = runtime_buf;
+	q7_t *out = output;
+	const int32_t num_elements = output_x * output_y;
+	const int channel_div4 = (input_ch >> 2);
+
+	const int16_t inoff16 = input_offset;
+	q31_t offset_q15x2 = __PKHBT(inoff16, inoff16, 16);
+
+	for (i_element = 0; i_element < num_elements / 2; i_element++) {
+		/* Fill buffer for partial im2col - two columns at a time */
+		q7_t *src = &input[i_element * input_ch * 2];
+		q15_t *dst = two_column_buffer;
+
+		//use variables
+		q31_t in_q7x4;
+		q31_t in_q15x2_1;
+		q31_t in_q15x2_2;
+		q31_t out_q15x2_1;
+		q31_t out_q15x2_2;
+
+		int cnt = channel_div4;	//two columns
+		while (cnt > 0) {
+			q7_q15_offset_reordered_ele(src, dst)
+			q7_q15_offset_reordered_ele(src, dst)
+			cnt--;
+		}
+
+		out = mat_mult_kernel_s8_s16_reordered_ch24(kernel,
+				two_column_buffer, output_ch, output_shift, output_mult,
+				(q7_t) out_offset, out_activation_min,
+				out_activation_max, input_ch * DIM_KER_Y * DIM_KER_X,
+				bias, out);
+	}
+
+	/* check if there is an odd column left-over for computation */
+	if (num_elements & 0x1) {
+		int32_t i_ch_out;
+		const q7_t *ker_a = kernel;
+		q7_t *src = &input[(num_elements - 1) * input_ch];
+		q15_t *dst = two_column_buffer;
+
+		//use variables
+		q31_t in_q7x4;
+		q31_t in_q15x2_1;
+		q31_t in_q15x2_2;
+		q31_t out_q15x2_1;
+		q31_t out_q15x2_2;
+
+		int cnt = channel_div4;	//two * numof2col columns
+		while (cnt > 0) {
+			q7_q15_offset_reordered_ele(src, dst)
+			cnt--;
+		}
+
+		for (i_ch_out = 0; i_ch_out < output_ch; i_ch_out++) {
+			q31_t sum = bias[i_ch_out];
+
+			/* Point to the beginning of the im2col buffer where the input is available as a rearranged column */
+			const q15_t *ip_as_col = runtime_buf;
+			uint16_t col_count = (input_ch * DIM_KER_X * DIM_KER_Y) >> 2;
+
+			while (col_count) {
+				q31_t ker_a1, ker_a2;
+				q31_t in_b1, in_b2;
+				ker_a = read_and_pad_reordered(ker_a, &ker_a1, &ker_a2);
+
+				in_b1 = arm_nn_read_q15x2_ia(&ip_as_col);
+				sum = __SMLAD(ker_a1, in_b1, sum);
+				in_b2 = arm_nn_read_q15x2_ia(&ip_as_col);
+				sum = __SMLAD(ker_a2, in_b2, sum);
+
+				col_count--;
+			}
+
+			sum = arm_nn_requantize(sum, output_mult[i_ch_out],
+					output_shift[i_ch_out]);
+			sum += out_offset;
+			sum = MAX(sum, out_activation_min);
+			sum = MIN(sum, out_activation_max);
+			*out++ = (q7_t) sum;
+		}
+	}
+
+	/* Return to application */
+	return STATE_SUCCESS;
+}
--- a/TinyEngine/src/kernels/int_only/convolve_1x1_s8_ch48.c
+++ b/TinyEngine/src/kernels/int_only/convolve_1x1_s8_ch48.c
@ -0,0 +1,124 @@
+/* ----------------------------------------------------------------------
+ * Project: TinyEngine
+ * Title:   convolve_1x1_s8_ch48.c
+ *
+ * Reference papers:
+ *  - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
+ *  - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
+ *  - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
+ * Contact authors:
+ *  - Wei-Ming Chen, wmchen@mit.edu
+ *  - Wei-Chen Wang, wweichen@mit.edu
+ *  - Ji Lin, jilin@mit.edu
+ *  - Ligeng Zhu, ligeng@mit.edu
+ *  - Song Han, songhan@mit.edu
+ *
+ * Target ISA:  ARMv7E-M
+ * -------------------------------------------------------------------- */
+
+#include "arm_nnfunctions.h"
+#include "arm_nnsupportfunctions.h"
+#include "img2col_element.h"
+#include "tinyengine_function.h"
+
+#define DIM_KER_X (1U)
+#define DIM_KER_Y (1U)
+
+tinyengine_status convolve_1x1_s8_ch48(const q7_t *input, const uint16_t input_x,
+		const uint16_t input_y, const uint16_t input_ch, const q7_t *kernel,
+		const int32_t *bias, const int32_t *output_shift,
+		const int32_t *output_mult, const int32_t out_offset,
+		const int32_t input_offset, const int32_t out_activation_min,
+		const int32_t out_activation_max, q7_t *output, const uint16_t output_x,
+		const uint16_t output_y, const uint16_t output_ch, q15_t *runtime_buf) {
+	int32_t i_element;
+	(void) input_x;
+	(void) input_y;
+
+	/* Partial(two columns) im2col buffer */
+	q15_t *two_column_buffer = runtime_buf;
+	q7_t *out = output;
+	const int32_t num_elements = output_x * output_y;
+	const int channel_div4 = (input_ch >> 2);
+
+	const int16_t inoff16 = input_offset;
+	q31_t offset_q15x2 = __PKHBT(inoff16, inoff16, 16);
+
+	for (i_element = 0; i_element < num_elements / 2; i_element++) {
+		/* Fill buffer for partial im2col - two columns at a time */
+		q7_t *src = &input[i_element * input_ch * 2];
+		q15_t *dst = two_column_buffer;
+
+		//use variables
+		q31_t in_q7x4;
+		q31_t in_q15x2_1;
+		q31_t in_q15x2_2;
+		q31_t out_q15x2_1;
+		q31_t out_q15x2_2;
+
+		int cnt = channel_div4;	//two columns
+		while (cnt > 0) {
+			q7_q15_offset_reordered_ele(src, dst)
+			q7_q15_offset_reordered_ele(src, dst)
+			cnt--;
+		}
+
+		out = mat_mult_kernel_s8_s16_reordered_ch48(kernel,
+				two_column_buffer, output_ch, output_shift, output_mult,
+				(q7_t) out_offset, out_activation_min,
+				out_activation_max, input_ch * DIM_KER_Y * DIM_KER_X,
+				bias, out);
+	}
+
+	/* check if there is an odd column left-over for computation */
+	if (num_elements & 0x1) {
+		int32_t i_ch_out;
+		const q7_t *ker_a = kernel;
+		q7_t *src = &input[(num_elements - 1) * input_ch];
+		q15_t *dst = two_column_buffer;
+
+		//use variables
+		q31_t in_q7x4;
+		q31_t in_q15x2_1;
+		q31_t in_q15x2_2;
+		q31_t out_q15x2_1;
+		q31_t out_q15x2_2;
+
+		int cnt = channel_div4;	//two * numof2col columns
+		while (cnt > 0) {
+			q7_q15_offset_reordered_ele(src, dst)
+			cnt--;
+		}
+
+		for (i_ch_out = 0; i_ch_out < output_ch; i_ch_out++) {
+			q31_t sum = bias[i_ch_out];
+
+			/* Point to the beginning of the im2col buffer where the input is available as a rearranged column */
+			const q15_t *ip_as_col = runtime_buf;
+			uint16_t col_count = (input_ch * DIM_KER_X * DIM_KER_Y) >> 2;
+
+			while (col_count) {
+				q31_t ker_a1, ker_a2;
+				q31_t in_b1, in_b2;
+				ker_a = read_and_pad_reordered(ker_a, &ker_a1, &ker_a2);
+
+				in_b1 = arm_nn_read_q15x2_ia(&ip_as_col);
+				sum = __SMLAD(ker_a1, in_b1, sum);
+				in_b2 = arm_nn_read_q15x2_ia(&ip_as_col);
+				sum = __SMLAD(ker_a2, in_b2, sum);
+
+				col_count--;
+			}
+
+			sum = arm_nn_requantize(sum, output_mult[i_ch_out],
+					output_shift[i_ch_out]);
+			sum += out_offset;
+			sum = MAX(sum, out_activation_min);
+			sum = MIN(sum, out_activation_max);
+			*out++ = (q7_t) sum;
+		}
+	}
+
+	/* Return to application */
+	return STATE_SUCCESS;
+}
--- a/TinyEngine/src/kernels/int_only/convolve_1x1_s8_ch8.c
+++ b/TinyEngine/src/kernels/int_only/convolve_1x1_s8_ch8.c
@ -0,0 +1,123 @@
+/* ----------------------------------------------------------------------
+ * Project: TinyEngine
+ * Title:   convolve_1x1_s8_ch8.c
+ *
+ * Reference papers:
+ *  - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
+ *  - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
+ *  - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
+ * Contact authors:
+ *  - Wei-Ming Chen, wmchen@mit.edu
+ *  - Wei-Chen Wang, wweichen@mit.edu
+ *  - Ji Lin, jilin@mit.edu
+ *  - Ligeng Zhu, ligeng@mit.edu
+ *  - Song Han, songhan@mit.edu
+ *
+ * Target ISA:  ARMv7E-M
+ * -------------------------------------------------------------------- */
+
+#include "arm_nnfunctions.h"
+#include "img2col_element.h"
+#include "tinyengine_function.h"
+
+#define DIM_KER_X (1U)
+#define DIM_KER_Y (1U)
+
+tinyengine_status convolve_1x1_s8_ch8(const q7_t *input, const uint16_t input_x,
+		const uint16_t input_y, const uint16_t input_ch, const q7_t *kernel,
+		const int32_t *bias, const int32_t *output_shift,
+		const int32_t *output_mult, const int32_t out_offset,
+		const int32_t input_offset, const int32_t out_activation_min,
+		const int32_t out_activation_max, q7_t *output, const uint16_t output_x,
+		const uint16_t output_y, const uint16_t output_ch, q15_t *runtime_buf) {
+	int32_t i_element;
+	(void) input_x;
+	(void) input_y;
+
+	/* Partial(two columns) im2col buffer */
+	q15_t *two_column_buffer = runtime_buf;
+	q7_t *out = output;
+	const int32_t num_elements = output_x * output_y;
+	const int channel_div4 = (input_ch >> 2);
+
+	const int16_t inoff16 = input_offset;
+	q31_t offset_q15x2 = __PKHBT(inoff16, inoff16, 16);
+
+	for (i_element = 0; i_element < num_elements / 2; i_element++) {
+		/* Fill buffer for partial im2col - two columns at a time */
+		q7_t *src = &input[i_element * input_ch * 2];
+		q15_t *dst = two_column_buffer;
+
+		//use variables
+		q31_t in_q7x4;
+		q31_t in_q15x2_1;
+		q31_t in_q15x2_2;
+		q31_t out_q15x2_1;
+		q31_t out_q15x2_2;
+
+		int cnt = channel_div4;	//two columns
+		while (cnt > 0) {
+			q7_q15_offset_reordered_ele(src, dst)
+			q7_q15_offset_reordered_ele(src, dst)
+			cnt--;
+		}
+
+		out = mat_mult_kernel_s8_s16_reordered_ch8(kernel,
+				two_column_buffer, output_ch, output_shift, output_mult,
+				(q7_t) out_offset, out_activation_min,
+				out_activation_max, input_ch * DIM_KER_Y * DIM_KER_X,
+				bias, out);
+	}
+
+	/* check if there is an odd column left-over for computation */
+	if (num_elements & 0x1) {
+		int32_t i_ch_out;
+		const q7_t *ker_a = kernel;
+		q7_t *src = &input[(num_elements - 1) * input_ch];
+		q15_t *dst = two_column_buffer;
+
+		//use variables
+		q31_t in_q7x4;
+		q31_t in_q15x2_1;
+		q31_t in_q15x2_2;
+		q31_t out_q15x2_1;
+		q31_t out_q15x2_2;
+
+		int cnt = channel_div4;	//two * numof2col columns
+		while (cnt > 0) {
+			q7_q15_offset_reordered_ele(src, dst)
+			cnt--;
+		}
+
+		for (i_ch_out = 0; i_ch_out < output_ch; i_ch_out++) {
+			q31_t sum = bias[i_ch_out];
+
+			/* Point to the beginning of the im2col buffer where the input is available as a rearranged column */
+			const q15_t *ip_as_col = runtime_buf;
+			uint16_t col_count = (input_ch * DIM_KER_X * DIM_KER_Y) >> 2;
+
+			while (col_count) {
+				q31_t ker_a1, ker_a2;
+				q31_t in_b1, in_b2;
+				ker_a = read_and_pad_reordered(ker_a, &ker_a1, &ker_a2);
+
+				in_b1 = arm_nn_read_q15x2_ia(&ip_as_col);
+				sum = __SMLAD(ker_a1, in_b1, sum);
+				in_b2 = arm_nn_read_q15x2_ia(&ip_as_col);
+				sum = __SMLAD(ker_a2, in_b2, sum);
+
+				col_count--;
+			}
+
+			sum = arm_nn_requantize(sum, output_mult[i_ch_out],
+					output_shift[i_ch_out]);
+			sum += out_offset;
+			sum = MAX(sum, out_activation_min);
+			sum = MIN(sum, out_activation_max);
+			*out++ = (q7_t) sum;
+		}
+	}
+
+	/* Return to application */
+	return STATE_SUCCESS;
+}
--- a/TinyEngine/src/kernels/int_only/convolve_1x1_s8_kbuf.c
+++ b/TinyEngine/src/kernels/int_only/convolve_1x1_s8_kbuf.c
@ -0,0 +1,135 @@
+/* ----------------------------------------------------------------------
+ * Project: TinyEngine
+ * Title:   convolve_1x1_s8_kbuf.c
+ * Description:  for pointwise convolution, which nests loops according to runtime buffer size
+ *
+ * Reference papers:
+ *  - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
+ *  - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
+ *  - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
+ * Contact authors:
+ *  - Wei-Ming Chen, wmchen@mit.edu
+ *  - Wei-Chen Wang, wweichen@mit.edu
+ *  - Ji Lin, jilin@mit.edu
+ *  - Ligeng Zhu, ligeng@mit.edu
+ *  - Song Han, songhan@mit.edu
+ *
+ * Target ISA:  ARMv7E-M
+ * -------------------------------------------------------------------- */
+
+#include "arm_nnfunctions.h"
+#include "tinyengine_function.h"
+#include "img2col_element.h"
+#include "kernel_element.h"
+
+#define DIM_KER_X (1U)
+#define DIM_KER_Y (1U)
+
+tinyengine_status convolve_1x1_s8_kbuf(const q7_t *input, const uint16_t input_x,
+		const uint16_t input_y, const uint16_t input_ch, const q7_t *kernel,
+		const q31_t *kbuf, const int32_t *bias, const int32_t *output_shift,
+		const int32_t *output_mult, const int32_t out_offset,
+		const int32_t input_offset, const int32_t out_activation_min,
+		const int32_t out_activation_max, q7_t *output, const uint16_t output_x,
+		const uint16_t output_y, const uint16_t output_ch, q15_t *runtime_buf){
+	if (input_ch % 4 != 0 || input_ch % 2 != 0) {
+		return PARAM_NO_SUPPORT;
+	}
+
+	int32_t i_element;
+	(void) input_x;
+	(void) input_y;
+
+	/* Partial(two columns) im2col buffer */
+	q15_t *two_column_buffer = runtime_buf;
+	q7_t *out = output;
+	const int32_t num_elements = output_x * output_y;
+	const int channel_div4 = (input_ch >> 2);
+
+	const int16_t inoff16 = input_offset;
+	q31_t offset_q15x2 = __PKHBT(inoff16, inoff16, 16);
+
+	volatile int sbufsize = get_sbuffer_size();
+	int maxcol = sbufsize / input_ch / 2;
+
+	/* whether kernels can fit in the buffer */
+	//fill in kernels
+	const q7_t *ip_a0 = kernel;
+	/* output stationary */
+	for (i_element = 0; i_element < num_elements; i_element += 2) {
+		q7_t *src = &input[i_element * input_ch];
+		q15_t *dst = two_column_buffer;
+
+		//use variables
+		q31_t in_q7x4;
+		q31_t in_q15x2_1;
+		q31_t in_q15x2_2;
+		q31_t out_q15x2_1;
+		q31_t out_q15x2_2;
+
+		int cnt = channel_div4; //two columns
+		while (cnt > 0) {
+			q7_q15_offset_reordered_ele(src, dst)
+			q7_q15_offset_reordered_ele(src, dst)
+			cnt--;
+		}
+
+		out = mat_mult_s16(kernel,
+				two_column_buffer, output_ch, output_shift, output_mult,
+				(q7_t) out_offset, out_activation_min,
+				out_activation_max, input_ch,
+				bias, out, kbuf);
+
+	}
+
+	/* check if there is an odd column left-over for computation */
+	if (num_elements & 0x1) {
+		int32_t i_ch_out;
+		const q7_t *ker_a = kernel;
+		q7_t *src = &input[(num_elements - 1) * input_ch];
+		q15_t *dst = two_column_buffer;
+
+		//use variables
+		q31_t in_q7x4;
+		q31_t in_q15x2_1;
+		q31_t in_q15x2_2;
+		q31_t out_q15x2_1;
+		q31_t out_q15x2_2;
+
+		int cnt = channel_div4;	//two * numof2col columns
+		while (cnt > 0) {
+			q7_q15_offset_reordered_ele(src, dst)
+			cnt--;
+		}
+
+		for (i_ch_out = 0; i_ch_out < output_ch; i_ch_out++) {
+			q31_t sum = bias[i_ch_out];
+
+			/* Point to the beginning of the im2col buffer where the input is available as a rearranged column */
+			const q15_t *ip_as_col = runtime_buf;
+			uint16_t col_count = (input_ch * DIM_KER_X * DIM_KER_Y) >> 2;
+
+			while (col_count) {
+				q31_t ker_a1, ker_a2;
+				q31_t in_b1, in_b2;
+				ker_a = read_and_pad_reordered(ker_a, &ker_a1, &ker_a2);
+
+				in_b1 = arm_nn_read_q15x2_ia(&ip_as_col);
+				sum = __SMLAD(ker_a1, in_b1, sum);
+				in_b2 = arm_nn_read_q15x2_ia(&ip_as_col);
+				sum = __SMLAD(ker_a2, in_b2, sum);
+
+				col_count--;
+			}
+
+			sum = arm_nn_requantize(sum, output_mult[i_ch_out],	output_shift[i_ch_out]);
+			sum += out_offset;
+			sum = MAX(sum, out_activation_min);
+			sum = MIN(sum, out_activation_max);
+			*out++ = (q7_t) sum;
+		}
+	}
+
+	/* Return to application */
+	return STATE_SUCCESS;
+}
--- a/TinyEngine/src/kernels/int_only/convolve_1x1_s8_oddch.c
+++ b/TinyEngine/src/kernels/int_only/convolve_1x1_s8_oddch.c
@ -0,0 +1,128 @@
+/* ----------------------------------------------------------------------
+ * Project: TinyEngine
+ * Title:   convolve_1x1_s8_oddch.c
+ *
+ * Reference papers:
+ *  - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
+ *  - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
+ *  - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
+ * Contact authors:
+ *  - Wei-Ming Chen, wmchen@mit.edu
+ *  - Wei-Chen Wang, wweichen@mit.edu
+ *  - Ji Lin, jilin@mit.edu
+ *  - Ligeng Zhu, ligeng@mit.edu
+ *  - Song Han, songhan@mit.edu
+ *
+ * Target ISA:  ARMv7E-M
+ * -------------------------------------------------------------------- */
+
+#include "arm_nnfunctions.h"
+#include "tinyengine_function.h"
+#include "img2col_element.h"
+#include "kernel_element.h"
+
+#define DIM_KER_X (1U)
+#define DIM_KER_Y (1U)
+
+tinyengine_status convolve_1x1_s8_oddch(const q7_t *input, const uint16_t input_x,
+		const uint16_t input_y, const uint16_t input_ch, const q7_t *kernel,
+		const int32_t *bias, const int32_t *output_shift,
+		const int32_t *output_mult, const int32_t out_offset,
+		const int32_t input_offset, const int32_t out_activation_min,
+		const int32_t out_activation_max, q7_t *output, const uint16_t output_x,
+		const uint16_t output_y, const uint16_t output_ch, q15_t *runtime_buf) {
+	if (input_ch % 4 != 0) {
+		return PARAM_NO_SUPPORT;
+	}
+
+	int32_t i_element;
+	(void) input_x;
+	(void) input_y;
+
+	/* Partial(two columns) im2col buffer */
+	q15_t *two_column_buffer = runtime_buf;
+	q7_t *out = output;
+	const int32_t num_elements = output_x * output_y;
+	const int channel_div4 = (input_ch >> 2);
+
+	const int16_t inoff16 = input_offset;
+	q31_t offset_q15x2 = __PKHBT(inoff16, inoff16, 16);
+
+	for (i_element = 0; i_element < num_elements / 2; i_element++) {
+		/* Fill buffer for partial im2col - two columns at a time */
+		q7_t *src = &input[i_element * input_ch * 2];
+		q15_t *dst = two_column_buffer;
+
+		//use variables
+		q31_t in_q7x4;
+		q31_t in_q15x2_1;
+		q31_t in_q15x2_2;
+		q31_t out_q15x2_1;
+		q31_t out_q15x2_2;
+
+		int cnt = channel_div4;	//two columns
+		while (cnt > 0) {
+			q7_q15_offset_reordered_ele(src, dst)
+			q7_q15_offset_reordered_ele(src, dst)
+			cnt--;
+		}
+
+		out = arm_nn_mat_mult_kernel_s8_s16_reordered(kernel,
+				two_column_buffer, output_ch, output_shift, output_mult,
+				(q7_t) out_offset, out_activation_min,
+				out_activation_max, input_ch * DIM_KER_Y * DIM_KER_X,
+				bias, out);
+	}
+
+	/* check if there is an odd column left-over for computation */
+	if (num_elements & 0x1) {
+		int32_t i_ch_out;
+		const q7_t *ker_a = kernel;
+		q7_t *src = &input[(num_elements - 1) * input_ch];
+		q15_t *dst = two_column_buffer;
+
+		//use variables
+		q31_t in_q7x4;
+		q31_t in_q15x2_1;
+		q31_t in_q15x2_2;
+		q31_t out_q15x2_1;
+		q31_t out_q15x2_2;
+
+		int cnt = channel_div4;	//two * numof2col columns
+		while (cnt > 0) {
+			q7_q15_offset_reordered_ele(src, dst)
+			cnt--;
+		}
+
+		for (i_ch_out = 0; i_ch_out < output_ch; i_ch_out++) {
+			q31_t sum = bias[i_ch_out];
+
+			/* Point to the beginning of the im2col buffer where the input is available as a rearranged column */
+			const q15_t *ip_as_col = runtime_buf;
+			uint16_t col_count = (input_ch * DIM_KER_X * DIM_KER_Y) >> 2;
+
+			while (col_count) {
+				q31_t ker_a1, ker_a2;
+				q31_t in_b1, in_b2;
+				ker_a = read_and_pad_reordered(ker_a, &ker_a1, &ker_a2);
+
+				in_b1 = arm_nn_read_q15x2_ia(&ip_as_col);
+				sum = __SMLAD(ker_a1, in_b1, sum);
+				in_b2 = arm_nn_read_q15x2_ia(&ip_as_col);
+				sum = __SMLAD(ker_a2, in_b2, sum);
+
+				col_count--;
+			}
+
+			sum = arm_nn_requantize(sum, output_mult[i_ch_out],
+					output_shift[i_ch_out]);
+			sum += out_offset;
+			sum = MAX(sum, out_activation_min);
+			sum = MIN(sum, out_activation_max);
+			*out++ = (q7_t) sum;
+		}
+	}
+
+	/* Return to application */
+	return STATE_SUCCESS;
+}
--- a/TinyEngine/src/kernels/int_only/convolve_1x1_s8_skip_pad.c
+++ b/TinyEngine/src/kernels/int_only/convolve_1x1_s8_skip_pad.c
@ -0,0 +1,153 @@
+/* ----------------------------------------------------------------------
+ * Project: TinyEngine
+ * Title:   convolve_1x1_s8_skip_pad.c
+ *
+ * Reference papers:
+ *  - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
+ *  - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
+ *  - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
+ * Contact authors:
+ *  - Wei-Ming Chen, wmchen@mit.edu
+ *  - Wei-Chen Wang, wweichen@mit.edu
+ *  - Ji Lin, jilin@mit.edu
+ *  - Ligeng Zhu, ligeng@mit.edu
+ *  - Song Han, songhan@mit.edu
+ *
+ * Target ISA:  ARMv7E-M
+ * -------------------------------------------------------------------- */
+
+#include "arm_nnfunctions.h"
+#include "tinyengine_function.h"
+#include "img2col_element.h"
+#include "kernel_element.h"
+
+#define DIM_KER_X (1U)
+#define DIM_KER_Y (1U)
+
+tinyengine_status convolve_1x1_s8_skip_pad(const q7_t *input, const uint16_t input_x,
+		const uint16_t input_y, const uint16_t input_ch, const q7_t *kernel,
+		const int32_t *bias, const int32_t *output_shift,
+		const int32_t *output_mult, const int32_t out_offset,
+		const int32_t input_offset, const int32_t out_activation_min,
+		const int32_t out_activation_max, q7_t *output, const uint16_t output_x,
+		const uint16_t output_y, const uint16_t output_ch, q15_t *runtime_buf,
+		const uint16_t pad_t, const uint16_t pad_b, const uint16_t pad_l, const uint16_t pad_r) {
+	int32_t i_element;
+	(void) input_x;
+	(void) input_y;
+
+	/* Partial(two columns) im2col buffer */
+	q15_t *two_column_buffer = runtime_buf;
+	q7_t *out = output;
+	const int32_t num_elements = output_x * output_y;
+	const int channel_div4 = (input_ch >> 2);
+
+	const int16_t inoff16 = input_offset;
+	q31_t offset_q15x2 = __PKHBT(inoff16, inoff16, 16);
+
+	q31_t *kbuf = get_kernel_buffer();
+	volatile int sbufsize = get_sbuffer_size();
+	int maxcol = sbufsize / input_ch / 2;
+
+	int h=0,w=0;
+	for (i_element = 0; i_element < num_elements / 2; i_element++) {
+		/* Fill buffer for partial im2col - two columns at a time */
+		q7_t *src = &input[i_element * input_ch * 2];
+		q15_t *dst = two_column_buffer;
+
+		//use variables
+		q31_t in_q7x4;
+		q31_t in_q15x2_1;
+		q31_t in_q15x2_2;
+		q31_t out_q15x2_1;
+		q31_t out_q15x2_2;
+
+		int skip = 0;
+		//first element
+		if (w < pad_l || w >= input_x - pad_r){
+			if (h < pad_t || h >= input_y - pad_b){
+				skip++;
+			}
+		}
+		//move to the next element
+		w++;
+		if (w == input_x - 1){
+			h++; w = 0;
+		}
+		//second element
+		if (w < pad_l || w >= input_x - pad_r){
+			if (h < pad_t || h >= input_y - pad_b){
+				skip++;
+			}
+		}
+		if (skip == 2){
+			out += output_ch * 2;
+			continue;
+		}
+
+		int cnt = channel_div4;	//two columns
+		while (cnt > 0) {
+			q7_q15_offset_reordered_ele(src, dst)
+			q7_q15_offset_reordered_ele(src, dst)
+			cnt--;
+		}
+
+		out = mat_mult_kernel_s8_s16_reordered(kernel,
+				two_column_buffer, output_ch, output_shift, output_mult,
+				(q7_t) out_offset, out_activation_min,
+				out_activation_max, input_ch * DIM_KER_Y * DIM_KER_X,
+				bias, out);
+	}
+
+	/* check if there is an odd column left-over for computation */
+	if (num_elements & 0x1) {
+		int32_t i_ch_out;
+		const q7_t *ker_a = kernel;
+		q7_t *src = &input[(num_elements - 1) * input_ch];
+		q15_t *dst = two_column_buffer;
+
+		//use variables
+		q31_t in_q7x4;
+		q31_t in_q15x2_1;
+		q31_t in_q15x2_2;
+		q31_t out_q15x2_1;
+		q31_t out_q15x2_2;
+
+		int cnt = channel_div4;	//two * numof2col columns
+		while (cnt > 0) {
+			q7_q15_offset_reordered_ele(src, dst)
+			cnt--;
+		}
+
+		for (i_ch_out = 0; i_ch_out < output_ch; i_ch_out++) {
+			q31_t sum = bias[i_ch_out];
+
+			/* Point to the beginning of the im2col buffer where the input is available as a rearranged column */
+			const q15_t *ip_as_col = runtime_buf;
+			uint16_t col_count = (input_ch * DIM_KER_X * DIM_KER_Y) >> 2;
+
+			while (col_count) {
+				q31_t ker_a1, ker_a2;
+				q31_t in_b1, in_b2;
+				ker_a = read_and_pad_reordered(ker_a, &ker_a1, &ker_a2);
+
+				in_b1 = arm_nn_read_q15x2_ia(&ip_as_col);
+				sum = __SMLAD(ker_a1, in_b1, sum);
+				in_b2 = arm_nn_read_q15x2_ia(&ip_as_col);
+				sum = __SMLAD(ker_a2, in_b2, sum);
+
+				col_count--;
+			}
+
+			sum = arm_nn_requantize(sum, output_mult[i_ch_out],
+					output_shift[i_ch_out]);
+			sum += out_offset;
+			sum = MAX(sum, out_activation_min);
+			sum = MIN(sum, out_activation_max);
+			*out++ = (q7_t) sum;
+		}
+	}
+
+	/* Return to application */
+	return STATE_SUCCESS;
+}
--- a/TinyEngine/src/kernels/int_only/convolve_s8_kernel2x3_inputch3_stride2_pad1.c
+++ b/TinyEngine/src/kernels/int_only/convolve_s8_kernel2x3_inputch3_stride2_pad1.c
@ -0,0 +1,213 @@
+/* ----------------------------------------------------------------------
+ * Project: TinyEngine
+ * Title:   convolve_s8_kernel2x3_inputch3_stride2_pad1.c
+ * Description:  for 3x3 convolution with 3 input channels, typically for image processing
+ *
+ * Reference papers:
+ *  - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
+ *  - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
+ *  - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
+ * Contact authors:
+ *  - Wei-Ming Chen, wmchen@mit.edu
+ *  - Wei-Chen Wang, wweichen@mit.edu
+ *  - Ji Lin, jilin@mit.edu
+ *  - Ligeng Zhu, ligeng@mit.edu
+ *  - Song Han, songhan@mit.edu
+ *
+ * Target ISA:  ARMv7E-M
+ * -------------------------------------------------------------------- */
+
+#include "arm_math.h"
+#include "arm_nnfunctions.h"
+#include "arm_nnsupportfunctions.h"
+#include "img2col_element.h"
+#include "tinyengine_function.h"
+
+tinyengine_status convolve_s8_kernel2x3_inputch3_stride2_pad1(const q7_t *input, const uint16_t input_x, const uint16_t input_y,
+		const uint16_t input_ch, const q7_t *kernel, const int32_t *bias,
+		const int32_t *output_shift, const int32_t *output_mult,
+		const int32_t output_offset, const int32_t input_offset,
+		const int32_t output_activation_min,
+		const int32_t output_activation_max, q7_t *output,
+		const uint16_t output_x, const uint16_t output_y,
+		const uint16_t output_ch, q15_t *runtime_buf, q7_t pad_value) {
+	const int kernel_y = 2;
+	const int kernel_x = 3;
+
+	//check this during code gen for better performance
+	if(input_x % 2 != 0 || input_y % 2 != 0){
+		return PARAM_NO_SUPPORT;
+	}
+
+	int16_t i_out_y, i_out_x, i_ker_y, i_ker_x;
+
+	/* Generate two columns from the input tensor a GEMM computation */
+	q15_t *two_column_buf = runtime_buf;
+	q7_t *out = output;
+
+	q15_t pad16 = pad_value;
+	const int16_t inoff16 = input_offset;
+	q15_t pad_out = pad16 + inoff16;
+	q31_t pad_out_q15x2 = __PKHBT(pad_out, pad_out, 16);
+	q31_t offset_q15x2 = __PKHBT(inoff16, inoff16, 16);
+
+	q15_t *kbuf = (q15_t*) get_kernel_buffer();
+	const q7_t *ip_a0 = kernel;
+
+	for (int i = 0; i < output_ch; i += 2) {
+		q15_t *dst1 = &kbuf[i * 18]; //each q31_t store 2 elements
+		q15_t *dst2 = dst1 + 18;
+
+		const q7_t *ip_a1 = ip_a0 + 18;
+
+		//27 for each output_ch
+		q31_t *dst1_31 = dst1;
+		q31_t *dst2_31 = dst2;
+		ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
+		ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
+		dst1_31 += 2;
+		dst2_31 += 2;
+
+		ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
+		ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
+		dst1_31 += 2;
+		dst2_31 += 2;
+
+		ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
+		ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
+		dst1_31 += 2;
+		dst2_31 += 2;
+
+		ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
+		ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
+		dst1_31 += 2;
+		dst2_31 += 2;
+
+		//17, 18
+		dst1 = dst1_31;
+		dst2 = dst2_31;
+		dst1[0] = *ip_a0++;
+		dst1[1] = *ip_a0++;
+		dst2[0] = *ip_a1++;
+		dst2[1] = *ip_a1++;
+
+		/* skip row */
+		ip_a0 += 18;
+	}
+
+	for (i_out_y = 0; i_out_y < output_y; i_out_y++) {
+		for (i_out_x = 0; i_out_x < output_x; i_out_x++) {
+			/* This part implements the im2col function */
+			const int16_t base_idx_y = (i_out_y * 2) - 1;
+			const int16_t base_idx_x = (i_out_x * 2) - 1;
+			const q15_t *col_buffer = two_column_buf;
+
+			//use variables
+			q31_t in_q7x4;
+			q31_t in_q15x2_1;
+			q31_t in_q15x2_2;
+			q31_t out_q15x2_1;
+			q31_t out_q15x2_2;
+
+			/* load address:8bit */
+			q7_t *src;
+			q7_t *src2;
+			q7_t *src3;
+
+			/* buffer for load:16bit */
+			q15_t *dst;
+			q15_t *dst2;
+			q15_t *dst3;
+
+			int input_row_offset = 3 * input_x;
+			dst = col_buffer;
+			dst2 = dst + 9;
+			if (base_idx_y != -1) {
+				if (base_idx_x != -1) {
+					//load all for now and unroll all
+					//3x3 = 9 elements
+					src = input	+ (base_idx_y * input_x + base_idx_x) * input_ch;
+					src2 = src + input_row_offset;
+					src3 = src2 + input_row_offset;
+
+					//4 * 2 + 1 = 9
+					q7_q15_offset_ele(src, dst)
+					q7_q15_offset_ele(src, dst)
+					*dst++ = *src++ + input_offset;
+					//4 * 2 + 1 = 9
+					q7_q15_offset_ele(src2, dst2)
+					q7_q15_offset_ele(src2, dst2)
+					*dst2++ = *src2++ + input_offset;
+				} else {
+					//first element is pad
+					//3x3 = 9 elements
+					src = input + (base_idx_y * input_x) * input_ch;
+					src2 = src + input_row_offset;
+					src3 = src2 + input_row_offset;
+
+					//pad the first one: 1x3 = 3
+					*dst++ = pad_out;
+					*dst++ = pad_out;
+					*dst++ = pad_out;
+					*dst2++ = pad_out;
+					*dst2++ = pad_out;
+					*dst2++ = pad_out;
+					//load 6 elements
+					//4 * 1 + 2 = 6
+					q7_q15_offset_ele(src, dst)
+					*dst++ = *src++ + input_offset;
+					*dst++ = *src++ + input_offset;
+					//4 * 1 + 2 = 6
+					q7_q15_offset_ele(src2, dst2)
+					*dst2++ = *src2++ + input_offset;
+					*dst2++ = *src2++ + input_offset;
+				}
+			} else {
+				//Padding the first row
+				//3x3 = 9 elements
+				*dst++ = pad_out;
+				q31_t *dst_31 = dst;
+				*dst_31++ = pad_out_q15x2;
+				*dst_31++ = pad_out_q15x2;
+				*dst_31++ = pad_out_q15x2;
+				*dst_31++ = pad_out_q15x2;
+				if (base_idx_x != -1) {
+					//3x3 = 9 elements
+					src2 = input + (base_idx_x) * input_ch;
+
+					//4 * 2 + 1 = 9
+					q7_q15_offset_ele(src2, dst2)
+					q7_q15_offset_ele(src2, dst2)
+					*dst2++ = *src2++ + input_offset;
+				} else {
+					src2 = input;
+
+					//pad the first col: 1x3 = 3
+					*dst2++ = pad_out;
+					*dst2++ = pad_out;
+					*dst2++ = pad_out;
+					//load 6 elements
+					q7_q15_offset_ele(src2, dst2)
+					*dst2++ = *src2++ + input_offset;
+					*dst2++ = *src2++ + input_offset;
+				}
+			}
+
+			two_column_buf += 18;
+			/* Computation is filed for every 2 columns */
+			if (two_column_buf == runtime_buf + 2 * 18) {
+
+				out = mat_mult_unloop18_s8_s16(kernel,
+						runtime_buf, output_ch, output_shift, output_mult,
+						output_offset, output_activation_min, output_activation_max,
+						input_ch * kernel_y * kernel_x, bias, out, kbuf);
+
+				/* counter reset */
+				two_column_buf = runtime_buf;
+			}
+		}
+	}
+
+	/* Return to application */
+	return STATE_SUCCESS;
+}
--- a/TinyEngine/src/kernels/int_only/convolve_s8_kernel3_inputch3_stride2_pad1.c
+++ b/TinyEngine/src/kernels/int_only/convolve_s8_kernel3_inputch3_stride2_pad1.c
@ -0,0 +1,285 @@
+/* ----------------------------------------------------------------------
+ * Project: TinyEngine
+ * Title:   convolve_s8_kernel3_inputch3_stride2_pad1.c
+ *
+ * Reference papers:
+ *  - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
+ *  - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
+ *  - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
+ * Contact authors:
+ *  - Wei-Ming Chen, wmchen@mit.edu
+ *  - Wei-Chen Wang, wweichen@mit.edu
+ *  - Ji Lin, jilin@mit.edu
+ *  - Ligeng Zhu, ligeng@mit.edu
+ *  - Song Han, songhan@mit.edu
+ *
+ * Target ISA:  ARMv7E-M
+ * -------------------------------------------------------------------- */
+
+#include "arm_math.h"
+#include "arm_nnfunctions.h"
+#include "arm_nnsupportfunctions.h"
+#include "img2col_element.h"
+#include "tinyengine_function.h"
+
+tinyengine_status convolve_s8_kernel3_inputch3_stride2_pad1(const q7_t *input, const uint16_t input_x, const uint16_t input_y,
+		const uint16_t input_ch, const q7_t *kernel, const int32_t *bias,
+		const int32_t *output_shift, const int32_t *output_mult,
+		const int32_t output_offset, const int32_t input_offset,
+		const int32_t output_activation_min,
+		const int32_t output_activation_max, q7_t *output,
+		const uint16_t output_x, const uint16_t output_y,
+		const uint16_t output_ch, q15_t *runtime_buf, q15_t *kbuf, q7_t pad_value) {
+	const int kernel_y = 3;
+	const int kernel_x = 3;
+
+	int16_t i_out_y, i_out_x, i_ker_y, i_ker_x;
+
+	/* Generate two columns from the input tensor a GEMM computation */
+	q15_t *two_column_buf = runtime_buf;
+	q7_t *out = output;
+
+	q15_t pad16 = pad_value;
+	const int16_t inoff16 = input_offset;
+	q15_t pad_out = pad16 + inoff16;
+	q31_t pad_out_q15x2 = __PKHBT(pad_out, pad_out, 16);
+	q31_t offset_q15x2 = __PKHBT(inoff16, inoff16, 16);
+
+	const q7_t *ip_a0 = kernel;
+
+	for (int i = 0; i < output_ch; i += 2) {
+		q15_t *dst1 = &kbuf[i * 27]; //each q31_t store 2 elements
+		q15_t *dst2 = dst1 + 27;
+
+		const q7_t *ip_a1 = ip_a0 + 27;
+
+		//27 for each output_ch
+		q31_t *dst1_31 = dst1;
+		q31_t *dst2_31 = dst2;
+		ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
+		ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
+		dst1_31 += 2;
+		dst2_31 += 2;
+
+		ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
+		ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
+		dst1_31 += 2;
+		dst2_31 += 2;
+
+		ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
+		ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
+		dst1_31 += 2;
+		dst2_31 += 2;
+
+		ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
+		ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
+		dst1_31 += 2;
+		dst2_31 += 2;
+
+		ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
+		ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
+		dst1_31 += 2;
+		dst2_31 += 2;
+
+		ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
+		ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
+		dst1_31 += 2;
+		dst2_31 += 2;
+		//25, 26, 27
+		dst1 = dst1_31;
+		dst2 = dst2_31;
+		dst1[0] = *ip_a0++;
+		dst1[1] = *ip_a0++;
+		dst1[2] = *ip_a0++;
+		dst2[0] = *ip_a1++;
+		dst2[1] = *ip_a1++;
+		dst2[2] = *ip_a1++;
+
+		/* skip row */
+		ip_a0 += 27;
+	}
+
+	for (i_out_y = 0; i_out_y < output_y; i_out_y++) {
+		for (i_out_x = 0; i_out_x < output_x; i_out_x++) {
+			/* This part implements the im2col function */
+			const int16_t base_idx_y = (i_out_y * 2) - 1;
+			const int16_t base_idx_x = (i_out_x * 2) - 1;
+			const q15_t *col_buffer = two_column_buf;
+
+			//use variables
+			q31_t in_q7x4;
+			q31_t in_q15x2_1;
+			q31_t in_q15x2_2;
+			q31_t out_q15x2_1;
+			q31_t out_q15x2_2;
+
+			/* load address:8bit */
+			q7_t *src;
+			q7_t *src2;
+			q7_t *src3;
+
+			/* buffer for load:16bit */
+			q15_t *dst;
+			q15_t *dst2;
+			q15_t *dst3;
+
+			int input_row_offset = 3 * input_x;
+			dst = col_buffer;
+			dst2 = dst + 9;
+			dst3 = dst2 + 9;
+			if (base_idx_y != -1) {
+				if (base_idx_x != -1) { //load all for now and unroll all
+					//3x3 = 9 elements
+					src = input	+ (base_idx_y * input_x + base_idx_x) * input_ch;
+					src2 = src + input_row_offset;
+					src3 = src2 + input_row_offset;
+
+					//4 * 2 = 8
+					q7_q15_offset_ele(src, dst)
+					q7_q15_offset_ele(src, dst)
+					*dst++ = *src++ + input_offset;
+					//
+					q7_q15_offset_ele(src2, dst2)
+					q7_q15_offset_ele(src2, dst2)
+					*dst2++ = *src2++ + input_offset;
+					//
+					q7_q15_offset_ele(src3, dst3)
+					q7_q15_offset_ele(src3, dst3)
+					*dst3++ = *src3++ + input_offset;
+				} else {						//first element is pad
+												//3x3 = 9 elements
+					src = input + (base_idx_y * input_x) * input_ch;
+					src2 = src + input_row_offset;
+					src3 = src2 + input_row_offset;
+
+					//pad the first one: 1x3 = 3
+					*dst++ = pad_out;
+					*dst++ = pad_out;
+					*dst++ = pad_out;
+					*dst2++ = pad_out;
+					*dst2++ = pad_out;
+					*dst2++ = pad_out;
+					*dst3++ = pad_out;
+					*dst3++ = pad_out;
+					*dst3++ = pad_out;
+					//load 6 elements
+					//4 * 1 = 6
+					q7_q15_offset_ele(src, dst)
+					*dst++ = *src++ + input_offset;
+					*dst++ = *src++ + input_offset;
+					//
+					q7_q15_offset_ele(src2, dst2)
+					*dst2++ = *src2++ + input_offset;
+					*dst2++ = *src2++ + input_offset;
+					//
+					q7_q15_offset_ele(src3, dst3)
+					*dst3++ = *src3++ + input_offset;
+					*dst3++ = *src3++ + input_offset;
+				}
+			} else {						// first row is padded
+											//3x3 = 9 elements
+				*dst++ = pad_out;
+				q31_t *dst_31 = dst;
+				*dst_31++ = pad_out_q15x2;
+				*dst_31++ = pad_out_q15x2;
+				*dst_31++ = pad_out_q15x2;
+				*dst_31++ = pad_out_q15x2;
+				if (base_idx_x != -1) {	//load all for now and unroll all
+					//3x3 = 9 elements
+					src2 = input + (base_idx_x) * input_ch;
+					src3 = src2 + input_row_offset;
+
+					//4 * 2 = 8
+					q7_q15_offset_ele(src2, dst2)
+					q7_q15_offset_ele(src2, dst2)
+					*dst2++ = *src2++ + input_offset;
+					//
+					q7_q15_offset_ele(src3, dst3)
+					q7_q15_offset_ele(src3, dst3)
+					*dst3++ = *src3++ + input_offset;
+				} else {						//first element is pad
+												//3x3 = 9 elements
+					src2 = input;
+					src3 = src2 + input_row_offset;
+
+					//pad the first one: 1x3 = 3
+					*dst2++ = pad_out;
+					*dst2++ = pad_out;
+					*dst2++ = pad_out;
+					*dst3++ = pad_out;
+					*dst3++ = pad_out;
+					*dst3++ = pad_out;
+					//load 6 elements
+					q7_q15_offset_ele(src2, dst2)
+					*dst2++ = *src2++ + input_offset;
+					*dst2++ = *src2++ + input_offset;
+					//
+					q7_q15_offset_ele(src3, dst3)
+					*dst3++ = *src3++ + input_offset;
+					*dst3++ = *src3++ + input_offset;
+				}
+			}
+
+			two_column_buf += 27;
+			/* Computation is filed for every 2 columns */
+			if (two_column_buf == runtime_buf + 2 * 27) {
+
+				out = arm_nn_mat_mult_kernel3_input3_s8_s16(kernel,
+						runtime_buf, output_ch, output_shift, output_mult,
+						output_offset, output_activation_min, output_activation_max,
+						input_ch * kernel_y * kernel_x, bias, out, kbuf);
+
+				/* counter reset */
+				two_column_buf = runtime_buf;
+			}
+		}
+	}
+
+	/* left-over because odd number of output pixels */
+	if (two_column_buf != runtime_buf) {
+		const q7_t *ker_a = kernel;
+		int i;
+
+		for (i = 0; i < output_ch; i++) {
+			/* Load the accumulator with bias first */
+			q31_t sum = bias[i];
+
+			/* Point to the beginning of the im2col buffer where the input is available as a rearranged column */
+			const q15_t *ip_as_col = runtime_buf;
+
+			/* 4 multiply and accumulates are done in one loop. */
+			uint16_t col_count = (input_ch * kernel_y * kernel_x) >> 2;
+
+			while (col_count) {
+				q31_t ker_a1, ker_a2;
+				q31_t ip_b1, ip_b2;
+
+				ker_a = read_and_pad(ker_a, &ker_a1, &ker_a2);
+
+				ip_b1 = arm_nn_read_q15x2_ia(&ip_as_col);
+				sum = __SMLAD(ker_a1, ip_b1, sum);
+				ip_b2 = arm_nn_read_q15x2_ia(&ip_as_col);
+				sum = __SMLAD(ker_a2, ip_b2, sum);
+
+				col_count--;
+			}
+			/* Handle left over mac */
+			col_count = input_ch * kernel_y * kernel_x & 0x3;
+			while (col_count) {
+				q7_t ker_a1 = *ker_a++;
+				q15_t ip_b1 = *ip_as_col++;
+				sum += ker_a1 * ip_b1;
+				col_count--;
+			}
+
+			sum = arm_nn_requantize(sum, output_mult[i], output_shift[i]);
+			sum += output_offset;
+			sum = MAX(sum, output_activation_min);
+			sum = MIN(sum, output_activation_max);
+			*out++ = (q7_t) sum;
+		}
+	}
+
+	/* Return to application */
+	return STATE_SUCCESS;
+}
--- a/TinyEngine/src/kernels/int_only/convolve_s8_kernel3_stride1_pad1.c
+++ b/TinyEngine/src/kernels/int_only/convolve_s8_kernel3_stride1_pad1.c
@ -0,0 +1,300 @@
+/* ----------------------------------------------------------------------
+ * Project: TinyEngine
+ * Title:   convolve_s8_kernel3_stride1_pad1.c
+ * Description:  for 3x3 convolution with kernels, typically for image processing
+ *
+ * Reference papers:
+ *  - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
+ *  - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
+ *  - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
+ * Contact authors:
+ *  - Wei-Ming Chen, wmchen@mit.edu
+ *  - Wei-Chen Wang, wweichen@mit.edu
+ *  - Ji Lin, jilin@mit.edu
+ *  - Ligeng Zhu, ligeng@mit.edu
+ *  - Song Han, songhan@mit.edu
+ *
+ * Target ISA:  ARMv7E-M
+ * -------------------------------------------------------------------- */
+
+#include "arm_math.h"
+#include "arm_nnfunctions.h"
+#include "arm_nnsupportfunctions.h"
+#include "img2col_element.h"
+#include "tinyengine_function.h"
+
+tinyengine_status convolve_s8_kernel3_stride1_pad1(const q7_t *input, const uint16_t input_x, const uint16_t input_y,
+		const uint16_t input_ch, const q7_t *kernel, const int32_t *bias,
+		const int32_t *output_shift, const int32_t *output_mult,
+		const int32_t output_offset, const int32_t input_offset,
+		const int32_t output_activation_min,
+		const int32_t output_activation_max, q7_t *output,
+		const uint16_t output_x, const uint16_t output_y,
+		const uint16_t output_ch, q15_t *runtime_buf, q7_t pad_value) {
+	if (input_ch % 4 != 0) {
+		return PARAM_NO_SUPPORT;
+	}
+
+	int32_t i_element;
+	(void) input_x;
+	(void) input_y;
+
+	/* Partial(two columns) im2col buffer */
+	q15_t *two_column_buffer = runtime_buf;
+	q7_t *out = output;
+	const int channel_div4 = (input_ch >> 2);
+
+	const int16_t inoff16 = input_offset;
+	q31_t offset_q15x2 = __PKHBT(inoff16, inoff16, 16);
+	q31_t pad_q15x2 = __PKHBT(pad_value, pad_value, 16);
+	q31_t pad_out_q15x2 = __SADD16(pad_q15x2, offset_q15x2);
+	int in_row_offset = input_ch * input_x;
+
+	for (int i_out_y = 0; i_out_y < output_y; i_out_y++) {
+		const int16_t base_idx_y = i_out_y - 1;
+		for (int i_out_x = 0; i_out_x < output_x; i_out_x++) {
+			const int16_t base_idx_x = i_out_x - 1;
+			//Img2col for 3x3 kernel
+			/* Used for SIMD instructions  */
+			q31_t in_q7x4;
+			q31_t in_q15x2_1;
+			q31_t in_q15x2_2;
+			q31_t out_q15x2_1;
+			q31_t out_q15x2_2;
+			int block_cnt;
+			q15_t *col_buffer = &two_column_buffer[0];
+
+			//TODO: swap these two if statement out to reduce overhead
+			int ypad_cnt = 0; //no pad by default
+			if (base_idx_y == -1) { //pad the first row
+				q31_t *dst_31 = (q31_t*) &col_buffer[0];
+				int block_cnt = channel_div4;//unroll by 2, 3 element
+				while (block_cnt > 0) {//total: 16bit * input_ch * 3
+					*dst_31++ = pad_out_q15x2;
+					*dst_31++ = pad_out_q15x2;
+					*dst_31++ = pad_out_q15x2;
+					*dst_31++ = pad_out_q15x2;
+					*dst_31++ = pad_out_q15x2;
+					*dst_31++ = pad_out_q15x2;
+					block_cnt--;
+				}
+				ypad_cnt = 1;
+			}
+			else if (base_idx_y + 2 == input_y) { //pad the third row
+				q31_t *dst_31 = (q31_t*) &col_buffer[input_ch * 6];
+				int block_cnt = channel_div4;//unroll by 2, 3 element
+				while (block_cnt > 0) {//total: 16bit * input_ch * 3
+					*dst_31++ = pad_out_q15x2;
+					*dst_31++ = pad_out_q15x2;
+					*dst_31++ = pad_out_q15x2;
+					*dst_31++ = pad_out_q15x2;
+					*dst_31++ = pad_out_q15x2;
+					*dst_31++ = pad_out_q15x2;
+					block_cnt--;
+				}
+				ypad_cnt = 2;
+			}
+
+			if (ypad_cnt == 0){ //filled all rows
+				if (base_idx_x == -1) {
+					/* use pad for the first 1 col */
+					q31_t *dst_31 = (q31_t*) &col_buffer[0];
+					q31_t *dst2_31 = (q31_t*) &col_buffer[input_ch * 3];
+					q31_t *dst3_31 = (q31_t*) &col_buffer[input_ch * 6];
+
+					pad_3row_1col(dst_31, dst2_31, dst3_31, pad_out_q15x2)
+
+					/* load input to 2 col*/
+					const q7_t *src = input + base_idx_y * input_x * input_ch;
+					const q7_t *src2 = src + in_row_offset;
+					const q7_t *src3 = src2 + in_row_offset;
+					q15_t *dst = dst_31;
+					q15_t *dst2 = dst2_31;
+					q15_t *dst3 = dst3_31;
+
+					load_3row_2col(src, src2, src3, dst, dst2, dst3)
+				} else if (base_idx_x + 2 == input_x) {
+					/* load 2 col */
+					const q7_t *src = input	+ (base_idx_y * input_x + base_idx_x) * input_ch;
+					const q7_t *src2 = src + in_row_offset;
+					const q7_t *src3 = src2 + in_row_offset;
+					q15_t *dst = (q31_t*) &col_buffer[0];;
+					q15_t *dst2 = (q31_t*) &col_buffer[input_ch * 3];
+					q15_t *dst3 = (q31_t*) &col_buffer[input_ch * 6];;
+
+					load_3row_2col(src, src2, src3, dst, dst2, dst3)
+
+					q31_t *dst_31 = (q31_t*) dst;
+					q31_t *dst2_31 = (q31_t*) dst2;
+					q31_t *dst3_31 = (q31_t*) dst3;
+
+					/* use pad for the last 1 col*/
+					pad_3row_1col(dst_31,dst2_31,dst3_31,pad_out_q15x2)
+				} else {
+					/* load 3 col */
+					const q7_t *src = input	+ (base_idx_y * input_x + base_idx_x) * input_ch;
+					const q7_t *src2 = src + in_row_offset;
+					const q7_t *src3 = src2 + in_row_offset;
+					q15_t *dst = (q31_t*) &col_buffer[0];;
+					q15_t *dst2 = (q31_t*) &col_buffer[input_ch * 3];
+					q15_t *dst3 = (q31_t*) &col_buffer[input_ch * 6];;
+
+					load_3row_3col(src, src2, src3, dst, dst2, dst3)
+				}
+			}
+			else if (ypad_cnt == 1){//filled the last two rows
+				if (base_idx_x == -1){
+					/* use pad for the first 1 col */
+					q31_t *dst_31 = &col_buffer[input_ch * 3];
+					q31_t *dst2_31 = &col_buffer[input_ch * 6];
+					pad_2row_1col(dst_31, dst2_31, pad_out_q15x2)
+
+					/* load input to 2 col*/
+					const q7_t *src = input + 0;
+					const q7_t *src2 = src + in_row_offset;
+					q15_t *dst = dst_31;
+					q15_t *dst2 = dst2_31;
+
+					load_2row_2col(src, src2, dst, dst2)
+				} else if (base_idx_x + 2 == input_x) {
+					/* load 2 col*/
+					q31_t *dst = &col_buffer[input_ch * 3];
+					q31_t *dst2 = &col_buffer[input_ch * 6];
+					const q7_t *src = input + base_idx_x * input_ch;
+					const q7_t *src2 = src + in_row_offset;
+
+					load_2row_2col(src, src2, dst, dst2)
+					q31_t *dst_31 = (q31_t*) dst;
+					q31_t *dst2_31 = (q31_t*) dst2;
+
+					/* use pad for the last 1 col*/
+					pad_2row_1col(dst_31,dst2_31,pad_out_q15x2)
+				}
+				else {
+					/* load 3 col*/
+					q15_t *dst = &col_buffer[input_ch * 3];
+					q15_t *dst2 = &col_buffer[input_ch * 6];
+					const q7_t *src = input + base_idx_x * input_ch;
+					const q7_t *src2 = src + in_row_offset;
+
+					load_2row_3col(src, src2, dst, dst2)
+				}
+			} else{ //filled the first two rows
+				if (base_idx_x == -1) {
+					/* use pad for the first 1 col*/
+					q31_t *dst_31 = (q31_t*) &col_buffer[0];
+					q31_t *dst2_31 = (q31_t*) &col_buffer[input_ch * 3];
+
+					pad_2row_1col(dst_31, dst2_31, pad_out_q15x2)
+
+					/* load input to 2 col*/
+					const q7_t *src = input + (base_idx_y * input_x) * input_ch;
+					const q7_t *src2 = src + in_row_offset;
+					q15_t *dst = dst_31;
+					q15_t *dst2 = dst2_31;
+
+					load_2row_2col(src, src2, dst, dst2)
+				} else if (base_idx_x + 2 == input_x) {
+					/* load 2 col*/
+					q15_t *dst = &col_buffer[input_ch * 0];
+					q15_t *dst2  = &col_buffer[input_ch * 3];
+					const q7_t *src = input	+ (base_idx_y * input_x + base_idx_x) * input_ch;
+					const q7_t *src2 = src + in_row_offset;
+
+					load_2row_2col(src, src2, dst, dst2)
+
+					/* use pad for the last 1 col*/
+					q31_t *dst_31 = (q31_t*) dst;
+					q31_t *dst2_31 = (q31_t*) dst2;
+
+					pad_2row_1col(dst_31,dst2_31,pad_out_q15x2)
+				} else {
+					/* load 3 col*/
+					q15_t *dst = &col_buffer[input_ch * 0];
+					q15_t *dst2  = &col_buffer[input_ch * 3];
+					/* load input to 1 col*/
+					const q7_t *src = input	+ (base_idx_y * input_x + base_idx_x) * input_ch;
+					const q7_t *src2 = src + in_row_offset;
+
+					load_2row_3col(src, src2, dst, dst2)
+				}
+			}
+
+			two_column_buffer += input_ch * 9;
+
+			/* Computation is filed for every 2 columns */
+			if (two_column_buffer == runtime_buf + 2 * input_ch * 9)
+			{
+				out = mat_mult_kernel_s8_s16(kernel,
+										  runtime_buf,
+										  output_ch,
+										  output_shift,
+										  output_mult,
+										  output_offset,
+										  output_activation_min,
+										  output_activation_max,
+										  input_ch * 9,
+										  bias,
+										  out);
+				/* counter reset */
+				two_column_buffer = runtime_buf;
+			}
+		}
+	}
+
+	/* left-over because odd number of output pixels */
+	if (two_column_buffer != runtime_buf)
+	{
+		const q7_t *ker_a = kernel;
+		int i;
+
+		for (i = 0; i < output_ch; i++)
+		{
+			/* Load the accumulator with bias first */
+			q31_t sum = bias[i];
+
+			/* Point to the beginning of the im2col buffer where the input is available as a rearranged column */
+			const q15_t *ip_as_col = runtime_buf;
+
+			/* 4 multiply and accumulates are done in one loop. */
+			uint16_t col_count = (input_ch * 9) >> 2;
+
+			while (col_count)
+			{
+				q31_t ker_a1, ker_a2;
+				q31_t ip_b1, ip_b2;
+
+				ker_a = read_and_pad(ker_a, &ker_a1, &ker_a2);
+
+				ip_b1 = arm_nn_read_q15x2_ia(&ip_as_col);
+				sum = __SMLAD(ker_a1, ip_b1, sum);
+				ip_b2 = arm_nn_read_q15x2_ia(&ip_as_col);
+				sum = __SMLAD(ker_a2, ip_b2, sum);
+
+				col_count--;
+			}
+			/* Handle left over mac */
+			col_count = input_ch * 3 * 3 & 0x3;
+			while (col_count)
+			{
+				q7_t ker_a1 = *ker_a++;
+				q15_t ip_b1 = *ip_as_col++;
+				sum += ker_a1 * ip_b1;
+				col_count--;
+			}
+
+			sum = arm_nn_requantize(sum, output_mult[i], output_shift[i]);
+			sum += output_offset;
+			sum = MAX(sum, output_activation_min);
+			sum = MIN(sum, output_activation_max);
+			*out++ = (q7_t)sum;
+		}
+	}
+
+	/* Return to application */
+	return STATE_SUCCESS;
+}
+
+/**
+ * @} end of NNConv group
+ */
--- a/TinyEngine/src/kernels/int_only/convolve_s8_kernel3x2_inputch3_stride2_pad1.c
+++ b/TinyEngine/src/kernels/int_only/convolve_s8_kernel3x2_inputch3_stride2_pad1.c
@ -0,0 +1,232 @@
+/* ----------------------------------------------------------------------
+ * Project: TinyEngine
+ * Title:   convolve_s8_kernel3x2_inputch3_stride2_pad1.c
+ * Description:  for 3x3 convolution with 3 input channels, typically for image processing
+ *
+ * Reference papers:
+ *  - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
+ *  - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
+ *  - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
+ * Contact authors:
+ *  - Wei-Ming Chen, wmchen@mit.edu
+ *  - Wei-Chen Wang, wweichen@mit.edu
+ *  - Ji Lin, jilin@mit.edu
+ *  - Ligeng Zhu, ligeng@mit.edu
+ *  - Song Han, songhan@mit.edu
+ *
+ * Target ISA:  ARMv7E-M
+ * -------------------------------------------------------------------- */
+
+#include "arm_math.h"
+#include "arm_nnfunctions.h"
+#include "arm_nnsupportfunctions.h"
+#include "img2col_element.h"
+#include "tinyengine_function.h"
+
+tinyengine_status convolve_s8_kernel3x2_inputch3_stride2_pad1(const q7_t *input, const uint16_t input_x, const uint16_t input_y,
+		const uint16_t input_ch, const q7_t *kernel, const int32_t *bias,
+		const int32_t *output_shift, const int32_t *output_mult,
+		const int32_t output_offset, const int32_t input_offset,
+		const int32_t output_activation_min,
+		const int32_t output_activation_max, q7_t *output,
+		const uint16_t output_x, const uint16_t output_y,
+		const uint16_t output_ch, q15_t *runtime_buf, q15_t *kbuf, q7_t pad_value) {
+	const int kernel_y = 3;
+	const int kernel_x = 2;
+
+	//check this during code gen for better performance
+	if(input_x % 2 != 0 || input_y % 2 != 0){
+		return PARAM_NO_SUPPORT;
+	}
+
+	int16_t i_out_y, i_out_x, i_ker_y, i_ker_x;
+
+	/* Generate two columns from the input tensor a GEMM computation */
+	q15_t *two_column_buf = runtime_buf;
+	q7_t *out = output;
+
+	q15_t pad16 = pad_value;
+	const int16_t inoff16 = input_offset;
+	q15_t pad_out = pad16 + inoff16;
+	q31_t pad_out_q15x2 = __PKHBT(pad_out, pad_out, 16);
+	q31_t offset_q15x2 = __PKHBT(inoff16, inoff16, 16);
+
+	const q7_t *ip_a0 = kernel;
+
+	for (int i = 0; i < output_ch; i += 2) {
+		q15_t *dst1 = &kbuf[i * 18]; //each q31_t store 2 elements
+		q15_t *dst2 = dst1 + 18;
+
+		const q7_t *ip_a1 = ip_a0 + 18;
+
+		//27 for each output_ch
+		q31_t *dst1_31 = dst1;
+		q31_t *dst2_31 = dst2;
+		ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
+		ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
+		dst1_31 += 2;
+		dst2_31 += 2;
+
+		ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
+		ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
+		dst1_31 += 2;
+		dst2_31 += 2;
+
+		ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
+		ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
+		dst1_31 += 2;
+		dst2_31 += 2;
+
+		ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
+		ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
+		dst1_31 += 2;
+		dst2_31 += 2;
+
+		//17, 18
+		dst1 = dst1_31;
+		dst2 = dst2_31;
+		dst1[0] = *ip_a0++;
+		dst1[1] = *ip_a0++;
+		dst2[0] = *ip_a1++;
+		dst2[1] = *ip_a1++;
+
+		/* skip row */
+		ip_a0 += 27;
+	}
+
+	for (i_out_y = 0; i_out_y < output_y; i_out_y++) {
+		for (i_out_x = 0; i_out_x < output_x; i_out_x++) {
+			/* This part implements the im2col function */
+			const int16_t base_idx_y = (i_out_y * 2) - 1;
+			const int16_t base_idx_x = (i_out_x * 2) - 1;
+			const q15_t *col_buffer = two_column_buf;
+
+			//use variables
+			q31_t in_q7x4;
+			q31_t in_q15x2_1;
+			q31_t in_q15x2_2;
+			q31_t out_q15x2_1;
+			q31_t out_q15x2_2;
+
+			/* load address:8bit */
+			q7_t *src;
+			q7_t *src2;
+			q7_t *src3;
+
+			/* buffer for load:16bit */
+			q15_t *dst;
+			q15_t *dst2;
+			q15_t *dst3;
+
+			int input_row_offset = 3 * input_x;
+			dst = col_buffer;
+			dst2 = dst + 6;
+			dst3 = dst2 + 6;
+			if (base_idx_y != -1) {
+				if (base_idx_x != -1) {
+					//load all for now and unroll all
+					//3x3 = 9 elements
+					src = input	+ (base_idx_y * input_x + base_idx_x) * input_ch;
+					src2 = src + input_row_offset;
+					src3 = src2 + input_row_offset;
+
+					//3 * 2 = 6
+					q7_q15_offset_ele(src, dst)
+					*dst++ = *src++ + input_offset;
+					*dst++ = *src++ + input_offset;
+					//
+					q7_q15_offset_ele(src2, dst2)
+					*dst2++ = *src2++ + input_offset;
+					*dst2++ = *src2++ + input_offset;
+					//
+					q7_q15_offset_ele(src3, dst3)
+					*dst3++ = *src3++ + input_offset;
+					*dst3++ = *src3++ + input_offset;
+				} else {
+					src = input + (base_idx_y * input_x) * input_ch;
+					src2 = src + input_row_offset;
+					src3 = src2 + input_row_offset;
+
+					//pad the first col: 1x3 = 3
+					*dst++ = pad_out;
+					*dst++ = pad_out;
+					*dst++ = pad_out;
+					*dst2++ = pad_out;
+					*dst2++ = pad_out;
+					*dst2++ = pad_out;
+					*dst3++ = pad_out;
+					*dst3++ = pad_out;
+					*dst3++ = pad_out;
+					//load 3 elements
+					*dst++ = *src++ + input_offset;
+					*dst++ = *src++ + input_offset;
+					*dst++ = *src++ + input_offset;
+					//
+					*dst2++ = *src2++ + input_offset;
+					*dst2++ = *src2++ + input_offset;
+					*dst2++ = *src2++ + input_offset;
+					//
+					*dst3++ = *src3++ + input_offset;
+					*dst3++ = *src3++ + input_offset;
+					*dst3++ = *src3++ + input_offset;
+				}
+			} else {
+				//Padding the first row
+				//3x2 = 6 elements
+				q31_t *dst_31 = dst;
+				*dst_31++ = pad_out_q15x2;
+				*dst_31++ = pad_out_q15x2;
+				*dst_31++ = pad_out_q15x2;
+				if (base_idx_x != -1) {
+					//3x3 = 9 elements
+					src2 = input + (base_idx_x) * input_ch;
+					src3 = src2 + input_row_offset;
+
+					//3 * 2 = 6 = 4 * 1 + 2
+					q7_q15_offset_ele(src2, dst2)
+					*dst2++ = *src2++ + input_offset;
+					*dst2++ = *src2++ + input_offset;
+					//
+					q7_q15_offset_ele(src3, dst3)
+					*dst3++ = *src3++ + input_offset;
+					*dst3++ = *src3++ + input_offset;
+				} else {
+					src2 = input;
+					src3 = src2 + input_row_offset;
+
+					//pad the first one: 1x3 = 3
+					*dst2++ = pad_out;
+					*dst2++ = pad_out;
+					*dst2++ = pad_out;
+					*dst3++ = pad_out;
+					*dst3++ = pad_out;
+					*dst3++ = pad_out;
+					//load 3 elements
+					*dst2++ = *src2++ + input_offset;
+					*dst2++ = *src2++ + input_offset;
+					*dst2++ = *src2++ + input_offset;
+					//
+					*dst3++ = *src3++ + input_offset;
+					*dst3++ = *src3++ + input_offset;
+					*dst3++ = *src3++ + input_offset;
+				}
+			}
+
+			two_column_buf += 18;
+			/* Computation is filed for every 2 columns */
+			if (two_column_buf == runtime_buf + 2 * 18) {
+
+				out = mat_mult_unloop18_s8_s16(kernel,
+						runtime_buf, output_ch, output_shift, output_mult,
+						output_offset, output_activation_min, output_activation_max,
+						input_ch * kernel_y * kernel_x, bias, out, kbuf);
+
+				/* counter reset */
+				two_column_buf = runtime_buf;
+			}
+		}
+	}
+
+	/* Return to application */
+	return STATE_SUCCESS;
+}
--- a/TinyEngine/src/kernels/int_only/convolve_u8_kernel3_inputch3_stride1_pad1.c
+++ b/TinyEngine/src/kernels/int_only/convolve_u8_kernel3_inputch3_stride1_pad1.c
@ -0,0 +1,286 @@
+/* ----------------------------------------------------------------------
+ * Project: TinyEngine
+ * Title:   convolve_u8_kernel3_inputch3_stride1_pad1.c
+ * Description:  for 3x3 convolution with 3 input channels, typically for image processing
+ *
+ * Reference papers:
+ *  - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
+ *  - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
+ *  - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
+ * Contact authors:
+ *  - Wei-Ming Chen, wmchen@mit.edu
+ *  - Wei-Chen Wang, wweichen@mit.edu
+ *  - Ji Lin, jilin@mit.edu
+ *  - Ligeng Zhu, ligeng@mit.edu
+ *  - Song Han, songhan@mit.edu
+ *
+ * Target ISA:  ARMv7E-M
+ * -------------------------------------------------------------------- */
+
+#include "arm_math.h"
+#include "arm_nnfunctions.h"
+#include "arm_nnsupportfunctions.h"
+#include "img2col_element.h"
+#include "tinyengine_function.h"
+
+tinyengine_status convolve_u8_kernel3_stride1_pad1(const q8_t *input, const uint16_t input_x, const uint16_t input_y,
+		const uint16_t input_ch, const q7_t *kernel, const int32_t *bias,
+		const int32_t *output_shift, const int32_t *output_mult,
+		const int32_t output_offset, const int32_t input_offset,
+		const int32_t output_activation_min,
+		const int32_t output_activation_max, q7_t *output,
+		const uint16_t output_x, const uint16_t output_y,
+		const uint16_t output_ch, q15_t *runtime_buf, q15_t *kbuf, q7_t pad_value) {
+	const int kernel_y = 3;
+	const int kernel_x = 3;
+
+	int16_t i_out_y, i_out_x, i_ker_y, i_ker_x;
+
+	/* Generate two columns from the input tensor a GEMM computation */
+	q15_t *two_column_buf = runtime_buf;
+	q7_t *out = output;
+
+	q15_t pad16 = pad_value;
+	const int16_t inoff16 = input_offset;
+	q15_t pad_out = pad16 + inoff16;
+	q31_t pad_out_q15x2 = __PKHBT(pad_out, pad_out, 16);
+	q31_t offset_q15x2 = __PKHBT(inoff16, inoff16, 16);
+
+	const q7_t *ip_a0 = kernel;
+
+	for (int i = 0; i < output_ch; i += 2) {
+		q15_t *dst1 = &kbuf[i * 27]; //each q31_t store 2 elements
+		q15_t *dst2 = dst1 + 27;
+
+		const q7_t *ip_a1 = ip_a0 + 27;
+
+		//27 for each output_ch
+		q31_t *dst1_31 = dst1;
+		q31_t *dst2_31 = dst2;
+		ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
+		ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
+		dst1_31 += 2;
+		dst2_31 += 2;
+
+		ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
+		ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
+		dst1_31 += 2;
+		dst2_31 += 2;
+
+		ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
+		ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
+		dst1_31 += 2;
+		dst2_31 += 2;
+
+		ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
+		ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
+		dst1_31 += 2;
+		dst2_31 += 2;
+
+		ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
+		ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
+		dst1_31 += 2;
+		dst2_31 += 2;
+
+		ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
+		ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
+		dst1_31 += 2;
+		dst2_31 += 2;
+		//25, 26, 27
+		dst1 = dst1_31;
+		dst2 = dst2_31;
+		dst1[0] = *ip_a0++;
+		dst1[1] = *ip_a0++;
+		dst1[2] = *ip_a0++;
+		dst2[0] = *ip_a1++;
+		dst2[1] = *ip_a1++;
+		dst2[2] = *ip_a1++;
+
+		/* skip row */
+		ip_a0 += 27;
+	}
+
+	for (i_out_y = 0; i_out_y < output_y; i_out_y++) {
+		for (i_out_x = 0; i_out_x < output_x; i_out_x++) {
+			/* This part implements the im2col function */
+			const int16_t base_idx_y = (i_out_y) - 1;
+			const int16_t base_idx_x = (i_out_x) - 1;
+			const q15_t *col_buffer = two_column_buf;
+
+			//use variables
+			q31_t in_q7x4;
+			q31_t in_q15x2_1;
+			q31_t in_q15x2_2;
+			q31_t out_q15x2_1;
+			q31_t out_q15x2_2;
+
+			/* load address:8bit */
+			q8_t *src;
+			q8_t *src2;
+			q8_t *src3;
+
+			/* buffer for load:16bit */
+			q15_t *dst;
+			q15_t *dst2;
+			q15_t *dst3;
+
+			int input_row_offset = 3 * input_x;//channel = 3
+			dst = col_buffer;
+			dst2 = dst + 9;
+			dst3 = dst2 + 9;
+			if (base_idx_y != -1) {
+				if (base_idx_x != -1) { //load all for now and unroll all
+					//3x3 = 9 elements
+					src = input	+ (base_idx_y * input_x + base_idx_x) * input_ch;
+					src2 = src + input_row_offset;
+					src3 = src2 + input_row_offset;
+
+					//4 * 2 = 8
+					q8_q15_offset_ele(src, dst)
+					q8_q15_offset_ele(src, dst)
+					*dst++ = *src++ + input_offset;
+					//
+					q8_q15_offset_ele(src2, dst2)
+					q8_q15_offset_ele(src2, dst2)
+					*dst2++ = *src2++ + input_offset;
+					//
+					q8_q15_offset_ele(src3, dst3)
+					q8_q15_offset_ele(src3, dst3)
+					*dst3++ = *src3++ + input_offset;
+				} else {						//first element is pad
+												//3x3 = 9 elements
+					src = input + (base_idx_y * input_x) * input_ch;
+					src2 = src + input_row_offset;
+					src3 = src2 + input_row_offset;
+
+					//pad the first one: 1x3 = 3
+					*dst++ = pad_out;
+					*dst++ = pad_out;
+					*dst++ = pad_out;
+					*dst2++ = pad_out;
+					*dst2++ = pad_out;
+					*dst2++ = pad_out;
+					*dst3++ = pad_out;
+					*dst3++ = pad_out;
+					*dst3++ = pad_out;
+					//load 6 elements
+					//4 * 1 = 6
+					q8_q15_offset_ele(src, dst)
+					*dst++ = *src++ + input_offset;
+					*dst++ = *src++ + input_offset;
+					//
+					q8_q15_offset_ele(src2, dst2)
+					*dst2++ = *src2++ + input_offset;
+					*dst2++ = *src2++ + input_offset;
+					//
+					q8_q15_offset_ele(src3, dst3)
+					*dst3++ = *src3++ + input_offset;
+					*dst3++ = *src3++ + input_offset;
+				}
+			} else {						// first row is padded
+											//3x3 = 9 elements
+				*dst++ = pad_out;
+				q31_t *dst_31 = dst;
+				*dst_31++ = pad_out_q15x2;
+				*dst_31++ = pad_out_q15x2;
+				*dst_31++ = pad_out_q15x2;
+				*dst_31++ = pad_out_q15x2;
+				if (base_idx_x != -1) {	//load all for now and unroll all
+					//3x3 = 9 elements
+					src2 = input + (base_idx_x) * input_ch;
+					src3 = src2 + input_row_offset;
+
+					//4 * 2 = 8
+					q8_q15_offset_ele(src2, dst2)
+					q8_q15_offset_ele(src2, dst2)
+					*dst2++ = *src2++ + input_offset;
+					//
+					q8_q15_offset_ele(src3, dst3)
+					q8_q15_offset_ele(src3, dst3)
+					*dst3++ = *src3++ + input_offset;
+				} else {						//first element is pad
+												//3x3 = 9 elements
+					src2 = input;
+					src3 = src2 + input_row_offset;
+
+					//pad the first one: 1x3 = 3
+					*dst2++ = pad_out;
+					*dst2++ = pad_out;
+					*dst2++ = pad_out;
+					*dst3++ = pad_out;
+					*dst3++ = pad_out;
+					*dst3++ = pad_out;
+					//load 6 elements
+					q8_q15_offset_ele(src2, dst2)
+					*dst2++ = *src2++ + input_offset;
+					*dst2++ = *src2++ + input_offset;
+					//
+					q8_q15_offset_ele(src3, dst3)
+					*dst3++ = *src3++ + input_offset;
+					*dst3++ = *src3++ + input_offset;
+				}
+			}
+
+			two_column_buf += 27;
+			/* Computation is filed for every 2 columns */
+			if (two_column_buf == runtime_buf + 2 * 27) {
+
+				out = arm_nn_mat_mult_kernel3_input3_s8_s16(kernel,
+						runtime_buf, output_ch, output_shift, output_mult,
+						output_offset, output_activation_min, output_activation_max,
+						input_ch * kernel_y * kernel_x, bias, out, kbuf);
+
+				/* counter reset */
+				two_column_buf = runtime_buf;
+			}
+		}
+	}
+
+	/* left-over because odd number of output pixels */
+	if (two_column_buf != runtime_buf) {
+		const q7_t *ker_a = kernel;
+		int i;
+
+		for (i = 0; i < output_ch; i++) {
+			/* Load the accumulator with bias first */
+			q31_t sum = bias[i];
+
+			/* Point to the beginning of the im2col buffer where the input is available as a rearranged column */
+			const q15_t *ip_as_col = runtime_buf;
+
+			/* 4 multiply and accumulates are done in one loop. */
+			uint16_t col_count = (input_ch * kernel_y * kernel_x) >> 2;
+
+			while (col_count) {
+				q31_t ker_a1, ker_a2;
+				q31_t ip_b1, ip_b2;
+
+				ker_a = read_and_pad(ker_a, &ker_a1, &ker_a2);
+
+				ip_b1 = arm_nn_read_q15x2_ia(&ip_as_col);
+				sum = __SMLAD(ker_a1, ip_b1, sum);
+				ip_b2 = arm_nn_read_q15x2_ia(&ip_as_col);
+				sum = __SMLAD(ker_a2, ip_b2, sum);
+
+				col_count--;
+			}
+			/* Handle left over mac */
+			col_count = input_ch * kernel_y * kernel_x & 0x3;
+			while (col_count) {
+				q7_t ker_a1 = *ker_a++;
+				q15_t ip_b1 = *ip_as_col++;
+				sum += ker_a1 * ip_b1;
+				col_count--;
+			}
+
+			sum = arm_nn_requantize(sum, output_mult[i], output_shift[i]);
+			sum += output_offset;
+			sum = MAX(sum, output_activation_min);
+			sum = MIN(sum, output_activation_max);
+			*out++ = (q7_t) sum;
+		}
+	}
+
+	/* Return to application */
+	return STATE_SUCCESS;
+}
--- a/TinyEngine/src/kernels/int_only/convolve_u8_kernel3_inputch3_stride2_pad1.c
+++ b/TinyEngine/src/kernels/int_only/convolve_u8_kernel3_inputch3_stride2_pad1.c
@ -0,0 +1,286 @@
+/* ----------------------------------------------------------------------
+ * Project: TinyEngine
+ * Title:   convolve_u8_kernel3_inputch3_stride2_pad1.c
+ * Description:  for 3x3 convolution with 3 input channels, typically for image processing
+ *
+ * Reference papers:
+ *  - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
+ *  - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
+ *  - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
+ * Contact authors:
+ *  - Wei-Ming Chen, wmchen@mit.edu
+ *  - Wei-Chen Wang, wweichen@mit.edu
+ *  - Ji Lin, jilin@mit.edu
+ *  - Ligeng Zhu, ligeng@mit.edu
+ *  - Song Han, songhan@mit.edu
+ *
+ * Target ISA:  ARMv7E-M
+ * -------------------------------------------------------------------- */
+
+#include "arm_math.h"
+#include "arm_nnfunctions.h"
+#include "arm_nnsupportfunctions.h"
+#include "img2col_element.h"
+#include "tinyengine_function.h"
+
+tinyengine_status convolve_u8_kernel3_inputch3_stride2_pad1(const q8_t *input, const uint16_t input_x, const uint16_t input_y,
+		const uint16_t input_ch, const q7_t *kernel, const int32_t *bias,
+		const int32_t *output_shift, const int32_t *output_mult,
+		const int32_t output_offset, const int32_t input_offset,
+		const int32_t output_activation_min,
+		const int32_t output_activation_max, q7_t *output,
+		const uint16_t output_x, const uint16_t output_y,
+		const uint16_t output_ch, q15_t *runtime_buf, q15_t* kbuf, q7_t pad_value) {
+	const int kernel_y = 3;
+	const int kernel_x = 3;
+
+	int16_t i_out_y, i_out_x, i_ker_y, i_ker_x;
+
+	/* Generate two columns from the input tensor a GEMM computation */
+	q15_t *two_column_buf = runtime_buf;
+	q7_t *out = output;
+
+	q15_t pad16 = pad_value;
+	const int16_t inoff16 = input_offset;
+	q15_t pad_out = pad16 + inoff16;
+	q31_t pad_out_q15x2 = __PKHBT(pad_out, pad_out, 16);
+	q31_t offset_q15x2 = __PKHBT(inoff16, inoff16, 16);
+
+	const q7_t *ip_a0 = kernel;
+
+	for (int i = 0; i < output_ch; i += 2) {
+		q15_t *dst1 = &kbuf[i * 27]; //each q31_t store 2 elements
+		q15_t *dst2 = dst1 + 27;
+
+		const q7_t *ip_a1 = ip_a0 + 27;
+
+		//27 for each output_ch
+		q31_t *dst1_31 = dst1;
+		q31_t *dst2_31 = dst2;
+		ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
+		ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
+		dst1_31 += 2;
+		dst2_31 += 2;
+
+		ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
+		ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
+		dst1_31 += 2;
+		dst2_31 += 2;
+
+		ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
+		ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
+		dst1_31 += 2;
+		dst2_31 += 2;
+
+		ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
+		ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
+		dst1_31 += 2;
+		dst2_31 += 2;
+
+		ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
+		ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
+		dst1_31 += 2;
+		dst2_31 += 2;
+
+		ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
+		ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
+		dst1_31 += 2;
+		dst2_31 += 2;
+		//25, 26, 27
+		dst1 = dst1_31;
+		dst2 = dst2_31;
+		dst1[0] = *ip_a0++;
+		dst1[1] = *ip_a0++;
+		dst1[2] = *ip_a0++;
+		dst2[0] = *ip_a1++;
+		dst2[1] = *ip_a1++;
+		dst2[2] = *ip_a1++;
+
+		/* skip row */
+		ip_a0 += 27;
+	}
+
+	for (i_out_y = 0; i_out_y < output_y; i_out_y++) {
+		for (i_out_x = 0; i_out_x < output_x; i_out_x++) {
+			/* This part implements the im2col function */
+			const int16_t base_idx_y = (i_out_y * 2) - 1;
+			const int16_t base_idx_x = (i_out_x * 2) - 1;
+			const q15_t *col_buffer = two_column_buf;
+
+			//use variables
+			q31_t in_q7x4;
+			q31_t in_q15x2_1;
+			q31_t in_q15x2_2;
+			q31_t out_q15x2_1;
+			q31_t out_q15x2_2;
+
+			/* load address:8bit */
+			q8_t *src;
+			q8_t *src2;
+			q8_t *src3;
+
+			/* buffer for load:16bit */
+			q15_t *dst;
+			q15_t *dst2;
+			q15_t *dst3;
+
+			int input_row_offset = 3 * input_x;
+			dst = col_buffer;
+			dst2 = dst + 9;
+			dst3 = dst2 + 9;
+			if (base_idx_y != -1) {
+				if (base_idx_x != -1) { //load all for now and unroll all
+					//3x3 = 9 elements
+					src = input	+ (base_idx_y * input_x + base_idx_x) * input_ch;
+					src2 = src + input_row_offset;
+					src3 = src2 + input_row_offset;
+
+					//4 * 2 = 8
+					q8_q15_offset_ele(src, dst)
+					q8_q15_offset_ele(src, dst)
+					*dst++ = *src++ + input_offset;
+					//
+					q8_q15_offset_ele(src2, dst2)
+					q8_q15_offset_ele(src2, dst2)
+					*dst2++ = *src2++ + input_offset;
+					//
+					q8_q15_offset_ele(src3, dst3)
+					q8_q15_offset_ele(src3, dst3)
+					*dst3++ = *src3++ + input_offset;
+				} else {						//first element is pad
+												//3x3 = 9 elements
+					src = input + (base_idx_y * input_x) * input_ch;
+					src2 = src + input_row_offset;
+					src3 = src2 + input_row_offset;
+
+					//pad the first one: 1x3 = 3
+					*dst++ = pad_out;
+					*dst++ = pad_out;
+					*dst++ = pad_out;
+					*dst2++ = pad_out;
+					*dst2++ = pad_out;
+					*dst2++ = pad_out;
+					*dst3++ = pad_out;
+					*dst3++ = pad_out;
+					*dst3++ = pad_out;
+					//load 6 elements
+					//4 * 1 = 6
+					q8_q15_offset_ele(src, dst)
+					*dst++ = *src++ + input_offset;
+					*dst++ = *src++ + input_offset;
+					//
+					q8_q15_offset_ele(src2, dst2)
+					*dst2++ = *src2++ + input_offset;
+					*dst2++ = *src2++ + input_offset;
+					//
+					q8_q15_offset_ele(src3, dst3)
+					*dst3++ = *src3++ + input_offset;
+					*dst3++ = *src3++ + input_offset;
+				}
+			} else {						// first row is padded
+											//3x3 = 9 elements
+				*dst++ = pad_out;
+				q31_t *dst_31 = dst;
+				*dst_31++ = pad_out_q15x2;
+				*dst_31++ = pad_out_q15x2;
+				*dst_31++ = pad_out_q15x2;
+				*dst_31++ = pad_out_q15x2;
+				if (base_idx_x != -1) {	//load all for now and unroll all
+					//3x3 = 9 elements
+					src2 = input + (base_idx_x) * input_ch;
+					src3 = src2 + input_row_offset;
+
+					//4 * 2 = 8
+					q8_q15_offset_ele(src2, dst2)
+					q8_q15_offset_ele(src2, dst2)
+					*dst2++ = *src2++ + input_offset;
+					//
+					q8_q15_offset_ele(src3, dst3)
+					q8_q15_offset_ele(src3, dst3)
+					*dst3++ = *src3++ + input_offset;
+				} else {						//first element is pad
+												//3x3 = 9 elements
+					src2 = input;
+					src3 = src2 + input_row_offset;
+
+					//pad the first one: 1x3 = 3
+					*dst2++ = pad_out;
+					*dst2++ = pad_out;
+					*dst2++ = pad_out;
+					*dst3++ = pad_out;
+					*dst3++ = pad_out;
+					*dst3++ = pad_out;
+					//load 6 elements
+					q8_q15_offset_ele(src2, dst2)
+					*dst2++ = *src2++ + input_offset;
+					*dst2++ = *src2++ + input_offset;
+					//
+					q8_q15_offset_ele(src3, dst3)
+					*dst3++ = *src3++ + input_offset;
+					*dst3++ = *src3++ + input_offset;
+				}
+			}
+
+			two_column_buf += 27;
+			/* Computation is filed for every 2 columns */
+			if (two_column_buf == runtime_buf + 2 * 27) {
+
+				out = arm_nn_mat_mult_kernel3_input3_s8_s16(kernel,
+						runtime_buf, output_ch, output_shift, output_mult,
+						output_offset, output_activation_min, output_activation_max,
+						input_ch * kernel_y * kernel_x, bias, out, kbuf);
+
+				/* counter reset */
+				two_column_buf = runtime_buf;
+			}
+		}
+	}
+
+	/* left-over because odd number of output pixels */
+	if (two_column_buf != runtime_buf) {
+		const q7_t *ker_a = kernel;
+		int i;
+
+		for (i = 0; i < output_ch; i++) {
+			/* Load the accumulator with bias first */
+			q31_t sum = bias[i];
+
+			/* Point to the beginning of the im2col buffer where the input is available as a rearranged column */
+			const q15_t *ip_as_col = runtime_buf;
+
+			/* 4 multiply and accumulates are done in one loop. */
+			uint16_t col_count = (input_ch * kernel_y * kernel_x) >> 2;
+
+			while (col_count) {
+				q31_t ker_a1, ker_a2;
+				q31_t ip_b1, ip_b2;
+
+				ker_a = read_and_pad(ker_a, &ker_a1, &ker_a2);
+
+				ip_b1 = arm_nn_read_q15x2_ia(&ip_as_col);
+				sum = __SMLAD(ker_a1, ip_b1, sum);
+				ip_b2 = arm_nn_read_q15x2_ia(&ip_as_col);
+				sum = __SMLAD(ker_a2, ip_b2, sum);
+
+				col_count--;
+			}
+			/* Handle left over mac */
+			col_count = input_ch * kernel_y * kernel_x & 0x3;
+			while (col_count) {
+				q7_t ker_a1 = *ker_a++;
+				q15_t ip_b1 = *ip_as_col++;
+				sum += ker_a1 * ip_b1;
+				col_count--;
+			}
+
+			sum = arm_nn_requantize(sum, output_mult[i], output_shift[i]);
+			sum += output_offset;
+			sum = MAX(sum, output_activation_min);
+			sum = MIN(sum, output_activation_max);
+			*out++ = (q7_t) sum;
+		}
+	}
+
+	/* Return to application */
+	return STATE_SUCCESS;
+}
--- a/TinyEngine/src/kernels/int_only/element_mult.c
+++ b/TinyEngine/src/kernels/int_only/element_mult.c
@ -0,0 +1,46 @@
+/* ----------------------------------------------------------------------
+ * Project: TinyEngine
+ * Title:   element_mult.c
+ *
+ * Reference papers:
+ *  - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
+ *  - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
+ *  - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
+ * Contact authors:
+ *  - Wei-Ming Chen, wmchen@mit.edu
+ *  - Wei-Chen Wang, wweichen@mit.edu
+ *  - Ji Lin, jilin@mit.edu
+ *  - Ligeng Zhu, ligeng@mit.edu
+ *  - Song Han, songhan@mit.edu
+ *
+ * Target ISA:  ARMv7E-M
+ * -------------------------------------------------------------------- */
+
+#include "tinyengine_function.h"
+#include "arm_nnfunctions.h"
+
+/*
+ * Spatial elementwise multiplications for nxnxc * 1x1xc
+ * */
+tinyengine_status element_mult_nx1(const q7_t* input, const uint16_t input_h, const uint16_t input_w,
+		const uint16_t input_c,	const q7_t* input2, const int16_t input1_offset, const int16_t input2_offset,
+		const int16_t output_offset, const int32_t out_activation_min, const int32_t out_activation_max,
+		const int32_t output_shift, const int32_t output_mult, q7_t* output)
+{
+	int c, element;
+	for (element = 0; element < input_h * input_w; element++){
+		q7_t* multiplier = input2;
+		for (c = 0; c < input_c; c++){
+		    const int32_t input1_val = input1_offset + *input++;
+		    const int32_t input2_val = input2_offset + *multiplier++;
+		    int32_t unclamped_result = input1_val * input2_val;
+		    int32_t clamped_result = output_offset + arm_nn_requantize(unclamped_result, output_mult, output_shift);
+		    clamped_result = MAX(clamped_result, out_activation_min);
+			clamped_result = MIN(clamped_result, out_activation_max);
+			*output++ = clamped_result;
+		}
+	}
+
+}
+
+
--- a/TinyEngine/src/kernels/int_only/fully_connected.c
+++ b/TinyEngine/src/kernels/int_only/fully_connected.c
@ -0,0 +1,43 @@
+/* ----------------------------------------------------------------------
+ * Project: TinyEngine
+ * Title:   fully_connected.c
+ *
+ * Reference papers:
+ *  - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
+ *  - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
+ *  - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
+ * Contact authors:
+ *  - Wei-Ming Chen, wmchen@mit.edu
+ *  - Wei-Chen Wang, wweichen@mit.edu
+ *  - Ji Lin, jilin@mit.edu
+ *  - Ligeng Zhu, ligeng@mit.edu
+ *  - Song Han, songhan@mit.edu
+ *
+ * Target ISA:  ARMv7E-M
+ * -------------------------------------------------------------------- */
+
+#include "tinyengine_function.h"
+
+tinyengine_status fully_connected_fp(
+		const float *input, const uint16_t input_x, const uint16_t input_y,
+		const uint16_t input_ch, const uint16_t output_ch, const float *bias,
+		const float *weights, float *output)
+{
+	int h, w, out_c, in_c;
+	for (h = 0; h < input_y; h++){
+		for (w = 0; w < input_x; w++){
+			int pixel_cnt = w + input_x * h;
+			for (out_c = 0; out_c < output_ch; out_c++){
+				float intermediate = bias[out_c];
+				float *start_weight = weights + out_c * input_ch;
+				float *start_input = input + input_ch * pixel_cnt;
+				float *start_out = output + output_ch * pixel_cnt;
+				for (in_c = 0; in_c < input_ch; in_c++){
+					intermediate += start_weight[in_c] * start_input[in_c];
+				}
+				start_out[out_c] = intermediate;
+			}
+		}
+	}
+
+}
--- a/TinyEngine/src/kernels/int_only/mat_mul_fp.c
+++ b/TinyEngine/src/kernels/int_only/mat_mul_fp.c
@ -0,0 +1,35 @@
+/* ----------------------------------------------------------------------
+ * Project: TinyEngine
+ * Title:   mat_mul_fp.c
+ *
+ * Reference papers:
+ *  - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
+ *  - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
+ *  - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
+ * Contact authors:
+ *  - Wei-Ming Chen, wmchen@mit.edu
+ *  - Wei-Chen Wang, wweichen@mit.edu
+ *  - Ji Lin, jilin@mit.edu
+ *  - Ligeng Zhu, ligeng@mit.edu
+ *  - Song Han, songhan@mit.edu
+ *
+ * Target ISA:  ARMv7E-M
+ * -------------------------------------------------------------------- */
+
+#include "tinyengine_function.h"
+
+tinyengine_status mat_mul_fp(
+				const float *matA, const uint16_t matA_row, const uint16_t matA_col,
+				const float* matB, const uint16_t matB_col, float* output)
+{
+	int m, n, i;
+	for (n = 0; n < matA_row; n++){
+		for (m = 0; m < matB_col; m++){
+			float sum = 0;
+			for (i = 0; i < matA_col; i++){
+				sum += matA[i + n * matA_col] * matB[m + i * matA_col];
+			}
+			output[m + n * matB_col] = sum;
+		}
+	}
+}
--- a/TinyEngine/src/kernels/int_only/mat_mult_kernels.c
+++ b/TinyEngine/src/kernels/int_only/mat_mult_kernels.c
--- a/TinyEngine/src/kernels/int_only/maxpooling.c
+++ b/TinyEngine/src/kernels/int_only/maxpooling.c
@ -0,0 +1,50 @@
+/* ----------------------------------------------------------------------
+ * Project: TinyEngine
+ * Title:   maxpooling.c
+ *
+ * Reference papers:
+ *  - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
+ *  - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
+ *  - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
+ * Contact authors:
+ *  - Wei-Ming Chen, wmchen@mit.edu
+ *  - Wei-Chen Wang, wweichen@mit.edu
+ *  - Ji Lin, jilin@mit.edu
+ *  - Ligeng Zhu, ligeng@mit.edu
+ *  - Song Han, songhan@mit.edu
+ *
+ * Target ISA:  ARMv7E-M
+ * -------------------------------------------------------------------- */
+
+#include "tinyengine_function.h"
+
+tinyengine_status max_pooling(const q7_t* input, const uint16_t input_h, const uint16_t input_w,
+		const uint16_t input_c,	const uint16_t sample_h, const uint16_t sample_w,
+		const uint16_t output_h, const uint16_t output_w, const int32_t out_activation_min,
+        const int32_t out_activation_max, q7_t* output)
+{
+	int h, w, c;
+	int sh, sw;
+	for(c = 0; c < input_c; c++){
+		for(h = 0; h < output_h; h++){
+			for(w = 0; w < output_w; w++){
+				int max = out_activation_min;
+
+				for(sh = 0; sh < sample_h; sh++){
+					int height = sh + h * sample_h;
+					for(sw = 0; sw < sample_w; sw++){
+						int width = sw + w * sample_w;
+						max = TN_MAX(max,input[(width + height * input_w) * input_c + c]);
+					}
+				}
+
+				int out = max;
+				out = TN_MAX(out, out_activation_min);
+				out = TN_MIN(out, out_activation_max);
+				output[(w + h * output_w) * input_c + c] = out;
+			}
+		}
+	}
+}
+
+
--- a/TinyEngine/src/kernels/int_only/patchpadding_convolve_s8_kernel3_inputch3_stride2.c
+++ b/TinyEngine/src/kernels/int_only/patchpadding_convolve_s8_kernel3_inputch3_stride2.c
@ -0,0 +1,252 @@
+/* ----------------------------------------------------------------------
+ * Project: TinyEngine
+ * Title:   patchpadding_convolve_s8_kernel3_inputch3_stride2.c
+ * Description:  for 3x3 convolution with 3 input channels, typically for image processing
+ *
+ * Reference papers:
+ *  - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
+ *  - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
+ *  - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
+ * Contact authors:
+ *  - Wei-Ming Chen, wmchen@mit.edu
+ *  - Wei-Chen Wang, wweichen@mit.edu
+ *  - Ji Lin, jilin@mit.edu
+ *  - Ligeng Zhu, ligeng@mit.edu
+ *  - Song Han, songhan@mit.edu
+ *
+ * Target ISA:  ARMv7E-M
+ * -------------------------------------------------------------------- */
+
+#include "arm_math.h"
+#include "arm_nnfunctions.h"
+#include "arm_nnsupportfunctions.h"
+#include "img2col_element.h"
+#include "tinyengine_function.h"
+
+#define HOLD_KERNEL
+
+tinyengine_status patchpadding_convolve_s8_kernel3_inputch3_stride2(const q7_t *input, const uint16_t input_x, const uint16_t input_y,
+		const uint16_t input_ch, const q7_t *kernel, const int32_t *bias,
+		const int32_t *output_shift, const int32_t *output_mult,
+		const int32_t output_offset, const int32_t input_offset,
+		const int32_t output_activation_min,
+		const int32_t output_activation_max, q7_t *output,
+		const uint16_t output_x, const uint16_t output_y,
+		const uint16_t output_ch, q15_t *runtime_buf, q7_t pad_value,
+		const uint16_t pad_t, const uint16_t pad_b, const uint16_t pad_l, const uint16_t pad_r) {
+	const int kernel_y = 3;
+	const int kernel_x = 3;
+
+	int16_t i_out_y, i_out_x, i_ker_y, i_ker_x;
+
+	/* Generate two columns from the input tensor a GEMM computation */
+	q15_t *two_column_buf = runtime_buf;
+	q7_t *out = output;
+
+	q15_t pad16 = pad_value;
+	const int16_t inoff16 = input_offset;
+	q15_t pad_out = pad16 + inoff16;
+	q31_t pad_out_q15x2 = __PKHBT(pad_out, pad_out, 16);
+	q31_t offset_q15x2 = __PKHBT(inoff16, inoff16, 16);
+
+	q15_t *kbuf = (q15_t*) get_kernel_buffer();
+	const q7_t *ip_a0 = kernel;
+
+#ifdef HOLD_KERNEL
+	for (int i = 0; i < output_ch; i += 2) {
+		q15_t *dst1 = &kbuf[i * 27]; //each q31_t store 2 elements
+		q15_t *dst2 = dst1 + 27;
+
+		const q7_t *ip_a1 = ip_a0 + 27;
+
+		//27 for each output_ch
+		q31_t *dst1_31 = dst1;
+		q31_t *dst2_31 = dst2;
+		ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
+		ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
+		dst1_31 += 2;
+		dst2_31 += 2;
+
+		ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
+		ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
+		dst1_31 += 2;
+		dst2_31 += 2;
+
+		ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
+		ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
+		dst1_31 += 2;
+		dst2_31 += 2;
+
+		ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
+		ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
+		dst1_31 += 2;
+		dst2_31 += 2;
+
+		ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
+		ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
+		dst1_31 += 2;
+		dst2_31 += 2;
+
+		ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
+		ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
+		dst1_31 += 2;
+		dst2_31 += 2;
+		//25, 26, 27
+		dst1 = dst1_31;
+		dst2 = dst2_31;
+		dst1[0] = *ip_a0++;
+		dst1[1] = *ip_a0++;
+		dst1[2] = *ip_a0++;
+		dst2[0] = *ip_a1++;
+		dst2[1] = *ip_a1++;
+		dst2[2] = *ip_a1++;
+
+		/* skip row */
+		ip_a0 += 27;
+	}
+#endif
+
+	int skip = 0;
+	for (i_out_y = 0; i_out_y < output_y; i_out_y++) {
+		for (i_out_x = 0; i_out_x < output_x; i_out_x++) {
+			/* This part implements the im2col function */
+			const q15_t *col_buffer = two_column_buf;
+			int16_t base_idx_y = (i_out_y * 2);
+			int16_t base_idx_x = (i_out_x * 2);
+
+			//use variables
+			q31_t in_q7x4;
+			q31_t in_q15x2_1;
+			q31_t in_q15x2_2;
+			q31_t out_q15x2_1;
+			q31_t out_q15x2_2;
+
+			/* load address:8bit */
+			q7_t *src;
+
+			/* buffer for im2col:16bit */
+			q15_t *dst = col_buffer;
+
+			int skip_top = pad_t - base_idx_y;
+			int skip_bottom = MAX(0,(base_idx_y + 3) - (input_y - pad_b));//3x3
+
+			int y_cnt = 3;//3 rows to load
+			//fill zeros in the top regions
+			while (y_cnt > 0 && skip_top-- > 0){
+				*dst++ = 0; *dst++ = 0; *dst++ = 0;
+				*dst++ = 0; *dst++ = 0; *dst++ = 0;
+				*dst++ = 0; *dst++ = 0; *dst++ = 0;
+				y_cnt--;
+				base_idx_y++;
+			}
+
+			//fill in the middle
+			int skip_left = MAX(0,pad_l - base_idx_x);
+			int skip_right = MAX(0,(base_idx_x + 3) - (input_x - pad_r));//3x3
+			//address of the first valid values
+			int m;
+			for (m = 0; m < y_cnt - skip_bottom; m++){
+				src = input	+ ((base_idx_y+m) * input_x + base_idx_x + skip_left) * input_ch;
+				int x_cnt = 3;//3 columns to load
+				//fill zero for left regions
+				int cnt = skip_left;
+				while(x_cnt > 0 && cnt-- > 0){
+					*dst++ = 0; *dst++ = 0; *dst++ = 0;//input_ch == 3
+					x_cnt--;
+				}
+
+				//load the middle
+				while(x_cnt > skip_right){
+					*dst++ = *src++ + input_offset; *dst++ = *src++ + input_offset; *dst++ = *src++ + input_offset;
+					x_cnt--;
+				}
+
+				//fill zero for right regions (for what's left)
+				while(x_cnt > 0){
+					*dst++ = 0; *dst++ = 0; *dst++ = 0;//input_ch == 3
+					x_cnt--;
+				}
+			}
+			y_cnt -= m;
+
+			//fill zeros in the bottom regions
+			while (y_cnt > 0){
+				*dst++ = 0; *dst++ = 0; *dst++ = 0;
+				*dst++ = 0; *dst++ = 0; *dst++ = 0;
+				*dst++ = 0; *dst++ = 0; *dst++ = 0;
+				y_cnt--;
+			}
+
+			two_column_buf += 27;
+			/* Computation is filed for every 2 columns */
+			if (two_column_buf == runtime_buf + 2 * 27) {
+#ifdef HOLD_KERNEL
+				out = arm_nn_mat_mult_kernel3_input3_s8_s16(kernel,
+						runtime_buf, output_ch, output_shift, output_mult,
+						output_offset, output_activation_min, output_activation_max,
+						input_ch * kernel_y * kernel_x, bias, out, kbuf);
+//				out = mat_mult_s16(kernel,
+//						runtime_buf, output_ch, output_shift, output_mult,
+//						output_offset, output_activation_min, output_activation_max,
+//						input_ch * kernel_y * kernel_x, bias, out, kbuf);
+#else
+				out = arm_nn_mat_mult_kernel_s8_s16(kernel,
+						runtime_buf, output_ch, output_shift, output_mult,
+						output_offset, output_activation_min, output_activation_max,
+						27, bias, out);
+#endif
+
+				/* counter reset */
+				two_column_buf = runtime_buf;
+			}
+		}
+	}
+
+	/* left-over because odd number of output pixels */
+	if (two_column_buf != runtime_buf) {
+		const q7_t *ker_a = kernel;
+		int i;
+
+		for (i = 0; i < output_ch; i++) {
+			/* Load the accumulator with bias first */
+			q31_t sum = bias[i];
+
+			/* Point to the beginning of the im2col buffer where the input is available as a rearranged column */
+			const q15_t *ip_as_col = runtime_buf;
+
+			/* 4 multiply and accumulates are done in one loop. */
+			uint16_t col_count = (input_ch * kernel_y * kernel_x) >> 2;
+
+			while (col_count) {
+				q31_t ker_a1, ker_a2;
+				q31_t ip_b1, ip_b2;
+
+				ker_a = read_and_pad(ker_a, &ker_a1, &ker_a2);
+
+				ip_b1 = arm_nn_read_q15x2_ia(&ip_as_col);
+				sum = __SMLAD(ker_a1, ip_b1, sum);
+				ip_b2 = arm_nn_read_q15x2_ia(&ip_as_col);
+				sum = __SMLAD(ker_a2, ip_b2, sum);
+
+				col_count--;
+			}
+			/* Handle left over mac */
+			col_count = input_ch * kernel_y * kernel_x & 0x3;
+			while (col_count) {
+				q7_t ker_a1 = *ker_a++;
+				q15_t ip_b1 = *ip_as_col++;
+				sum += ker_a1 * ip_b1;
+				col_count--;
+			}
+
+			sum = arm_nn_requantize(sum, output_mult[i], output_shift[i]);
+			sum += output_offset;
+			sum = MAX(sum, output_activation_min);
+			sum = MIN(sum, output_activation_max);
+			*out++ = (q7_t) sum;
+		}
+	}
+
+	/* Return to application */
+	return STATE_SUCCESS;
+}
--- a/TinyEngine/src/kernels/int_only/patchpadding_depthwise_kernel3x3_stride1_inplace_CHW.c
+++ b/TinyEngine/src/kernels/int_only/patchpadding_depthwise_kernel3x3_stride1_inplace_CHW.c
@ -0,0 +1,175 @@
+/* This file is automatically generated */
+/* ----------------------------------------------------------------------
+ * Project: TinyEngine
+ * Title:   patchpadding_depthwise_kernel3x3_stride1_inplace_CHW.c
+ * Description:  for sparse in-place 3x3 depth-wise convolution (HWC->CHW->HWC)
+ *
+ * Reference papers:
+ *  - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
+ *  - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
+ *  - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
+ * Contact authors:
+ *  - Wei-Ming Chen, wmchen@mit.edu
+ *  - Wei-Chen Wang, wweichen@mit.edu
+ *  - Ji Lin, jilin@mit.edu
+ *  - Ligeng Zhu, ligeng@mit.edu
+ *  - Song Han, songhan@mit.edu
+ *
+ * Target ISA:  ARMv7E-M
+ * -------------------------------------------------------------------- */
+
+#include "arm_nnsupportfunctions.h" //TODO: remove this in the future for self-contained
+#include "tinyengine_function.h"
+void patch_depthwise_kernel3x3_stride1_inplace_kernel_CHW(
+		const uint16_t output_y, const uint16_t output_x,
+		const int32_t *bias, const int32_t *biasR, const q7_t *ksrc, const int32_t *multiplier,
+		const int32_t *shift, q7_t *output, const int32_t output_offset,
+		const int32_t activation_min, const int32_t activation_max,
+		q7_t *cols_8b_iterptr, const uint16_t column_x, int channel_offset);
+tinyengine_status patchpadding_depthwise_kernel3x3_stride1_inplace_CHW(q7_t *input, const uint16_t input_x, const uint16_t input_y,
+				const uint16_t input_ch, const q7_t *kernel, const int32_t *bias, const int32_t *biasR,
+				const int32_t *output_shift, const int32_t *output_mult,
+				const int32_t output_offset, const int32_t input_offset,
+				const int32_t output_activation_min,
+				const int32_t output_activation_max, q7_t *output,
+				const uint16_t output_x, const uint16_t output_y,
+				const uint16_t output_ch, q15_t *runtime_buf, q7_t pad_value,
+				const uint16_t pad_t, const uint16_t pad_b, const uint16_t pad_l, const uint16_t pad_r)
+{
+    uint16_t c,i,j;
+	q7_t *cols_8b_start = (q7_t *)runtime_buf;
+	q7_t* cols_8b = (q7_t* )cols_8b_start;
+
+	const q7_t *src;
+	const q7_t *ksrc = kernel;
+
+	//set the output for inplace update
+	q7_t *inplace_out = input;
+
+	int padding_cnt = pad_t * input_x;
+	//shift the input ptr accordingly for HWC->CHW
+	input += padding_cnt * input_ch;
+	//handle top padding
+	q7_t PAD8 = pad_value;
+	while (padding_cnt--){
+		*cols_8b++ = PAD8;
+	}
+
+	for (i = pad_t; i < input_y - pad_b; i++){
+		//handle left padding
+		for (j = 0; j < pad_l; j++){
+			*cols_8b++ = PAD8;
+		}
+		cols_8b += input_x - (pad_l + pad_r);
+		//handle right padding
+		for (j = 0; j < pad_r; j++){
+			*cols_8b++ = PAD8;
+		}
+	}
+
+	//handle bottom padding
+	padding_cnt = pad_b * input_x;
+	//not need to shift for bottom padding
+	while (padding_cnt--){
+		*cols_8b++ = PAD8;
+	}
+
+	for (c = 0; c < input_ch; c++){        
+	    src = input;
+        cols_8b = (q7_t*)(cols_8b_start + pad_t * (input_x)); //skip pad_t rows
+        for(i = pad_t; i < input_y - pad_b; i++){
+            cols_8b += pad_l;//skip left
+            src += pad_l * input_ch;
+            for(j = pad_l; j < input_x - pad_r; j++){
+                *cols_8b++ = *src;// + input_offset;
+                src += input_ch;
+            }
+            cols_8b += pad_r;//skip right
+            src += pad_r * input_ch;
+        }
+		patch_depthwise_kernel3x3_stride1_inplace_kernel_CHW(output_y, output_x, bias++, biasR++, ksrc, output_mult++, output_shift++, inplace_out, output_offset,output_activation_min, output_activation_max,cols_8b_start, input_x, input_ch);
+		inplace_out++;
+		input++;
+		ksrc += 9;
+    }
+}
+void patch_depthwise_kernel3x3_stride1_inplace_kernel_CHW(
+		const uint16_t output_y, const uint16_t output_x,
+		const int32_t *bias, const int32_t *biasR, const q7_t *ksrc, const int32_t *multiplier,
+		const int32_t *shift, q7_t *output, const int32_t output_offset,
+		const int32_t activation_min, const int32_t activation_max,
+		q7_t *cols_8b_iterptr, const uint16_t column_x, int channel_offset)
+{
+	#define STRIDE 1
+	int i, j;
+	/* MACs for each output */
+	for (i = 0; i < output_y; i++) {
+		for (j = 0; j < output_x / 2; j++) {
+			q7_t *cols_8b = cols_8b_iterptr;
+
+			q31_t sum0 = bias[0];
+			q31_t sum1 = bias[0];
+
+			/* computation */
+			sum0 += cols_8b[0]*ksrc[0];
+			sum1 += cols_8b[1]*ksrc[0];
+			sum0 += cols_8b[1]*ksrc[1];
+			sum1 += cols_8b[2]*ksrc[1];
+			sum0 += cols_8b[2]*ksrc[2];
+			sum1 += cols_8b[3]*ksrc[2];
+			cols_8b += column_x;
+			sum0 += cols_8b[0]*ksrc[3];
+			sum1 += cols_8b[1]*ksrc[3];
+			sum0 += cols_8b[1]*ksrc[4];
+			sum1 += cols_8b[2]*ksrc[4];
+			sum0 += cols_8b[2]*ksrc[5];
+			sum1 += cols_8b[3]*ksrc[5];
+			cols_8b += column_x;
+			sum0 += cols_8b[0]*ksrc[6];
+			sum1 += cols_8b[1]*ksrc[6];
+			sum0 += cols_8b[1]*ksrc[7];
+			sum1 += cols_8b[2]*ksrc[7];
+			sum0 += cols_8b[2]*ksrc[8];
+			sum1 += cols_8b[3]*ksrc[8];
+
+			/* requantize */
+			sum0 = arm_nn_requantize(sum0 + biasR[0], *multiplier, *shift);
+			sum0 += output_offset;
+			sum0 = MAX(sum0, activation_min);
+			sum0 = MIN(sum0, activation_max);
+			output[(i * output_x + j * 2) * channel_offset] = sum0;
+
+			sum1 = arm_nn_requantize(sum1 + biasR[0], *multiplier, *shift);
+			sum1 += output_offset;
+			sum1 = MAX(sum1, activation_min);
+			sum1 = MIN(sum1, activation_max);
+			output[(i * output_x + (j * 2 + 1)) * channel_offset] = sum1;
+
+			cols_8b_iterptr += STRIDE * 2;
+		}
+		if (output_x & 1) {
+			q7_t * cols_8b = cols_8b_iterptr;
+			q31_t sum = bias[0];
+			sum += cols_8b[0]*ksrc[0];
+			sum += cols_8b[1]*ksrc[1];
+			sum += cols_8b[2]*ksrc[2];
+			cols_8b += column_x;
+			sum += cols_8b[0]*ksrc[3];
+			sum += cols_8b[1]*ksrc[4];
+			sum += cols_8b[2]*ksrc[5];
+			cols_8b += column_x;
+			sum += cols_8b[0]*ksrc[6];
+			sum += cols_8b[1]*ksrc[7];
+			sum += cols_8b[2]*ksrc[8];
+
+			sum = arm_nn_requantize(sum + biasR[0], *multiplier, *shift);
+			sum += output_offset;
+			sum = MAX(sum, activation_min);
+			sum = MIN(sum, activation_max);
+			output[(i * output_x + output_x - 1) * channel_offset] = sum;
+
+			cols_8b_iterptr += STRIDE;
+		}
+		cols_8b_iterptr += 1 * 2;
+    }
+}
--- a/TinyEngine/src/kernels/int_only/patchpadding_depthwise_kernel3x3_stride2_inplace_CHW.c
+++ b/TinyEngine/src/kernels/int_only/patchpadding_depthwise_kernel3x3_stride2_inplace_CHW.c
@ -0,0 +1,176 @@
+/* This file is automatically generated */
+/* ----------------------------------------------------------------------
+ * Project: TinyEngine
+ * Title:   patchpadding_depthwise_kernel3x3_stride2_inplace_CHW.c
+ * Description:  for sparse in-place 3x3 depth-wise convolution (HWC->CHW->HWC)
+ *
+ * Reference papers:
+ *  - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
+ *  - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
+ *  - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
+ * Contact authors:
+ *  - Wei-Ming Chen, wmchen@mit.edu
+ *  - Wei-Chen Wang, wweichen@mit.edu
+ *  - Ji Lin, jilin@mit.edu
+ *  - Ligeng Zhu, ligeng@mit.edu
+ *  - Song Han, songhan@mit.edu
+ *
+ * Target ISA:  ARMv7E-M
+ * -------------------------------------------------------------------- */
+
+#include "arm_nnsupportfunctions.h" //TODO: remove this in the future for self-contained
+#include "tinyengine_function.h"
+void patch_depthwise_kernel3x3_stride2_inplace_kernel_CHW(
+		const uint16_t output_y, const uint16_t output_x,
+		const int32_t *bias, const int32_t *biasR, const q7_t *ksrc, const int32_t *multiplier,
+		const int32_t *shift, q7_t *output, const int32_t output_offset,
+		const int32_t activation_min, const int32_t activation_max,
+		q7_t *cols_8b_iterptr, const uint16_t column_x, int channel_offset);
+tinyengine_status patchpadding_depthwise_kernel3x3_stride2_inplace_CHW(q7_t *input, const uint16_t input_x, const uint16_t input_y,
+				const uint16_t input_ch, const q7_t *kernel, const int32_t *bias, const int32_t *biasR,
+				const int32_t *output_shift, const int32_t *output_mult,
+				const int32_t output_offset, const int32_t input_offset,
+				const int32_t output_activation_min,
+				const int32_t output_activation_max, q7_t *output,
+				const uint16_t output_x, const uint16_t output_y,
+				const uint16_t output_ch, q15_t *runtime_buf, q7_t pad_value,
+				const uint16_t pad_t, const uint16_t pad_b, const uint16_t pad_l, const uint16_t pad_r)
+{
+    uint16_t c,i,j;
+	q7_t *cols_8b_start = (q7_t *)runtime_buf;
+	q7_t* cols_8b = (q7_t* )cols_8b_start;
+
+	const q7_t *src;
+	const q7_t *ksrc = kernel;
+
+	//set the output for inplace update
+	q7_t *inplace_out = input;
+
+	int padding_cnt = pad_t * input_x;
+	//shift the input ptr accordingly for HWC->CHW
+	input += padding_cnt * input_ch;
+	//handle top padding
+	q7_t PAD8 = pad_value;
+	while (padding_cnt--){
+		*cols_8b++ = PAD8;
+	}
+
+	for (i = pad_t; i < input_y - pad_b; i++){
+		//handle left padding
+		for (j = 0; j < pad_l; j++){
+			*cols_8b++ = PAD8;
+		}
+		cols_8b += input_x - (pad_l + pad_r);
+		//handle right padding
+		for (j = 0; j < pad_r; j++){
+			*cols_8b++ = PAD8;
+		}
+	}
+
+	//handle bottom padding
+	padding_cnt = pad_b * input_x;
+	//not need to shift for bottom padding
+	while (padding_cnt--){
+		*cols_8b++ = PAD8;
+	}
+
+	for (c = 0; c < input_ch; c++){        
+	    src = input;
+        cols_8b = (q7_t*)(cols_8b_start + pad_t * (input_x)); //skip pad_t rows
+        for(i = pad_t; i < input_y - pad_b; i++){
+            cols_8b += pad_l;//skip left
+            src += pad_l * input_ch;
+            for(j = pad_l; j < input_x - pad_r; j++){
+                *cols_8b++ = *src;// + input_offset;
+                src += input_ch;
+            }
+            cols_8b += pad_r;//skip right
+            src += pad_r * input_ch;
+        }
+        patch_depthwise_kernel3x3_stride2_inplace_kernel_CHW(output_y, output_x, bias++, biasR++, ksrc, output_mult++, output_shift++, inplace_out, output_offset,output_activation_min, output_activation_max,cols_8b_start, input_x, input_ch);
+		inplace_out++;
+		input++;
+		ksrc += 9;
+    }
+}
+void patch_depthwise_kernel3x3_stride2_inplace_kernel_CHW(
+		const uint16_t output_y, const uint16_t output_x,
+		const int32_t *bias, const int32_t *biasR, const q7_t *ksrc, const int32_t *multiplier,
+		const int32_t *shift, q7_t *output, const int32_t output_offset,
+		const int32_t activation_min, const int32_t activation_max,
+		q7_t *cols_8b_iterptr, const uint16_t column_x, int channel_offset)
+{
+    #define STRIDE 2
+    int i, j;
+    /* MACs for each output */
+	for (i = 0; i < output_y; i++) {
+		for (j = 0; j < output_x / 2; j++) {
+			q7_t *cols_8b = cols_8b_iterptr;
+            
+            q31_t sum0 = bias[0];
+			q31_t sum1 = bias[0];
+                        
+            /* computation */
+			sum0 += cols_8b[0]*ksrc[0];
+			sum1 += cols_8b[2]*ksrc[0];
+			sum0 += cols_8b[1]*ksrc[1];
+			sum1 += cols_8b[3]*ksrc[1];
+			sum0 += cols_8b[2]*ksrc[2];
+			sum1 += cols_8b[4]*ksrc[2];
+            cols_8b += column_x;
+			sum0 += cols_8b[0]*ksrc[3];
+			sum1 += cols_8b[2]*ksrc[3];
+			sum0 += cols_8b[1]*ksrc[4];
+			sum1 += cols_8b[3]*ksrc[4];
+			sum0 += cols_8b[2]*ksrc[5];
+			sum1 += cols_8b[4]*ksrc[5];
+            cols_8b += column_x;
+			sum0 += cols_8b[0]*ksrc[6];
+			sum1 += cols_8b[2]*ksrc[6];
+			sum0 += cols_8b[1]*ksrc[7];
+			sum1 += cols_8b[3]*ksrc[7];
+			sum0 += cols_8b[2]*ksrc[8];
+			sum1 += cols_8b[4]*ksrc[8];
+           
+            /* requantize */
+            sum0 = arm_nn_requantize(sum0 + biasR[0], *multiplier, *shift);
+            sum0 += output_offset;
+            sum0 = MAX(sum0, activation_min);
+            sum0 = MIN(sum0, activation_max);
+            output[(i * output_x + j * 2) * channel_offset] = sum0;
+
+            sum1 = arm_nn_requantize(sum1 + biasR[0], *multiplier, *shift);
+            sum1 += output_offset;
+            sum1 = MAX(sum1, activation_min);
+            sum1 = MIN(sum1, activation_max);
+            output[(i * output_x + (j * 2 + 1)) * channel_offset] = sum1;
+
+            cols_8b_iterptr += STRIDE * 2;
+        }
+        if (output_x & 1) {
+			q7_t * cols_8b = cols_8b_iterptr;
+			q31_t sum = bias[0];
+			sum += cols_8b[0]*ksrc[0];
+			sum += cols_8b[1]*ksrc[1];
+			sum += cols_8b[2]*ksrc[2];
+            cols_8b += column_x;
+			sum += cols_8b[0]*ksrc[3];
+			sum += cols_8b[1]*ksrc[4];
+			sum += cols_8b[2]*ksrc[5];
+            cols_8b += column_x;
+			sum += cols_8b[0]*ksrc[6];
+			sum += cols_8b[1]*ksrc[7];
+			sum += cols_8b[2]*ksrc[8];
+
+            sum = arm_nn_requantize(sum + biasR[0], *multiplier, *shift);
+			sum += output_offset;
+			sum = MAX(sum, activation_min);
+			sum = MIN(sum, activation_max);
+            output[(i * output_x + output_x - 1) * channel_offset] = sum;
+
+			cols_8b_iterptr += STRIDE;
+        }
+        cols_8b_iterptr += 1 * 2 - (column_x & 1);
+        cols_8b_iterptr += (STRIDE - 1) * (column_x);
+    }
+}
--- a/TinyEngine/src/kernels/int_only/patchpadding_kbuf_convolve_s8_kernel3_inputch3_stride2.c
+++ b/TinyEngine/src/kernels/int_only/patchpadding_kbuf_convolve_s8_kernel3_inputch3_stride2.c
@ -0,0 +1,179 @@
+/* ----------------------------------------------------------------------
+ * Project: TinyEngine
+ * Title:   patchpadding_kbuf_convolve_s8_kernel3_inputch3_stride2.c
+ * Description:  for 3x3 convolution with 3 input channels, typically for image processing
+ *
+ * Reference papers:
+ *  - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
+ *  - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
+ *  - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
+ * Contact authors:
+ *  - Wei-Ming Chen, wmchen@mit.edu
+ *  - Wei-Chen Wang, wweichen@mit.edu
+ *  - Ji Lin, jilin@mit.edu
+ *  - Ligeng Zhu, ligeng@mit.edu
+ *  - Song Han, songhan@mit.edu
+ *
+ * Target ISA:  ARMv7E-M
+ * -------------------------------------------------------------------- */
+
+#include "arm_math.h"
+#include "arm_nnfunctions.h"
+#include "arm_nnsupportfunctions.h"
+#include "img2col_element.h"
+#include "tinyengine_function.h"
+
+tinyengine_status patchpadding_kbuf_convolve_s8_kernel3_inputch3_stride2(const q7_t *input, const uint16_t input_x, const uint16_t input_y,
+		const uint16_t input_ch, const q7_t* kernel, const q31_t *kbuf, const int32_t *bias,
+		const int32_t *output_shift, const int32_t *output_mult,
+		const int32_t output_offset, const int32_t input_offset,
+		const int32_t output_activation_min,
+		const int32_t output_activation_max, q7_t *output,
+		const uint16_t output_x, const uint16_t output_y,
+		const uint16_t output_ch, q15_t *runtime_buf, q7_t pad_value,
+		const uint16_t pad_t, const uint16_t pad_b, const uint16_t pad_l, const uint16_t pad_r) {
+	const int kernel_y = 3;
+	const int kernel_x = 3;
+
+	int16_t i_out_y, i_out_x, i_ker_y, i_ker_x;
+
+	/* Generate two columns from the input tensor a GEMM computation */
+	q15_t *two_column_buf = runtime_buf;
+	q7_t *out = output;
+
+	q15_t pad16 = pad_value;
+	const int16_t inoff16 = input_offset;
+
+	for (i_out_y = 0; i_out_y < output_y; i_out_y++) {
+		for (i_out_x = 0; i_out_x < output_x; i_out_x++) {
+			/* This part implements the im2col function */
+			const q15_t *col_buffer = two_column_buf;
+			int16_t base_idx_y = (i_out_y * 2);
+			int16_t base_idx_x = (i_out_x * 2);
+
+			//use variables
+			q31_t in_q7x4;
+			q31_t in_q15x2_1;
+			q31_t in_q15x2_2;
+			q31_t out_q15x2_1;
+			q31_t out_q15x2_2;
+
+			/* load address:8bit */
+			q7_t *src;
+
+			/* buffer for im2col:16bit */
+			q15_t *dst = col_buffer;
+
+			int skip_top = pad_t - base_idx_y;
+			int skip_bottom = MAX(0,(base_idx_y + 3) - (input_y - pad_b));//3x3
+
+			int y_cnt = 3;//3 rows to load
+			//fill zeros in the top regions
+			while (y_cnt > 0 && skip_top-- > 0){
+				*dst++ = 0; *dst++ = 0; *dst++ = 0;
+				*dst++ = 0; *dst++ = 0; *dst++ = 0;
+				*dst++ = 0; *dst++ = 0; *dst++ = 0;
+				y_cnt--;
+				base_idx_y++;
+			}
+
+			//fill in the middle
+			int skip_left = MAX(0,pad_l - base_idx_x);
+			int skip_right = MAX(0,(base_idx_x + 3) - (input_x - pad_r));//3x3
+			//address of the first valid values
+			int m;
+			for (m = 0; m < y_cnt - skip_bottom; m++){
+				src = input	+ ((base_idx_y+m) * input_x + base_idx_x + skip_left) * input_ch;
+				int x_cnt = 3;//3 columns to load
+				//fill zero for left regions
+				int cnt = skip_left;
+				while(x_cnt > 0 && cnt-- > 0){
+					*dst++ = 0; *dst++ = 0; *dst++ = 0;//input_ch == 3
+					x_cnt--;
+				}
+
+				//load the middle
+				while(x_cnt > skip_right){
+					*dst++ = *src++ + input_offset; *dst++ = *src++ + input_offset; *dst++ = *src++ + input_offset;
+					x_cnt--;
+				}
+
+				//fill zero for right regions (for what's left)
+				while(x_cnt > 0){
+					*dst++ = 0; *dst++ = 0; *dst++ = 0;//input_ch == 3
+					x_cnt--;
+				}
+			}
+			y_cnt -= m;
+
+			//fill zeros in the bottom regions
+			while (y_cnt > 0){
+				*dst++ = 0; *dst++ = 0; *dst++ = 0;
+				*dst++ = 0; *dst++ = 0; *dst++ = 0;
+				*dst++ = 0; *dst++ = 0; *dst++ = 0;
+				y_cnt--;
+			}
+
+			two_column_buf += 27;
+			/* Computation is filed for every 2 columns */
+			if (two_column_buf == runtime_buf + 2 * 27) {
+
+				out = mat_mult_s16(kernel,
+						runtime_buf, output_ch, output_shift, output_mult,
+						output_offset, output_activation_min, output_activation_max,
+						input_ch * kernel_y * kernel_x, bias, out, kbuf);
+
+				/* counter reset */
+				two_column_buf = runtime_buf;
+			}
+		}
+	}
+
+	/* left-over because odd number of output pixels */
+	if (two_column_buf != runtime_buf) {
+		const q7_t *ker_a = kernel;
+		int i;
+
+		for (i = 0; i < output_ch; i++) {
+			/* Load the accumulator with bias first */
+			q31_t sum = bias[i];
+
+			/* Point to the beginning of the im2col buffer where the input is available as a rearranged column */
+			const q15_t *ip_as_col = runtime_buf;
+
+			/* 4 multiply and accumulates are done in one loop. */
+			uint16_t col_count = (input_ch * kernel_y * kernel_x) >> 2;
+
+			while (col_count) {
+				q31_t ker_a1, ker_a2;
+				q31_t ip_b1, ip_b2;
+
+				ker_a = read_and_pad(ker_a, &ker_a1, &ker_a2);
+
+				ip_b1 = arm_nn_read_q15x2_ia(&ip_as_col);
+				sum = __SMLAD(ker_a1, ip_b1, sum);
+				ip_b2 = arm_nn_read_q15x2_ia(&ip_as_col);
+				sum = __SMLAD(ker_a2, ip_b2, sum);
+
+				col_count--;
+			}
+			/* Handle left over mac */
+			col_count = input_ch * kernel_y * kernel_x & 0x3;
+			while (col_count) {
+				q7_t ker_a1 = *ker_a++;
+				q15_t ip_b1 = *ip_as_col++;
+				sum += ker_a1 * ip_b1;
+				col_count--;
+			}
+
+			sum = arm_nn_requantize(sum, output_mult[i], output_shift[i]);
+			sum += output_offset;
+			sum = MAX(sum, output_activation_min);
+			sum = MIN(sum, output_activation_max);
+			*out++ = (q7_t) sum;
+		}
+	}
+
+	/* Return to application */
+	return STATE_SUCCESS;
+}
--- a/TinyEngine/src/kernels/int_only/stable_softmax.c
+++ b/TinyEngine/src/kernels/int_only/stable_softmax.c
@ -0,0 +1,40 @@
+/* ----------------------------------------------------------------------
+ * Project: TinyEngine
+ * Title:   stable_softmax.c
+ *
+ * Reference papers:
+ *  - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
+ *  - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
+ *  - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
+ * Contact authors:
+ *  - Wei-Ming Chen, wmchen@mit.edu
+ *  - Wei-Chen Wang, wweichen@mit.edu
+ *  - Ji Lin, jilin@mit.edu
+ *  - Ligeng Zhu, ligeng@mit.edu
+ *  - Song Han, songhan@mit.edu
+ *
+ * Target ISA:  ARMv7E-M
+ * -------------------------------------------------------------------- */
+
+#include "tinyengine_function.h"
+#include <float.h>
+#include <math.h>
+
+tinyengine_status statble_softmax_inplace(float *input, const uint16_t length)
+{
+	float max = FLT_MIN;
+	float exp_sum = 0;
+	uint16_t i;
+	for (i = 0; i < length; i++){
+		if (input[i] > max) max = input[i];
+	}
+
+	// inplace update
+	for (i = 0; i < length; i++){
+		input[i] = exp(input[i] - max);
+		exp_sum += input[i];
+	}
+	for (i = 0; i < length; i++){
+		input[i] = input[i] / exp_sum;
+	}
+}
--- a/TinyEngine/src/kernels/int_only/upsample_byte.c
+++ b/TinyEngine/src/kernels/int_only/upsample_byte.c
@ -0,0 +1,85 @@
+/* ----------------------------------------------------------------------
+ * Project: TinyEngine
+ * Title:   upsample_byte.c
+ *
+ * Reference papers:
+ *  - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
+ *  - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
+ *  - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
+ * Contact authors:
+ *  - Wei-Ming Chen, wmchen@mit.edu
+ *  - Wei-Chen Wang, wweichen@mit.edu
+ *  - Ji Lin, jilin@mit.edu
+ *  - Ligeng Zhu, ligeng@mit.edu
+ *  - Song Han, songhan@mit.edu
+ *
+ * Target ISA:  ARMv7E-M
+ * -------------------------------------------------------------------- */
+
+#include "arm_nnfunctions.h"
+#include "tinyengine_function.h"
+
+tinyengine_status upsample_byte(const q7_t *input, const uint16_t input_x,
+	const uint16_t input_y, const uint16_t input_ch, q7_t *output, const uint16_t sample_factor) {
+	//get output resolution
+	const uint16_t output_x = input_x * sample_factor, output_y = input_y * sample_factor , output_ch = input_ch;
+
+	//upsample in a repeated manner
+	for(int ih = 0; ih < input_y; ih++){
+		q7_t* out_head = output;
+		//place 1 row
+		for(int iw = 0; iw < input_x; iw++){
+			for(int s = 0; s < sample_factor; s++){
+				memcpy(output, input, input_ch);
+				output += input_ch;
+			}
+			input += input_ch;
+		}
+
+		//copy the remaining rows
+		for(int s = 1; s < sample_factor; s++){
+			memcpy(output, out_head, output_ch * output_x);
+			output += output_ch * output_x;
+		}
+	}
+	return STATE_SUCCESS;
+}
+
+
+//ref: https://www.cs.toronto.edu/~guerzhoy/320/lec/upsampling.pdf
+tinyengine_status upsample_byte_bilinear(const q7_t *input, const uint16_t input_x,
+	const uint16_t input_y, const uint16_t input_ch, q7_t *output, const uint16_t sample_factor) {
+	//get output resolution
+	const uint16_t output_x = input_x * sample_factor, output_y = input_y * sample_factor , output_ch = input_ch;
+
+//	//upsample in a repeated manner
+//	for(int oh = 0; oh < input_y; oh++){
+//		int ih = oh / sample_factor;
+//		int rh = oh % sample_factor;
+//
+//		q7_t* out_head = output;
+//		//place 1 row
+//		for(int ow = 0; ow < onput_x; ow++){
+//			int iw = iw / sample_factor;
+//			int wh = wh % sample_factor;
+//
+//			//exact coordinate
+//			q7_t* ori_input = input + input_ch * (input_x * ih + iw);
+//			if(rh | wh == 0){
+//				memcpy(output, ori_input, input_ch);
+//				continue;
+//			}
+//
+//			//interpolate
+//			q7_t* topleft = ori_input;
+//			q7_t* topright = ori_input + input_ch;
+//			q7_t* bottomleft = topleft + input_ch * input_x;
+//			q7_t* bottomright = topright + input_ch * input_x;
+//		}
+//	}
+	return STATE_SUCCESS;
+}
+
+
+
+
--- a/TinyEngine/third_party/CMSIS
+++ b/TinyEngine/third_party/CMSIS
@ -0,0 +1 @@
+Subproject commit 5b58d2da8af7cee64cc9145ee1154609bdfee9f9
--- a/assets/detection.tflite
+++ b/assets/detection.tflite
--- a/assets/detection_config.json
+++ b/assets/detection_config.json
@ -0,0 +1,23 @@
+{
+    "output1": {
+      "name": "Yolo3Output",
+      "input_id": "175",
+      "num_class": 1,
+      "anchors": [116, 90, 156, 198, 373, 326],
+      "stride": 32
+    },
+    "output2": {
+      "name": "Yolo3Output",
+      "input_id": "36",
+      "num_class": 1,
+      "anchors": [30, 61, 62, 45, 59, 119],
+      "stride": 16
+    },
+    "output3": {
+      "name": "Yolo3Output",
+      "input_id": "5",
+      "num_class": 1,
+      "anchors": [10, 13, 16, 30, 33, 23],
+      "stride": 8
+    }
+  }
--- a/assets/figures/0_import_project_0.png
+++ b/assets/figures/0_import_project_0.png
--- a/assets/figures/10_mcu_top_view.png
+++ b/assets/figures/10_mcu_top_view.png
--- a/assets/figures/11_mcu_side_view.png
+++ b/assets/figures/11_mcu_side_view.png
--- a/assets/figures/12_stlink_0.png
+++ b/assets/figures/12_stlink_0.png
--- a/assets/figures/13_stlink_1.png
+++ b/assets/figures/13_stlink_1.png
--- a/assets/figures/14_stlink_2.png
+++ b/assets/figures/14_stlink_2.png
--- a/assets/figures/15_demo_person.png
+++ b/assets/figures/15_demo_person.png
--- a/assets/figures/16_demo_no_person.png
+++ b/assets/figures/16_demo_no_person.png
--- a/assets/figures/1_import_project_1.png
+++ b/assets/figures/1_import_project_1.png
--- a/assets/figures/2_project_explorer.png
+++ b/assets/figures/2_project_explorer.png
--- a/assets/figures/3_main_cpp.png
+++ b/assets/figures/3_main_cpp.png
--- a/assets/figures/4_gcc_include_paths.png
+++ b/assets/figures/4_gcc_include_paths.png
--- a/assets/figures/5_gcc_optimization.png
+++ b/assets/figures/5_gcc_optimization.png
--- a/assets/figures/6_gplusplus_include_paths.png
+++ b/assets/figures/6_gplusplus_include_paths.png
--- a/assets/figures/7_gplusplus_optimization.png
+++ b/assets/figures/7_gplusplus_optimization.png
--- a/assets/figures/8_run_configurations_0.png
+++ b/assets/figures/8_run_configurations_0.png
--- a/assets/figures/9_run_configurations_1.png
+++ b/assets/figures/9_run_configurations_1.png
--- a/assets/figures/applications.png
+++ b/assets/figures/applications.png
--- a/assets/figures/inplace_depthwise.png
+++ b/assets/figures/inplace_depthwise.png
--- a/assets/figures/latency_mem.png
+++ b/assets/figures/latency_mem.png
--- a/assets/figures/mac_result.png
+++ b/assets/figures/mac_result.png
--- a/assets/figures/mcunetV3_demo.gif
+++ b/assets/figures/mcunetV3_demo.gif
--- a/assets/figures/mcunetV3_demo_2images.gif
+++ b/assets/figures/mcunetV3_demo_2images.gif
--- a/assets/figures/mcunet_demo.gif
+++ b/assets/figures/mcunet_demo.gif
--- a/assets/figures/measured_result.png
+++ b/assets/figures/measured_result.png
--- a/assets/figures/memory_size.png
+++ b/assets/figures/memory_size.png
--- a/assets/figures/overview.png
+++ b/assets/figures/overview.png
--- a/assets/figures/peakmem_result.png
+++ b/assets/figures/peakmem_result.png
--- a/assets/vww.tflite
+++ b/assets/vww.tflite
--- a/code_generator/CodeGenerator.py
+++ b/code_generator/CodeGenerator.py
@ -0,0 +1,667 @@
+# ----------------------------------------------------------------------
+# Project: TinyEngine
+# Title:   CodeGenerator.py
+#
+# Reference papers:
+#  - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
+#  - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
+#  - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
+# Contact authors:
+#  - Wei-Ming Chen, wmchen@mit.edu
+#  - Wei-Chen Wang, wweichen@mit.edu
+#  - Ji Lin, jilin@mit.edu
+#  - Ligeng Zhu, ligeng@mit.edu
+#  - Song Han, songhan@mit.edu
+#
+# Target ISA:  ARMv7E-M
+# ----------------------------------------------------------------------
+
+import os
+
+from .OpGenerator import OpGenerator
+
+Codegen_root = "./codegen/"
+include_path = Codegen_root + "Include/"
+source_path = Codegen_root + "Source/"
+
+use_hard_switsh = False
+gen_kernels = True
+use_aggressive_unroll = True
+
+
+class CodeGenerator:
+    """Provide utilities to generate C code for a given model and memory schdeule."""
+
+    parse_count = 0
+    header_handle = None
+    source_handle = None
+
+    def __init__(
+        self,
+        memsche,
+        inplace,
+        precision=8,
+        unsigned_input=False,
+        patch_params=None,
+        FP_output=False,
+        profile_mode=False,
+        fp_requantize=False,
+        tflite_op=False,
+        dummy_address=False,
+        outputTables=None,
+        detectionUtils=None,
+    ):
+        self.MemSche = memsche
+
+        # Check if path exists, create it if not
+        if not os.path.exists(include_path):
+            os.makedirs(include_path)
+        if not os.path.exists(source_path):
+            os.makedirs(source_path)
+
+        self.header_handle = open(include_path + "genModel.h", "w")
+        self.source_handle = open(source_path + "genModel.c", "w")
+        self.inplace = inplace
+        self.BIT = precision
+        self.unsigned_input = unsigned_input
+        self.patch_params = patch_params
+        self.FP_output = FP_output
+        self.profile_mode = profile_mode
+        self.fp_requantize = fp_requantize
+        self.tflite_op = tflite_op
+        self.dummy_address = dummy_address
+        self.trainSRAMTable = []
+        self.outputTables = outputTables
+        self.detectionUtils = detectionUtils
+
+    def _readOnly(self, name):
+        if self.outputTables is None or name is None:
+            return True
+        else:
+            for o in self.outputTables:
+                if o.name in name:
+                    return False
+        return True
+
+    def codeGeneration(self):
+        # buffer in SRAM
+        self._genMemBuffer()
+
+        # parse trainable parameters & assign the corresponding buffers for layers
+        self._parseTrainable()
+
+        # include all headers
+        self._includeHeaders()
+
+        # generate detection output if any
+        self._genDetprocessing()
+
+        # generate patch-based
+        self._genPatchInference()
+
+        # generate invoke function
+        self._genInvoke()
+
+        self._closefp()
+
+        # generate operatior kernels
+        if gen_kernels:
+            op_gen = OpGenerator(include_path, source_path, self.MemSche.layer, self.fp_requantize)
+            op_gen.genOpcode()
+
+    def _genDetprocessing(self):
+        if self.detectionUtils is not None:
+            fp = self.source_handle
+            fp.write(self.detectionUtils.genPostProcessing())
+
+    def _genOpstr(self, op, *args):
+        if self.profile_mode:
+            if len(args) > 0:
+                return op.generate_profiling_str(*args)
+            else:
+                return op.generate_profiling_str()
+        else:
+            if len(args) > 0:
+                return op.generate_inference_str(*args)
+            else:
+                return op.generate_inference_str()
+
+    def _genPatchInference(self):
+        schedule = self.MemSche
+        layer_info = schedule.layer[0].get_layer_info()
+        if "is_patch" in layer_info and layer_info["is_patch"]:
+            fp = self.source_handle
+            string = ""
+            first_height = layer_info["input_h"]
+            first_width = layer_info["input_w"]
+            img_w = (first_width - self.patch_params["pad_l"] - self.patch_params["pad_r"]) * self.patch_params[
+                "n_patch"
+            ]
+            # by default, we go three stride 2 conv in the patch-based inference
+            patch_out_w = int((first_width - self.patch_params["pad_l"]) / 8)
+            # by default, we go three stride 2 conv in the patch-based inference
+            patch_out_h = int((first_height - self.patch_params["pad_l"]) / 8)
+            out_w = self.patch_params["output_w"]
+            # generate code for testing whole inference time
+            string += (
+                """void end2endinference(q7_t* img){
+    //stage 1
+    int i, j, h, w, c;
+    for (i = 0; i < """
+                + str(self.patch_params["n_patch"])
+                + """; i++){
+        uint16_t pad_t=0,pad_b=0;
+        if (i == 0){
+            pad_t = """
+                + str(self.patch_params["pad_l"])
+                + """;
+        }
+        else if (i == """
+                + str(self.patch_params["n_patch"] - 1)
+                + """){
+            pad_b = """
+                + str(self.patch_params["pad_r"])
+                + """;
+        }
+        for (j = 0; j < """
+                + str(self.patch_params["n_patch"])
+                + """; j++){
+            uint16_t pad_l=0,pad_r=0;
+            if (j == 0){
+                pad_l = """
+                + str(self.patch_params["pad_l"])
+                + """;
+            }
+            else if (j == """
+                + str(self.patch_params["n_patch"] - 1)
+                + """){
+                pad_r = """
+                + str(self.patch_params["pad_r"])
+                + """;
+            }
+            /* load partial input from the img */
+            q7_t* patch_input = &buffer0[0]; // for partial input
+            int start_x = MAX("""
+                + str(first_width - self.patch_params["pad_l"])
+                + """ * j - """
+                + str(self.patch_params["pad_l"])
+                + """,0);
+            int start_y = MAX("""
+                + str(first_height - self.patch_params["pad_l"])
+                + """ * i - """
+                + str(self.patch_params["pad_l"])
+                + """,0);
+            q7_t* img_ptr = &img[(start_x + start_y * """
+                + str(img_w)
+                + """) * 3];
+
+            //skip top
+            patch_input += pad_t * """
+                + str(first_width)
+                + """ * 3;
+            for (h = pad_t; h < """
+                + str(first_height)
+                + """ - pad_b; h++){
+                //skip left
+                patch_input += pad_l * 3;
+                //fill middle
+                int bytes = ("""
+                + str(first_width)
+                + """ - (pad_l + pad_r)) * 3;
+                memcpy (patch_input, img_ptr, bytes);
+                img_ptr += """
+                + str(img_w)
+                + """ * 3;
+                patch_input += bytes;
+                //skip right
+                patch_input += pad_r * 3;
+            }
+            invoke_1patch(pad_t,pad_b,pad_l,pad_r);
+            /* concat the output from buffer0 (this is set manually for now) */
+            q7_t* output_ptr = buffer1 + (i * """
+                + str(patch_out_w)
+                + """ * """
+                + str(out_w)
+                + """ + j * """
+                + str(patch_out_w)
+                + """) * """
+                + str(self.patch_params["output_c"])
+                + """ ;
+            for (h = 0; h < """
+                + str(patch_out_h)
+                + """; h++){
+                for (w = 0; w < """
+                + str(patch_out_w)
+                + """; w++){
+                    for (c = 0; c < """
+                + str(self.patch_params["output_c"])
+                + """; c++){
+                        output_ptr[(w + h * """
+                + str(out_w)
+                + """) * """
+                + str(self.patch_params["output_c"])
+                + """ + c] = buffer0[(w + h * """
+                + str(patch_out_w)
+                + """) * """
+                + str(self.patch_params["output_c"])
+                + """ + c];
+                    }
+                }
+            }
+        }
+    }
+    //stage 2
+    invoke();
+}"""
+            )
+            string += """
+
+void invoke_1patch(uint16_t pad_t, uint16_t pad_b, uint16_t pad_l ,uint16_t pad_r){
+"""
+            fp.write(string)
+
+            # gen patch-based inference code
+            patch_layers = []
+            layercnt = 0
+            for i, op in enumerate(schedule.layer):
+                layer_info = op.get_layer_info()
+                if "is_patch" not in layer_info or not layer_info["is_patch"]:
+                    break  # end of patch-based
+                string = "/* layer " + str(layercnt) + ":" + layer_info["op"] + " */\n"
+                layercnt += 1
+                fp.write(string)
+                if layer_info["op"] == "CONV_2D":
+                    # hardcode this memory schedule for quick implementation
+                    # TODO: adjust this according to model architecture and split index
+                    next_layer_info = schedule.layer[i + 1].get_layer_info()
+                    if "is_patch" not in next_layer_info or not next_layer_info["is_patch"]:
+                        layer_info["output_buf_add"] = "front"
+                        layer_info["output_buf_add_offset"] = 0
+                    if self.unsigned_input:
+                        raise Exception("unsigned input is not supported by patch-based yet")
+
+                    string = self._genOpstr(
+                        op,
+                        False,
+                        self.FP_output,
+                        use_aggressive_unroll,
+                        use_hard_switsh,
+                        self.fp_requantize,
+                    )
+                    fp.write(string)
+
+                elif layer_info["op"] == "DEPTHWISE_CONV_2D":
+                    string = self._genOpstr(op, self.fp_requantize)
+                    fp.write(string)
+
+                elif layer_info["op"] == "ADD":
+                    string = self._genOpstr(op)
+                    fp.write(string)
+
+                patch_layers.append(schedule.layer[i])
+
+            # remove these layers for patching for the following code gen
+            for layer in patch_layers:
+                schedule.layer.remove(layer)
+
+            string = "}\n\n"
+
+            fp.write(string)
+        else:  # not patch-based
+            string = """void end2endinference(q7_t* img){
+    invoke(NULL);
+}
+"""
+            fp = self.source_handle
+            fp.write(string)
+
+    def _genInvoke(self):
+        fp = self.source_handle
+        string = "void invoke(float* labels){\n"
+        fp.write(string)
+
+        schedule = self.MemSche
+        for i, op in enumerate(schedule.layer):
+            layer_info = op.get_layer_info()
+            string = "/* layer " + str(i) + ":" + layer_info["op"] + " */\n"
+            fp.write(string)
+
+            if layer_info["op"] == "CONV_2D":
+                if (
+                    self.FP_output
+                    and "effective_scale" in layer_info
+                    and layer_info["output_scale"] is not None
+                    and layer_info["effective_scale"] is not None
+                ):
+                    use_fp = True
+                else:
+                    use_fp = False
+                string = self._genOpstr(
+                    op,
+                    self.unsigned_input,
+                    use_fp,
+                    use_aggressive_unroll,
+                    use_hard_switsh,
+                    self.fp_requantize,
+                    self.tflite_op,
+                    self.dummy_address,
+                )
+                fp.write(string)
+            elif layer_info["op"] == "DEPTHWISE_CONV_2D":
+                string = self._genOpstr(op, self.fp_requantize)
+                fp.write(string)
+            else:
+                string = self._genOpstr(op)
+                fp.write(string)
+
+        string = "}\n"
+        fp.write(string)
+
+    def _getBufferIndex(self, location):
+        if location == "front":
+            return 0
+        elif location == "end":
+            return 0
+        elif location == "residual":
+            return 1
+        return None
+
+    def _genMemBuffer(self):
+        schedule = self.MemSche
+        # define output tensor
+        string = "#define NNoutput &buffer0[" + str(_findtheinferenceOutput(schedule.layer)) + "];"
+        fp = self.header_handle
+        fp.write("\n" + string + "\n")
+
+        # activation buffers
+        string = "\n/* sram:" + str(schedule.peakmem) + ", flash:" + str(schedule.flash) + " */\n"
+        fp.write(string + "\n")
+
+        string = "static signed char buffer[" + str(schedule.peakmem) + "];\n"
+        fp.write(string)
+        accumulate_ptr = 0
+        string = "static signed char *buffer0 = &buffer[" + str(accumulate_ptr) + "];\n"
+        accumulate_ptr += int(schedule.buffers["input_output"])
+        fp.write(string)
+        string = "static signed char *buffer1 = &buffer[" + str(accumulate_ptr) + "];\n"
+        accumulate_ptr += int(schedule.buffers["residual"])
+        fp.write(string)
+
+        string = "static int16_t *sbuf = (int16_t *)&buffer[" + str(accumulate_ptr) + "];\n"
+        accumulate_ptr += int(schedule.buffers["im2col"])
+        fp.write(string)
+        string = "static int32_t *kbuf = (int32_t *)&buffer[" + str(accumulate_ptr) + "];\n"
+        accumulate_ptr += int(schedule.buffers["kernel"])
+        fp.write(string)
+        string = "const int SBuffer_size = " + str(int(schedule.buffers["im2col"])) + ";\n"
+        fp.write(string)
+        string = "const int KBuffer_size = " + str(int(schedule.buffers["kernel"])) + ";\n"
+        fp.write(string + "\n")
+
+    def _includeHeaders(self):
+        include_string = """/* Automatically generated source file */
+#include <float.h>
+#include "arm_nnfunctions.h"
+
+#include "genNN.h"
+#include "genModel.h"
+
+#include "tinyengine_function.h"
+//#include "tinyengine_function_fp.h"
+
+"""
+        if self.profile_mode:
+            include_string += '#include "profile.h"\n'
+
+        include_string += """
+/* Variables used by all ops */
+ADD_params add_params;
+//Conv_Params conv_params;
+//Depthwise_Params dpconv_params;
+int i;
+int8_t *int8ptr;
+float *fptr,*fptr2,*fptr3;
+
+signed char* getInput() {
+    return &buffer0[""" + f"{self.MemSche.layer[0].params['input_buf_add_offset']}" + """];
+}
+signed char* getOutput() {
+    return NNoutput;
+}\n"""
+        fp = self.source_handle
+        fp.write(include_string)
+
+    def _parseTrainable(self):
+        schedule = self.MemSche
+        for i, op in enumerate(schedule.layer):
+            layer_info = op.get_layer_info()
+            if layer_info["op"] == "CONV_2D":
+                self._parseWeight(
+                    self.parse_count,
+                    layer_info["weight_value"].flatten(),
+                    layer_info["weight_name"],
+                    self._readOnly(layer_info["weight_name"]),
+                )
+
+                if "bias_name" in layer_info:
+                    self._parseBias(
+                        self.parse_count,
+                        layer_info["bias"].flatten(),
+                        layer_info["bias_name"],
+                        self._readOnly(layer_info["bias_name"]),
+                    )
+                else:
+                    self._parseBias(self.parse_count, layer_info["bias"].flatten())
+                self._parseEffectivescales(self.parse_count, layer_info["effective_scale"].flatten())
+                self._parseRequantize(
+                    self.parse_count,
+                    layer_info["shift"].flatten(),
+                    layer_info["multiplier"].flatten(),
+                )
+
+                layer_info["parsed_trainable"] = self.parse_count
+                self.parse_count += 1
+            elif layer_info["op"] == "DEPTHWISE_CONV_2D":
+                if layer_info["kernel_h"] > layer_info["kernel_w"]:
+                    self._parseCWHWeight(
+                        self.parse_count,
+                        layer_info["weight_value"].flatten(),
+                        layer_info["kernel_h"],
+                        layer_info["kernel_w"],
+                        layer_info["input_c"],
+                    )
+                else:
+                    if "weight_name" in layer_info:
+                        self._parseCHWWeight(
+                            self.parse_count,
+                            layer_info["weight_value"].flatten(),
+                            layer_info["input_c"],
+                        )
+                    else:
+                        self._parseCHWWeight(
+                            self.parse_count,
+                            layer_info["weight_value"].flatten(),
+                            layer_info["input_c"],
+                        )
+                    if "bias_name" in layer_info:
+                        self._parseoffsetBias(
+                            self.parse_count,
+                            layer_info["bias"].flatten(),
+                            layer_info["input_zero_point"] * -1,
+                            layer_info["weight_value"].flatten(),
+                            layer_info["input_c"],
+                            layer_info["bias_name"],
+                            self._readOnly(layer_info["bias_name"]),
+                        )
+                    else:
+                        self._parseoffsetBias(
+                            self.parse_count,
+                            layer_info["bias"].flatten(),
+                            layer_info["input_zero_point"] * -1,
+                            layer_info["weight_value"].flatten(),
+                            layer_info["input_c"],
+                        )
+                    self._parseEffectivescales(self.parse_count, layer_info["effective_scale"].flatten())
+                    self._parseRequantize(
+                        self.parse_count,
+                        layer_info["shift"].flatten(),
+                        layer_info["multiplier"].flatten(),
+                    )
+
+                layer_info["parsed_trainable"] = self.parse_count
+                self.parse_count += 1
+
+            elif layer_info["op"] == "FULLY_CONNECTED":
+                self._parseWeight(
+                    self.parse_count,
+                    layer_info["weight_value"].flatten(),
+                    layer_info["weight_name"],
+                    self._readOnly(layer_info["weight_name"]),
+                )
+                self._parseBias(self.parse_count, layer_info["bias"].flatten())
+
+                layer_info["parsed_trainable"] = self.parse_count
+                self.parse_count += 1
+
+            elif layer_info["op"] == "SOFTMAX":
+                pass
+
+    def _parseCWHWeight(self, Lindex, weight, height, width, channel):
+        fp = self.header_handle
+        # 8bit implementation
+        if self.BIT == 8:
+            string = "const unsigned char CWHweight" + str(Lindex) + "[" + str(len(weight)) + "] = {"
+            fp.write(string)
+            for j in range(channel):
+                for w in range(width):
+                    for h in range(height):
+                        value = weight[(h * width + w) * channel + j]
+                        if value < 0:
+                            value += 256
+                        fp.write(str(format(value, "#04x")) + ", ")
+        else:
+            raise NotImplementedError
+
+        fp.write("};\n")
+
+    def _parseCHWWeight(self, Lindex, weight, channel):
+        fp = self.header_handle
+        kernelsize = int(len(weight) / channel)
+        # 8bit implementation
+        if self.BIT == 8:
+            string = "const unsigned char CHWweight" + str(Lindex) + "[" + str(len(weight)) + "] = {"
+            fp.write(string)
+            for j in range(channel):
+                for i in range(kernelsize):
+                    value = int(weight[i * channel + j])
+                    if value < 0:
+                        value += 256
+                    fp.write(str(format(value, "#04x")) + ", ")
+        else:
+            raise NotImplementedError
+
+        fp.write("};\n")
+
+    def _parseEffectivescales(self, Lindex, scales):
+        fp = self.header_handle
+        string = "const float scales" + str(Lindex) + "[" + str(len(scales)) + "] = {"
+        fp.write(string)
+        for _, value in enumerate(scales):
+            fp.write(str(value) + ", ")
+        fp.write("};\n")
+
+    def _parseWeight(self, Lindex, weight, weight_name=None, is_const=True):
+        fp = self.header_handle
+        const_str = "const " if is_const else ""
+        string = f"{const_str}unsigned char weight" + str(Lindex) + "[" + str(len(weight)) + "] = {"
+        fp.write(string)
+        for _, value in enumerate(weight):
+            value = int(value)
+            if value < 0:
+                value += 256
+            fp.write(str(format(value, "#04x")) + ", ")
+        fp.write("};\n")
+
+        if weight_name is not None:
+            for r in self.trainSRAMTable:
+                if r.name == weight_name:
+                    return
+            self.trainSRAMTable.append(tensorRecorder(weight_name, len(weight), "unknown"))
+
+            if weight.dtype == "int8":
+                string = f"{const_str}unsigned char* {weight_name}=weight" + str(Lindex) + ";\n"
+            else:
+                raise NotImplementedError
+            fp.write(string)
+
+    def _parseoffsetBias(self, Lindex, bias, input_offset, weight, channel, bias_name=None, is_const=True):
+        fp = self.header_handle
+        const_str = "const " if is_const else ""
+        string = f"{const_str}int32_t offsetBias" + str(Lindex) + "[" + str(len(bias)) + "] = {"
+        fp.write(string)
+        kernelsize = int(len(weight) / channel)
+        # fuse the offset into bias
+        for i in range(channel):
+            tmpW = 0
+            for j in range(kernelsize):
+                tmpW += weight[j * channel + i]
+            fp.write(str(self.int32_clip(bias[i] + tmpW * input_offset)) + ", ")
+        fp.write("};\n")
+        string = f"{const_str}int32_t offsetRBias" + str(Lindex) + "[" + str(len(bias)) + "] = {"
+        fp.write(string)
+        kernelsize = int(len(weight) / channel)
+        for i in range(channel):
+            tmpW = 0
+            for j in range(kernelsize):
+                tmpW += weight[j * channel + i]
+            fp.write(str(bias[i] + tmpW * input_offset - self.int32_clip(bias[i] + tmpW * input_offset)) + ", ")
+        fp.write("};\n")
+
+    def _parseBias(self, Lindex, bias, bias_name=None, is_const=True):
+        fp = self.header_handle
+        const_str = "const " if is_const else ""
+        string = f"{const_str}int32_t bias" + str(Lindex) + "[" + str(len(bias)) + "] = {"
+        fp.write(string)
+        for _, value in enumerate(bias):
+            value = int(value)
+            fp.write(str(value) + ", ")
+        fp.write("};\n")
+
+    def _parseRequantize(self, Lindex, shift, multiplier):
+        fp = self.header_handle
+        string = "const int32_t shift" + str(Lindex) + "[" + str(len(shift)) + "] = {"
+        fp.write(string)
+        for _, value in enumerate(shift):
+            fp.write(str(value) + ", ")
+        fp.write("};\n")
+
+        string = "const int32_t multiplier" + str(Lindex) + "[" + str(len(multiplier)) + "] = {"
+        fp.write(string)
+        for _, value in enumerate(multiplier):
+            fp.write(str(value) + ", ")
+        fp.write("};\n")
+
+    def int32_clip(self, a):
+        if a < -(2**31):
+            return -(2**31)
+        elif a > 2**31 - 1:
+            return 2**31 - 1
+        return a.astype(int)
+
+    def _closefp(self):
+        self.header_handle.close()
+        self.source_handle.close()
+
+
+def _findtheinferenceOutput(layers):
+    for cnt, op in enumerate(layers):
+        if op.params["output_dtype"] != "int8":
+            return layers[cnt - 1].params["output_buf_add_offset"]
+    return layers[-1].params["output_buf_add_offset"]
+
+
+class tensorRecorder:
+    def __init__(self, name, len, dtype):
+        self.name = name
+        self.len = len
+        self.dtype = dtype
--- a/code_generator/CodegenUtilTFlite.py
+++ b/code_generator/CodegenUtilTFlite.py
@ -0,0 +1,72 @@
+# ----------------------------------------------------------------------
+# Project: TinyEngine
+# Title:   CodegenUtilTFlite.py
+#
+# Reference papers:
+#  - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
+#  - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
+#  - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
+# Contact authors:
+#  - Wei-Ming Chen, wmchen@mit.edu
+#  - Wei-Chen Wang, wweichen@mit.edu
+#  - Ji Lin, jilin@mit.edu
+#  - Ligeng Zhu, ligeng@mit.edu
+#  - Song Han, songhan@mit.edu
+#
+# Target ISA:  ARMv7E-M
+# ----------------------------------------------------------------------
+
+import os
+from tempfile import TemporaryDirectory
+
+from .CodeGenerator import CodeGenerator
+from .GeneralMemoryScheduler import GeneralMemoryScheduler
+from .TfliteConvertor import TfliteConvertor
+
+
+def GenerateSourceFilesFromTFlite(
+    tflite_path,
+    life_cycle_path=None,
+):
+    use_inplace = True
+
+    with TemporaryDirectory() as WORKING_DIR:
+        if life_cycle_path is None:
+            schedule_image_path = os.path.join(WORKING_DIR, "schedule.png")
+        else:
+            schedule_image_path = life_cycle_path
+
+        tf_convertor = TfliteConvertor(tflite_path)
+        tf_convertor.parseOperatorInfo()
+        layer = tf_convertor.layer
+        outTable = []
+        VisaulizeTrainable = False  # disable for code gen
+        memory_scheduler = GeneralMemoryScheduler(
+            layer,
+            False,
+            False,
+            outputTables=outTable,
+            inplace=use_inplace,
+            mem_visual_path=schedule_image_path,
+            VisaulizeTrainable=VisaulizeTrainable,
+        )
+        memory_scheduler.USE_INPLACE = use_inplace
+        memory_scheduler.allocateMemory()
+
+        outTable = tf_convertor.outputTables if hasattr(tf_convertor, "outputTables") else []
+        code_generator = CodeGenerator(
+            memsche=memory_scheduler,
+            inplace=memory_scheduler.USE_INPLACE,
+            unsigned_input=False,
+            patch_params=None,
+            FP_output=False,
+            profile_mode=False,
+            fp_requantize=True,
+            tflite_op=False,
+            dummy_address=False,
+            outputTables=outTable,
+        )
+        # set detection outputs before codegen if any
+        code_generator.codeGeneration()
+
+        return memory_scheduler.buffers["input_output"]
--- a/code_generator/GeneralMemoryScheduler.py
+++ b/code_generator/GeneralMemoryScheduler.py
@ -0,0 +1,389 @@
+# ----------------------------------------------------------------------
+# Project: TinyEngine
+# Title:   GeneralMemoryScheduler.py
+#
+# Reference papers:
+#  - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
+#  - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
+#  - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
+# Contact authors:
+#  - Wei-Ming Chen, wmchen@mit.edu
+#  - Wei-Chen Wang, wweichen@mit.edu
+#  - Ji Lin, jilin@mit.edu
+#  - Ligeng Zhu, ligeng@mit.edu
+#  - Song Han, songhan@mit.edu
+#
+# Target ISA:  ARMv7E-M
+# ----------------------------------------------------------------------
+
+from .allocator.firstFit import FirstFit
+from .constant import TTYPE_INFERNECE
+
+
+class GeneralMemoryScheduler:
+    def __init__(
+        self,
+        layer,
+        tflite_op=False,
+        dummy_address=False,
+        memory_limit=10 * 1024 * 1024,
+        inplace=True,
+        outputTables=None,
+        mem_visual_path="codegen/allocation.png",
+        VisaulizeTrainable=True,
+    ):
+        self.layer = layer
+        self.heads = 0
+        self.buffers = {
+            "input_output": 0,
+            "residual": 0,
+            "im2col": 0,
+            "kernel": 0,
+            "feature": 0,
+            "trainable": 0,
+        }  # for feature pyramid
+        # overall memory info
+        self.peakmem = 0
+        self.flash = 0
+        self.bias = 0
+        self.scale = 0
+        self.code = 0
+        self.allocator = FirstFit(memory_limit)
+        self.outputTables = outputTables
+        self.USE_INPLACE = inplace
+        self.mem_visual_path = mem_visual_path
+        self.tflite_op = tflite_op
+        self.dummy_address = dummy_address
+        self.VisaulizeTrainable = VisaulizeTrainable
+
+        # for showing layer-wise memory usage
+        self.layermem = []
+
+    def _isTranable(self, name):
+        for o in self.outputTables:
+            if isinstance(name, str) and o.name in name:
+                return True
+        return False
+
+    def allocateMemory(self):
+        # assign the same graph index for inplace operations
+        # note: we need to handle stride == 2 for int8 depthwise to save memory
+        if self.USE_INPLACE:
+            for i, op in enumerate(self.layer):
+                if op.params["op"] == "DEPTHWISE_CONV_2D" and op.params["input_dtype"] == "int8" and not self.tflite_op:
+                    # set the idx of output and next layer input
+                    previous_output_idx = op.output_tensors[0].graph_idx
+                    op.output_tensors[0].graph_idx = op.input_tensors[0].graph_idx
+                    if (
+                        i + 1 < len(self.layer)
+                        and len(self.layer[i + 1].input_tensors) > 0
+                        and str(self.layer[i + 1].input_tensors[0].graph_idx) == str(previous_output_idx)
+                    ):
+                        self.layer[i + 1].input_tensors[0].graph_idx = op.input_tensors[0].graph_idx
+                    # update following ops' tensors
+                    for following_idx in range(i, len(self.layer)):
+                        for cnt, inp_tensor in enumerate(self.layer[following_idx].input_tensors):
+                            if str(inp_tensor.graph_idx) == str(previous_output_idx):
+                                inp_tensor.graph_idx = op.input_tensors[0].graph_idx
+
+        num_layers = len(self.layer)
+        # go through all tensors in the model
+        for i, op in enumerate(self.layer):
+            # get all unallocated tensors for this layer
+            unallocated_tensors = []
+            for t in op.input_tensors:
+                if t.allocator_idx is None:
+                    unallocated_tensors.append(t)
+            for cnt, t in enumerate(op.output_tensors):
+                if cnt == 0 and not (
+                    self.USE_INPLACE
+                    and op.params["op"] == "DEPTHWISE_CONV_2D"
+                    and op.params["input_dtype"] == "int8"
+                    and not self.tflite_op
+                ):
+                    if t.allocator_idx is None:
+                        unallocated_tensors.append(t)
+                # assume seocnd outputs will not be inplace updated
+                else:
+                    if t.allocator_idx is None:
+                        unallocated_tensors.append(t)
+
+            # add each tensor
+            for cnt, t in enumerate(unallocated_tensors):
+                start_idx = i
+                end_idx = i + 1 if i == 0 else num_layers
+                for idx in range(start_idx + 1, num_layers):
+                    for input_t in self.layer[idx].input_tensors:
+                        if str(t.graph_idx) == str(input_t.graph_idx):
+                            end_idx = idx + 1
+                # check if this is output
+                ttype = TTYPE_INFERNECE
+                # add the tensor
+                t.allocator_idx = self.allocator.addTensor(start_idx, end_idx, t.len(), name=t.graph_idx, type=ttype)
+                # propagate the allocation to tensors with the same idx
+                for j in range(i + 1, num_layers):
+                    opp = self.layer[j]
+                    for tt in opp.input_tensors:
+                        if str(t.graph_idx) == str(tt.graph_idx):
+                            tt.allocator_idx = t.allocator_idx
+                    # not inplace update
+                    for tt in opp.output_tensors:
+                        if str(t.graph_idx) == str(tt.graph_idx):
+                            tt.allocator_idx = t.allocator_idx
+
+            # for detailed memory
+            layermem = {}
+
+            layermem["MAC"] = op.get_macs()
+            layermem["activation"] = op.get_activation_size()
+            layermem["scale"] = op.get_scale_size()
+            layermem["runtime"] = op.get_sbuf_size()
+            layermem["kernel"] = op.get_kbuf_size()
+            self._enlargeBuffer("im2col", layermem["runtime"])
+            self._enlargeBuffer("kernel", layermem["kernel"])
+
+            if (
+                "weight_name" in op.params
+                and self._isTranable(op.params["weight_name"])
+                and op.params["op"] != "TRANSPOSE_CONV_2D"
+            ):
+                size = int(op.get_weights_size())
+                self.buffers["trainable"] += size
+                layermem["trainable"] = size
+                layermem["weight"] = 0
+            else:
+                layermem["weight"] = int(op.get_weights_size())
+            if "bias_name" in op.params and self._isTranable(op.params["bias_name"]):
+                size = int(op.get_bias_size())
+                self.buffers["trainable"] += size
+                if "trainable" in layermem:
+                    layermem["trainable"] += size
+                else:
+                    layermem["trainable"] = size
+                layermem["bias"] = 0
+            else:
+                layermem["bias"] = int(op.get_bias_size())
+            # if it is float32 op, then their wegiths/bias should from SRAM buffers
+            if op.params["input_dtype"] != "int8":
+                layermem["scale"] = 0
+                layermem["bias"] = 0
+                layermem["weight"] = 0
+            self.__increaseFlash(layermem["weight"])
+            self.__increaseFlash(layermem["bias"])
+            self.__increaseFlash(layermem["scale"])
+
+            self.layermem.append(layermem)
+
+        # find out int8 inplace depthwise conv and stride == 2
+        for i, op in enumerate(self.layer):
+            if (
+                op.params["op"] == "DEPTHWISE_CONV_2D"
+                and op.params["input_dtype"] == "int8"
+                and op.params["stride_h"] == op.params["stride_w"] == 2
+            ):
+                if op.input_tensors[0].allocator_idx == op.output_tensors[0].allocator_idx:
+                    self.allocator.rectangles[op.input_tensors[0].allocator_idx]["stride2_inplace_idx"] = i
+
+        # Reorder the rectangles to decide which tensor needs to be scheduled first
+        self.allocator.sortSize()
+        self.allocator.allocate()
+        self.allocator.visualize(self.mem_visual_path)
+        self._enlargeBuffer("input_output", self.allocator.get_peak())
+
+        # sanity check, see if all tensors have been allocated
+        for i, op in enumerate(self.layer):
+            # get all unallocated tensors for this layer
+            for cnt, t in enumerate(op.input_tensors):
+                assert t.allocator_idx is not None
+            for cnt, t in enumerate(op.output_tensors):
+                assert t.allocator_idx is not None
+
+        # assign the address according to placement
+        for i, op in enumerate(self.layer):
+            # get all unallocated tensors for this layer
+            for cnt, t in enumerate(op.input_tensors):
+                if cnt == 0:
+                    op.params["input_buf_add_offset"] = self.allocator.getIdxAddress(t.allocator_idx)
+                    op.params["input_buf_add"] = "front"
+                elif cnt == 1:
+                    op.params["input2_buf_add_offset"] = self.allocator.getIdxAddress(t.allocator_idx)
+                    op.params["input2_buf_add"] = "front"
+                elif cnt == 2:
+                    op.params["input3_buf_add_offset"] = self.allocator.getIdxAddress(t.allocator_idx)
+                    op.params["input3_buf_add"] = "front"
+                op.input_tensors[cnt].buffer_name = "buffer0"
+                op.input_tensors[cnt].buffer_address = self.allocator.getIdxAddress(t.allocator_idx)
+            for cnt, t in enumerate(op.output_tensors):
+                if cnt == 0:
+                    op.params["output_buf_add_offset"] = self.allocator.getIdxAddress(t.allocator_idx)
+                    op.params["output_buf_add"] = "front"
+                    op.output_tensors[cnt].buffer_name = "buffer0"
+                    op.output_tensors[cnt].buffer_address = self.allocator.getIdxAddress(t.allocator_idx)
+                if cnt == 1:
+                    op.params["output2_buf_add_offset"] = self.allocator.getIdxAddress(t.allocator_idx)
+                    op.params["output2_buf_add"] = "front"
+                    op.output_tensors[cnt].buffer_name = "buffer0"
+                    op.output_tensors[cnt].buffer_address = self.allocator.getIdxAddress(t.allocator_idx)
+
+        # calculate peak mem
+        self.peakmem = (
+            self.allocator.get_peak() + self.buffers["im2col"] + self.buffers["kernel"]  # + self.buffers["trainable"]
+        )
+
+    def dumpLayerIndex(self):
+        # header
+        print("-" * 14 + " Tensor Allocation Details " + "-" * 14)
+        print(" #op |   operator type   | input index | output index |")
+        for cnt, l in enumerate(self.layer):
+            operator_num = "#" + str(cnt)
+            type = str(l.params["op"])
+            input_tensor = ""
+            for cnt_inp, inp in enumerate(l.input_tensors):
+                input_tensor += str(inp.allocator_idx)
+                if cnt_inp < len(l.input_tensors) - 1:
+                    input_tensor += ","
+            output_tensor = str(l.output_tensors[0].allocator_idx)
+            string = (
+                operator_num.ljust(5)
+                + "|"
+                + type.ljust(19)
+                + "|"
+                + input_tensor.ljust(13)
+                + "|"
+                + output_tensor.ljust(14)
+                + "|"
+            )
+            print(string)
+
+    def dumpLayerMem(self):
+        # header
+        print(
+            "----------------------------------------------------  Schedule Details ----------------------------------------------------------------"  # noqa: E501
+        )
+        print(
+            "----------------------|                      SRAM                      ||                     Flash                      |             |"  # noqa: E501
+        )
+        print(
+            "----------------------|  activation  |  runtime  | trainable  |  sum   ||   weight   |   bias   |  scale   |     sum     |     MAC     |"  # noqa: E501
+        )
+
+        layermem = self.layermem
+        self.__dumpMemInfo(layermem)
+
+    def __dumpMemInfo(self, layermem):
+        string = "-------Schedule-------|"
+        maxActive = self.buffers["input_output"]
+        maxRuntime = self.buffers["im2col"] + self.buffers["kernel"]
+        maxTrainable = self.buffers["trainable"]
+        totalWeight = self.__sumKey(layermem, "weight")
+        totalBias = self.__sumKey(layermem, "bias")
+        totalScale = self.__sumKey(layermem, "scale")
+        totalMAC = self.__sumKey(layermem, "MAC")
+        string += str(maxActive).ljust(14) + "|"
+        string += str(maxRuntime).ljust(11) + "|"
+        string += str(maxTrainable).ljust(12) + "|"
+        string += str(maxActive + maxRuntime + maxTrainable).ljust(8) + "||"
+        string += str(totalWeight).ljust(12) + "|"
+        string += str(totalBias).ljust(10) + "|"
+        string += str(totalScale).ljust(10) + "|"
+        string += str(totalWeight + totalBias + totalScale).ljust(13) + "|"
+        string += str(totalMAC).ljust(13) + "|"
+        print(string)
+        for i, _ in enumerate(layermem):
+            layer_info = self.layer[i].get_layer_info()
+            string = ""
+            string += str(i) + ":" + layer_info["op"]
+            string = string.ljust(22) + "|"
+            SRAM = 0
+            if "activation" in layermem[i]:
+                substr = (
+                    str(layermem[i]["activation"]) + " (" + "{:.0%}".format(layermem[i]["activation"] / maxActive) + ")"
+                )
+                string += substr.ljust(14) + "|"
+                SRAM += layermem[i]["activation"]
+            if "runtime" in layermem[i]:
+                sbuf = layermem[i]["runtime"] + layermem[i]["kernel"]
+                substr = str(sbuf) + " (" + "{:.0%}".format(sbuf / maxRuntime) + ")"
+                string += substr.ljust(11) + "|"
+                SRAM += sbuf
+            else:
+                string = string.ljust(49) + "|"
+            if "trainable" in layermem[i]:
+                substr = (
+                    str(layermem[i]["trainable"])
+                    + " ("
+                    + "{:.0%}".format(layermem[i]["trainable"] / maxTrainable)
+                    + ")"
+                )
+                string += substr.ljust(12) + "|"
+                SRAM += layermem[i]["trainable"]
+            else:
+                string = string.ljust(62) + "|"
+
+            # SRAM end
+            string += str(SRAM)
+            string = string.ljust(71) + "||"
+            flash = 0
+            if "weight" in layermem[i]:
+                substr = (
+                    str(layermem[i]["weight"])
+                    + " ("
+                    + "{:.0%}".format(layermem[i]["weight"] / (totalWeight + 0.0001))
+                    + ")"
+                )
+                string += str(substr).ljust(12) + "|"
+                flash += layermem[i]["weight"]
+            if "bias" in layermem[i]:
+                substr = (
+                    str(layermem[i]["bias"]) + " (" + "{:.0%}".format(layermem[i]["bias"] / (totalBias + 0.0001)) + ")"
+                )
+                string += str(substr).ljust(10) + "|"
+                flash += layermem[i]["bias"]
+            if "scale" in layermem[i]:
+                substr = (
+                    str(layermem[i]["scale"]) + " (" + "{:.0%}".format(layermem[i]["scale"] / totalScale + 0.0001) + ")"
+                )
+                string += str(substr).ljust(10) + "|"
+                flash += layermem[i]["scale"]
+
+                if flash > 0:
+                    string += (
+                        str(flash)
+                        + " ("
+                        + "{:.0%}".format(flash / (totalWeight + totalBias + totalScale + 0.0001))
+                        + ")"
+                    )
+                    string = string.ljust(121) + "|"
+            # flash end
+            if "MAC" in layermem[i]:
+                substr = str(layermem[i]["MAC"]) + " (" + "{:.0%}".format(layermem[i]["MAC"] / totalMAC) + ")"
+                string += str(substr).ljust(13) + "|"
+            print(string)
+
+    def __sumKey(self, layers, key):
+        result = 0
+        for _, layer in enumerate(layers):
+            if key in layer:
+                result += layer[key]
+
+        return result
+
+    def getBuffers(self):
+        return self.buffers
+
+    # Maximum binary size: This should be updated if any change in the inference side
+    # TODO: Combine with code generation to get more accurate result
+    def profileResult(self):
+        return self.peakmem, self.flash + self.bias + self.scale + int(self.code * 1024)
+
+    def __increaseFlash(self, size):
+        self.flash += int(size)
+
+    def _enlargeBuffer(self, buf_str, size):
+        if buf_str == "input_output" or buf_str == "residual":
+            self.buffers[buf_str] = max(self.buffers[buf_str], int(size))
+        else:
+            if buf_str not in self.buffers:
+                self.buffers[buf_str] = size
+            else:
+                self.buffers[buf_str] = max(self.buffers[buf_str], size)
--- a/code_generator/InputResizer.py
+++ b/code_generator/InputResizer.py
@ -0,0 +1,167 @@
+# ----------------------------------------------------------------------
+# Project: TinyEngine
+# Title:   InputResizer.py
+#
+# Reference papers:
+#  - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
+#  - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
+#  - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
+# Contact authors:
+#  - Wei-Ming Chen, wmchen@mit.edu
+#  - Wei-Chen Wang, wweichen@mit.edu
+#  - Ji Lin, jilin@mit.edu
+#  - Ligeng Zhu, ligeng@mit.edu
+#  - Song Han, songhan@mit.edu
+#
+# Target ISA:  ARMv7E-M
+# ----------------------------------------------------------------------
+
+import math
+
+
+def _find_previous_info(layers, idx):
+    for layer in layers:
+        info = layer.get_layer_info()
+        if info["output_idx"] == idx:
+            return info
+
+
+class InputResizer:
+    def __init__(self, layer):
+        self.layer = layer
+
+    def inputResize(self, input_h, input_w):
+        for i, layer in enumerate(self.layer):
+            layer_info = layer.get_layer_info()
+
+            previous_layer_info = _find_previous_info(self.layer, layer_info["input_idx"])
+            # we need to handle different op
+            op_code_str = layer_info["op"]
+            if i == 0:
+                layer_info["input_h"] = input_h
+                layer_info["input_w"] = input_w
+                _changeOPTensorSize(self.layer[i], "input", 0, layer_info["input_h"], layer_info["input_w"])
+            else:
+                if op_code_str == "SE_AVG_POOL_2D":
+                    SEinput_h = previous_layer_info["output_h"]
+                    SEinput_w = previous_layer_info["output_w"]
+                    layer_info["input_h"] = SEinput_h
+                    layer_info["input_w"] = SEinput_w
+                    _changeOPTensorSize(self.layer[i], "input", 0, layer_info["input_h"], layer_info["input_w"])
+                    layer_info["sample_h"] = SEinput_h
+                    layer_info["sample_w"] = SEinput_w
+                else:
+                    layer_info["input_h"] = previous_layer_info["output_h"]
+                    layer_info["input_w"] = previous_layer_info["output_w"]
+                    layer_info["input_c"] = previous_layer_info["output_c"]
+                    _changeOPTensorSize(self.layer[i], "input", 0, layer_info["input_h"], layer_info["input_w"])
+                    if op_code_str == "AVERAGE_POOL_2D":
+                        layer_info["filter_h"] = layer_info["input_h"]
+                        layer_info["filter_w"] = layer_info["input_w"]
+                        layer_info["filter_c"] = layer_info["input_c"]
+
+                    # handle nodes for dag op
+                    # find the previous node
+                    if "dagop_input0_key" in layer_info:
+                        for op in self.layer:
+                            l_into = op.get_layer_info()
+                            if (
+                                "dagop_output_key" in l_into
+                                and l_into["dagop_output_key"] == layer_info["dagop_input0_key"]
+                            ):
+                                layer_info["input_h"] = l_into["output_h"]
+                                layer_info["input_w"] = l_into["output_w"]
+                                layer_info["input_c"] = l_into["output_c"]
+                    if "dagop_input1_key" in layer_info:
+                        for op in self.layer:
+                            l_into = op.get_layer_info()
+                            if (
+                                "dagop_output_key" in l_into
+                                and l_into["dagop_output_key"] == layer_info["dagop_input1_key"]
+                            ):
+                                layer_info["input_h"] = l_into["output_h"]
+                                layer_info["input_w"] = l_into["output_w"]
+                                layer_info["input_c"] = l_into["output_c"]
+
+            if op_code_str == "CONV_2D" or op_code_str == "DEPTHWISE_CONV_2D":
+                layer_info["output_h"] = math.ceil(layer_info["input_h"] / layer_info["stride_h"])
+                layer_info["output_w"] = math.ceil(layer_info["input_w"] / layer_info["stride_w"])
+                _changeOPTensorSize(self.layer[i], "output", 0, layer_info["output_h"], layer_info["output_w"])
+            elif op_code_str == "ADD":
+                layer_info["output_h"] = layer_info["input_h"]
+                layer_info["output_w"] = layer_info["input_w"]
+                layer_info["output_c"] = layer_info["input_c"]
+                _changeOPTensorSize(self.layer[i], "output", 0, layer_info["output_h"], layer_info["output_w"])
+                layer_info["input2_h"] = layer_info["input_h"]
+                layer_info["input2_w"] = layer_info["input_w"]
+                _changeOPTensorSize(self.layer[i], "input", 1, layer_info["input2_h"], layer_info["input_w"])
+            elif op_code_str == "SE_ELEMENT_MULT_2D":
+                layer_info["input2_h"] = SEinput_h
+                layer_info["input2_w"] = SEinput_w
+                _changeOPTensorSize(self.layer[i], "input", 1, layer_info["input2_h"], layer_info["input_w"])
+                layer_info["output_h"] = SEinput_h
+                layer_info["output_w"] = SEinput_w
+                _changeOPTensorSize(self.layer[i], "output", 0, layer_info["output_h"], layer_info["output_w"])
+            elif op_code_str == "UPSAMPLE":
+                layer_info["output_h"] = layer_info["input_h"] * layer_info["factor"]
+                layer_info["output_w"] = layer_info["input_w"] * layer_info["factor"]
+                layer_info["output_c"] = layer_info["input_c"]
+                _changeOPTensorSize(self.layer[i], "output", 0, layer_info["output_h"], layer_info["output_w"])
+            elif op_code_str == "MAX_POOL_2D":
+                layer_info["output_h"] = int(layer_info["input_h"] / layer_info["filter_h"])
+                layer_info["output_w"] = int(layer_info["input_w"] / layer_info["filter_h"])
+                layer_info["output_c"] = layer_info["input_c"]
+                _changeOPTensorSize(self.layer[i], "output", 0, layer_info["output_h"], layer_info["output_w"])
+
+
+def _changeOPTensorSize(layer, tensor_type: str, tensor_idx: int, input_h: int, input_w: int):
+    if tensor_type == "input":
+        if hasattr(layer, "input_tensors") and len(layer.input_tensors) > tensor_idx:
+            layer.input_tensors[tensor_idx].set_input_w(input_w)
+            layer.input_tensors[tensor_idx].set_input_h(input_h)
+    elif tensor_type == "output":
+        if hasattr(layer, "output_tensors"):
+            layer.output_tensors[tensor_idx].set_input_w(input_w)
+            layer.output_tensors[tensor_idx].set_input_h(input_h)
+
+
+class PatchResizer:
+    def __init__(self, layer):
+        self.layer = layer
+
+    # manually setting these variables for now
+    def patchResize(self, PatchLayers, PatchSize, PatchSize_height):
+        for i, layer in enumerate(self.layer):
+            layer_info = layer.get_layer_info()
+            if i < PatchLayers:
+                layer_info["is_patch"] = True
+                op_code_str = layer_info["op"]
+                if i == 0:
+                    layer_info["input_h"] = PatchSize_height
+                    layer_info["input_w"] = PatchSize
+                    _changeOPTensorSize(self.layer[i], "input", 0, PatchSize_height, PatchSize)
+                else:
+                    prev_layer_info = self.layer[i - 1].get_layer_info()
+                    layer_info["input_h"] = prev_layer_info["output_h"]
+                    layer_info["input_w"] = prev_layer_info["output_w"]
+                    _changeOPTensorSize(
+                        self.layer[i], "input", 0, prev_layer_info["output_h"], prev_layer_info["output_w"]
+                    )
+
+                if op_code_str == "CONV_2D" or op_code_str == "DEPTHWISE_CONV_2D":
+                    layer_info["output_h"] = math.ceil(
+                        (layer_info["input_h"] - layer_info["kernel_h"] + 1) / layer_info["stride_h"]
+                    )
+                    layer_info["output_w"] = math.ceil(
+                        (layer_info["input_w"] - layer_info["kernel_w"] + 1) / layer_info["stride_w"]
+                    )
+                    _changeOPTensorSize(self.layer[i], "output", 0, layer_info["output_h"], layer_info["output_w"])
+                elif op_code_str == "ADD":
+                    layer_info["output_h"] = layer_info["input_h"]
+                    layer_info["output_w"] = layer_info["input_w"]
+                    layer_info["input2_h"] = layer_info["input_h"]
+                    layer_info["input2_w"] = layer_info["input_w"]
+                    _changeOPTensorSize(self.layer[i], "input", 0, layer_info["input_h"], layer_info["input_w"])
+                    _changeOPTensorSize(self.layer[i], "input", 1, layer_info["input_h"], layer_info["input_w"])
+            else:
+                layer_info["is_patch"] = False
--- a/code_generator/OpGenerator.py
+++ b/code_generator/OpGenerator.py
@ -0,0 +1,118 @@
+# ----------------------------------------------------------------------
+# Project: TinyEngine
+# Title:   OpGenerator.py
+#
+# Reference papers:
+#  - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
+#  - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
+#  - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
+# Contact authors:
+#  - Wei-Ming Chen, wmchen@mit.edu
+#  - Wei-Chen Wang, wweichen@mit.edu
+#  - Ji Lin, jilin@mit.edu
+#  - Ligeng Zhu, ligeng@mit.edu
+#  - Song Han, songhan@mit.edu
+#
+# Target ISA:  ARMv7E-M
+# ----------------------------------------------------------------------
+
+from .codetemplate.depthwiseTemplate import depthwiseInplace
+
+
+class OpGenerator:
+    def __init__(self, incpath, srcpath, layers, fp_requantize=False):
+        self.incpath = incpath
+        self.srcpath = srcpath
+        self.layers = layers
+        self.fp_requantize = fp_requantize
+
+    def genOpcode(self):
+        # find all conv ops
+        op_list = []
+        for op in self.layers:
+            layer_info = op.get_layer_info()
+            if layer_info["op"] == "CONV_2D" or layer_info["op"] == "DEPTHWISE_CONV_2D":
+                op = convOp(layer_info)
+                if op not in op_list:
+                    op_list.append(op)
+
+        # go through and generate all ops
+        incfile = includeFile(self.incpath)
+        for op in op_list:
+            if op.isDepthwise:
+                if op.kernel_h > op.kernel_w:
+                    depthwise_template = depthwiseInplace(
+                        op.kernel_h,
+                        op.kernel_w,
+                        op.pad_h,
+                        op.pad_w,
+                        op.stride,
+                        "CWH",
+                        self.fp_requantize,
+                    )
+                else:
+                    depthwise_template = depthwiseInplace(
+                        op.kernel_h,
+                        op.kernel_w,
+                        op.pad_h,
+                        op.pad_w,
+                        op.stride,
+                        "CHW",
+                        self.fp_requantize,
+                    )
+                depthwise_template.genFile(self.srcpath)
+                incfile.addDefine(depthwise_template.genFuncDefine())
+
+        incfile.writeFile()
+
+
+class convOp:
+    def __init__(self, layer_info):
+        if layer_info["op"] == "CONV_2D":
+            isDepthwise = False
+        elif layer_info["op"] == "DEPTHWISE_CONV_2D":
+            isDepthwise = True
+        kernel_h = layer_info["kernel_h"]
+        kernel_w = layer_info["kernel_w"]
+        pad_h = (kernel_h - 1) // 2
+        pad_w = (kernel_w - 1) // 2
+        stride = layer_info["stride_h"]
+        self.inchannel = layer_info["input_c"]
+        self.isDepthwise = isDepthwise
+        self.kernel_h = kernel_h
+        self.kernel_w = kernel_w
+        self.stride = stride
+        self.pad_h = pad_h
+        self.pad_w = pad_w
+
+    def __eq__(self, other):
+        if isinstance(other, convOp):
+            if (
+                self.isDepthwise == other.isDepthwise
+                and self.kernel_h == other.kernel_h
+                and self.kernel_w == other.kernel_w
+                and self.stride == other.stride
+                and self.pad_h == other.pad_h
+                and self.pad_w == other.pad_w
+            ):
+                return True
+            else:
+                return False
+        return NotImplemented
+
+
+class includeFile:
+    def __init__(self, path):
+        self.path = path
+        self.defstring = ""
+
+    def addDefine(self, defstr):
+        self.defstring += defstr + ";\n"
+
+    def writeFile(self):
+        import os
+
+        outpath = os.path.join(self.path, "genInclude.h")
+        outf = open(outpath, "w")
+        outf.write(self.defstring)
+        outf.close()
--- a/code_generator/PatchBasedUtil.py
+++ b/code_generator/PatchBasedUtil.py
@ -0,0 +1,85 @@
+# ----------------------------------------------------------------------
+# Project: TinyEngine
+# Title:   PatchBasedUtil.py
+#
+# Reference papers:
+#  - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
+#  - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
+#  - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
+# Contact authors:
+#  - Wei-Ming Chen, wmchen@mit.edu
+#  - Wei-Chen Wang, wweichen@mit.edu
+#  - Ji Lin, jilin@mit.edu
+#  - Ligeng Zhu, ligeng@mit.edu
+#  - Song Han, songhan@mit.edu
+#
+# Target ISA:  ARMv7E-M
+# ----------------------------------------------------------------------
+
+def getPatchParams(layers, split_idx, n_patch):
+    patch_params = {}
+
+    feat_stride = 8
+    patch_params["n_patch"] = n_patch
+    patch_params["layer_cnt"] = split_idx
+
+    resolution = max(layers[0].get_layer_info()["input_h"], layers[0].get_layer_info()["input_w"])
+    layer_cnt = layers[patch_params["layer_cnt"]].get_layer_info()
+    out_shape = max(layer_cnt["input_h"], layer_cnt["input_w"])
+    feat_stride = resolution // out_shape
+    grain_size = out_shape // n_patch
+
+    patch_params["single_rf"] = compute_receptive_field(layers, patch_params["layer_cnt"], 1)
+    patch_params["output_c"] = layer_cnt["input_c"]
+    patch_params["output_h"] = layer_cnt["output_h"]
+    patch_params["output_w"] = layer_cnt["output_w"]
+    patch_params["grain_rf"] = compute_receptive_field(layers, patch_params["layer_cnt"], grain_size)
+    patch_params["grain_rf_height"] = compute_receptive_field(
+        layers, patch_params["layer_cnt"], layer_cnt["input_h"] // n_patch
+    )
+    print("receptive field: single {} all {}".format(patch_params["single_rf"], patch_params["grain_rf"]))
+
+    # now generate the padding for each layer (two side)
+    patch_params["pad_l"] = patch_params["single_rf"] // 2
+    patch_params["pad_r"] = max(
+        0,
+        patch_params["grain_rf"]
+        + feat_stride * grain_size * (n_patch - 1)
+        - patch_params["single_rf"] // 2
+        - resolution,
+    )
+
+    return patch_params
+
+
+def get_recompute_layer(model, split_idx):
+    layer_cnt = 1  # first conv
+
+    for i in range(split_idx):
+        block = model["blocks"][i]
+        if "pointwise1" in block and block["pointwise1"] is not None:
+            layer_cnt += 1
+        if "depthwise" in block and block["depthwise"] is not None:
+            layer_cnt += 1
+        if "pointwise2" in block and block["pointwise2"] is not None:
+            layer_cnt += 1
+
+    return layer_cnt
+
+
+def compute_receptive_field(layers, layer_cnt, grain=1):
+    for i in range(layer_cnt):
+        op = layers[(layer_cnt - 1) - i]  # trace in a backward manner
+        layer_info = op.get_layer_info()
+        if layer_info["op"] == "CONV_2D" or layer_info["op"] == "DEPTHWISE_CONV_2D":  # receptive field will increase
+            stride = layer_info["stride_h"]
+            kernel_size = max(layer_info["kernel_h"], layer_info["kernel_w"])
+            if stride in [1, 2]:
+                if stride == 1:
+                    grain += kernel_size - 1
+                else:
+                    grain = (grain - 1) * 2 + kernel_size
+        else:
+            pass
+
+    return grain
--- a/code_generator/TfliteConvertor.py
+++ b/code_generator/TfliteConvertor.py
@ -0,0 +1,920 @@
+# ----------------------------------------------------------------------
+# Project: TinyEngine
+# Title:   TfliteConvertor.py
+#
+# Reference papers:
+#  - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
+#  - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
+#  - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
+# Contact authors:
+#  - Wei-Ming Chen, wmchen@mit.edu
+#  - Wei-Chen Wang, wweichen@mit.edu
+#  - Ji Lin, jilin@mit.edu
+#  - Ligeng Zhu, ligeng@mit.edu
+#  - Song Han, songhan@mit.edu
+#
+# Target ISA:  ARMv7E-M
+# ----------------------------------------------------------------------
+
+import math
+
+import numpy as np
+
+from .constant import SKIP_OPs
+from .operators import add, avgpool2d, conv2d, depthwiseConv2d, maxpool2d, upsample
+from .tflite import Model
+from .tflite.BuiltinOperator import BuiltinOperator
+from .tflite.BuiltinOptions import BuiltinOptions
+from .tflite.Conv2DOptions import Conv2DOptions
+from .tflite.DepthwiseConv2DOptions import DepthwiseConv2DOptions
+from .tflite.Padding import Padding
+from .tflite.Pool2DOptions import Pool2DOptions
+from .tflite.TensorType import TensorType
+
+
+# Parse tflite model into TinyEngine IR format
+class TfliteConvertor(object):
+    def __init__(self, filepath):
+        # path to the tflite file
+        self.filepath = filepath
+        self.model = self.loadTFmodel(filepath)
+        self.subgraph = self.model.Subgraphs(0)
+        self.builtin_op_code = self._build_str_map(BuiltinOperator())
+        self.layer = []
+        self.tmpPADIndice = None
+        self.skip_transpose = None
+        self.average_1D_to_2D_holder = MEAN2D()
+
+    # public functions
+    def loadTFmodel(self, filepath):
+        buf = open(filepath, "rb").read()
+        return Model.Model.GetRootAsModel(buf, 0)
+
+    def dumpModelInfo(self):
+        version = self.model.Version()
+        print("Model version:", version)
+        description = self.model.Description().decode("utf-8")
+        print("Description:", description)
+        subgraph_len = self.model.SubgraphsLength()
+        print("Subgraph length:", subgraph_len)
+
+        self.dumpLayerInfo()
+
+    def dumpLayerInfo(self):
+        print("Layer length:", len(self.layer))
+
+        # print brief info about each layer
+        for i, layer in enumerate(self.layer):
+            if self.layer[i]["op"] == "ADD":
+                print(
+                    "op:",
+                    layer["op"],
+                    ",input_idx:",
+                    layer["input_idx"],
+                    ",input2_idx:",
+                    layer["input2_idx"],
+                    "output_idx:",
+                    layer["output_idx"],
+                )
+            else:
+                print(
+                    "op:",
+                    layer["op"],
+                    ",input_idx:",
+                    layer["input_idx"],
+                    "output_idx:",
+                    layer["output_idx"],
+                )
+
+    def parseOperatorInfo(self):
+        operators_len = self.subgraph.OperatorsLength()
+
+        for i in range(operators_len):
+            op = self.subgraph.Operators(i)
+
+            # parse the op
+            self._handleOperator(op)
+
+    # private functions
+    def _build_str_map(self, obj):
+        ret = {}
+        for field_name in dir(obj):
+            if not field_name.startswith("_"):
+                field_value = getattr(obj, field_name)
+                if isinstance(field_value, int):
+                    ret[field_value] = field_name
+        return ret
+
+    def _getOpCodeStr(self, op):
+        op_code_list_idx = op.OpcodeIndex()
+        op_code_id = self.model.OperatorCodes(op_code_list_idx).DeprecatedBuiltinCode()
+        return self.builtin_op_code[op_code_id]
+
+    def _getTensorTypeStr(self, type):
+        if TensorType.INT8 == type:
+            return "int8"
+        if TensorType.UINT8 == type:
+            return "uint8"
+        if TensorType.FLOAT32 == type:
+            return "float32"
+
+    def _getMultiplierShift(self, effective_scale):
+        significand = np.zeros(len(effective_scale), dtype="int32")
+        shift = np.zeros(len(effective_scale), dtype="int32")
+
+        for i, s in enumerate(effective_scale):
+            if s == 0:
+                significand[i] = 0
+                shift[i] = 0
+            else:
+                sig, shi = math.frexp(s)
+                sig = int(round(sig * 2**31))
+
+                if sig == 2**31:
+                    sig /= 2
+                    shi += 1
+                if shi < -31:
+                    shi = 0
+                    sig = 0
+
+                significand[i] = sig
+                shift[i] = shi
+
+        return significand, shift
+
+    def _getSigShift(self, s):
+        sig, shi = math.frexp(s)
+        sig = int(round(sig * 2**31))
+        if sig == 2**31:
+            sig /= 2
+            shi += 1
+        if shi < -31:
+            shi = 0
+            sig = 0
+
+        return sig, shi
+
+    def _getADDMultiplierShift(self, input_scale, input2_scale, output_scale):
+        left_shift = 20
+
+        twice_max_input_scale = 2 * np.double(max(input_scale, input2_scale))
+        real_input1_multiplier = np.double(input_scale / twice_max_input_scale)
+        real_input2_multiplier = np.double(input2_scale / twice_max_input_scale)
+        real_output_multiplier = np.double(twice_max_input_scale / ((1 << left_shift) * output_scale))
+
+        input_multiplier, input_shift = self._getSigShift(real_input1_multiplier)
+        input2_multiplier, input2_shift = self._getSigShift(real_input2_multiplier)
+        output_multiplier, output_shift = self._getSigShift(real_output_multiplier)
+
+        return (
+            left_shift,
+            input_multiplier,
+            input_shift,
+            input2_multiplier,
+            input2_shift,
+            output_multiplier,
+            output_shift,
+        )
+
+    def _preprocessSoftmaxScaling(self, beta, input_scale, input_integer_bits):
+
+        input_beta_real_multiplier = min(beta * input_scale * (1 << (31 - input_integer_bits)), (1 << 31) - 1.0)
+
+        multiplier, shift = self._getSigShift(input_beta_real_multiplier)
+
+        return multiplier, shift
+
+    # follow TFlite implementation
+    def _calculateInputRadius(self, input_integer_bits, input_left_shift, total_signed_bits=31):
+        max_input_rescaled = (
+            1.0
+            * ((1 << input_integer_bits) - 1)
+            * (1 << (total_signed_bits - input_integer_bits))
+            / (1 << input_left_shift)
+        )
+        return math.floor(max_input_rescaled)
+
+    # converting tflite fuctions
+    def _convert_convolution(self, op):
+        # operator
+        op_code_str = self._getOpCodeStr(op)
+
+        # get input, weight, and output tensors
+        input_tensors = self._get_input_tensors(op)
+        input_tensor_count = len(input_tensors)
+        assert input_tensor_count >= 2, "input tensors length should be >= 2"
+
+        input_tensor = input_tensors[0]
+        weight_tensor = input_tensors[1]
+
+        output_tensors = self._get_output_tensors(op)
+        assert len(output_tensors) == 1, "output tensors length should be 1"
+        output_tensor = output_tensors[0]
+
+        # conv_2d options
+        if op_code_str == "CONV_2D":
+            assert op.BuiltinOptionsType() == BuiltinOptions.Conv2DOptions
+            op_options = op.BuiltinOptions()
+            conv_options = Conv2DOptions()
+            conv_options.Init(op_options.Bytes, op_options.Pos)
+        if op_code_str == "DEPTHWISE_CONV_2D":
+            assert op.BuiltinOptionsType() == BuiltinOptions.DepthwiseConv2DOptions
+            op_options = op.BuiltinOptions()
+            conv_options = DepthwiseConv2DOptions()
+            conv_options.Init(op_options.Bytes, op_options.Pos)
+
+        # conv parameters
+        stride_h = conv_options.StrideH()
+        stride_w = conv_options.StrideW()
+
+        # shapes
+        _, input_h, input_w, input_c = input_tensor.tensor.ShapeAsNumpy()
+        if op_code_str == "CONV_2D":
+            output_c, kernel_h, kernel_w, _ = weight_tensor.tensor.ShapeAsNumpy()
+        elif op_code_str == "DEPTHWISE_CONV_2D":
+            _, kernel_h, kernel_w, output_c = weight_tensor.tensor.ShapeAsNumpy()
+        _, output_h, output_w, output_c_dual = output_tensor.tensor.ShapeAsNumpy()
+        assert output_c_dual == output_c, "output channels not match"
+
+        # tensor types
+        input_type = self._getTensorTypeStr(input_tensor.tensor.Type())
+        output_type = self._getTensorTypeStr(output_tensor.tensor.Type())
+        weight_type = self._getTensorTypeStr(weight_tensor.tensor.Type())
+        assert input_type == output_type == weight_type, "tensor type not consistent"
+
+        # tensor value: weight, scalers
+        weight_value = self._get_np_from_wrapper(weight_tensor)
+        if input_tensor_count == 3:
+            bias_tensor = input_tensors[2]
+            # bias = self._get_np_from_wrapper(bias_tensor).astype('int') # forcely casting for testing latency
+            bias = self._get_np_from_wrapper(bias_tensor)
+        else:
+            bias = None
+
+        # quantized setting
+        input_zero_point = input_tensor.qnn_params["zero_point"]
+        output_zero_point = output_tensor.qnn_params["zero_point"]
+        input_scale = input_tensor.qnn_params["scale"]
+        weight_scale = weight_tensor.qnn_params["scale"]
+        output_scale = output_tensor.qnn_params["scale"]
+        effective_scale = np.double(input_scale) * np.double(weight_scale) / np.double(output_scale)
+
+        # quantized inference, used for requantize
+        multiplier, shift = self._getMultiplierShift(effective_scale)
+
+        # find previous layer and redirct the index and fuse pad into conv
+        if self.tmpPADIndice is not None:
+            if self.tmpPADIndice.output_idx == input_tensor.tensor_idx:
+                input_idx = self.tmpPADIndice.input_idx
+                input_h = input_h - math.floor(kernel_h / 2) * 2
+                input_w = input_w - math.floor(kernel_h / 2) * 2
+            else:
+                input_idx = input_tensor.tensor_idx
+        else:
+            input_idx = input_tensor.tensor_idx
+        # clean the buffer
+        self.tmpPADIndice = None
+
+        params = {
+            # operator
+            "op": op_code_str,
+            # conv
+            "kernel_h": kernel_h,
+            "kernel_w": kernel_w,
+            "padding": math.floor(kernel_h / 2),
+            "stride_h": stride_h,
+            "stride_w": stride_w,
+            # tensor
+            "input_idx": input_idx,
+            "output_idx": output_tensor.tensor_idx,
+            "input_dim": 3,
+            "output_dim": 3,
+            "input_h": input_h,
+            "input_w": input_w,
+            "input_c": input_c,
+            "output_h": output_h,
+            "output_w": output_w,
+            "output_c": output_c,
+            "dtypte": input_type,
+            # trainable parameters
+            "weight_value": weight_value,
+            "bias": bias,
+            "effective_scale": effective_scale,
+            "input_zero_point": input_zero_point,
+            "output_zero_point": output_zero_point,
+            "input_scale": input_scale,
+            "weight_scale": weight_scale,
+            "output_scale": output_scale,
+            # quantized infernece
+            "multiplier": multiplier,
+            "shift": shift,
+        }
+
+        if op_code_str == "CONV_2D":
+            op = conv2d.Conv2d(params)
+        elif op_code_str == "DEPTHWISE_CONV_2D":
+            op = depthwiseConv2d.DepthwiseConv2d(params)
+
+        return op
+
+    def _convert_ADD(self, op):
+        # operator
+        op_code_str = self._getOpCodeStr(op)
+
+        # get input, weight, and output tensors
+        input_tensors = self._get_input_tensors(op)
+        input_tensor_count = len(input_tensors)
+        assert input_tensor_count == 2, "input should be 2 tensors"
+
+        input_tensor = input_tensors[0]
+        input2_tensor = input_tensors[1]
+
+        output_tensors = self._get_output_tensors(op)
+        assert len(output_tensors) == 1, "output tensors length should be 1"
+        output_tensor = output_tensors[0]
+
+        # shapes
+        _, input_h, input_w, input_c = input_tensor.tensor.ShapeAsNumpy()
+        _, input2_h, input2_w, input2_c = input2_tensor.tensor.ShapeAsNumpy()
+        _, output_h, output_w, output_c = output_tensor.tensor.ShapeAsNumpy()
+        assert input_h == input2_h == output_h, "tensor shpae not consistent"
+        assert input_w == input2_w == output_w, "tensor shpae not consistent"
+        assert input_c == input2_c == output_c, "tensor shpae not consistent"
+
+        # tensor types
+        input_type = self._getTensorTypeStr(input_tensor.tensor.Type())
+        input_type2 = self._getTensorTypeStr(input2_tensor.tensor.Type())
+        output_type = self._getTensorTypeStr(output_tensor.tensor.Type())
+        assert input_type == input_type2 == output_type, "tensor type not consistent"
+
+        # quantized setting
+        input_zero_point = input_tensor.qnn_params["zero_point"]
+        input2_zero_point = input2_tensor.qnn_params["zero_point"]
+        output_zero_point = output_tensor.qnn_params["zero_point"]
+        input_scale = input_tensor.qnn_params["scale"]
+        input2_scale = input2_tensor.qnn_params["scale"]
+        output_scale = output_tensor.qnn_params["scale"]
+
+        # get multipliers and shifts
+        (
+            left_shift,
+            input_multiplier,
+            input_shift,
+            input2_multiplier,
+            input2_shift,
+            output_multiplier,
+            output_shift,
+        ) = self._getADDMultiplierShift(input_scale, input2_scale, output_scale)
+
+        # assign params
+        params = {
+            # operator
+            "op": op_code_str,
+            # tensor
+            "input_idx": input_tensor.tensor_idx,
+            "input2_idx": input2_tensor.tensor_idx,
+            "output_idx": output_tensor.tensor_idx,
+            "input_h": input_h,
+            "input_w": input_w,
+            "input_c": input_c,
+            "input2_h": input_h,
+            "input2_w": input_w,
+            "input2_c": input_c,
+            "input_dim": 3,
+            "input2_dim": 3,
+            "output_dim": 3,
+            "output_h": output_h,
+            "output_w": output_w,
+            "output_c": output_c,
+            "dtypte": input_type,
+            # trainable parameters
+            "input_zero_point": input_zero_point,
+            "input2_zero_point": input2_zero_point,
+            "output_zero_point": output_zero_point,
+            "input_scale": input_scale,
+            "input2_scale": input2_scale,
+            "output_scale": output_scale,
+            # quantized infernece
+            "left_shift": left_shift,
+            "input_multiplier": input_multiplier,
+            "input2_multiplier": input2_multiplier,
+            "input_shift": input_shift,
+            "input2_shift": input2_shift,
+            "output_multiplier": output_multiplier,
+            "output_shift": output_shift,
+        }
+        op = add.Add(params)
+
+        return op
+
+    def _convert_AVERAGE_POOL_2D(self, op):
+        # operator
+        op_code_str = self._getOpCodeStr(op)
+
+        # get input, weight, and output tensors
+        input_tensors = self._get_input_tensors(op)
+        input_tensor_count = len(input_tensors)
+        assert input_tensor_count == 1, "input tensors length should be 1"
+
+        input_tensor = input_tensors[0]
+
+        output_tensors = self._get_output_tensors(op)
+        assert len(output_tensors) == 1, "output tensors length should be 1"
+        output_tensor = output_tensors[0]
+
+        # shapes
+        _, input_h, input_w, input_c = input_tensor.tensor.ShapeAsNumpy()
+        _, output_h, output_w, output_c = output_tensor.tensor.ShapeAsNumpy()
+
+        # tensor types
+        input_type = self._getTensorTypeStr(input_tensor.tensor.Type())
+        output_type = self._getTensorTypeStr(output_tensor.tensor.Type())
+        assert input_type == output_type, "tensor type not consistent"
+
+        # pool parameters
+        assert op.BuiltinOptionsType() == BuiltinOptions.Pool2DOptions
+        op_options = op.BuiltinOptions()
+        pool2d_options = Pool2DOptions()
+        pool2d_options.Init(op_options.Bytes, op_options.Pos)
+        stride_h = pool2d_options.StrideH()
+        stride_w = pool2d_options.StrideW()
+        padding = pool2d_options.Padding()
+        filter_h = pool2d_options.FilterHeight()
+        filter_w = pool2d_options.FilterWidth()
+
+        # padding
+        if padding == Padding.VALID:
+            pad_h = 0
+            pad_w = 0
+        elif padding == Padding.SAME:
+            pass  # no support for now
+
+        # quantized setting
+        input_zero_point = input_tensor.qnn_params["zero_point"]
+        output_zero_point = output_tensor.qnn_params["zero_point"]
+        input_scale = input_tensor.qnn_params["scale"]
+        output_scale = output_tensor.qnn_params["scale"]
+
+        params = {
+            # operator
+            "op": op_code_str,
+            # pool parameters
+            "filter_h": filter_h,
+            "filter_w": filter_w,
+            "stride_h": stride_h,
+            "stride_w": stride_w,
+            "pad_h": pad_h,
+            "pad_w": pad_w,
+            # tensor
+            "input_idx": input_tensor.tensor_idx,
+            "output_idx": output_tensor.tensor_idx,
+            "input_h": input_h,
+            "input_w": input_w,
+            "input_c": input_c,
+            "input_dim": input_tensor.tensor.ShapeAsNumpy().size,
+            "output_dim": output_tensor.tensor.ShapeAsNumpy().size,
+            "output_h": output_h,
+            "output_w": output_w,
+            "output_c": output_c,
+            "dtypte": input_type,
+            # trainable parameters
+            "input_zero_point": input_zero_point,
+            "output_zero_point": output_zero_point,
+            "input_scale": input_scale,
+            "output_scale": output_scale,
+        }
+
+        op = avgpool2d.AvgPool2d(params)
+
+        return op
+
+    def _convert_upsample(self, op):
+        # Incase no params
+        input_type = None
+        input_zero_point = None
+        output_zero_point = None
+        input_scale = None
+        output_scale = None
+
+        # get input, weight, and output tensors
+        input_tensors = self._get_input_tensors(op)
+        input_tensor_count = len(input_tensors)
+        assert input_tensor_count == 1, "input tensors length should be 1"
+
+        input_tensor = input_tensors[0]
+
+        output_tensors = self._get_output_tensors(op)
+        assert len(output_tensors) == 1, "output tensors length should be 1"
+        output_tensor = output_tensors[0]
+
+        # shapes
+        _, input_h, input_w, input_c = input_tensor.tensor.ShapeAsNumpy()
+        _, output_h, output_w, output_c = output_tensor.tensor.ShapeAsNumpy()
+
+        params = {
+            # operator
+            "op": "UPSAMPLE",
+            # upsample parameters
+            "factor": output_w / input_w,
+            # tensor
+            "input_idx": input_tensor.tensor_idx,
+            "output_idx": output_tensor.tensor_idx,
+            "input_h": input_h,
+            "input_w": input_w,
+            "input_c": input_c,
+            "input_dim": 3,
+            "output_dim": 3,
+            "output_h": output_h,
+            "output_w": output_w,
+            "output_c": output_c,
+            "dtype": input_type,
+            # trainable parameters
+            "input_zero_point": input_zero_point,
+            "output_zero_point": output_zero_point,
+            "input_scale": input_scale,
+            "output_scale": output_scale,
+            # quantized infernece
+        }
+        op = upsample.upSample(params)
+
+        return op
+
+    def _convert_PAD(self, op):
+        # get input, weight, and output tensors
+        input_tensors = self._get_input_tensors(op)
+        input_tensor = input_tensors[0]
+
+        output_tensors = self._get_output_tensors(op)
+        assert len(output_tensors) == 1, "output tensors length should be 1"
+        output_tensor = output_tensors[0]
+
+        # fuse pad into conv
+        self.tmpPADIndice = PAD_tensorIndice(input_tensor.tensor_idx, output_tensor.tensor_idx)
+
+    def _convert_TRANSPOSE(self, op):
+        # get input, weight, and output tensors
+        input_tensors = self._get_input_tensors(op)
+        input_tensor = input_tensors[0]
+
+        output_tensors = self._get_output_tensors(op)
+        assert len(output_tensors) == 1, "output tensors length should be 1"
+        output_tensor = output_tensors[0]
+
+        # fuse pad into conv
+        self.skip_transpose = PAD_tensorIndice(input_tensor.tensor_idx, output_tensor.tensor_idx)
+
+    def _convert_maxpool(self, op):
+        # Incase no params
+        input_type = None
+        input_zero_point = None
+        output_zero_point = None
+        input_scale = None
+        output_scale = None
+
+        # get input, weight, and output tensors
+        input_tensors = self._get_input_tensors(op)
+        input_tensor_count = len(input_tensors)
+        assert input_tensor_count == 1, "input tensors length should be 1"
+
+        input_tensor = input_tensors[0]
+
+        output_tensors = self._get_output_tensors(op)
+        assert len(output_tensors) == 1, "output tensors length should be 1"
+        output_tensor = output_tensors[0]
+
+        # shapes
+        _, input_h, input_w, input_c = input_tensor.tensor.ShapeAsNumpy()
+        _, output_h, output_w, output_c = output_tensor.tensor.ShapeAsNumpy()
+
+        # pool parameters
+        assert op.BuiltinOptionsType() == BuiltinOptions.Pool2DOptions
+        op_options = op.BuiltinOptions()
+        pool2d_options = Pool2DOptions()
+        pool2d_options.Init(op_options.Bytes, op_options.Pos)
+        stride_h = pool2d_options.StrideH()
+        stride_w = pool2d_options.StrideW()
+        # padding = pool2d_options.Padding()
+        filter_h = pool2d_options.FilterHeight()
+        filter_w = pool2d_options.FilterWidth()
+        # fused_activation_fn = pool2d_options.FusedActivationFunction()
+
+        pool_params = {
+            # operator
+            "op": "MAX_POOL_2D",
+            # pool parameters
+            "filter_h": filter_h,
+            "filter_w": filter_w,
+            "stride_h": stride_h,
+            "stride_w": stride_w,
+            "pad_h": 0,
+            "pad_w": 0,
+            # tensor
+            "input_idx": input_tensor.tensor_idx,
+            "output_idx": output_tensor.tensor_idx,
+            "input_h": input_h,
+            "input_w": input_w,
+            "input_c": input_c,
+            "input_dim": 3,
+            "output_dim": 3,
+            "output_h": output_h,
+            "output_w": output_w,
+            "output_c": output_c,
+            "dtype": input_type,
+            # trainable parameters
+            "input_zero_point": input_zero_point,
+            "output_zero_point": output_zero_point,
+            "input_scale": input_scale,
+            "output_scale": output_scale,
+            # quantized infernece
+        }
+        op = maxpool2d.maxPool2d(pool_params)
+
+        return op
+
+    def _convert_mean1D(self, op, MEAN2Dholder):
+        # Incase no params
+        input_type = None
+
+        # get input, weight, and output tensors
+        input_tensors = self._get_input_tensors(op)
+        input_tensor_count = len(input_tensors)
+        assert input_tensor_count == 1, "input tensors length should be 1"
+
+        input_tensor = input_tensors[0]
+
+        output_tensors = self._get_output_tensors(op)
+        assert len(output_tensors) == 1, "output tensors length should be 1"
+        output_tensor = output_tensors[0]
+
+        # shapes
+        input_shape = input_tensor.tensor.ShapeAsNumpy()
+        output_shape = output_tensor.tensor.ShapeAsNumpy()
+
+        input_h, input_w, input_c = get_hwc_from_chwshape(input_shape)
+        output_h, output_w, output_c = get_hwc_from_chwshape(output_shape)
+        input_type = self._getTensorTypeStr(input_tensor.tensor.Type())
+
+        if not MEAN2Dholder.has_first_1D:
+            MEAN2Dholder.add_first_1D_op(input_tensor.tensor_idx, output_tensor.tensor_idx, input_h, input_w, input_c)
+            return None
+        elif not MEAN2Dholder.has_second_1D:
+            MEAN2Dholder.add_second_1D_op(
+                input_tensor.tensor_idx, output_tensor.tensor_idx, output_h, output_w, output_c
+            )
+            filter_h = input_h - output_h + 1
+            filter_w = input_w - output_w + 1
+            params = {
+                # operator
+                "op": "AVERAGE_POOL_2D",
+                # pool parameters
+                "filter_h": filter_h,
+                "filter_w": filter_w,
+                "stride_h": 1,
+                "stride_w": 1,
+                "pad_h": 0,
+                "pad_w": 0,
+                # tensor
+                "input_idx": MEAN2Dholder.first_1D_input_idx,
+                "output_idx": MEAN2Dholder.second_1D_output_idx,
+                "input_h": MEAN2Dholder.input_h,
+                "input_w": MEAN2Dholder.input_w,
+                "input_c": MEAN2Dholder.input_c,
+                "input_dim": 3,
+                "output_dim": 3,
+                "output_h": MEAN2Dholder.output_h,
+                "output_w": MEAN2Dholder.output_w,
+                "output_c": MEAN2Dholder.output_c,
+                "dtypte": input_type,
+            }
+
+            op = avgpool2d.AvgPool2d(params)
+
+            return op
+        else:
+            raise NotImplementedError
+
+    def _convert_FULLY_CONNECTED(self, op):
+        # get input, weight, and output tensors
+        input_tensors = self._get_input_tensors(op)
+        input_tensor_count = len(input_tensors)
+        assert input_tensor_count == 3, "input tensors length should be 3"
+
+        input_tensor = input_tensors[0]
+        weight_tensor = input_tensors[1]
+        bias_tensor = input_tensors[2]
+        weight = self._get_np_from_wrapper(weight_tensor)
+        bias = self._get_np_from_wrapper(bias_tensor)
+
+        output_tensors = self._get_output_tensors(op)
+        assert len(output_tensors) == 1, "output tensors length should be 1"
+        output_tensor = output_tensors[0]
+
+        # shapes
+        if input_tensor.tensor.ShapeAsNumpy().shape[0] == 2:
+            input_w, input_c = input_tensor.tensor.ShapeAsNumpy()
+            input_h = 1
+        elif input_tensor.tensor.ShapeAsNumpy().shape[0] == 4:
+            _, input_h, input_w, input_c = input_tensor.tensor.ShapeAsNumpy()
+        output_c, input_c_dual = weight_tensor.tensor.ShapeAsNumpy()
+        output_h, output_c_dual = output_tensor.tensor.ShapeAsNumpy()
+        assert input_c_dual == input_c, "channels not match"
+        assert output_c_dual == output_c, "channels not match"
+
+        # tensor types
+        input_type = self._getTensorTypeStr(input_tensor.tensor.Type())
+        output_type = self._getTensorTypeStr(output_tensor.tensor.Type())
+        assert input_type == output_type, "tensor type not consistent"
+
+        # quantized setting
+        input_zero_point = input_tensor.qnn_params["zero_point"]
+        output_zero_point = output_tensor.qnn_params["zero_point"]
+        input_scale = input_tensor.qnn_params["scale"]
+        weight_scale = weight_tensor.qnn_params["scale"]
+        bias_scale = bias_tensor.qnn_params["scale"]
+        output_scale = output_tensor.qnn_params["scale"]
+        # We support per channel in the CONV2D operator
+        if isinstance(bias_scale, float) and isinstance(weight_scale, float):
+            np_ones = np.ones(output_c)
+            bias_scale = np_ones * bias_scale
+            np_ones = np.ones(output_c)
+            output_scale = np_ones * output_scale
+        effective_scale = np.double(input_scale) * np.double(weight_scale) / np.double(output_scale)
+
+        # follows tensorflow lite micro
+        multiplier, shift = self._getMultiplierShift(effective_scale)
+
+        params = {
+            # operator
+            "op": "CONV_2D",
+            # tensor
+            "input_idx": input_tensor.tensor_idx,
+            "output_idx": output_tensor.tensor_idx,
+            "input_h": input_h,
+            "input_w": input_w,
+            "input_c": input_c,
+            "input_dim": 3,
+            "output_dim": 2,
+            "output_h": output_h,
+            "output_w": 1,
+            "output_c": output_c,
+            "dtypte": input_type,
+            "kernel_h": 1,
+            "kernel_w": 1,
+            # trainable parameters
+            "weight_value": weight,
+            "bias": bias,
+            "effective_scale": effective_scale,
+            "input_zero_point": input_zero_point,
+            "output_zero_point": output_zero_point,
+            "input_scale": input_scale,
+            "output_scale": output_scale,
+            # quantized infernece
+            "multiplier": multiplier,
+            "shift": shift,
+        }
+
+        op = conv2d.Conv2d(params)
+
+        return op
+
+    # handle one op and parse it into layers[] for supported operators
+    def _handleOperator(self, op):
+        op_code_str = self._getOpCodeStr(op)
+        if op_code_str == "CONV_2D":
+            self.layer.append(self._convert_convolution(op))
+        elif op_code_str == "ADD":
+            self.layer.append(self._convert_ADD(op))
+        elif op_code_str == "AVERAGE_POOL_2D":
+            self.layer.append(self._convert_AVERAGE_POOL_2D(op))
+        elif op_code_str == "DEPTHWISE_CONV_2D":
+            self.layer.append(self._convert_convolution(op))
+        elif op_code_str == "PAD":
+            self._convert_PAD(op)
+        elif op_code_str == "RESIZE_NEAREST_NEIGHBOR":
+            self.layer.append(self._convert_upsample(op))
+        elif op_code_str == "MAX_POOL_2D":
+            self.layer.append(self._convert_maxpool(op))
+        elif op_code_str in "MEAN":
+            ret_op = self._convert_mean1D(op, self.average_1D_to_2D_holder)
+            if ret_op is not None:
+                # TODO: This only handle a specific graph: TRANSPOSE -> MEAN -> MEANS
+                if self.skip_transpose is not None:
+                    ret_op.params["input_idx"] = self.skip_transpose.input_idx
+                    ret_op.input_tensors[0].graph_idx = self.skip_transpose.input_idx
+                self.layer.append(ret_op)
+        elif op_code_str == "TRANSPOSE":
+            self._convert_TRANSPOSE(op)
+        elif op_code_str in "FULLY_CONNECTED":
+            self.layer.append(self._convert_FULLY_CONNECTED(op))
+        elif op_code_str in SKIP_OPs:
+            pass
+        else:
+            raise NotImplementedError(f"Unsupported {op_code_str}")
+
+    def _get_np_from_wrapper(self, wrapper):
+        if wrapper.tensor.Type() == TensorType.INT8:
+            dtype = np.int8
+        elif wrapper.tensor.Type() == TensorType.INT32:
+            dtype = np.int32
+        else:
+            raise NotImplementedError("Current implementation only supports int8 and int32")
+
+        data = wrapper.buffer.DataAsNumpy()
+        shape = wrapper.tensor.ShapeAsNumpy() if wrapper.tensor.ShapeLength() != 0 else []
+
+        return np.frombuffer(data, dtype=dtype).reshape(shape)
+
+    def _get_tensor_type_str(self, tensor_type):
+        if tensor_type == TensorType.INT8:
+            return "int8"
+        raise NotImplementedError(f"Tensor type: {tensor_type} is not supported yet.")
+
+    def _get_input_tensors(self, op):
+        return self._get_wrapper_tensors(op.InputsAsNumpy())
+
+    def _get_output_tensors(self, op):
+        return self._get_wrapper_tensors(op.OutputsAsNumpy())
+
+    def _get_wrapper_tensors(self, tensor_index_list):
+        ret = []
+        for idx in tensor_index_list:
+            tensor = self.subgraph.Tensors(idx)
+            buffer_idx = tensor.Buffer()
+            buffer = self.model.Buffers(buffer_idx)
+
+            tflite_qparams = tensor.Quantization()
+
+            if tflite_qparams is None:
+                continue
+            assert tflite_qparams, "Quantization parameters not found in the model!"
+
+            scale = tflite_qparams.ScaleAsNumpy()
+            zero_point = tflite_qparams.ZeroPointAsNumpy()
+            qparams_to_tensor_wrapper = None
+
+            if isinstance(zero_point, np.ndarray):
+                # Per-channel quantization
+                if scale.size != 1 and zero_point.size != 1:
+                    qparams_to_tensor_wrapper = {"scale": scale, "zero_point": zero_point}
+                # Per-tensor quantization
+                elif scale.size == 1 and zero_point.size == 1:
+                    qparams_to_tensor_wrapper = {"scale": float(scale[0]), "zero_point": int(zero_point[0])}
+                else:
+                    raise NotImplementedError
+            elif scale == zero_point == 0:
+                pass
+
+            ret.append(TFLiteTensorWrpper(idx, tensor, buffer, qparams_to_tensor_wrapper))
+        return ret
+
+
+class PAD_tensorIndice(object):
+    def __init__(self, input_idx, output_idx):
+        self.input_idx = input_idx
+        self.output_idx = output_idx
+
+
+class MEAN2D(object):
+    def __init__(self):
+        self.has_first_1D = False
+        self.has_second_1D = False
+
+    def add_first_1D_op(self, input_idx, output_idx, input_h, input_w, input_c):
+        self.first_1D_input_idx = input_idx
+        self.first_1D_output_idx = output_idx
+        self.input_h = input_h
+        self.input_w = input_w
+        self.input_c = input_c
+        self.has_first_1D = True
+
+    def add_second_1D_op(self, input_idx, output_idx, output_h, output_w, output_c):
+        self.second_1D_input_idx = input_idx
+        self.second_1D_output_idx = output_idx
+        self.output_h = output_h
+        self.output_w = output_w
+        self.output_c = output_c
+        self.has_second_1D = True
+
+
+class TFLiteTensorWrpper:
+    def __init__(self, tensor_idx, tensor, buffer, qnn_params):
+        self.tensor_idx = tensor_idx
+        self.tensor = tensor
+        self.buffer = buffer
+        self.qnn_params = qnn_params
+
+
+def get_hwc_from_chwshape(shape):
+    h = 1
+    w = 1
+    c = 1
+    if len(shape) == 4:
+        c = shape[1]
+        h = shape[2]
+        w = shape[3]
+    elif len(shape) == 3:
+        c = shape[1]
+        h = shape[2]
+    elif len(shape) == 2:
+        c = shape[1]
+    return h, w, c
--- a/code_generator/init.py
+++ b/code_generator/init.py
--- a/code_generator/allocator/init.py
+++ b/code_generator/allocator/init.py
@ -0,0 +1 @@
+__all__ = ["base_allocator", "firstFit"]
--- a/Show More
+++ b/Show More
				`@ -0,0 +1 @@`
				`Subproject commit 5b58d2da8af7cee64cc9145ee1154609bdfee9f9`