initial commit

This commit is contained in:
RaymondWang0 2022-08-26 17:42:09 +00:00
commit c71768bb55
823 changed files with 276191 additions and 0 deletions

5
.clang-format Normal file
View File

@ -0,0 +1,5 @@
BasedOnStyle: Google
ColumnLimit: 120
ContinuationIndentWidth: 4
IndentWidth: 4
TabWidth: 4

4
.gitignore vendored Normal file
View File

@ -0,0 +1,4 @@
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

6
.gitmodules vendored Normal file
View File

@ -0,0 +1,6 @@
[submodule "mcunet"]
path = mcunet
url = https://github.com/mit-han-lab/mcunet.git
[submodule "TinyEngine/third_party/CMSIS"]
path = TinyEngine/third_party/CMSIS
url = https://github.com/ARM-software/CMSIS_5.git

51
.pre-commit-config.yaml Normal file
View File

@ -0,0 +1,51 @@
exclude: "code_generator/tflite/.*"
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.0.1
hooks:
- id: trailing-whitespace
- id: mixed-line-ending
args: ["--fix=lf"]
- id: end-of-file-fixer
- id: check-merge-conflict
- id: requirements-txt-fixer
- id: fix-encoding-pragma
args: ["--remove"]
- id: debug-statements
- id: check-toml
- repo: https://github.com/executablebooks/mdformat
rev: 0.7.10
hooks:
- id: mdformat
- repo: https://github.com/psf/black
rev: 22.3.0
hooks:
- id: black
- repo: https://github.com/pycqa/isort
rev: 5.10.1
hooks:
- id: isort
args: ["--sp", "pyproject.toml"]
- repo: https://github.com/pycqa/flake8
rev: 4.0.1
hooks:
- id: flake8
additional_dependencies:
- flake8-comprehensions==3.7.0
- flake8-docstrings==1.6.0
- repo: local
hooks:
- id: pylint
name: pylint
entry: pylint
language: system
types: [python]
require_serial: true
- repo: https://github.com/pre-commit/mirrors-mypy
rev: v0.910-1
hooks:
- id: mypy
- repo: https://github.com/pre-commit/mirrors-clang-format
rev: v13.0.0
hooks:
- id: clang-format

21
LICENSE Normal file
View File

@ -0,0 +1,21 @@
MIT License
Copyright (c) 2022 MIT HAN Lab
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

224
README.md Normal file
View File

@ -0,0 +1,224 @@
# TinyEngine
This is the official implementation of TinyEngine, a memory-efficient and high-performance neural network library for Microcontrollers.
TinyEngine is a part of MCUNet, which also consists of TinyNAS. MCUNet is a system-algorithm co-design framework for tiny deep learning on microcontrollers. TinyEngine and TinyNAS are co-designed to fit the tight memory budgets.
**The MCUNet and TinyNAS repo is [here](https://github.com/mit-han-lab/mcunet).**
### [MCUNetV1](https://mcunet.mit.edu/#mcunetv1) | [MCUNetV2](https://mcunet.mit.edu/#mcunetv2) | [MCUNetV3](https://mcunet.mit.edu/#mcunetv3)
### [Demo (Inference)](https://www.youtube.com/watch?v=YvioBgtec4U)
![demo](assets/figures/mcunet_demo.gif)
### [Demo (Training)](https://www.youtube.com/watch?v=XaDCO8YtmBw)
![demo_v3](assets/figures/mcunetV3_demo_2images.gif)
## News
We will soon release **Tiny Training Engine** used in [MCUNetV3: On-Device Training Under 256KB Memory](https://mcunet.mit.edu/#mcunetv3). **If you are interested in getting updates, please sign up [here](https://forms.gle/UW1uUmnfk1k6UJPPA) to get notified!**
- **(2022/08)** Our **New Course on TinyML and Efficient Deep Learning** will be released soon in September 2022: [efficientml.ai](https://efficientml.ai/).
- **(2022/08)** We include the [demo tutorial](tutorial) for deploying a visual wake word (VWW) model onto microcontrollers.
- **(2022/08)** We opensource the TinyEngine repo.
- **(2022/07)** We include the person detection model used in the video demo above in the [MCUNet repo](https://github.com/mit-han-lab/mcunet).
- **(2022/06)** We refactor the [MCUNet repo](https://github.com/mit-han-lab/mcunet) as a standalone repo (previous repo: https://github.com/mit-han-lab/tinyml)
- **(2021/10)** **MCUNetV2** is accepted to NeurIPS 2021: https://arxiv.org/abs/2110.15352 !
- **(2020/10)** **MCUNet** is accepted to NeurIPS 2020 as **spotlight**: https://arxiv.org/abs/2007.10319 !
- Our projects are covered by: [MIT News](https://news.mit.edu/2020/iot-deep-learning-1113), [MIT News (v2)](https://news.mit.edu/2021/tiny-machine-learning-design-alleviates-bottleneck-memory-usage-iot-devices-1208), [WIRED](https://www.wired.com/story/ai-algorithms-slimming-fit-fridge/), [Morning Brew](https://www.morningbrew.com/emerging-tech/stories/2020/12/07/researchers-figured-fit-ai-ever-onto-internet-things-microchips), [Stacey on IoT](https://staceyoniot.com/researchers-take-a-3-pronged-approach-to-edge-ai/), [Analytics Insight](https://www.analyticsinsight.net/amalgamating-ml-and-iot-in-smart-home-devices/), [Techable](https://techable.jp/archives/142462), etc.
## Overview
Microcontrollers are low-cost, low-power hardware. They are widely deployed and have wide applications, but the tight memory budget (50,000x smaller than GPUs) makes deep learning deployment difficult.
MCUNet is a **system-algorithm co-design** framework for tiny deep learning on microcontrollers. It consists of **TinyNAS** and **TinyEngine**. They are co-designed to fit the tight memory budgets. With system-algorithm co-design, we can significantly improve the deep learning performance on the same tiny memory budget.
![overview](assets/figures/overview.png)
Specifically, TinyEngine is a memory-efficient inference library. TinyEngine adapts the memory scheduling according to the overall network topology rather than layer-wise optimization, reducing memory usage and accelerating the inference. It outperforms existing inference libraries such as [TF-Lite Micro](https://www.tensorflow.org/lite/microcontrollers) from Google, [CMSIS-NN](https://arxiv.org/abs/1801.06601) from Arm, and [X-CUBE-AI](https://www.st.com/en/embedded-software/x-cube-ai.html) from STMicroelectronics.
TinyEngine adopts the following optimization techniques to accelerate inference speed and minimize memory footprint.
* [**In-place depth-wise convolution**](https://mcunet.mit.edu/#mcunetv1): A unique data placement technique for depth-wise convolution that overwrites input data by intermediate/output data to reduce peak SRAM memory.
* [**Operator fusion**](https://docs.microsoft.com/en-us/windows/ai/directml/dml-fused-activations): A method that improves performance by merging one operator into a different operator so that they are executed together without requiring a roundtrip to memory.
* [**SIMD (Single instruction, multiple data) programming**](https://en.wikipedia.org/wiki/Single_instruction,_multiple_data): A computing method that performs the same operation on multiple data points simultaneously.
* [**HWC to CHW weight format transformation**](https://oneapi-src.github.io/oneDNN/dev_guide_understanding_memory_formats.html): A weight format transformation technique that increases cache hit ratio for in-place depth-wise convolution.
* [**Image to Column (Im2col) convolution**](https://iq.opengenus.org/im2col/): An implementation technique of computing convolution operation using general matrix multiplication (GEMM) operations.
* [**Loop reordering**](https://xilinx.github.io/Vitis_Accel_Examples/2019.2/html/loop_reorder.html): A loop transformation technique that attempts to optimize a program's execution speed by reordering/interchanging the sequence of loops.
* [**Loop unrolling**](https://en.wikipedia.org/wiki/Loop_unrolling): A loop transformation technique that attempts to optimize a program's execution speed at the expense of its binary size, which is an approach known as space-time tradeoff.
* [**Loop tiling**](https://en.wikipedia.org/wiki/Loop_nest_optimization): A loop transformation technique that attempts to reduce memory access latency by partitioning a loop's iteration space into smaller chunks or blocks, so as to help ensure data used in a loop stays in the cache until it is reused.
![inplace_depthwise](assets/figures/inplace_depthwise.png)
By adopting the abovementioned optimization techniques, TinyEngine can not only enhance inference speed but also reduce peak memory, as shown in the figures below.
**MAC/s improvement breakdown:**
![mac_result](assets/figures/mac_result.png)
**Peak memory reduction:**
![peakmem_result](assets/figures/peakmem_result.png)
To sum up, our **TinyEngine** inference engine could be a useful infrastructure for MCU-based AI applications. It significantly **improves the inference speed and reduces the memory usage** compared to existing libraries like [TF-Lite Micro](https://www.tensorflow.org/lite/microcontrollers), [CMSIS-NN](https://arxiv.org/abs/1801.06601), [X-CUBE-AI](https://www.st.com/en/embedded-software/x-cube-ai.html), etc. It improves the inference speed by **1.1-18.6x**, and reduces the peak memory by **1.3-3.6x**.
![measured_result](assets/figures/measured_result.png)
## Code Structure
`code_generator` contains a python library that is used to compile neural networks into low-level source code (C/C++).
`TinyEngine` contains a C/C++ library that implements operators and performs inference on Microcontrollers.
`examples` contains the examples of transforming TFLite models into our TinyEngine models.
`tutorial` contains the demo tutorial of deploying a visual wake word (VWW) model onto microcontrollers.
`assets` contains misc assets.
## Requirement
- Python 3.6+
- STM32CubeIDE 1.5+
## Setup for Users
First, clone this repository:
```bash
git clone --recursive https://github.com/mit-han-lab/tinyengine.git
```
(Optional) Using a virtual environment with `conda` is recommended.
```bash
conda create -n tinyengine python=3.6 pip
conda activate tinyengine
```
Install dependencies:
```bash
pip install -r requirements.txt
```
## Setup for Developers
Install pre-commit hooks to automatically format changes in your code.
```
pre-commit install
```
## Deployment Example
Please see [tutorial](tutorial) to learn how to deploy a visual wake word (VWW) model onto microcontrollers by using TinyEngine.
## Measured Results
- All the tflite models are from [Model Zoo in MCUNet repo](https://github.com/mit-han-lab/mcunet#model-zoo). Please see MCUNet repo to know how to build the pre-trained int8 quantized models in TF-Lite format.
- All the **latency**, **peak memory (SRAM)** and **Flash memory usage** results are profiled on STM32F746G-DISCO discovery boards.
- Note that we measure the newer versions of libraries in this repo, so that the results in this repo might be different from the ones in the MCUNet papers.
- Since TF-Lite Micro no longer has version numbers anymore, we use the git commit ID to indicate its newer version.
- All the tflite models are compiled by `-Ofast` optimization level in STM32CubeIDE.
- OOM denotes Out Of Memory.
The **latency** results:
| net_id | TF-Lite Micro<br>v2.1.0 | TF-Lite Micro<br>[@ 713b6ed](https://github.com/tensorflow/tflite-micro/tree/713b6ed6bd81d8d6906d885e14f444aaf9c154f6) | CMSIS-NN<br>v2.0.0 | X-CUBE-AI<br>v7.1.0 | TinyEngine |
| ---------------------------- | ----------------------- | -------------------------- | ------------------ | --------- | ---------- |
| *# mcunet models (VWW)* | | | | | |
| mcunet-5fps-vww | 624ms | 2346ms | 269ms | 137ms | 128ms |
| mcunet-10fps-vww | 345ms | 1230ms | 143ms | 76ms | 66ms |
| mcunet-320kB-vww | OOM | OOM | OOM | 657ms | 570ms |
| *# mcunet models (ImageNet)* | | | | | |
| mcunet-5fps | OOM | OOM | OOM | 149ms | 135ms |
| mcunet-10fps | OOM | OOM | OOM | 84ms | 62ms |
| mcunet-256kB | OOM | OOM | OOM | 839ms | 681ms |
| mcunet-320kB | OOM | OOM | OOM | OOM | 819ms |
| *# baseline models* | | | | | |
| mbv2-320kB | OOM | OOM | OOM | OOM | 292ms |
| proxyless-320kB | OOM | OOM | OOM | 484ms | 425ms |
The **peak memory (SRAM)** results:
| net_id | TF-Lite Micro<br>v2.1.0 | TF-Lite Micro<br>[@ 713b6ed](https://github.com/tensorflow/tflite-micro/tree/713b6ed6bd81d8d6906d885e14f444aaf9c154f6) | CMSIS-NN<br>v2.0.0 | X-CUBE-AI<br>v7.1.0 | TinyEngine |
| ---------------------------- | ----------------------- | -------------------------- | ------------------ | --------- | ---------- |
| *# mcunet models (VWW)* | | | | | |
| mcunet-5fps-vww | 227kB | 220kB | 248kB | 123kB | 88kB |
| mcunet-10fps-vww | 169kB | 163kB | 199kB | 98kB | 56kB |
| mcunet-320kB-vww | OOM | OOM | OOM | 259kB | 162kB |
| *# mcunet models (ImageNet)* | | | | | |
| mcunet-5fps | OOM | OOM | OOM | 126kB | 90kB |
| mcunet-10fps | OOM | OOM | OOM | 76kB | 45kB |
| mcunet-256kB | OOM | OOM | OOM | 311kB | 200kB |
| mcunet-320kB | OOM | OOM | OOM | OOM | 242kB |
| *# baseline models* | | | | | |
| mbv2-320kB | OOM | OOM | OOM | OOM | 284kB |
| proxyless-320kB | OOM | OOM | OOM | 312kB | 242kB |
The **Flash memory usage** results:
| net_id | TF-Lite Micro<br>v2.1.0 | TF-Lite Micro<br>[@ 713b6ed](https://github.com/tensorflow/tflite-micro/tree/713b6ed6bd81d8d6906d885e14f444aaf9c154f6) | CMSIS-NN<br>v2.0.0 | X-CUBE-AI<br>v7.1.0 | TinyEngine |
| ---------------------------- | ----------------------- | -------------------------- | ------------------ | --------- | ---------- |
| *# mcunet models (VWW)* | | | | | |
| mcunet-5fps-vww | 782kB | 733kB | 743kB | 534kB | 517kB |
| mcunet-10fps-vww | 691kB | 643kB | 653kB | 463kB | 447kB |
| mcunet-320kB-vww | OOM | OOM | OOM | 773kB | 742kB |
| *# mcunet models (ImageNet)* | | | | | |
| mcunet-5fps | OOM | OOM | OOM | 737kB | 720kB |
| mcunet-10fps | OOM | OOM | OOM | 856kB | 837kB |
| mcunet-256kB | OOM | OOM | OOM | 850kB | 827kB |
| mcunet-320kB | OOM | OOM | OOM | OOM | 835kB |
| *# baseline models* | | | | | |
| mbv2-320kB | OOM | OOM | OOM | OOM | 828kB |
| proxyless-320kB | OOM | OOM | OOM | 866kB | 835kB |
## Citation
If you find the project helpful, please consider citing our paper:
```
@article{
lin2020mcunet,
title={Mcunet: Tiny deep learning on iot devices},
author={Lin, Ji and Chen, Wei-Ming and Lin, Yujun and Gan, Chuang and Han, Song},
journal={Advances in Neural Information Processing Systems},
volume={33},
year={2020}
}
@inproceedings{
lin2021mcunetv2,
title={MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning},
author={Lin, Ji and Chen, Wei-Ming and Cai, Han and Gan, Chuang and Han, Song},
booktitle={Annual Conference on Neural Information Processing Systems (NeurIPS)},
year={2021}
}
@inproceedings{
lin2022ondevice,
title={On-Device Training Under 256KB Memory},
author={Lin, Ji and Zhu, Ligeng and Chen, Wei-Ming and Wang, Wei-Chen and Gan, Chuang and Han, Song},
booktitle={ArXiv},
year={2022}
}
```
## Related Projects
[MCUNet: Tiny Deep Learning on IoT Devices](https://mcunet.mit.edu/#mcunetv1) (NeurIPS'20)
[MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning](https://mcunet.mit.edu/#mcunetv2) (NeurIPS'21)
[MCUNetV3: On-Device Training Under 256KB Memory](https://mcunet.mit.edu/#mcunetv3)

View File

@ -0,0 +1,236 @@
/*
* Copyright (C) 2010-2022 Arm Limited or its affiliates.
*
* SPDX-License-Identifier: Apache-2.0
*
* Licensed under the Apache License, Version 2.0 (the License); you may
* not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an AS IS BASIS, WITHOUT
* WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/* ----------------------------------------------------------------------
* This file is MODIFIED from Arm CMSIS NN Library.
*
* Project: TinyEngine
* Title: arm_nnfunctions_modified.h
* Description: Public header file for TinyEngine.
*
* Reference papers:
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
* Contact authors:
* - Wei-Ming Chen, wmchen@mit.edu
* - Wei-Chen Wang, wweichen@mit.edu
* - Ji Lin, jilin@mit.edu
* - Ligeng Zhu, ligeng@mit.edu
* - Song Han, songhan@mit.edu
*
* Original Project: CMSIS NN Library
* Original Title: arm_nnfunctions.h
*
* Target Processor: Cortex-M CPUs
* -------------------------------------------------------------------- */
/**
\mainpage CMSIS NN Software Library
*
* Introduction
* ------------
*
* This user manual describes the CMSIS NN software library,
* a collection of efficient neural network kernels developed to maximize the
* performance and minimize the memory footprint of neural networks on Cortex-M processor cores.
*
* The library is divided into a number of functions each covering a specific category:
* - Convolution Functions
* - Activation Functions
* - Fully-connected Layer Functions
* - SVDF Layer Functions
* - Pooling Functions
* - Softmax Functions
* - Basic math Functions
*
* The library has separate functions for operating on different weight and activation data
* types including 8-bit integers (q7_t) and 16-bit integers (q15_t). The descrition of the
* kernels are included in the function description. The implementation details are also
* described in this paper [1].
*
* Function Classification
* --------
* The functions can be classified into two segments
* - Legacy functions supporting ARM's internal symmetric quantization(8 bits).
* - Functions that support TensorFlow Lite framework with symmetric quantization(8 bits).
*
* The legacy functions can be identified with their suffix of _q7 or _q15 and are no new development is done there.
* The article in [2] describes in detail how to run a network using the legacy functions.
*
* The functions supporting TensorFlow Lite framework is identified by the _s8 suffix and can be invoked from TFL
* micro. The functions are bit exact to TensorFlow Lite. Refer to the TensorFlow's documentation in [3] on how to run
* a TensorFlow Lite model using optimized CMSIS-NN kernels.
*
* Block Diagram
* --------
* \image html CMSIS-NN-OVERVIEW.PNG
*
* Examples
* --------
*
* The library ships with a number of examples which demonstrate how to use the library functions.
*
* Pre-processor Macros
* ------------
*
* Each library project have different pre-processor macros.
*
* - ARM_MATH_DSP:
*
* Define macro ARM_MATH_DSP, If the silicon supports DSP instructions(DSP extension).
*
* - ARM_MATH_MVEI:
*
* Define macro ARM_MATH_MVEI, If the silicon supports M-Profile Vector Extension.
* - ARM_MATH_AUTOVECTORIZE
* Used in conjucture with ARM_MATH_MVEI to let the compiler auto vectorize for the functions that uses inline
* assembly. It does not affect functions that use C or intrinsics.
* - ARM_MATH_BIG_ENDIAN:
*
* Define macro ARM_MATH_BIG_ENDIAN to build the library for big endian targets. This is supported only for the legacy
* functions i.e, functions targetted at TensorFlow Lite do not support big endianness. By default library builds for
* little endian targets.
*
* - ARM_NN_TRUNCATE:
*
* Define macro ARM_NN_TRUNCATE to use floor instead of round-to-the-nearest-int for the computation.
*
*
* Copyright Notice
* ------------
*
* Copyright (C) 2010-2019 Arm Limited. All rights reserved.
*
* [1] CMSIS-NN: Efficient Neural Network Kernels for Arm Cortex-M CPUs https://arxiv.org/abs/1801.06601
*
* [2] Converting a Neural Network for Arm Cortex-M with CMSIS-NN
*
https://developer.arm.com/solutions/machine-learning-on-arm/developer-material/how-to-guides/converting-a-neural-network-for-arm-cortex-m-with-cmsis-nn/single-page
* [3] https://www.tensorflow.org/lite/microcontrollers/library
*
* [4] https://github.com/ARM-software/CMSIS_5/tree/develop/CMSIS/NN#legacy-vs-tfl-micro-compliant-apis
*/
/**
* @defgroup groupNN Neural Network Functions
* A collection of functions to perform basic operations for neural network layers. Functions with a _s8 suffix support
* TensorFlow Lite framework.
*/
#ifndef _ARM_NNFUNCTIONS_H
#define _ARM_NNFUNCTIONS_H
#include "arm_nn_math_types.h"
#include "arm_nn_types.h"
#include "arm_nnsupportfunctions.h"
#define USE_INTRINSIC
//#define ARM_NN_TRUNCATE /* This config the rounding model to floor or round to the nearest int */
#ifdef __cplusplus
extern "C" {
#endif
/**
* @defgroup NNConv Convolution Functions
*
* Collection of convolution, depthwise convolution functions and their variants.
*
* The convolution is implemented in 2 steps: im2col and GEMM
*
* im2col is a process of converting each patch of image data into
* a column. After im2col, the convolution is computed as matrix-matrix
* multiplication.
*
* To reduce the memory footprint, the im2col is performed partially.
* Each iteration, only a few column (i.e., patches) are generated and
* computed with GEMM kernels similar to CMSIS-DSP arm_mat_mult functions.
*
*/
arm_status arm_convolve_s8_4col(const q7_t *input,
const uint16_t input_x,
const uint16_t input_y,
const uint16_t input_ch,
const uint16_t input_batches,
const q7_t *kernel,
const uint16_t output_ch,
const uint16_t kernel_x,
const uint16_t kernel_y,
const uint16_t pad_x,
const uint16_t pad_y,
const uint16_t stride_x,
const uint16_t stride_y,
const int32_t *bias,
q7_t *output,
const int32_t *output_shift,
const int32_t *output_mult,
const int32_t out_offset,
const int32_t input_offset,
const int32_t out_activation_min,
const int32_t out_activation_max,
const uint16_t output_x,
const uint16_t output_y,
q15_t *buffer_a);
q7_t *arm_nn_mat_mult_kernel_s8_s16_reordered_oddch(const q7_t *input_a,
const q15_t *input_b,
const uint16_t output_ch,
const int32_t *out_shift,
const int32_t *out_mult,
const int32_t out_offset,
const int16_t activation_min,
const int16_t activation_max,
const uint16_t num_col_a,
const int32_t *const output_bias,
q7_t *out_0);
q7_t *arm_nn_mat_mult_kernel_s8_s16_reordered_8mul(const q7_t *input_a,
const q15_t *input_b,
const uint16_t output_ch,
const int32_t *out_shift,
const int32_t *out_mult,
const int32_t out_offset,
const int16_t activation_min,
const int16_t activation_max,
const uint16_t num_col_a,
const int32_t *const output_bias,
q7_t *out_0);
q7_t *arm_nn_mat_mult_kernel3_input3_s8_s16(const q7_t *input_a,
const q15_t *input_b,
const uint16_t output_ch,
const int32_t *out_shift,
const int32_t *out_mult,
const int32_t out_offset,
const int16_t activation_min,
const int16_t activation_max,
const uint16_t num_col_a,
const int32_t *const output_bias,
q7_t *out_0,
q15_t *kbuf);
#ifdef __cplusplus
}
#endif
#endif

View File

@ -0,0 +1,27 @@
/* ----------------------------------------------------------------------
* Project: TinyEngine
* Title: detectionUtility.h
*
* Reference papers:
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
* Contact authors:
* - Wei-Ming Chen, wmchen@mit.edu
* - Wei-Chen Wang, wweichen@mit.edu
* - Ji Lin, jilin@mit.edu
* - Ligeng Zhu, ligeng@mit.edu
* - Song Han, songhan@mit.edu
*
* Target ISA: ARMv7E-M
* -------------------------------------------------------------------- */
#ifndef TINYENGINE_INCLUDE_DETECTIONUTILITY_H_
#define TINYENGINE_INCLUDE_DETECTIONUTILITY_H_
int postProcessing(signed char *input, unsigned char* runtime_buffer,
int y_zero, float y_scale, int shape_x, int shape_y, int shape_c, int resolution,
int width, int height , float conf_thresh, float out_boxes[10][6]);
#endif /* TINYENGINE_INCLUDE_DETECTIONUTILITY_H_ */

View File

@ -0,0 +1,99 @@
/* ----------------------------------------------------------------------
* Project: TinyEngine
* Title: fp_requantize_op.h
*
* Reference papers:
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
* Contact authors:
* - Wei-Ming Chen, wmchen@mit.edu
* - Wei-Chen Wang, wweichen@mit.edu
* - Ji Lin, jilin@mit.edu
* - Ligeng Zhu, ligeng@mit.edu
* - Song Han, songhan@mit.edu
*
* Target ISA: ARMv7E-M
* -------------------------------------------------------------------- */
#ifndef TINYENGINE_INCLUDE_FP_REQUANTIZE_OP_H_
#define TINYENGINE_INCLUDE_FP_REQUANTIZE_OP_H_
tinyengine_status convolve_1x1_s8_ch8_fpreq(const q7_t *input,
const uint16_t input_x, const uint16_t input_y, const uint16_t input_ch,
const q7_t *kernel, const int32_t *bias, const float *scales,
const int32_t out_offset, const int32_t input_offset,
const int32_t out_activation_min, const int32_t out_activation_max,
q7_t *output, const uint16_t output_x, const uint16_t output_y,
const uint16_t output_ch, q15_t *runtime_buf);
tinyengine_status convolve_1x1_s8_ch16_fpreq(const q7_t *input,
const uint16_t input_x, const uint16_t input_y, const uint16_t input_ch,
const q7_t *kernel, const int32_t *bias, const float *scales,
const int32_t out_offset, const int32_t input_offset,
const int32_t out_activation_min, const int32_t out_activation_max,
q7_t *output, const uint16_t output_x, const uint16_t output_y,
const uint16_t output_ch, q15_t *runtime_buf);
tinyengine_status convolve_1x1_s8_ch24_fpreq(const q7_t *input,
const uint16_t input_x, const uint16_t input_y, const uint16_t input_ch,
const q7_t *kernel, const int32_t *bias, const float *scales,
const int32_t out_offset, const int32_t input_offset,
const int32_t out_activation_min, const int32_t out_activation_max,
q7_t *output, const uint16_t output_x, const uint16_t output_y,
const uint16_t output_ch, q15_t *runtime_buf);
tinyengine_status convolve_1x1_s8_ch48_fpreq(const q7_t *input,
const uint16_t input_x, const uint16_t input_y, const uint16_t input_ch,
const q7_t *kernel, const int32_t *bias, const float *scales,
const int32_t out_offset, const int32_t input_offset,
const int32_t out_activation_min, const int32_t out_activation_max,
q7_t *output, const uint16_t output_x, const uint16_t output_y,
const uint16_t output_ch, q15_t *runtime_buf);
tinyengine_status convolve_1x1_s8_fpreq(const q7_t *input,
const uint16_t input_x, const uint16_t input_y, const uint16_t input_ch,
const q7_t *kernel, const int32_t *bias, const float *scales,
const int32_t out_offset, const int32_t input_offset,
const int32_t out_activation_min, const int32_t out_activation_max,
q7_t *output, const uint16_t output_x, const uint16_t output_y,
const uint16_t output_ch, q15_t *runtime_buf);
tinyengine_status convolve_1x1_s8_fpreq_bitmask(const q7_t *input,
const uint16_t input_x, const uint16_t input_y, const uint16_t input_ch,
const q7_t *kernel, const int32_t *bias, const float *scales,
const int32_t out_offset, const int32_t input_offset,
const int32_t out_activation_min, const int32_t out_activation_max,
q7_t *output, q7_t *mask, const uint16_t output_x, const uint16_t output_y,
const uint16_t output_ch, q15_t *runtime_buf);
q7_t* mat_mult_kernel_s8_s16_reordered_fpreq(const q7_t *input_a,
const q15_t *input_b, const uint16_t output_ch, const float *scales,
const int32_t out_offset, const int16_t activation_min,
const int16_t activation_max, const uint16_t num_col_a,
const int32_t *const output_bias, q7_t *out_0);
q7_t* mat_mult_kernel_s8_s16_reordered_ch8_fpreq(const q7_t *input_a,
const q15_t *input_b, const uint16_t output_ch, const float *scales,
const int32_t out_offset, const int16_t activation_min,
const int16_t activation_max, const uint16_t num_col_a,
const int32_t *const output_bias, q7_t *out_0);
q7_t* mat_mult_kernel_s8_s16_reordered_ch16_fpreq(const q7_t *input_a,
const q15_t *input_b, const uint16_t output_ch, const float *scales,
const int32_t out_offset, const int16_t activation_min,
const int16_t activation_max, const uint16_t num_col_a,
const int32_t *const output_bias, q7_t *out_0);
q7_t* mat_mult_kernel_s8_s16_reordered_ch24_fpreq(const q7_t *input_a,
const q15_t *input_b, const uint16_t output_ch, const float *scales,
const int32_t out_offset, const int16_t activation_min,
const int16_t activation_max, const uint16_t num_col_a,
const int32_t *const output_bias, q7_t *out_0);
q7_t* mat_mult_kernel_s8_s16_reordered_ch48_fpreq(const q7_t *input_a,
const q15_t *input_b, const uint16_t output_ch, const float *scales,
const int32_t out_offset, const int16_t activation_min,
const int16_t activation_max, const uint16_t num_col_a,
const int32_t *const output_bias, q7_t *out_0);
#endif /* TINYENGINE_INCLUDE_FP_REQUANTIZE_OP_H_ */

View File

@ -0,0 +1,35 @@
/* ----------------------------------------------------------------------
* Project: TinyEngine
* Title: genNN.h
*
* Reference papers:
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
* Contact authors:
* - Wei-Ming Chen, wmchen@mit.edu
* - Wei-Chen Wang, wweichen@mit.edu
* - Ji Lin, jilin@mit.edu
* - Ligeng Zhu, ligeng@mit.edu
* - Song Han, songhan@mit.edu
*
* Target ISA: ARMv7E-M
* -------------------------------------------------------------------- */
#ifndef INC_GENNN_H_
#define INC_GENNN_H_
#include <stdint.h>
signed char* getInput();
signed char* getOutput();
float* getOutput_fp();
int32_t* getOutput_int32();
void setupBuffer();
void invoke(float* labels);
void getResult(uint8_t *P, uint8_t *NP);
int* getKbuffer();
void end2endinference();
#endif /* INC_GENNN_H_ */

View File

@ -0,0 +1,546 @@
/* ----------------------------------------------------------------------
* Project: TinyEngine
* Title: img2col_element.h
*
* Reference papers:
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
* Contact authors:
* - Wei-Ming Chen, wmchen@mit.edu
* - Wei-Chen Wang, wweichen@mit.edu
* - Ji Lin, jilin@mit.edu
* - Ligeng Zhu, ligeng@mit.edu
* - Song Han, songhan@mit.edu
*
* Target ISA: ARMv7E-M
* -------------------------------------------------------------------- */
#ifndef ARMNN_INCLUDE_IMG2COL_ELEMENT_H_
#define ARMNN_INCLUDE_IMG2COL_ELEMENT_H_
#include "arm_nnsupportfunctions.h"
#include "arm_math_memory.h"
#define b2_q7_q15_offset_ele(src,dst) \
/* convert from q7 to q15 and then store the results in the destination buffer */ \
/*in_q7x4 = b2_nn_read_q7x4_ia((const q7_t **)&src); \
in_q15x2_1 = __SXTB16(__ROR(in_q7x4, 8)); \
in_q15x2_2 = __SXTB16(in_q7x4); */ \
in_q15x2_1 = ((src[0] & 0x0C) >> 2) + ((src[0] & 0xC0) << 10);\
in_q15x2_2 = (src[0] & 0x03) + ((src[0] & 0x30) << 12);\
src +=1;\
out_q15x2_2 = __PKHTB(in_q15x2_1, in_q15x2_2, 16); \
/* Maximum of 9 bits from the addition is expected */ \
out_q15x2_2 = __SADD16(out_q15x2_2, offset_q15x2); \
\
out_q15x2_1 = __PKHBT(in_q15x2_2, in_q15x2_1, 16); \
out_q15x2_1 = __SADD16(out_q15x2_1, offset_q15x2); \
\
write_q15x2_ia(&dst, out_q15x2_1); \
write_q15x2_ia(&dst, out_q15x2_2);
#define b4_q7_q15_offset_ele(src,dst) \
/* convert from q7 to q15 and then store the results in the destination buffer */ \
/*in_q7x4 = b4_nn_read_q7x4_ia((const q7_t **)&src); \
in_q15x2_1 = __SXTB16(__ROR(in_q7x4, 8)); \
in_q15x2_2 = __SXTB16(in_q7x4); */ \
in_q15x2_1 = ((src[0] & 0xF0) >> 4) + ((src[1] & 0xF0) << 12);\
in_q15x2_2 = (src[0] & 0x0F) + ((src[1] & 0x0F) << 16);\
src +=2;\
out_q15x2_2 = __PKHTB(in_q15x2_1, in_q15x2_2, 16); \
/* Maximum of 9 bits from the addition is expected */ \
out_q15x2_2 = __SADD16(out_q15x2_2, offset_q15x2); \
\
out_q15x2_1 = __PKHBT(in_q15x2_2, in_q15x2_1, 16); \
out_q15x2_1 = __SADD16(out_q15x2_1, offset_q15x2); \
\
write_q15x2_ia(&dst, out_q15x2_1); \
write_q15x2_ia(&dst, out_q15x2_2);
#define q7_q15_offset_ele(src,dst) \
/* convert from q7 to q15 and then store the results in the destination buffer */ \
in_q7x4 = arm_nn_read_q7x4_ia((const q7_t **)&src); \
/* Extract and sign extend each of the four q7 values to q15 */ \
in_q15x2_1 = __SXTB16(__ROR(in_q7x4, 8)); \
in_q15x2_2 = __SXTB16(in_q7x4); \
\
out_q15x2_2 = __PKHTB(in_q15x2_1, in_q15x2_2, 16); \
/* Maximum of 9 bits from the addition is expected */ \
out_q15x2_2 = __SADD16(out_q15x2_2, offset_q15x2); \
\
out_q15x2_1 = __PKHBT(in_q15x2_2, in_q15x2_1, 16); \
out_q15x2_1 = __SADD16(out_q15x2_1, offset_q15x2); \
\
write_q15x2_ia(&dst, out_q15x2_1); \
write_q15x2_ia(&dst, out_q15x2_2);
#define q8_q15_offset_ele(src,dst) \
/* convert from q8 to q15 and then store the results in the destination buffer */ \
in_q7x4 = arm_nn_read_q7x4_ia((const q8_t **)&src); \
/* Extend each of the four q8 values to q15 */ \
in_q15x2_1 = __UXTB16(__ROR(in_q7x4, 8)); \
in_q15x2_2 = __UXTB16(in_q7x4); \
\
out_q15x2_2 = __PKHTB(in_q15x2_1, in_q15x2_2, 16); \
/* Maximum of 9 bits from the addition is expected */ \
out_q15x2_2 = __SADD16(out_q15x2_2, offset_q15x2); \
\
out_q15x2_1 = __PKHBT(in_q15x2_2, in_q15x2_1, 16); \
out_q15x2_1 = __SADD16(out_q15x2_1, offset_q15x2); \
\
write_q15x2_ia(&dst, out_q15x2_1); \
write_q15x2_ia(&dst, out_q15x2_2);
#define b4_q15_offset_reordered_ele(src,dst)\
/* convert from q7 to q15 and then store the results in the destination buffer */\
in_q7x4 = b4_nn_read_q7x4_ia((const q7_t **)&src);\
\
/* Extract and sign extend each of the four q7 values to q15 */\
out_q15x2_1 = __SXTB16(__ROR(in_q7x4, 8));\
out_q15x2_2 = __SXTB16(in_q7x4);\
\
out_q15x2_1 = __SADD16(out_q15x2_1, offset_q15x2);\
out_q15x2_2 = __SADD16(out_q15x2_2, offset_q15x2);\
\
write_q15x2_ia(&dst, out_q15x2_2);\
write_q15x2_ia(&dst, out_q15x2_1);
#define b2_q15_offset_reordered_ele(src,dst)\
/* convert from q7 to q15 and then store the results in the destination buffer */\
in_q7x4 = b2_nn_read_q7x4_ia(&src);\
\
/* Extract and sign extend each of the four q7 values to q15 */\
out_q15x2_1 = __SXTB16(__ROR(in_q7x4, 8));\
out_q15x2_2 = __SXTB16(in_q7x4);\
\
out_q15x2_1 = __SADD16(out_q15x2_1, offset_q15x2);\
out_q15x2_2 = __SADD16(out_q15x2_2, offset_q15x2);\
\
write_q15x2_ia(&dst, out_q15x2_2);\
write_q15x2_ia(&dst, out_q15x2_1);
#define q7_q15_offset_reordered_ele(src,dst)\
/* convert from q7 to q15 and then store the results in the destination buffer */\
in_q7x4 = arm_nn_read_q7x4_ia((const q7_t **)&src);\
\
/* Extract and sign extend each of the four q7 values to q15 */\
out_q15x2_1 = __SXTB16(__ROR(in_q7x4, 8));\
out_q15x2_2 = __SXTB16(in_q7x4);\
\
out_q15x2_1 = __SADD16(out_q15x2_1, offset_q15x2);\
out_q15x2_2 = __SADD16(out_q15x2_2, offset_q15x2);\
\
write_q15x2_ia(&dst, out_q15x2_2);\
write_q15x2_ia(&dst, out_q15x2_1);
#define q31_assign2(src,dst) \
*dst++ = *src++; \
*dst++ = *src++;
#define q31_assign4(src,dst) \
q31_assign2(src,dst) \
q31_assign2(src,dst) \
#define q31_assign6(src,dst) \
q31_assign4(src,dst) \
q31_assign2(src,dst) \
#define q31_assign8(src,dst) \
q31_assign4(src,dst) \
q31_assign4(src,dst) \
#define q31_assign10(src,dst) \
q31_assign8(src,dst) \
q31_assign2(src,dst) \
#define q31_assign12(src,dst) \
q31_assign10(src,dst) \
q31_assign2(src,dst) \
#define q31_pad2(dst,padvalue) \
*dst++ = padvalue; \
*dst++ = padvalue; \
#define q31_pad4(dst,padvalue) \
q31_pad2(dst,padvalue) \
q31_pad2(dst,padvalue) \
#define q31_pad6(dst,padvalue) \
q31_pad4(dst,padvalue) \
q31_pad2(dst,padvalue) \
#define q31_pad10(dst,padvalue) \
q31_pad6(dst,padvalue) \
q31_pad4(dst,padvalue) \
#define q31_pad14(dst,padvalue) \
q31_pad6(dst,padvalue) \
q31_pad6(dst,padvalue) \
q31_pad2(dst,padvalue) \
#define assignq31toq15()\
dst = (q15_t*)dst_31;\
dst2 = (q15_t*)dst2_31;\
dst3 = (q15_t*)dst3_31;\
dst4 = (q15_t*)dst4_31;\
dst5 = (q15_t*)dst5_31;\
dst6 = (q15_t*)dst6_31;\
dst7 = (q15_t*)dst7_31;\
#define assignq15toq31()\
dst_31 = (q31_t*)dst;\
dst2_31 = (q31_t*)dst2;\
dst3_31 = (q31_t*)dst3;\
dst4_31 = (q31_t*)dst4;\
dst5_31 = (q31_t*)dst5;\
dst6_31 = (q31_t*)dst6;\
dst7_31 = (q31_t*)dst7;\
/* ---------------------------------- Pad ---------------------------------- */
#define basic_pad_1row(col,dst_31,pad_out_q15x2)\
block_cnt = channel_div4 * col; \
while (block_cnt > 0)\
{ \
q31_pad2(dst_31,pad_out_q15x2) \
block_cnt--; \
}
#define basic_pad_2row(col,dst_31,dst2_31,pad_out_q15x2)\
block_cnt = channel_div4 * col; \
while (block_cnt > 0)\
{ \
q31_pad2(dst_31,pad_out_q15x2) \
q31_pad2(dst2_31,pad_out_q15x2) \
block_cnt--; \
}
#define basic_pad_3row(col,dst_31,dst2_31,dst3_31,pad_out_q15x2)\
block_cnt = channel_div4 * col; \
while (block_cnt > 0)\
{ \
q31_pad2(dst_31,pad_out_q15x2) \
q31_pad2(dst2_31,pad_out_q15x2) \
q31_pad2(dst3_31,pad_out_q15x2) \
block_cnt--; \
}
#define basic_pad_4row(col,dst_31,dst2_31,dst3_31,dst4_31,pad_out_q15x2)\
block_cnt = channel_div4 * col; \
while (block_cnt > 0)\
{ \
q31_pad2(dst_31,pad_out_q15x2) \
q31_pad2(dst2_31,pad_out_q15x2) \
q31_pad2(dst3_31,pad_out_q15x2) \
q31_pad2(dst4_31,pad_out_q15x2) \
block_cnt--; \
}
#define basic_pad_5row(col,dst_31,dst2_31,dst3_31,dst4_31,dst5_31,pad_out_q15x2)\
block_cnt = channel_div4 * col; \
while (block_cnt > 0)\
{ \
q31_pad2(dst_31,pad_out_q15x2) \
q31_pad2(dst2_31,pad_out_q15x2) \
q31_pad2(dst3_31,pad_out_q15x2) \
q31_pad2(dst4_31,pad_out_q15x2) \
q31_pad2(dst5_31,pad_out_q15x2) \
block_cnt--; \
}
#define pad_1row_1col(dst_31,pad_out_q15x2) basic_pad_1row(1,dst_31,pad_out_q15x2)
#define pad_1row_2col(dst_31,pad_out_q15x2) basic_pad_1row(2,dst_31,pad_out_q15x2)
#define pad_1row_3col(dst_31,pad_out_q15x2) basic_pad_1row(3,dst_31,pad_out_q15x2)
#define pad_2row_1col(dst_31,dst2_31,pad_out_q15x2) basic_pad_2row(1,dst_31,dst2_31,pad_out_q15x2)
#define pad_2row_2col(dst_31,dst2_31,pad_out_q15x2) basic_pad_2row(2,dst_31,dst2_31,pad_out_q15x2)
#define pad_2row_3col(dst_31,dst2_31,pad_out_q15x2) basic_pad_2row(3,dst_31,dst2_31,pad_out_q15x2)
#define pad_2row_4col(dst_31,dst2_31,pad_out_q15x2) basic_pad_2row(4,dst_31,dst2_31,pad_out_q15x2)
#define pad_2row_5col(dst_31,dst2_31,pad_out_q15x2) basic_pad_2row(5,dst_31,dst2_31,pad_out_q15x2)
#define pad_3row_1col(dst_31,dst2_31,dst3_31,pad_out_q15x2) basic_pad_3row(1,dst_31,dst2_31,dst3_31,pad_out_q15x2)
#define pad_3row_2col(dst_31,dst2_31,dst3_31,pad_out_q15x2) basic_pad_3row(2,dst_31,dst2_31,dst3_31,pad_out_q15x2)
#define pad_3row_3col(dst_31,dst2_31,dst3_31,pad_out_q15x2) basic_pad_3row(3,dst_31,dst2_31,dst3_31,pad_out_q15x2)
#define pad_4row_1col(dst_31,dst2_31,dst3_31,dst4_31,pad_out_q15x2) basic_pad_4row(1,dst_31,dst2_31,dst3_31,dst4_31,pad_out_q15x2)
#define pad_4row_2col(dst_31,dst2_31,dst3_31,dst4_31,pad_out_q15x2) basic_pad_4row(2,dst_31,dst2_31,dst3_31,dst4_31,pad_out_q15x2)
#define pad_4row_3col(dst_31,dst2_31,dst3_31,dst4_31,pad_out_q15x2) basic_pad_4row(3,dst_31,dst2_31,dst3_31,dst4_31,pad_out_q15x2)
#define pad_5row_1col(dst_31,dst2_31,dst3_31,dst4_31,dst5_31,pad_out_q15x2) basic_pad_5row(1,dst_31,dst2_31,dst3_31,dst4_31,dst5_31,pad_out_q15x2)
#define pad_5row_2col(dst_31,dst2_31,dst3_31,dst4_31,dst5_31,pad_out_q15x2) basic_pad_5row(2,dst_31,dst2_31,dst3_31,dst4_31,dst5_31,pad_out_q15x2)
#define pad_5row_3col(dst_31,dst2_31,dst3_31,dst4_31,dst5_31,pad_out_q15x2) basic_pad_5row(3,dst_31,dst2_31,dst3_31,dst4_31,dst5_31,pad_out_q15x2)
/* ---------------------------------- Load ---------------------------------- */
#define basic_load_1row(col,src,dst)\
block_cnt = channel_div4 * col; \
while (block_cnt > 0) \
{\
q7_q15_offset_ele(src,dst)\
block_cnt--;\
}
#define basic_load_2row(col,src,src2,dst,dst2)\
block_cnt = channel_div4 * col; \
while (block_cnt > 0) \
{\
q7_q15_offset_ele(src,dst)\
q7_q15_offset_ele(src2,dst2)\
block_cnt--;\
}
#define basic_load_3row(col,src,src2,src3,dst,dst2,dst3)\
block_cnt = channel_div4 * col; \
while (block_cnt > 0) \
{\
q7_q15_offset_ele(src,dst)\
q7_q15_offset_ele(src2,dst2)\
q7_q15_offset_ele(src3,dst3)\
block_cnt--;\
}
#define basic_load_4row(col,src,src2,src3,src4,dst,dst2,dst3,dst4)\
block_cnt = channel_div4 * col; \
while (block_cnt > 0) \
{\
q7_q15_offset_ele(src,dst)\
q7_q15_offset_ele(src2,dst2)\
q7_q15_offset_ele(src3,dst3)\
q7_q15_offset_ele(src4,dst4)\
block_cnt--;\
}
#define basic_load_5row(col,src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5)\
block_cnt = channel_div4 * col; \
while (block_cnt > 0) \
{\
q7_q15_offset_ele(src,dst)\
q7_q15_offset_ele(src2,dst2)\
q7_q15_offset_ele(src3,dst3)\
q7_q15_offset_ele(src4,dst4)\
q7_q15_offset_ele(src5,dst5)\
block_cnt--;\
}
///////////////////////// 4bit //////////////////////////
#define b4_load_1row(col,src,dst)\
block_cnt = channel_div4 * col; \
while (block_cnt > 0) \
{\
b4_q7_q15_offset_ele(src,dst)\
block_cnt--;\
}
#define b4_load_2row(col,src,src2,dst,dst2)\
block_cnt = channel_div4 * col; \
while (block_cnt > 0) \
{\
b4_q7_q15_offset_ele(src,dst)\
b4_q7_q15_offset_ele(src2,dst2)\
block_cnt--;\
}
#define b4_load_3row(col,src,src2,src3,dst,dst2,dst3)\
block_cnt = channel_div4 * col; \
while (block_cnt > 0) \
{\
b4_q7_q15_offset_ele(src,dst)\
b4_q7_q15_offset_ele(src2,dst2)\
b4_q7_q15_offset_ele(src3,dst3)\
block_cnt--;\
}
#define b4_load_4row(col,src,src2,src3,src4,dst,dst2,dst3,dst4)\
block_cnt = channel_div4 * col; \
while (block_cnt > 0) \
{\
b4_q7_q15_offset_ele(src,dst)\
b4_q7_q15_offset_ele(src2,dst2)\
b4_q7_q15_offset_ele(src3,dst3)\
b4_q7_q15_offset_ele(src4,dst4)\
block_cnt--;\
}
#define b4_load_5row(col,src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5)\
block_cnt = channel_div4 * col; \
while (block_cnt > 0) \
{\
b4_q7_q15_offset_ele(src,dst)\
b4_q7_q15_offset_ele(src2,dst2)\
b4_q7_q15_offset_ele(src3,dst3)\
b4_q7_q15_offset_ele(src4,dst4)\
b4_q7_q15_offset_ele(src5,dst5)\
block_cnt--;\
}
///////////////////////// 2bit //////////////////////////
#define b2_load_1row(col,src,dst)\
block_cnt = channel_div4 * col; \
while (block_cnt > 0) \
{\
b2_q7_q15_offset_ele(src,dst)\
block_cnt--;\
}
#define b2_load_2row(col,src,src2,dst,dst2)\
block_cnt = channel_div4 * col; \
while (block_cnt > 0) \
{\
b2_q7_q15_offset_ele(src,dst)\
b2_q7_q15_offset_ele(src2,dst2)\
block_cnt--;\
}
#define b2_load_3row(col,src,src2,src3,dst,dst2,dst3)\
block_cnt = channel_div4 * col; \
while (block_cnt > 0) \
{\
b2_q7_q15_offset_ele(src,dst)\
b2_q7_q15_offset_ele(src2,dst2)\
b2_q7_q15_offset_ele(src3,dst3)\
block_cnt--;\
}
#define b2_load_4row(col,src,src2,src3,src4,dst,dst2,dst3,dst4)\
block_cnt = channel_div4 * col; \
while (block_cnt > 0) \
{\
b2_q7_q15_offset_ele(src,dst)\
b2_q7_q15_offset_ele(src2,dst2)\
b2_q7_q15_offset_ele(src3,dst3)\
b2_q7_q15_offset_ele(src4,dst4)\
block_cnt--;\
}
#define b2_load_5row(col,src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5)\
block_cnt = channel_div4 * col; \
while (block_cnt > 0) \
{\
b2_q7_q15_offset_ele(src,dst)\
b2_q7_q15_offset_ele(src2,dst2)\
b2_q7_q15_offset_ele(src3,dst3)\
b2_q7_q15_offset_ele(src4,dst4)\
b2_q7_q15_offset_ele(src5,dst5)\
block_cnt--;\
}
#define b4_load_1row_1col(src,dst) b4_load_1row(1,src,dst)
#define b4_load_1row_2col(src,dst) b4_load_1row(2,src,dst)
#define b4_load_1row_3col(src,dst) b4_load_1row(3,src,dst)
#define b4_load_1row_4col(src,dst) b4_load_1row(4,src,dst)
#define b4_load_2row_1col(src,src2,dst,dst2) b4_load_2row(1,src,src2,dst,dst2)
#define b4_load_2row_2col(src,src2,dst,dst2) b4_load_2row(2,src,src2,dst,dst2)
#define b4_load_2row_3col(src,src2,dst,dst2) b4_load_2row(3,src,src2,dst,dst2)
#define b4_load_2row_4col(src,src2,dst,dst2) b4_load_2row(4,src,src2,dst,dst2)
#define b4_load_3row_1col(src,src2,src3,dst,dst2,dst3) b4_load_3row(1,src,src2,src3,dst,dst2,dst3)
#define b4_load_3row_2col(src,src2,src3,dst,dst2,dst3) b4_load_3row(2,src,src2,src3,dst,dst2,dst3)
#define b4_load_3row_3col(src,src2,src3,dst,dst2,dst3) b4_load_3row(3,src,src2,src3,dst,dst2,dst3)
#define b4_load_3row_4col(src,src2,src3,dst,dst2,dst3) b4_load_3row(4,src,src2,src3,dst,dst2,dst3)
#define b4_load_4row_1col(src,src2,src3,src4,dst,dst2,dst3,dst4) b4_load_4row(1,src,src2,src3,src4,dst,dst2,dst3,dst4)
#define b4_load_4row_2col(src,src2,src3,src4,dst,dst2,dst3,dst4) b4_load_4row(2,src,src2,src3,src4,dst,dst2,dst3,dst4)
#define b4_load_4row_3col(src,src2,src3,src4,dst,dst2,dst3,dst4) b4_load_4row(3,src,src2,src3,src4,dst,dst2,dst3,dst4)
#define b4_load_4row_4col(src,src2,src3,src4,dst,dst2,dst3,dst4) b4_load_4row(4,src,src2,src3,src4,dst,dst2,dst3,dst4)
#define b4_load_5row_1col(src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5) b4_load_5row(1,src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5)
#define b4_load_5row_2col(src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5) b4_load_5row(2,src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5)
#define b4_load_5row_3col(src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5) b4_load_5row(3,src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5)
#define b4_load_5row_4col(src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5) b4_load_5row(4,src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5)
#define b2_load_1row_1col(src,dst) b2_load_1row(1,src,dst)
#define b2_load_1row_2col(src,dst) b2_load_1row(2,src,dst)
#define b2_load_1row_3col(src,dst) b2_load_1row(3,src,dst)
#define b2_load_1row_4col(src,dst) b2_load_1row(4,src,dst)
#define b2_load_2row_1col(src,src2,dst,dst2) b2_load_2row(1,src,src2,dst,dst2)
#define b2_load_2row_2col(src,src2,dst,dst2) b2_load_2row(2,src,src2,dst,dst2)
#define b2_load_2row_3col(src,src2,dst,dst2) b2_load_2row(3,src,src2,dst,dst2)
#define b2_load_2row_4col(src,src2,dst,dst2) b2_load_2row(4,src,src2,dst,dst2)
#define b2_load_3row_1col(src,src2,src3,dst,dst2,dst3) b2_load_3row(1,src,src2,src3,dst,dst2,dst3)
#define b2_load_3row_2col(src,src2,src3,dst,dst2,dst3) b2_load_3row(2,src,src2,src3,dst,dst2,dst3)
#define b2_load_3row_3col(src,src2,src3,dst,dst2,dst3) b2_load_3row(3,src,src2,src3,dst,dst2,dst3)
#define b2_load_3row_4col(src,src2,src3,dst,dst2,dst3) b2_load_3row(4,src,src2,src3,dst,dst2,dst3)
#define b2_load_4row_1col(src,src2,src3,src4,dst,dst2,dst3,dst4) b2_load_4row(1,src,src2,src3,src4,dst,dst2,dst3,dst4)
#define b2_load_4row_2col(src,src2,src3,src4,dst,dst2,dst3,dst4) b2_load_4row(2,src,src2,src3,src4,dst,dst2,dst3,dst4)
#define b2_load_4row_3col(src,src2,src3,src4,dst,dst2,dst3,dst4) b2_load_4row(3,src,src2,src3,src4,dst,dst2,dst3,dst4)
#define b2_load_4row_4col(src,src2,src3,src4,dst,dst2,dst3,dst4) b2_load_4row(4,src,src2,src3,src4,dst,dst2,dst3,dst4)
#define b2_load_5row_1col(src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5) b2_load_5row(1,src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5)
#define b2_load_5row_2col(src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5) b2_load_5row(2,src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5)
#define b2_load_5row_3col(src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5) b2_load_5row(3,src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5)
#define b2_load_5row_4col(src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5) b2_load_5row(4,src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5)
#define load_1row_1col(src,dst) basic_load_1row(1,src,dst)
#define load_1row_2col(src,dst) basic_load_1row(2,src,dst)
#define load_1row_3col(src,dst) basic_load_1row(3,src,dst)
#define load_1row_4col(src,dst) basic_load_1row(4,src,dst)
#define load_2row_1col(src,src2,dst,dst2) basic_load_2row(1,src,src2,dst,dst2)
#define load_2row_2col(src,src2,dst,dst2) basic_load_2row(2,src,src2,dst,dst2)
#define load_2row_3col(src,src2,dst,dst2) basic_load_2row(3,src,src2,dst,dst2)
#define load_2row_4col(src,src2,dst,dst2) basic_load_2row(4,src,src2,dst,dst2)
#define load_3row_1col(src,src2,src3,dst,dst2,dst3) basic_load_3row(1,src,src2,src3,dst,dst2,dst3)
#define load_3row_2col(src,src2,src3,dst,dst2,dst3) basic_load_3row(2,src,src2,src3,dst,dst2,dst3)
#define load_3row_3col(src,src2,src3,dst,dst2,dst3) basic_load_3row(3,src,src2,src3,dst,dst2,dst3)
#define load_3row_4col(src,src2,src3,dst,dst2,dst3) basic_load_3row(4,src,src2,src3,dst,dst2,dst3)
#define load_4row_1col(src,src2,src3,src4,dst,dst2,dst3,dst4) basic_load_4row(1,src,src2,src3,src4,dst,dst2,dst3,dst4)
#define load_4row_2col(src,src2,src3,src4,dst,dst2,dst3,dst4) basic_load_4row(2,src,src2,src3,src4,dst,dst2,dst3,dst4)
#define load_4row_3col(src,src2,src3,src4,dst,dst2,dst3,dst4) basic_load_4row(3,src,src2,src3,src4,dst,dst2,dst3,dst4)
#define load_4row_4col(src,src2,src3,src4,dst,dst2,dst3,dst4) basic_load_4row(4,src,src2,src3,src4,dst,dst2,dst3,dst4)
#define load_5row_1col(src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5) basic_load_5row(1,src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5)
#define load_5row_2col(src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5) basic_load_5row(2,src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5)
#define load_5row_3col(src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5) basic_load_5row(3,src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5)
#define load_5row_4col(src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5) basic_load_5row(4,src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5)
/* ---------------------------------- Reuse ---------------------------------- */
#define basic_reuse_1row(col,src_31,dst_31)\
block_cnt = channel_div4 * col;\
while (block_cnt > 0)\
{\
q31_assign2(src_31,dst_31)\
block_cnt--;\
}
#define basic_reuse_2row(col,src_31,src2_31,dst_31,dst2_31)\
block_cnt = channel_div4 * col;\
while (block_cnt > 0)\
{\
q31_assign2(src_31,dst_31)\
q31_assign2(src2_31,dst2_31)\
block_cnt--;\
}
#define basic_reuse_3row(col,src_31,src2_31,src3_31,dst_31,dst2_31,dst3_31)\
block_cnt = channel_div4 * col;\
while (block_cnt > 0)\
{\
q31_assign2(src_31,dst_31)\
q31_assign2(src2_31,dst2_31)\
q31_assign2(src3_31,dst3_31)\
block_cnt--;\
}
#define basic_reuse_4row(col,src_31,src2_31,src3_31,src4_31,dst_31,dst2_31,dst3_31,dst4_31)\
block_cnt = channel_div4 * col;\
while (block_cnt > 0)\
{\
q31_assign2(src_31,dst_31)\
q31_assign2(src2_31,dst2_31)\
q31_assign2(src3_31,dst3_31)\
q31_assign2(src4_31,dst4_31)\
block_cnt--;\
}
#define basic_reuse_5row(col,src_31,src2_31,src3_31,src4_31,src5_31,dst_31,dst2_31,dst3_31,dst4_31,dst5_31)\
block_cnt = channel_div4 * col;\
while (block_cnt > 0)\
{\
q31_assign2(src_31,dst_31)\
q31_assign2(src2_31,dst2_31)\
q31_assign2(src3_31,dst3_31)\
q31_assign2(src4_31,dst4_31)\
q31_assign2(src5_31,dst5_31)\
block_cnt--;\
}
#define reuse_1row_1col(src_31,dst_31) basic_reuse_1row(1,src_31,dst_31)
#define reuse_1row_2col(src_31,dst_31) basic_reuse_1row(2,src_31,dst_31)
#define reuse_1row_3col(src_31,dst_31) basic_reuse_1row(3,src_31,dst_31)
#define reuse_1row_4col(src_31,dst_31) basic_reuse_1row(4,src_31,dst_31)
#define reuse_1row_5col(src_31,dst_31) basic_reuse_1row(5,src_31,dst_31)
#define reuse_1row_6col(src_31,dst_31) basic_reuse_1row(6,src_31,dst_31)
#define reuse_2row_1col(src_31,src2_31,dst_31,dst2_31) basic_reuse_2row(1,src_31,src2_31,dst_31,dst2_31)
#define reuse_2row_2col(src_31,src2_31,dst_31,dst2_31) basic_reuse_2row(2,src_31,src2_31,dst_31,dst2_31)
#define reuse_2row_3col(src_31,src2_31,dst_31,dst2_31) basic_reuse_2row(3,src_31,src2_31,dst_31,dst2_31)
#define reuse_2row_4col(src_31,src2_31,dst_31,dst2_31) basic_reuse_2row(4,src_31,src2_31,dst_31,dst2_31)
#define reuse_2row_5col(src_31,src2_31,dst_31,dst2_31) basic_reuse_2row(5,src_31,src2_31,dst_31,dst2_31)
#define reuse_2row_6col(src_31,src2_31,dst_31,dst2_31) basic_reuse_2row(6,src_31,src2_31,dst_31,dst2_31)
#define reuse_3row_1col(src_31,src2_31,src3_31,dst_31,dst2_31,dst3_31) basic_reuse_3row(1,src_31,src2_31,src3_31,dst_31,dst2_31,dst3_31)
#define reuse_3row_2col(src_31,src2_31,src3_31,dst_31,dst2_31,dst3_31) basic_reuse_3row(2,src_31,src2_31,src3_31,dst_31,dst2_31,dst3_31)
#define reuse_3row_3col(src_31,src2_31,src3_31,dst_31,dst2_31,dst3_31) basic_reuse_3row(3,src_31,src2_31,src3_31,dst_31,dst2_31,dst3_31)
#define reuse_3row_4col(src_31,src2_31,src3_31,dst_31,dst2_31,dst3_31) basic_reuse_3row(4,src_31,src2_31,src3_31,dst_31,dst2_31,dst3_31)
#define reuse_3row_5col(src_31,src2_31,src3_31,dst_31,dst2_31,dst3_31) basic_reuse_3row(5,src_31,src2_31,src3_31,dst_31,dst2_31,dst3_31)
#define reuse_3row_6col(src_31,src2_31,src3_31,dst_31,dst2_31,dst3_31) basic_reuse_3row(6,src_31,src2_31,src3_31,dst_31,dst2_31,dst3_31)
#define reuse_4row_3col(src_31,src2_31,src3_31,src4_31,dst_31,dst2_31,dst3_31,dst4_31) basic_reuse_4row(3,src_31,src2_31,src3_31,src4_31,dst_31,dst2_31,dst3_31,dst4_31)
#define reuse_4row_4col(src_31,src2_31,src3_31,src4_31,dst_31,dst2_31,dst3_31,dst4_31) basic_reuse_4row(4,src_31,src2_31,src3_31,src4_31,dst_31,dst2_31,dst3_31,dst4_31)
#define reuse_4row_5col(src_31,src2_31,src3_31,src4_31,dst_31,dst2_31,dst3_31,dst4_31) basic_reuse_4row(5,src_31,src2_31,src3_31,src4_31,dst_31,dst2_31,dst3_31,dst4_31)
#define reuse_4row_6col(src_31,src2_31,src3_31,src4_31,dst_31,dst2_31,dst3_31,dst4_31) basic_reuse_4row(6,src_31,src2_31,src3_31,src4_31,dst_31,dst2_31,dst3_31,dst4_31)
#define reuse_5row_3col(src_31,src2_31,src3_31,src4_31,src5_31,dst_31,dst2_31,dst3_31,dst4_31,dst5_31) basic_reuse_5row(3,src_31,src2_31,src3_31,src4_31,src5_31,dst_31,dst2_31,dst3_31,dst4_31,dst5_31)
#define reuse_5row_4col(src_31,src2_31,src3_31,src4_31,src5_31,dst_31,dst2_31,dst3_31,dst4_31,dst5_31) basic_reuse_5row(4,src_31,src2_31,src3_31,src4_31,src5_31,dst_31,dst2_31,dst3_31,dst4_31,dst5_31)
#define reuse_5row_5col(src_31,src2_31,src3_31,src4_31,src5_31,dst_31,dst2_31,dst3_31,dst4_31,dst5_31) basic_reuse_5row(5,src_31,src2_31,src3_31,src4_31,src5_31,dst_31,dst2_31,dst3_31,dst4_31,dst5_31)
#define reuse_5row_6col(src_31,src2_31,src3_31,src4_31,src5_31,dst_31,dst2_31,dst3_31,dst4_31,dst5_31) basic_reuse_5row(6,src_31,src2_31,src3_31,src4_31,src5_31,dst_31,dst2_31,dst3_31,dst4_31,dst5_31)
#endif /* ARMNN_INCLUDE_IMG2COL_ELEMENT_H_ */

View File

@ -0,0 +1,421 @@
/* ----------------------------------------------------------------------
* Project: TinyEngine
* Title: kernel_element.h
*
* Reference papers:
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
* Contact authors:
* - Wei-Ming Chen, wmchen@mit.edu
* - Wei-Chen Wang, wweichen@mit.edu
* - Ji Lin, jilin@mit.edu
* - Ligeng Zhu, ligeng@mit.edu
* - Song Han, songhan@mit.edu
*
* Target ISA: ARMv7E-M
* -------------------------------------------------------------------- */
#ifndef ARMNN_INCLUDE_KERNEL_ELEMENT_H_
#define ARMNN_INCLUDE_KERNEL_ELEMENT_H_
#include "mutable_function.h"
#include "precision_cnt.h"
#define loop_ele_ext() \
sum = __SMLAD(col32[0], k_buf1[0], sum); \
sum_2 = __SMLAD(col32[1], k_buf1[1], sum_2); \
sum_3 = __SMLAD(col32[2], k_buf1[2], sum_3); \
sum_4 = __SMLAD(col32[3], k_buf1[3], sum_4); \
col32 += 4;\
k_buf1 += 4; \
#define loop_ele() \
op_a = arm_nn_read_q15x2(col_pos); \
op_b = arm_nn_read_q15x2(col_pos + input_ch); \
\
op_c = __PKHBT(op_b, op_a, 16); \
op_a = __PKHTB(op_b, op_a, 16); \
sum = __SMLAD(op_c, k_buf1[0], sum); \
sum_2 = __SMLAD(op_a, k_buf1[q32_elements], sum_2); \
\
op_a = arm_nn_read_q15x2(col_pos + 2); \
op_b = arm_nn_read_q15x2(col_pos + input_ch + 2); \
\
op_c = __PKHBT(op_b, op_a, 16); \
op_a = __PKHTB(op_b, op_a, 16); \
sum_3 = __SMLAD(op_c, k_buf1[q32_elements*2], sum_3); \
sum_4 = __SMLAD(op_a, k_buf1[q32_elements*3], sum_4); \
\
col_pos += two_inch; \
k_buf1++;
/* end of loop_ele() */
#define prepare_loops()\
q7_t *out_1 = out + output_ch / output_scaler;\
const int32_t *out_shift = output_shift;\
const int32_t *out_mult = output_mult;\
const int32_t *obias = bias;\
uint16_t row_count = output_ch / 2;\
q31_t *ksrc = &kbuf[0];\
/* end of prepare_loops() */
#define conv_1stloop_ele()\
q31_t ch_0_out_0 = *obias;\
q31_t ch_0_out_1 = *obias++;\
q31_t ch_1_out_0 = *obias;\
q31_t ch_1_out_1 = *obias++;\
q31_t b0 = arm_nn_read_q15x2_ia((const q15_t **)&ip_b0);\
q31_t b1 = arm_nn_read_q15x2_ia((const q15_t **)&ip_b1);\
ch_0_out_0 = __SMLAD(*ksrc, b0, ch_0_out_0);\
ch_0_out_1 = __SMLAD(*ksrc++, b1, ch_0_out_1);\
ch_1_out_0 = __SMLAD(*ksrc2, b0, ch_1_out_0);\
b0 = arm_nn_read_q15x2_ia((const q15_t **)&ip_b0);\
ch_1_out_1 = __SMLAD(*ksrc2++, b1, ch_1_out_1);\
/* end of conv_1stloop_ele */
#define conv_lastloop_ele()\
b1 = arm_nn_read_q15x2_ia((const q15_t **)&ip_b1);\
\
ch_0_out_0 = __SMLAD(*ksrc, b0, ch_0_out_0);\
ch_0_out_1 = __SMLAD(*ksrc++, b1, ch_0_out_1);\
ch_1_out_0 = __SMLAD(*ksrc2, b0, ch_1_out_0);\
ch_1_out_1 = __SMLAD(*ksrc2++, b1, ch_1_out_1);\
\
ksrc = ksrc2;\
/* end of conv_lastloop_ele */
#define conv_midloop_ele(k_index) \
b1 = arm_nn_read_q15x2_ia(&ip_b1);\
ch_0_out_0 = __SMLAD(ksrc[k_index], b0, ch_0_out_0);\
ch_0_out_1 = __SMLAD(ksrc[k_index], b1, ch_0_out_1);\
ch_1_out_0 = __SMLAD(ksrc2[k_index], b0, ch_1_out_0);\
b0 = arm_nn_read_q15x2_ia(&ip_b0);\
ch_1_out_1 = __SMLAD(ksrc2[k_index], b1, ch_1_out_1);\
/* end of conv_midloop_ele */
#define conv_midloop_ptrele() \
b1 = arm_nn_read_q15x2_ia((const q15_t **)&ip_b1);\
ch_0_out_0 = __SMLAD(*ksrc, b0, ch_0_out_0);\
ch_0_out_1 = __SMLAD(*ksrc++, b1, ch_0_out_1);\
ch_1_out_0 = __SMLAD(*ksrc2, b0, ch_1_out_0);\
b0 = arm_nn_read_q15x2_ia((const q15_t **)&ip_b0);\
ch_1_out_1 = __SMLAD(*ksrc2++, b1, ch_1_out_1);\
/* end of conv_midloop_ele */
#define unroll_8inch()\
prepare_loops();\
while (row_count) {\
const q15_t *ip_b0 = two_column_buffer;\
const q15_t *ip_b1 = ip_b0 + 8;\
q31_t *ksrc2 = ksrc + 4;\
conv_1stloop_ele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_lastloop_ele()\
mix_assign_requantize()\
row_count--;\
}\
/* Specialized Loop Unrolling */
//this can be selected for different models
#define unroll_8inch()\
prepare_loops();\
while (row_count) {\
const q15_t *ip_b0 = two_column_buffer;\
const q15_t *ip_b1 = ip_b0 + 8;\
q31_t *ksrc2 = ksrc + 4;\
conv_1stloop_ele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_lastloop_ele()\
mix_assign_requantize()\
row_count--;\
}\
#define unroll_12inch()\
prepare_loops();\
while (row_count) {\
const q15_t *ip_b0 = two_column_buffer;\
const q15_t *ip_b1 = ip_b0 + 12;\
q31_t *ksrc2 = ksrc + 6;\
conv_1stloop_ele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_lastloop_ele()\
mix_assign_requantize()\
row_count--;\
}\
#define unroll_16inch()\
prepare_loops();\
while (row_count) {\
const q15_t *ip_b0 = two_column_buffer;\
const q15_t *ip_b1 = ip_b0 + 16;\
q31_t *ksrc2 = ksrc + 8;\
conv_1stloop_ele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_lastloop_ele()\
mix_assign_requantize()\
row_count--;\
}\
#define unroll_20inch()\
prepare_loops();\
while (row_count) {\
const q15_t *ip_b0 = two_column_buffer;\
const q15_t *ip_b1 = ip_b0 + 20;\
q31_t *ksrc2 = ksrc + 10;\
conv_1stloop_ele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_lastloop_ele()\
mix_assign_requantize()\
row_count--;\
}\
#define unroll_24inch()\
prepare_loops();\
while (row_count) {\
const q15_t *ip_b0 = two_column_buffer;\
const q15_t *ip_b1 = ip_b0 + 24;\
q31_t *ksrc2 = ksrc + 12;\
conv_1stloop_ele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_lastloop_ele()\
mix_assign_requantize()\
row_count--;\
}\
#define unroll_32inch()\
prepare_loops();\
while (row_count) {\
const q15_t *ip_b0 = two_column_buffer;\
const q15_t *ip_b1 = ip_b0 + 32;\
q31_t *ksrc2 = ksrc + 16;\
conv_1stloop_ele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_lastloop_ele()\
mix_assign_requantize()\
row_count--;\
}\
#define unroll_36inch()\
prepare_loops();\
while (row_count) {\
const q15_t *ip_b0 = two_column_buffer;\
const q15_t *ip_b1 = ip_b0 + 36;\
q31_t *ksrc2 = ksrc + 18;\
conv_1stloop_ele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_lastloop_ele()\
mix_assign_requantize()\
row_count--;\
}\
#define unroll_40inch()\
prepare_loops();\
while (row_count) {\
const q15_t *ip_b0 = two_column_buffer;\
const q15_t *ip_b1 = ip_b0 + 40;\
q31_t *ksrc2 = ksrc + 20;\
conv_1stloop_ele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_lastloop_ele()\
mix_assign_requantize()\
row_count--;\
}\
#define unroll_48inch()\
prepare_loops();\
while (row_count) {\
const q15_t *ip_b0 = two_column_buffer;\
const q15_t *ip_b1 = ip_b0 + 48;\
q31_t *ksrc2 = ksrc + 24;\
conv_1stloop_ele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_midloop_ptrele()\
conv_lastloop_ele()\
mix_assign_requantize()\
row_count--;\
}\
/* END: Specialized Loop Unrolling */
#define b2_assign_requantize() \
ch_0_out_0 = arm_nn_requantize(ch_0_out_0, *out_mult,*out_shift);\
ch_0_out_0 += out_offset;\
ch_0_out_0 = MAX(ch_0_out_0, out_activation_min);\
ch_0_out_0 = MIN(ch_0_out_0, out_activation_max);\
\
ch_0_out_1 = arm_nn_requantize(ch_0_out_1, *out_mult,*out_shift);\
ch_0_out_1 += out_offset;\
ch_0_out_1 = MAX(ch_0_out_1, out_activation_min);\
ch_0_out_1 = MIN(ch_0_out_1, out_activation_max);\
out_mult++;\
out_shift++;\
ch_1_out_0 = arm_nn_requantize(ch_1_out_0, *out_mult,*out_shift);\
ch_1_out_0 += out_offset;\
ch_1_out_0 = MAX(ch_1_out_0, out_activation_min);\
ch_1_out_0 = MIN(ch_1_out_0, out_activation_max);\
ch_1_out_1 = arm_nn_requantize(ch_1_out_1, *out_mult,*out_shift);\
ch_1_out_1 += out_offset;\
ch_1_out_1 = MAX(ch_1_out_1, out_activation_min);\
ch_1_out_1 = MIN(ch_1_out_1, out_activation_max);\
if(lower_bit == 1){\
*out = (q7_t) ((ch_0_out_0 & 0x03) + ((ch_1_out_0 & 0x03) << 2));\
*out_1 = (q7_t) ((ch_0_out_0 & 0x03) + ((ch_1_out_1 & 0x03) << 2));\
lower_bit = 3;\
}\
else{\
*out++ += (q7_t) (((ch_0_out_0 & 0x03) + ((ch_1_out_0 & 0x03) << 2)) << 4);\
*out_1++ += (q7_t) (((ch_0_out_1 & 0x03) + ((ch_1_out_1 & 0x03) << 2)) << 4);\
lower_bit = 1;\
}\
out_mult++;\
out_shift++;\
#define b4_assign_requantize() \
ch_0_out_0 = arm_nn_requantize(ch_0_out_0, *out_mult,*out_shift);\
ch_0_out_0 += out_offset;\
ch_0_out_0 = MAX(ch_0_out_0, out_activation_min);\
ch_0_out_0 = MIN(ch_0_out_0, out_activation_max);\
\
ch_0_out_1 = arm_nn_requantize(ch_0_out_1, *out_mult,*out_shift);\
ch_0_out_1 += out_offset;\
ch_0_out_1 = MAX(ch_0_out_1, out_activation_min);\
ch_0_out_1 = MIN(ch_0_out_1, out_activation_max);\
out_mult++;\
out_shift++;\
ch_1_out_0 = arm_nn_requantize(ch_1_out_0, *out_mult,*out_shift);\
ch_1_out_0 += out_offset;\
ch_1_out_0 = MAX(ch_1_out_0, out_activation_min);\
ch_1_out_0 = MIN(ch_1_out_0, out_activation_max);\
*out++ = (q7_t) ((ch_0_out_0 & 0x0F) + ((ch_1_out_0 & 0x0F) << 4));\
ch_1_out_1 = arm_nn_requantize(ch_1_out_1, *out_mult,*out_shift);\
ch_1_out_1 += out_offset;\
ch_1_out_1 = MAX(ch_1_out_1, out_activation_min);\
ch_1_out_1 = MIN(ch_1_out_1, out_activation_max);\
*out_1++ = (q7_t) ((ch_0_out_1 & 0x0F) + ((ch_1_out_1 & 0x0F) << 4));\
out_mult++;\
out_shift++;\
#define assign_requantize() \
ch_0_out_0 = arm_nn_requantize(ch_0_out_0, *out_mult,*out_shift);\
ch_0_out_0 += out_offset;\
ch_0_out_0 = MAX(ch_0_out_0, out_activation_min);\
ch_0_out_0 = MIN(ch_0_out_0, out_activation_max);\
*out++ = (q7_t) ch_0_out_0;\
\
ch_0_out_1 = arm_nn_requantize(ch_0_out_1, *out_mult,*out_shift);\
ch_0_out_1 += out_offset;\
ch_0_out_1 = MAX(ch_0_out_1, out_activation_min);\
ch_0_out_1 = MIN(ch_0_out_1, out_activation_max);\
*out_1++ = (q7_t) ch_0_out_1;\
out_mult++;\
out_shift++;\
ch_1_out_0 = arm_nn_requantize(ch_1_out_0, *out_mult,*out_shift);\
ch_1_out_0 += out_offset;\
ch_1_out_0 = MAX(ch_1_out_0, out_activation_min);\
ch_1_out_0 = MIN(ch_1_out_0, out_activation_max);\
*out++ = (q7_t) ch_1_out_0;\
\
ch_1_out_1 = arm_nn_requantize(ch_1_out_1, *out_mult,*out_shift);\
ch_1_out_1 += out_offset;\
ch_1_out_1 = MAX(ch_1_out_1, out_activation_min);\
ch_1_out_1 = MIN(ch_1_out_1, out_activation_max);\
*out_1++ = (q7_t) ch_1_out_1;\
out_mult++;\
out_shift++;\
/* end of assign_requantize */
#endif /* ARMNN_INCLUDE_KERNEL_ELEMENT_H_ */

View File

@ -0,0 +1,236 @@
/* ----------------------------------------------------------------------
* Project: TinyEngine
* Title: mutable_function.h
*
* Reference papers:
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
* Contact authors:
* - Wei-Ming Chen, wmchen@mit.edu
* - Wei-Chen Wang, wweichen@mit.edu
* - Ji Lin, jilin@mit.edu
* - Ligeng Zhu, ligeng@mit.edu
* - Song Han, songhan@mit.edu
*
* Target ISA: ARMv7E-M
* -------------------------------------------------------------------- */
#ifndef TINYENGINE_SOURCE_CONVOLUTIONFUNCTIONS_MIX_MUTABLE_FUNCTION_H_
#define TINYENGINE_SOURCE_CONVOLUTIONFUNCTIONS_MIX_MUTABLE_FUNCTION_H_
/* mutable functions */
#if KERNEL_PRE == 4
#define mix_read_and_pad_reordered b4_read_and_pad_reordered
#define mix_nn_read_q7x4 b4_nn_read_q7x4
#define mix_read_and_pad b4_read_and_pad
#elif KERNEL_PRE == 2
#define mix_read_and_pad_reordered b2_read_and_pad_reordered
#define mix_nn_read_q7x4 b2_nn_read_q7x4
#define mix_read_and_pad b2_read_and_pad
#else
#define mix_read_and_pad_reordered read_and_pad_reordered
#define mix_nn_read_q7x4 arm_nn_read_q7x4
#define mix_read_and_pad read_and_pad
#endif
#if INPUT_PRE == 4
#define mix_q7_q15_offset_ele b4_q7_q15_offset_ele
#elif INPUT_PRE == 2
#define mix_q7_q15_offset_ele b2_q7_q15_offset_ele
#else
#define mix_q7_q15_offset_ele q7_q15_offset_ele
#endif
#if INPUT_PRE == 4
#define mix_q7_q15_offset_reordered_ele b4_q15_offset_reordered_ele
#define mix_load_1row_1col b4_load_1row_1col
#define mix_load_1row_2col b4_load_1row_2col
#define mix_load_1row_3col b4_load_1row_3col
#define mix_load_1row_4col b4_load_1row_4col
#define mix_load_1row_5col b4_load_1row_5col
#define mix_load_1row_6col b4_load_1row_6col
#define mix_load_1row_7col b4_load_1row_7col
#define mix_load_2row_1col b4_load_2row_1col
#define mix_load_2row_2col b4_load_2row_2col
#define mix_load_2row_3col b4_load_2row_3col
#define mix_load_2row_4col b4_load_2row_4col
#define mix_load_2row_5col b4_load_2row_5col
#define mix_load_2row_6col b4_load_2row_6col
#define mix_load_2row_7col b4_load_2row_7col
#define mix_load_3row_1col b4_load_3row_1col
#define mix_load_3row_2col b4_load_3row_2col
#define mix_load_3row_3col b4_load_3row_3col
#define mix_load_3row_4col b4_load_3row_4col
#define mix_load_3row_5col b4_load_3row_5col
#define mix_load_3row_6col b4_load_3row_6col
#define mix_load_3row_7col b4_load_3row_7col
#define mix_load_4row_1col b4_load_4row_1col
#define mix_load_4row_2col b4_load_4row_2col
#define mix_load_4row_3col b4_load_4row_3col
#define mix_load_4row_4col b4_load_4row_4col
#define mix_load_4row_5col b4_load_4row_5col
#define mix_load_4row_6col b4_load_4row_6col
#define mix_load_4row_7col b4_load_4row_7col
#define mix_load_5row_1col b4_load_5row_1col
#define mix_load_5row_2col b4_load_5row_2col
#define mix_load_5row_3col b4_load_5row_3col
#define mix_load_5row_4col b4_load_5row_4col
#define mix_load_5row_5col b4_load_5row_5col
#define mix_load_5row_6col b4_load_5row_6col
#define mix_load_5row_7col b4_load_5row_7col
#define mix_load_6row_1col b4_load_6row_1col
#define mix_load_6row_2col b4_load_6row_2col
#define mix_load_6row_3col b4_load_6row_3col
#define mix_load_6row_4col b4_load_6row_4col
#define mix_load_6row_5col b4_load_6row_5col
#define mix_load_6row_6col b4_load_6row_6col
#define mix_load_6row_7col b4_load_6row_7col
#define mix_load_7row_1col b4_load_7row_1col
#define mix_load_7row_2col b4_load_7row_2col
#define mix_load_7row_3col b4_load_7row_3col
#define mix_load_7row_4col b4_load_7row_4col
#define mix_load_7row_5col b4_load_7row_5col
#define mix_load_7row_6col b4_load_7row_6col
#define mix_load_7row_7col b4_load_7row_7col
#elif INPUT_PRE == 2
#define mix_q7_q15_offset_reordered_ele b2_q15_offset_reordered_ele
#define mix_load_1row_1col b2_load_1row_1col
#define mix_load_1row_2col b2_load_1row_2col
#define mix_load_1row_3col b2_load_1row_3col
#define mix_load_1row_4col b2_load_1row_4col
#define mix_load_1row_5col b2_load_1row_5col
#define mix_load_1row_6col b2_load_1row_6col
#define mix_load_1row_7col b2_load_1row_7col
#define mix_load_2row_1col b2_load_2row_1col
#define mix_load_2row_2col b2_load_2row_2col
#define mix_load_2row_3col b2_load_2row_3col
#define mix_load_2row_4col b2_load_2row_4col
#define mix_load_2row_5col b2_load_2row_5col
#define mix_load_2row_6col b2_load_2row_6col
#define mix_load_2row_7col b2_load_2row_7col
#define mix_load_3row_1col b2_load_3row_1col
#define mix_load_3row_2col b2_load_3row_2col
#define mix_load_3row_3col b2_load_3row_3col
#define mix_load_3row_4col b2_load_3row_4col
#define mix_load_3row_5col b2_load_3row_5col
#define mix_load_3row_6col b2_load_3row_6col
#define mix_load_3row_7col b2_load_3row_7col
#define mix_load_4row_1col b2_load_4row_1col
#define mix_load_4row_2col b2_load_4row_2col
#define mix_load_4row_3col b2_load_4row_3col
#define mix_load_4row_4col b2_load_4row_4col
#define mix_load_4row_5col b2_load_4row_5col
#define mix_load_4row_6col b2_load_4row_6col
#define mix_load_4row_7col b2_load_4row_7col
#define mix_load_5row_1col b2_load_5row_1col
#define mix_load_5row_2col b2_load_5row_2col
#define mix_load_5row_3col b2_load_5row_3col
#define mix_load_5row_4col b2_load_5row_4col
#define mix_load_5row_5col b2_load_5row_5col
#define mix_load_5row_6col b2_load_5row_6col
#define mix_load_5row_7col b2_load_5row_7col
#define mix_load_6row_1col b2_load_6row_1col
#define mix_load_6row_2col b2_load_6row_2col
#define mix_load_6row_3col b2_load_6row_3col
#define mix_load_6row_4col b2_load_6row_4col
#define mix_load_6row_5col b2_load_6row_5col
#define mix_load_6row_6col b2_load_6row_6col
#define mix_load_6row_7col b2_load_6row_7col
#define mix_load_7row_1col b2_load_7row_1col
#define mix_load_7row_2col b2_load_7row_2col
#define mix_load_7row_3col b2_load_7row_3col
#define mix_load_7row_4col b2_load_7row_4col
#define mix_load_7row_5col b2_load_7row_5col
#define mix_load_7row_6col b2_load_7row_6col
#define mix_load_7row_7col b2_load_7row_7col
#else
#define mix_q7_q15_offset_reordered_ele q7_q15_offset_reordered_ele
#define mix_load_1row_1col load_1row_1col
#define mix_load_1row_2col load_1row_2col
#define mix_load_1row_3col load_1row_3col
#define mix_load_1row_4col load_1row_4col
#define mix_load_1row_5col load_1row_5col
#define mix_load_1row_6col load_1row_6col
#define mix_load_1row_7col load_1row_7col
#define mix_load_2row_1col load_2row_1col
#define mix_load_2row_2col load_2row_2col
#define mix_load_2row_3col load_2row_3col
#define mix_load_2row_4col load_2row_4col
#define mix_load_2row_5col load_2row_5col
#define mix_load_2row_6col load_2row_6col
#define mix_load_2row_7col load_2row_7col
#define mix_load_3row_1col load_3row_1col
#define mix_load_3row_2col load_3row_2col
#define mix_load_3row_3col load_3row_3col
#define mix_load_3row_4col load_3row_4col
#define mix_load_3row_5col load_3row_5col
#define mix_load_3row_6col load_3row_6col
#define mix_load_3row_7col load_3row_7col
#define mix_load_4row_1col load_4row_1col
#define mix_load_4row_2col load_4row_2col
#define mix_load_4row_3col load_4row_3col
#define mix_load_4row_4col load_4row_4col
#define mix_load_4row_5col load_4row_5col
#define mix_load_4row_6col load_4row_6col
#define mix_load_4row_7col load_4row_7col
#define mix_load_5row_1col load_5row_1col
#define mix_load_5row_2col load_5row_2col
#define mix_load_5row_3col load_5row_3col
#define mix_load_5row_4col load_5row_4col
#define mix_load_5row_5col load_5row_5col
#define mix_load_5row_6col load_5row_6col
#define mix_load_5row_7col load_5row_7col
#define mix_load_6row_1col load_6row_1col
#define mix_load_6row_2col load_6row_2col
#define mix_load_6row_3col load_6row_3col
#define mix_load_6row_4col load_6row_4col
#define mix_load_6row_5col load_6row_5col
#define mix_load_6row_6col load_6row_6col
#define mix_load_6row_7col load_6row_7col
#define mix_load_7row_1col load_7row_1col
#define mix_load_7row_2col load_7row_2col
#define mix_load_7row_3col load_7row_3col
#define mix_load_7row_4col load_7row_4col
#define mix_load_7row_5col load_7row_5col
#define mix_load_7row_6col load_7row_6col
#define mix_load_7row_7col load_7row_7col
#endif
#if OUTPUT_PRE == 4
#define mix_assign_requantize() b4_assign_requantize()
#elif OUTPUT_PRE == 2
#define mix_assign_requantize() b2_assign_requantize()
#else
#define mix_assign_requantize() assign_requantize()
#endif
#if KERNEL_PRE == 4
#if OUTPUT_PRE == 4
#define mix_nn_mat_mult_kernel_s8_s16_reordered b44_nn_mat_mult_kernel_s8_s16_reordered
#define mix_nn_mat_mult_kernel_s8_s16_reordered_8mul b44_nn_mat_mult_kernel_s8_s16_reordered_8mul
#elif OUTPUT_PRE == 2
#define mix_nn_mat_mult_kernel_s8_s16_reordered b42_nn_mat_mult_kernel_s8_s16_reordered
#define mix_nn_mat_mult_kernel_s8_s16_reordered_8mul b42_nn_mat_mult_kernel_s8_s16_reordered_8mul
#else
#define mix_nn_mat_mult_kernel_s8_s16_reordered b48_nn_mat_mult_kernel_s8_s16_reordered
#define mix_nn_mat_mult_kernel_s8_s16_reordered_8mul b48_nn_mat_mult_kernel_s8_s16_reordered_8mul
#endif//OUTPUT
#elif KERNEL_PRE == 2
#if OUTPUT_PRE == 4
#define mix_nn_mat_mult_kernel_s8_s16_reordered b24_nn_mat_mult_kernel_s8_s16_reordered
#define mix_nn_mat_mult_kernel_s8_s16_reordered_8mul b24_nn_mat_mult_kernel_s8_s16_reordered_8mul
#elif OUTPUT_PRE == 2
#define mix_nn_mat_mult_kernel_s8_s16_reordered b22_nn_mat_mult_kernel_s8_s16_reordered
#define mix_nn_mat_mult_kernel_s8_s16_reordered_8mul b22_nn_mat_mult_kernel_s8_s16_reordered_8mul
#else
#define mix_nn_mat_mult_kernel_s8_s16_reordered b28_nn_mat_mult_kernel_s8_s16_reordered
#define mix_nn_mat_mult_kernel_s8_s16_reordered_8mul b28_nn_mat_mult_kernel_s8_s16_reordered_8mul
#endif//OUTPUT
#else
#define mix_nn_mat_mult_kernel_s8_s16_reordered arm_nn_mat_mult_kernel_s8_s16_reordered
#define mix_nn_mat_mult_kernel_s8_s16_reordered_8mul arm_nn_mat_mult_kernel_s8_s16_reordered_8mul
#endif
#endif /* TINYENGINE_SOURCE_CONVOLUTIONFUNCTIONS_MIX_MUTABLE_FUNCTION_H_ */

View File

@ -0,0 +1,31 @@
/* ----------------------------------------------------------------------
* Project: TinyEngine
* Title: precision_cnt.h
*
* Reference papers:
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
* Contact authors:
* - Wei-Ming Chen, wmchen@mit.edu
* - Wei-Chen Wang, wweichen@mit.edu
* - Ji Lin, jilin@mit.edu
* - Ligeng Zhu, ligeng@mit.edu
* - Song Han, songhan@mit.edu
*
* Target ISA: ARMv7E-M
* -------------------------------------------------------------------- */
#ifndef TINYENGINE_SOURCE_CONVOLUTIONFUNCTIONS_MIX_PRECISION_CNT_H_
#define TINYENGINE_SOURCE_CONVOLUTIONFUNCTIONS_MIX_PRECISION_CNT_H_
/* MIX precision */
#define INPUT_PRE 8
#define KERNEL_PRE 8
#define OUTPUT_PRE 8
#define input_scaler (8 / INPUT_PRE)
#define weight_scaler (8 / KERNEL_PRE)
#define output_scaler (8 / OUTPUT_PRE)
#endif /* TINYENGINE_SOURCE_CONVOLUTIONFUNCTIONS_MIX_PRECISION_CNT_H_ */

View File

@ -0,0 +1,68 @@
/* ----------------------------------------------------------------------
* Project: TinyEngine
* Title: profile.h
*
* Reference papers:
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
* Contact authors:
* - Wei-Ming Chen, wmchen@mit.edu
* - Wei-Chen Wang, wweichen@mit.edu
* - Ji Lin, jilin@mit.edu
* - Ligeng Zhu, ligeng@mit.edu
* - Song Han, songhan@mit.edu
*
* Target ISA: ARMv7E-M
* -------------------------------------------------------------------- */
#include "stm32f7xx_hal.h"
#include <stdio.h>
#include <string.h>
static UART_HandleTypeDef UART;
#define RUNS 1
static int profile_i;
static int start, end;
static char buf[100];
void printLog(const char *s) {
static int is_initialized = 0;
if (!is_initialized) {
UART.Instance = USART1;
UART.Init.BaudRate = 115200;
UART.Init.WordLength = UART_WORDLENGTH_8B;
UART.Init.StopBits = UART_STOPBITS_1;
UART.Init.Parity = UART_PARITY_NONE;
UART.Init.Mode = UART_MODE_TX_RX;
UART.Init.HwFlowCtl = UART_HWCONTROL_NONE;
UART.Init.OverSampling = UART_OVERSAMPLING_16;
UART.Init.OneBitSampling = UART_ONE_BIT_SAMPLE_DISABLE;
UART.AdvancedInit.AdvFeatureInit = UART_ADVFEATURE_NO_INIT;
if (HAL_UART_Init(&UART) != HAL_OK) {
//Error handling
}
is_initialized = 1;
}
HAL_UART_Transmit(&UART, (uint8_t*) s, strlen(s), 10);
}
void recieveChar(char *s) {
static int is_initialized = 0;
if (!is_initialized) {
UART.Instance = USART1;
UART.Init.BaudRate = 115200;
UART.Init.WordLength = UART_WORDLENGTH_8B;
UART.Init.StopBits = UART_STOPBITS_1;
UART.Init.Parity = UART_PARITY_NONE;
UART.Init.Mode = UART_MODE_TX_RX;
UART.Init.HwFlowCtl = UART_HWCONTROL_NONE;
UART.Init.OverSampling = UART_OVERSAMPLING_16;
UART.Init.OneBitSampling = UART_ONE_BIT_SAMPLE_DISABLE;
UART.AdvancedInit.AdvFeatureInit = UART_ADVFEATURE_NO_INIT;
if (HAL_UART_Init(&UART) != HAL_OK) {
//Error handling
}
is_initialized = 1;
}
HAL_UART_Receive(&UART, (uint8_t*) s, 1, 10);
}

View File

@ -0,0 +1,161 @@
/* ----------------------------------------------------------------------
* Project: TinyEngine
* Title: tinyengine_function.h
*
* Reference papers:
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
* Contact authors:
* - Wei-Ming Chen, wmchen@mit.edu
* - Wei-Chen Wang, wweichen@mit.edu
* - Ji Lin, jilin@mit.edu
* - Ligeng Zhu, ligeng@mit.edu
* - Song Han, songhan@mit.edu
*
* Target ISA: ARMv7E-M
* -------------------------------------------------------------------- */
#include <stdint.h>
#include <stdbool.h>
typedef int8_t q7_t;
typedef uint8_t q8_t;
typedef int16_t q15_t;
typedef uint16_t q16_t;
typedef int32_t q31_t;
typedef uint32_t q32_t;
typedef enum {
STATE_SUCCESS = 0, /* No error */
PARAM_NO_SUPPORT = 1, /* Unsupported parameters */
} tinyengine_status;
typedef struct add_params {
int input_h, input_w, input_c, left_shift;
int input1_offset, input1_multiplier, input1_shift;
int input2_offset, input2_multiplier, input2_shift;
int output_offset, output_multiplier, output_shift;
int quantized_activation_max, quantized_activation_min;
} ADD_params;
#define TN_MAX(A,B) ((A) > (B) ? (A) : (B))
#define TN_MIN(A,B) ((A) < (B) ? (A) : (B))
// bit assignment and check
#define BIT_SET(a,b) ((a) |= (1ULL<<(b)))
#define BIT_CLEAR(a,b) ((a) &= ~(1ULL<<(b)))
#define BIT_FLIP(a,b) ((a) ^= (1ULL<<(b)))
#define BIT_CHECK(a,b) (!!((a) & (1ULL<<(b)))) // '!!' to make sure this returns 0 or 1
#define BITMASK_SET(x, mask) ((x) |= (mask))
#define BITMASK_CLEAR(x, mask) ((x) &= (~(mask)))
#define BITMASK_FLIP(x, mask) ((x) ^= (mask))
#define BITMASK_CHECK_ALL(x, mask) (!(~(x) & (mask)))
#define BITMASK_CHECK_ANY(x, mask) ((x) & (mask))
tinyengine_status convolve_1x1_s8(const q7_t *input, const uint16_t input_x,
const uint16_t input_y, const uint16_t input_ch, const q7_t *kernel,
const int32_t *bias, const int32_t *output_shift,
const int32_t *output_mult, const int32_t out_offset,
const int32_t input_offset, const int32_t out_activation_min,
const int32_t out_activation_max, q7_t *output, const uint16_t output_x,
const uint16_t output_y, const uint16_t output_ch, q15_t *runtime_buf);
tinyengine_status convolve_1x1_s8_ch8(const q7_t *input, const uint16_t input_x,
const uint16_t input_y, const uint16_t input_ch, const q7_t *kernel,
const int32_t *bias, const int32_t *output_shift,
const int32_t *output_mult, const int32_t out_offset,
const int32_t input_offset, const int32_t out_activation_min,
const int32_t out_activation_max, q7_t *output, const uint16_t output_x,
const uint16_t output_y, const uint16_t output_ch, q15_t *runtime_buf);
tinyengine_status convolve_1x1_s8_ch16(const q7_t *input,
const uint16_t input_x, const uint16_t input_y, const uint16_t input_ch,
const q7_t *kernel, const int32_t *bias, const int32_t *output_shift,
const int32_t *output_mult, const int32_t out_offset,
const int32_t input_offset, const int32_t out_activation_min,
const int32_t out_activation_max, q7_t *output, const uint16_t output_x,
const uint16_t output_y, const uint16_t output_ch, q15_t *runtime_buf);
tinyengine_status convolve_1x1_s8_ch24(const q7_t *input,
const uint16_t input_x, const uint16_t input_y, const uint16_t input_ch,
const q7_t *kernel, const int32_t *bias, const int32_t *output_shift,
const int32_t *output_mult, const int32_t out_offset,
const int32_t input_offset, const int32_t out_activation_min,
const int32_t out_activation_max, q7_t *output, const uint16_t output_x,
const uint16_t output_y, const uint16_t output_ch, q15_t *runtime_buf);
tinyengine_status convolve_1x1_s8_ch48(const q7_t *input,
const uint16_t input_x, const uint16_t input_y, const uint16_t input_ch,
const q7_t *kernel, const int32_t *bias, const int32_t *output_shift,
const int32_t *output_mult, const int32_t out_offset,
const int32_t input_offset, const int32_t out_activation_min,
const int32_t out_activation_max, q7_t *output, const uint16_t output_x,
const uint16_t output_y, const uint16_t output_ch, q15_t *runtime_buf);
tinyengine_status convolve_s8_kernel3_inputch3_stride2_pad1(const q7_t *input,
const uint16_t input_x, const uint16_t input_y, const uint16_t input_ch,
const q7_t *kernel, const int32_t *bias, const int32_t *output_shift,
const int32_t *output_mult, const int32_t output_offset,
const int32_t input_offset, const int32_t output_activation_min,
const int32_t output_activation_max, q7_t *output,
const uint16_t output_x, const uint16_t output_y,
const uint16_t output_ch, q15_t *runtime_buf, q15_t *kbuf,
q7_t pad_value);
tinyengine_status add(int size, ADD_params *params, const int8_t *input1_data,
const int8_t *input2_data, int8_t *output_data);
tinyengine_status avg_pooling(const q7_t *input, const uint16_t input_h,
const uint16_t input_w, const uint16_t input_c, const uint16_t sample_h,
const uint16_t sample_w, const uint16_t output_h,
const uint16_t output_w, const int32_t out_activation_min,
const int32_t out_activation_max, q7_t *output);
tinyengine_status fully_connected_fp(const float *input, const uint16_t input_x,
const uint16_t input_y, const uint16_t input_ch,
const uint16_t output_ch, const float *bias, const float *weights,
float *output);
tinyengine_status statble_softmax_inplace(float *input, const uint16_t length);
tinyengine_status mat_mul_fp(const float *matA, const uint16_t matA_row,
const uint16_t matA_col, const float *matB, const uint16_t matB_col,
float *output);
tinyengine_status convolve_s8_kernel3_inputch3_stride2_pad1_fpreq(
const q7_t *input, const uint16_t input_x, const uint16_t input_y,
const uint16_t input_ch, const q7_t *kernel, const int32_t *bias,
const float *scales, const int32_t output_offset,
const int32_t input_offset, const int32_t output_activation_min,
const int32_t output_activation_max, q7_t *output,
const uint16_t output_x, const uint16_t output_y,
const uint16_t output_ch, q15_t *runtime_buf, q15_t *kbuf,
q7_t pad_value);
tinyengine_status add_fpreq(int size, const int8_t* input1_data, const float input1_scale, const float input1_zero,
const int8_t* input2_data, const float input2_scale, const float input2_zero, const float output_scale,
const float zero_y, int8_t* output_data);
tinyengine_status add_fpreq_mask(int size, const int8_t* input1_data, const float input1_scale, const float input1_zero,
const int8_t* input2_data, const float input2_scale, const float input2_zero, const float output_scale,
const float zero_y, int8_t* output_data, int8_t* output_mask);
tinyengine_status add_fpreq_bitmask(int size, const int8_t* input1_data, const float input1_scale, const float input1_zero,
const int8_t* input2_data, const float input2_scale, const float input2_zero, const float output_scale,
const float zero_y, int8_t* output_data, int8_t* output_mask);
tinyengine_status where_int8(const bool* inMask, const uint16_t size, signed char* input1_data,
const char* input2_data, char* output_data);
tinyengine_status convolve_1x1_s8_fpreq_mask_partialCH(const q7_t *input,
const uint16_t input_x, const uint16_t input_y, const uint16_t input_ch,
const q7_t *kernel_sram, const q7_t *kernel_flash, const uint16_t first_k_channel, const int32_t *bias, const float *scales,
const int32_t out_offset, const int32_t input_offset,
const int32_t out_activation_min, const int32_t out_activation_max,
q7_t *output, q7_t *mask, const uint16_t output_x, const uint16_t output_y,
const uint16_t output_ch, q15_t *runtime_buf);
#include "genInclude.h"
#include "fp_requantize_op.h"

View File

@ -0,0 +1,31 @@
/* ----------------------------------------------------------------------
* Project: TinyEngine
* Title: tinyengine_lib.h
*
* Reference papers:
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
* Contact authors:
* - Wei-Ming Chen, wmchen@mit.edu
* - Wei-Chen Wang, wweichen@mit.edu
* - Ji Lin, jilin@mit.edu
* - Ligeng Zhu, ligeng@mit.edu
* - Song Han, songhan@mit.edu
*
* Target ISA: ARMv7E-M
* -------------------------------------------------------------------- */
#ifndef TINYENGINE_INCLUDE_TINYENGINE_FUNCTIONLIB_H_
#define TINYENGINE_INCLUDE_TINYENGINE_FUNCTIONLIB_H_
#include <stdio.h>
typedef int8_t q7_t;
typedef uint8_t q8_t;
typedef int16_t q15_t;
typedef uint16_t q16_t;
typedef int32_t q31_t;
typedef uint32_t q32_t;
#endif /* TINYENGINE_INCLUDE_TINYENGINE_FUNCTIONLIB_H_ */

View File

@ -0,0 +1,33 @@
/* ----------------------------------------------------------------------
* Project: TinyEngine
* Title: yoloOutput.h
*
* Reference papers:
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
* Contact authors:
* - Wei-Ming Chen, wmchen@mit.edu
* - Wei-Chen Wang, wweichen@mit.edu
* - Ji Lin, jilin@mit.edu
* - Ligeng Zhu, ligeng@mit.edu
* - Song Han, songhan@mit.edu
*
* Target ISA: ARMv7E-M
* -------------------------------------------------------------------- */
typedef struct box{
float x0;
float y0;
float x1;
float y1;
float score;
} det_box;
det_box** postprocessing(signed char *input_data[3], signed char y_zero[3], float y_scale[3],
unsigned char *data_buf, int w, int h, int output_c, int num_classes, const int anchors[3][3][2], int outputs,
const float NMS_threshold, const float VALID_THRESHOLD, int* box_ret, det_box** ret_box);
det_box** postprocessing_fp(float *input_data[3], signed char y_zero[3], float y_scale[3],
unsigned char *data_buf, int w, int h, int output_c, int num_classes, const int anchors[3][3][2], int outputs,
const float NMS_threshold, const float VALID_THRESHOLD, int* box_ret, det_box** ret_box);

View File

@ -0,0 +1,88 @@
/* ----------------------------------------------------------------------
* Project: TinyEngine
* Title: add_fpreq.c
*
* Reference papers:
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
* Contact authors:
* - Wei-Ming Chen, wmchen@mit.edu
* - Wei-Chen Wang, wweichen@mit.edu
* - Ji Lin, jilin@mit.edu
* - Ligeng Zhu, ligeng@mit.edu
* - Song Han, songhan@mit.edu
*
* Target ISA: ARMv7E-M
* -------------------------------------------------------------------- */
#include <math.h>
#include "arm_math.h"
#include "tinyengine_function.h"
tinyengine_status add_fpreq(int size, const int8_t* input1_data, const float input1_scale, const float input1_zero,
const int8_t* input2_data, const float input2_scale, const float input2_zero, const float output_scale,
const float zero_y, int8_t* output_data) {
for (int i = 0; i < size; ++i) {
float input1_fp = ((float)*input1_data++ - input1_zero) * input1_scale;
float input2_fp = ((float)*input2_data++ - input2_zero) * input2_scale;
int clamped_output = (int)round((input1_fp + input2_fp) / output_scale + zero_y); // to align with tvm implementation
clamped_output = TN_MAX(clamped_output, -128);
clamped_output = TN_MIN(clamped_output, 127);
output_data[i] = (int8_t)(clamped_output);
}
}
const int activation_min = -128;
const int activation_max = 127;
tinyengine_status add_fpreq_mask(int size, const int8_t* input1_data, const float input1_scale, const float input1_zero,
const int8_t* input2_data, const float input2_scale, const float input2_zero, const float output_scale,
const float zero_y, int8_t* output_data, int8_t* output_mask) {
for (int i = 0; i < size; ++i) {
float input1_fp = ((float)*input1_data++ - input1_zero) * input1_scale;
float input2_fp = ((float)*input2_data++ - input2_zero) * input2_scale;
int clamped_output = (int)round((input1_fp + input2_fp) / output_scale + zero_y); // to align with tvm implementation
int8_t mask_value = 1;
if (clamped_output < activation_min){
clamped_output = activation_min;
mask_value = 0;
}
if (clamped_output > activation_max){
clamped_output = activation_max;
mask_value = 0;
}
output_data[i] = (int8_t)(clamped_output);
output_mask[i] = mask_value;
}
}
tinyengine_status add_fpreq_bitmask(int size, const int8_t* input1_data, const float input1_scale, const float input1_zero,
const int8_t* input2_data, const float input2_scale, const float input2_zero, const float output_scale,
const float zero_y, int8_t* output_data, int8_t* output_mask) {
int mask_idx = 0;
for (int i = 0; i < size; ++i) {
float input1_fp = ((float)*input1_data++ - input1_zero) * input1_scale;
float input2_fp = ((float)*input2_data++ - input2_zero) * input2_scale;
int clamped_output = (int)round((input1_fp + input2_fp) / output_scale + zero_y); // to align with tvm implementation
int8_t mask_value = 1;
if (clamped_output < activation_min){
clamped_output = activation_min;
mask_value = 0;
}
if (clamped_output > activation_max){
clamped_output = activation_max;
mask_value = 0;
}
output_data[i] = (int8_t)(clamped_output);
if (mask_value == 1)
BIT_SET(*output_mask, mask_idx);
else
BIT_CLEAR(*output_mask, mask_idx);
mask_idx++;
if (mask_idx == 8){
mask_idx = 0;
output_mask++;
}
}
}

View File

@ -0,0 +1,122 @@
/* ----------------------------------------------------------------------
* Project: TinyEngine
* Title: convolve_1x1_s8_ch16_fpreq.c
*
* Reference papers:
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
* Contact authors:
* - Wei-Ming Chen, wmchen@mit.edu
* - Wei-Chen Wang, wweichen@mit.edu
* - Ji Lin, jilin@mit.edu
* - Ligeng Zhu, ligeng@mit.edu
* - Song Han, songhan@mit.edu
*
* Target ISA: ARMv7E-M
* -------------------------------------------------------------------- */
#include "arm_nnfunctions.h"
#include "img2col_element.h"
#include "tinyengine_function.h"
#include "fp_requantize_op.h"
#define DIM_KER_X (1U)
#define DIM_KER_Y (1U)
tinyengine_status convolve_1x1_s8_ch16_fpreq(const q7_t *input,
const uint16_t input_x, const uint16_t input_y, const uint16_t input_ch,
const q7_t *kernel, const int32_t *bias, const float *scales,
const int32_t out_offset, const int32_t input_offset,
const int32_t out_activation_min, const int32_t out_activation_max,
q7_t *output, const uint16_t output_x, const uint16_t output_y,
const uint16_t output_ch, q15_t *runtime_buf) {
int32_t i_element;
(void) input_x;
(void) input_y;
/* Partial(two columns) im2col buffer */
q15_t *two_column_buffer = runtime_buf;
q7_t *out = output;
const int32_t num_elements = output_x * output_y;
const int channel_div4 = (input_ch >> 2);
const int16_t inoff16 = input_offset;
q31_t offset_q15x2 = __PKHBT(inoff16, inoff16, 16);
for (i_element = 0; i_element < num_elements / 2; i_element++) {
/* Fill buffer for partial im2col - two columns at a time */
q7_t *src = &input[i_element * input_ch * 2];
q15_t *dst = two_column_buffer;
//use variables
q31_t in_q7x4;
q31_t in_q15x2_1;
q31_t in_q15x2_2;
q31_t out_q15x2_1;
q31_t out_q15x2_2;
int cnt = channel_div4; //two columns
while (cnt > 0) {
q7_q15_offset_reordered_ele(src, dst)
q7_q15_offset_reordered_ele(src, dst)
cnt--;
}
out = mat_mult_kernel_s8_s16_reordered_ch16_fpreq(kernel,
two_column_buffer, output_ch, scales, (q7_t) out_offset,
out_activation_min, out_activation_max,
input_ch * DIM_KER_Y * DIM_KER_X, bias, out);
}
/* check if there is an odd column left-over for computation */
if (num_elements & 0x1) {
int32_t i_ch_out;
const q7_t *ker_a = kernel;
q7_t *src = &input[(num_elements - 1) * input_ch];
q15_t *dst = two_column_buffer;
//use variables
q31_t in_q7x4;
q31_t in_q15x2_1;
q31_t in_q15x2_2;
q31_t out_q15x2_1;
q31_t out_q15x2_2;
int cnt = channel_div4; //two * numof2col columns
while (cnt > 0) {
q7_q15_offset_reordered_ele(src, dst)
cnt--;
}
for (i_ch_out = 0; i_ch_out < output_ch; i_ch_out++) {
q31_t sum = bias[i_ch_out];
/* Point to the beginning of the im2col buffer where the input is available as a rearranged column */
const q15_t *ip_as_col = runtime_buf;
uint16_t col_count = (input_ch * DIM_KER_X * DIM_KER_Y) >> 2;
while (col_count) {
q31_t ker_a1, ker_a2;
q31_t in_b1, in_b2;
ker_a = read_and_pad_reordered(ker_a, &ker_a1, &ker_a2);
in_b1 = arm_nn_read_q15x2_ia(&ip_as_col);
sum = __SMLAD(ker_a1, in_b1, sum);
in_b2 = arm_nn_read_q15x2_ia(&ip_as_col);
sum = __SMLAD(ker_a2, in_b2, sum);
col_count--;
}
sum = (float) sum * scales[i_ch_out];
sum += out_offset;
sum = MAX(sum, out_activation_min);
sum = MIN(sum, out_activation_max);
*out++ = (q7_t) sum;
}
}
/* Return to application */
return STATE_SUCCESS;
}

View File

@ -0,0 +1,122 @@
/* ----------------------------------------------------------------------
* Project: TinyEngine
* Title: convolve_1x1_s8_ch24_fpreq.c
*
* Reference papers:
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
* Contact authors:
* - Wei-Ming Chen, wmchen@mit.edu
* - Wei-Chen Wang, wweichen@mit.edu
* - Ji Lin, jilin@mit.edu
* - Ligeng Zhu, ligeng@mit.edu
* - Song Han, songhan@mit.edu
*
* Target ISA: ARMv7E-M
* -------------------------------------------------------------------- */
#include "arm_nnfunctions.h"
#include "img2col_element.h"
#include "tinyengine_function.h"
#include "fp_requantize_op.h"
#define DIM_KER_X (1U)
#define DIM_KER_Y (1U)
tinyengine_status convolve_1x1_s8_ch24_fpreq(const q7_t *input,
const uint16_t input_x, const uint16_t input_y, const uint16_t input_ch,
const q7_t *kernel, const int32_t *bias, const float *scales,
const int32_t out_offset, const int32_t input_offset,
const int32_t out_activation_min, const int32_t out_activation_max,
q7_t *output, const uint16_t output_x, const uint16_t output_y,
const uint16_t output_ch, q15_t *runtime_buf) {
int32_t i_element;
(void) input_x;
(void) input_y;
/* Partial(two columns) im2col buffer */
q15_t *two_column_buffer = runtime_buf;
q7_t *out = output;
const int32_t num_elements = output_x * output_y;
const int channel_div4 = (input_ch >> 2);
const int16_t inoff16 = input_offset;
q31_t offset_q15x2 = __PKHBT(inoff16, inoff16, 16);
for (i_element = 0; i_element < num_elements / 2; i_element++) {
/* Fill buffer for partial im2col - two columns at a time */
q7_t *src = &input[i_element * input_ch * 2];
q15_t *dst = two_column_buffer;
//use variables
q31_t in_q7x4;
q31_t in_q15x2_1;
q31_t in_q15x2_2;
q31_t out_q15x2_1;
q31_t out_q15x2_2;
int cnt = channel_div4; //two columns
while (cnt > 0) {
q7_q15_offset_reordered_ele(src, dst)
q7_q15_offset_reordered_ele(src, dst)
cnt--;
}
out = mat_mult_kernel_s8_s16_reordered_ch24_fpreq(kernel,
two_column_buffer, output_ch, scales, (q7_t) out_offset,
out_activation_min, out_activation_max,
input_ch * DIM_KER_Y * DIM_KER_X, bias, out);
}
/* check if there is an odd column left-over for computation */
if (num_elements & 0x1) {
int32_t i_ch_out;
const q7_t *ker_a = kernel;
q7_t *src = &input[(num_elements - 1) * input_ch];
q15_t *dst = two_column_buffer;
//use variables
q31_t in_q7x4;
q31_t in_q15x2_1;
q31_t in_q15x2_2;
q31_t out_q15x2_1;
q31_t out_q15x2_2;
int cnt = channel_div4; //two * numof2col columns
while (cnt > 0) {
q7_q15_offset_reordered_ele(src, dst)
cnt--;
}
for (i_ch_out = 0; i_ch_out < output_ch; i_ch_out++) {
q31_t sum = bias[i_ch_out];
/* Point to the beginning of the im2col buffer where the input is available as a rearranged column */
const q15_t *ip_as_col = runtime_buf;
uint16_t col_count = (input_ch * DIM_KER_X * DIM_KER_Y) >> 2;
while (col_count) {
q31_t ker_a1, ker_a2;
q31_t in_b1, in_b2;
ker_a = read_and_pad_reordered(ker_a, &ker_a1, &ker_a2);
in_b1 = arm_nn_read_q15x2_ia(&ip_as_col);
sum = __SMLAD(ker_a1, in_b1, sum);
in_b2 = arm_nn_read_q15x2_ia(&ip_as_col);
sum = __SMLAD(ker_a2, in_b2, sum);
col_count--;
}
sum = (float) sum * scales[i_ch_out];
sum += out_offset;
sum = MAX(sum, out_activation_min);
sum = MIN(sum, out_activation_max);
*out++ = (q7_t) sum;
}
}
/* Return to application */
return STATE_SUCCESS;
}

View File

@ -0,0 +1,122 @@
/* ----------------------------------------------------------------------
* Project: TinyEngine
* Title: convolve_1x1_s8_ch48_fpreq.c
*
* Reference papers:
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
* Contact authors:
* - Wei-Ming Chen, wmchen@mit.edu
* - Wei-Chen Wang, wweichen@mit.edu
* - Ji Lin, jilin@mit.edu
* - Ligeng Zhu, ligeng@mit.edu
* - Song Han, songhan@mit.edu
*
* Target ISA: ARMv7E-M
* -------------------------------------------------------------------- */
#include "arm_nnfunctions.h"
#include "arm_nnsupportfunctions.h"
#include "img2col_element.h"
#include "tinyengine_function.h"
#define DIM_KER_X (1U)
#define DIM_KER_Y (1U)
tinyengine_status convolve_1x1_s8_ch48_fpreq(const q7_t *input,
const uint16_t input_x, const uint16_t input_y, const uint16_t input_ch,
const q7_t *kernel, const int32_t *bias, const float *scales,
const int32_t out_offset, const int32_t input_offset,
const int32_t out_activation_min, const int32_t out_activation_max,
q7_t *output, const uint16_t output_x, const uint16_t output_y,
const uint16_t output_ch, q15_t *runtime_buf) {
int32_t i_element;
(void) input_x;
(void) input_y;
/* Partial(two columns) im2col buffer */
q15_t *two_column_buffer = runtime_buf;
q7_t *out = output;
const int32_t num_elements = output_x * output_y;
const int channel_div4 = (input_ch >> 2);
const int16_t inoff16 = input_offset;
q31_t offset_q15x2 = __PKHBT(inoff16, inoff16, 16);
for (i_element = 0; i_element < num_elements / 2; i_element++) {
/* Fill buffer for partial im2col - two columns at a time */
q7_t *src = &input[i_element * input_ch * 2];
q15_t *dst = two_column_buffer;
//use variables
q31_t in_q7x4;
q31_t in_q15x2_1;
q31_t in_q15x2_2;
q31_t out_q15x2_1;
q31_t out_q15x2_2;
int cnt = channel_div4; //two columns
while (cnt > 0) {
q7_q15_offset_reordered_ele(src, dst)
q7_q15_offset_reordered_ele(src, dst)
cnt--;
}
out = mat_mult_kernel_s8_s16_reordered_ch48_fpreq(kernel,
two_column_buffer, output_ch, scales, (q7_t) out_offset,
out_activation_min, out_activation_max,
input_ch * DIM_KER_Y * DIM_KER_X, bias, out);
}
/* check if there is an odd column left-over for computation */
if (num_elements & 0x1) {
int32_t i_ch_out;
const q7_t *ker_a = kernel;
q7_t *src = &input[(num_elements - 1) * input_ch];
q15_t *dst = two_column_buffer;
//use variables
q31_t in_q7x4;
q31_t in_q15x2_1;
q31_t in_q15x2_2;
q31_t out_q15x2_1;
q31_t out_q15x2_2;
int cnt = channel_div4; //two * numof2col columns
while (cnt > 0) {
q7_q15_offset_reordered_ele(src, dst)
cnt--;
}
for (i_ch_out = 0; i_ch_out < output_ch; i_ch_out++) {
q31_t sum = bias[i_ch_out];
/* Point to the beginning of the im2col buffer where the input is available as a rearranged column */
const q15_t *ip_as_col = runtime_buf;
uint16_t col_count = (input_ch * DIM_KER_X * DIM_KER_Y) >> 2;
while (col_count) {
q31_t ker_a1, ker_a2;
q31_t in_b1, in_b2;
ker_a = read_and_pad_reordered(ker_a, &ker_a1, &ker_a2);
in_b1 = arm_nn_read_q15x2_ia(&ip_as_col);
sum = __SMLAD(ker_a1, in_b1, sum);
in_b2 = arm_nn_read_q15x2_ia(&ip_as_col);
sum = __SMLAD(ker_a2, in_b2, sum);
col_count--;
}
sum = (float) sum * scales[i_ch_out];
sum += out_offset;
sum = MAX(sum, out_activation_min);
sum = MIN(sum, out_activation_max);
*out++ = (q7_t) sum;
}
}
/* Return to application */
return STATE_SUCCESS;
}

View File

@ -0,0 +1,122 @@
/* ----------------------------------------------------------------------
* Project: TinyEngine
* Title: convolve_1x1_s8_ch8_fpreq.c
*
* Reference papers:
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
* Contact authors:
* - Wei-Ming Chen, wmchen@mit.edu
* - Wei-Chen Wang, wweichen@mit.edu
* - Ji Lin, jilin@mit.edu
* - Ligeng Zhu, ligeng@mit.edu
* - Song Han, songhan@mit.edu
*
* Target ISA: ARMv7E-M
* -------------------------------------------------------------------- */
#include "arm_nnfunctions.h"
#include "img2col_element.h"
#include "tinyengine_function.h"
#include "fp_requantize_op.h"
#define DIM_KER_X (1U)
#define DIM_KER_Y (1U)
tinyengine_status convolve_1x1_s8_ch8_fpreq(const q7_t *input,
const uint16_t input_x, const uint16_t input_y, const uint16_t input_ch,
const q7_t *kernel, const int32_t *bias, const float *scales,
const int32_t out_offset, const int32_t input_offset,
const int32_t out_activation_min, const int32_t out_activation_max,
q7_t *output, const uint16_t output_x, const uint16_t output_y,
const uint16_t output_ch, q15_t *runtime_buf) {
int32_t i_element;
(void) input_x;
(void) input_y;
/* Partial(two columns) im2col buffer */
q15_t *two_column_buffer = runtime_buf;
q7_t *out = output;
const int32_t num_elements = output_x * output_y;
const int channel_div4 = (input_ch >> 2);
const int16_t inoff16 = input_offset;
q31_t offset_q15x2 = __PKHBT(inoff16, inoff16, 16);
for (i_element = 0; i_element < num_elements / 2; i_element++) {
/* Fill buffer for partial im2col - two columns at a time */
q7_t *src = &input[i_element * input_ch * 2];
q15_t *dst = two_column_buffer;
//use variables
q31_t in_q7x4;
q31_t in_q15x2_1;
q31_t in_q15x2_2;
q31_t out_q15x2_1;
q31_t out_q15x2_2;
int cnt = channel_div4; //two columns
while (cnt > 0) {
q7_q15_offset_reordered_ele(src, dst)
q7_q15_offset_reordered_ele(src, dst)
cnt--;
}
out = mat_mult_kernel_s8_s16_reordered_fpreq(kernel, two_column_buffer,
output_ch, scales, (q7_t) out_offset, out_activation_min,
out_activation_max, input_ch * DIM_KER_Y * DIM_KER_X, bias,
out);
}
/* check if there is an odd column left-over for computation */
if (num_elements & 0x1) {
int32_t i_ch_out;
const q7_t *ker_a = kernel;
q7_t *src = &input[(num_elements - 1) * input_ch];
q15_t *dst = two_column_buffer;
//use variables
q31_t in_q7x4;
q31_t in_q15x2_1;
q31_t in_q15x2_2;
q31_t out_q15x2_1;
q31_t out_q15x2_2;
int cnt = channel_div4; //two * numof2col columns
while (cnt > 0) {
q7_q15_offset_reordered_ele(src, dst)
cnt--;
}
for (i_ch_out = 0; i_ch_out < output_ch; i_ch_out++) {
q31_t sum = bias[i_ch_out];
/* Point to the beginning of the im2col buffer where the input is available as a rearranged column */
const q15_t *ip_as_col = runtime_buf;
uint16_t col_count = (input_ch * DIM_KER_X * DIM_KER_Y) >> 2;
while (col_count) {
q31_t ker_a1, ker_a2;
q31_t in_b1, in_b2;
ker_a = read_and_pad_reordered(ker_a, &ker_a1, &ker_a2);
in_b1 = arm_nn_read_q15x2_ia(&ip_as_col);
sum = __SMLAD(ker_a1, in_b1, sum);
in_b2 = arm_nn_read_q15x2_ia(&ip_as_col);
sum = __SMLAD(ker_a2, in_b2, sum);
col_count--;
}
sum = (float) sum * scales[i_ch_out];
sum += out_offset;
sum = MAX(sum, out_activation_min);
sum = MIN(sum, out_activation_max);
*out++ = (q7_t) sum;
}
}
/* Return to application */
return STATE_SUCCESS;
}

View File

@ -0,0 +1,125 @@
/* ----------------------------------------------------------------------
* Project: TinyEngine
* Title: convolve_1x1_s8_fpreq.c
*
* Reference papers:
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
* Contact authors:
* - Wei-Ming Chen, wmchen@mit.edu
* - Wei-Chen Wang, wweichen@mit.edu
* - Ji Lin, jilin@mit.edu
* - Ligeng Zhu, ligeng@mit.edu
* - Song Han, songhan@mit.edu
*
* Target ISA: ARMv7E-M
* -------------------------------------------------------------------- */
#include "arm_nnfunctions.h"
#include "img2col_element.h"
#include "tinyengine_function.h"
#define DIM_KER_X (1U)
#define DIM_KER_Y (1U)
tinyengine_status convolve_1x1_s8_fpreq(const q7_t *input,
const uint16_t input_x, const uint16_t input_y, const uint16_t input_ch,
const q7_t *kernel, const int32_t *bias, const float *scales,
const int32_t out_offset, const int32_t input_offset,
const int32_t out_activation_min, const int32_t out_activation_max,
q7_t *output, const uint16_t output_x, const uint16_t output_y,
const uint16_t output_ch, q15_t *runtime_buf) {
if (input_ch % 4 != 0 || input_ch % 2 != 0) {
return PARAM_NO_SUPPORT;
}
int32_t i_element;
(void) input_x;
(void) input_y;
/* Partial(two columns) im2col buffer */
q15_t *two_column_buffer = runtime_buf;
q7_t *out = output;
const int32_t num_elements = output_x * output_y;
const int channel_div4 = (input_ch >> 2);
const int16_t inoff16 = input_offset;
q31_t offset_q15x2 = __PKHBT(inoff16, inoff16, 16);
for (i_element = 0; i_element < num_elements / 2; i_element++) {
/* Fill buffer for partial im2col - two columns at a time */
q7_t *src = &input[i_element * input_ch * 2];
q15_t *dst = two_column_buffer;
//use variables
q31_t in_q7x4;
q31_t in_q15x2_1;
q31_t in_q15x2_2;
q31_t out_q15x2_1;
q31_t out_q15x2_2;
int cnt = channel_div4; //two columns
while (cnt > 0) {
q7_q15_offset_reordered_ele(src, dst)
q7_q15_offset_reordered_ele(src, dst)
cnt--;
}
out = mat_mult_kernel_s8_s16_reordered_fpreq(kernel, two_column_buffer,
output_ch, scales, (q7_t) out_offset, out_activation_min,
out_activation_max, input_ch * DIM_KER_Y * DIM_KER_X, bias,
out);
}
/* check if there is an odd column left-over for computation */
if (num_elements & 0x1) {
int32_t i_ch_out;
const q7_t *ker_a = kernel;
q7_t *src = &input[(num_elements - 1) * input_ch];
q15_t *dst = two_column_buffer;
//use variables
q31_t in_q7x4;
q31_t in_q15x2_1;
q31_t in_q15x2_2;
q31_t out_q15x2_1;
q31_t out_q15x2_2;
int cnt = channel_div4; //two * numof2col columns
while (cnt > 0) {
q7_q15_offset_reordered_ele(src, dst)
cnt--;
}
for (i_ch_out = 0; i_ch_out < output_ch; i_ch_out++) {
q31_t sum = bias[i_ch_out];
/* Point to the beginning of the im2col buffer where the input is available as a rearranged column */
const q15_t *ip_as_col = runtime_buf;
uint16_t col_count = (input_ch * DIM_KER_X * DIM_KER_Y) >> 2;
while (col_count) {
q31_t ker_a1, ker_a2;
q31_t in_b1, in_b2;
ker_a = read_and_pad_reordered(ker_a, &ker_a1, &ker_a2);
in_b1 = arm_nn_read_q15x2_ia(&ip_as_col);
sum = __SMLAD(ker_a1, in_b1, sum);
in_b2 = arm_nn_read_q15x2_ia(&ip_as_col);
sum = __SMLAD(ker_a2, in_b2, sum);
col_count--;
}
sum = (q31_t) ((float) sum * scales[i_ch_out]);
sum += out_offset;
sum = MAX(sum, out_activation_min);
sum = MIN(sum, out_activation_max);
*out++ = (q7_t) sum;
}
}
/* Return to application */
return STATE_SUCCESS;
}

View File

@ -0,0 +1,287 @@
/* ----------------------------------------------------------------------
* Project: TinyEngine
* Title: convolve_s8_kernel3_inputch3_stride2_pad1_fpreq.c
*
* Reference papers:
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
* Contact authors:
* - Wei-Ming Chen, wmchen@mit.edu
* - Wei-Chen Wang, wweichen@mit.edu
* - Ji Lin, jilin@mit.edu
* - Ligeng Zhu, ligeng@mit.edu
* - Song Han, songhan@mit.edu
*
* Target ISA: ARMv7E-M
* -------------------------------------------------------------------- */
#include "arm_math.h"
#include "arm_nnfunctions.h"
#include "arm_nnsupportfunctions.h"
#include "img2col_element.h"
#include "tinyengine_function.h"
tinyengine_status convolve_s8_kernel3_inputch3_stride2_pad1_fpreq(
const q7_t *input, const uint16_t input_x, const uint16_t input_y,
const uint16_t input_ch, const q7_t *kernel, const int32_t *bias,
const float *scales, const int32_t output_offset,
const int32_t input_offset, const int32_t output_activation_min,
const int32_t output_activation_max, q7_t *output,
const uint16_t output_x, const uint16_t output_y,
const uint16_t output_ch, q15_t *runtime_buf, q15_t *kbuf,
q7_t pad_value) {
const int kernel_y = 3;
const int kernel_x = 3;
int16_t i_out_y, i_out_x, i_ker_y, i_ker_x;
/* Generate two columns from the input tensor a GEMM computation */
q15_t *two_column_buf = runtime_buf;
q7_t *out = output;
q15_t pad16 = pad_value;
const int16_t inoff16 = input_offset;
q15_t pad_out = pad16 + inoff16;
q31_t pad_out_q15x2 = __PKHBT(pad_out, pad_out, 16);
q31_t offset_q15x2 = __PKHBT(inoff16, inoff16, 16);
const q7_t *ip_a0 = kernel;
for (int i = 0; i < output_ch; i += 2) {
q15_t *dst1 = &kbuf[i * 27]; //each q31_t store 2 elements
q15_t *dst2 = dst1 + 27;
const q7_t *ip_a1 = ip_a0 + 27;
//27 for each output_ch
q31_t *dst1_31 = dst1;
q31_t *dst2_31 = dst2;
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
dst1_31 += 2;
dst2_31 += 2;
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
dst1_31 += 2;
dst2_31 += 2;
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
dst1_31 += 2;
dst2_31 += 2;
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
dst1_31 += 2;
dst2_31 += 2;
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
dst1_31 += 2;
dst2_31 += 2;
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
dst1_31 += 2;
dst2_31 += 2;
//25, 26, 27
dst1 = dst1_31;
dst2 = dst2_31;
dst1[0] = *ip_a0++;
dst1[1] = *ip_a0++;
dst1[2] = *ip_a0++;
dst2[0] = *ip_a1++;
dst2[1] = *ip_a1++;
dst2[2] = *ip_a1++;
/* skip row */
ip_a0 += 27;
}
for (i_out_y = 0; i_out_y < output_y; i_out_y++) {
for (i_out_x = 0; i_out_x < output_x; i_out_x++) {
/* This part implements the im2col function */
const int16_t base_idx_y = (i_out_y * 2) - 1;
const int16_t base_idx_x = (i_out_x * 2) - 1;
const q15_t *col_buffer = two_column_buf;
//use variables
q31_t in_q7x4;
q31_t in_q15x2_1;
q31_t in_q15x2_2;
q31_t out_q15x2_1;
q31_t out_q15x2_2;
/* load address:8bit */
q7_t *src;
q7_t *src2;
q7_t *src3;
/* buffer for load:16bit */
q15_t *dst;
q15_t *dst2;
q15_t *dst3;
int input_row_offset = 3 * input_x;
dst = col_buffer;
dst2 = dst + 9;
dst3 = dst2 + 9;
if (base_idx_y != -1) {
if (base_idx_x != -1) { //load all for now and unroll all
//3x3 = 9 elements
src = input
+ (base_idx_y * input_x + base_idx_x) * input_ch;
src2 = src + input_row_offset;
src3 = src2 + input_row_offset;
//4 * 2 = 8
q7_q15_offset_ele(src, dst)
q7_q15_offset_ele(src, dst)
*dst++ = *src++ + input_offset;
//
q7_q15_offset_ele(src2, dst2)
q7_q15_offset_ele(src2, dst2)
*dst2++ = *src2++ + input_offset;
//
q7_q15_offset_ele(src3, dst3)
q7_q15_offset_ele(src3, dst3)
*dst3++ = *src3++ + input_offset;
} else { //first element is pad
//3x3 = 9 elements
src = input + (base_idx_y * input_x) * input_ch;
src2 = src + input_row_offset;
src3 = src2 + input_row_offset;
//pad the first one: 1x3 = 3
*dst++ = pad_out;
*dst++ = pad_out;
*dst++ = pad_out;
*dst2++ = pad_out;
*dst2++ = pad_out;
*dst2++ = pad_out;
*dst3++ = pad_out;
*dst3++ = pad_out;
*dst3++ = pad_out;
//load 6 elements
//4 * 1 = 6
q7_q15_offset_ele(src, dst)
*dst++ = *src++ + input_offset;
*dst++ = *src++ + input_offset;
//
q7_q15_offset_ele(src2, dst2)
*dst2++ = *src2++ + input_offset;
*dst2++ = *src2++ + input_offset;
//
q7_q15_offset_ele(src3, dst3)
*dst3++ = *src3++ + input_offset;
*dst3++ = *src3++ + input_offset;
}
} else { // first row is padded
//3x3 = 9 elements
*dst++ = pad_out;
q31_t *dst_31 = dst;
*dst_31++ = pad_out_q15x2;
*dst_31++ = pad_out_q15x2;
*dst_31++ = pad_out_q15x2;
*dst_31++ = pad_out_q15x2;
if (base_idx_x != -1) { //load all for now and unroll all
//3x3 = 9 elements
src2 = input + (base_idx_x) * input_ch;
src3 = src2 + input_row_offset;
//4 * 2 = 8
q7_q15_offset_ele(src2, dst2)
q7_q15_offset_ele(src2, dst2)
*dst2++ = *src2++ + input_offset;
//
q7_q15_offset_ele(src3, dst3)
q7_q15_offset_ele(src3, dst3)
*dst3++ = *src3++ + input_offset;
} else { //first element is pad
//3x3 = 9 elements
src2 = input;
src3 = src2 + input_row_offset;
//pad the first one: 1x3 = 3
*dst2++ = pad_out;
*dst2++ = pad_out;
*dst2++ = pad_out;
*dst3++ = pad_out;
*dst3++ = pad_out;
*dst3++ = pad_out;
//load 6 elements
q7_q15_offset_ele(src2, dst2)
*dst2++ = *src2++ + input_offset;
*dst2++ = *src2++ + input_offset;
//
q7_q15_offset_ele(src3, dst3)
*dst3++ = *src3++ + input_offset;
*dst3++ = *src3++ + input_offset;
}
}
two_column_buf += 27;
/* Computation is filed for every 2 columns */
if (two_column_buf == runtime_buf + 2 * 27) {
out = mat_mult_kernel3_input3_s8_s16_fpreq(kernel, runtime_buf,
output_ch, scales, output_offset, output_activation_min,
output_activation_max, input_ch * kernel_y * kernel_x,
bias, out, kbuf);
/* counter reset */
two_column_buf = runtime_buf;
}
}
}
/* left-over because odd number of output pixels */
if (two_column_buf != runtime_buf) {
const q7_t *ker_a = kernel;
int i;
for (i = 0; i < output_ch; i++) {
/* Load the accumulator with bias first */
q31_t sum = bias[i];
/* Point to the beginning of the im2col buffer where the input is available as a rearranged column */
const q15_t *ip_as_col = runtime_buf;
/* 4 multiply and accumulates are done in one loop. */
uint16_t col_count = (input_ch * kernel_y * kernel_x) >> 2;
while (col_count) {
q31_t ker_a1, ker_a2;
q31_t ip_b1, ip_b2;
ker_a = read_and_pad(ker_a, &ker_a1, &ker_a2);
ip_b1 = arm_nn_read_q15x2_ia(&ip_as_col);
sum = __SMLAD(ker_a1, ip_b1, sum);
ip_b2 = arm_nn_read_q15x2_ia(&ip_as_col);
sum = __SMLAD(ker_a2, ip_b2, sum);
col_count--;
}
/* Handle left over mac */
col_count = input_ch * kernel_y * kernel_x & 0x3;
while (col_count) {
q7_t ker_a1 = *ker_a++;
q15_t ip_b1 = *ip_as_col++;
sum += ker_a1 * ip_b1;
col_count--;
}
sum = (float) sum * scales[i];
sum += output_offset;
sum = MAX(sum, output_activation_min);
sum = MIN(sum, output_activation_max);
*out++ = (q7_t) sum;
}
}
/* Return to application */
return STATE_SUCCESS;
}

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,92 @@
/* ----------------------------------------------------------------------
* Project: TinyEngine
* Title: add.c
*
* Reference papers:
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
* Contact authors:
* - Wei-Ming Chen, wmchen@mit.edu
* - Wei-Chen Wang, wweichen@mit.edu
* - Ji Lin, jilin@mit.edu
* - Ligeng Zhu, ligeng@mit.edu
* - Song Han, songhan@mit.edu
*
* Target ISA: ARMv7E-M
* -------------------------------------------------------------------- */
#include <math.h>
#include "arm_math.h"
#include "tinyengine_function.h"
int32_t Add(int32_t a, int32_t b) {
return a + b;
}
int32_t ShiftRight(int32_t a, int offset) {
return a >> offset;
}
int32_t BitAnd(int32_t a, int32_t b) {
return a & b;
}
int32_t BitNot(int32_t a) {
return ~a;
}
int32_t MaskIfNonZero(int32_t a) {
static const int32_t zero = 0;
return a ? BitNot(zero) : zero;
}
int32_t MaskIfGreaterThan(int32_t a, int32_t b) {
return MaskIfNonZero(a > b);
}
int32_t MaskIfLessThan(int32_t a, int32_t b) {
return MaskIfNonZero(a < b);
}
static inline int32_t SaturatingRoundingDoublingHighMul(int32_t a, int32_t b) {
int64_t a_64 = a;
int64_t b_64 = b;
int64_t ab_64 = a_64 * b_64;
int32_t nudge = ab_64 >= 0 ? (1 << 30) : (1 - (1 << 30));
int32_t ab_x2_high32 = (int32_t)((ab_64 + nudge) / (1ll << 31));
return a == b && a == -2147483648 ? 2147483647 : ab_x2_high32;
}
static inline int32_t RoundingDivideByPOT(int32_t x, int exponent) {
const int32_t mask = ((1ll << exponent) - 1);
const int32_t zero = (0);
const int32_t one = (1);
const int32_t remainder = BitAnd(x, mask);
const int32_t threshold = Add(ShiftRight(mask, 1), BitAnd(MaskIfLessThan(x, zero), one));
return Add(ShiftRight(x, exponent), BitAnd(MaskIfGreaterThan(remainder, threshold), one));
}
static inline int32_t MultiplyByQuantizedMultiplierSmallerThanOneExp(
int32_t x, int32_t quantized_multiplier, int left_shift) {
return RoundingDivideByPOT(
SaturatingRoundingDoublingHighMul(x, quantized_multiplier), -left_shift);
}
tinyengine_status add(int size, ADD_params* params, const int8_t* input1_data,
const int8_t* input2_data, int8_t* output_data) {
for (int i = 0; i < size; ++i) {
const int32_t input1_val = params->input1_offset + input1_data[i];
const int32_t input2_val = params->input2_offset + input2_data[i];
const int32_t shifted_input1_val = input1_val * (1 << params->left_shift);
const int32_t shifted_input2_val = input2_val * (1 << params->left_shift);
const int32_t scaled_input1_val =
MultiplyByQuantizedMultiplierSmallerThanOneExp(
shifted_input1_val, params->input1_multiplier, params->input1_shift);
const int32_t scaled_input2_val =
MultiplyByQuantizedMultiplierSmallerThanOneExp(
shifted_input2_val, params->input2_multiplier, params->input2_shift);
const int32_t raw_sum = scaled_input1_val + scaled_input2_val;
const int32_t raw_output =
MultiplyByQuantizedMultiplierSmallerThanOneExp(
raw_sum, params->output_multiplier, params->output_shift) +
params->output_offset;
const int32_t clamped_output = TN_MIN(params->quantized_activation_max,
TN_MAX(params->quantized_activation_min, raw_output));
output_data[i] = (int8_t)(clamped_output);
}
}

View File

@ -0,0 +1,223 @@
/*
* Copyright (C) 2010-2022 Arm Limited or its affiliates.
*
* SPDX-License-Identifier: Apache-2.0
*
* Licensed under the Apache License, Version 2.0 (the License); you may
* not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an AS IS BASIS, WITHOUT
* WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/* ----------------------------------------------------------------------
* This file is MODIFIED from Arm CMSIS NN Library.
*
* Project: TinyEngine
* Title: arm_convolve_s8_4col.c
* Description: s8_4col version of convolution using symmetric quantization.
*
* Reference papers:
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
* Contact authors:
* - Wei-Ming Chen, wmchen@mit.edu
* - Wei-Chen Wang, wweichen@mit.edu
* - Ji Lin, jilin@mit.edu
* - Ligeng Zhu, ligeng@mit.edu
* - Song Han, songhan@mit.edu
*
* Original Project: CMSIS NN Library
* Original Title: arm_convolve_s8.c
*
* Target Processor: Cortex-M CPUs
* -------------------------------------------------------------------- */
#include "arm_math.h"
#include "arm_nnfunctions.h"
#include "arm_nnsupportfunctions.h"
/**
* @ingroup groupNN
*/
/**
* @addtogroup NNConv
* @{
*/
/*
* Basic s8_4col convolution function.
*
* Refer header file for details. Optimal use case for the DSP/MVE implementation is when input and output channels
* are multiples of 4 or atleast greater than 4.
*
*/
arm_status arm_convolve_s8_4col(const q7_t *input,
const uint16_t input_x,
const uint16_t input_y,
const uint16_t input_ch,
const uint16_t input_batches,
const q7_t *kernel,
const uint16_t output_ch,
const uint16_t kernel_x,
const uint16_t kernel_y,
const uint16_t pad_x,
const uint16_t pad_y,
const uint16_t stride_x,
const uint16_t stride_y,
const int32_t *bias,
q7_t *output,
const int32_t *output_shift,
const int32_t *output_mult,
const int32_t out_offset,
const int32_t input_offset,
const int32_t out_activation_min,
const int32_t out_activation_max,
const uint16_t output_x,
const uint16_t output_y,
q15_t *buffer_a)
{
int i_batch;
for (i_batch = 0; i_batch < input_batches; i_batch++)
{
input += i_batch * (input_x * input_y * input_ch);
output += i_batch * (output_x * output_y * output_ch);
int16_t i_out_y, i_out_x, i_ker_y, i_ker_x;
/* Generate two columns from the input tensor a GEMM computation */
q15_t *four_column_buf = buffer_a;
q7_t *out = output;
/* This part implements the im2col function */
for (i_out_y = 0; i_out_y < output_y; i_out_y++)
{
for (i_out_x = 0; i_out_x < output_x; i_out_x++)
{
for (i_ker_y = i_out_y * stride_y - pad_y; i_ker_y < i_out_y * stride_y - pad_y + kernel_y; i_ker_y++)
{
for (i_ker_x = i_out_x * stride_x - pad_x; i_ker_x < i_out_x * stride_x - pad_x + kernel_x; i_ker_x++)
{
if (i_ker_y < 0 || i_ker_y >= input_y || i_ker_x < 0 || i_ker_x >= input_x)
{
/* Filling 0 for out-of-bound paddings */
memset(four_column_buf, 0, sizeof(q15_t) * input_ch);
}
else
{
/* Copying the pixel data to column */
arm_q7_to_q15_with_offset(input + (i_ker_y * input_x + i_ker_x) * input_ch, four_column_buf, input_ch, input_offset);
}
four_column_buf += input_ch;
}
}
/* Computation is filed for every 2 columns */
if (four_column_buf == buffer_a + 4 * input_ch * kernel_y * kernel_x)
{
out =
arm_nn_mat_mult_kernel_s8_s16_4col(kernel,
buffer_a,
output_ch,
output_shift,
output_mult,
out_offset,
out_activation_min,
out_activation_max,
input_ch * kernel_y * kernel_x,
bias,
out);
/* counter reset */
four_column_buf = buffer_a;
}
}
}
q15_t *four_column_buf_mid = buffer_a;
if (four_column_buf >= four_column_buf_mid + 2 * input_ch * kernel_y * kernel_x) {
out =
arm_nn_mat_mult_kernel_s8_s16(kernel,
four_column_buf_mid,
output_ch,
output_shift,
output_mult,
out_offset,
out_activation_min,
out_activation_max,
input_ch * kernel_y * kernel_x,
bias,
out);
four_column_buf_mid = buffer_a + 2 * input_ch * kernel_y * kernel_x;
}
/* left-over because odd number of output pixels */
if (four_column_buf != four_column_buf_mid)
{
const q7_t *ker_a = kernel;
int i;
for (i = 0; i < output_ch; i++)
{
/* Load the accumulator with bias first */
q31_t sum = bias[i];
/* Point to the beginning of the im2col buffer where the input is available as a rearranged column */
const q15_t *ip_as_col = four_column_buf_mid;
/* 4 multiply and accumulates are done in one loop. */
uint16_t col_count = (input_ch * kernel_y * kernel_x) >> 2;
while (col_count)
{
q31_t ker_a1, ker_a2;
q31_t ip_b1, ip_b2;
ker_a = read_and_pad(ker_a, &ker_a1, &ker_a2);
ip_b1 = arm_nn_read_q15x2_ia(&ip_as_col);
sum = __SMLAD(ker_a1, ip_b1, sum);
ip_b2 = arm_nn_read_q15x2_ia(&ip_as_col);
sum = __SMLAD(ker_a2, ip_b2, sum);
col_count--;
}
/* Handle left over mac */
col_count = input_ch * kernel_y * kernel_x & 0x3;
while (col_count)
{
q7_t ker_a1 = *ker_a++;
q15_t ip_b1 = *ip_as_col++;
sum += ker_a1 * ip_b1;
col_count--;
}
sum = arm_nn_requantize(sum, output_mult[i], output_shift[i]);
sum += out_offset;
sum = MAX(sum, out_activation_min);
sum = MIN(sum, out_activation_max);
*out++ = (q7_t)sum;
}
}
}
/* Return to application */
return ARM_MATH_SUCCESS;
}
/**
* @} end of NNConv group
*/

View File

@ -0,0 +1,245 @@
/*
* Copyright (C) 2010-2020 Arm Limited or its affiliates. All rights reserved.
*
* SPDX-License-Identifier: Apache-2.0
*
* Licensed under the Apache License, Version 2.0 (the License); you may
* not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an AS IS BASIS, WITHOUT
* WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/* ----------------------------------------------------------------------
* This file is MODIFIED from Arm CMSIS NN Library.
*
* Project: TinyEngine
* Title: arm_nn_mat_mult_kernel3_input3_s8_s16.c
* Description: Matrix-multiplication function for convolution (input channel = 3 and kernel size = 3).
*
* Reference papers:
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
* Contact authors:
* - Wei-Ming Chen, wmchen@mit.edu
* - Wei-Chen Wang, wweichen@mit.edu
* - Ji Lin, jilin@mit.edu
* - Ligeng Zhu, ligeng@mit.edu
* - Song Han, songhan@mit.edu
*
* Original Project: CMSIS NN Library
* Original Title: arm_nn_mat_mult_kernel_s8_s16.c
*
* Target Processor: Cortex-M cores
* -------------------------------------------------------------------- */
#include "arm_math.h"
#include "arm_nnfunctions.h"
#include "arm_nnsupportfunctions.h"
/*
* Matrix-multiplication function for convolution with per-channel requantization.
*
* Refer header file for details.
*
*/
q7_t *arm_nn_mat_mult_kernel3_input3_s8_s16(const q7_t *input_a,
const q15_t *input_b,
const uint16_t output_ch,
const int32_t *out_shift,
const int32_t *out_mult,
const int32_t out_offset,
const int16_t activation_min,
const int16_t activation_max,
const uint16_t num_col_a,
const int32_t *const output_bias,
q7_t *out_0,
q15_t *kbuf)
{
/* set up the second output pointers */
q7_t *out_1 = out_0 + output_ch;
const int32_t *bias = output_bias;
uint16_t row_count = output_ch / 2;
const q15_t *ksrc = &kbuf[0];
/* this loop over rows in A */
while (row_count)
{
/* setup pointers for B */
const q15_t *ip_b0 = input_b;
const q15_t *ip_b1 = ip_b0 + num_col_a;
const q31_t *ip31_b0 = ip_b0;
const q31_t *ip31_b1 = ip_b1;
/* align the second pointer for A */
const q15_t *ksrc2 = ksrc + 27;
q31_t *ksrc_31 = ksrc;
q31_t *ksrc2_31 = ksrc2;
/* Init accumulator with bias for channel N and N + 1 */
q31_t ch_0_out_0 = *bias;
q31_t ch_0_out_1 = *bias++;
q31_t ch_1_out_0 = *bias;
q31_t ch_1_out_1 = *bias++;
//------------------4
q31_t a01, a02, a11, a12;
q31_t b0 = arm_nn_read_q15x2_ia(&ip_b0);
q31_t b1 = arm_nn_read_q15x2_ia(&ip_b1);
ch_0_out_0 = __SMLAD(ksrc_31[0], b0, ch_0_out_0);
ch_0_out_1 = __SMLAD(ksrc_31[0], b1, ch_0_out_1);
ch_1_out_0 = __SMLAD(ksrc2_31[0], b0, ch_1_out_0);
ch_1_out_1 = __SMLAD(ksrc2_31[0], b1, ch_1_out_1);
b0 = arm_nn_read_q15x2_ia(&ip_b0);
b1 = arm_nn_read_q15x2_ia(&ip_b1);
ch_0_out_0 = __SMLAD(ksrc_31[1], b0, ch_0_out_0);
ch_0_out_1 = __SMLAD(ksrc_31[1], b1, ch_0_out_1);
ch_1_out_0 = __SMLAD(ksrc2_31[1], b0, ch_1_out_0);
ch_1_out_1 = __SMLAD(ksrc2_31[1], b1, ch_1_out_1);
//------------------8
b0 = arm_nn_read_q15x2_ia(&ip_b0);
b1 = arm_nn_read_q15x2_ia(&ip_b1);
ch_0_out_0 = __SMLAD(ksrc_31[2], b0, ch_0_out_0);
ch_0_out_1 = __SMLAD(ksrc_31[2], b1, ch_0_out_1);
ch_1_out_0 = __SMLAD(ksrc2_31[2], b0, ch_1_out_0);
ch_1_out_1 = __SMLAD(ksrc2_31[2], b1, ch_1_out_1);
b0 = arm_nn_read_q15x2_ia(&ip_b0);
b1 = arm_nn_read_q15x2_ia(&ip_b1);
ch_0_out_0 = __SMLAD(ksrc_31[3], b0, ch_0_out_0);
ch_0_out_1 = __SMLAD(ksrc_31[3], b1, ch_0_out_1);
ch_1_out_0 = __SMLAD(ksrc2_31[3], b0, ch_1_out_0);
ch_1_out_1 = __SMLAD(ksrc2_31[3], b1, ch_1_out_1);
//------------------12
b0 = arm_nn_read_q15x2_ia(&ip_b0);
b1 = arm_nn_read_q15x2_ia(&ip_b1);
ch_0_out_0 = __SMLAD(ksrc_31[4], b0, ch_0_out_0);
ch_0_out_1 = __SMLAD(ksrc_31[4], b1, ch_0_out_1);
ch_1_out_0 = __SMLAD(ksrc2_31[4], b0, ch_1_out_0);
ch_1_out_1 = __SMLAD(ksrc2_31[4], b1, ch_1_out_1);
b0 = arm_nn_read_q15x2_ia(&ip_b0);
b1 = arm_nn_read_q15x2_ia(&ip_b1);
ch_0_out_0 = __SMLAD(ksrc_31[5], b0, ch_0_out_0);
ch_0_out_1 = __SMLAD(ksrc_31[5], b1, ch_0_out_1);
ch_1_out_0 = __SMLAD(ksrc2_31[5], b0, ch_1_out_0);
ch_1_out_1 = __SMLAD(ksrc2_31[5], b1, ch_1_out_1);
//------------------16
b0 = arm_nn_read_q15x2_ia(&ip_b0);
b1 = arm_nn_read_q15x2_ia(&ip_b1);
ch_0_out_0 = __SMLAD(ksrc_31[6], b0, ch_0_out_0);
ch_0_out_1 = __SMLAD(ksrc_31[6], b1, ch_0_out_1);
ch_1_out_0 = __SMLAD(ksrc2_31[6], b0, ch_1_out_0);
ch_1_out_1 = __SMLAD(ksrc2_31[6], b1, ch_1_out_1);
b0 = arm_nn_read_q15x2_ia(&ip_b0);
b1 = arm_nn_read_q15x2_ia(&ip_b1);
ch_0_out_0 = __SMLAD(ksrc_31[7], b0, ch_0_out_0);
ch_0_out_1 = __SMLAD(ksrc_31[7], b1, ch_0_out_1);
ch_1_out_0 = __SMLAD(ksrc2_31[7], b0, ch_1_out_0);
ch_1_out_1 = __SMLAD(ksrc2_31[7], b1, ch_1_out_1);
//------------------20
b0 = arm_nn_read_q15x2_ia(&ip_b0);
b1 = arm_nn_read_q15x2_ia(&ip_b1);
ch_0_out_0 = __SMLAD(ksrc_31[8], b0, ch_0_out_0);
ch_0_out_1 = __SMLAD(ksrc_31[8], b1, ch_0_out_1);
ch_1_out_0 = __SMLAD(ksrc2_31[8], b0, ch_1_out_0);
ch_1_out_1 = __SMLAD(ksrc2_31[8], b1, ch_1_out_1);
b0 = arm_nn_read_q15x2_ia(&ip_b0);
b1 = arm_nn_read_q15x2_ia(&ip_b1);
ch_0_out_0 = __SMLAD(ksrc_31[9], b0, ch_0_out_0);
ch_0_out_1 = __SMLAD(ksrc_31[9], b1, ch_0_out_1);
ch_1_out_0 = __SMLAD(ksrc2_31[9], b0, ch_1_out_0);
ch_1_out_1 = __SMLAD(ksrc2_31[9], b1, ch_1_out_1);
//------------------24
b0 = arm_nn_read_q15x2_ia(&ip_b0);
b1 = arm_nn_read_q15x2_ia(&ip_b1);
ch_0_out_0 = __SMLAD(ksrc_31[10], b0, ch_0_out_0);
ch_0_out_1 = __SMLAD(ksrc_31[10], b1, ch_0_out_1);
ch_1_out_0 = __SMLAD(ksrc2_31[10], b0, ch_1_out_0);
ch_1_out_1 = __SMLAD(ksrc2_31[10], b1, ch_1_out_1);
b0 = arm_nn_read_q15x2_ia(&ip_b0);
b1 = arm_nn_read_q15x2_ia(&ip_b1);
ch_0_out_0 = __SMLAD(ksrc_31[11], b0, ch_0_out_0);
ch_0_out_1 = __SMLAD(ksrc_31[11], b1, ch_0_out_1);
ch_1_out_0 = __SMLAD(ksrc2_31[11], b0, ch_1_out_0);
ch_1_out_1 = __SMLAD(ksrc2_31[11], b1, ch_1_out_1);
//------------------25,26,27
b0 = arm_nn_read_q15x2_ia(&ip_b0);
b1 = arm_nn_read_q15x2_ia(&ip_b1);
ch_0_out_0 = __SMLAD(ksrc_31[12], b0, ch_0_out_0);
ch_0_out_1 = __SMLAD(ksrc_31[12], b1, ch_0_out_1);
ch_1_out_0 = __SMLAD(ksrc2_31[12], b0, ch_1_out_0);
ch_1_out_1 = __SMLAD(ksrc2_31[12], b1, ch_1_out_1);
q15_t _b0 = *ip_b0++;
q15_t _b1 = *ip_b1++;
ch_0_out_0 += ksrc[26] * _b0;
ch_0_out_1 += ksrc[26] * _b1;
ch_1_out_0 += ksrc2[26] * _b0;
ch_1_out_1 += ksrc2[26] * _b1;
ch_0_out_0 = arm_nn_requantize(ch_0_out_0, *out_mult, *out_shift);
ch_0_out_0 += out_offset;
ch_0_out_0 = MAX(ch_0_out_0, activation_min);
ch_0_out_0 = MIN(ch_0_out_0, activation_max);
*out_0++ = (q7_t)ch_0_out_0;
ch_0_out_1 = arm_nn_requantize(ch_0_out_1, *out_mult, *out_shift);
ch_0_out_1 += out_offset;
ch_0_out_1 = MAX(ch_0_out_1, activation_min);
ch_0_out_1 = MIN(ch_0_out_1, activation_max);
*out_1++ = (q7_t)ch_0_out_1;
out_mult++;
out_shift++;
ch_1_out_0 = arm_nn_requantize(ch_1_out_0, *out_mult, *out_shift);
ch_1_out_0 += out_offset;
ch_1_out_0 = MAX(ch_1_out_0, activation_min);
ch_1_out_0 = MIN(ch_1_out_0, activation_max);
*out_0++ = (q7_t)ch_1_out_0;
ch_1_out_1 = arm_nn_requantize(ch_1_out_1, *out_mult, *out_shift);
ch_1_out_1 += out_offset;
ch_1_out_1 = MAX(ch_1_out_1, activation_min);
ch_1_out_1 = MIN(ch_1_out_1, activation_max);
*out_1++ = (q7_t)ch_1_out_1;
out_mult++;
out_shift++;
/* skip row */
ksrc += 54;
row_count--;
}
out_0 += output_ch;
/* return the new output pointer with offset */
return out_0;
}

View File

@ -0,0 +1,174 @@
/*
* Copyright (C) 2010-2020 Arm Limited or its affiliates. All rights reserved.
*
* SPDX-License-Identifier: Apache-2.0
*
* Licensed under the Apache License, Version 2.0 (the License); you may
* not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an AS IS BASIS, WITHOUT
* WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/* ----------------------------------------------------------------------
* This file is MODIFIED from Arm CMSIS NN Library.
*
* Project: TinyEngine
* Title: arm_nn_mat_mult_kernel_s8_s16_reordered_8mul.c
* Description: Matrix-multiplication function for convolution with reordered columns (input channels with the multiple of 8).
*
* Reference papers:
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
* Contact authors:
* - Wei-Ming Chen, wmchen@mit.edu
* - Wei-Chen Wang, wweichen@mit.edu
* - Ji Lin, jilin@mit.edu
* - Ligeng Zhu, ligeng@mit.edu
* - Song Han, songhan@mit.edu
*
* Original Project: CMSIS NN Library
* Original Title: arm_nn_mat_mult_kernel_s8_s16_reordered.c
*
* Target Processor: Cortex-M cores
* -------------------------------------------------------------------- */
#include "arm_math.h"
#include "arm_nnfunctions.h"
#include "arm_nnsupportfunctions.h"
/*
* Matrix-multiplication with re-ordered input and bias inputs for convolution with per-channel
* requantization. The re-ordering is a consequence of sign extension is done by the SXTB16 command.
*
* Refer header file for details. This function differs from arm_nn_mat_mult_kernel_s8_s16(), in that it uses
* read_and_pad_reordered() instead of arm_nn_mat_mult_kernel_s8_s16(). Investigating the cycles impact and
* unifying these two functions is a potential future improvement.
*
*/
q7_t *arm_nn_mat_mult_kernel_s8_s16_reordered_8mul(const q7_t *input_a,
const q15_t *input_b,
const uint16_t output_ch,
const int32_t *out_shift,
const int32_t *out_mult,
const int32_t out_offset,
const int16_t activation_min,
const int16_t activation_max,
const uint16_t num_col_a,
const int32_t *const output_bias,
q7_t *out_0)
{
/* set up the second output pointers */
q7_t *out_1 = out_0 + output_ch;
const int32_t *bias = output_bias;
uint16_t row_count = output_ch / 2;
const q7_t *ip_a0 = input_a;
/* this loop over rows in A */
while (row_count)
{
/* setup pointers for B */
const q15_t *ip_b0 = input_b;
const q15_t *ip_b1 = ip_b0 + num_col_a;
/* align the second pointer for A */
const q7_t *ip_a1 = ip_a0 + num_col_a;
/* Init accumulator with bias for channel N and N + 1 */
q31_t ch_0_out_0 = *bias;
q31_t ch_0_out_1 = *bias++;
q31_t ch_1_out_0 = *bias;
q31_t ch_1_out_1 = *bias++;
uint16_t col_count = num_col_a / 8;
/* accumulate over the vector */
while (col_count)
{
q31_t a01, a02, a11, a12;
q31_t b0 = arm_nn_read_q15x2_ia(&ip_b0);
q31_t b1 = arm_nn_read_q15x2_ia(&ip_b1);
ip_a0 = read_and_pad_reordered(ip_a0, &a01, &a02);
ch_0_out_0 = __SMLAD(a01, b0, ch_0_out_0);
ip_a1 = read_and_pad_reordered(ip_a1, &a11, &a12);
ch_0_out_1 = __SMLAD(a01, b1, ch_0_out_1);
ch_1_out_0 = __SMLAD(a11, b0, ch_1_out_0);
b0 = arm_nn_read_q15x2_ia(&ip_b0);
ch_1_out_1 = __SMLAD(a11, b1, ch_1_out_1);
b1 = arm_nn_read_q15x2_ia(&ip_b1);
ch_0_out_0 = __SMLAD(a02, b0, ch_0_out_0);
ch_0_out_1 = __SMLAD(a02, b1, ch_0_out_1);
ch_1_out_0 = __SMLAD(a12, b0, ch_1_out_0);
b0 = arm_nn_read_q15x2_ia(&ip_b0);
ch_1_out_1 = __SMLAD(a12, b1, ch_1_out_1);
b1 = arm_nn_read_q15x2_ia(&ip_b1);
ip_a0 = read_and_pad_reordered(ip_a0, &a01, &a02);
ch_0_out_0 = __SMLAD(a01, b0, ch_0_out_0);
ip_a1 = read_and_pad_reordered(ip_a1, &a11, &a12);
ch_0_out_1 = __SMLAD(a01, b1, ch_0_out_1);
ch_1_out_0 = __SMLAD(a11, b0, ch_1_out_0);
b0 = arm_nn_read_q15x2_ia(&ip_b0);
ch_1_out_1 = __SMLAD(a11, b1, ch_1_out_1);
b1 = arm_nn_read_q15x2_ia(&ip_b1);
ch_0_out_0 = __SMLAD(a02, b0, ch_0_out_0);
ch_0_out_1 = __SMLAD(a02, b1, ch_0_out_1);
ch_1_out_0 = __SMLAD(a12, b0, ch_1_out_0);
ch_1_out_1 = __SMLAD(a12, b1, ch_1_out_1);
col_count--;
} /* while over col_count */
ch_0_out_0 = arm_nn_requantize(ch_0_out_0, *out_mult, *out_shift);
ch_0_out_0 += out_offset;
ch_0_out_0 = MAX(ch_0_out_0, activation_min);
ch_0_out_0 = MIN(ch_0_out_0, activation_max);
*out_0++ = (q7_t)ch_0_out_0;
ch_0_out_1 = arm_nn_requantize(ch_0_out_1, *out_mult, *out_shift);
ch_0_out_1 += out_offset;
ch_0_out_1 = MAX(ch_0_out_1, activation_min);
ch_0_out_1 = MIN(ch_0_out_1, activation_max);
*out_1++ = (q7_t)ch_0_out_1;
out_mult++;
out_shift++;
ch_1_out_0 = arm_nn_requantize(ch_1_out_0, *out_mult, *out_shift);
ch_1_out_0 += out_offset;
ch_1_out_0 = MAX(ch_1_out_0, activation_min);
ch_1_out_0 = MIN(ch_1_out_0, activation_max);
*out_0++ = (q7_t)ch_1_out_0;
ch_1_out_1 = arm_nn_requantize(ch_1_out_1, *out_mult, *out_shift);
ch_1_out_1 += out_offset;
ch_1_out_1 = MAX(ch_1_out_1, activation_min);
ch_1_out_1 = MIN(ch_1_out_1, activation_max);
*out_1++ = (q7_t)ch_1_out_1;
out_mult++;
out_shift++;
/* skip row */
ip_a0 += num_col_a;
row_count--;
}
out_0 += output_ch;
/* return the new output pointer with offset */
return out_0;
}

View File

@ -0,0 +1,215 @@
/*
* Copyright (C) 2010-2020 Arm Limited or its affiliates. All rights reserved.
*
* SPDX-License-Identifier: Apache-2.0
*
* Licensed under the Apache License, Version 2.0 (the License); you may
* not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an AS IS BASIS, WITHOUT
* WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/* ----------------------------------------------------------------------
* This file is MODIFIED from Arm CMSIS NN Library.
*
* Project: TinyEngine
* Title: arm_nn_mat_mult_kernel_s8_s16_reordered_oddch.c
* Description: Matrix-multiplication function for convolution with reordered columns (odd number of channel).
*
* Reference papers:
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
* Contact authors:
* - Wei-Ming Chen, wmchen@mit.edu
* - Wei-Chen Wang, wweichen@mit.edu
* - Ji Lin, jilin@mit.edu
* - Ligeng Zhu, ligeng@mit.edu
* - Song Han, songhan@mit.edu
*
* Original Project: CMSIS NN Library
* Original Title: arm_nn_mat_mult_kernel_s8_s16_reordered.c
*
* Target Processor: Cortex-M cores
* -------------------------------------------------------------------- */
#include "arm_math.h"
#include "arm_nnfunctions.h"
#include "arm_nnsupportfunctions.h"
/*
* Matrix-multiplication with re-ordered input and bias inputs for convolution with per-channel
* requantization. The re-ordering is a consequence of sign extension is done by the SXTB16 command.
*
* Refer header file for details. This function differs from arm_nn_mat_mult_kernel_s8_s16(), in that it uses
* read_and_pad_reordered() instead of arm_nn_mat_mult_kernel_s8_s16(). Investigating the cycles impact and
* unifying these two functions is a potential future improvement.
*
*/
q7_t *arm_nn_mat_mult_kernel_s8_s16_reordered_oddch(const q7_t *input_a,
const q15_t *input_b,
const uint16_t output_ch,
const int32_t *out_shift,
const int32_t *out_mult,
const int32_t out_offset,
const int16_t activation_min,
const int16_t activation_max,
const uint16_t num_col_a,
const int32_t *const output_bias,
q7_t *out_0)
{
#if defined(ARM_MATH_LOOPUNROLL) && defined(ARM_MATH_DSP)
/* set up the second output pointers */
q7_t *out_1 = out_0 + output_ch;
const int32_t *bias = output_bias;
uint16_t row_count = output_ch / 2;
const q7_t *ip_a0 = input_a;
/* this loop over rows in A */
while (row_count)
{
/* setup pointers for B */
const q15_t *ip_b0 = input_b;
const q15_t *ip_b1 = ip_b0 + num_col_a;
/* align the second pointer for A */
const q7_t *ip_a1 = ip_a0 + num_col_a;
/* Init accumulator with bias for channel N and N + 1 */
q31_t ch_0_out_0 = *bias;
q31_t ch_0_out_1 = *bias++;
q31_t ch_1_out_0 = *bias;
q31_t ch_1_out_1 = *bias++;
uint16_t col_count = num_col_a / 4;
/* accumulate over the vector */
while (col_count)
{
q31_t a01, a02, a11, a12;
q31_t b0 = arm_nn_read_q15x2_ia(&ip_b0);
q31_t b1 = arm_nn_read_q15x2_ia(&ip_b1);
ip_a0 = read_and_pad_reordered(ip_a0, &a01, &a02);
ch_0_out_0 = __SMLAD(a01, b0, ch_0_out_0);
ip_a1 = read_and_pad_reordered(ip_a1, &a11, &a12);
ch_0_out_1 = __SMLAD(a01, b1, ch_0_out_1);
ch_1_out_0 = __SMLAD(a11, b0, ch_1_out_0);
b0 = arm_nn_read_q15x2_ia(&ip_b0);
ch_1_out_1 = __SMLAD(a11, b1, ch_1_out_1);
b1 = arm_nn_read_q15x2_ia(&ip_b1);
ch_0_out_0 = __SMLAD(a02, b0, ch_0_out_0);
ch_0_out_1 = __SMLAD(a02, b1, ch_0_out_1);
ch_1_out_0 = __SMLAD(a12, b0, ch_1_out_0);
ch_1_out_1 = __SMLAD(a12, b1, ch_1_out_1);
col_count--;
} /* while over col_count */
ch_0_out_0 = arm_nn_requantize(ch_0_out_0, *out_mult, *out_shift);
ch_0_out_0 += out_offset;
ch_0_out_0 = MAX(ch_0_out_0, activation_min);
ch_0_out_0 = MIN(ch_0_out_0, activation_max);
*out_0++ = (q7_t)ch_0_out_0;
ch_0_out_1 = arm_nn_requantize(ch_0_out_1, *out_mult, *out_shift);
ch_0_out_1 += out_offset;
ch_0_out_1 = MAX(ch_0_out_1, activation_min);
ch_0_out_1 = MIN(ch_0_out_1, activation_max);
*out_1++ = (q7_t)ch_0_out_1;
out_mult++;
out_shift++;
ch_1_out_0 = arm_nn_requantize(ch_1_out_0, *out_mult, *out_shift);
ch_1_out_0 += out_offset;
ch_1_out_0 = MAX(ch_1_out_0, activation_min);
ch_1_out_0 = MIN(ch_1_out_0, activation_max);
*out_0++ = (q7_t)ch_1_out_0;
ch_1_out_1 = arm_nn_requantize(ch_1_out_1, *out_mult, *out_shift);
ch_1_out_1 += out_offset;
ch_1_out_1 = MAX(ch_1_out_1, activation_min);
ch_1_out_1 = MIN(ch_1_out_1, activation_max);
*out_1++ = (q7_t)ch_1_out_1;
out_mult++;
out_shift++;
/* skip row */
ip_a0 += num_col_a;
row_count--;
}
if (output_ch & 1)
{
/* setup pointers for B */
const q15_t *ip_b0 = input_b;
const q15_t *ip_b1 = ip_b0 + num_col_a;
/* Init accumulator with bias for channel N + 1 */
q31_t ch_0_out_0 = *bias;
q31_t ch_0_out_1 = ch_0_out_0;
int32_t col_count = num_col_a / 4;
while (col_count)
{
q31_t a01, a02;
q31_t b0 = arm_nn_read_q15x2_ia(&ip_b0);
q31_t b1 = arm_nn_read_q15x2_ia(&ip_b1);
ip_a0 = read_and_pad_reordered(ip_a0, &a01, &a02);
ch_0_out_0 = __SMLAD(a01, b0, ch_0_out_0);
ch_0_out_1 = __SMLAD(a01, b1, ch_0_out_1);
b0 = arm_nn_read_q15x2_ia(&ip_b0);
b1 = arm_nn_read_q15x2_ia(&ip_b1);
ch_0_out_0 = __SMLAD(a02, b0, ch_0_out_0);
ch_0_out_1 = __SMLAD(a02, b1, ch_0_out_1);
col_count--;
} /* while over col_count */
ch_0_out_0 = arm_nn_requantize(ch_0_out_0, *out_mult, *out_shift);
ch_0_out_0 += out_offset;
ch_0_out_0 = MAX(ch_0_out_0, activation_min);
ch_0_out_0 = MIN(ch_0_out_0, activation_max);
*out_0++ = (q7_t)ch_0_out_0;
ch_0_out_1 = arm_nn_requantize(ch_0_out_1, *out_mult, *out_shift);
ch_0_out_1 += out_offset;
ch_0_out_1 = MAX(ch_0_out_1, activation_min);
ch_0_out_1 = MIN(ch_0_out_1, activation_max);
*out_1++ = (q7_t)ch_0_out_1;
}
out_0 += output_ch;
/* return the new output pointer with offset */
return out_0;
#else
(void)input_a;
(void)input_b;
(void)output_ch;
(void)out_shift;
(void)out_mult;
(void)out_offset;
(void)activation_min;
(void)activation_max;
(void)num_col_a;
(void)output_bias;
(void)out_0;
/* To be completed */
return NULL;
#endif
}

View File

@ -0,0 +1,57 @@
/* ----------------------------------------------------------------------
* Project: TinyEngine
* Title: avgpooling.c
*
* Reference papers:
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
* Contact authors:
* - Wei-Ming Chen, wmchen@mit.edu
* - Wei-Chen Wang, wweichen@mit.edu
* - Ji Lin, jilin@mit.edu
* - Ligeng Zhu, ligeng@mit.edu
* - Song Han, songhan@mit.edu
*
* Target ISA: ARMv7E-M
* -------------------------------------------------------------------- */
#include "tinyengine_function.h"
tinyengine_status avg_pooling(const q7_t* input, const uint16_t input_h, const uint16_t input_w,
const uint16_t input_c, const uint16_t sample_h, const uint16_t sample_w,
const uint16_t output_h, const uint16_t output_w, const int32_t out_activation_min,
const int32_t out_activation_max, q7_t* output)
{
int h, w, c;
int sh, sw;
const int divider_half = ((sample_h * sample_w) / 2);
for(c = 0; c < input_c; c++){
for(h = 0; h < output_h; h++){
for(w = 0; w < output_w; w++){
int avg = 0;
for(sh = 0; sh < sample_h; sh++){
int height = sh + h * sample_h;
for(sw = 0; sw < sample_w; sw++){
int width = sw + w * sample_w;
avg += input[(width + height * input_w) * input_c + c];
}
}
// for rounded div
if (avg > 0)
avg += divider_half;
else
avg -= divider_half;
int out = avg / (sample_h * sample_w);
out = TN_MAX(out, out_activation_min);
out = TN_MIN(out, out_activation_max);
output[(w + h * output_w) * input_c + c] = out;
}
}
}
}

View File

@ -0,0 +1,43 @@
/* ----------------------------------------------------------------------
* Project: TinyEngine
* Title: concat_ch.c
*
* Reference papers:
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
* Contact authors:
* - Wei-Ming Chen, wmchen@mit.edu
* - Wei-Chen Wang, wweichen@mit.edu
* - Ji Lin, jilin@mit.edu
* - Ligeng Zhu, ligeng@mit.edu
* - Song Han, songhan@mit.edu
*
* Target ISA: ARMv7E-M
* -------------------------------------------------------------------- */
#include "arm_nnfunctions.h"
#include "tinyengine_function.h"
tinyengine_status concat_ch(const q7_t *input1, const uint16_t input_x,
const uint16_t input_y, const uint16_t input1_ch, const q7_t* input2, const uint16_t input2_ch, q7_t *output) {
int elements = input_y * input_x;
while(elements--){
//place the first input
memcpy(output, input1, input1_ch);
input1 += input1_ch; output += input1_ch;
//place the second input
memcpy(output, input2, input2_ch);
input2 += input2_ch; output += input2_ch;
}
return STATE_SUCCESS;
}

View File

@ -0,0 +1,127 @@
/* ----------------------------------------------------------------------
* Project: TinyEngine
* Title: convolve_1x1_s8.c
*
* Reference papers:
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
* Contact authors:
* - Wei-Ming Chen, wmchen@mit.edu
* - Wei-Chen Wang, wweichen@mit.edu
* - Ji Lin, jilin@mit.edu
* - Ligeng Zhu, ligeng@mit.edu
* - Song Han, songhan@mit.edu
*
* Target ISA: ARMv7E-M
* -------------------------------------------------------------------- */
#include "arm_nnfunctions.h"
#include "img2col_element.h"
#include "tinyengine_function.h"
#define DIM_KER_X (1U)
#define DIM_KER_Y (1U)
tinyengine_status convolve_1x1_s8(const q7_t *input, const uint16_t input_x,
const uint16_t input_y, const uint16_t input_ch, const q7_t *kernel,
const int32_t *bias, const int32_t *output_shift,
const int32_t *output_mult, const int32_t out_offset,
const int32_t input_offset, const int32_t out_activation_min,
const int32_t out_activation_max, q7_t *output, const uint16_t output_x,
const uint16_t output_y, const uint16_t output_ch, q15_t *runtime_buf) {
if (input_ch % 4 != 0 || input_ch % 2 != 0) {
return PARAM_NO_SUPPORT;
}
int32_t i_element;
(void) input_x;
(void) input_y;
/* Partial(two columns) im2col buffer */
q15_t *two_column_buffer = runtime_buf;
q7_t *out = output;
const int32_t num_elements = output_x * output_y;
const int channel_div4 = (input_ch >> 2);
const int16_t inoff16 = input_offset;
q31_t offset_q15x2 = __PKHBT(inoff16, inoff16, 16);
for (i_element = 0; i_element < num_elements / 2; i_element++) {
/* Fill buffer for partial im2col - two columns at a time */
q7_t *src = &input[i_element * input_ch * 2];
q15_t *dst = two_column_buffer;
//use variables
q31_t in_q7x4;
q31_t in_q15x2_1;
q31_t in_q15x2_2;
q31_t out_q15x2_1;
q31_t out_q15x2_2;
int cnt = channel_div4; //two columns
while (cnt > 0) {
q7_q15_offset_reordered_ele(src, dst)
q7_q15_offset_reordered_ele(src, dst)
cnt--;
}
out = arm_nn_mat_mult_kernel_s8_s16_reordered(kernel,
two_column_buffer, output_ch, output_shift, output_mult,
(q7_t) out_offset, out_activation_min,
out_activation_max, input_ch * DIM_KER_Y * DIM_KER_X,
bias, out);
}
/* check if there is an odd column left-over for computation */
if (num_elements & 0x1) {
int32_t i_ch_out;
const q7_t *ker_a = kernel;
q7_t *src = &input[(num_elements - 1) * input_ch];
q15_t *dst = two_column_buffer;
//use variables
q31_t in_q7x4;
q31_t in_q15x2_1;
q31_t in_q15x2_2;
q31_t out_q15x2_1;
q31_t out_q15x2_2;
int cnt = channel_div4; //two * numof2col columns
while (cnt > 0) {
q7_q15_offset_reordered_ele(src, dst)
cnt--;
}
for (i_ch_out = 0; i_ch_out < output_ch; i_ch_out++) {
q31_t sum = bias[i_ch_out];
/* Point to the beginning of the im2col buffer where the input is available as a rearranged column */
const q15_t *ip_as_col = runtime_buf;
uint16_t col_count = (input_ch * DIM_KER_X * DIM_KER_Y) >> 2;
while (col_count) {
q31_t ker_a1, ker_a2;
q31_t in_b1, in_b2;
ker_a = read_and_pad_reordered(ker_a, &ker_a1, &ker_a2);
in_b1 = arm_nn_read_q15x2_ia(&ip_as_col);
sum = __SMLAD(ker_a1, in_b1, sum);
in_b2 = arm_nn_read_q15x2_ia(&ip_as_col);
sum = __SMLAD(ker_a2, in_b2, sum);
col_count--;
}
sum = arm_nn_requantize(sum, output_mult[i_ch_out],
output_shift[i_ch_out]);
sum += out_offset;
sum = MAX(sum, out_activation_min);
sum = MIN(sum, out_activation_max);
*out++ = (q7_t) sum;
}
}
/* Return to application */
return STATE_SUCCESS;
}

View File

@ -0,0 +1,158 @@
/* ----------------------------------------------------------------------
* Project: TinyEngine
* Title: convolve_1x1_s8_SRAM.c
*
* Reference papers:
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
* Contact authors:
* - Wei-Ming Chen, wmchen@mit.edu
* - Wei-Chen Wang, wweichen@mit.edu
* - Ji Lin, jilin@mit.edu
* - Ligeng Zhu, ligeng@mit.edu
* - Song Han, songhan@mit.edu
*
* Target ISA: ARMv7E-M
* -------------------------------------------------------------------- */
#include "arm_nnfunctions.h"
#include "tinyengine_function.h"
#include "img2col_element.h"
#include "kernel_element.h"
#define DIM_KER_X (1U)
#define DIM_KER_Y (1U)
//#define FULL_UNROLL
tinyengine_status convolve_1x1_s8_SRAM(const q7_t *input, const uint16_t input_x,
const uint16_t input_y, const uint16_t input_ch, const q7_t *kernel,
const int32_t *bias, const int32_t *output_shift,
const int32_t *output_mult, const int32_t out_offset,
const int32_t input_offset, const int32_t out_activation_min,
const int32_t out_activation_max, q7_t *output, const uint16_t output_x,
const uint16_t output_y, const uint16_t output_ch, q15_t *runtime_buf, q15_t *kbuf) {
if (input_ch % 4 != 0 || input_ch % 2 != 0) {
return PARAM_NO_SUPPORT;
}
int32_t i_element;
(void) input_x;
(void) input_y;
/* Partial(two columns) im2col buffer */
q15_t *two_column_buffer = runtime_buf;
q7_t *out = output;
const int32_t num_elements = output_x * output_y;
const int channel_div4 = (input_ch >> 2);
const int16_t inoff16 = input_offset;
q31_t offset_q15x2 = __PKHBT(inoff16, inoff16, 16);
/* whether kernels can fit in the buffer */
//fill in kernels
const q7_t *ip_a0 = kernel;
for (int i = 0; i < output_ch; i += 2) {
q31_t *dst1 = &kbuf[i * input_ch / 2]; //each q31_t store 2 elements
q31_t *dst2 = dst1 + input_ch / 2;
/* align the second pointer for A */
const q7_t *ip_a1 = ip_a0 + input_ch;
uint16_t col_count = input_ch / 4;
/* accumulate over the vector */
while (col_count) {
q31_t a01, a02, a11, a12;
ip_a0 = read_and_pad_reordered(ip_a0, &dst1[0], &dst1[1]);
ip_a1 = read_and_pad_reordered(ip_a1, &dst2[0], &dst2[1]);
dst1 += 2;
dst2 += 2;
col_count--;
} /* while over col_count */
/* skip row */
ip_a0 += input_ch;
}
/* output stationary */
for (i_element = 0; i_element < num_elements; i_element += 2) {
q7_t *src = &input[i_element * input_ch];
q15_t *dst = two_column_buffer;
//use variables
q31_t in_q7x4;
q31_t in_q15x2_1;
q31_t in_q15x2_2;
q31_t out_q15x2_1;
q31_t out_q15x2_2;
int cnt = channel_div4; //two columns
while (cnt > 0) {
q7_q15_offset_reordered_ele(src, dst)
q7_q15_offset_reordered_ele(src, dst)
cnt--;
}
out = mat_mult_s16(kernel,
two_column_buffer, output_ch, output_shift, output_mult,
(q7_t) out_offset, out_activation_min,
out_activation_max, input_ch,
bias, out, kbuf);
}
/* check if there is an odd column left-over for computation */
if (num_elements & 0x1) {
int32_t i_ch_out;
const q7_t *ker_a = kernel;
q7_t *src = &input[(num_elements - 1) * input_ch];
q15_t *dst = two_column_buffer;
//use variables
q31_t in_q7x4;
q31_t in_q15x2_1;
q31_t in_q15x2_2;
q31_t out_q15x2_1;
q31_t out_q15x2_2;
int cnt = channel_div4; //two * numof2col columns
while (cnt > 0) {
q7_q15_offset_reordered_ele(src, dst)
cnt--;
}
for (i_ch_out = 0; i_ch_out < output_ch; i_ch_out++) {
q31_t sum = bias[i_ch_out];
/* Point to the beginning of the im2col buffer where the input is available as a rearranged column */
const q15_t *ip_as_col = runtime_buf;
uint16_t col_count = (input_ch * DIM_KER_X * DIM_KER_Y) >> 2;
while (col_count) {
q31_t ker_a1, ker_a2;
q31_t in_b1, in_b2;
ker_a = read_and_pad_reordered(ker_a, &ker_a1, &ker_a2);
in_b1 = arm_nn_read_q15x2_ia(&ip_as_col);
sum = __SMLAD(ker_a1, in_b1, sum);
in_b2 = arm_nn_read_q15x2_ia(&ip_as_col);
sum = __SMLAD(ker_a2, in_b2, sum);
col_count--;
}
sum = arm_nn_requantize(sum, output_mult[i_ch_out],
output_shift[i_ch_out]);
sum += out_offset;
sum = MAX(sum, out_activation_min);
sum = MIN(sum, out_activation_max);
*out++ = (q7_t) sum;
}
}
/* Return to application */
return STATE_SUCCESS;
}

View File

@ -0,0 +1,123 @@
/* ----------------------------------------------------------------------
* Project: TinyEngine
* Title: convolve_1x1_s8_ch16.c
*
* Reference papers:
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
* Contact authors:
* - Wei-Ming Chen, wmchen@mit.edu
* - Wei-Chen Wang, wweichen@mit.edu
* - Ji Lin, jilin@mit.edu
* - Ligeng Zhu, ligeng@mit.edu
* - Song Han, songhan@mit.edu
*
* Target ISA: ARMv7E-M
* -------------------------------------------------------------------- */
#include "arm_nnfunctions.h"
#include "img2col_element.h"
#include "tinyengine_function.h"
#define DIM_KER_X (1U)
#define DIM_KER_Y (1U)
tinyengine_status convolve_1x1_s8_ch16(const q7_t *input, const uint16_t input_x,
const uint16_t input_y, const uint16_t input_ch, const q7_t *kernel,
const int32_t *bias, const int32_t *output_shift,
const int32_t *output_mult, const int32_t out_offset,
const int32_t input_offset, const int32_t out_activation_min,
const int32_t out_activation_max, q7_t *output, const uint16_t output_x,
const uint16_t output_y, const uint16_t output_ch, q15_t *runtime_buf) {
int32_t i_element;
(void) input_x;
(void) input_y;
/* Partial(two columns) im2col buffer */
q15_t *two_column_buffer = runtime_buf;
q7_t *out = output;
const int32_t num_elements = output_x * output_y;
const int channel_div4 = (input_ch >> 2);
const int16_t inoff16 = input_offset;
q31_t offset_q15x2 = __PKHBT(inoff16, inoff16, 16);
for (i_element = 0; i_element < num_elements / 2; i_element++) {
/* Fill buffer for partial im2col - two columns at a time */
q7_t *src = &input[i_element * input_ch * 2];
q15_t *dst = two_column_buffer;
//use variables
q31_t in_q7x4;
q31_t in_q15x2_1;
q31_t in_q15x2_2;
q31_t out_q15x2_1;
q31_t out_q15x2_2;
int cnt = channel_div4; //two columns
while (cnt > 0) {
q7_q15_offset_reordered_ele(src, dst)
q7_q15_offset_reordered_ele(src, dst)
cnt--;
}
out = mat_mult_kernel_s8_s16_reordered_ch16(kernel,
two_column_buffer, output_ch, output_shift, output_mult,
(q7_t) out_offset, out_activation_min,
out_activation_max, input_ch * DIM_KER_Y * DIM_KER_X,
bias, out);
}
/* check if there is an odd column left-over for computation */
if (num_elements & 0x1) {
int32_t i_ch_out;
const q7_t *ker_a = kernel;
q7_t *src = &input[(num_elements - 1) * input_ch];
q15_t *dst = two_column_buffer;
//use variables
q31_t in_q7x4;
q31_t in_q15x2_1;
q31_t in_q15x2_2;
q31_t out_q15x2_1;
q31_t out_q15x2_2;
int cnt = channel_div4; //two * numof2col columns
while (cnt > 0) {
q7_q15_offset_reordered_ele(src, dst)
cnt--;
}
for (i_ch_out = 0; i_ch_out < output_ch; i_ch_out++) {
q31_t sum = bias[i_ch_out];
/* Point to the beginning of the im2col buffer where the input is available as a rearranged column */
const q15_t *ip_as_col = runtime_buf;
uint16_t col_count = (input_ch * DIM_KER_X * DIM_KER_Y) >> 2;
while (col_count) {
q31_t ker_a1, ker_a2;
q31_t in_b1, in_b2;
ker_a = read_and_pad_reordered(ker_a, &ker_a1, &ker_a2);
in_b1 = arm_nn_read_q15x2_ia(&ip_as_col);
sum = __SMLAD(ker_a1, in_b1, sum);
in_b2 = arm_nn_read_q15x2_ia(&ip_as_col);
sum = __SMLAD(ker_a2, in_b2, sum);
col_count--;
}
sum = arm_nn_requantize(sum, output_mult[i_ch_out],
output_shift[i_ch_out]);
sum += out_offset;
sum = MAX(sum, out_activation_min);
sum = MIN(sum, out_activation_max);
*out++ = (q7_t) sum;
}
}
/* Return to application */
return STATE_SUCCESS;
}

View File

@ -0,0 +1,124 @@
/* ----------------------------------------------------------------------
* Project: TinyEngine
* Title: convolve_1x1_s8_ch24.c
*
* Reference papers:
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
* Contact authors:
* - Wei-Ming Chen, wmchen@mit.edu
* - Wei-Chen Wang, wweichen@mit.edu
* - Ji Lin, jilin@mit.edu
* - Ligeng Zhu, ligeng@mit.edu
* - Song Han, songhan@mit.edu
*
* Target ISA: ARMv7E-M
* -------------------------------------------------------------------- */
#include "arm_nnfunctions.h"
#include "arm_nnsupportfunctions.h"
#include "img2col_element.h"
#include "tinyengine_function.h"
#define DIM_KER_X (1U)
#define DIM_KER_Y (1U)
tinyengine_status convolve_1x1_s8_ch24(const q7_t *input, const uint16_t input_x,
const uint16_t input_y, const uint16_t input_ch, const q7_t *kernel,
const int32_t *bias, const int32_t *output_shift,
const int32_t *output_mult, const int32_t out_offset,
const int32_t input_offset, const int32_t out_activation_min,
const int32_t out_activation_max, q7_t *output, const uint16_t output_x,
const uint16_t output_y, const uint16_t output_ch, q15_t *runtime_buf) {
int32_t i_element;
(void) input_x;
(void) input_y;
/* Partial(two columns) im2col buffer */
q15_t *two_column_buffer = runtime_buf;
q7_t *out = output;
const int32_t num_elements = output_x * output_y;
const int channel_div4 = (input_ch >> 2);
const int16_t inoff16 = input_offset;
q31_t offset_q15x2 = __PKHBT(inoff16, inoff16, 16);
for (i_element = 0; i_element < num_elements / 2; i_element++) {
/* Fill buffer for partial im2col - two columns at a time */
q7_t *src = &input[i_element * input_ch * 2];
q15_t *dst = two_column_buffer;
//use variables
q31_t in_q7x4;
q31_t in_q15x2_1;
q31_t in_q15x2_2;
q31_t out_q15x2_1;
q31_t out_q15x2_2;
int cnt = channel_div4; //two columns
while (cnt > 0) {
q7_q15_offset_reordered_ele(src, dst)
q7_q15_offset_reordered_ele(src, dst)
cnt--;
}
out = mat_mult_kernel_s8_s16_reordered_ch24(kernel,
two_column_buffer, output_ch, output_shift, output_mult,
(q7_t) out_offset, out_activation_min,
out_activation_max, input_ch * DIM_KER_Y * DIM_KER_X,
bias, out);
}
/* check if there is an odd column left-over for computation */
if (num_elements & 0x1) {
int32_t i_ch_out;
const q7_t *ker_a = kernel;
q7_t *src = &input[(num_elements - 1) * input_ch];
q15_t *dst = two_column_buffer;
//use variables
q31_t in_q7x4;
q31_t in_q15x2_1;
q31_t in_q15x2_2;
q31_t out_q15x2_1;
q31_t out_q15x2_2;
int cnt = channel_div4; //two * numof2col columns
while (cnt > 0) {
q7_q15_offset_reordered_ele(src, dst)
cnt--;
}
for (i_ch_out = 0; i_ch_out < output_ch; i_ch_out++) {
q31_t sum = bias[i_ch_out];
/* Point to the beginning of the im2col buffer where the input is available as a rearranged column */
const q15_t *ip_as_col = runtime_buf;
uint16_t col_count = (input_ch * DIM_KER_X * DIM_KER_Y) >> 2;
while (col_count) {
q31_t ker_a1, ker_a2;
q31_t in_b1, in_b2;
ker_a = read_and_pad_reordered(ker_a, &ker_a1, &ker_a2);
in_b1 = arm_nn_read_q15x2_ia(&ip_as_col);
sum = __SMLAD(ker_a1, in_b1, sum);
in_b2 = arm_nn_read_q15x2_ia(&ip_as_col);
sum = __SMLAD(ker_a2, in_b2, sum);
col_count--;
}
sum = arm_nn_requantize(sum, output_mult[i_ch_out],
output_shift[i_ch_out]);
sum += out_offset;
sum = MAX(sum, out_activation_min);
sum = MIN(sum, out_activation_max);
*out++ = (q7_t) sum;
}
}
/* Return to application */
return STATE_SUCCESS;
}

View File

@ -0,0 +1,124 @@
/* ----------------------------------------------------------------------
* Project: TinyEngine
* Title: convolve_1x1_s8_ch48.c
*
* Reference papers:
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
* Contact authors:
* - Wei-Ming Chen, wmchen@mit.edu
* - Wei-Chen Wang, wweichen@mit.edu
* - Ji Lin, jilin@mit.edu
* - Ligeng Zhu, ligeng@mit.edu
* - Song Han, songhan@mit.edu
*
* Target ISA: ARMv7E-M
* -------------------------------------------------------------------- */
#include "arm_nnfunctions.h"
#include "arm_nnsupportfunctions.h"
#include "img2col_element.h"
#include "tinyengine_function.h"
#define DIM_KER_X (1U)
#define DIM_KER_Y (1U)
tinyengine_status convolve_1x1_s8_ch48(const q7_t *input, const uint16_t input_x,
const uint16_t input_y, const uint16_t input_ch, const q7_t *kernel,
const int32_t *bias, const int32_t *output_shift,
const int32_t *output_mult, const int32_t out_offset,
const int32_t input_offset, const int32_t out_activation_min,
const int32_t out_activation_max, q7_t *output, const uint16_t output_x,
const uint16_t output_y, const uint16_t output_ch, q15_t *runtime_buf) {
int32_t i_element;
(void) input_x;
(void) input_y;
/* Partial(two columns) im2col buffer */
q15_t *two_column_buffer = runtime_buf;
q7_t *out = output;
const int32_t num_elements = output_x * output_y;
const int channel_div4 = (input_ch >> 2);
const int16_t inoff16 = input_offset;
q31_t offset_q15x2 = __PKHBT(inoff16, inoff16, 16);
for (i_element = 0; i_element < num_elements / 2; i_element++) {
/* Fill buffer for partial im2col - two columns at a time */
q7_t *src = &input[i_element * input_ch * 2];
q15_t *dst = two_column_buffer;
//use variables
q31_t in_q7x4;
q31_t in_q15x2_1;
q31_t in_q15x2_2;
q31_t out_q15x2_1;
q31_t out_q15x2_2;
int cnt = channel_div4; //two columns
while (cnt > 0) {
q7_q15_offset_reordered_ele(src, dst)
q7_q15_offset_reordered_ele(src, dst)
cnt--;
}
out = mat_mult_kernel_s8_s16_reordered_ch48(kernel,
two_column_buffer, output_ch, output_shift, output_mult,
(q7_t) out_offset, out_activation_min,
out_activation_max, input_ch * DIM_KER_Y * DIM_KER_X,
bias, out);
}
/* check if there is an odd column left-over for computation */
if (num_elements & 0x1) {
int32_t i_ch_out;
const q7_t *ker_a = kernel;
q7_t *src = &input[(num_elements - 1) * input_ch];
q15_t *dst = two_column_buffer;
//use variables
q31_t in_q7x4;
q31_t in_q15x2_1;
q31_t in_q15x2_2;
q31_t out_q15x2_1;
q31_t out_q15x2_2;
int cnt = channel_div4; //two * numof2col columns
while (cnt > 0) {
q7_q15_offset_reordered_ele(src, dst)
cnt--;
}
for (i_ch_out = 0; i_ch_out < output_ch; i_ch_out++) {
q31_t sum = bias[i_ch_out];
/* Point to the beginning of the im2col buffer where the input is available as a rearranged column */
const q15_t *ip_as_col = runtime_buf;
uint16_t col_count = (input_ch * DIM_KER_X * DIM_KER_Y) >> 2;
while (col_count) {
q31_t ker_a1, ker_a2;
q31_t in_b1, in_b2;
ker_a = read_and_pad_reordered(ker_a, &ker_a1, &ker_a2);
in_b1 = arm_nn_read_q15x2_ia(&ip_as_col);
sum = __SMLAD(ker_a1, in_b1, sum);
in_b2 = arm_nn_read_q15x2_ia(&ip_as_col);
sum = __SMLAD(ker_a2, in_b2, sum);
col_count--;
}
sum = arm_nn_requantize(sum, output_mult[i_ch_out],
output_shift[i_ch_out]);
sum += out_offset;
sum = MAX(sum, out_activation_min);
sum = MIN(sum, out_activation_max);
*out++ = (q7_t) sum;
}
}
/* Return to application */
return STATE_SUCCESS;
}

View File

@ -0,0 +1,123 @@
/* ----------------------------------------------------------------------
* Project: TinyEngine
* Title: convolve_1x1_s8_ch8.c
*
* Reference papers:
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
* Contact authors:
* - Wei-Ming Chen, wmchen@mit.edu
* - Wei-Chen Wang, wweichen@mit.edu
* - Ji Lin, jilin@mit.edu
* - Ligeng Zhu, ligeng@mit.edu
* - Song Han, songhan@mit.edu
*
* Target ISA: ARMv7E-M
* -------------------------------------------------------------------- */
#include "arm_nnfunctions.h"
#include "img2col_element.h"
#include "tinyengine_function.h"
#define DIM_KER_X (1U)
#define DIM_KER_Y (1U)
tinyengine_status convolve_1x1_s8_ch8(const q7_t *input, const uint16_t input_x,
const uint16_t input_y, const uint16_t input_ch, const q7_t *kernel,
const int32_t *bias, const int32_t *output_shift,
const int32_t *output_mult, const int32_t out_offset,
const int32_t input_offset, const int32_t out_activation_min,
const int32_t out_activation_max, q7_t *output, const uint16_t output_x,
const uint16_t output_y, const uint16_t output_ch, q15_t *runtime_buf) {
int32_t i_element;
(void) input_x;
(void) input_y;
/* Partial(two columns) im2col buffer */
q15_t *two_column_buffer = runtime_buf;
q7_t *out = output;
const int32_t num_elements = output_x * output_y;
const int channel_div4 = (input_ch >> 2);
const int16_t inoff16 = input_offset;
q31_t offset_q15x2 = __PKHBT(inoff16, inoff16, 16);
for (i_element = 0; i_element < num_elements / 2; i_element++) {
/* Fill buffer for partial im2col - two columns at a time */
q7_t *src = &input[i_element * input_ch * 2];
q15_t *dst = two_column_buffer;
//use variables
q31_t in_q7x4;
q31_t in_q15x2_1;
q31_t in_q15x2_2;
q31_t out_q15x2_1;
q31_t out_q15x2_2;
int cnt = channel_div4; //two columns
while (cnt > 0) {
q7_q15_offset_reordered_ele(src, dst)
q7_q15_offset_reordered_ele(src, dst)
cnt--;
}
out = mat_mult_kernel_s8_s16_reordered_ch8(kernel,
two_column_buffer, output_ch, output_shift, output_mult,
(q7_t) out_offset, out_activation_min,
out_activation_max, input_ch * DIM_KER_Y * DIM_KER_X,
bias, out);
}
/* check if there is an odd column left-over for computation */
if (num_elements & 0x1) {
int32_t i_ch_out;
const q7_t *ker_a = kernel;
q7_t *src = &input[(num_elements - 1) * input_ch];
q15_t *dst = two_column_buffer;
//use variables
q31_t in_q7x4;
q31_t in_q15x2_1;
q31_t in_q15x2_2;
q31_t out_q15x2_1;
q31_t out_q15x2_2;
int cnt = channel_div4; //two * numof2col columns
while (cnt > 0) {
q7_q15_offset_reordered_ele(src, dst)
cnt--;
}
for (i_ch_out = 0; i_ch_out < output_ch; i_ch_out++) {
q31_t sum = bias[i_ch_out];
/* Point to the beginning of the im2col buffer where the input is available as a rearranged column */
const q15_t *ip_as_col = runtime_buf;
uint16_t col_count = (input_ch * DIM_KER_X * DIM_KER_Y) >> 2;
while (col_count) {
q31_t ker_a1, ker_a2;
q31_t in_b1, in_b2;
ker_a = read_and_pad_reordered(ker_a, &ker_a1, &ker_a2);
in_b1 = arm_nn_read_q15x2_ia(&ip_as_col);
sum = __SMLAD(ker_a1, in_b1, sum);
in_b2 = arm_nn_read_q15x2_ia(&ip_as_col);
sum = __SMLAD(ker_a2, in_b2, sum);
col_count--;
}
sum = arm_nn_requantize(sum, output_mult[i_ch_out],
output_shift[i_ch_out]);
sum += out_offset;
sum = MAX(sum, out_activation_min);
sum = MIN(sum, out_activation_max);
*out++ = (q7_t) sum;
}
}
/* Return to application */
return STATE_SUCCESS;
}

View File

@ -0,0 +1,135 @@
/* ----------------------------------------------------------------------
* Project: TinyEngine
* Title: convolve_1x1_s8_kbuf.c
* Description: for pointwise convolution, which nests loops according to runtime buffer size
*
* Reference papers:
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
* Contact authors:
* - Wei-Ming Chen, wmchen@mit.edu
* - Wei-Chen Wang, wweichen@mit.edu
* - Ji Lin, jilin@mit.edu
* - Ligeng Zhu, ligeng@mit.edu
* - Song Han, songhan@mit.edu
*
* Target ISA: ARMv7E-M
* -------------------------------------------------------------------- */
#include "arm_nnfunctions.h"
#include "tinyengine_function.h"
#include "img2col_element.h"
#include "kernel_element.h"
#define DIM_KER_X (1U)
#define DIM_KER_Y (1U)
tinyengine_status convolve_1x1_s8_kbuf(const q7_t *input, const uint16_t input_x,
const uint16_t input_y, const uint16_t input_ch, const q7_t *kernel,
const q31_t *kbuf, const int32_t *bias, const int32_t *output_shift,
const int32_t *output_mult, const int32_t out_offset,
const int32_t input_offset, const int32_t out_activation_min,
const int32_t out_activation_max, q7_t *output, const uint16_t output_x,
const uint16_t output_y, const uint16_t output_ch, q15_t *runtime_buf){
if (input_ch % 4 != 0 || input_ch % 2 != 0) {
return PARAM_NO_SUPPORT;
}
int32_t i_element;
(void) input_x;
(void) input_y;
/* Partial(two columns) im2col buffer */
q15_t *two_column_buffer = runtime_buf;
q7_t *out = output;
const int32_t num_elements = output_x * output_y;
const int channel_div4 = (input_ch >> 2);
const int16_t inoff16 = input_offset;
q31_t offset_q15x2 = __PKHBT(inoff16, inoff16, 16);
volatile int sbufsize = get_sbuffer_size();
int maxcol = sbufsize / input_ch / 2;
/* whether kernels can fit in the buffer */
//fill in kernels
const q7_t *ip_a0 = kernel;
/* output stationary */
for (i_element = 0; i_element < num_elements; i_element += 2) {
q7_t *src = &input[i_element * input_ch];
q15_t *dst = two_column_buffer;
//use variables
q31_t in_q7x4;
q31_t in_q15x2_1;
q31_t in_q15x2_2;
q31_t out_q15x2_1;
q31_t out_q15x2_2;
int cnt = channel_div4; //two columns
while (cnt > 0) {
q7_q15_offset_reordered_ele(src, dst)
q7_q15_offset_reordered_ele(src, dst)
cnt--;
}
out = mat_mult_s16(kernel,
two_column_buffer, output_ch, output_shift, output_mult,
(q7_t) out_offset, out_activation_min,
out_activation_max, input_ch,
bias, out, kbuf);
}
/* check if there is an odd column left-over for computation */
if (num_elements & 0x1) {
int32_t i_ch_out;
const q7_t *ker_a = kernel;
q7_t *src = &input[(num_elements - 1) * input_ch];
q15_t *dst = two_column_buffer;
//use variables
q31_t in_q7x4;
q31_t in_q15x2_1;
q31_t in_q15x2_2;
q31_t out_q15x2_1;
q31_t out_q15x2_2;
int cnt = channel_div4; //two * numof2col columns
while (cnt > 0) {
q7_q15_offset_reordered_ele(src, dst)
cnt--;
}
for (i_ch_out = 0; i_ch_out < output_ch; i_ch_out++) {
q31_t sum = bias[i_ch_out];
/* Point to the beginning of the im2col buffer where the input is available as a rearranged column */
const q15_t *ip_as_col = runtime_buf;
uint16_t col_count = (input_ch * DIM_KER_X * DIM_KER_Y) >> 2;
while (col_count) {
q31_t ker_a1, ker_a2;
q31_t in_b1, in_b2;
ker_a = read_and_pad_reordered(ker_a, &ker_a1, &ker_a2);
in_b1 = arm_nn_read_q15x2_ia(&ip_as_col);
sum = __SMLAD(ker_a1, in_b1, sum);
in_b2 = arm_nn_read_q15x2_ia(&ip_as_col);
sum = __SMLAD(ker_a2, in_b2, sum);
col_count--;
}
sum = arm_nn_requantize(sum, output_mult[i_ch_out], output_shift[i_ch_out]);
sum += out_offset;
sum = MAX(sum, out_activation_min);
sum = MIN(sum, out_activation_max);
*out++ = (q7_t) sum;
}
}
/* Return to application */
return STATE_SUCCESS;
}

View File

@ -0,0 +1,128 @@
/* ----------------------------------------------------------------------
* Project: TinyEngine
* Title: convolve_1x1_s8_oddch.c
*
* Reference papers:
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
* Contact authors:
* - Wei-Ming Chen, wmchen@mit.edu
* - Wei-Chen Wang, wweichen@mit.edu
* - Ji Lin, jilin@mit.edu
* - Ligeng Zhu, ligeng@mit.edu
* - Song Han, songhan@mit.edu
*
* Target ISA: ARMv7E-M
* -------------------------------------------------------------------- */
#include "arm_nnfunctions.h"
#include "tinyengine_function.h"
#include "img2col_element.h"
#include "kernel_element.h"
#define DIM_KER_X (1U)
#define DIM_KER_Y (1U)
tinyengine_status convolve_1x1_s8_oddch(const q7_t *input, const uint16_t input_x,
const uint16_t input_y, const uint16_t input_ch, const q7_t *kernel,
const int32_t *bias, const int32_t *output_shift,
const int32_t *output_mult, const int32_t out_offset,
const int32_t input_offset, const int32_t out_activation_min,
const int32_t out_activation_max, q7_t *output, const uint16_t output_x,
const uint16_t output_y, const uint16_t output_ch, q15_t *runtime_buf) {
if (input_ch % 4 != 0) {
return PARAM_NO_SUPPORT;
}
int32_t i_element;
(void) input_x;
(void) input_y;
/* Partial(two columns) im2col buffer */
q15_t *two_column_buffer = runtime_buf;
q7_t *out = output;
const int32_t num_elements = output_x * output_y;
const int channel_div4 = (input_ch >> 2);
const int16_t inoff16 = input_offset;
q31_t offset_q15x2 = __PKHBT(inoff16, inoff16, 16);
for (i_element = 0; i_element < num_elements / 2; i_element++) {
/* Fill buffer for partial im2col - two columns at a time */
q7_t *src = &input[i_element * input_ch * 2];
q15_t *dst = two_column_buffer;
//use variables
q31_t in_q7x4;
q31_t in_q15x2_1;
q31_t in_q15x2_2;
q31_t out_q15x2_1;
q31_t out_q15x2_2;
int cnt = channel_div4; //two columns
while (cnt > 0) {
q7_q15_offset_reordered_ele(src, dst)
q7_q15_offset_reordered_ele(src, dst)
cnt--;
}
out = arm_nn_mat_mult_kernel_s8_s16_reordered(kernel,
two_column_buffer, output_ch, output_shift, output_mult,
(q7_t) out_offset, out_activation_min,
out_activation_max, input_ch * DIM_KER_Y * DIM_KER_X,
bias, out);
}
/* check if there is an odd column left-over for computation */
if (num_elements & 0x1) {
int32_t i_ch_out;
const q7_t *ker_a = kernel;
q7_t *src = &input[(num_elements - 1) * input_ch];
q15_t *dst = two_column_buffer;
//use variables
q31_t in_q7x4;
q31_t in_q15x2_1;
q31_t in_q15x2_2;
q31_t out_q15x2_1;
q31_t out_q15x2_2;
int cnt = channel_div4; //two * numof2col columns
while (cnt > 0) {
q7_q15_offset_reordered_ele(src, dst)
cnt--;
}
for (i_ch_out = 0; i_ch_out < output_ch; i_ch_out++) {
q31_t sum = bias[i_ch_out];
/* Point to the beginning of the im2col buffer where the input is available as a rearranged column */
const q15_t *ip_as_col = runtime_buf;
uint16_t col_count = (input_ch * DIM_KER_X * DIM_KER_Y) >> 2;
while (col_count) {
q31_t ker_a1, ker_a2;
q31_t in_b1, in_b2;
ker_a = read_and_pad_reordered(ker_a, &ker_a1, &ker_a2);
in_b1 = arm_nn_read_q15x2_ia(&ip_as_col);
sum = __SMLAD(ker_a1, in_b1, sum);
in_b2 = arm_nn_read_q15x2_ia(&ip_as_col);
sum = __SMLAD(ker_a2, in_b2, sum);
col_count--;
}
sum = arm_nn_requantize(sum, output_mult[i_ch_out],
output_shift[i_ch_out]);
sum += out_offset;
sum = MAX(sum, out_activation_min);
sum = MIN(sum, out_activation_max);
*out++ = (q7_t) sum;
}
}
/* Return to application */
return STATE_SUCCESS;
}

View File

@ -0,0 +1,153 @@
/* ----------------------------------------------------------------------
* Project: TinyEngine
* Title: convolve_1x1_s8_skip_pad.c
*
* Reference papers:
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
* Contact authors:
* - Wei-Ming Chen, wmchen@mit.edu
* - Wei-Chen Wang, wweichen@mit.edu
* - Ji Lin, jilin@mit.edu
* - Ligeng Zhu, ligeng@mit.edu
* - Song Han, songhan@mit.edu
*
* Target ISA: ARMv7E-M
* -------------------------------------------------------------------- */
#include "arm_nnfunctions.h"
#include "tinyengine_function.h"
#include "img2col_element.h"
#include "kernel_element.h"
#define DIM_KER_X (1U)
#define DIM_KER_Y (1U)
tinyengine_status convolve_1x1_s8_skip_pad(const q7_t *input, const uint16_t input_x,
const uint16_t input_y, const uint16_t input_ch, const q7_t *kernel,
const int32_t *bias, const int32_t *output_shift,
const int32_t *output_mult, const int32_t out_offset,
const int32_t input_offset, const int32_t out_activation_min,
const int32_t out_activation_max, q7_t *output, const uint16_t output_x,
const uint16_t output_y, const uint16_t output_ch, q15_t *runtime_buf,
const uint16_t pad_t, const uint16_t pad_b, const uint16_t pad_l, const uint16_t pad_r) {
int32_t i_element;
(void) input_x;
(void) input_y;
/* Partial(two columns) im2col buffer */
q15_t *two_column_buffer = runtime_buf;
q7_t *out = output;
const int32_t num_elements = output_x * output_y;
const int channel_div4 = (input_ch >> 2);
const int16_t inoff16 = input_offset;
q31_t offset_q15x2 = __PKHBT(inoff16, inoff16, 16);
q31_t *kbuf = get_kernel_buffer();
volatile int sbufsize = get_sbuffer_size();
int maxcol = sbufsize / input_ch / 2;
int h=0,w=0;
for (i_element = 0; i_element < num_elements / 2; i_element++) {
/* Fill buffer for partial im2col - two columns at a time */
q7_t *src = &input[i_element * input_ch * 2];
q15_t *dst = two_column_buffer;
//use variables
q31_t in_q7x4;
q31_t in_q15x2_1;
q31_t in_q15x2_2;
q31_t out_q15x2_1;
q31_t out_q15x2_2;
int skip = 0;
//first element
if (w < pad_l || w >= input_x - pad_r){
if (h < pad_t || h >= input_y - pad_b){
skip++;
}
}
//move to the next element
w++;
if (w == input_x - 1){
h++; w = 0;
}
//second element
if (w < pad_l || w >= input_x - pad_r){
if (h < pad_t || h >= input_y - pad_b){
skip++;
}
}
if (skip == 2){
out += output_ch * 2;
continue;
}
int cnt = channel_div4; //two columns
while (cnt > 0) {
q7_q15_offset_reordered_ele(src, dst)
q7_q15_offset_reordered_ele(src, dst)
cnt--;
}
out = mat_mult_kernel_s8_s16_reordered(kernel,
two_column_buffer, output_ch, output_shift, output_mult,
(q7_t) out_offset, out_activation_min,
out_activation_max, input_ch * DIM_KER_Y * DIM_KER_X,
bias, out);
}
/* check if there is an odd column left-over for computation */
if (num_elements & 0x1) {
int32_t i_ch_out;
const q7_t *ker_a = kernel;
q7_t *src = &input[(num_elements - 1) * input_ch];
q15_t *dst = two_column_buffer;
//use variables
q31_t in_q7x4;
q31_t in_q15x2_1;
q31_t in_q15x2_2;
q31_t out_q15x2_1;
q31_t out_q15x2_2;
int cnt = channel_div4; //two * numof2col columns
while (cnt > 0) {
q7_q15_offset_reordered_ele(src, dst)
cnt--;
}
for (i_ch_out = 0; i_ch_out < output_ch; i_ch_out++) {
q31_t sum = bias[i_ch_out];
/* Point to the beginning of the im2col buffer where the input is available as a rearranged column */
const q15_t *ip_as_col = runtime_buf;
uint16_t col_count = (input_ch * DIM_KER_X * DIM_KER_Y) >> 2;
while (col_count) {
q31_t ker_a1, ker_a2;
q31_t in_b1, in_b2;
ker_a = read_and_pad_reordered(ker_a, &ker_a1, &ker_a2);
in_b1 = arm_nn_read_q15x2_ia(&ip_as_col);
sum = __SMLAD(ker_a1, in_b1, sum);
in_b2 = arm_nn_read_q15x2_ia(&ip_as_col);
sum = __SMLAD(ker_a2, in_b2, sum);
col_count--;
}
sum = arm_nn_requantize(sum, output_mult[i_ch_out],
output_shift[i_ch_out]);
sum += out_offset;
sum = MAX(sum, out_activation_min);
sum = MIN(sum, out_activation_max);
*out++ = (q7_t) sum;
}
}
/* Return to application */
return STATE_SUCCESS;
}

View File

@ -0,0 +1,213 @@
/* ----------------------------------------------------------------------
* Project: TinyEngine
* Title: convolve_s8_kernel2x3_inputch3_stride2_pad1.c
* Description: for 3x3 convolution with 3 input channels, typically for image processing
*
* Reference papers:
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
* Contact authors:
* - Wei-Ming Chen, wmchen@mit.edu
* - Wei-Chen Wang, wweichen@mit.edu
* - Ji Lin, jilin@mit.edu
* - Ligeng Zhu, ligeng@mit.edu
* - Song Han, songhan@mit.edu
*
* Target ISA: ARMv7E-M
* -------------------------------------------------------------------- */
#include "arm_math.h"
#include "arm_nnfunctions.h"
#include "arm_nnsupportfunctions.h"
#include "img2col_element.h"
#include "tinyengine_function.h"
tinyengine_status convolve_s8_kernel2x3_inputch3_stride2_pad1(const q7_t *input, const uint16_t input_x, const uint16_t input_y,
const uint16_t input_ch, const q7_t *kernel, const int32_t *bias,
const int32_t *output_shift, const int32_t *output_mult,
const int32_t output_offset, const int32_t input_offset,
const int32_t output_activation_min,
const int32_t output_activation_max, q7_t *output,
const uint16_t output_x, const uint16_t output_y,
const uint16_t output_ch, q15_t *runtime_buf, q7_t pad_value) {
const int kernel_y = 2;
const int kernel_x = 3;
//check this during code gen for better performance
if(input_x % 2 != 0 || input_y % 2 != 0){
return PARAM_NO_SUPPORT;
}
int16_t i_out_y, i_out_x, i_ker_y, i_ker_x;
/* Generate two columns from the input tensor a GEMM computation */
q15_t *two_column_buf = runtime_buf;
q7_t *out = output;
q15_t pad16 = pad_value;
const int16_t inoff16 = input_offset;
q15_t pad_out = pad16 + inoff16;
q31_t pad_out_q15x2 = __PKHBT(pad_out, pad_out, 16);
q31_t offset_q15x2 = __PKHBT(inoff16, inoff16, 16);
q15_t *kbuf = (q15_t*) get_kernel_buffer();
const q7_t *ip_a0 = kernel;
for (int i = 0; i < output_ch; i += 2) {
q15_t *dst1 = &kbuf[i * 18]; //each q31_t store 2 elements
q15_t *dst2 = dst1 + 18;
const q7_t *ip_a1 = ip_a0 + 18;
//27 for each output_ch
q31_t *dst1_31 = dst1;
q31_t *dst2_31 = dst2;
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
dst1_31 += 2;
dst2_31 += 2;
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
dst1_31 += 2;
dst2_31 += 2;
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
dst1_31 += 2;
dst2_31 += 2;
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
dst1_31 += 2;
dst2_31 += 2;
//17, 18
dst1 = dst1_31;
dst2 = dst2_31;
dst1[0] = *ip_a0++;
dst1[1] = *ip_a0++;
dst2[0] = *ip_a1++;
dst2[1] = *ip_a1++;
/* skip row */
ip_a0 += 18;
}
for (i_out_y = 0; i_out_y < output_y; i_out_y++) {
for (i_out_x = 0; i_out_x < output_x; i_out_x++) {
/* This part implements the im2col function */
const int16_t base_idx_y = (i_out_y * 2) - 1;
const int16_t base_idx_x = (i_out_x * 2) - 1;
const q15_t *col_buffer = two_column_buf;
//use variables
q31_t in_q7x4;
q31_t in_q15x2_1;
q31_t in_q15x2_2;
q31_t out_q15x2_1;
q31_t out_q15x2_2;
/* load address:8bit */
q7_t *src;
q7_t *src2;
q7_t *src3;
/* buffer for load:16bit */
q15_t *dst;
q15_t *dst2;
q15_t *dst3;
int input_row_offset = 3 * input_x;
dst = col_buffer;
dst2 = dst + 9;
if (base_idx_y != -1) {
if (base_idx_x != -1) {
//load all for now and unroll all
//3x3 = 9 elements
src = input + (base_idx_y * input_x + base_idx_x) * input_ch;
src2 = src + input_row_offset;
src3 = src2 + input_row_offset;
//4 * 2 + 1 = 9
q7_q15_offset_ele(src, dst)
q7_q15_offset_ele(src, dst)
*dst++ = *src++ + input_offset;
//4 * 2 + 1 = 9
q7_q15_offset_ele(src2, dst2)
q7_q15_offset_ele(src2, dst2)
*dst2++ = *src2++ + input_offset;
} else {
//first element is pad
//3x3 = 9 elements
src = input + (base_idx_y * input_x) * input_ch;
src2 = src + input_row_offset;
src3 = src2 + input_row_offset;
//pad the first one: 1x3 = 3
*dst++ = pad_out;
*dst++ = pad_out;
*dst++ = pad_out;
*dst2++ = pad_out;
*dst2++ = pad_out;
*dst2++ = pad_out;
//load 6 elements
//4 * 1 + 2 = 6
q7_q15_offset_ele(src, dst)
*dst++ = *src++ + input_offset;
*dst++ = *src++ + input_offset;
//4 * 1 + 2 = 6
q7_q15_offset_ele(src2, dst2)
*dst2++ = *src2++ + input_offset;
*dst2++ = *src2++ + input_offset;
}
} else {
//Padding the first row
//3x3 = 9 elements
*dst++ = pad_out;
q31_t *dst_31 = dst;
*dst_31++ = pad_out_q15x2;
*dst_31++ = pad_out_q15x2;
*dst_31++ = pad_out_q15x2;
*dst_31++ = pad_out_q15x2;
if (base_idx_x != -1) {
//3x3 = 9 elements
src2 = input + (base_idx_x) * input_ch;
//4 * 2 + 1 = 9
q7_q15_offset_ele(src2, dst2)
q7_q15_offset_ele(src2, dst2)
*dst2++ = *src2++ + input_offset;
} else {
src2 = input;
//pad the first col: 1x3 = 3
*dst2++ = pad_out;
*dst2++ = pad_out;
*dst2++ = pad_out;
//load 6 elements
q7_q15_offset_ele(src2, dst2)
*dst2++ = *src2++ + input_offset;
*dst2++ = *src2++ + input_offset;
}
}
two_column_buf += 18;
/* Computation is filed for every 2 columns */
if (two_column_buf == runtime_buf + 2 * 18) {
out = mat_mult_unloop18_s8_s16(kernel,
runtime_buf, output_ch, output_shift, output_mult,
output_offset, output_activation_min, output_activation_max,
input_ch * kernel_y * kernel_x, bias, out, kbuf);
/* counter reset */
two_column_buf = runtime_buf;
}
}
}
/* Return to application */
return STATE_SUCCESS;
}

View File

@ -0,0 +1,285 @@
/* ----------------------------------------------------------------------
* Project: TinyEngine
* Title: convolve_s8_kernel3_inputch3_stride2_pad1.c
*
* Reference papers:
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
* Contact authors:
* - Wei-Ming Chen, wmchen@mit.edu
* - Wei-Chen Wang, wweichen@mit.edu
* - Ji Lin, jilin@mit.edu
* - Ligeng Zhu, ligeng@mit.edu
* - Song Han, songhan@mit.edu
*
* Target ISA: ARMv7E-M
* -------------------------------------------------------------------- */
#include "arm_math.h"
#include "arm_nnfunctions.h"
#include "arm_nnsupportfunctions.h"
#include "img2col_element.h"
#include "tinyengine_function.h"
tinyengine_status convolve_s8_kernel3_inputch3_stride2_pad1(const q7_t *input, const uint16_t input_x, const uint16_t input_y,
const uint16_t input_ch, const q7_t *kernel, const int32_t *bias,
const int32_t *output_shift, const int32_t *output_mult,
const int32_t output_offset, const int32_t input_offset,
const int32_t output_activation_min,
const int32_t output_activation_max, q7_t *output,
const uint16_t output_x, const uint16_t output_y,
const uint16_t output_ch, q15_t *runtime_buf, q15_t *kbuf, q7_t pad_value) {
const int kernel_y = 3;
const int kernel_x = 3;
int16_t i_out_y, i_out_x, i_ker_y, i_ker_x;
/* Generate two columns from the input tensor a GEMM computation */
q15_t *two_column_buf = runtime_buf;
q7_t *out = output;
q15_t pad16 = pad_value;
const int16_t inoff16 = input_offset;
q15_t pad_out = pad16 + inoff16;
q31_t pad_out_q15x2 = __PKHBT(pad_out, pad_out, 16);
q31_t offset_q15x2 = __PKHBT(inoff16, inoff16, 16);
const q7_t *ip_a0 = kernel;
for (int i = 0; i < output_ch; i += 2) {
q15_t *dst1 = &kbuf[i * 27]; //each q31_t store 2 elements
q15_t *dst2 = dst1 + 27;
const q7_t *ip_a1 = ip_a0 + 27;
//27 for each output_ch
q31_t *dst1_31 = dst1;
q31_t *dst2_31 = dst2;
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
dst1_31 += 2;
dst2_31 += 2;
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
dst1_31 += 2;
dst2_31 += 2;
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
dst1_31 += 2;
dst2_31 += 2;
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
dst1_31 += 2;
dst2_31 += 2;
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
dst1_31 += 2;
dst2_31 += 2;
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
dst1_31 += 2;
dst2_31 += 2;
//25, 26, 27
dst1 = dst1_31;
dst2 = dst2_31;
dst1[0] = *ip_a0++;
dst1[1] = *ip_a0++;
dst1[2] = *ip_a0++;
dst2[0] = *ip_a1++;
dst2[1] = *ip_a1++;
dst2[2] = *ip_a1++;
/* skip row */
ip_a0 += 27;
}
for (i_out_y = 0; i_out_y < output_y; i_out_y++) {
for (i_out_x = 0; i_out_x < output_x; i_out_x++) {
/* This part implements the im2col function */
const int16_t base_idx_y = (i_out_y * 2) - 1;
const int16_t base_idx_x = (i_out_x * 2) - 1;
const q15_t *col_buffer = two_column_buf;
//use variables
q31_t in_q7x4;
q31_t in_q15x2_1;
q31_t in_q15x2_2;
q31_t out_q15x2_1;
q31_t out_q15x2_2;
/* load address:8bit */
q7_t *src;
q7_t *src2;
q7_t *src3;
/* buffer for load:16bit */
q15_t *dst;
q15_t *dst2;
q15_t *dst3;
int input_row_offset = 3 * input_x;
dst = col_buffer;
dst2 = dst + 9;
dst3 = dst2 + 9;
if (base_idx_y != -1) {
if (base_idx_x != -1) { //load all for now and unroll all
//3x3 = 9 elements
src = input + (base_idx_y * input_x + base_idx_x) * input_ch;
src2 = src + input_row_offset;
src3 = src2 + input_row_offset;
//4 * 2 = 8
q7_q15_offset_ele(src, dst)
q7_q15_offset_ele(src, dst)
*dst++ = *src++ + input_offset;
//
q7_q15_offset_ele(src2, dst2)
q7_q15_offset_ele(src2, dst2)
*dst2++ = *src2++ + input_offset;
//
q7_q15_offset_ele(src3, dst3)
q7_q15_offset_ele(src3, dst3)
*dst3++ = *src3++ + input_offset;
} else { //first element is pad
//3x3 = 9 elements
src = input + (base_idx_y * input_x) * input_ch;
src2 = src + input_row_offset;
src3 = src2 + input_row_offset;
//pad the first one: 1x3 = 3
*dst++ = pad_out;
*dst++ = pad_out;
*dst++ = pad_out;
*dst2++ = pad_out;
*dst2++ = pad_out;
*dst2++ = pad_out;
*dst3++ = pad_out;
*dst3++ = pad_out;
*dst3++ = pad_out;
//load 6 elements
//4 * 1 = 6
q7_q15_offset_ele(src, dst)
*dst++ = *src++ + input_offset;
*dst++ = *src++ + input_offset;
//
q7_q15_offset_ele(src2, dst2)
*dst2++ = *src2++ + input_offset;
*dst2++ = *src2++ + input_offset;
//
q7_q15_offset_ele(src3, dst3)
*dst3++ = *src3++ + input_offset;
*dst3++ = *src3++ + input_offset;
}
} else { // first row is padded
//3x3 = 9 elements
*dst++ = pad_out;
q31_t *dst_31 = dst;
*dst_31++ = pad_out_q15x2;
*dst_31++ = pad_out_q15x2;
*dst_31++ = pad_out_q15x2;
*dst_31++ = pad_out_q15x2;
if (base_idx_x != -1) { //load all for now and unroll all
//3x3 = 9 elements
src2 = input + (base_idx_x) * input_ch;
src3 = src2 + input_row_offset;
//4 * 2 = 8
q7_q15_offset_ele(src2, dst2)
q7_q15_offset_ele(src2, dst2)
*dst2++ = *src2++ + input_offset;
//
q7_q15_offset_ele(src3, dst3)
q7_q15_offset_ele(src3, dst3)
*dst3++ = *src3++ + input_offset;
} else { //first element is pad
//3x3 = 9 elements
src2 = input;
src3 = src2 + input_row_offset;
//pad the first one: 1x3 = 3
*dst2++ = pad_out;
*dst2++ = pad_out;
*dst2++ = pad_out;
*dst3++ = pad_out;
*dst3++ = pad_out;
*dst3++ = pad_out;
//load 6 elements
q7_q15_offset_ele(src2, dst2)
*dst2++ = *src2++ + input_offset;
*dst2++ = *src2++ + input_offset;
//
q7_q15_offset_ele(src3, dst3)
*dst3++ = *src3++ + input_offset;
*dst3++ = *src3++ + input_offset;
}
}
two_column_buf += 27;
/* Computation is filed for every 2 columns */
if (two_column_buf == runtime_buf + 2 * 27) {
out = arm_nn_mat_mult_kernel3_input3_s8_s16(kernel,
runtime_buf, output_ch, output_shift, output_mult,
output_offset, output_activation_min, output_activation_max,
input_ch * kernel_y * kernel_x, bias, out, kbuf);
/* counter reset */
two_column_buf = runtime_buf;
}
}
}
/* left-over because odd number of output pixels */
if (two_column_buf != runtime_buf) {
const q7_t *ker_a = kernel;
int i;
for (i = 0; i < output_ch; i++) {
/* Load the accumulator with bias first */
q31_t sum = bias[i];
/* Point to the beginning of the im2col buffer where the input is available as a rearranged column */
const q15_t *ip_as_col = runtime_buf;
/* 4 multiply and accumulates are done in one loop. */
uint16_t col_count = (input_ch * kernel_y * kernel_x) >> 2;
while (col_count) {
q31_t ker_a1, ker_a2;
q31_t ip_b1, ip_b2;
ker_a = read_and_pad(ker_a, &ker_a1, &ker_a2);
ip_b1 = arm_nn_read_q15x2_ia(&ip_as_col);
sum = __SMLAD(ker_a1, ip_b1, sum);
ip_b2 = arm_nn_read_q15x2_ia(&ip_as_col);
sum = __SMLAD(ker_a2, ip_b2, sum);
col_count--;
}
/* Handle left over mac */
col_count = input_ch * kernel_y * kernel_x & 0x3;
while (col_count) {
q7_t ker_a1 = *ker_a++;
q15_t ip_b1 = *ip_as_col++;
sum += ker_a1 * ip_b1;
col_count--;
}
sum = arm_nn_requantize(sum, output_mult[i], output_shift[i]);
sum += output_offset;
sum = MAX(sum, output_activation_min);
sum = MIN(sum, output_activation_max);
*out++ = (q7_t) sum;
}
}
/* Return to application */
return STATE_SUCCESS;
}

View File

@ -0,0 +1,300 @@
/* ----------------------------------------------------------------------
* Project: TinyEngine
* Title: convolve_s8_kernel3_stride1_pad1.c
* Description: for 3x3 convolution with kernels, typically for image processing
*
* Reference papers:
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
* Contact authors:
* - Wei-Ming Chen, wmchen@mit.edu
* - Wei-Chen Wang, wweichen@mit.edu
* - Ji Lin, jilin@mit.edu
* - Ligeng Zhu, ligeng@mit.edu
* - Song Han, songhan@mit.edu
*
* Target ISA: ARMv7E-M
* -------------------------------------------------------------------- */
#include "arm_math.h"
#include "arm_nnfunctions.h"
#include "arm_nnsupportfunctions.h"
#include "img2col_element.h"
#include "tinyengine_function.h"
tinyengine_status convolve_s8_kernel3_stride1_pad1(const q7_t *input, const uint16_t input_x, const uint16_t input_y,
const uint16_t input_ch, const q7_t *kernel, const int32_t *bias,
const int32_t *output_shift, const int32_t *output_mult,
const int32_t output_offset, const int32_t input_offset,
const int32_t output_activation_min,
const int32_t output_activation_max, q7_t *output,
const uint16_t output_x, const uint16_t output_y,
const uint16_t output_ch, q15_t *runtime_buf, q7_t pad_value) {
if (input_ch % 4 != 0) {
return PARAM_NO_SUPPORT;
}
int32_t i_element;
(void) input_x;
(void) input_y;
/* Partial(two columns) im2col buffer */
q15_t *two_column_buffer = runtime_buf;
q7_t *out = output;
const int channel_div4 = (input_ch >> 2);
const int16_t inoff16 = input_offset;
q31_t offset_q15x2 = __PKHBT(inoff16, inoff16, 16);
q31_t pad_q15x2 = __PKHBT(pad_value, pad_value, 16);
q31_t pad_out_q15x2 = __SADD16(pad_q15x2, offset_q15x2);
int in_row_offset = input_ch * input_x;
for (int i_out_y = 0; i_out_y < output_y; i_out_y++) {
const int16_t base_idx_y = i_out_y - 1;
for (int i_out_x = 0; i_out_x < output_x; i_out_x++) {
const int16_t base_idx_x = i_out_x - 1;
//Img2col for 3x3 kernel
/* Used for SIMD instructions */
q31_t in_q7x4;
q31_t in_q15x2_1;
q31_t in_q15x2_2;
q31_t out_q15x2_1;
q31_t out_q15x2_2;
int block_cnt;
q15_t *col_buffer = &two_column_buffer[0];
//TODO: swap these two if statement out to reduce overhead
int ypad_cnt = 0; //no pad by default
if (base_idx_y == -1) { //pad the first row
q31_t *dst_31 = (q31_t*) &col_buffer[0];
int block_cnt = channel_div4;//unroll by 2, 3 element
while (block_cnt > 0) {//total: 16bit * input_ch * 3
*dst_31++ = pad_out_q15x2;
*dst_31++ = pad_out_q15x2;
*dst_31++ = pad_out_q15x2;
*dst_31++ = pad_out_q15x2;
*dst_31++ = pad_out_q15x2;
*dst_31++ = pad_out_q15x2;
block_cnt--;
}
ypad_cnt = 1;
}
else if (base_idx_y + 2 == input_y) { //pad the third row
q31_t *dst_31 = (q31_t*) &col_buffer[input_ch * 6];
int block_cnt = channel_div4;//unroll by 2, 3 element
while (block_cnt > 0) {//total: 16bit * input_ch * 3
*dst_31++ = pad_out_q15x2;
*dst_31++ = pad_out_q15x2;
*dst_31++ = pad_out_q15x2;
*dst_31++ = pad_out_q15x2;
*dst_31++ = pad_out_q15x2;
*dst_31++ = pad_out_q15x2;
block_cnt--;
}
ypad_cnt = 2;
}
if (ypad_cnt == 0){ //filled all rows
if (base_idx_x == -1) {
/* use pad for the first 1 col */
q31_t *dst_31 = (q31_t*) &col_buffer[0];
q31_t *dst2_31 = (q31_t*) &col_buffer[input_ch * 3];
q31_t *dst3_31 = (q31_t*) &col_buffer[input_ch * 6];
pad_3row_1col(dst_31, dst2_31, dst3_31, pad_out_q15x2)
/* load input to 2 col*/
const q7_t *src = input + base_idx_y * input_x * input_ch;
const q7_t *src2 = src + in_row_offset;
const q7_t *src3 = src2 + in_row_offset;
q15_t *dst = dst_31;
q15_t *dst2 = dst2_31;
q15_t *dst3 = dst3_31;
load_3row_2col(src, src2, src3, dst, dst2, dst3)
} else if (base_idx_x + 2 == input_x) {
/* load 2 col */
const q7_t *src = input + (base_idx_y * input_x + base_idx_x) * input_ch;
const q7_t *src2 = src + in_row_offset;
const q7_t *src3 = src2 + in_row_offset;
q15_t *dst = (q31_t*) &col_buffer[0];;
q15_t *dst2 = (q31_t*) &col_buffer[input_ch * 3];
q15_t *dst3 = (q31_t*) &col_buffer[input_ch * 6];;
load_3row_2col(src, src2, src3, dst, dst2, dst3)
q31_t *dst_31 = (q31_t*) dst;
q31_t *dst2_31 = (q31_t*) dst2;
q31_t *dst3_31 = (q31_t*) dst3;
/* use pad for the last 1 col*/
pad_3row_1col(dst_31,dst2_31,dst3_31,pad_out_q15x2)
} else {
/* load 3 col */
const q7_t *src = input + (base_idx_y * input_x + base_idx_x) * input_ch;
const q7_t *src2 = src + in_row_offset;
const q7_t *src3 = src2 + in_row_offset;
q15_t *dst = (q31_t*) &col_buffer[0];;
q15_t *dst2 = (q31_t*) &col_buffer[input_ch * 3];
q15_t *dst3 = (q31_t*) &col_buffer[input_ch * 6];;
load_3row_3col(src, src2, src3, dst, dst2, dst3)
}
}
else if (ypad_cnt == 1){//filled the last two rows
if (base_idx_x == -1){
/* use pad for the first 1 col */
q31_t *dst_31 = &col_buffer[input_ch * 3];
q31_t *dst2_31 = &col_buffer[input_ch * 6];
pad_2row_1col(dst_31, dst2_31, pad_out_q15x2)
/* load input to 2 col*/
const q7_t *src = input + 0;
const q7_t *src2 = src + in_row_offset;
q15_t *dst = dst_31;
q15_t *dst2 = dst2_31;
load_2row_2col(src, src2, dst, dst2)
} else if (base_idx_x + 2 == input_x) {
/* load 2 col*/
q31_t *dst = &col_buffer[input_ch * 3];
q31_t *dst2 = &col_buffer[input_ch * 6];
const q7_t *src = input + base_idx_x * input_ch;
const q7_t *src2 = src + in_row_offset;
load_2row_2col(src, src2, dst, dst2)
q31_t *dst_31 = (q31_t*) dst;
q31_t *dst2_31 = (q31_t*) dst2;
/* use pad for the last 1 col*/
pad_2row_1col(dst_31,dst2_31,pad_out_q15x2)
}
else {
/* load 3 col*/
q15_t *dst = &col_buffer[input_ch * 3];
q15_t *dst2 = &col_buffer[input_ch * 6];
const q7_t *src = input + base_idx_x * input_ch;
const q7_t *src2 = src + in_row_offset;
load_2row_3col(src, src2, dst, dst2)
}
} else{ //filled the first two rows
if (base_idx_x == -1) {
/* use pad for the first 1 col*/
q31_t *dst_31 = (q31_t*) &col_buffer[0];
q31_t *dst2_31 = (q31_t*) &col_buffer[input_ch * 3];
pad_2row_1col(dst_31, dst2_31, pad_out_q15x2)
/* load input to 2 col*/
const q7_t *src = input + (base_idx_y * input_x) * input_ch;
const q7_t *src2 = src + in_row_offset;
q15_t *dst = dst_31;
q15_t *dst2 = dst2_31;
load_2row_2col(src, src2, dst, dst2)
} else if (base_idx_x + 2 == input_x) {
/* load 2 col*/
q15_t *dst = &col_buffer[input_ch * 0];
q15_t *dst2 = &col_buffer[input_ch * 3];
const q7_t *src = input + (base_idx_y * input_x + base_idx_x) * input_ch;
const q7_t *src2 = src + in_row_offset;
load_2row_2col(src, src2, dst, dst2)
/* use pad for the last 1 col*/
q31_t *dst_31 = (q31_t*) dst;
q31_t *dst2_31 = (q31_t*) dst2;
pad_2row_1col(dst_31,dst2_31,pad_out_q15x2)
} else {
/* load 3 col*/
q15_t *dst = &col_buffer[input_ch * 0];
q15_t *dst2 = &col_buffer[input_ch * 3];
/* load input to 1 col*/
const q7_t *src = input + (base_idx_y * input_x + base_idx_x) * input_ch;
const q7_t *src2 = src + in_row_offset;
load_2row_3col(src, src2, dst, dst2)
}
}
two_column_buffer += input_ch * 9;
/* Computation is filed for every 2 columns */
if (two_column_buffer == runtime_buf + 2 * input_ch * 9)
{
out = mat_mult_kernel_s8_s16(kernel,
runtime_buf,
output_ch,
output_shift,
output_mult,
output_offset,
output_activation_min,
output_activation_max,
input_ch * 9,
bias,
out);
/* counter reset */
two_column_buffer = runtime_buf;
}
}
}
/* left-over because odd number of output pixels */
if (two_column_buffer != runtime_buf)
{
const q7_t *ker_a = kernel;
int i;
for (i = 0; i < output_ch; i++)
{
/* Load the accumulator with bias first */
q31_t sum = bias[i];
/* Point to the beginning of the im2col buffer where the input is available as a rearranged column */
const q15_t *ip_as_col = runtime_buf;
/* 4 multiply and accumulates are done in one loop. */
uint16_t col_count = (input_ch * 9) >> 2;
while (col_count)
{
q31_t ker_a1, ker_a2;
q31_t ip_b1, ip_b2;
ker_a = read_and_pad(ker_a, &ker_a1, &ker_a2);
ip_b1 = arm_nn_read_q15x2_ia(&ip_as_col);
sum = __SMLAD(ker_a1, ip_b1, sum);
ip_b2 = arm_nn_read_q15x2_ia(&ip_as_col);
sum = __SMLAD(ker_a2, ip_b2, sum);
col_count--;
}
/* Handle left over mac */
col_count = input_ch * 3 * 3 & 0x3;
while (col_count)
{
q7_t ker_a1 = *ker_a++;
q15_t ip_b1 = *ip_as_col++;
sum += ker_a1 * ip_b1;
col_count--;
}
sum = arm_nn_requantize(sum, output_mult[i], output_shift[i]);
sum += output_offset;
sum = MAX(sum, output_activation_min);
sum = MIN(sum, output_activation_max);
*out++ = (q7_t)sum;
}
}
/* Return to application */
return STATE_SUCCESS;
}
/**
* @} end of NNConv group
*/

View File

@ -0,0 +1,232 @@
/* ----------------------------------------------------------------------
* Project: TinyEngine
* Title: convolve_s8_kernel3x2_inputch3_stride2_pad1.c
* Description: for 3x3 convolution with 3 input channels, typically for image processing
*
* Reference papers:
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
* Contact authors:
* - Wei-Ming Chen, wmchen@mit.edu
* - Wei-Chen Wang, wweichen@mit.edu
* - Ji Lin, jilin@mit.edu
* - Ligeng Zhu, ligeng@mit.edu
* - Song Han, songhan@mit.edu
*
* Target ISA: ARMv7E-M
* -------------------------------------------------------------------- */
#include "arm_math.h"
#include "arm_nnfunctions.h"
#include "arm_nnsupportfunctions.h"
#include "img2col_element.h"
#include "tinyengine_function.h"
tinyengine_status convolve_s8_kernel3x2_inputch3_stride2_pad1(const q7_t *input, const uint16_t input_x, const uint16_t input_y,
const uint16_t input_ch, const q7_t *kernel, const int32_t *bias,
const int32_t *output_shift, const int32_t *output_mult,
const int32_t output_offset, const int32_t input_offset,
const int32_t output_activation_min,
const int32_t output_activation_max, q7_t *output,
const uint16_t output_x, const uint16_t output_y,
const uint16_t output_ch, q15_t *runtime_buf, q15_t *kbuf, q7_t pad_value) {
const int kernel_y = 3;
const int kernel_x = 2;
//check this during code gen for better performance
if(input_x % 2 != 0 || input_y % 2 != 0){
return PARAM_NO_SUPPORT;
}
int16_t i_out_y, i_out_x, i_ker_y, i_ker_x;
/* Generate two columns from the input tensor a GEMM computation */
q15_t *two_column_buf = runtime_buf;
q7_t *out = output;
q15_t pad16 = pad_value;
const int16_t inoff16 = input_offset;
q15_t pad_out = pad16 + inoff16;
q31_t pad_out_q15x2 = __PKHBT(pad_out, pad_out, 16);
q31_t offset_q15x2 = __PKHBT(inoff16, inoff16, 16);
const q7_t *ip_a0 = kernel;
for (int i = 0; i < output_ch; i += 2) {
q15_t *dst1 = &kbuf[i * 18]; //each q31_t store 2 elements
q15_t *dst2 = dst1 + 18;
const q7_t *ip_a1 = ip_a0 + 18;
//27 for each output_ch
q31_t *dst1_31 = dst1;
q31_t *dst2_31 = dst2;
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
dst1_31 += 2;
dst2_31 += 2;
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
dst1_31 += 2;
dst2_31 += 2;
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
dst1_31 += 2;
dst2_31 += 2;
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
dst1_31 += 2;
dst2_31 += 2;
//17, 18
dst1 = dst1_31;
dst2 = dst2_31;
dst1[0] = *ip_a0++;
dst1[1] = *ip_a0++;
dst2[0] = *ip_a1++;
dst2[1] = *ip_a1++;
/* skip row */
ip_a0 += 27;
}
for (i_out_y = 0; i_out_y < output_y; i_out_y++) {
for (i_out_x = 0; i_out_x < output_x; i_out_x++) {
/* This part implements the im2col function */
const int16_t base_idx_y = (i_out_y * 2) - 1;
const int16_t base_idx_x = (i_out_x * 2) - 1;
const q15_t *col_buffer = two_column_buf;
//use variables
q31_t in_q7x4;
q31_t in_q15x2_1;
q31_t in_q15x2_2;
q31_t out_q15x2_1;
q31_t out_q15x2_2;
/* load address:8bit */
q7_t *src;
q7_t *src2;
q7_t *src3;
/* buffer for load:16bit */
q15_t *dst;
q15_t *dst2;
q15_t *dst3;
int input_row_offset = 3 * input_x;
dst = col_buffer;
dst2 = dst + 6;
dst3 = dst2 + 6;
if (base_idx_y != -1) {
if (base_idx_x != -1) {
//load all for now and unroll all
//3x3 = 9 elements
src = input + (base_idx_y * input_x + base_idx_x) * input_ch;
src2 = src + input_row_offset;
src3 = src2 + input_row_offset;
//3 * 2 = 6
q7_q15_offset_ele(src, dst)
*dst++ = *src++ + input_offset;
*dst++ = *src++ + input_offset;
//
q7_q15_offset_ele(src2, dst2)
*dst2++ = *src2++ + input_offset;
*dst2++ = *src2++ + input_offset;
//
q7_q15_offset_ele(src3, dst3)
*dst3++ = *src3++ + input_offset;
*dst3++ = *src3++ + input_offset;
} else {
src = input + (base_idx_y * input_x) * input_ch;
src2 = src + input_row_offset;
src3 = src2 + input_row_offset;
//pad the first col: 1x3 = 3
*dst++ = pad_out;
*dst++ = pad_out;
*dst++ = pad_out;
*dst2++ = pad_out;
*dst2++ = pad_out;
*dst2++ = pad_out;
*dst3++ = pad_out;
*dst3++ = pad_out;
*dst3++ = pad_out;
//load 3 elements
*dst++ = *src++ + input_offset;
*dst++ = *src++ + input_offset;
*dst++ = *src++ + input_offset;
//
*dst2++ = *src2++ + input_offset;
*dst2++ = *src2++ + input_offset;
*dst2++ = *src2++ + input_offset;
//
*dst3++ = *src3++ + input_offset;
*dst3++ = *src3++ + input_offset;
*dst3++ = *src3++ + input_offset;
}
} else {
//Padding the first row
//3x2 = 6 elements
q31_t *dst_31 = dst;
*dst_31++ = pad_out_q15x2;
*dst_31++ = pad_out_q15x2;
*dst_31++ = pad_out_q15x2;
if (base_idx_x != -1) {
//3x3 = 9 elements
src2 = input + (base_idx_x) * input_ch;
src3 = src2 + input_row_offset;
//3 * 2 = 6 = 4 * 1 + 2
q7_q15_offset_ele(src2, dst2)
*dst2++ = *src2++ + input_offset;
*dst2++ = *src2++ + input_offset;
//
q7_q15_offset_ele(src3, dst3)
*dst3++ = *src3++ + input_offset;
*dst3++ = *src3++ + input_offset;
} else {
src2 = input;
src3 = src2 + input_row_offset;
//pad the first one: 1x3 = 3
*dst2++ = pad_out;
*dst2++ = pad_out;
*dst2++ = pad_out;
*dst3++ = pad_out;
*dst3++ = pad_out;
*dst3++ = pad_out;
//load 3 elements
*dst2++ = *src2++ + input_offset;
*dst2++ = *src2++ + input_offset;
*dst2++ = *src2++ + input_offset;
//
*dst3++ = *src3++ + input_offset;
*dst3++ = *src3++ + input_offset;
*dst3++ = *src3++ + input_offset;
}
}
two_column_buf += 18;
/* Computation is filed for every 2 columns */
if (two_column_buf == runtime_buf + 2 * 18) {
out = mat_mult_unloop18_s8_s16(kernel,
runtime_buf, output_ch, output_shift, output_mult,
output_offset, output_activation_min, output_activation_max,
input_ch * kernel_y * kernel_x, bias, out, kbuf);
/* counter reset */
two_column_buf = runtime_buf;
}
}
}
/* Return to application */
return STATE_SUCCESS;
}

View File

@ -0,0 +1,286 @@
/* ----------------------------------------------------------------------
* Project: TinyEngine
* Title: convolve_u8_kernel3_inputch3_stride1_pad1.c
* Description: for 3x3 convolution with 3 input channels, typically for image processing
*
* Reference papers:
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
* Contact authors:
* - Wei-Ming Chen, wmchen@mit.edu
* - Wei-Chen Wang, wweichen@mit.edu
* - Ji Lin, jilin@mit.edu
* - Ligeng Zhu, ligeng@mit.edu
* - Song Han, songhan@mit.edu
*
* Target ISA: ARMv7E-M
* -------------------------------------------------------------------- */
#include "arm_math.h"
#include "arm_nnfunctions.h"
#include "arm_nnsupportfunctions.h"
#include "img2col_element.h"
#include "tinyengine_function.h"
tinyengine_status convolve_u8_kernel3_stride1_pad1(const q8_t *input, const uint16_t input_x, const uint16_t input_y,
const uint16_t input_ch, const q7_t *kernel, const int32_t *bias,
const int32_t *output_shift, const int32_t *output_mult,
const int32_t output_offset, const int32_t input_offset,
const int32_t output_activation_min,
const int32_t output_activation_max, q7_t *output,
const uint16_t output_x, const uint16_t output_y,
const uint16_t output_ch, q15_t *runtime_buf, q15_t *kbuf, q7_t pad_value) {
const int kernel_y = 3;
const int kernel_x = 3;
int16_t i_out_y, i_out_x, i_ker_y, i_ker_x;
/* Generate two columns from the input tensor a GEMM computation */
q15_t *two_column_buf = runtime_buf;
q7_t *out = output;
q15_t pad16 = pad_value;
const int16_t inoff16 = input_offset;
q15_t pad_out = pad16 + inoff16;
q31_t pad_out_q15x2 = __PKHBT(pad_out, pad_out, 16);
q31_t offset_q15x2 = __PKHBT(inoff16, inoff16, 16);
const q7_t *ip_a0 = kernel;
for (int i = 0; i < output_ch; i += 2) {
q15_t *dst1 = &kbuf[i * 27]; //each q31_t store 2 elements
q15_t *dst2 = dst1 + 27;
const q7_t *ip_a1 = ip_a0 + 27;
//27 for each output_ch
q31_t *dst1_31 = dst1;
q31_t *dst2_31 = dst2;
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
dst1_31 += 2;
dst2_31 += 2;
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
dst1_31 += 2;
dst2_31 += 2;
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
dst1_31 += 2;
dst2_31 += 2;
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
dst1_31 += 2;
dst2_31 += 2;
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
dst1_31 += 2;
dst2_31 += 2;
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
dst1_31 += 2;
dst2_31 += 2;
//25, 26, 27
dst1 = dst1_31;
dst2 = dst2_31;
dst1[0] = *ip_a0++;
dst1[1] = *ip_a0++;
dst1[2] = *ip_a0++;
dst2[0] = *ip_a1++;
dst2[1] = *ip_a1++;
dst2[2] = *ip_a1++;
/* skip row */
ip_a0 += 27;
}
for (i_out_y = 0; i_out_y < output_y; i_out_y++) {
for (i_out_x = 0; i_out_x < output_x; i_out_x++) {
/* This part implements the im2col function */
const int16_t base_idx_y = (i_out_y) - 1;
const int16_t base_idx_x = (i_out_x) - 1;
const q15_t *col_buffer = two_column_buf;
//use variables
q31_t in_q7x4;
q31_t in_q15x2_1;
q31_t in_q15x2_2;
q31_t out_q15x2_1;
q31_t out_q15x2_2;
/* load address:8bit */
q8_t *src;
q8_t *src2;
q8_t *src3;
/* buffer for load:16bit */
q15_t *dst;
q15_t *dst2;
q15_t *dst3;
int input_row_offset = 3 * input_x;//channel = 3
dst = col_buffer;
dst2 = dst + 9;
dst3 = dst2 + 9;
if (base_idx_y != -1) {
if (base_idx_x != -1) { //load all for now and unroll all
//3x3 = 9 elements
src = input + (base_idx_y * input_x + base_idx_x) * input_ch;
src2 = src + input_row_offset;
src3 = src2 + input_row_offset;
//4 * 2 = 8
q8_q15_offset_ele(src, dst)
q8_q15_offset_ele(src, dst)
*dst++ = *src++ + input_offset;
//
q8_q15_offset_ele(src2, dst2)
q8_q15_offset_ele(src2, dst2)
*dst2++ = *src2++ + input_offset;
//
q8_q15_offset_ele(src3, dst3)
q8_q15_offset_ele(src3, dst3)
*dst3++ = *src3++ + input_offset;
} else { //first element is pad
//3x3 = 9 elements
src = input + (base_idx_y * input_x) * input_ch;
src2 = src + input_row_offset;
src3 = src2 + input_row_offset;
//pad the first one: 1x3 = 3
*dst++ = pad_out;
*dst++ = pad_out;
*dst++ = pad_out;
*dst2++ = pad_out;
*dst2++ = pad_out;
*dst2++ = pad_out;
*dst3++ = pad_out;
*dst3++ = pad_out;
*dst3++ = pad_out;
//load 6 elements
//4 * 1 = 6
q8_q15_offset_ele(src, dst)
*dst++ = *src++ + input_offset;
*dst++ = *src++ + input_offset;
//
q8_q15_offset_ele(src2, dst2)
*dst2++ = *src2++ + input_offset;
*dst2++ = *src2++ + input_offset;
//
q8_q15_offset_ele(src3, dst3)
*dst3++ = *src3++ + input_offset;
*dst3++ = *src3++ + input_offset;
}
} else { // first row is padded
//3x3 = 9 elements
*dst++ = pad_out;
q31_t *dst_31 = dst;
*dst_31++ = pad_out_q15x2;
*dst_31++ = pad_out_q15x2;
*dst_31++ = pad_out_q15x2;
*dst_31++ = pad_out_q15x2;
if (base_idx_x != -1) { //load all for now and unroll all
//3x3 = 9 elements
src2 = input + (base_idx_x) * input_ch;
src3 = src2 + input_row_offset;
//4 * 2 = 8
q8_q15_offset_ele(src2, dst2)
q8_q15_offset_ele(src2, dst2)
*dst2++ = *src2++ + input_offset;
//
q8_q15_offset_ele(src3, dst3)
q8_q15_offset_ele(src3, dst3)
*dst3++ = *src3++ + input_offset;
} else { //first element is pad
//3x3 = 9 elements
src2 = input;
src3 = src2 + input_row_offset;
//pad the first one: 1x3 = 3
*dst2++ = pad_out;
*dst2++ = pad_out;
*dst2++ = pad_out;
*dst3++ = pad_out;
*dst3++ = pad_out;
*dst3++ = pad_out;
//load 6 elements
q8_q15_offset_ele(src2, dst2)
*dst2++ = *src2++ + input_offset;
*dst2++ = *src2++ + input_offset;
//
q8_q15_offset_ele(src3, dst3)
*dst3++ = *src3++ + input_offset;
*dst3++ = *src3++ + input_offset;
}
}
two_column_buf += 27;
/* Computation is filed for every 2 columns */
if (two_column_buf == runtime_buf + 2 * 27) {
out = arm_nn_mat_mult_kernel3_input3_s8_s16(kernel,
runtime_buf, output_ch, output_shift, output_mult,
output_offset, output_activation_min, output_activation_max,
input_ch * kernel_y * kernel_x, bias, out, kbuf);
/* counter reset */
two_column_buf = runtime_buf;
}
}
}
/* left-over because odd number of output pixels */
if (two_column_buf != runtime_buf) {
const q7_t *ker_a = kernel;
int i;
for (i = 0; i < output_ch; i++) {
/* Load the accumulator with bias first */
q31_t sum = bias[i];
/* Point to the beginning of the im2col buffer where the input is available as a rearranged column */
const q15_t *ip_as_col = runtime_buf;
/* 4 multiply and accumulates are done in one loop. */
uint16_t col_count = (input_ch * kernel_y * kernel_x) >> 2;
while (col_count) {
q31_t ker_a1, ker_a2;
q31_t ip_b1, ip_b2;
ker_a = read_and_pad(ker_a, &ker_a1, &ker_a2);
ip_b1 = arm_nn_read_q15x2_ia(&ip_as_col);
sum = __SMLAD(ker_a1, ip_b1, sum);
ip_b2 = arm_nn_read_q15x2_ia(&ip_as_col);
sum = __SMLAD(ker_a2, ip_b2, sum);
col_count--;
}
/* Handle left over mac */
col_count = input_ch * kernel_y * kernel_x & 0x3;
while (col_count) {
q7_t ker_a1 = *ker_a++;
q15_t ip_b1 = *ip_as_col++;
sum += ker_a1 * ip_b1;
col_count--;
}
sum = arm_nn_requantize(sum, output_mult[i], output_shift[i]);
sum += output_offset;
sum = MAX(sum, output_activation_min);
sum = MIN(sum, output_activation_max);
*out++ = (q7_t) sum;
}
}
/* Return to application */
return STATE_SUCCESS;
}

View File

@ -0,0 +1,286 @@
/* ----------------------------------------------------------------------
* Project: TinyEngine
* Title: convolve_u8_kernel3_inputch3_stride2_pad1.c
* Description: for 3x3 convolution with 3 input channels, typically for image processing
*
* Reference papers:
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
* Contact authors:
* - Wei-Ming Chen, wmchen@mit.edu
* - Wei-Chen Wang, wweichen@mit.edu
* - Ji Lin, jilin@mit.edu
* - Ligeng Zhu, ligeng@mit.edu
* - Song Han, songhan@mit.edu
*
* Target ISA: ARMv7E-M
* -------------------------------------------------------------------- */
#include "arm_math.h"
#include "arm_nnfunctions.h"
#include "arm_nnsupportfunctions.h"
#include "img2col_element.h"
#include "tinyengine_function.h"
tinyengine_status convolve_u8_kernel3_inputch3_stride2_pad1(const q8_t *input, const uint16_t input_x, const uint16_t input_y,
const uint16_t input_ch, const q7_t *kernel, const int32_t *bias,
const int32_t *output_shift, const int32_t *output_mult,
const int32_t output_offset, const int32_t input_offset,
const int32_t output_activation_min,
const int32_t output_activation_max, q7_t *output,
const uint16_t output_x, const uint16_t output_y,
const uint16_t output_ch, q15_t *runtime_buf, q15_t* kbuf, q7_t pad_value) {
const int kernel_y = 3;
const int kernel_x = 3;
int16_t i_out_y, i_out_x, i_ker_y, i_ker_x;
/* Generate two columns from the input tensor a GEMM computation */
q15_t *two_column_buf = runtime_buf;
q7_t *out = output;
q15_t pad16 = pad_value;
const int16_t inoff16 = input_offset;
q15_t pad_out = pad16 + inoff16;
q31_t pad_out_q15x2 = __PKHBT(pad_out, pad_out, 16);
q31_t offset_q15x2 = __PKHBT(inoff16, inoff16, 16);
const q7_t *ip_a0 = kernel;
for (int i = 0; i < output_ch; i += 2) {
q15_t *dst1 = &kbuf[i * 27]; //each q31_t store 2 elements
q15_t *dst2 = dst1 + 27;
const q7_t *ip_a1 = ip_a0 + 27;
//27 for each output_ch
q31_t *dst1_31 = dst1;
q31_t *dst2_31 = dst2;
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
dst1_31 += 2;
dst2_31 += 2;
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
dst1_31 += 2;
dst2_31 += 2;
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
dst1_31 += 2;
dst2_31 += 2;
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
dst1_31 += 2;
dst2_31 += 2;
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
dst1_31 += 2;
dst2_31 += 2;
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
dst1_31 += 2;
dst2_31 += 2;
//25, 26, 27
dst1 = dst1_31;
dst2 = dst2_31;
dst1[0] = *ip_a0++;
dst1[1] = *ip_a0++;
dst1[2] = *ip_a0++;
dst2[0] = *ip_a1++;
dst2[1] = *ip_a1++;
dst2[2] = *ip_a1++;
/* skip row */
ip_a0 += 27;
}
for (i_out_y = 0; i_out_y < output_y; i_out_y++) {
for (i_out_x = 0; i_out_x < output_x; i_out_x++) {
/* This part implements the im2col function */
const int16_t base_idx_y = (i_out_y * 2) - 1;
const int16_t base_idx_x = (i_out_x * 2) - 1;
const q15_t *col_buffer = two_column_buf;
//use variables
q31_t in_q7x4;
q31_t in_q15x2_1;
q31_t in_q15x2_2;
q31_t out_q15x2_1;
q31_t out_q15x2_2;
/* load address:8bit */
q8_t *src;
q8_t *src2;
q8_t *src3;
/* buffer for load:16bit */
q15_t *dst;
q15_t *dst2;
q15_t *dst3;
int input_row_offset = 3 * input_x;
dst = col_buffer;
dst2 = dst + 9;
dst3 = dst2 + 9;
if (base_idx_y != -1) {
if (base_idx_x != -1) { //load all for now and unroll all
//3x3 = 9 elements
src = input + (base_idx_y * input_x + base_idx_x) * input_ch;
src2 = src + input_row_offset;
src3 = src2 + input_row_offset;
//4 * 2 = 8
q8_q15_offset_ele(src, dst)
q8_q15_offset_ele(src, dst)
*dst++ = *src++ + input_offset;
//
q8_q15_offset_ele(src2, dst2)
q8_q15_offset_ele(src2, dst2)
*dst2++ = *src2++ + input_offset;
//
q8_q15_offset_ele(src3, dst3)
q8_q15_offset_ele(src3, dst3)
*dst3++ = *src3++ + input_offset;
} else { //first element is pad
//3x3 = 9 elements
src = input + (base_idx_y * input_x) * input_ch;
src2 = src + input_row_offset;
src3 = src2 + input_row_offset;
//pad the first one: 1x3 = 3
*dst++ = pad_out;
*dst++ = pad_out;
*dst++ = pad_out;
*dst2++ = pad_out;
*dst2++ = pad_out;
*dst2++ = pad_out;
*dst3++ = pad_out;
*dst3++ = pad_out;
*dst3++ = pad_out;
//load 6 elements
//4 * 1 = 6
q8_q15_offset_ele(src, dst)
*dst++ = *src++ + input_offset;
*dst++ = *src++ + input_offset;
//
q8_q15_offset_ele(src2, dst2)
*dst2++ = *src2++ + input_offset;
*dst2++ = *src2++ + input_offset;
//
q8_q15_offset_ele(src3, dst3)
*dst3++ = *src3++ + input_offset;
*dst3++ = *src3++ + input_offset;
}
} else { // first row is padded
//3x3 = 9 elements
*dst++ = pad_out;
q31_t *dst_31 = dst;
*dst_31++ = pad_out_q15x2;
*dst_31++ = pad_out_q15x2;
*dst_31++ = pad_out_q15x2;
*dst_31++ = pad_out_q15x2;
if (base_idx_x != -1) { //load all for now and unroll all
//3x3 = 9 elements
src2 = input + (base_idx_x) * input_ch;
src3 = src2 + input_row_offset;
//4 * 2 = 8
q8_q15_offset_ele(src2, dst2)
q8_q15_offset_ele(src2, dst2)
*dst2++ = *src2++ + input_offset;
//
q8_q15_offset_ele(src3, dst3)
q8_q15_offset_ele(src3, dst3)
*dst3++ = *src3++ + input_offset;
} else { //first element is pad
//3x3 = 9 elements
src2 = input;
src3 = src2 + input_row_offset;
//pad the first one: 1x3 = 3
*dst2++ = pad_out;
*dst2++ = pad_out;
*dst2++ = pad_out;
*dst3++ = pad_out;
*dst3++ = pad_out;
*dst3++ = pad_out;
//load 6 elements
q8_q15_offset_ele(src2, dst2)
*dst2++ = *src2++ + input_offset;
*dst2++ = *src2++ + input_offset;
//
q8_q15_offset_ele(src3, dst3)
*dst3++ = *src3++ + input_offset;
*dst3++ = *src3++ + input_offset;
}
}
two_column_buf += 27;
/* Computation is filed for every 2 columns */
if (two_column_buf == runtime_buf + 2 * 27) {
out = arm_nn_mat_mult_kernel3_input3_s8_s16(kernel,
runtime_buf, output_ch, output_shift, output_mult,
output_offset, output_activation_min, output_activation_max,
input_ch * kernel_y * kernel_x, bias, out, kbuf);
/* counter reset */
two_column_buf = runtime_buf;
}
}
}
/* left-over because odd number of output pixels */
if (two_column_buf != runtime_buf) {
const q7_t *ker_a = kernel;
int i;
for (i = 0; i < output_ch; i++) {
/* Load the accumulator with bias first */
q31_t sum = bias[i];
/* Point to the beginning of the im2col buffer where the input is available as a rearranged column */
const q15_t *ip_as_col = runtime_buf;
/* 4 multiply and accumulates are done in one loop. */
uint16_t col_count = (input_ch * kernel_y * kernel_x) >> 2;
while (col_count) {
q31_t ker_a1, ker_a2;
q31_t ip_b1, ip_b2;
ker_a = read_and_pad(ker_a, &ker_a1, &ker_a2);
ip_b1 = arm_nn_read_q15x2_ia(&ip_as_col);
sum = __SMLAD(ker_a1, ip_b1, sum);
ip_b2 = arm_nn_read_q15x2_ia(&ip_as_col);
sum = __SMLAD(ker_a2, ip_b2, sum);
col_count--;
}
/* Handle left over mac */
col_count = input_ch * kernel_y * kernel_x & 0x3;
while (col_count) {
q7_t ker_a1 = *ker_a++;
q15_t ip_b1 = *ip_as_col++;
sum += ker_a1 * ip_b1;
col_count--;
}
sum = arm_nn_requantize(sum, output_mult[i], output_shift[i]);
sum += output_offset;
sum = MAX(sum, output_activation_min);
sum = MIN(sum, output_activation_max);
*out++ = (q7_t) sum;
}
}
/* Return to application */
return STATE_SUCCESS;
}

View File

@ -0,0 +1,46 @@
/* ----------------------------------------------------------------------
* Project: TinyEngine
* Title: element_mult.c
*
* Reference papers:
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
* Contact authors:
* - Wei-Ming Chen, wmchen@mit.edu
* - Wei-Chen Wang, wweichen@mit.edu
* - Ji Lin, jilin@mit.edu
* - Ligeng Zhu, ligeng@mit.edu
* - Song Han, songhan@mit.edu
*
* Target ISA: ARMv7E-M
* -------------------------------------------------------------------- */
#include "tinyengine_function.h"
#include "arm_nnfunctions.h"
/*
* Spatial elementwise multiplications for nxnxc * 1x1xc
* */
tinyengine_status element_mult_nx1(const q7_t* input, const uint16_t input_h, const uint16_t input_w,
const uint16_t input_c, const q7_t* input2, const int16_t input1_offset, const int16_t input2_offset,
const int16_t output_offset, const int32_t out_activation_min, const int32_t out_activation_max,
const int32_t output_shift, const int32_t output_mult, q7_t* output)
{
int c, element;
for (element = 0; element < input_h * input_w; element++){
q7_t* multiplier = input2;
for (c = 0; c < input_c; c++){
const int32_t input1_val = input1_offset + *input++;
const int32_t input2_val = input2_offset + *multiplier++;
int32_t unclamped_result = input1_val * input2_val;
int32_t clamped_result = output_offset + arm_nn_requantize(unclamped_result, output_mult, output_shift);
clamped_result = MAX(clamped_result, out_activation_min);
clamped_result = MIN(clamped_result, out_activation_max);
*output++ = clamped_result;
}
}
}

View File

@ -0,0 +1,43 @@
/* ----------------------------------------------------------------------
* Project: TinyEngine
* Title: fully_connected.c
*
* Reference papers:
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
* Contact authors:
* - Wei-Ming Chen, wmchen@mit.edu
* - Wei-Chen Wang, wweichen@mit.edu
* - Ji Lin, jilin@mit.edu
* - Ligeng Zhu, ligeng@mit.edu
* - Song Han, songhan@mit.edu
*
* Target ISA: ARMv7E-M
* -------------------------------------------------------------------- */
#include "tinyengine_function.h"
tinyengine_status fully_connected_fp(
const float *input, const uint16_t input_x, const uint16_t input_y,
const uint16_t input_ch, const uint16_t output_ch, const float *bias,
const float *weights, float *output)
{
int h, w, out_c, in_c;
for (h = 0; h < input_y; h++){
for (w = 0; w < input_x; w++){
int pixel_cnt = w + input_x * h;
for (out_c = 0; out_c < output_ch; out_c++){
float intermediate = bias[out_c];
float *start_weight = weights + out_c * input_ch;
float *start_input = input + input_ch * pixel_cnt;
float *start_out = output + output_ch * pixel_cnt;
for (in_c = 0; in_c < input_ch; in_c++){
intermediate += start_weight[in_c] * start_input[in_c];
}
start_out[out_c] = intermediate;
}
}
}
}

View File

@ -0,0 +1,35 @@
/* ----------------------------------------------------------------------
* Project: TinyEngine
* Title: mat_mul_fp.c
*
* Reference papers:
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
* Contact authors:
* - Wei-Ming Chen, wmchen@mit.edu
* - Wei-Chen Wang, wweichen@mit.edu
* - Ji Lin, jilin@mit.edu
* - Ligeng Zhu, ligeng@mit.edu
* - Song Han, songhan@mit.edu
*
* Target ISA: ARMv7E-M
* -------------------------------------------------------------------- */
#include "tinyengine_function.h"
tinyengine_status mat_mul_fp(
const float *matA, const uint16_t matA_row, const uint16_t matA_col,
const float* matB, const uint16_t matB_col, float* output)
{
int m, n, i;
for (n = 0; n < matA_row; n++){
for (m = 0; m < matB_col; m++){
float sum = 0;
for (i = 0; i < matA_col; i++){
sum += matA[i + n * matA_col] * matB[m + i * matA_col];
}
output[m + n * matB_col] = sum;
}
}
}

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,50 @@
/* ----------------------------------------------------------------------
* Project: TinyEngine
* Title: maxpooling.c
*
* Reference papers:
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
* Contact authors:
* - Wei-Ming Chen, wmchen@mit.edu
* - Wei-Chen Wang, wweichen@mit.edu
* - Ji Lin, jilin@mit.edu
* - Ligeng Zhu, ligeng@mit.edu
* - Song Han, songhan@mit.edu
*
* Target ISA: ARMv7E-M
* -------------------------------------------------------------------- */
#include "tinyengine_function.h"
tinyengine_status max_pooling(const q7_t* input, const uint16_t input_h, const uint16_t input_w,
const uint16_t input_c, const uint16_t sample_h, const uint16_t sample_w,
const uint16_t output_h, const uint16_t output_w, const int32_t out_activation_min,
const int32_t out_activation_max, q7_t* output)
{
int h, w, c;
int sh, sw;
for(c = 0; c < input_c; c++){
for(h = 0; h < output_h; h++){
for(w = 0; w < output_w; w++){
int max = out_activation_min;
for(sh = 0; sh < sample_h; sh++){
int height = sh + h * sample_h;
for(sw = 0; sw < sample_w; sw++){
int width = sw + w * sample_w;
max = TN_MAX(max,input[(width + height * input_w) * input_c + c]);
}
}
int out = max;
out = TN_MAX(out, out_activation_min);
out = TN_MIN(out, out_activation_max);
output[(w + h * output_w) * input_c + c] = out;
}
}
}
}

View File

@ -0,0 +1,252 @@
/* ----------------------------------------------------------------------
* Project: TinyEngine
* Title: patchpadding_convolve_s8_kernel3_inputch3_stride2.c
* Description: for 3x3 convolution with 3 input channels, typically for image processing
*
* Reference papers:
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
* Contact authors:
* - Wei-Ming Chen, wmchen@mit.edu
* - Wei-Chen Wang, wweichen@mit.edu
* - Ji Lin, jilin@mit.edu
* - Ligeng Zhu, ligeng@mit.edu
* - Song Han, songhan@mit.edu
*
* Target ISA: ARMv7E-M
* -------------------------------------------------------------------- */
#include "arm_math.h"
#include "arm_nnfunctions.h"
#include "arm_nnsupportfunctions.h"
#include "img2col_element.h"
#include "tinyengine_function.h"
#define HOLD_KERNEL
tinyengine_status patchpadding_convolve_s8_kernel3_inputch3_stride2(const q7_t *input, const uint16_t input_x, const uint16_t input_y,
const uint16_t input_ch, const q7_t *kernel, const int32_t *bias,
const int32_t *output_shift, const int32_t *output_mult,
const int32_t output_offset, const int32_t input_offset,
const int32_t output_activation_min,
const int32_t output_activation_max, q7_t *output,
const uint16_t output_x, const uint16_t output_y,
const uint16_t output_ch, q15_t *runtime_buf, q7_t pad_value,
const uint16_t pad_t, const uint16_t pad_b, const uint16_t pad_l, const uint16_t pad_r) {
const int kernel_y = 3;
const int kernel_x = 3;
int16_t i_out_y, i_out_x, i_ker_y, i_ker_x;
/* Generate two columns from the input tensor a GEMM computation */
q15_t *two_column_buf = runtime_buf;
q7_t *out = output;
q15_t pad16 = pad_value;
const int16_t inoff16 = input_offset;
q15_t pad_out = pad16 + inoff16;
q31_t pad_out_q15x2 = __PKHBT(pad_out, pad_out, 16);
q31_t offset_q15x2 = __PKHBT(inoff16, inoff16, 16);
q15_t *kbuf = (q15_t*) get_kernel_buffer();
const q7_t *ip_a0 = kernel;
#ifdef HOLD_KERNEL
for (int i = 0; i < output_ch; i += 2) {
q15_t *dst1 = &kbuf[i * 27]; //each q31_t store 2 elements
q15_t *dst2 = dst1 + 27;
const q7_t *ip_a1 = ip_a0 + 27;
//27 for each output_ch
q31_t *dst1_31 = dst1;
q31_t *dst2_31 = dst2;
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
dst1_31 += 2;
dst2_31 += 2;
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
dst1_31 += 2;
dst2_31 += 2;
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
dst1_31 += 2;
dst2_31 += 2;
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
dst1_31 += 2;
dst2_31 += 2;
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
dst1_31 += 2;
dst2_31 += 2;
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
dst1_31 += 2;
dst2_31 += 2;
//25, 26, 27
dst1 = dst1_31;
dst2 = dst2_31;
dst1[0] = *ip_a0++;
dst1[1] = *ip_a0++;
dst1[2] = *ip_a0++;
dst2[0] = *ip_a1++;
dst2[1] = *ip_a1++;
dst2[2] = *ip_a1++;
/* skip row */
ip_a0 += 27;
}
#endif
int skip = 0;
for (i_out_y = 0; i_out_y < output_y; i_out_y++) {
for (i_out_x = 0; i_out_x < output_x; i_out_x++) {
/* This part implements the im2col function */
const q15_t *col_buffer = two_column_buf;
int16_t base_idx_y = (i_out_y * 2);
int16_t base_idx_x = (i_out_x * 2);
//use variables
q31_t in_q7x4;
q31_t in_q15x2_1;
q31_t in_q15x2_2;
q31_t out_q15x2_1;
q31_t out_q15x2_2;
/* load address:8bit */
q7_t *src;
/* buffer for im2col:16bit */
q15_t *dst = col_buffer;
int skip_top = pad_t - base_idx_y;
int skip_bottom = MAX(0,(base_idx_y + 3) - (input_y - pad_b));//3x3
int y_cnt = 3;//3 rows to load
//fill zeros in the top regions
while (y_cnt > 0 && skip_top-- > 0){
*dst++ = 0; *dst++ = 0; *dst++ = 0;
*dst++ = 0; *dst++ = 0; *dst++ = 0;
*dst++ = 0; *dst++ = 0; *dst++ = 0;
y_cnt--;
base_idx_y++;
}
//fill in the middle
int skip_left = MAX(0,pad_l - base_idx_x);
int skip_right = MAX(0,(base_idx_x + 3) - (input_x - pad_r));//3x3
//address of the first valid values
int m;
for (m = 0; m < y_cnt - skip_bottom; m++){
src = input + ((base_idx_y+m) * input_x + base_idx_x + skip_left) * input_ch;
int x_cnt = 3;//3 columns to load
//fill zero for left regions
int cnt = skip_left;
while(x_cnt > 0 && cnt-- > 0){
*dst++ = 0; *dst++ = 0; *dst++ = 0;//input_ch == 3
x_cnt--;
}
//load the middle
while(x_cnt > skip_right){
*dst++ = *src++ + input_offset; *dst++ = *src++ + input_offset; *dst++ = *src++ + input_offset;
x_cnt--;
}
//fill zero for right regions (for what's left)
while(x_cnt > 0){
*dst++ = 0; *dst++ = 0; *dst++ = 0;//input_ch == 3
x_cnt--;
}
}
y_cnt -= m;
//fill zeros in the bottom regions
while (y_cnt > 0){
*dst++ = 0; *dst++ = 0; *dst++ = 0;
*dst++ = 0; *dst++ = 0; *dst++ = 0;
*dst++ = 0; *dst++ = 0; *dst++ = 0;
y_cnt--;
}
two_column_buf += 27;
/* Computation is filed for every 2 columns */
if (two_column_buf == runtime_buf + 2 * 27) {
#ifdef HOLD_KERNEL
out = arm_nn_mat_mult_kernel3_input3_s8_s16(kernel,
runtime_buf, output_ch, output_shift, output_mult,
output_offset, output_activation_min, output_activation_max,
input_ch * kernel_y * kernel_x, bias, out, kbuf);
// out = mat_mult_s16(kernel,
// runtime_buf, output_ch, output_shift, output_mult,
// output_offset, output_activation_min, output_activation_max,
// input_ch * kernel_y * kernel_x, bias, out, kbuf);
#else
out = arm_nn_mat_mult_kernel_s8_s16(kernel,
runtime_buf, output_ch, output_shift, output_mult,
output_offset, output_activation_min, output_activation_max,
27, bias, out);
#endif
/* counter reset */
two_column_buf = runtime_buf;
}
}
}
/* left-over because odd number of output pixels */
if (two_column_buf != runtime_buf) {
const q7_t *ker_a = kernel;
int i;
for (i = 0; i < output_ch; i++) {
/* Load the accumulator with bias first */
q31_t sum = bias[i];
/* Point to the beginning of the im2col buffer where the input is available as a rearranged column */
const q15_t *ip_as_col = runtime_buf;
/* 4 multiply and accumulates are done in one loop. */
uint16_t col_count = (input_ch * kernel_y * kernel_x) >> 2;
while (col_count) {
q31_t ker_a1, ker_a2;
q31_t ip_b1, ip_b2;
ker_a = read_and_pad(ker_a, &ker_a1, &ker_a2);
ip_b1 = arm_nn_read_q15x2_ia(&ip_as_col);
sum = __SMLAD(ker_a1, ip_b1, sum);
ip_b2 = arm_nn_read_q15x2_ia(&ip_as_col);
sum = __SMLAD(ker_a2, ip_b2, sum);
col_count--;
}
/* Handle left over mac */
col_count = input_ch * kernel_y * kernel_x & 0x3;
while (col_count) {
q7_t ker_a1 = *ker_a++;
q15_t ip_b1 = *ip_as_col++;
sum += ker_a1 * ip_b1;
col_count--;
}
sum = arm_nn_requantize(sum, output_mult[i], output_shift[i]);
sum += output_offset;
sum = MAX(sum, output_activation_min);
sum = MIN(sum, output_activation_max);
*out++ = (q7_t) sum;
}
}
/* Return to application */
return STATE_SUCCESS;
}

View File

@ -0,0 +1,175 @@
/* This file is automatically generated */
/* ----------------------------------------------------------------------
* Project: TinyEngine
* Title: patchpadding_depthwise_kernel3x3_stride1_inplace_CHW.c
* Description: for sparse in-place 3x3 depth-wise convolution (HWC->CHW->HWC)
*
* Reference papers:
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
* Contact authors:
* - Wei-Ming Chen, wmchen@mit.edu
* - Wei-Chen Wang, wweichen@mit.edu
* - Ji Lin, jilin@mit.edu
* - Ligeng Zhu, ligeng@mit.edu
* - Song Han, songhan@mit.edu
*
* Target ISA: ARMv7E-M
* -------------------------------------------------------------------- */
#include "arm_nnsupportfunctions.h" //TODO: remove this in the future for self-contained
#include "tinyengine_function.h"
void patch_depthwise_kernel3x3_stride1_inplace_kernel_CHW(
const uint16_t output_y, const uint16_t output_x,
const int32_t *bias, const int32_t *biasR, const q7_t *ksrc, const int32_t *multiplier,
const int32_t *shift, q7_t *output, const int32_t output_offset,
const int32_t activation_min, const int32_t activation_max,
q7_t *cols_8b_iterptr, const uint16_t column_x, int channel_offset);
tinyengine_status patchpadding_depthwise_kernel3x3_stride1_inplace_CHW(q7_t *input, const uint16_t input_x, const uint16_t input_y,
const uint16_t input_ch, const q7_t *kernel, const int32_t *bias, const int32_t *biasR,
const int32_t *output_shift, const int32_t *output_mult,
const int32_t output_offset, const int32_t input_offset,
const int32_t output_activation_min,
const int32_t output_activation_max, q7_t *output,
const uint16_t output_x, const uint16_t output_y,
const uint16_t output_ch, q15_t *runtime_buf, q7_t pad_value,
const uint16_t pad_t, const uint16_t pad_b, const uint16_t pad_l, const uint16_t pad_r)
{
uint16_t c,i,j;
q7_t *cols_8b_start = (q7_t *)runtime_buf;
q7_t* cols_8b = (q7_t* )cols_8b_start;
const q7_t *src;
const q7_t *ksrc = kernel;
//set the output for inplace update
q7_t *inplace_out = input;
int padding_cnt = pad_t * input_x;
//shift the input ptr accordingly for HWC->CHW
input += padding_cnt * input_ch;
//handle top padding
q7_t PAD8 = pad_value;
while (padding_cnt--){
*cols_8b++ = PAD8;
}
for (i = pad_t; i < input_y - pad_b; i++){
//handle left padding
for (j = 0; j < pad_l; j++){
*cols_8b++ = PAD8;
}
cols_8b += input_x - (pad_l + pad_r);
//handle right padding
for (j = 0; j < pad_r; j++){
*cols_8b++ = PAD8;
}
}
//handle bottom padding
padding_cnt = pad_b * input_x;
//not need to shift for bottom padding
while (padding_cnt--){
*cols_8b++ = PAD8;
}
for (c = 0; c < input_ch; c++){
src = input;
cols_8b = (q7_t*)(cols_8b_start + pad_t * (input_x)); //skip pad_t rows
for(i = pad_t; i < input_y - pad_b; i++){
cols_8b += pad_l;//skip left
src += pad_l * input_ch;
for(j = pad_l; j < input_x - pad_r; j++){
*cols_8b++ = *src;// + input_offset;
src += input_ch;
}
cols_8b += pad_r;//skip right
src += pad_r * input_ch;
}
patch_depthwise_kernel3x3_stride1_inplace_kernel_CHW(output_y, output_x, bias++, biasR++, ksrc, output_mult++, output_shift++, inplace_out, output_offset,output_activation_min, output_activation_max,cols_8b_start, input_x, input_ch);
inplace_out++;
input++;
ksrc += 9;
}
}
void patch_depthwise_kernel3x3_stride1_inplace_kernel_CHW(
const uint16_t output_y, const uint16_t output_x,
const int32_t *bias, const int32_t *biasR, const q7_t *ksrc, const int32_t *multiplier,
const int32_t *shift, q7_t *output, const int32_t output_offset,
const int32_t activation_min, const int32_t activation_max,
q7_t *cols_8b_iterptr, const uint16_t column_x, int channel_offset)
{
#define STRIDE 1
int i, j;
/* MACs for each output */
for (i = 0; i < output_y; i++) {
for (j = 0; j < output_x / 2; j++) {
q7_t *cols_8b = cols_8b_iterptr;
q31_t sum0 = bias[0];
q31_t sum1 = bias[0];
/* computation */
sum0 += cols_8b[0]*ksrc[0];
sum1 += cols_8b[1]*ksrc[0];
sum0 += cols_8b[1]*ksrc[1];
sum1 += cols_8b[2]*ksrc[1];
sum0 += cols_8b[2]*ksrc[2];
sum1 += cols_8b[3]*ksrc[2];
cols_8b += column_x;
sum0 += cols_8b[0]*ksrc[3];
sum1 += cols_8b[1]*ksrc[3];
sum0 += cols_8b[1]*ksrc[4];
sum1 += cols_8b[2]*ksrc[4];
sum0 += cols_8b[2]*ksrc[5];
sum1 += cols_8b[3]*ksrc[5];
cols_8b += column_x;
sum0 += cols_8b[0]*ksrc[6];
sum1 += cols_8b[1]*ksrc[6];
sum0 += cols_8b[1]*ksrc[7];
sum1 += cols_8b[2]*ksrc[7];
sum0 += cols_8b[2]*ksrc[8];
sum1 += cols_8b[3]*ksrc[8];
/* requantize */
sum0 = arm_nn_requantize(sum0 + biasR[0], *multiplier, *shift);
sum0 += output_offset;
sum0 = MAX(sum0, activation_min);
sum0 = MIN(sum0, activation_max);
output[(i * output_x + j * 2) * channel_offset] = sum0;
sum1 = arm_nn_requantize(sum1 + biasR[0], *multiplier, *shift);
sum1 += output_offset;
sum1 = MAX(sum1, activation_min);
sum1 = MIN(sum1, activation_max);
output[(i * output_x + (j * 2 + 1)) * channel_offset] = sum1;
cols_8b_iterptr += STRIDE * 2;
}
if (output_x & 1) {
q7_t * cols_8b = cols_8b_iterptr;
q31_t sum = bias[0];
sum += cols_8b[0]*ksrc[0];
sum += cols_8b[1]*ksrc[1];
sum += cols_8b[2]*ksrc[2];
cols_8b += column_x;
sum += cols_8b[0]*ksrc[3];
sum += cols_8b[1]*ksrc[4];
sum += cols_8b[2]*ksrc[5];
cols_8b += column_x;
sum += cols_8b[0]*ksrc[6];
sum += cols_8b[1]*ksrc[7];
sum += cols_8b[2]*ksrc[8];
sum = arm_nn_requantize(sum + biasR[0], *multiplier, *shift);
sum += output_offset;
sum = MAX(sum, activation_min);
sum = MIN(sum, activation_max);
output[(i * output_x + output_x - 1) * channel_offset] = sum;
cols_8b_iterptr += STRIDE;
}
cols_8b_iterptr += 1 * 2;
}
}

View File

@ -0,0 +1,176 @@
/* This file is automatically generated */
/* ----------------------------------------------------------------------
* Project: TinyEngine
* Title: patchpadding_depthwise_kernel3x3_stride2_inplace_CHW.c
* Description: for sparse in-place 3x3 depth-wise convolution (HWC->CHW->HWC)
*
* Reference papers:
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
* Contact authors:
* - Wei-Ming Chen, wmchen@mit.edu
* - Wei-Chen Wang, wweichen@mit.edu
* - Ji Lin, jilin@mit.edu
* - Ligeng Zhu, ligeng@mit.edu
* - Song Han, songhan@mit.edu
*
* Target ISA: ARMv7E-M
* -------------------------------------------------------------------- */
#include "arm_nnsupportfunctions.h" //TODO: remove this in the future for self-contained
#include "tinyengine_function.h"
void patch_depthwise_kernel3x3_stride2_inplace_kernel_CHW(
const uint16_t output_y, const uint16_t output_x,
const int32_t *bias, const int32_t *biasR, const q7_t *ksrc, const int32_t *multiplier,
const int32_t *shift, q7_t *output, const int32_t output_offset,
const int32_t activation_min, const int32_t activation_max,
q7_t *cols_8b_iterptr, const uint16_t column_x, int channel_offset);
tinyengine_status patchpadding_depthwise_kernel3x3_stride2_inplace_CHW(q7_t *input, const uint16_t input_x, const uint16_t input_y,
const uint16_t input_ch, const q7_t *kernel, const int32_t *bias, const int32_t *biasR,
const int32_t *output_shift, const int32_t *output_mult,
const int32_t output_offset, const int32_t input_offset,
const int32_t output_activation_min,
const int32_t output_activation_max, q7_t *output,
const uint16_t output_x, const uint16_t output_y,
const uint16_t output_ch, q15_t *runtime_buf, q7_t pad_value,
const uint16_t pad_t, const uint16_t pad_b, const uint16_t pad_l, const uint16_t pad_r)
{
uint16_t c,i,j;
q7_t *cols_8b_start = (q7_t *)runtime_buf;
q7_t* cols_8b = (q7_t* )cols_8b_start;
const q7_t *src;
const q7_t *ksrc = kernel;
//set the output for inplace update
q7_t *inplace_out = input;
int padding_cnt = pad_t * input_x;
//shift the input ptr accordingly for HWC->CHW
input += padding_cnt * input_ch;
//handle top padding
q7_t PAD8 = pad_value;
while (padding_cnt--){
*cols_8b++ = PAD8;
}
for (i = pad_t; i < input_y - pad_b; i++){
//handle left padding
for (j = 0; j < pad_l; j++){
*cols_8b++ = PAD8;
}
cols_8b += input_x - (pad_l + pad_r);
//handle right padding
for (j = 0; j < pad_r; j++){
*cols_8b++ = PAD8;
}
}
//handle bottom padding
padding_cnt = pad_b * input_x;
//not need to shift for bottom padding
while (padding_cnt--){
*cols_8b++ = PAD8;
}
for (c = 0; c < input_ch; c++){
src = input;
cols_8b = (q7_t*)(cols_8b_start + pad_t * (input_x)); //skip pad_t rows
for(i = pad_t; i < input_y - pad_b; i++){
cols_8b += pad_l;//skip left
src += pad_l * input_ch;
for(j = pad_l; j < input_x - pad_r; j++){
*cols_8b++ = *src;// + input_offset;
src += input_ch;
}
cols_8b += pad_r;//skip right
src += pad_r * input_ch;
}
patch_depthwise_kernel3x3_stride2_inplace_kernel_CHW(output_y, output_x, bias++, biasR++, ksrc, output_mult++, output_shift++, inplace_out, output_offset,output_activation_min, output_activation_max,cols_8b_start, input_x, input_ch);
inplace_out++;
input++;
ksrc += 9;
}
}
void patch_depthwise_kernel3x3_stride2_inplace_kernel_CHW(
const uint16_t output_y, const uint16_t output_x,
const int32_t *bias, const int32_t *biasR, const q7_t *ksrc, const int32_t *multiplier,
const int32_t *shift, q7_t *output, const int32_t output_offset,
const int32_t activation_min, const int32_t activation_max,
q7_t *cols_8b_iterptr, const uint16_t column_x, int channel_offset)
{
#define STRIDE 2
int i, j;
/* MACs for each output */
for (i = 0; i < output_y; i++) {
for (j = 0; j < output_x / 2; j++) {
q7_t *cols_8b = cols_8b_iterptr;
q31_t sum0 = bias[0];
q31_t sum1 = bias[0];
/* computation */
sum0 += cols_8b[0]*ksrc[0];
sum1 += cols_8b[2]*ksrc[0];
sum0 += cols_8b[1]*ksrc[1];
sum1 += cols_8b[3]*ksrc[1];
sum0 += cols_8b[2]*ksrc[2];
sum1 += cols_8b[4]*ksrc[2];
cols_8b += column_x;
sum0 += cols_8b[0]*ksrc[3];
sum1 += cols_8b[2]*ksrc[3];
sum0 += cols_8b[1]*ksrc[4];
sum1 += cols_8b[3]*ksrc[4];
sum0 += cols_8b[2]*ksrc[5];
sum1 += cols_8b[4]*ksrc[5];
cols_8b += column_x;
sum0 += cols_8b[0]*ksrc[6];
sum1 += cols_8b[2]*ksrc[6];
sum0 += cols_8b[1]*ksrc[7];
sum1 += cols_8b[3]*ksrc[7];
sum0 += cols_8b[2]*ksrc[8];
sum1 += cols_8b[4]*ksrc[8];
/* requantize */
sum0 = arm_nn_requantize(sum0 + biasR[0], *multiplier, *shift);
sum0 += output_offset;
sum0 = MAX(sum0, activation_min);
sum0 = MIN(sum0, activation_max);
output[(i * output_x + j * 2) * channel_offset] = sum0;
sum1 = arm_nn_requantize(sum1 + biasR[0], *multiplier, *shift);
sum1 += output_offset;
sum1 = MAX(sum1, activation_min);
sum1 = MIN(sum1, activation_max);
output[(i * output_x + (j * 2 + 1)) * channel_offset] = sum1;
cols_8b_iterptr += STRIDE * 2;
}
if (output_x & 1) {
q7_t * cols_8b = cols_8b_iterptr;
q31_t sum = bias[0];
sum += cols_8b[0]*ksrc[0];
sum += cols_8b[1]*ksrc[1];
sum += cols_8b[2]*ksrc[2];
cols_8b += column_x;
sum += cols_8b[0]*ksrc[3];
sum += cols_8b[1]*ksrc[4];
sum += cols_8b[2]*ksrc[5];
cols_8b += column_x;
sum += cols_8b[0]*ksrc[6];
sum += cols_8b[1]*ksrc[7];
sum += cols_8b[2]*ksrc[8];
sum = arm_nn_requantize(sum + biasR[0], *multiplier, *shift);
sum += output_offset;
sum = MAX(sum, activation_min);
sum = MIN(sum, activation_max);
output[(i * output_x + output_x - 1) * channel_offset] = sum;
cols_8b_iterptr += STRIDE;
}
cols_8b_iterptr += 1 * 2 - (column_x & 1);
cols_8b_iterptr += (STRIDE - 1) * (column_x);
}
}

View File

@ -0,0 +1,179 @@
/* ----------------------------------------------------------------------
* Project: TinyEngine
* Title: patchpadding_kbuf_convolve_s8_kernel3_inputch3_stride2.c
* Description: for 3x3 convolution with 3 input channels, typically for image processing
*
* Reference papers:
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
* Contact authors:
* - Wei-Ming Chen, wmchen@mit.edu
* - Wei-Chen Wang, wweichen@mit.edu
* - Ji Lin, jilin@mit.edu
* - Ligeng Zhu, ligeng@mit.edu
* - Song Han, songhan@mit.edu
*
* Target ISA: ARMv7E-M
* -------------------------------------------------------------------- */
#include "arm_math.h"
#include "arm_nnfunctions.h"
#include "arm_nnsupportfunctions.h"
#include "img2col_element.h"
#include "tinyengine_function.h"
tinyengine_status patchpadding_kbuf_convolve_s8_kernel3_inputch3_stride2(const q7_t *input, const uint16_t input_x, const uint16_t input_y,
const uint16_t input_ch, const q7_t* kernel, const q31_t *kbuf, const int32_t *bias,
const int32_t *output_shift, const int32_t *output_mult,
const int32_t output_offset, const int32_t input_offset,
const int32_t output_activation_min,
const int32_t output_activation_max, q7_t *output,
const uint16_t output_x, const uint16_t output_y,
const uint16_t output_ch, q15_t *runtime_buf, q7_t pad_value,
const uint16_t pad_t, const uint16_t pad_b, const uint16_t pad_l, const uint16_t pad_r) {
const int kernel_y = 3;
const int kernel_x = 3;
int16_t i_out_y, i_out_x, i_ker_y, i_ker_x;
/* Generate two columns from the input tensor a GEMM computation */
q15_t *two_column_buf = runtime_buf;
q7_t *out = output;
q15_t pad16 = pad_value;
const int16_t inoff16 = input_offset;
for (i_out_y = 0; i_out_y < output_y; i_out_y++) {
for (i_out_x = 0; i_out_x < output_x; i_out_x++) {
/* This part implements the im2col function */
const q15_t *col_buffer = two_column_buf;
int16_t base_idx_y = (i_out_y * 2);
int16_t base_idx_x = (i_out_x * 2);
//use variables
q31_t in_q7x4;
q31_t in_q15x2_1;
q31_t in_q15x2_2;
q31_t out_q15x2_1;
q31_t out_q15x2_2;
/* load address:8bit */
q7_t *src;
/* buffer for im2col:16bit */
q15_t *dst = col_buffer;
int skip_top = pad_t - base_idx_y;
int skip_bottom = MAX(0,(base_idx_y + 3) - (input_y - pad_b));//3x3
int y_cnt = 3;//3 rows to load
//fill zeros in the top regions
while (y_cnt > 0 && skip_top-- > 0){
*dst++ = 0; *dst++ = 0; *dst++ = 0;
*dst++ = 0; *dst++ = 0; *dst++ = 0;
*dst++ = 0; *dst++ = 0; *dst++ = 0;
y_cnt--;
base_idx_y++;
}
//fill in the middle
int skip_left = MAX(0,pad_l - base_idx_x);
int skip_right = MAX(0,(base_idx_x + 3) - (input_x - pad_r));//3x3
//address of the first valid values
int m;
for (m = 0; m < y_cnt - skip_bottom; m++){
src = input + ((base_idx_y+m) * input_x + base_idx_x + skip_left) * input_ch;
int x_cnt = 3;//3 columns to load
//fill zero for left regions
int cnt = skip_left;
while(x_cnt > 0 && cnt-- > 0){
*dst++ = 0; *dst++ = 0; *dst++ = 0;//input_ch == 3
x_cnt--;
}
//load the middle
while(x_cnt > skip_right){
*dst++ = *src++ + input_offset; *dst++ = *src++ + input_offset; *dst++ = *src++ + input_offset;
x_cnt--;
}
//fill zero for right regions (for what's left)
while(x_cnt > 0){
*dst++ = 0; *dst++ = 0; *dst++ = 0;//input_ch == 3
x_cnt--;
}
}
y_cnt -= m;
//fill zeros in the bottom regions
while (y_cnt > 0){
*dst++ = 0; *dst++ = 0; *dst++ = 0;
*dst++ = 0; *dst++ = 0; *dst++ = 0;
*dst++ = 0; *dst++ = 0; *dst++ = 0;
y_cnt--;
}
two_column_buf += 27;
/* Computation is filed for every 2 columns */
if (two_column_buf == runtime_buf + 2 * 27) {
out = mat_mult_s16(kernel,
runtime_buf, output_ch, output_shift, output_mult,
output_offset, output_activation_min, output_activation_max,
input_ch * kernel_y * kernel_x, bias, out, kbuf);
/* counter reset */
two_column_buf = runtime_buf;
}
}
}
/* left-over because odd number of output pixels */
if (two_column_buf != runtime_buf) {
const q7_t *ker_a = kernel;
int i;
for (i = 0; i < output_ch; i++) {
/* Load the accumulator with bias first */
q31_t sum = bias[i];
/* Point to the beginning of the im2col buffer where the input is available as a rearranged column */
const q15_t *ip_as_col = runtime_buf;
/* 4 multiply and accumulates are done in one loop. */
uint16_t col_count = (input_ch * kernel_y * kernel_x) >> 2;
while (col_count) {
q31_t ker_a1, ker_a2;
q31_t ip_b1, ip_b2;
ker_a = read_and_pad(ker_a, &ker_a1, &ker_a2);
ip_b1 = arm_nn_read_q15x2_ia(&ip_as_col);
sum = __SMLAD(ker_a1, ip_b1, sum);
ip_b2 = arm_nn_read_q15x2_ia(&ip_as_col);
sum = __SMLAD(ker_a2, ip_b2, sum);
col_count--;
}
/* Handle left over mac */
col_count = input_ch * kernel_y * kernel_x & 0x3;
while (col_count) {
q7_t ker_a1 = *ker_a++;
q15_t ip_b1 = *ip_as_col++;
sum += ker_a1 * ip_b1;
col_count--;
}
sum = arm_nn_requantize(sum, output_mult[i], output_shift[i]);
sum += output_offset;
sum = MAX(sum, output_activation_min);
sum = MIN(sum, output_activation_max);
*out++ = (q7_t) sum;
}
}
/* Return to application */
return STATE_SUCCESS;
}

View File

@ -0,0 +1,40 @@
/* ----------------------------------------------------------------------
* Project: TinyEngine
* Title: stable_softmax.c
*
* Reference papers:
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
* Contact authors:
* - Wei-Ming Chen, wmchen@mit.edu
* - Wei-Chen Wang, wweichen@mit.edu
* - Ji Lin, jilin@mit.edu
* - Ligeng Zhu, ligeng@mit.edu
* - Song Han, songhan@mit.edu
*
* Target ISA: ARMv7E-M
* -------------------------------------------------------------------- */
#include "tinyengine_function.h"
#include <float.h>
#include <math.h>
tinyengine_status statble_softmax_inplace(float *input, const uint16_t length)
{
float max = FLT_MIN;
float exp_sum = 0;
uint16_t i;
for (i = 0; i < length; i++){
if (input[i] > max) max = input[i];
}
// inplace update
for (i = 0; i < length; i++){
input[i] = exp(input[i] - max);
exp_sum += input[i];
}
for (i = 0; i < length; i++){
input[i] = input[i] / exp_sum;
}
}

View File

@ -0,0 +1,85 @@
/* ----------------------------------------------------------------------
* Project: TinyEngine
* Title: upsample_byte.c
*
* Reference papers:
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
* Contact authors:
* - Wei-Ming Chen, wmchen@mit.edu
* - Wei-Chen Wang, wweichen@mit.edu
* - Ji Lin, jilin@mit.edu
* - Ligeng Zhu, ligeng@mit.edu
* - Song Han, songhan@mit.edu
*
* Target ISA: ARMv7E-M
* -------------------------------------------------------------------- */
#include "arm_nnfunctions.h"
#include "tinyengine_function.h"
tinyengine_status upsample_byte(const q7_t *input, const uint16_t input_x,
const uint16_t input_y, const uint16_t input_ch, q7_t *output, const uint16_t sample_factor) {
//get output resolution
const uint16_t output_x = input_x * sample_factor, output_y = input_y * sample_factor , output_ch = input_ch;
//upsample in a repeated manner
for(int ih = 0; ih < input_y; ih++){
q7_t* out_head = output;
//place 1 row
for(int iw = 0; iw < input_x; iw++){
for(int s = 0; s < sample_factor; s++){
memcpy(output, input, input_ch);
output += input_ch;
}
input += input_ch;
}
//copy the remaining rows
for(int s = 1; s < sample_factor; s++){
memcpy(output, out_head, output_ch * output_x);
output += output_ch * output_x;
}
}
return STATE_SUCCESS;
}
//ref: https://www.cs.toronto.edu/~guerzhoy/320/lec/upsampling.pdf
tinyengine_status upsample_byte_bilinear(const q7_t *input, const uint16_t input_x,
const uint16_t input_y, const uint16_t input_ch, q7_t *output, const uint16_t sample_factor) {
//get output resolution
const uint16_t output_x = input_x * sample_factor, output_y = input_y * sample_factor , output_ch = input_ch;
// //upsample in a repeated manner
// for(int oh = 0; oh < input_y; oh++){
// int ih = oh / sample_factor;
// int rh = oh % sample_factor;
//
// q7_t* out_head = output;
// //place 1 row
// for(int ow = 0; ow < onput_x; ow++){
// int iw = iw / sample_factor;
// int wh = wh % sample_factor;
//
// //exact coordinate
// q7_t* ori_input = input + input_ch * (input_x * ih + iw);
// if(rh | wh == 0){
// memcpy(output, ori_input, input_ch);
// continue;
// }
//
// //interpolate
// q7_t* topleft = ori_input;
// q7_t* topright = ori_input + input_ch;
// q7_t* bottomleft = topleft + input_ch * input_x;
// q7_t* bottomright = topright + input_ch * input_x;
// }
// }
return STATE_SUCCESS;
}

1
TinyEngine/third_party/CMSIS vendored Submodule

@ -0,0 +1 @@
Subproject commit 5b58d2da8af7cee64cc9145ee1154609bdfee9f9

BIN
assets/detection.tflite Normal file

Binary file not shown.

View File

@ -0,0 +1,23 @@
{
"output1": {
"name": "Yolo3Output",
"input_id": "175",
"num_class": 1,
"anchors": [116, 90, 156, 198, 373, 326],
"stride": 32
},
"output2": {
"name": "Yolo3Output",
"input_id": "36",
"num_class": 1,
"anchors": [30, 61, 62, 45, 59, 119],
"stride": 16
},
"output3": {
"name": "Yolo3Output",
"input_id": "5",
"num_class": 1,
"anchors": [10, 13, 16, 30, 33, 23],
"stride": 8
}
}

Binary file not shown.

After

Width:  |  Height:  |  Size: 413 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 359 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 346 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 37 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 85 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 87 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 15 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 14 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 559 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 141 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.6 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.1 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 1003 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.1 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.1 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 746 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 724 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 282 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 51 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 279 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 43 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 23 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 13 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 7.4 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 176 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 256 KiB

BIN
assets/figures/overview.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 222 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 32 KiB

BIN
assets/vww.tflite Normal file

Binary file not shown.

View File

@ -0,0 +1,667 @@
# ----------------------------------------------------------------------
# Project: TinyEngine
# Title: CodeGenerator.py
#
# Reference papers:
# - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
# - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
# - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
# Contact authors:
# - Wei-Ming Chen, wmchen@mit.edu
# - Wei-Chen Wang, wweichen@mit.edu
# - Ji Lin, jilin@mit.edu
# - Ligeng Zhu, ligeng@mit.edu
# - Song Han, songhan@mit.edu
#
# Target ISA: ARMv7E-M
# ----------------------------------------------------------------------
import os
from .OpGenerator import OpGenerator
Codegen_root = "./codegen/"
include_path = Codegen_root + "Include/"
source_path = Codegen_root + "Source/"
use_hard_switsh = False
gen_kernels = True
use_aggressive_unroll = True
class CodeGenerator:
"""Provide utilities to generate C code for a given model and memory schdeule."""
parse_count = 0
header_handle = None
source_handle = None
def __init__(
self,
memsche,
inplace,
precision=8,
unsigned_input=False,
patch_params=None,
FP_output=False,
profile_mode=False,
fp_requantize=False,
tflite_op=False,
dummy_address=False,
outputTables=None,
detectionUtils=None,
):
self.MemSche = memsche
# Check if path exists, create it if not
if not os.path.exists(include_path):
os.makedirs(include_path)
if not os.path.exists(source_path):
os.makedirs(source_path)
self.header_handle = open(include_path + "genModel.h", "w")
self.source_handle = open(source_path + "genModel.c", "w")
self.inplace = inplace
self.BIT = precision
self.unsigned_input = unsigned_input
self.patch_params = patch_params
self.FP_output = FP_output
self.profile_mode = profile_mode
self.fp_requantize = fp_requantize
self.tflite_op = tflite_op
self.dummy_address = dummy_address
self.trainSRAMTable = []
self.outputTables = outputTables
self.detectionUtils = detectionUtils
def _readOnly(self, name):
if self.outputTables is None or name is None:
return True
else:
for o in self.outputTables:
if o.name in name:
return False
return True
def codeGeneration(self):
# buffer in SRAM
self._genMemBuffer()
# parse trainable parameters & assign the corresponding buffers for layers
self._parseTrainable()
# include all headers
self._includeHeaders()
# generate detection output if any
self._genDetprocessing()
# generate patch-based
self._genPatchInference()
# generate invoke function
self._genInvoke()
self._closefp()
# generate operatior kernels
if gen_kernels:
op_gen = OpGenerator(include_path, source_path, self.MemSche.layer, self.fp_requantize)
op_gen.genOpcode()
def _genDetprocessing(self):
if self.detectionUtils is not None:
fp = self.source_handle
fp.write(self.detectionUtils.genPostProcessing())
def _genOpstr(self, op, *args):
if self.profile_mode:
if len(args) > 0:
return op.generate_profiling_str(*args)
else:
return op.generate_profiling_str()
else:
if len(args) > 0:
return op.generate_inference_str(*args)
else:
return op.generate_inference_str()
def _genPatchInference(self):
schedule = self.MemSche
layer_info = schedule.layer[0].get_layer_info()
if "is_patch" in layer_info and layer_info["is_patch"]:
fp = self.source_handle
string = ""
first_height = layer_info["input_h"]
first_width = layer_info["input_w"]
img_w = (first_width - self.patch_params["pad_l"] - self.patch_params["pad_r"]) * self.patch_params[
"n_patch"
]
# by default, we go three stride 2 conv in the patch-based inference
patch_out_w = int((first_width - self.patch_params["pad_l"]) / 8)
# by default, we go three stride 2 conv in the patch-based inference
patch_out_h = int((first_height - self.patch_params["pad_l"]) / 8)
out_w = self.patch_params["output_w"]
# generate code for testing whole inference time
string += (
"""void end2endinference(q7_t* img){
//stage 1
int i, j, h, w, c;
for (i = 0; i < """
+ str(self.patch_params["n_patch"])
+ """; i++){
uint16_t pad_t=0,pad_b=0;
if (i == 0){
pad_t = """
+ str(self.patch_params["pad_l"])
+ """;
}
else if (i == """
+ str(self.patch_params["n_patch"] - 1)
+ """){
pad_b = """
+ str(self.patch_params["pad_r"])
+ """;
}
for (j = 0; j < """
+ str(self.patch_params["n_patch"])
+ """; j++){
uint16_t pad_l=0,pad_r=0;
if (j == 0){
pad_l = """
+ str(self.patch_params["pad_l"])
+ """;
}
else if (j == """
+ str(self.patch_params["n_patch"] - 1)
+ """){
pad_r = """
+ str(self.patch_params["pad_r"])
+ """;
}
/* load partial input from the img */
q7_t* patch_input = &buffer0[0]; // for partial input
int start_x = MAX("""
+ str(first_width - self.patch_params["pad_l"])
+ """ * j - """
+ str(self.patch_params["pad_l"])
+ """,0);
int start_y = MAX("""
+ str(first_height - self.patch_params["pad_l"])
+ """ * i - """
+ str(self.patch_params["pad_l"])
+ """,0);
q7_t* img_ptr = &img[(start_x + start_y * """
+ str(img_w)
+ """) * 3];
//skip top
patch_input += pad_t * """
+ str(first_width)
+ """ * 3;
for (h = pad_t; h < """
+ str(first_height)
+ """ - pad_b; h++){
//skip left
patch_input += pad_l * 3;
//fill middle
int bytes = ("""
+ str(first_width)
+ """ - (pad_l + pad_r)) * 3;
memcpy (patch_input, img_ptr, bytes);
img_ptr += """
+ str(img_w)
+ """ * 3;
patch_input += bytes;
//skip right
patch_input += pad_r * 3;
}
invoke_1patch(pad_t,pad_b,pad_l,pad_r);
/* concat the output from buffer0 (this is set manually for now) */
q7_t* output_ptr = buffer1 + (i * """
+ str(patch_out_w)
+ """ * """
+ str(out_w)
+ """ + j * """
+ str(patch_out_w)
+ """) * """
+ str(self.patch_params["output_c"])
+ """ ;
for (h = 0; h < """
+ str(patch_out_h)
+ """; h++){
for (w = 0; w < """
+ str(patch_out_w)
+ """; w++){
for (c = 0; c < """
+ str(self.patch_params["output_c"])
+ """; c++){
output_ptr[(w + h * """
+ str(out_w)
+ """) * """
+ str(self.patch_params["output_c"])
+ """ + c] = buffer0[(w + h * """
+ str(patch_out_w)
+ """) * """
+ str(self.patch_params["output_c"])
+ """ + c];
}
}
}
}
}
//stage 2
invoke();
}"""
)
string += """
void invoke_1patch(uint16_t pad_t, uint16_t pad_b, uint16_t pad_l ,uint16_t pad_r){
"""
fp.write(string)
# gen patch-based inference code
patch_layers = []
layercnt = 0
for i, op in enumerate(schedule.layer):
layer_info = op.get_layer_info()
if "is_patch" not in layer_info or not layer_info["is_patch"]:
break # end of patch-based
string = "/* layer " + str(layercnt) + ":" + layer_info["op"] + " */\n"
layercnt += 1
fp.write(string)
if layer_info["op"] == "CONV_2D":
# hardcode this memory schedule for quick implementation
# TODO: adjust this according to model architecture and split index
next_layer_info = schedule.layer[i + 1].get_layer_info()
if "is_patch" not in next_layer_info or not next_layer_info["is_patch"]:
layer_info["output_buf_add"] = "front"
layer_info["output_buf_add_offset"] = 0
if self.unsigned_input:
raise Exception("unsigned input is not supported by patch-based yet")
string = self._genOpstr(
op,
False,
self.FP_output,
use_aggressive_unroll,
use_hard_switsh,
self.fp_requantize,
)
fp.write(string)
elif layer_info["op"] == "DEPTHWISE_CONV_2D":
string = self._genOpstr(op, self.fp_requantize)
fp.write(string)
elif layer_info["op"] == "ADD":
string = self._genOpstr(op)
fp.write(string)
patch_layers.append(schedule.layer[i])
# remove these layers for patching for the following code gen
for layer in patch_layers:
schedule.layer.remove(layer)
string = "}\n\n"
fp.write(string)
else: # not patch-based
string = """void end2endinference(q7_t* img){
invoke(NULL);
}
"""
fp = self.source_handle
fp.write(string)
def _genInvoke(self):
fp = self.source_handle
string = "void invoke(float* labels){\n"
fp.write(string)
schedule = self.MemSche
for i, op in enumerate(schedule.layer):
layer_info = op.get_layer_info()
string = "/* layer " + str(i) + ":" + layer_info["op"] + " */\n"
fp.write(string)
if layer_info["op"] == "CONV_2D":
if (
self.FP_output
and "effective_scale" in layer_info
and layer_info["output_scale"] is not None
and layer_info["effective_scale"] is not None
):
use_fp = True
else:
use_fp = False
string = self._genOpstr(
op,
self.unsigned_input,
use_fp,
use_aggressive_unroll,
use_hard_switsh,
self.fp_requantize,
self.tflite_op,
self.dummy_address,
)
fp.write(string)
elif layer_info["op"] == "DEPTHWISE_CONV_2D":
string = self._genOpstr(op, self.fp_requantize)
fp.write(string)
else:
string = self._genOpstr(op)
fp.write(string)
string = "}\n"
fp.write(string)
def _getBufferIndex(self, location):
if location == "front":
return 0
elif location == "end":
return 0
elif location == "residual":
return 1
return None
def _genMemBuffer(self):
schedule = self.MemSche
# define output tensor
string = "#define NNoutput &buffer0[" + str(_findtheinferenceOutput(schedule.layer)) + "];"
fp = self.header_handle
fp.write("\n" + string + "\n")
# activation buffers
string = "\n/* sram:" + str(schedule.peakmem) + ", flash:" + str(schedule.flash) + " */\n"
fp.write(string + "\n")
string = "static signed char buffer[" + str(schedule.peakmem) + "];\n"
fp.write(string)
accumulate_ptr = 0
string = "static signed char *buffer0 = &buffer[" + str(accumulate_ptr) + "];\n"
accumulate_ptr += int(schedule.buffers["input_output"])
fp.write(string)
string = "static signed char *buffer1 = &buffer[" + str(accumulate_ptr) + "];\n"
accumulate_ptr += int(schedule.buffers["residual"])
fp.write(string)
string = "static int16_t *sbuf = (int16_t *)&buffer[" + str(accumulate_ptr) + "];\n"
accumulate_ptr += int(schedule.buffers["im2col"])
fp.write(string)
string = "static int32_t *kbuf = (int32_t *)&buffer[" + str(accumulate_ptr) + "];\n"
accumulate_ptr += int(schedule.buffers["kernel"])
fp.write(string)
string = "const int SBuffer_size = " + str(int(schedule.buffers["im2col"])) + ";\n"
fp.write(string)
string = "const int KBuffer_size = " + str(int(schedule.buffers["kernel"])) + ";\n"
fp.write(string + "\n")
def _includeHeaders(self):
include_string = """/* Automatically generated source file */
#include <float.h>
#include "arm_nnfunctions.h"
#include "genNN.h"
#include "genModel.h"
#include "tinyengine_function.h"
//#include "tinyengine_function_fp.h"
"""
if self.profile_mode:
include_string += '#include "profile.h"\n'
include_string += """
/* Variables used by all ops */
ADD_params add_params;
//Conv_Params conv_params;
//Depthwise_Params dpconv_params;
int i;
int8_t *int8ptr;
float *fptr,*fptr2,*fptr3;
signed char* getInput() {
return &buffer0[""" + f"{self.MemSche.layer[0].params['input_buf_add_offset']}" + """];
}
signed char* getOutput() {
return NNoutput;
}\n"""
fp = self.source_handle
fp.write(include_string)
def _parseTrainable(self):
schedule = self.MemSche
for i, op in enumerate(schedule.layer):
layer_info = op.get_layer_info()
if layer_info["op"] == "CONV_2D":
self._parseWeight(
self.parse_count,
layer_info["weight_value"].flatten(),
layer_info["weight_name"],
self._readOnly(layer_info["weight_name"]),
)
if "bias_name" in layer_info:
self._parseBias(
self.parse_count,
layer_info["bias"].flatten(),
layer_info["bias_name"],
self._readOnly(layer_info["bias_name"]),
)
else:
self._parseBias(self.parse_count, layer_info["bias"].flatten())
self._parseEffectivescales(self.parse_count, layer_info["effective_scale"].flatten())
self._parseRequantize(
self.parse_count,
layer_info["shift"].flatten(),
layer_info["multiplier"].flatten(),
)
layer_info["parsed_trainable"] = self.parse_count
self.parse_count += 1
elif layer_info["op"] == "DEPTHWISE_CONV_2D":
if layer_info["kernel_h"] > layer_info["kernel_w"]:
self._parseCWHWeight(
self.parse_count,
layer_info["weight_value"].flatten(),
layer_info["kernel_h"],
layer_info["kernel_w"],
layer_info["input_c"],
)
else:
if "weight_name" in layer_info:
self._parseCHWWeight(
self.parse_count,
layer_info["weight_value"].flatten(),
layer_info["input_c"],
)
else:
self._parseCHWWeight(
self.parse_count,
layer_info["weight_value"].flatten(),
layer_info["input_c"],
)
if "bias_name" in layer_info:
self._parseoffsetBias(
self.parse_count,
layer_info["bias"].flatten(),
layer_info["input_zero_point"] * -1,
layer_info["weight_value"].flatten(),
layer_info["input_c"],
layer_info["bias_name"],
self._readOnly(layer_info["bias_name"]),
)
else:
self._parseoffsetBias(
self.parse_count,
layer_info["bias"].flatten(),
layer_info["input_zero_point"] * -1,
layer_info["weight_value"].flatten(),
layer_info["input_c"],
)
self._parseEffectivescales(self.parse_count, layer_info["effective_scale"].flatten())
self._parseRequantize(
self.parse_count,
layer_info["shift"].flatten(),
layer_info["multiplier"].flatten(),
)
layer_info["parsed_trainable"] = self.parse_count
self.parse_count += 1
elif layer_info["op"] == "FULLY_CONNECTED":
self._parseWeight(
self.parse_count,
layer_info["weight_value"].flatten(),
layer_info["weight_name"],
self._readOnly(layer_info["weight_name"]),
)
self._parseBias(self.parse_count, layer_info["bias"].flatten())
layer_info["parsed_trainable"] = self.parse_count
self.parse_count += 1
elif layer_info["op"] == "SOFTMAX":
pass
def _parseCWHWeight(self, Lindex, weight, height, width, channel):
fp = self.header_handle
# 8bit implementation
if self.BIT == 8:
string = "const unsigned char CWHweight" + str(Lindex) + "[" + str(len(weight)) + "] = {"
fp.write(string)
for j in range(channel):
for w in range(width):
for h in range(height):
value = weight[(h * width + w) * channel + j]
if value < 0:
value += 256
fp.write(str(format(value, "#04x")) + ", ")
else:
raise NotImplementedError
fp.write("};\n")
def _parseCHWWeight(self, Lindex, weight, channel):
fp = self.header_handle
kernelsize = int(len(weight) / channel)
# 8bit implementation
if self.BIT == 8:
string = "const unsigned char CHWweight" + str(Lindex) + "[" + str(len(weight)) + "] = {"
fp.write(string)
for j in range(channel):
for i in range(kernelsize):
value = int(weight[i * channel + j])
if value < 0:
value += 256
fp.write(str(format(value, "#04x")) + ", ")
else:
raise NotImplementedError
fp.write("};\n")
def _parseEffectivescales(self, Lindex, scales):
fp = self.header_handle
string = "const float scales" + str(Lindex) + "[" + str(len(scales)) + "] = {"
fp.write(string)
for _, value in enumerate(scales):
fp.write(str(value) + ", ")
fp.write("};\n")
def _parseWeight(self, Lindex, weight, weight_name=None, is_const=True):
fp = self.header_handle
const_str = "const " if is_const else ""
string = f"{const_str}unsigned char weight" + str(Lindex) + "[" + str(len(weight)) + "] = {"
fp.write(string)
for _, value in enumerate(weight):
value = int(value)
if value < 0:
value += 256
fp.write(str(format(value, "#04x")) + ", ")
fp.write("};\n")
if weight_name is not None:
for r in self.trainSRAMTable:
if r.name == weight_name:
return
self.trainSRAMTable.append(tensorRecorder(weight_name, len(weight), "unknown"))
if weight.dtype == "int8":
string = f"{const_str}unsigned char* {weight_name}=weight" + str(Lindex) + ";\n"
else:
raise NotImplementedError
fp.write(string)
def _parseoffsetBias(self, Lindex, bias, input_offset, weight, channel, bias_name=None, is_const=True):
fp = self.header_handle
const_str = "const " if is_const else ""
string = f"{const_str}int32_t offsetBias" + str(Lindex) + "[" + str(len(bias)) + "] = {"
fp.write(string)
kernelsize = int(len(weight) / channel)
# fuse the offset into bias
for i in range(channel):
tmpW = 0
for j in range(kernelsize):
tmpW += weight[j * channel + i]
fp.write(str(self.int32_clip(bias[i] + tmpW * input_offset)) + ", ")
fp.write("};\n")
string = f"{const_str}int32_t offsetRBias" + str(Lindex) + "[" + str(len(bias)) + "] = {"
fp.write(string)
kernelsize = int(len(weight) / channel)
for i in range(channel):
tmpW = 0
for j in range(kernelsize):
tmpW += weight[j * channel + i]
fp.write(str(bias[i] + tmpW * input_offset - self.int32_clip(bias[i] + tmpW * input_offset)) + ", ")
fp.write("};\n")
def _parseBias(self, Lindex, bias, bias_name=None, is_const=True):
fp = self.header_handle
const_str = "const " if is_const else ""
string = f"{const_str}int32_t bias" + str(Lindex) + "[" + str(len(bias)) + "] = {"
fp.write(string)
for _, value in enumerate(bias):
value = int(value)
fp.write(str(value) + ", ")
fp.write("};\n")
def _parseRequantize(self, Lindex, shift, multiplier):
fp = self.header_handle
string = "const int32_t shift" + str(Lindex) + "[" + str(len(shift)) + "] = {"
fp.write(string)
for _, value in enumerate(shift):
fp.write(str(value) + ", ")
fp.write("};\n")
string = "const int32_t multiplier" + str(Lindex) + "[" + str(len(multiplier)) + "] = {"
fp.write(string)
for _, value in enumerate(multiplier):
fp.write(str(value) + ", ")
fp.write("};\n")
def int32_clip(self, a):
if a < -(2**31):
return -(2**31)
elif a > 2**31 - 1:
return 2**31 - 1
return a.astype(int)
def _closefp(self):
self.header_handle.close()
self.source_handle.close()
def _findtheinferenceOutput(layers):
for cnt, op in enumerate(layers):
if op.params["output_dtype"] != "int8":
return layers[cnt - 1].params["output_buf_add_offset"]
return layers[-1].params["output_buf_add_offset"]
class tensorRecorder:
def __init__(self, name, len, dtype):
self.name = name
self.len = len
self.dtype = dtype

View File

@ -0,0 +1,72 @@
# ----------------------------------------------------------------------
# Project: TinyEngine
# Title: CodegenUtilTFlite.py
#
# Reference papers:
# - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
# - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
# - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
# Contact authors:
# - Wei-Ming Chen, wmchen@mit.edu
# - Wei-Chen Wang, wweichen@mit.edu
# - Ji Lin, jilin@mit.edu
# - Ligeng Zhu, ligeng@mit.edu
# - Song Han, songhan@mit.edu
#
# Target ISA: ARMv7E-M
# ----------------------------------------------------------------------
import os
from tempfile import TemporaryDirectory
from .CodeGenerator import CodeGenerator
from .GeneralMemoryScheduler import GeneralMemoryScheduler
from .TfliteConvertor import TfliteConvertor
def GenerateSourceFilesFromTFlite(
tflite_path,
life_cycle_path=None,
):
use_inplace = True
with TemporaryDirectory() as WORKING_DIR:
if life_cycle_path is None:
schedule_image_path = os.path.join(WORKING_DIR, "schedule.png")
else:
schedule_image_path = life_cycle_path
tf_convertor = TfliteConvertor(tflite_path)
tf_convertor.parseOperatorInfo()
layer = tf_convertor.layer
outTable = []
VisaulizeTrainable = False # disable for code gen
memory_scheduler = GeneralMemoryScheduler(
layer,
False,
False,
outputTables=outTable,
inplace=use_inplace,
mem_visual_path=schedule_image_path,
VisaulizeTrainable=VisaulizeTrainable,
)
memory_scheduler.USE_INPLACE = use_inplace
memory_scheduler.allocateMemory()
outTable = tf_convertor.outputTables if hasattr(tf_convertor, "outputTables") else []
code_generator = CodeGenerator(
memsche=memory_scheduler,
inplace=memory_scheduler.USE_INPLACE,
unsigned_input=False,
patch_params=None,
FP_output=False,
profile_mode=False,
fp_requantize=True,
tflite_op=False,
dummy_address=False,
outputTables=outTable,
)
# set detection outputs before codegen if any
code_generator.codeGeneration()
return memory_scheduler.buffers["input_output"]

View File

@ -0,0 +1,389 @@
# ----------------------------------------------------------------------
# Project: TinyEngine
# Title: GeneralMemoryScheduler.py
#
# Reference papers:
# - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
# - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
# - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
# Contact authors:
# - Wei-Ming Chen, wmchen@mit.edu
# - Wei-Chen Wang, wweichen@mit.edu
# - Ji Lin, jilin@mit.edu
# - Ligeng Zhu, ligeng@mit.edu
# - Song Han, songhan@mit.edu
#
# Target ISA: ARMv7E-M
# ----------------------------------------------------------------------
from .allocator.firstFit import FirstFit
from .constant import TTYPE_INFERNECE
class GeneralMemoryScheduler:
def __init__(
self,
layer,
tflite_op=False,
dummy_address=False,
memory_limit=10 * 1024 * 1024,
inplace=True,
outputTables=None,
mem_visual_path="codegen/allocation.png",
VisaulizeTrainable=True,
):
self.layer = layer
self.heads = 0
self.buffers = {
"input_output": 0,
"residual": 0,
"im2col": 0,
"kernel": 0,
"feature": 0,
"trainable": 0,
} # for feature pyramid
# overall memory info
self.peakmem = 0
self.flash = 0
self.bias = 0
self.scale = 0
self.code = 0
self.allocator = FirstFit(memory_limit)
self.outputTables = outputTables
self.USE_INPLACE = inplace
self.mem_visual_path = mem_visual_path
self.tflite_op = tflite_op
self.dummy_address = dummy_address
self.VisaulizeTrainable = VisaulizeTrainable
# for showing layer-wise memory usage
self.layermem = []
def _isTranable(self, name):
for o in self.outputTables:
if isinstance(name, str) and o.name in name:
return True
return False
def allocateMemory(self):
# assign the same graph index for inplace operations
# note: we need to handle stride == 2 for int8 depthwise to save memory
if self.USE_INPLACE:
for i, op in enumerate(self.layer):
if op.params["op"] == "DEPTHWISE_CONV_2D" and op.params["input_dtype"] == "int8" and not self.tflite_op:
# set the idx of output and next layer input
previous_output_idx = op.output_tensors[0].graph_idx
op.output_tensors[0].graph_idx = op.input_tensors[0].graph_idx
if (
i + 1 < len(self.layer)
and len(self.layer[i + 1].input_tensors) > 0
and str(self.layer[i + 1].input_tensors[0].graph_idx) == str(previous_output_idx)
):
self.layer[i + 1].input_tensors[0].graph_idx = op.input_tensors[0].graph_idx
# update following ops' tensors
for following_idx in range(i, len(self.layer)):
for cnt, inp_tensor in enumerate(self.layer[following_idx].input_tensors):
if str(inp_tensor.graph_idx) == str(previous_output_idx):
inp_tensor.graph_idx = op.input_tensors[0].graph_idx
num_layers = len(self.layer)
# go through all tensors in the model
for i, op in enumerate(self.layer):
# get all unallocated tensors for this layer
unallocated_tensors = []
for t in op.input_tensors:
if t.allocator_idx is None:
unallocated_tensors.append(t)
for cnt, t in enumerate(op.output_tensors):
if cnt == 0 and not (
self.USE_INPLACE
and op.params["op"] == "DEPTHWISE_CONV_2D"
and op.params["input_dtype"] == "int8"
and not self.tflite_op
):
if t.allocator_idx is None:
unallocated_tensors.append(t)
# assume seocnd outputs will not be inplace updated
else:
if t.allocator_idx is None:
unallocated_tensors.append(t)
# add each tensor
for cnt, t in enumerate(unallocated_tensors):
start_idx = i
end_idx = i + 1 if i == 0 else num_layers
for idx in range(start_idx + 1, num_layers):
for input_t in self.layer[idx].input_tensors:
if str(t.graph_idx) == str(input_t.graph_idx):
end_idx = idx + 1
# check if this is output
ttype = TTYPE_INFERNECE
# add the tensor
t.allocator_idx = self.allocator.addTensor(start_idx, end_idx, t.len(), name=t.graph_idx, type=ttype)
# propagate the allocation to tensors with the same idx
for j in range(i + 1, num_layers):
opp = self.layer[j]
for tt in opp.input_tensors:
if str(t.graph_idx) == str(tt.graph_idx):
tt.allocator_idx = t.allocator_idx
# not inplace update
for tt in opp.output_tensors:
if str(t.graph_idx) == str(tt.graph_idx):
tt.allocator_idx = t.allocator_idx
# for detailed memory
layermem = {}
layermem["MAC"] = op.get_macs()
layermem["activation"] = op.get_activation_size()
layermem["scale"] = op.get_scale_size()
layermem["runtime"] = op.get_sbuf_size()
layermem["kernel"] = op.get_kbuf_size()
self._enlargeBuffer("im2col", layermem["runtime"])
self._enlargeBuffer("kernel", layermem["kernel"])
if (
"weight_name" in op.params
and self._isTranable(op.params["weight_name"])
and op.params["op"] != "TRANSPOSE_CONV_2D"
):
size = int(op.get_weights_size())
self.buffers["trainable"] += size
layermem["trainable"] = size
layermem["weight"] = 0
else:
layermem["weight"] = int(op.get_weights_size())
if "bias_name" in op.params and self._isTranable(op.params["bias_name"]):
size = int(op.get_bias_size())
self.buffers["trainable"] += size
if "trainable" in layermem:
layermem["trainable"] += size
else:
layermem["trainable"] = size
layermem["bias"] = 0
else:
layermem["bias"] = int(op.get_bias_size())
# if it is float32 op, then their wegiths/bias should from SRAM buffers
if op.params["input_dtype"] != "int8":
layermem["scale"] = 0
layermem["bias"] = 0
layermem["weight"] = 0
self.__increaseFlash(layermem["weight"])
self.__increaseFlash(layermem["bias"])
self.__increaseFlash(layermem["scale"])
self.layermem.append(layermem)
# find out int8 inplace depthwise conv and stride == 2
for i, op in enumerate(self.layer):
if (
op.params["op"] == "DEPTHWISE_CONV_2D"
and op.params["input_dtype"] == "int8"
and op.params["stride_h"] == op.params["stride_w"] == 2
):
if op.input_tensors[0].allocator_idx == op.output_tensors[0].allocator_idx:
self.allocator.rectangles[op.input_tensors[0].allocator_idx]["stride2_inplace_idx"] = i
# Reorder the rectangles to decide which tensor needs to be scheduled first
self.allocator.sortSize()
self.allocator.allocate()
self.allocator.visualize(self.mem_visual_path)
self._enlargeBuffer("input_output", self.allocator.get_peak())
# sanity check, see if all tensors have been allocated
for i, op in enumerate(self.layer):
# get all unallocated tensors for this layer
for cnt, t in enumerate(op.input_tensors):
assert t.allocator_idx is not None
for cnt, t in enumerate(op.output_tensors):
assert t.allocator_idx is not None
# assign the address according to placement
for i, op in enumerate(self.layer):
# get all unallocated tensors for this layer
for cnt, t in enumerate(op.input_tensors):
if cnt == 0:
op.params["input_buf_add_offset"] = self.allocator.getIdxAddress(t.allocator_idx)
op.params["input_buf_add"] = "front"
elif cnt == 1:
op.params["input2_buf_add_offset"] = self.allocator.getIdxAddress(t.allocator_idx)
op.params["input2_buf_add"] = "front"
elif cnt == 2:
op.params["input3_buf_add_offset"] = self.allocator.getIdxAddress(t.allocator_idx)
op.params["input3_buf_add"] = "front"
op.input_tensors[cnt].buffer_name = "buffer0"
op.input_tensors[cnt].buffer_address = self.allocator.getIdxAddress(t.allocator_idx)
for cnt, t in enumerate(op.output_tensors):
if cnt == 0:
op.params["output_buf_add_offset"] = self.allocator.getIdxAddress(t.allocator_idx)
op.params["output_buf_add"] = "front"
op.output_tensors[cnt].buffer_name = "buffer0"
op.output_tensors[cnt].buffer_address = self.allocator.getIdxAddress(t.allocator_idx)
if cnt == 1:
op.params["output2_buf_add_offset"] = self.allocator.getIdxAddress(t.allocator_idx)
op.params["output2_buf_add"] = "front"
op.output_tensors[cnt].buffer_name = "buffer0"
op.output_tensors[cnt].buffer_address = self.allocator.getIdxAddress(t.allocator_idx)
# calculate peak mem
self.peakmem = (
self.allocator.get_peak() + self.buffers["im2col"] + self.buffers["kernel"] # + self.buffers["trainable"]
)
def dumpLayerIndex(self):
# header
print("-" * 14 + " Tensor Allocation Details " + "-" * 14)
print(" #op | operator type | input index | output index |")
for cnt, l in enumerate(self.layer):
operator_num = "#" + str(cnt)
type = str(l.params["op"])
input_tensor = ""
for cnt_inp, inp in enumerate(l.input_tensors):
input_tensor += str(inp.allocator_idx)
if cnt_inp < len(l.input_tensors) - 1:
input_tensor += ","
output_tensor = str(l.output_tensors[0].allocator_idx)
string = (
operator_num.ljust(5)
+ "|"
+ type.ljust(19)
+ "|"
+ input_tensor.ljust(13)
+ "|"
+ output_tensor.ljust(14)
+ "|"
)
print(string)
def dumpLayerMem(self):
# header
print(
"---------------------------------------------------- Schedule Details ----------------------------------------------------------------" # noqa: E501
)
print(
"----------------------| SRAM || Flash | |" # noqa: E501
)
print(
"----------------------| activation | runtime | trainable | sum || weight | bias | scale | sum | MAC |" # noqa: E501
)
layermem = self.layermem
self.__dumpMemInfo(layermem)
def __dumpMemInfo(self, layermem):
string = "-------Schedule-------|"
maxActive = self.buffers["input_output"]
maxRuntime = self.buffers["im2col"] + self.buffers["kernel"]
maxTrainable = self.buffers["trainable"]
totalWeight = self.__sumKey(layermem, "weight")
totalBias = self.__sumKey(layermem, "bias")
totalScale = self.__sumKey(layermem, "scale")
totalMAC = self.__sumKey(layermem, "MAC")
string += str(maxActive).ljust(14) + "|"
string += str(maxRuntime).ljust(11) + "|"
string += str(maxTrainable).ljust(12) + "|"
string += str(maxActive + maxRuntime + maxTrainable).ljust(8) + "||"
string += str(totalWeight).ljust(12) + "|"
string += str(totalBias).ljust(10) + "|"
string += str(totalScale).ljust(10) + "|"
string += str(totalWeight + totalBias + totalScale).ljust(13) + "|"
string += str(totalMAC).ljust(13) + "|"
print(string)
for i, _ in enumerate(layermem):
layer_info = self.layer[i].get_layer_info()
string = ""
string += str(i) + ":" + layer_info["op"]
string = string.ljust(22) + "|"
SRAM = 0
if "activation" in layermem[i]:
substr = (
str(layermem[i]["activation"]) + " (" + "{:.0%}".format(layermem[i]["activation"] / maxActive) + ")"
)
string += substr.ljust(14) + "|"
SRAM += layermem[i]["activation"]
if "runtime" in layermem[i]:
sbuf = layermem[i]["runtime"] + layermem[i]["kernel"]
substr = str(sbuf) + " (" + "{:.0%}".format(sbuf / maxRuntime) + ")"
string += substr.ljust(11) + "|"
SRAM += sbuf
else:
string = string.ljust(49) + "|"
if "trainable" in layermem[i]:
substr = (
str(layermem[i]["trainable"])
+ " ("
+ "{:.0%}".format(layermem[i]["trainable"] / maxTrainable)
+ ")"
)
string += substr.ljust(12) + "|"
SRAM += layermem[i]["trainable"]
else:
string = string.ljust(62) + "|"
# SRAM end
string += str(SRAM)
string = string.ljust(71) + "||"
flash = 0
if "weight" in layermem[i]:
substr = (
str(layermem[i]["weight"])
+ " ("
+ "{:.0%}".format(layermem[i]["weight"] / (totalWeight + 0.0001))
+ ")"
)
string += str(substr).ljust(12) + "|"
flash += layermem[i]["weight"]
if "bias" in layermem[i]:
substr = (
str(layermem[i]["bias"]) + " (" + "{:.0%}".format(layermem[i]["bias"] / (totalBias + 0.0001)) + ")"
)
string += str(substr).ljust(10) + "|"
flash += layermem[i]["bias"]
if "scale" in layermem[i]:
substr = (
str(layermem[i]["scale"]) + " (" + "{:.0%}".format(layermem[i]["scale"] / totalScale + 0.0001) + ")"
)
string += str(substr).ljust(10) + "|"
flash += layermem[i]["scale"]
if flash > 0:
string += (
str(flash)
+ " ("
+ "{:.0%}".format(flash / (totalWeight + totalBias + totalScale + 0.0001))
+ ")"
)
string = string.ljust(121) + "|"
# flash end
if "MAC" in layermem[i]:
substr = str(layermem[i]["MAC"]) + " (" + "{:.0%}".format(layermem[i]["MAC"] / totalMAC) + ")"
string += str(substr).ljust(13) + "|"
print(string)
def __sumKey(self, layers, key):
result = 0
for _, layer in enumerate(layers):
if key in layer:
result += layer[key]
return result
def getBuffers(self):
return self.buffers
# Maximum binary size: This should be updated if any change in the inference side
# TODO: Combine with code generation to get more accurate result
def profileResult(self):
return self.peakmem, self.flash + self.bias + self.scale + int(self.code * 1024)
def __increaseFlash(self, size):
self.flash += int(size)
def _enlargeBuffer(self, buf_str, size):
if buf_str == "input_output" or buf_str == "residual":
self.buffers[buf_str] = max(self.buffers[buf_str], int(size))
else:
if buf_str not in self.buffers:
self.buffers[buf_str] = size
else:
self.buffers[buf_str] = max(self.buffers[buf_str], size)

View File

@ -0,0 +1,167 @@
# ----------------------------------------------------------------------
# Project: TinyEngine
# Title: InputResizer.py
#
# Reference papers:
# - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
# - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
# - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
# Contact authors:
# - Wei-Ming Chen, wmchen@mit.edu
# - Wei-Chen Wang, wweichen@mit.edu
# - Ji Lin, jilin@mit.edu
# - Ligeng Zhu, ligeng@mit.edu
# - Song Han, songhan@mit.edu
#
# Target ISA: ARMv7E-M
# ----------------------------------------------------------------------
import math
def _find_previous_info(layers, idx):
for layer in layers:
info = layer.get_layer_info()
if info["output_idx"] == idx:
return info
class InputResizer:
def __init__(self, layer):
self.layer = layer
def inputResize(self, input_h, input_w):
for i, layer in enumerate(self.layer):
layer_info = layer.get_layer_info()
previous_layer_info = _find_previous_info(self.layer, layer_info["input_idx"])
# we need to handle different op
op_code_str = layer_info["op"]
if i == 0:
layer_info["input_h"] = input_h
layer_info["input_w"] = input_w
_changeOPTensorSize(self.layer[i], "input", 0, layer_info["input_h"], layer_info["input_w"])
else:
if op_code_str == "SE_AVG_POOL_2D":
SEinput_h = previous_layer_info["output_h"]
SEinput_w = previous_layer_info["output_w"]
layer_info["input_h"] = SEinput_h
layer_info["input_w"] = SEinput_w
_changeOPTensorSize(self.layer[i], "input", 0, layer_info["input_h"], layer_info["input_w"])
layer_info["sample_h"] = SEinput_h
layer_info["sample_w"] = SEinput_w
else:
layer_info["input_h"] = previous_layer_info["output_h"]
layer_info["input_w"] = previous_layer_info["output_w"]
layer_info["input_c"] = previous_layer_info["output_c"]
_changeOPTensorSize(self.layer[i], "input", 0, layer_info["input_h"], layer_info["input_w"])
if op_code_str == "AVERAGE_POOL_2D":
layer_info["filter_h"] = layer_info["input_h"]
layer_info["filter_w"] = layer_info["input_w"]
layer_info["filter_c"] = layer_info["input_c"]
# handle nodes for dag op
# find the previous node
if "dagop_input0_key" in layer_info:
for op in self.layer:
l_into = op.get_layer_info()
if (
"dagop_output_key" in l_into
and l_into["dagop_output_key"] == layer_info["dagop_input0_key"]
):
layer_info["input_h"] = l_into["output_h"]
layer_info["input_w"] = l_into["output_w"]
layer_info["input_c"] = l_into["output_c"]
if "dagop_input1_key" in layer_info:
for op in self.layer:
l_into = op.get_layer_info()
if (
"dagop_output_key" in l_into
and l_into["dagop_output_key"] == layer_info["dagop_input1_key"]
):
layer_info["input_h"] = l_into["output_h"]
layer_info["input_w"] = l_into["output_w"]
layer_info["input_c"] = l_into["output_c"]
if op_code_str == "CONV_2D" or op_code_str == "DEPTHWISE_CONV_2D":
layer_info["output_h"] = math.ceil(layer_info["input_h"] / layer_info["stride_h"])
layer_info["output_w"] = math.ceil(layer_info["input_w"] / layer_info["stride_w"])
_changeOPTensorSize(self.layer[i], "output", 0, layer_info["output_h"], layer_info["output_w"])
elif op_code_str == "ADD":
layer_info["output_h"] = layer_info["input_h"]
layer_info["output_w"] = layer_info["input_w"]
layer_info["output_c"] = layer_info["input_c"]
_changeOPTensorSize(self.layer[i], "output", 0, layer_info["output_h"], layer_info["output_w"])
layer_info["input2_h"] = layer_info["input_h"]
layer_info["input2_w"] = layer_info["input_w"]
_changeOPTensorSize(self.layer[i], "input", 1, layer_info["input2_h"], layer_info["input_w"])
elif op_code_str == "SE_ELEMENT_MULT_2D":
layer_info["input2_h"] = SEinput_h
layer_info["input2_w"] = SEinput_w
_changeOPTensorSize(self.layer[i], "input", 1, layer_info["input2_h"], layer_info["input_w"])
layer_info["output_h"] = SEinput_h
layer_info["output_w"] = SEinput_w
_changeOPTensorSize(self.layer[i], "output", 0, layer_info["output_h"], layer_info["output_w"])
elif op_code_str == "UPSAMPLE":
layer_info["output_h"] = layer_info["input_h"] * layer_info["factor"]
layer_info["output_w"] = layer_info["input_w"] * layer_info["factor"]
layer_info["output_c"] = layer_info["input_c"]
_changeOPTensorSize(self.layer[i], "output", 0, layer_info["output_h"], layer_info["output_w"])
elif op_code_str == "MAX_POOL_2D":
layer_info["output_h"] = int(layer_info["input_h"] / layer_info["filter_h"])
layer_info["output_w"] = int(layer_info["input_w"] / layer_info["filter_h"])
layer_info["output_c"] = layer_info["input_c"]
_changeOPTensorSize(self.layer[i], "output", 0, layer_info["output_h"], layer_info["output_w"])
def _changeOPTensorSize(layer, tensor_type: str, tensor_idx: int, input_h: int, input_w: int):
if tensor_type == "input":
if hasattr(layer, "input_tensors") and len(layer.input_tensors) > tensor_idx:
layer.input_tensors[tensor_idx].set_input_w(input_w)
layer.input_tensors[tensor_idx].set_input_h(input_h)
elif tensor_type == "output":
if hasattr(layer, "output_tensors"):
layer.output_tensors[tensor_idx].set_input_w(input_w)
layer.output_tensors[tensor_idx].set_input_h(input_h)
class PatchResizer:
def __init__(self, layer):
self.layer = layer
# manually setting these variables for now
def patchResize(self, PatchLayers, PatchSize, PatchSize_height):
for i, layer in enumerate(self.layer):
layer_info = layer.get_layer_info()
if i < PatchLayers:
layer_info["is_patch"] = True
op_code_str = layer_info["op"]
if i == 0:
layer_info["input_h"] = PatchSize_height
layer_info["input_w"] = PatchSize
_changeOPTensorSize(self.layer[i], "input", 0, PatchSize_height, PatchSize)
else:
prev_layer_info = self.layer[i - 1].get_layer_info()
layer_info["input_h"] = prev_layer_info["output_h"]
layer_info["input_w"] = prev_layer_info["output_w"]
_changeOPTensorSize(
self.layer[i], "input", 0, prev_layer_info["output_h"], prev_layer_info["output_w"]
)
if op_code_str == "CONV_2D" or op_code_str == "DEPTHWISE_CONV_2D":
layer_info["output_h"] = math.ceil(
(layer_info["input_h"] - layer_info["kernel_h"] + 1) / layer_info["stride_h"]
)
layer_info["output_w"] = math.ceil(
(layer_info["input_w"] - layer_info["kernel_w"] + 1) / layer_info["stride_w"]
)
_changeOPTensorSize(self.layer[i], "output", 0, layer_info["output_h"], layer_info["output_w"])
elif op_code_str == "ADD":
layer_info["output_h"] = layer_info["input_h"]
layer_info["output_w"] = layer_info["input_w"]
layer_info["input2_h"] = layer_info["input_h"]
layer_info["input2_w"] = layer_info["input_w"]
_changeOPTensorSize(self.layer[i], "input", 0, layer_info["input_h"], layer_info["input_w"])
_changeOPTensorSize(self.layer[i], "input", 1, layer_info["input_h"], layer_info["input_w"])
else:
layer_info["is_patch"] = False

View File

@ -0,0 +1,118 @@
# ----------------------------------------------------------------------
# Project: TinyEngine
# Title: OpGenerator.py
#
# Reference papers:
# - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
# - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
# - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
# Contact authors:
# - Wei-Ming Chen, wmchen@mit.edu
# - Wei-Chen Wang, wweichen@mit.edu
# - Ji Lin, jilin@mit.edu
# - Ligeng Zhu, ligeng@mit.edu
# - Song Han, songhan@mit.edu
#
# Target ISA: ARMv7E-M
# ----------------------------------------------------------------------
from .codetemplate.depthwiseTemplate import depthwiseInplace
class OpGenerator:
def __init__(self, incpath, srcpath, layers, fp_requantize=False):
self.incpath = incpath
self.srcpath = srcpath
self.layers = layers
self.fp_requantize = fp_requantize
def genOpcode(self):
# find all conv ops
op_list = []
for op in self.layers:
layer_info = op.get_layer_info()
if layer_info["op"] == "CONV_2D" or layer_info["op"] == "DEPTHWISE_CONV_2D":
op = convOp(layer_info)
if op not in op_list:
op_list.append(op)
# go through and generate all ops
incfile = includeFile(self.incpath)
for op in op_list:
if op.isDepthwise:
if op.kernel_h > op.kernel_w:
depthwise_template = depthwiseInplace(
op.kernel_h,
op.kernel_w,
op.pad_h,
op.pad_w,
op.stride,
"CWH",
self.fp_requantize,
)
else:
depthwise_template = depthwiseInplace(
op.kernel_h,
op.kernel_w,
op.pad_h,
op.pad_w,
op.stride,
"CHW",
self.fp_requantize,
)
depthwise_template.genFile(self.srcpath)
incfile.addDefine(depthwise_template.genFuncDefine())
incfile.writeFile()
class convOp:
def __init__(self, layer_info):
if layer_info["op"] == "CONV_2D":
isDepthwise = False
elif layer_info["op"] == "DEPTHWISE_CONV_2D":
isDepthwise = True
kernel_h = layer_info["kernel_h"]
kernel_w = layer_info["kernel_w"]
pad_h = (kernel_h - 1) // 2
pad_w = (kernel_w - 1) // 2
stride = layer_info["stride_h"]
self.inchannel = layer_info["input_c"]
self.isDepthwise = isDepthwise
self.kernel_h = kernel_h
self.kernel_w = kernel_w
self.stride = stride
self.pad_h = pad_h
self.pad_w = pad_w
def __eq__(self, other):
if isinstance(other, convOp):
if (
self.isDepthwise == other.isDepthwise
and self.kernel_h == other.kernel_h
and self.kernel_w == other.kernel_w
and self.stride == other.stride
and self.pad_h == other.pad_h
and self.pad_w == other.pad_w
):
return True
else:
return False
return NotImplemented
class includeFile:
def __init__(self, path):
self.path = path
self.defstring = ""
def addDefine(self, defstr):
self.defstring += defstr + ";\n"
def writeFile(self):
import os
outpath = os.path.join(self.path, "genInclude.h")
outf = open(outpath, "w")
outf.write(self.defstring)
outf.close()

View File

@ -0,0 +1,85 @@
# ----------------------------------------------------------------------
# Project: TinyEngine
# Title: PatchBasedUtil.py
#
# Reference papers:
# - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
# - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
# - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
# Contact authors:
# - Wei-Ming Chen, wmchen@mit.edu
# - Wei-Chen Wang, wweichen@mit.edu
# - Ji Lin, jilin@mit.edu
# - Ligeng Zhu, ligeng@mit.edu
# - Song Han, songhan@mit.edu
#
# Target ISA: ARMv7E-M
# ----------------------------------------------------------------------
def getPatchParams(layers, split_idx, n_patch):
patch_params = {}
feat_stride = 8
patch_params["n_patch"] = n_patch
patch_params["layer_cnt"] = split_idx
resolution = max(layers[0].get_layer_info()["input_h"], layers[0].get_layer_info()["input_w"])
layer_cnt = layers[patch_params["layer_cnt"]].get_layer_info()
out_shape = max(layer_cnt["input_h"], layer_cnt["input_w"])
feat_stride = resolution // out_shape
grain_size = out_shape // n_patch
patch_params["single_rf"] = compute_receptive_field(layers, patch_params["layer_cnt"], 1)
patch_params["output_c"] = layer_cnt["input_c"]
patch_params["output_h"] = layer_cnt["output_h"]
patch_params["output_w"] = layer_cnt["output_w"]
patch_params["grain_rf"] = compute_receptive_field(layers, patch_params["layer_cnt"], grain_size)
patch_params["grain_rf_height"] = compute_receptive_field(
layers, patch_params["layer_cnt"], layer_cnt["input_h"] // n_patch
)
print("receptive field: single {} all {}".format(patch_params["single_rf"], patch_params["grain_rf"]))
# now generate the padding for each layer (two side)
patch_params["pad_l"] = patch_params["single_rf"] // 2
patch_params["pad_r"] = max(
0,
patch_params["grain_rf"]
+ feat_stride * grain_size * (n_patch - 1)
- patch_params["single_rf"] // 2
- resolution,
)
return patch_params
def get_recompute_layer(model, split_idx):
layer_cnt = 1 # first conv
for i in range(split_idx):
block = model["blocks"][i]
if "pointwise1" in block and block["pointwise1"] is not None:
layer_cnt += 1
if "depthwise" in block and block["depthwise"] is not None:
layer_cnt += 1
if "pointwise2" in block and block["pointwise2"] is not None:
layer_cnt += 1
return layer_cnt
def compute_receptive_field(layers, layer_cnt, grain=1):
for i in range(layer_cnt):
op = layers[(layer_cnt - 1) - i] # trace in a backward manner
layer_info = op.get_layer_info()
if layer_info["op"] == "CONV_2D" or layer_info["op"] == "DEPTHWISE_CONV_2D": # receptive field will increase
stride = layer_info["stride_h"]
kernel_size = max(layer_info["kernel_h"], layer_info["kernel_w"])
if stride in [1, 2]:
if stride == 1:
grain += kernel_size - 1
else:
grain = (grain - 1) * 2 + kernel_size
else:
pass
return grain

View File

@ -0,0 +1,920 @@
# ----------------------------------------------------------------------
# Project: TinyEngine
# Title: TfliteConvertor.py
#
# Reference papers:
# - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
# - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
# - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
# Contact authors:
# - Wei-Ming Chen, wmchen@mit.edu
# - Wei-Chen Wang, wweichen@mit.edu
# - Ji Lin, jilin@mit.edu
# - Ligeng Zhu, ligeng@mit.edu
# - Song Han, songhan@mit.edu
#
# Target ISA: ARMv7E-M
# ----------------------------------------------------------------------
import math
import numpy as np
from .constant import SKIP_OPs
from .operators import add, avgpool2d, conv2d, depthwiseConv2d, maxpool2d, upsample
from .tflite import Model
from .tflite.BuiltinOperator import BuiltinOperator
from .tflite.BuiltinOptions import BuiltinOptions
from .tflite.Conv2DOptions import Conv2DOptions
from .tflite.DepthwiseConv2DOptions import DepthwiseConv2DOptions
from .tflite.Padding import Padding
from .tflite.Pool2DOptions import Pool2DOptions
from .tflite.TensorType import TensorType
# Parse tflite model into TinyEngine IR format
class TfliteConvertor(object):
def __init__(self, filepath):
# path to the tflite file
self.filepath = filepath
self.model = self.loadTFmodel(filepath)
self.subgraph = self.model.Subgraphs(0)
self.builtin_op_code = self._build_str_map(BuiltinOperator())
self.layer = []
self.tmpPADIndice = None
self.skip_transpose = None
self.average_1D_to_2D_holder = MEAN2D()
# public functions
def loadTFmodel(self, filepath):
buf = open(filepath, "rb").read()
return Model.Model.GetRootAsModel(buf, 0)
def dumpModelInfo(self):
version = self.model.Version()
print("Model version:", version)
description = self.model.Description().decode("utf-8")
print("Description:", description)
subgraph_len = self.model.SubgraphsLength()
print("Subgraph length:", subgraph_len)
self.dumpLayerInfo()
def dumpLayerInfo(self):
print("Layer length:", len(self.layer))
# print brief info about each layer
for i, layer in enumerate(self.layer):
if self.layer[i]["op"] == "ADD":
print(
"op:",
layer["op"],
",input_idx:",
layer["input_idx"],
",input2_idx:",
layer["input2_idx"],
"output_idx:",
layer["output_idx"],
)
else:
print(
"op:",
layer["op"],
",input_idx:",
layer["input_idx"],
"output_idx:",
layer["output_idx"],
)
def parseOperatorInfo(self):
operators_len = self.subgraph.OperatorsLength()
for i in range(operators_len):
op = self.subgraph.Operators(i)
# parse the op
self._handleOperator(op)
# private functions
def _build_str_map(self, obj):
ret = {}
for field_name in dir(obj):
if not field_name.startswith("_"):
field_value = getattr(obj, field_name)
if isinstance(field_value, int):
ret[field_value] = field_name
return ret
def _getOpCodeStr(self, op):
op_code_list_idx = op.OpcodeIndex()
op_code_id = self.model.OperatorCodes(op_code_list_idx).DeprecatedBuiltinCode()
return self.builtin_op_code[op_code_id]
def _getTensorTypeStr(self, type):
if TensorType.INT8 == type:
return "int8"
if TensorType.UINT8 == type:
return "uint8"
if TensorType.FLOAT32 == type:
return "float32"
def _getMultiplierShift(self, effective_scale):
significand = np.zeros(len(effective_scale), dtype="int32")
shift = np.zeros(len(effective_scale), dtype="int32")
for i, s in enumerate(effective_scale):
if s == 0:
significand[i] = 0
shift[i] = 0
else:
sig, shi = math.frexp(s)
sig = int(round(sig * 2**31))
if sig == 2**31:
sig /= 2
shi += 1
if shi < -31:
shi = 0
sig = 0
significand[i] = sig
shift[i] = shi
return significand, shift
def _getSigShift(self, s):
sig, shi = math.frexp(s)
sig = int(round(sig * 2**31))
if sig == 2**31:
sig /= 2
shi += 1
if shi < -31:
shi = 0
sig = 0
return sig, shi
def _getADDMultiplierShift(self, input_scale, input2_scale, output_scale):
left_shift = 20
twice_max_input_scale = 2 * np.double(max(input_scale, input2_scale))
real_input1_multiplier = np.double(input_scale / twice_max_input_scale)
real_input2_multiplier = np.double(input2_scale / twice_max_input_scale)
real_output_multiplier = np.double(twice_max_input_scale / ((1 << left_shift) * output_scale))
input_multiplier, input_shift = self._getSigShift(real_input1_multiplier)
input2_multiplier, input2_shift = self._getSigShift(real_input2_multiplier)
output_multiplier, output_shift = self._getSigShift(real_output_multiplier)
return (
left_shift,
input_multiplier,
input_shift,
input2_multiplier,
input2_shift,
output_multiplier,
output_shift,
)
def _preprocessSoftmaxScaling(self, beta, input_scale, input_integer_bits):
input_beta_real_multiplier = min(beta * input_scale * (1 << (31 - input_integer_bits)), (1 << 31) - 1.0)
multiplier, shift = self._getSigShift(input_beta_real_multiplier)
return multiplier, shift
# follow TFlite implementation
def _calculateInputRadius(self, input_integer_bits, input_left_shift, total_signed_bits=31):
max_input_rescaled = (
1.0
* ((1 << input_integer_bits) - 1)
* (1 << (total_signed_bits - input_integer_bits))
/ (1 << input_left_shift)
)
return math.floor(max_input_rescaled)
# converting tflite fuctions
def _convert_convolution(self, op):
# operator
op_code_str = self._getOpCodeStr(op)
# get input, weight, and output tensors
input_tensors = self._get_input_tensors(op)
input_tensor_count = len(input_tensors)
assert input_tensor_count >= 2, "input tensors length should be >= 2"
input_tensor = input_tensors[0]
weight_tensor = input_tensors[1]
output_tensors = self._get_output_tensors(op)
assert len(output_tensors) == 1, "output tensors length should be 1"
output_tensor = output_tensors[0]
# conv_2d options
if op_code_str == "CONV_2D":
assert op.BuiltinOptionsType() == BuiltinOptions.Conv2DOptions
op_options = op.BuiltinOptions()
conv_options = Conv2DOptions()
conv_options.Init(op_options.Bytes, op_options.Pos)
if op_code_str == "DEPTHWISE_CONV_2D":
assert op.BuiltinOptionsType() == BuiltinOptions.DepthwiseConv2DOptions
op_options = op.BuiltinOptions()
conv_options = DepthwiseConv2DOptions()
conv_options.Init(op_options.Bytes, op_options.Pos)
# conv parameters
stride_h = conv_options.StrideH()
stride_w = conv_options.StrideW()
# shapes
_, input_h, input_w, input_c = input_tensor.tensor.ShapeAsNumpy()
if op_code_str == "CONV_2D":
output_c, kernel_h, kernel_w, _ = weight_tensor.tensor.ShapeAsNumpy()
elif op_code_str == "DEPTHWISE_CONV_2D":
_, kernel_h, kernel_w, output_c = weight_tensor.tensor.ShapeAsNumpy()
_, output_h, output_w, output_c_dual = output_tensor.tensor.ShapeAsNumpy()
assert output_c_dual == output_c, "output channels not match"
# tensor types
input_type = self._getTensorTypeStr(input_tensor.tensor.Type())
output_type = self._getTensorTypeStr(output_tensor.tensor.Type())
weight_type = self._getTensorTypeStr(weight_tensor.tensor.Type())
assert input_type == output_type == weight_type, "tensor type not consistent"
# tensor value: weight, scalers
weight_value = self._get_np_from_wrapper(weight_tensor)
if input_tensor_count == 3:
bias_tensor = input_tensors[2]
# bias = self._get_np_from_wrapper(bias_tensor).astype('int') # forcely casting for testing latency
bias = self._get_np_from_wrapper(bias_tensor)
else:
bias = None
# quantized setting
input_zero_point = input_tensor.qnn_params["zero_point"]
output_zero_point = output_tensor.qnn_params["zero_point"]
input_scale = input_tensor.qnn_params["scale"]
weight_scale = weight_tensor.qnn_params["scale"]
output_scale = output_tensor.qnn_params["scale"]
effective_scale = np.double(input_scale) * np.double(weight_scale) / np.double(output_scale)
# quantized inference, used for requantize
multiplier, shift = self._getMultiplierShift(effective_scale)
# find previous layer and redirct the index and fuse pad into conv
if self.tmpPADIndice is not None:
if self.tmpPADIndice.output_idx == input_tensor.tensor_idx:
input_idx = self.tmpPADIndice.input_idx
input_h = input_h - math.floor(kernel_h / 2) * 2
input_w = input_w - math.floor(kernel_h / 2) * 2
else:
input_idx = input_tensor.tensor_idx
else:
input_idx = input_tensor.tensor_idx
# clean the buffer
self.tmpPADIndice = None
params = {
# operator
"op": op_code_str,
# conv
"kernel_h": kernel_h,
"kernel_w": kernel_w,
"padding": math.floor(kernel_h / 2),
"stride_h": stride_h,
"stride_w": stride_w,
# tensor
"input_idx": input_idx,
"output_idx": output_tensor.tensor_idx,
"input_dim": 3,
"output_dim": 3,
"input_h": input_h,
"input_w": input_w,
"input_c": input_c,
"output_h": output_h,
"output_w": output_w,
"output_c": output_c,
"dtypte": input_type,
# trainable parameters
"weight_value": weight_value,
"bias": bias,
"effective_scale": effective_scale,
"input_zero_point": input_zero_point,
"output_zero_point": output_zero_point,
"input_scale": input_scale,
"weight_scale": weight_scale,
"output_scale": output_scale,
# quantized infernece
"multiplier": multiplier,
"shift": shift,
}
if op_code_str == "CONV_2D":
op = conv2d.Conv2d(params)
elif op_code_str == "DEPTHWISE_CONV_2D":
op = depthwiseConv2d.DepthwiseConv2d(params)
return op
def _convert_ADD(self, op):
# operator
op_code_str = self._getOpCodeStr(op)
# get input, weight, and output tensors
input_tensors = self._get_input_tensors(op)
input_tensor_count = len(input_tensors)
assert input_tensor_count == 2, "input should be 2 tensors"
input_tensor = input_tensors[0]
input2_tensor = input_tensors[1]
output_tensors = self._get_output_tensors(op)
assert len(output_tensors) == 1, "output tensors length should be 1"
output_tensor = output_tensors[0]
# shapes
_, input_h, input_w, input_c = input_tensor.tensor.ShapeAsNumpy()
_, input2_h, input2_w, input2_c = input2_tensor.tensor.ShapeAsNumpy()
_, output_h, output_w, output_c = output_tensor.tensor.ShapeAsNumpy()
assert input_h == input2_h == output_h, "tensor shpae not consistent"
assert input_w == input2_w == output_w, "tensor shpae not consistent"
assert input_c == input2_c == output_c, "tensor shpae not consistent"
# tensor types
input_type = self._getTensorTypeStr(input_tensor.tensor.Type())
input_type2 = self._getTensorTypeStr(input2_tensor.tensor.Type())
output_type = self._getTensorTypeStr(output_tensor.tensor.Type())
assert input_type == input_type2 == output_type, "tensor type not consistent"
# quantized setting
input_zero_point = input_tensor.qnn_params["zero_point"]
input2_zero_point = input2_tensor.qnn_params["zero_point"]
output_zero_point = output_tensor.qnn_params["zero_point"]
input_scale = input_tensor.qnn_params["scale"]
input2_scale = input2_tensor.qnn_params["scale"]
output_scale = output_tensor.qnn_params["scale"]
# get multipliers and shifts
(
left_shift,
input_multiplier,
input_shift,
input2_multiplier,
input2_shift,
output_multiplier,
output_shift,
) = self._getADDMultiplierShift(input_scale, input2_scale, output_scale)
# assign params
params = {
# operator
"op": op_code_str,
# tensor
"input_idx": input_tensor.tensor_idx,
"input2_idx": input2_tensor.tensor_idx,
"output_idx": output_tensor.tensor_idx,
"input_h": input_h,
"input_w": input_w,
"input_c": input_c,
"input2_h": input_h,
"input2_w": input_w,
"input2_c": input_c,
"input_dim": 3,
"input2_dim": 3,
"output_dim": 3,
"output_h": output_h,
"output_w": output_w,
"output_c": output_c,
"dtypte": input_type,
# trainable parameters
"input_zero_point": input_zero_point,
"input2_zero_point": input2_zero_point,
"output_zero_point": output_zero_point,
"input_scale": input_scale,
"input2_scale": input2_scale,
"output_scale": output_scale,
# quantized infernece
"left_shift": left_shift,
"input_multiplier": input_multiplier,
"input2_multiplier": input2_multiplier,
"input_shift": input_shift,
"input2_shift": input2_shift,
"output_multiplier": output_multiplier,
"output_shift": output_shift,
}
op = add.Add(params)
return op
def _convert_AVERAGE_POOL_2D(self, op):
# operator
op_code_str = self._getOpCodeStr(op)
# get input, weight, and output tensors
input_tensors = self._get_input_tensors(op)
input_tensor_count = len(input_tensors)
assert input_tensor_count == 1, "input tensors length should be 1"
input_tensor = input_tensors[0]
output_tensors = self._get_output_tensors(op)
assert len(output_tensors) == 1, "output tensors length should be 1"
output_tensor = output_tensors[0]
# shapes
_, input_h, input_w, input_c = input_tensor.tensor.ShapeAsNumpy()
_, output_h, output_w, output_c = output_tensor.tensor.ShapeAsNumpy()
# tensor types
input_type = self._getTensorTypeStr(input_tensor.tensor.Type())
output_type = self._getTensorTypeStr(output_tensor.tensor.Type())
assert input_type == output_type, "tensor type not consistent"
# pool parameters
assert op.BuiltinOptionsType() == BuiltinOptions.Pool2DOptions
op_options = op.BuiltinOptions()
pool2d_options = Pool2DOptions()
pool2d_options.Init(op_options.Bytes, op_options.Pos)
stride_h = pool2d_options.StrideH()
stride_w = pool2d_options.StrideW()
padding = pool2d_options.Padding()
filter_h = pool2d_options.FilterHeight()
filter_w = pool2d_options.FilterWidth()
# padding
if padding == Padding.VALID:
pad_h = 0
pad_w = 0
elif padding == Padding.SAME:
pass # no support for now
# quantized setting
input_zero_point = input_tensor.qnn_params["zero_point"]
output_zero_point = output_tensor.qnn_params["zero_point"]
input_scale = input_tensor.qnn_params["scale"]
output_scale = output_tensor.qnn_params["scale"]
params = {
# operator
"op": op_code_str,
# pool parameters
"filter_h": filter_h,
"filter_w": filter_w,
"stride_h": stride_h,
"stride_w": stride_w,
"pad_h": pad_h,
"pad_w": pad_w,
# tensor
"input_idx": input_tensor.tensor_idx,
"output_idx": output_tensor.tensor_idx,
"input_h": input_h,
"input_w": input_w,
"input_c": input_c,
"input_dim": input_tensor.tensor.ShapeAsNumpy().size,
"output_dim": output_tensor.tensor.ShapeAsNumpy().size,
"output_h": output_h,
"output_w": output_w,
"output_c": output_c,
"dtypte": input_type,
# trainable parameters
"input_zero_point": input_zero_point,
"output_zero_point": output_zero_point,
"input_scale": input_scale,
"output_scale": output_scale,
}
op = avgpool2d.AvgPool2d(params)
return op
def _convert_upsample(self, op):
# Incase no params
input_type = None
input_zero_point = None
output_zero_point = None
input_scale = None
output_scale = None
# get input, weight, and output tensors
input_tensors = self._get_input_tensors(op)
input_tensor_count = len(input_tensors)
assert input_tensor_count == 1, "input tensors length should be 1"
input_tensor = input_tensors[0]
output_tensors = self._get_output_tensors(op)
assert len(output_tensors) == 1, "output tensors length should be 1"
output_tensor = output_tensors[0]
# shapes
_, input_h, input_w, input_c = input_tensor.tensor.ShapeAsNumpy()
_, output_h, output_w, output_c = output_tensor.tensor.ShapeAsNumpy()
params = {
# operator
"op": "UPSAMPLE",
# upsample parameters
"factor": output_w / input_w,
# tensor
"input_idx": input_tensor.tensor_idx,
"output_idx": output_tensor.tensor_idx,
"input_h": input_h,
"input_w": input_w,
"input_c": input_c,
"input_dim": 3,
"output_dim": 3,
"output_h": output_h,
"output_w": output_w,
"output_c": output_c,
"dtype": input_type,
# trainable parameters
"input_zero_point": input_zero_point,
"output_zero_point": output_zero_point,
"input_scale": input_scale,
"output_scale": output_scale,
# quantized infernece
}
op = upsample.upSample(params)
return op
def _convert_PAD(self, op):
# get input, weight, and output tensors
input_tensors = self._get_input_tensors(op)
input_tensor = input_tensors[0]
output_tensors = self._get_output_tensors(op)
assert len(output_tensors) == 1, "output tensors length should be 1"
output_tensor = output_tensors[0]
# fuse pad into conv
self.tmpPADIndice = PAD_tensorIndice(input_tensor.tensor_idx, output_tensor.tensor_idx)
def _convert_TRANSPOSE(self, op):
# get input, weight, and output tensors
input_tensors = self._get_input_tensors(op)
input_tensor = input_tensors[0]
output_tensors = self._get_output_tensors(op)
assert len(output_tensors) == 1, "output tensors length should be 1"
output_tensor = output_tensors[0]
# fuse pad into conv
self.skip_transpose = PAD_tensorIndice(input_tensor.tensor_idx, output_tensor.tensor_idx)
def _convert_maxpool(self, op):
# Incase no params
input_type = None
input_zero_point = None
output_zero_point = None
input_scale = None
output_scale = None
# get input, weight, and output tensors
input_tensors = self._get_input_tensors(op)
input_tensor_count = len(input_tensors)
assert input_tensor_count == 1, "input tensors length should be 1"
input_tensor = input_tensors[0]
output_tensors = self._get_output_tensors(op)
assert len(output_tensors) == 1, "output tensors length should be 1"
output_tensor = output_tensors[0]
# shapes
_, input_h, input_w, input_c = input_tensor.tensor.ShapeAsNumpy()
_, output_h, output_w, output_c = output_tensor.tensor.ShapeAsNumpy()
# pool parameters
assert op.BuiltinOptionsType() == BuiltinOptions.Pool2DOptions
op_options = op.BuiltinOptions()
pool2d_options = Pool2DOptions()
pool2d_options.Init(op_options.Bytes, op_options.Pos)
stride_h = pool2d_options.StrideH()
stride_w = pool2d_options.StrideW()
# padding = pool2d_options.Padding()
filter_h = pool2d_options.FilterHeight()
filter_w = pool2d_options.FilterWidth()
# fused_activation_fn = pool2d_options.FusedActivationFunction()
pool_params = {
# operator
"op": "MAX_POOL_2D",
# pool parameters
"filter_h": filter_h,
"filter_w": filter_w,
"stride_h": stride_h,
"stride_w": stride_w,
"pad_h": 0,
"pad_w": 0,
# tensor
"input_idx": input_tensor.tensor_idx,
"output_idx": output_tensor.tensor_idx,
"input_h": input_h,
"input_w": input_w,
"input_c": input_c,
"input_dim": 3,
"output_dim": 3,
"output_h": output_h,
"output_w": output_w,
"output_c": output_c,
"dtype": input_type,
# trainable parameters
"input_zero_point": input_zero_point,
"output_zero_point": output_zero_point,
"input_scale": input_scale,
"output_scale": output_scale,
# quantized infernece
}
op = maxpool2d.maxPool2d(pool_params)
return op
def _convert_mean1D(self, op, MEAN2Dholder):
# Incase no params
input_type = None
# get input, weight, and output tensors
input_tensors = self._get_input_tensors(op)
input_tensor_count = len(input_tensors)
assert input_tensor_count == 1, "input tensors length should be 1"
input_tensor = input_tensors[0]
output_tensors = self._get_output_tensors(op)
assert len(output_tensors) == 1, "output tensors length should be 1"
output_tensor = output_tensors[0]
# shapes
input_shape = input_tensor.tensor.ShapeAsNumpy()
output_shape = output_tensor.tensor.ShapeAsNumpy()
input_h, input_w, input_c = get_hwc_from_chwshape(input_shape)
output_h, output_w, output_c = get_hwc_from_chwshape(output_shape)
input_type = self._getTensorTypeStr(input_tensor.tensor.Type())
if not MEAN2Dholder.has_first_1D:
MEAN2Dholder.add_first_1D_op(input_tensor.tensor_idx, output_tensor.tensor_idx, input_h, input_w, input_c)
return None
elif not MEAN2Dholder.has_second_1D:
MEAN2Dholder.add_second_1D_op(
input_tensor.tensor_idx, output_tensor.tensor_idx, output_h, output_w, output_c
)
filter_h = input_h - output_h + 1
filter_w = input_w - output_w + 1
params = {
# operator
"op": "AVERAGE_POOL_2D",
# pool parameters
"filter_h": filter_h,
"filter_w": filter_w,
"stride_h": 1,
"stride_w": 1,
"pad_h": 0,
"pad_w": 0,
# tensor
"input_idx": MEAN2Dholder.first_1D_input_idx,
"output_idx": MEAN2Dholder.second_1D_output_idx,
"input_h": MEAN2Dholder.input_h,
"input_w": MEAN2Dholder.input_w,
"input_c": MEAN2Dholder.input_c,
"input_dim": 3,
"output_dim": 3,
"output_h": MEAN2Dholder.output_h,
"output_w": MEAN2Dholder.output_w,
"output_c": MEAN2Dholder.output_c,
"dtypte": input_type,
}
op = avgpool2d.AvgPool2d(params)
return op
else:
raise NotImplementedError
def _convert_FULLY_CONNECTED(self, op):
# get input, weight, and output tensors
input_tensors = self._get_input_tensors(op)
input_tensor_count = len(input_tensors)
assert input_tensor_count == 3, "input tensors length should be 3"
input_tensor = input_tensors[0]
weight_tensor = input_tensors[1]
bias_tensor = input_tensors[2]
weight = self._get_np_from_wrapper(weight_tensor)
bias = self._get_np_from_wrapper(bias_tensor)
output_tensors = self._get_output_tensors(op)
assert len(output_tensors) == 1, "output tensors length should be 1"
output_tensor = output_tensors[0]
# shapes
if input_tensor.tensor.ShapeAsNumpy().shape[0] == 2:
input_w, input_c = input_tensor.tensor.ShapeAsNumpy()
input_h = 1
elif input_tensor.tensor.ShapeAsNumpy().shape[0] == 4:
_, input_h, input_w, input_c = input_tensor.tensor.ShapeAsNumpy()
output_c, input_c_dual = weight_tensor.tensor.ShapeAsNumpy()
output_h, output_c_dual = output_tensor.tensor.ShapeAsNumpy()
assert input_c_dual == input_c, "channels not match"
assert output_c_dual == output_c, "channels not match"
# tensor types
input_type = self._getTensorTypeStr(input_tensor.tensor.Type())
output_type = self._getTensorTypeStr(output_tensor.tensor.Type())
assert input_type == output_type, "tensor type not consistent"
# quantized setting
input_zero_point = input_tensor.qnn_params["zero_point"]
output_zero_point = output_tensor.qnn_params["zero_point"]
input_scale = input_tensor.qnn_params["scale"]
weight_scale = weight_tensor.qnn_params["scale"]
bias_scale = bias_tensor.qnn_params["scale"]
output_scale = output_tensor.qnn_params["scale"]
# We support per channel in the CONV2D operator
if isinstance(bias_scale, float) and isinstance(weight_scale, float):
np_ones = np.ones(output_c)
bias_scale = np_ones * bias_scale
np_ones = np.ones(output_c)
output_scale = np_ones * output_scale
effective_scale = np.double(input_scale) * np.double(weight_scale) / np.double(output_scale)
# follows tensorflow lite micro
multiplier, shift = self._getMultiplierShift(effective_scale)
params = {
# operator
"op": "CONV_2D",
# tensor
"input_idx": input_tensor.tensor_idx,
"output_idx": output_tensor.tensor_idx,
"input_h": input_h,
"input_w": input_w,
"input_c": input_c,
"input_dim": 3,
"output_dim": 2,
"output_h": output_h,
"output_w": 1,
"output_c": output_c,
"dtypte": input_type,
"kernel_h": 1,
"kernel_w": 1,
# trainable parameters
"weight_value": weight,
"bias": bias,
"effective_scale": effective_scale,
"input_zero_point": input_zero_point,
"output_zero_point": output_zero_point,
"input_scale": input_scale,
"output_scale": output_scale,
# quantized infernece
"multiplier": multiplier,
"shift": shift,
}
op = conv2d.Conv2d(params)
return op
# handle one op and parse it into layers[] for supported operators
def _handleOperator(self, op):
op_code_str = self._getOpCodeStr(op)
if op_code_str == "CONV_2D":
self.layer.append(self._convert_convolution(op))
elif op_code_str == "ADD":
self.layer.append(self._convert_ADD(op))
elif op_code_str == "AVERAGE_POOL_2D":
self.layer.append(self._convert_AVERAGE_POOL_2D(op))
elif op_code_str == "DEPTHWISE_CONV_2D":
self.layer.append(self._convert_convolution(op))
elif op_code_str == "PAD":
self._convert_PAD(op)
elif op_code_str == "RESIZE_NEAREST_NEIGHBOR":
self.layer.append(self._convert_upsample(op))
elif op_code_str == "MAX_POOL_2D":
self.layer.append(self._convert_maxpool(op))
elif op_code_str in "MEAN":
ret_op = self._convert_mean1D(op, self.average_1D_to_2D_holder)
if ret_op is not None:
# TODO: This only handle a specific graph: TRANSPOSE -> MEAN -> MEANS
if self.skip_transpose is not None:
ret_op.params["input_idx"] = self.skip_transpose.input_idx
ret_op.input_tensors[0].graph_idx = self.skip_transpose.input_idx
self.layer.append(ret_op)
elif op_code_str == "TRANSPOSE":
self._convert_TRANSPOSE(op)
elif op_code_str in "FULLY_CONNECTED":
self.layer.append(self._convert_FULLY_CONNECTED(op))
elif op_code_str in SKIP_OPs:
pass
else:
raise NotImplementedError(f"Unsupported {op_code_str}")
def _get_np_from_wrapper(self, wrapper):
if wrapper.tensor.Type() == TensorType.INT8:
dtype = np.int8
elif wrapper.tensor.Type() == TensorType.INT32:
dtype = np.int32
else:
raise NotImplementedError("Current implementation only supports int8 and int32")
data = wrapper.buffer.DataAsNumpy()
shape = wrapper.tensor.ShapeAsNumpy() if wrapper.tensor.ShapeLength() != 0 else []
return np.frombuffer(data, dtype=dtype).reshape(shape)
def _get_tensor_type_str(self, tensor_type):
if tensor_type == TensorType.INT8:
return "int8"
raise NotImplementedError(f"Tensor type: {tensor_type} is not supported yet.")
def _get_input_tensors(self, op):
return self._get_wrapper_tensors(op.InputsAsNumpy())
def _get_output_tensors(self, op):
return self._get_wrapper_tensors(op.OutputsAsNumpy())
def _get_wrapper_tensors(self, tensor_index_list):
ret = []
for idx in tensor_index_list:
tensor = self.subgraph.Tensors(idx)
buffer_idx = tensor.Buffer()
buffer = self.model.Buffers(buffer_idx)
tflite_qparams = tensor.Quantization()
if tflite_qparams is None:
continue
assert tflite_qparams, "Quantization parameters not found in the model!"
scale = tflite_qparams.ScaleAsNumpy()
zero_point = tflite_qparams.ZeroPointAsNumpy()
qparams_to_tensor_wrapper = None
if isinstance(zero_point, np.ndarray):
# Per-channel quantization
if scale.size != 1 and zero_point.size != 1:
qparams_to_tensor_wrapper = {"scale": scale, "zero_point": zero_point}
# Per-tensor quantization
elif scale.size == 1 and zero_point.size == 1:
qparams_to_tensor_wrapper = {"scale": float(scale[0]), "zero_point": int(zero_point[0])}
else:
raise NotImplementedError
elif scale == zero_point == 0:
pass
ret.append(TFLiteTensorWrpper(idx, tensor, buffer, qparams_to_tensor_wrapper))
return ret
class PAD_tensorIndice(object):
def __init__(self, input_idx, output_idx):
self.input_idx = input_idx
self.output_idx = output_idx
class MEAN2D(object):
def __init__(self):
self.has_first_1D = False
self.has_second_1D = False
def add_first_1D_op(self, input_idx, output_idx, input_h, input_w, input_c):
self.first_1D_input_idx = input_idx
self.first_1D_output_idx = output_idx
self.input_h = input_h
self.input_w = input_w
self.input_c = input_c
self.has_first_1D = True
def add_second_1D_op(self, input_idx, output_idx, output_h, output_w, output_c):
self.second_1D_input_idx = input_idx
self.second_1D_output_idx = output_idx
self.output_h = output_h
self.output_w = output_w
self.output_c = output_c
self.has_second_1D = True
class TFLiteTensorWrpper:
def __init__(self, tensor_idx, tensor, buffer, qnn_params):
self.tensor_idx = tensor_idx
self.tensor = tensor
self.buffer = buffer
self.qnn_params = qnn_params
def get_hwc_from_chwshape(shape):
h = 1
w = 1
c = 1
if len(shape) == 4:
c = shape[1]
h = shape[2]
w = shape[3]
elif len(shape) == 3:
c = shape[1]
h = shape[2]
elif len(shape) == 2:
c = shape[1]
return h, w, c

View File

View File

@ -0,0 +1 @@
__all__ = ["base_allocator", "firstFit"]

Some files were not shown because too many files have changed in this diff Show More