initial commit
5
.clang-format
Normal file
@ -0,0 +1,5 @@
|
||||
BasedOnStyle: Google
|
||||
ColumnLimit: 120
|
||||
ContinuationIndentWidth: 4
|
||||
IndentWidth: 4
|
||||
TabWidth: 4
|
4
.gitignore
vendored
Normal file
@ -0,0 +1,4 @@
|
||||
# Byte-compiled / optimized / DLL files
|
||||
__pycache__/
|
||||
*.py[cod]
|
||||
*$py.class
|
6
.gitmodules
vendored
Normal file
@ -0,0 +1,6 @@
|
||||
[submodule "mcunet"]
|
||||
path = mcunet
|
||||
url = https://github.com/mit-han-lab/mcunet.git
|
||||
[submodule "TinyEngine/third_party/CMSIS"]
|
||||
path = TinyEngine/third_party/CMSIS
|
||||
url = https://github.com/ARM-software/CMSIS_5.git
|
51
.pre-commit-config.yaml
Normal file
@ -0,0 +1,51 @@
|
||||
exclude: "code_generator/tflite/.*"
|
||||
repos:
|
||||
- repo: https://github.com/pre-commit/pre-commit-hooks
|
||||
rev: v4.0.1
|
||||
hooks:
|
||||
- id: trailing-whitespace
|
||||
- id: mixed-line-ending
|
||||
args: ["--fix=lf"]
|
||||
- id: end-of-file-fixer
|
||||
- id: check-merge-conflict
|
||||
- id: requirements-txt-fixer
|
||||
- id: fix-encoding-pragma
|
||||
args: ["--remove"]
|
||||
- id: debug-statements
|
||||
- id: check-toml
|
||||
- repo: https://github.com/executablebooks/mdformat
|
||||
rev: 0.7.10
|
||||
hooks:
|
||||
- id: mdformat
|
||||
- repo: https://github.com/psf/black
|
||||
rev: 22.3.0
|
||||
hooks:
|
||||
- id: black
|
||||
- repo: https://github.com/pycqa/isort
|
||||
rev: 5.10.1
|
||||
hooks:
|
||||
- id: isort
|
||||
args: ["--sp", "pyproject.toml"]
|
||||
- repo: https://github.com/pycqa/flake8
|
||||
rev: 4.0.1
|
||||
hooks:
|
||||
- id: flake8
|
||||
additional_dependencies:
|
||||
- flake8-comprehensions==3.7.0
|
||||
- flake8-docstrings==1.6.0
|
||||
- repo: local
|
||||
hooks:
|
||||
- id: pylint
|
||||
name: pylint
|
||||
entry: pylint
|
||||
language: system
|
||||
types: [python]
|
||||
require_serial: true
|
||||
- repo: https://github.com/pre-commit/mirrors-mypy
|
||||
rev: v0.910-1
|
||||
hooks:
|
||||
- id: mypy
|
||||
- repo: https://github.com/pre-commit/mirrors-clang-format
|
||||
rev: v13.0.0
|
||||
hooks:
|
||||
- id: clang-format
|
21
LICENSE
Normal file
@ -0,0 +1,21 @@
|
||||
MIT License
|
||||
|
||||
Copyright (c) 2022 MIT HAN Lab
|
||||
|
||||
Permission is hereby granted, free of charge, to any person obtaining a copy
|
||||
of this software and associated documentation files (the "Software"), to deal
|
||||
in the Software without restriction, including without limitation the rights
|
||||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
||||
copies of the Software, and to permit persons to whom the Software is
|
||||
furnished to do so, subject to the following conditions:
|
||||
|
||||
The above copyright notice and this permission notice shall be included in all
|
||||
copies or substantial portions of the Software.
|
||||
|
||||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
||||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
||||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
||||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
||||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
||||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
||||
SOFTWARE.
|
224
README.md
Normal file
@ -0,0 +1,224 @@
|
||||
# TinyEngine
|
||||
|
||||
This is the official implementation of TinyEngine, a memory-efficient and high-performance neural network library for Microcontrollers.
|
||||
TinyEngine is a part of MCUNet, which also consists of TinyNAS. MCUNet is a system-algorithm co-design framework for tiny deep learning on microcontrollers. TinyEngine and TinyNAS are co-designed to fit the tight memory budgets.
|
||||
|
||||
**The MCUNet and TinyNAS repo is [here](https://github.com/mit-han-lab/mcunet).**
|
||||
|
||||
### [MCUNetV1](https://mcunet.mit.edu/#mcunetv1) | [MCUNetV2](https://mcunet.mit.edu/#mcunetv2) | [MCUNetV3](https://mcunet.mit.edu/#mcunetv3)
|
||||
|
||||
### [Demo (Inference)](https://www.youtube.com/watch?v=YvioBgtec4U)
|
||||
|
||||

|
||||
|
||||
### [Demo (Training)](https://www.youtube.com/watch?v=XaDCO8YtmBw)
|
||||
|
||||

|
||||
|
||||
|
||||
## News
|
||||
|
||||
We will soon release **Tiny Training Engine** used in [MCUNetV3: On-Device Training Under 256KB Memory](https://mcunet.mit.edu/#mcunetv3). **If you are interested in getting updates, please sign up [here](https://forms.gle/UW1uUmnfk1k6UJPPA) to get notified!**
|
||||
|
||||
- **(2022/08)** Our **New Course on TinyML and Efficient Deep Learning** will be released soon in September 2022: [efficientml.ai](https://efficientml.ai/).
|
||||
- **(2022/08)** We include the [demo tutorial](tutorial) for deploying a visual wake word (VWW) model onto microcontrollers.
|
||||
- **(2022/08)** We opensource the TinyEngine repo.
|
||||
- **(2022/07)** We include the person detection model used in the video demo above in the [MCUNet repo](https://github.com/mit-han-lab/mcunet).
|
||||
- **(2022/06)** We refactor the [MCUNet repo](https://github.com/mit-han-lab/mcunet) as a standalone repo (previous repo: https://github.com/mit-han-lab/tinyml)
|
||||
- **(2021/10)** **MCUNetV2** is accepted to NeurIPS 2021: https://arxiv.org/abs/2110.15352 !
|
||||
- **(2020/10)** **MCUNet** is accepted to NeurIPS 2020 as **spotlight**: https://arxiv.org/abs/2007.10319 !
|
||||
- Our projects are covered by: [MIT News](https://news.mit.edu/2020/iot-deep-learning-1113), [MIT News (v2)](https://news.mit.edu/2021/tiny-machine-learning-design-alleviates-bottleneck-memory-usage-iot-devices-1208), [WIRED](https://www.wired.com/story/ai-algorithms-slimming-fit-fridge/), [Morning Brew](https://www.morningbrew.com/emerging-tech/stories/2020/12/07/researchers-figured-fit-ai-ever-onto-internet-things-microchips), [Stacey on IoT](https://staceyoniot.com/researchers-take-a-3-pronged-approach-to-edge-ai/), [Analytics Insight](https://www.analyticsinsight.net/amalgamating-ml-and-iot-in-smart-home-devices/), [Techable](https://techable.jp/archives/142462), etc.
|
||||
|
||||
|
||||
## Overview
|
||||
|
||||
Microcontrollers are low-cost, low-power hardware. They are widely deployed and have wide applications, but the tight memory budget (50,000x smaller than GPUs) makes deep learning deployment difficult.
|
||||
|
||||
MCUNet is a **system-algorithm co-design** framework for tiny deep learning on microcontrollers. It consists of **TinyNAS** and **TinyEngine**. They are co-designed to fit the tight memory budgets. With system-algorithm co-design, we can significantly improve the deep learning performance on the same tiny memory budget.
|
||||
|
||||

|
||||
|
||||
Specifically, TinyEngine is a memory-efficient inference library. TinyEngine adapts the memory scheduling according to the overall network topology rather than layer-wise optimization, reducing memory usage and accelerating the inference. It outperforms existing inference libraries such as [TF-Lite Micro](https://www.tensorflow.org/lite/microcontrollers) from Google, [CMSIS-NN](https://arxiv.org/abs/1801.06601) from Arm, and [X-CUBE-AI](https://www.st.com/en/embedded-software/x-cube-ai.html) from STMicroelectronics.
|
||||
|
||||
TinyEngine adopts the following optimization techniques to accelerate inference speed and minimize memory footprint.
|
||||
|
||||
* [**In-place depth-wise convolution**](https://mcunet.mit.edu/#mcunetv1): A unique data placement technique for depth-wise convolution that overwrites input data by intermediate/output data to reduce peak SRAM memory.
|
||||
* [**Operator fusion**](https://docs.microsoft.com/en-us/windows/ai/directml/dml-fused-activations): A method that improves performance by merging one operator into a different operator so that they are executed together without requiring a roundtrip to memory.
|
||||
* [**SIMD (Single instruction, multiple data) programming**](https://en.wikipedia.org/wiki/Single_instruction,_multiple_data): A computing method that performs the same operation on multiple data points simultaneously.
|
||||
* [**HWC to CHW weight format transformation**](https://oneapi-src.github.io/oneDNN/dev_guide_understanding_memory_formats.html): A weight format transformation technique that increases cache hit ratio for in-place depth-wise convolution.
|
||||
* [**Image to Column (Im2col) convolution**](https://iq.opengenus.org/im2col/): An implementation technique of computing convolution operation using general matrix multiplication (GEMM) operations.
|
||||
* [**Loop reordering**](https://xilinx.github.io/Vitis_Accel_Examples/2019.2/html/loop_reorder.html): A loop transformation technique that attempts to optimize a program's execution speed by reordering/interchanging the sequence of loops.
|
||||
* [**Loop unrolling**](https://en.wikipedia.org/wiki/Loop_unrolling): A loop transformation technique that attempts to optimize a program's execution speed at the expense of its binary size, which is an approach known as space-time tradeoff.
|
||||
* [**Loop tiling**](https://en.wikipedia.org/wiki/Loop_nest_optimization): A loop transformation technique that attempts to reduce memory access latency by partitioning a loop's iteration space into smaller chunks or blocks, so as to help ensure data used in a loop stays in the cache until it is reused.
|
||||
|
||||

|
||||
|
||||
By adopting the abovementioned optimization techniques, TinyEngine can not only enhance inference speed but also reduce peak memory, as shown in the figures below.
|
||||
|
||||
**MAC/s improvement breakdown:**
|
||||

|
||||
|
||||
**Peak memory reduction:**
|
||||

|
||||
|
||||
To sum up, our **TinyEngine** inference engine could be a useful infrastructure for MCU-based AI applications. It significantly **improves the inference speed and reduces the memory usage** compared to existing libraries like [TF-Lite Micro](https://www.tensorflow.org/lite/microcontrollers), [CMSIS-NN](https://arxiv.org/abs/1801.06601), [X-CUBE-AI](https://www.st.com/en/embedded-software/x-cube-ai.html), etc. It improves the inference speed by **1.1-18.6x**, and reduces the peak memory by **1.3-3.6x**.
|
||||
|
||||

|
||||
|
||||
|
||||
|
||||
## Code Structure
|
||||
|
||||
`code_generator` contains a python library that is used to compile neural networks into low-level source code (C/C++).
|
||||
|
||||
`TinyEngine` contains a C/C++ library that implements operators and performs inference on Microcontrollers.
|
||||
|
||||
`examples` contains the examples of transforming TFLite models into our TinyEngine models.
|
||||
|
||||
`tutorial` contains the demo tutorial of deploying a visual wake word (VWW) model onto microcontrollers.
|
||||
|
||||
`assets` contains misc assets.
|
||||
|
||||
|
||||
## Requirement
|
||||
|
||||
- Python 3.6+
|
||||
- STM32CubeIDE 1.5+
|
||||
|
||||
|
||||
## Setup for Users
|
||||
|
||||
First, clone this repository:
|
||||
|
||||
```bash
|
||||
git clone --recursive https://github.com/mit-han-lab/tinyengine.git
|
||||
```
|
||||
|
||||
(Optional) Using a virtual environment with `conda` is recommended.
|
||||
|
||||
```bash
|
||||
conda create -n tinyengine python=3.6 pip
|
||||
conda activate tinyengine
|
||||
```
|
||||
|
||||
Install dependencies:
|
||||
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
|
||||
## Setup for Developers
|
||||
|
||||
Install pre-commit hooks to automatically format changes in your code.
|
||||
|
||||
```
|
||||
pre-commit install
|
||||
```
|
||||
|
||||
|
||||
## Deployment Example
|
||||
|
||||
Please see [tutorial](tutorial) to learn how to deploy a visual wake word (VWW) model onto microcontrollers by using TinyEngine.
|
||||
|
||||
|
||||
## Measured Results
|
||||
|
||||
- All the tflite models are from [Model Zoo in MCUNet repo](https://github.com/mit-han-lab/mcunet#model-zoo). Please see MCUNet repo to know how to build the pre-trained int8 quantized models in TF-Lite format.
|
||||
- All the **latency**, **peak memory (SRAM)** and **Flash memory usage** results are profiled on STM32F746G-DISCO discovery boards.
|
||||
- Note that we measure the newer versions of libraries in this repo, so that the results in this repo might be different from the ones in the MCUNet papers.
|
||||
- Since TF-Lite Micro no longer has version numbers anymore, we use the git commit ID to indicate its newer version.
|
||||
- All the tflite models are compiled by `-Ofast` optimization level in STM32CubeIDE.
|
||||
- OOM denotes Out Of Memory.
|
||||
|
||||
The **latency** results:
|
||||
|
||||
| net_id | TF-Lite Micro<br>v2.1.0 | TF-Lite Micro<br>[@ 713b6ed](https://github.com/tensorflow/tflite-micro/tree/713b6ed6bd81d8d6906d885e14f444aaf9c154f6) | CMSIS-NN<br>v2.0.0 | X-CUBE-AI<br>v7.1.0 | TinyEngine |
|
||||
| ---------------------------- | ----------------------- | -------------------------- | ------------------ | --------- | ---------- |
|
||||
| *# mcunet models (VWW)* | | | | | |
|
||||
| mcunet-5fps-vww | 624ms | 2346ms | 269ms | 137ms | 128ms |
|
||||
| mcunet-10fps-vww | 345ms | 1230ms | 143ms | 76ms | 66ms |
|
||||
| mcunet-320kB-vww | OOM | OOM | OOM | 657ms | 570ms |
|
||||
| *# mcunet models (ImageNet)* | | | | | |
|
||||
| mcunet-5fps | OOM | OOM | OOM | 149ms | 135ms |
|
||||
| mcunet-10fps | OOM | OOM | OOM | 84ms | 62ms |
|
||||
| mcunet-256kB | OOM | OOM | OOM | 839ms | 681ms |
|
||||
| mcunet-320kB | OOM | OOM | OOM | OOM | 819ms |
|
||||
| *# baseline models* | | | | | |
|
||||
| mbv2-320kB | OOM | OOM | OOM | OOM | 292ms |
|
||||
| proxyless-320kB | OOM | OOM | OOM | 484ms | 425ms |
|
||||
|
||||
The **peak memory (SRAM)** results:
|
||||
|
||||
| net_id | TF-Lite Micro<br>v2.1.0 | TF-Lite Micro<br>[@ 713b6ed](https://github.com/tensorflow/tflite-micro/tree/713b6ed6bd81d8d6906d885e14f444aaf9c154f6) | CMSIS-NN<br>v2.0.0 | X-CUBE-AI<br>v7.1.0 | TinyEngine |
|
||||
| ---------------------------- | ----------------------- | -------------------------- | ------------------ | --------- | ---------- |
|
||||
| *# mcunet models (VWW)* | | | | | |
|
||||
| mcunet-5fps-vww | 227kB | 220kB | 248kB | 123kB | 88kB |
|
||||
| mcunet-10fps-vww | 169kB | 163kB | 199kB | 98kB | 56kB |
|
||||
| mcunet-320kB-vww | OOM | OOM | OOM | 259kB | 162kB |
|
||||
| *# mcunet models (ImageNet)* | | | | | |
|
||||
| mcunet-5fps | OOM | OOM | OOM | 126kB | 90kB |
|
||||
| mcunet-10fps | OOM | OOM | OOM | 76kB | 45kB |
|
||||
| mcunet-256kB | OOM | OOM | OOM | 311kB | 200kB |
|
||||
| mcunet-320kB | OOM | OOM | OOM | OOM | 242kB |
|
||||
| *# baseline models* | | | | | |
|
||||
| mbv2-320kB | OOM | OOM | OOM | OOM | 284kB |
|
||||
| proxyless-320kB | OOM | OOM | OOM | 312kB | 242kB |
|
||||
|
||||
The **Flash memory usage** results:
|
||||
|
||||
| net_id | TF-Lite Micro<br>v2.1.0 | TF-Lite Micro<br>[@ 713b6ed](https://github.com/tensorflow/tflite-micro/tree/713b6ed6bd81d8d6906d885e14f444aaf9c154f6) | CMSIS-NN<br>v2.0.0 | X-CUBE-AI<br>v7.1.0 | TinyEngine |
|
||||
| ---------------------------- | ----------------------- | -------------------------- | ------------------ | --------- | ---------- |
|
||||
| *# mcunet models (VWW)* | | | | | |
|
||||
| mcunet-5fps-vww | 782kB | 733kB | 743kB | 534kB | 517kB |
|
||||
| mcunet-10fps-vww | 691kB | 643kB | 653kB | 463kB | 447kB |
|
||||
| mcunet-320kB-vww | OOM | OOM | OOM | 773kB | 742kB |
|
||||
| *# mcunet models (ImageNet)* | | | | | |
|
||||
| mcunet-5fps | OOM | OOM | OOM | 737kB | 720kB |
|
||||
| mcunet-10fps | OOM | OOM | OOM | 856kB | 837kB |
|
||||
| mcunet-256kB | OOM | OOM | OOM | 850kB | 827kB |
|
||||
| mcunet-320kB | OOM | OOM | OOM | OOM | 835kB |
|
||||
| *# baseline models* | | | | | |
|
||||
| mbv2-320kB | OOM | OOM | OOM | OOM | 828kB |
|
||||
| proxyless-320kB | OOM | OOM | OOM | 866kB | 835kB |
|
||||
|
||||
|
||||
## Citation
|
||||
|
||||
If you find the project helpful, please consider citing our paper:
|
||||
|
||||
```
|
||||
@article{
|
||||
lin2020mcunet,
|
||||
title={Mcunet: Tiny deep learning on iot devices},
|
||||
author={Lin, Ji and Chen, Wei-Ming and Lin, Yujun and Gan, Chuang and Han, Song},
|
||||
journal={Advances in Neural Information Processing Systems},
|
||||
volume={33},
|
||||
year={2020}
|
||||
}
|
||||
|
||||
@inproceedings{
|
||||
lin2021mcunetv2,
|
||||
title={MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning},
|
||||
author={Lin, Ji and Chen, Wei-Ming and Cai, Han and Gan, Chuang and Han, Song},
|
||||
booktitle={Annual Conference on Neural Information Processing Systems (NeurIPS)},
|
||||
year={2021}
|
||||
}
|
||||
|
||||
@inproceedings{
|
||||
lin2022ondevice,
|
||||
title={On-Device Training Under 256KB Memory},
|
||||
author={Lin, Ji and Zhu, Ligeng and Chen, Wei-Ming and Wang, Wei-Chen and Gan, Chuang and Han, Song},
|
||||
booktitle={ArXiv},
|
||||
year={2022}
|
||||
}
|
||||
```
|
||||
|
||||
|
||||
## Related Projects
|
||||
|
||||
[MCUNet: Tiny Deep Learning on IoT Devices](https://mcunet.mit.edu/#mcunetv1) (NeurIPS'20)
|
||||
|
||||
[MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning](https://mcunet.mit.edu/#mcunetv2) (NeurIPS'21)
|
||||
|
||||
[MCUNetV3: On-Device Training Under 256KB Memory](https://mcunet.mit.edu/#mcunetv3)
|
236
TinyEngine/include/arm_nnfunctions_modified.h
Normal file
@ -0,0 +1,236 @@
|
||||
/*
|
||||
* Copyright (C) 2010-2022 Arm Limited or its affiliates.
|
||||
*
|
||||
* SPDX-License-Identifier: Apache-2.0
|
||||
*
|
||||
* Licensed under the Apache License, Version 2.0 (the License); you may
|
||||
* not use this file except in compliance with the License.
|
||||
* You may obtain a copy of the License at
|
||||
*
|
||||
* www.apache.org/licenses/LICENSE-2.0
|
||||
*
|
||||
* Unless required by applicable law or agreed to in writing, software
|
||||
* distributed under the License is distributed on an AS IS BASIS, WITHOUT
|
||||
* WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
* See the License for the specific language governing permissions and
|
||||
* limitations under the License.
|
||||
*/
|
||||
|
||||
/* ----------------------------------------------------------------------
|
||||
* This file is MODIFIED from Arm CMSIS NN Library.
|
||||
*
|
||||
* Project: TinyEngine
|
||||
* Title: arm_nnfunctions_modified.h
|
||||
* Description: Public header file for TinyEngine.
|
||||
*
|
||||
* Reference papers:
|
||||
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
|
||||
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
|
||||
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
|
||||
* Contact authors:
|
||||
* - Wei-Ming Chen, wmchen@mit.edu
|
||||
* - Wei-Chen Wang, wweichen@mit.edu
|
||||
* - Ji Lin, jilin@mit.edu
|
||||
* - Ligeng Zhu, ligeng@mit.edu
|
||||
* - Song Han, songhan@mit.edu
|
||||
*
|
||||
* Original Project: CMSIS NN Library
|
||||
* Original Title: arm_nnfunctions.h
|
||||
*
|
||||
* Target Processor: Cortex-M CPUs
|
||||
* -------------------------------------------------------------------- */
|
||||
|
||||
/**
|
||||
\mainpage CMSIS NN Software Library
|
||||
*
|
||||
* Introduction
|
||||
* ------------
|
||||
*
|
||||
* This user manual describes the CMSIS NN software library,
|
||||
* a collection of efficient neural network kernels developed to maximize the
|
||||
* performance and minimize the memory footprint of neural networks on Cortex-M processor cores.
|
||||
*
|
||||
* The library is divided into a number of functions each covering a specific category:
|
||||
* - Convolution Functions
|
||||
* - Activation Functions
|
||||
* - Fully-connected Layer Functions
|
||||
* - SVDF Layer Functions
|
||||
* - Pooling Functions
|
||||
* - Softmax Functions
|
||||
* - Basic math Functions
|
||||
*
|
||||
* The library has separate functions for operating on different weight and activation data
|
||||
* types including 8-bit integers (q7_t) and 16-bit integers (q15_t). The descrition of the
|
||||
* kernels are included in the function description. The implementation details are also
|
||||
* described in this paper [1].
|
||||
*
|
||||
* Function Classification
|
||||
* --------
|
||||
* The functions can be classified into two segments
|
||||
* - Legacy functions supporting ARM's internal symmetric quantization(8 bits).
|
||||
* - Functions that support TensorFlow Lite framework with symmetric quantization(8 bits).
|
||||
*
|
||||
* The legacy functions can be identified with their suffix of _q7 or _q15 and are no new development is done there.
|
||||
* The article in [2] describes in detail how to run a network using the legacy functions.
|
||||
*
|
||||
* The functions supporting TensorFlow Lite framework is identified by the _s8 suffix and can be invoked from TFL
|
||||
* micro. The functions are bit exact to TensorFlow Lite. Refer to the TensorFlow's documentation in [3] on how to run
|
||||
* a TensorFlow Lite model using optimized CMSIS-NN kernels.
|
||||
*
|
||||
* Block Diagram
|
||||
* --------
|
||||
* \image html CMSIS-NN-OVERVIEW.PNG
|
||||
*
|
||||
* Examples
|
||||
* --------
|
||||
*
|
||||
* The library ships with a number of examples which demonstrate how to use the library functions.
|
||||
*
|
||||
* Pre-processor Macros
|
||||
* ------------
|
||||
*
|
||||
* Each library project have different pre-processor macros.
|
||||
*
|
||||
* - ARM_MATH_DSP:
|
||||
*
|
||||
* Define macro ARM_MATH_DSP, If the silicon supports DSP instructions(DSP extension).
|
||||
*
|
||||
* - ARM_MATH_MVEI:
|
||||
*
|
||||
* Define macro ARM_MATH_MVEI, If the silicon supports M-Profile Vector Extension.
|
||||
|
||||
* - ARM_MATH_AUTOVECTORIZE
|
||||
* Used in conjucture with ARM_MATH_MVEI to let the compiler auto vectorize for the functions that uses inline
|
||||
* assembly. It does not affect functions that use C or intrinsics.
|
||||
* - ARM_MATH_BIG_ENDIAN:
|
||||
*
|
||||
* Define macro ARM_MATH_BIG_ENDIAN to build the library for big endian targets. This is supported only for the legacy
|
||||
* functions i.e, functions targetted at TensorFlow Lite do not support big endianness. By default library builds for
|
||||
* little endian targets.
|
||||
*
|
||||
* - ARM_NN_TRUNCATE:
|
||||
*
|
||||
* Define macro ARM_NN_TRUNCATE to use floor instead of round-to-the-nearest-int for the computation.
|
||||
*
|
||||
*
|
||||
* Copyright Notice
|
||||
* ------------
|
||||
*
|
||||
* Copyright (C) 2010-2019 Arm Limited. All rights reserved.
|
||||
*
|
||||
* [1] CMSIS-NN: Efficient Neural Network Kernels for Arm Cortex-M CPUs https://arxiv.org/abs/1801.06601
|
||||
*
|
||||
* [2] Converting a Neural Network for Arm Cortex-M with CMSIS-NN
|
||||
*
|
||||
https://developer.arm.com/solutions/machine-learning-on-arm/developer-material/how-to-guides/converting-a-neural-network-for-arm-cortex-m-with-cmsis-nn/single-page
|
||||
* [3] https://www.tensorflow.org/lite/microcontrollers/library
|
||||
*
|
||||
* [4] https://github.com/ARM-software/CMSIS_5/tree/develop/CMSIS/NN#legacy-vs-tfl-micro-compliant-apis
|
||||
*/
|
||||
|
||||
/**
|
||||
* @defgroup groupNN Neural Network Functions
|
||||
* A collection of functions to perform basic operations for neural network layers. Functions with a _s8 suffix support
|
||||
* TensorFlow Lite framework.
|
||||
*/
|
||||
|
||||
#ifndef _ARM_NNFUNCTIONS_H
|
||||
#define _ARM_NNFUNCTIONS_H
|
||||
|
||||
#include "arm_nn_math_types.h"
|
||||
#include "arm_nn_types.h"
|
||||
#include "arm_nnsupportfunctions.h"
|
||||
|
||||
#define USE_INTRINSIC
|
||||
|
||||
//#define ARM_NN_TRUNCATE /* This config the rounding model to floor or round to the nearest int */
|
||||
|
||||
#ifdef __cplusplus
|
||||
extern "C" {
|
||||
#endif
|
||||
|
||||
/**
|
||||
* @defgroup NNConv Convolution Functions
|
||||
*
|
||||
* Collection of convolution, depthwise convolution functions and their variants.
|
||||
*
|
||||
* The convolution is implemented in 2 steps: im2col and GEMM
|
||||
*
|
||||
* im2col is a process of converting each patch of image data into
|
||||
* a column. After im2col, the convolution is computed as matrix-matrix
|
||||
* multiplication.
|
||||
*
|
||||
* To reduce the memory footprint, the im2col is performed partially.
|
||||
* Each iteration, only a few column (i.e., patches) are generated and
|
||||
* computed with GEMM kernels similar to CMSIS-DSP arm_mat_mult functions.
|
||||
*
|
||||
*/
|
||||
|
||||
arm_status arm_convolve_s8_4col(const q7_t *input,
|
||||
const uint16_t input_x,
|
||||
const uint16_t input_y,
|
||||
const uint16_t input_ch,
|
||||
const uint16_t input_batches,
|
||||
const q7_t *kernel,
|
||||
const uint16_t output_ch,
|
||||
const uint16_t kernel_x,
|
||||
const uint16_t kernel_y,
|
||||
const uint16_t pad_x,
|
||||
const uint16_t pad_y,
|
||||
const uint16_t stride_x,
|
||||
const uint16_t stride_y,
|
||||
const int32_t *bias,
|
||||
q7_t *output,
|
||||
const int32_t *output_shift,
|
||||
const int32_t *output_mult,
|
||||
const int32_t out_offset,
|
||||
const int32_t input_offset,
|
||||
const int32_t out_activation_min,
|
||||
const int32_t out_activation_max,
|
||||
const uint16_t output_x,
|
||||
const uint16_t output_y,
|
||||
q15_t *buffer_a);
|
||||
|
||||
|
||||
q7_t *arm_nn_mat_mult_kernel_s8_s16_reordered_oddch(const q7_t *input_a,
|
||||
const q15_t *input_b,
|
||||
const uint16_t output_ch,
|
||||
const int32_t *out_shift,
|
||||
const int32_t *out_mult,
|
||||
const int32_t out_offset,
|
||||
const int16_t activation_min,
|
||||
const int16_t activation_max,
|
||||
const uint16_t num_col_a,
|
||||
const int32_t *const output_bias,
|
||||
q7_t *out_0);
|
||||
|
||||
q7_t *arm_nn_mat_mult_kernel_s8_s16_reordered_8mul(const q7_t *input_a,
|
||||
const q15_t *input_b,
|
||||
const uint16_t output_ch,
|
||||
const int32_t *out_shift,
|
||||
const int32_t *out_mult,
|
||||
const int32_t out_offset,
|
||||
const int16_t activation_min,
|
||||
const int16_t activation_max,
|
||||
const uint16_t num_col_a,
|
||||
const int32_t *const output_bias,
|
||||
q7_t *out_0);
|
||||
|
||||
q7_t *arm_nn_mat_mult_kernel3_input3_s8_s16(const q7_t *input_a,
|
||||
const q15_t *input_b,
|
||||
const uint16_t output_ch,
|
||||
const int32_t *out_shift,
|
||||
const int32_t *out_mult,
|
||||
const int32_t out_offset,
|
||||
const int16_t activation_min,
|
||||
const int16_t activation_max,
|
||||
const uint16_t num_col_a,
|
||||
const int32_t *const output_bias,
|
||||
q7_t *out_0,
|
||||
q15_t *kbuf);
|
||||
|
||||
#ifdef __cplusplus
|
||||
}
|
||||
#endif
|
||||
|
||||
#endif
|
27
TinyEngine/include/detectionUtility.h
Normal file
@ -0,0 +1,27 @@
|
||||
/* ----------------------------------------------------------------------
|
||||
* Project: TinyEngine
|
||||
* Title: detectionUtility.h
|
||||
*
|
||||
* Reference papers:
|
||||
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
|
||||
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
|
||||
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
|
||||
* Contact authors:
|
||||
* - Wei-Ming Chen, wmchen@mit.edu
|
||||
* - Wei-Chen Wang, wweichen@mit.edu
|
||||
* - Ji Lin, jilin@mit.edu
|
||||
* - Ligeng Zhu, ligeng@mit.edu
|
||||
* - Song Han, songhan@mit.edu
|
||||
*
|
||||
* Target ISA: ARMv7E-M
|
||||
* -------------------------------------------------------------------- */
|
||||
|
||||
#ifndef TINYENGINE_INCLUDE_DETECTIONUTILITY_H_
|
||||
#define TINYENGINE_INCLUDE_DETECTIONUTILITY_H_
|
||||
|
||||
int postProcessing(signed char *input, unsigned char* runtime_buffer,
|
||||
int y_zero, float y_scale, int shape_x, int shape_y, int shape_c, int resolution,
|
||||
int width, int height , float conf_thresh, float out_boxes[10][6]);
|
||||
|
||||
|
||||
#endif /* TINYENGINE_INCLUDE_DETECTIONUTILITY_H_ */
|
99
TinyEngine/include/fp_requantize_op.h
Normal file
@ -0,0 +1,99 @@
|
||||
/* ----------------------------------------------------------------------
|
||||
* Project: TinyEngine
|
||||
* Title: fp_requantize_op.h
|
||||
*
|
||||
* Reference papers:
|
||||
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
|
||||
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
|
||||
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
|
||||
* Contact authors:
|
||||
* - Wei-Ming Chen, wmchen@mit.edu
|
||||
* - Wei-Chen Wang, wweichen@mit.edu
|
||||
* - Ji Lin, jilin@mit.edu
|
||||
* - Ligeng Zhu, ligeng@mit.edu
|
||||
* - Song Han, songhan@mit.edu
|
||||
*
|
||||
* Target ISA: ARMv7E-M
|
||||
* -------------------------------------------------------------------- */
|
||||
#ifndef TINYENGINE_INCLUDE_FP_REQUANTIZE_OP_H_
|
||||
#define TINYENGINE_INCLUDE_FP_REQUANTIZE_OP_H_
|
||||
|
||||
tinyengine_status convolve_1x1_s8_ch8_fpreq(const q7_t *input,
|
||||
const uint16_t input_x, const uint16_t input_y, const uint16_t input_ch,
|
||||
const q7_t *kernel, const int32_t *bias, const float *scales,
|
||||
const int32_t out_offset, const int32_t input_offset,
|
||||
const int32_t out_activation_min, const int32_t out_activation_max,
|
||||
q7_t *output, const uint16_t output_x, const uint16_t output_y,
|
||||
const uint16_t output_ch, q15_t *runtime_buf);
|
||||
|
||||
tinyengine_status convolve_1x1_s8_ch16_fpreq(const q7_t *input,
|
||||
const uint16_t input_x, const uint16_t input_y, const uint16_t input_ch,
|
||||
const q7_t *kernel, const int32_t *bias, const float *scales,
|
||||
const int32_t out_offset, const int32_t input_offset,
|
||||
const int32_t out_activation_min, const int32_t out_activation_max,
|
||||
q7_t *output, const uint16_t output_x, const uint16_t output_y,
|
||||
const uint16_t output_ch, q15_t *runtime_buf);
|
||||
|
||||
tinyengine_status convolve_1x1_s8_ch24_fpreq(const q7_t *input,
|
||||
const uint16_t input_x, const uint16_t input_y, const uint16_t input_ch,
|
||||
const q7_t *kernel, const int32_t *bias, const float *scales,
|
||||
const int32_t out_offset, const int32_t input_offset,
|
||||
const int32_t out_activation_min, const int32_t out_activation_max,
|
||||
q7_t *output, const uint16_t output_x, const uint16_t output_y,
|
||||
const uint16_t output_ch, q15_t *runtime_buf);
|
||||
|
||||
tinyengine_status convolve_1x1_s8_ch48_fpreq(const q7_t *input,
|
||||
const uint16_t input_x, const uint16_t input_y, const uint16_t input_ch,
|
||||
const q7_t *kernel, const int32_t *bias, const float *scales,
|
||||
const int32_t out_offset, const int32_t input_offset,
|
||||
const int32_t out_activation_min, const int32_t out_activation_max,
|
||||
q7_t *output, const uint16_t output_x, const uint16_t output_y,
|
||||
const uint16_t output_ch, q15_t *runtime_buf);
|
||||
|
||||
tinyengine_status convolve_1x1_s8_fpreq(const q7_t *input,
|
||||
const uint16_t input_x, const uint16_t input_y, const uint16_t input_ch,
|
||||
const q7_t *kernel, const int32_t *bias, const float *scales,
|
||||
const int32_t out_offset, const int32_t input_offset,
|
||||
const int32_t out_activation_min, const int32_t out_activation_max,
|
||||
q7_t *output, const uint16_t output_x, const uint16_t output_y,
|
||||
const uint16_t output_ch, q15_t *runtime_buf);
|
||||
|
||||
tinyengine_status convolve_1x1_s8_fpreq_bitmask(const q7_t *input,
|
||||
const uint16_t input_x, const uint16_t input_y, const uint16_t input_ch,
|
||||
const q7_t *kernel, const int32_t *bias, const float *scales,
|
||||
const int32_t out_offset, const int32_t input_offset,
|
||||
const int32_t out_activation_min, const int32_t out_activation_max,
|
||||
q7_t *output, q7_t *mask, const uint16_t output_x, const uint16_t output_y,
|
||||
const uint16_t output_ch, q15_t *runtime_buf);
|
||||
|
||||
q7_t* mat_mult_kernel_s8_s16_reordered_fpreq(const q7_t *input_a,
|
||||
const q15_t *input_b, const uint16_t output_ch, const float *scales,
|
||||
const int32_t out_offset, const int16_t activation_min,
|
||||
const int16_t activation_max, const uint16_t num_col_a,
|
||||
const int32_t *const output_bias, q7_t *out_0);
|
||||
|
||||
q7_t* mat_mult_kernel_s8_s16_reordered_ch8_fpreq(const q7_t *input_a,
|
||||
const q15_t *input_b, const uint16_t output_ch, const float *scales,
|
||||
const int32_t out_offset, const int16_t activation_min,
|
||||
const int16_t activation_max, const uint16_t num_col_a,
|
||||
const int32_t *const output_bias, q7_t *out_0);
|
||||
|
||||
q7_t* mat_mult_kernel_s8_s16_reordered_ch16_fpreq(const q7_t *input_a,
|
||||
const q15_t *input_b, const uint16_t output_ch, const float *scales,
|
||||
const int32_t out_offset, const int16_t activation_min,
|
||||
const int16_t activation_max, const uint16_t num_col_a,
|
||||
const int32_t *const output_bias, q7_t *out_0);
|
||||
|
||||
q7_t* mat_mult_kernel_s8_s16_reordered_ch24_fpreq(const q7_t *input_a,
|
||||
const q15_t *input_b, const uint16_t output_ch, const float *scales,
|
||||
const int32_t out_offset, const int16_t activation_min,
|
||||
const int16_t activation_max, const uint16_t num_col_a,
|
||||
const int32_t *const output_bias, q7_t *out_0);
|
||||
|
||||
q7_t* mat_mult_kernel_s8_s16_reordered_ch48_fpreq(const q7_t *input_a,
|
||||
const q15_t *input_b, const uint16_t output_ch, const float *scales,
|
||||
const int32_t out_offset, const int16_t activation_min,
|
||||
const int16_t activation_max, const uint16_t num_col_a,
|
||||
const int32_t *const output_bias, q7_t *out_0);
|
||||
|
||||
#endif /* TINYENGINE_INCLUDE_FP_REQUANTIZE_OP_H_ */
|
35
TinyEngine/include/genNN.h
Normal file
@ -0,0 +1,35 @@
|
||||
/* ----------------------------------------------------------------------
|
||||
* Project: TinyEngine
|
||||
* Title: genNN.h
|
||||
*
|
||||
* Reference papers:
|
||||
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
|
||||
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
|
||||
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
|
||||
* Contact authors:
|
||||
* - Wei-Ming Chen, wmchen@mit.edu
|
||||
* - Wei-Chen Wang, wweichen@mit.edu
|
||||
* - Ji Lin, jilin@mit.edu
|
||||
* - Ligeng Zhu, ligeng@mit.edu
|
||||
* - Song Han, songhan@mit.edu
|
||||
*
|
||||
* Target ISA: ARMv7E-M
|
||||
* -------------------------------------------------------------------- */
|
||||
|
||||
#ifndef INC_GENNN_H_
|
||||
#define INC_GENNN_H_
|
||||
|
||||
#include <stdint.h>
|
||||
|
||||
signed char* getInput();
|
||||
signed char* getOutput();
|
||||
float* getOutput_fp();
|
||||
int32_t* getOutput_int32();
|
||||
|
||||
void setupBuffer();
|
||||
void invoke(float* labels);
|
||||
void getResult(uint8_t *P, uint8_t *NP);
|
||||
int* getKbuffer();
|
||||
void end2endinference();
|
||||
|
||||
#endif /* INC_GENNN_H_ */
|
546
TinyEngine/include/img2col_element.h
Normal file
@ -0,0 +1,546 @@
|
||||
/* ----------------------------------------------------------------------
|
||||
* Project: TinyEngine
|
||||
* Title: img2col_element.h
|
||||
*
|
||||
* Reference papers:
|
||||
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
|
||||
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
|
||||
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
|
||||
* Contact authors:
|
||||
* - Wei-Ming Chen, wmchen@mit.edu
|
||||
* - Wei-Chen Wang, wweichen@mit.edu
|
||||
* - Ji Lin, jilin@mit.edu
|
||||
* - Ligeng Zhu, ligeng@mit.edu
|
||||
* - Song Han, songhan@mit.edu
|
||||
*
|
||||
* Target ISA: ARMv7E-M
|
||||
* -------------------------------------------------------------------- */
|
||||
|
||||
#ifndef ARMNN_INCLUDE_IMG2COL_ELEMENT_H_
|
||||
#define ARMNN_INCLUDE_IMG2COL_ELEMENT_H_
|
||||
|
||||
#include "arm_nnsupportfunctions.h"
|
||||
#include "arm_math_memory.h"
|
||||
|
||||
#define b2_q7_q15_offset_ele(src,dst) \
|
||||
/* convert from q7 to q15 and then store the results in the destination buffer */ \
|
||||
/*in_q7x4 = b2_nn_read_q7x4_ia((const q7_t **)&src); \
|
||||
in_q15x2_1 = __SXTB16(__ROR(in_q7x4, 8)); \
|
||||
in_q15x2_2 = __SXTB16(in_q7x4); */ \
|
||||
in_q15x2_1 = ((src[0] & 0x0C) >> 2) + ((src[0] & 0xC0) << 10);\
|
||||
in_q15x2_2 = (src[0] & 0x03) + ((src[0] & 0x30) << 12);\
|
||||
src +=1;\
|
||||
out_q15x2_2 = __PKHTB(in_q15x2_1, in_q15x2_2, 16); \
|
||||
/* Maximum of 9 bits from the addition is expected */ \
|
||||
out_q15x2_2 = __SADD16(out_q15x2_2, offset_q15x2); \
|
||||
\
|
||||
out_q15x2_1 = __PKHBT(in_q15x2_2, in_q15x2_1, 16); \
|
||||
out_q15x2_1 = __SADD16(out_q15x2_1, offset_q15x2); \
|
||||
\
|
||||
write_q15x2_ia(&dst, out_q15x2_1); \
|
||||
write_q15x2_ia(&dst, out_q15x2_2);
|
||||
|
||||
#define b4_q7_q15_offset_ele(src,dst) \
|
||||
/* convert from q7 to q15 and then store the results in the destination buffer */ \
|
||||
/*in_q7x4 = b4_nn_read_q7x4_ia((const q7_t **)&src); \
|
||||
in_q15x2_1 = __SXTB16(__ROR(in_q7x4, 8)); \
|
||||
in_q15x2_2 = __SXTB16(in_q7x4); */ \
|
||||
in_q15x2_1 = ((src[0] & 0xF0) >> 4) + ((src[1] & 0xF0) << 12);\
|
||||
in_q15x2_2 = (src[0] & 0x0F) + ((src[1] & 0x0F) << 16);\
|
||||
src +=2;\
|
||||
out_q15x2_2 = __PKHTB(in_q15x2_1, in_q15x2_2, 16); \
|
||||
/* Maximum of 9 bits from the addition is expected */ \
|
||||
out_q15x2_2 = __SADD16(out_q15x2_2, offset_q15x2); \
|
||||
\
|
||||
out_q15x2_1 = __PKHBT(in_q15x2_2, in_q15x2_1, 16); \
|
||||
out_q15x2_1 = __SADD16(out_q15x2_1, offset_q15x2); \
|
||||
\
|
||||
write_q15x2_ia(&dst, out_q15x2_1); \
|
||||
write_q15x2_ia(&dst, out_q15x2_2);
|
||||
|
||||
#define q7_q15_offset_ele(src,dst) \
|
||||
/* convert from q7 to q15 and then store the results in the destination buffer */ \
|
||||
in_q7x4 = arm_nn_read_q7x4_ia((const q7_t **)&src); \
|
||||
/* Extract and sign extend each of the four q7 values to q15 */ \
|
||||
in_q15x2_1 = __SXTB16(__ROR(in_q7x4, 8)); \
|
||||
in_q15x2_2 = __SXTB16(in_q7x4); \
|
||||
\
|
||||
out_q15x2_2 = __PKHTB(in_q15x2_1, in_q15x2_2, 16); \
|
||||
/* Maximum of 9 bits from the addition is expected */ \
|
||||
out_q15x2_2 = __SADD16(out_q15x2_2, offset_q15x2); \
|
||||
\
|
||||
out_q15x2_1 = __PKHBT(in_q15x2_2, in_q15x2_1, 16); \
|
||||
out_q15x2_1 = __SADD16(out_q15x2_1, offset_q15x2); \
|
||||
\
|
||||
write_q15x2_ia(&dst, out_q15x2_1); \
|
||||
write_q15x2_ia(&dst, out_q15x2_2);
|
||||
|
||||
#define q8_q15_offset_ele(src,dst) \
|
||||
/* convert from q8 to q15 and then store the results in the destination buffer */ \
|
||||
in_q7x4 = arm_nn_read_q7x4_ia((const q8_t **)&src); \
|
||||
/* Extend each of the four q8 values to q15 */ \
|
||||
in_q15x2_1 = __UXTB16(__ROR(in_q7x4, 8)); \
|
||||
in_q15x2_2 = __UXTB16(in_q7x4); \
|
||||
\
|
||||
out_q15x2_2 = __PKHTB(in_q15x2_1, in_q15x2_2, 16); \
|
||||
/* Maximum of 9 bits from the addition is expected */ \
|
||||
out_q15x2_2 = __SADD16(out_q15x2_2, offset_q15x2); \
|
||||
\
|
||||
out_q15x2_1 = __PKHBT(in_q15x2_2, in_q15x2_1, 16); \
|
||||
out_q15x2_1 = __SADD16(out_q15x2_1, offset_q15x2); \
|
||||
\
|
||||
write_q15x2_ia(&dst, out_q15x2_1); \
|
||||
write_q15x2_ia(&dst, out_q15x2_2);
|
||||
|
||||
#define b4_q15_offset_reordered_ele(src,dst)\
|
||||
/* convert from q7 to q15 and then store the results in the destination buffer */\
|
||||
in_q7x4 = b4_nn_read_q7x4_ia((const q7_t **)&src);\
|
||||
\
|
||||
/* Extract and sign extend each of the four q7 values to q15 */\
|
||||
out_q15x2_1 = __SXTB16(__ROR(in_q7x4, 8));\
|
||||
out_q15x2_2 = __SXTB16(in_q7x4);\
|
||||
\
|
||||
out_q15x2_1 = __SADD16(out_q15x2_1, offset_q15x2);\
|
||||
out_q15x2_2 = __SADD16(out_q15x2_2, offset_q15x2);\
|
||||
\
|
||||
write_q15x2_ia(&dst, out_q15x2_2);\
|
||||
write_q15x2_ia(&dst, out_q15x2_1);
|
||||
|
||||
#define b2_q15_offset_reordered_ele(src,dst)\
|
||||
/* convert from q7 to q15 and then store the results in the destination buffer */\
|
||||
in_q7x4 = b2_nn_read_q7x4_ia(&src);\
|
||||
\
|
||||
/* Extract and sign extend each of the four q7 values to q15 */\
|
||||
out_q15x2_1 = __SXTB16(__ROR(in_q7x4, 8));\
|
||||
out_q15x2_2 = __SXTB16(in_q7x4);\
|
||||
\
|
||||
out_q15x2_1 = __SADD16(out_q15x2_1, offset_q15x2);\
|
||||
out_q15x2_2 = __SADD16(out_q15x2_2, offset_q15x2);\
|
||||
\
|
||||
write_q15x2_ia(&dst, out_q15x2_2);\
|
||||
write_q15x2_ia(&dst, out_q15x2_1);
|
||||
|
||||
#define q7_q15_offset_reordered_ele(src,dst)\
|
||||
/* convert from q7 to q15 and then store the results in the destination buffer */\
|
||||
in_q7x4 = arm_nn_read_q7x4_ia((const q7_t **)&src);\
|
||||
\
|
||||
/* Extract and sign extend each of the four q7 values to q15 */\
|
||||
out_q15x2_1 = __SXTB16(__ROR(in_q7x4, 8));\
|
||||
out_q15x2_2 = __SXTB16(in_q7x4);\
|
||||
\
|
||||
out_q15x2_1 = __SADD16(out_q15x2_1, offset_q15x2);\
|
||||
out_q15x2_2 = __SADD16(out_q15x2_2, offset_q15x2);\
|
||||
\
|
||||
write_q15x2_ia(&dst, out_q15x2_2);\
|
||||
write_q15x2_ia(&dst, out_q15x2_1);
|
||||
|
||||
#define q31_assign2(src,dst) \
|
||||
*dst++ = *src++; \
|
||||
*dst++ = *src++;
|
||||
|
||||
#define q31_assign4(src,dst) \
|
||||
q31_assign2(src,dst) \
|
||||
q31_assign2(src,dst) \
|
||||
|
||||
#define q31_assign6(src,dst) \
|
||||
q31_assign4(src,dst) \
|
||||
q31_assign2(src,dst) \
|
||||
|
||||
#define q31_assign8(src,dst) \
|
||||
q31_assign4(src,dst) \
|
||||
q31_assign4(src,dst) \
|
||||
|
||||
#define q31_assign10(src,dst) \
|
||||
q31_assign8(src,dst) \
|
||||
q31_assign2(src,dst) \
|
||||
|
||||
#define q31_assign12(src,dst) \
|
||||
q31_assign10(src,dst) \
|
||||
q31_assign2(src,dst) \
|
||||
|
||||
#define q31_pad2(dst,padvalue) \
|
||||
*dst++ = padvalue; \
|
||||
*dst++ = padvalue; \
|
||||
|
||||
#define q31_pad4(dst,padvalue) \
|
||||
q31_pad2(dst,padvalue) \
|
||||
q31_pad2(dst,padvalue) \
|
||||
|
||||
#define q31_pad6(dst,padvalue) \
|
||||
q31_pad4(dst,padvalue) \
|
||||
q31_pad2(dst,padvalue) \
|
||||
|
||||
#define q31_pad10(dst,padvalue) \
|
||||
q31_pad6(dst,padvalue) \
|
||||
q31_pad4(dst,padvalue) \
|
||||
|
||||
#define q31_pad14(dst,padvalue) \
|
||||
q31_pad6(dst,padvalue) \
|
||||
q31_pad6(dst,padvalue) \
|
||||
q31_pad2(dst,padvalue) \
|
||||
|
||||
|
||||
#define assignq31toq15()\
|
||||
dst = (q15_t*)dst_31;\
|
||||
dst2 = (q15_t*)dst2_31;\
|
||||
dst3 = (q15_t*)dst3_31;\
|
||||
dst4 = (q15_t*)dst4_31;\
|
||||
dst5 = (q15_t*)dst5_31;\
|
||||
dst6 = (q15_t*)dst6_31;\
|
||||
dst7 = (q15_t*)dst7_31;\
|
||||
|
||||
#define assignq15toq31()\
|
||||
dst_31 = (q31_t*)dst;\
|
||||
dst2_31 = (q31_t*)dst2;\
|
||||
dst3_31 = (q31_t*)dst3;\
|
||||
dst4_31 = (q31_t*)dst4;\
|
||||
dst5_31 = (q31_t*)dst5;\
|
||||
dst6_31 = (q31_t*)dst6;\
|
||||
dst7_31 = (q31_t*)dst7;\
|
||||
|
||||
/* ---------------------------------- Pad ---------------------------------- */
|
||||
#define basic_pad_1row(col,dst_31,pad_out_q15x2)\
|
||||
block_cnt = channel_div4 * col; \
|
||||
while (block_cnt > 0)\
|
||||
{ \
|
||||
q31_pad2(dst_31,pad_out_q15x2) \
|
||||
block_cnt--; \
|
||||
}
|
||||
|
||||
#define basic_pad_2row(col,dst_31,dst2_31,pad_out_q15x2)\
|
||||
block_cnt = channel_div4 * col; \
|
||||
while (block_cnt > 0)\
|
||||
{ \
|
||||
q31_pad2(dst_31,pad_out_q15x2) \
|
||||
q31_pad2(dst2_31,pad_out_q15x2) \
|
||||
block_cnt--; \
|
||||
}
|
||||
|
||||
#define basic_pad_3row(col,dst_31,dst2_31,dst3_31,pad_out_q15x2)\
|
||||
block_cnt = channel_div4 * col; \
|
||||
while (block_cnt > 0)\
|
||||
{ \
|
||||
q31_pad2(dst_31,pad_out_q15x2) \
|
||||
q31_pad2(dst2_31,pad_out_q15x2) \
|
||||
q31_pad2(dst3_31,pad_out_q15x2) \
|
||||
block_cnt--; \
|
||||
}
|
||||
|
||||
#define basic_pad_4row(col,dst_31,dst2_31,dst3_31,dst4_31,pad_out_q15x2)\
|
||||
block_cnt = channel_div4 * col; \
|
||||
while (block_cnt > 0)\
|
||||
{ \
|
||||
q31_pad2(dst_31,pad_out_q15x2) \
|
||||
q31_pad2(dst2_31,pad_out_q15x2) \
|
||||
q31_pad2(dst3_31,pad_out_q15x2) \
|
||||
q31_pad2(dst4_31,pad_out_q15x2) \
|
||||
block_cnt--; \
|
||||
}
|
||||
|
||||
#define basic_pad_5row(col,dst_31,dst2_31,dst3_31,dst4_31,dst5_31,pad_out_q15x2)\
|
||||
block_cnt = channel_div4 * col; \
|
||||
while (block_cnt > 0)\
|
||||
{ \
|
||||
q31_pad2(dst_31,pad_out_q15x2) \
|
||||
q31_pad2(dst2_31,pad_out_q15x2) \
|
||||
q31_pad2(dst3_31,pad_out_q15x2) \
|
||||
q31_pad2(dst4_31,pad_out_q15x2) \
|
||||
q31_pad2(dst5_31,pad_out_q15x2) \
|
||||
block_cnt--; \
|
||||
}
|
||||
|
||||
#define pad_1row_1col(dst_31,pad_out_q15x2) basic_pad_1row(1,dst_31,pad_out_q15x2)
|
||||
#define pad_1row_2col(dst_31,pad_out_q15x2) basic_pad_1row(2,dst_31,pad_out_q15x2)
|
||||
#define pad_1row_3col(dst_31,pad_out_q15x2) basic_pad_1row(3,dst_31,pad_out_q15x2)
|
||||
#define pad_2row_1col(dst_31,dst2_31,pad_out_q15x2) basic_pad_2row(1,dst_31,dst2_31,pad_out_q15x2)
|
||||
#define pad_2row_2col(dst_31,dst2_31,pad_out_q15x2) basic_pad_2row(2,dst_31,dst2_31,pad_out_q15x2)
|
||||
#define pad_2row_3col(dst_31,dst2_31,pad_out_q15x2) basic_pad_2row(3,dst_31,dst2_31,pad_out_q15x2)
|
||||
#define pad_2row_4col(dst_31,dst2_31,pad_out_q15x2) basic_pad_2row(4,dst_31,dst2_31,pad_out_q15x2)
|
||||
#define pad_2row_5col(dst_31,dst2_31,pad_out_q15x2) basic_pad_2row(5,dst_31,dst2_31,pad_out_q15x2)
|
||||
#define pad_3row_1col(dst_31,dst2_31,dst3_31,pad_out_q15x2) basic_pad_3row(1,dst_31,dst2_31,dst3_31,pad_out_q15x2)
|
||||
#define pad_3row_2col(dst_31,dst2_31,dst3_31,pad_out_q15x2) basic_pad_3row(2,dst_31,dst2_31,dst3_31,pad_out_q15x2)
|
||||
#define pad_3row_3col(dst_31,dst2_31,dst3_31,pad_out_q15x2) basic_pad_3row(3,dst_31,dst2_31,dst3_31,pad_out_q15x2)
|
||||
#define pad_4row_1col(dst_31,dst2_31,dst3_31,dst4_31,pad_out_q15x2) basic_pad_4row(1,dst_31,dst2_31,dst3_31,dst4_31,pad_out_q15x2)
|
||||
#define pad_4row_2col(dst_31,dst2_31,dst3_31,dst4_31,pad_out_q15x2) basic_pad_4row(2,dst_31,dst2_31,dst3_31,dst4_31,pad_out_q15x2)
|
||||
#define pad_4row_3col(dst_31,dst2_31,dst3_31,dst4_31,pad_out_q15x2) basic_pad_4row(3,dst_31,dst2_31,dst3_31,dst4_31,pad_out_q15x2)
|
||||
#define pad_5row_1col(dst_31,dst2_31,dst3_31,dst4_31,dst5_31,pad_out_q15x2) basic_pad_5row(1,dst_31,dst2_31,dst3_31,dst4_31,dst5_31,pad_out_q15x2)
|
||||
#define pad_5row_2col(dst_31,dst2_31,dst3_31,dst4_31,dst5_31,pad_out_q15x2) basic_pad_5row(2,dst_31,dst2_31,dst3_31,dst4_31,dst5_31,pad_out_q15x2)
|
||||
#define pad_5row_3col(dst_31,dst2_31,dst3_31,dst4_31,dst5_31,pad_out_q15x2) basic_pad_5row(3,dst_31,dst2_31,dst3_31,dst4_31,dst5_31,pad_out_q15x2)
|
||||
|
||||
/* ---------------------------------- Load ---------------------------------- */
|
||||
#define basic_load_1row(col,src,dst)\
|
||||
block_cnt = channel_div4 * col; \
|
||||
while (block_cnt > 0) \
|
||||
{\
|
||||
q7_q15_offset_ele(src,dst)\
|
||||
block_cnt--;\
|
||||
}
|
||||
#define basic_load_2row(col,src,src2,dst,dst2)\
|
||||
block_cnt = channel_div4 * col; \
|
||||
while (block_cnt > 0) \
|
||||
{\
|
||||
q7_q15_offset_ele(src,dst)\
|
||||
q7_q15_offset_ele(src2,dst2)\
|
||||
block_cnt--;\
|
||||
}
|
||||
#define basic_load_3row(col,src,src2,src3,dst,dst2,dst3)\
|
||||
block_cnt = channel_div4 * col; \
|
||||
while (block_cnt > 0) \
|
||||
{\
|
||||
q7_q15_offset_ele(src,dst)\
|
||||
q7_q15_offset_ele(src2,dst2)\
|
||||
q7_q15_offset_ele(src3,dst3)\
|
||||
block_cnt--;\
|
||||
}
|
||||
#define basic_load_4row(col,src,src2,src3,src4,dst,dst2,dst3,dst4)\
|
||||
block_cnt = channel_div4 * col; \
|
||||
while (block_cnt > 0) \
|
||||
{\
|
||||
q7_q15_offset_ele(src,dst)\
|
||||
q7_q15_offset_ele(src2,dst2)\
|
||||
q7_q15_offset_ele(src3,dst3)\
|
||||
q7_q15_offset_ele(src4,dst4)\
|
||||
block_cnt--;\
|
||||
}
|
||||
#define basic_load_5row(col,src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5)\
|
||||
block_cnt = channel_div4 * col; \
|
||||
while (block_cnt > 0) \
|
||||
{\
|
||||
q7_q15_offset_ele(src,dst)\
|
||||
q7_q15_offset_ele(src2,dst2)\
|
||||
q7_q15_offset_ele(src3,dst3)\
|
||||
q7_q15_offset_ele(src4,dst4)\
|
||||
q7_q15_offset_ele(src5,dst5)\
|
||||
block_cnt--;\
|
||||
}
|
||||
|
||||
///////////////////////// 4bit //////////////////////////
|
||||
#define b4_load_1row(col,src,dst)\
|
||||
block_cnt = channel_div4 * col; \
|
||||
while (block_cnt > 0) \
|
||||
{\
|
||||
b4_q7_q15_offset_ele(src,dst)\
|
||||
block_cnt--;\
|
||||
}
|
||||
#define b4_load_2row(col,src,src2,dst,dst2)\
|
||||
block_cnt = channel_div4 * col; \
|
||||
while (block_cnt > 0) \
|
||||
{\
|
||||
b4_q7_q15_offset_ele(src,dst)\
|
||||
b4_q7_q15_offset_ele(src2,dst2)\
|
||||
block_cnt--;\
|
||||
}
|
||||
#define b4_load_3row(col,src,src2,src3,dst,dst2,dst3)\
|
||||
block_cnt = channel_div4 * col; \
|
||||
while (block_cnt > 0) \
|
||||
{\
|
||||
b4_q7_q15_offset_ele(src,dst)\
|
||||
b4_q7_q15_offset_ele(src2,dst2)\
|
||||
b4_q7_q15_offset_ele(src3,dst3)\
|
||||
block_cnt--;\
|
||||
}
|
||||
#define b4_load_4row(col,src,src2,src3,src4,dst,dst2,dst3,dst4)\
|
||||
block_cnt = channel_div4 * col; \
|
||||
while (block_cnt > 0) \
|
||||
{\
|
||||
b4_q7_q15_offset_ele(src,dst)\
|
||||
b4_q7_q15_offset_ele(src2,dst2)\
|
||||
b4_q7_q15_offset_ele(src3,dst3)\
|
||||
b4_q7_q15_offset_ele(src4,dst4)\
|
||||
block_cnt--;\
|
||||
}
|
||||
#define b4_load_5row(col,src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5)\
|
||||
block_cnt = channel_div4 * col; \
|
||||
while (block_cnt > 0) \
|
||||
{\
|
||||
b4_q7_q15_offset_ele(src,dst)\
|
||||
b4_q7_q15_offset_ele(src2,dst2)\
|
||||
b4_q7_q15_offset_ele(src3,dst3)\
|
||||
b4_q7_q15_offset_ele(src4,dst4)\
|
||||
b4_q7_q15_offset_ele(src5,dst5)\
|
||||
block_cnt--;\
|
||||
}
|
||||
///////////////////////// 2bit //////////////////////////
|
||||
#define b2_load_1row(col,src,dst)\
|
||||
block_cnt = channel_div4 * col; \
|
||||
while (block_cnt > 0) \
|
||||
{\
|
||||
b2_q7_q15_offset_ele(src,dst)\
|
||||
block_cnt--;\
|
||||
}
|
||||
#define b2_load_2row(col,src,src2,dst,dst2)\
|
||||
block_cnt = channel_div4 * col; \
|
||||
while (block_cnt > 0) \
|
||||
{\
|
||||
b2_q7_q15_offset_ele(src,dst)\
|
||||
b2_q7_q15_offset_ele(src2,dst2)\
|
||||
block_cnt--;\
|
||||
}
|
||||
#define b2_load_3row(col,src,src2,src3,dst,dst2,dst3)\
|
||||
block_cnt = channel_div4 * col; \
|
||||
while (block_cnt > 0) \
|
||||
{\
|
||||
b2_q7_q15_offset_ele(src,dst)\
|
||||
b2_q7_q15_offset_ele(src2,dst2)\
|
||||
b2_q7_q15_offset_ele(src3,dst3)\
|
||||
block_cnt--;\
|
||||
}
|
||||
#define b2_load_4row(col,src,src2,src3,src4,dst,dst2,dst3,dst4)\
|
||||
block_cnt = channel_div4 * col; \
|
||||
while (block_cnt > 0) \
|
||||
{\
|
||||
b2_q7_q15_offset_ele(src,dst)\
|
||||
b2_q7_q15_offset_ele(src2,dst2)\
|
||||
b2_q7_q15_offset_ele(src3,dst3)\
|
||||
b2_q7_q15_offset_ele(src4,dst4)\
|
||||
block_cnt--;\
|
||||
}
|
||||
#define b2_load_5row(col,src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5)\
|
||||
block_cnt = channel_div4 * col; \
|
||||
while (block_cnt > 0) \
|
||||
{\
|
||||
b2_q7_q15_offset_ele(src,dst)\
|
||||
b2_q7_q15_offset_ele(src2,dst2)\
|
||||
b2_q7_q15_offset_ele(src3,dst3)\
|
||||
b2_q7_q15_offset_ele(src4,dst4)\
|
||||
b2_q7_q15_offset_ele(src5,dst5)\
|
||||
block_cnt--;\
|
||||
}
|
||||
|
||||
#define b4_load_1row_1col(src,dst) b4_load_1row(1,src,dst)
|
||||
#define b4_load_1row_2col(src,dst) b4_load_1row(2,src,dst)
|
||||
#define b4_load_1row_3col(src,dst) b4_load_1row(3,src,dst)
|
||||
#define b4_load_1row_4col(src,dst) b4_load_1row(4,src,dst)
|
||||
#define b4_load_2row_1col(src,src2,dst,dst2) b4_load_2row(1,src,src2,dst,dst2)
|
||||
#define b4_load_2row_2col(src,src2,dst,dst2) b4_load_2row(2,src,src2,dst,dst2)
|
||||
#define b4_load_2row_3col(src,src2,dst,dst2) b4_load_2row(3,src,src2,dst,dst2)
|
||||
#define b4_load_2row_4col(src,src2,dst,dst2) b4_load_2row(4,src,src2,dst,dst2)
|
||||
#define b4_load_3row_1col(src,src2,src3,dst,dst2,dst3) b4_load_3row(1,src,src2,src3,dst,dst2,dst3)
|
||||
#define b4_load_3row_2col(src,src2,src3,dst,dst2,dst3) b4_load_3row(2,src,src2,src3,dst,dst2,dst3)
|
||||
#define b4_load_3row_3col(src,src2,src3,dst,dst2,dst3) b4_load_3row(3,src,src2,src3,dst,dst2,dst3)
|
||||
#define b4_load_3row_4col(src,src2,src3,dst,dst2,dst3) b4_load_3row(4,src,src2,src3,dst,dst2,dst3)
|
||||
#define b4_load_4row_1col(src,src2,src3,src4,dst,dst2,dst3,dst4) b4_load_4row(1,src,src2,src3,src4,dst,dst2,dst3,dst4)
|
||||
#define b4_load_4row_2col(src,src2,src3,src4,dst,dst2,dst3,dst4) b4_load_4row(2,src,src2,src3,src4,dst,dst2,dst3,dst4)
|
||||
#define b4_load_4row_3col(src,src2,src3,src4,dst,dst2,dst3,dst4) b4_load_4row(3,src,src2,src3,src4,dst,dst2,dst3,dst4)
|
||||
#define b4_load_4row_4col(src,src2,src3,src4,dst,dst2,dst3,dst4) b4_load_4row(4,src,src2,src3,src4,dst,dst2,dst3,dst4)
|
||||
#define b4_load_5row_1col(src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5) b4_load_5row(1,src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5)
|
||||
#define b4_load_5row_2col(src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5) b4_load_5row(2,src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5)
|
||||
#define b4_load_5row_3col(src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5) b4_load_5row(3,src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5)
|
||||
#define b4_load_5row_4col(src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5) b4_load_5row(4,src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5)
|
||||
|
||||
#define b2_load_1row_1col(src,dst) b2_load_1row(1,src,dst)
|
||||
#define b2_load_1row_2col(src,dst) b2_load_1row(2,src,dst)
|
||||
#define b2_load_1row_3col(src,dst) b2_load_1row(3,src,dst)
|
||||
#define b2_load_1row_4col(src,dst) b2_load_1row(4,src,dst)
|
||||
#define b2_load_2row_1col(src,src2,dst,dst2) b2_load_2row(1,src,src2,dst,dst2)
|
||||
#define b2_load_2row_2col(src,src2,dst,dst2) b2_load_2row(2,src,src2,dst,dst2)
|
||||
#define b2_load_2row_3col(src,src2,dst,dst2) b2_load_2row(3,src,src2,dst,dst2)
|
||||
#define b2_load_2row_4col(src,src2,dst,dst2) b2_load_2row(4,src,src2,dst,dst2)
|
||||
#define b2_load_3row_1col(src,src2,src3,dst,dst2,dst3) b2_load_3row(1,src,src2,src3,dst,dst2,dst3)
|
||||
#define b2_load_3row_2col(src,src2,src3,dst,dst2,dst3) b2_load_3row(2,src,src2,src3,dst,dst2,dst3)
|
||||
#define b2_load_3row_3col(src,src2,src3,dst,dst2,dst3) b2_load_3row(3,src,src2,src3,dst,dst2,dst3)
|
||||
#define b2_load_3row_4col(src,src2,src3,dst,dst2,dst3) b2_load_3row(4,src,src2,src3,dst,dst2,dst3)
|
||||
#define b2_load_4row_1col(src,src2,src3,src4,dst,dst2,dst3,dst4) b2_load_4row(1,src,src2,src3,src4,dst,dst2,dst3,dst4)
|
||||
#define b2_load_4row_2col(src,src2,src3,src4,dst,dst2,dst3,dst4) b2_load_4row(2,src,src2,src3,src4,dst,dst2,dst3,dst4)
|
||||
#define b2_load_4row_3col(src,src2,src3,src4,dst,dst2,dst3,dst4) b2_load_4row(3,src,src2,src3,src4,dst,dst2,dst3,dst4)
|
||||
#define b2_load_4row_4col(src,src2,src3,src4,dst,dst2,dst3,dst4) b2_load_4row(4,src,src2,src3,src4,dst,dst2,dst3,dst4)
|
||||
#define b2_load_5row_1col(src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5) b2_load_5row(1,src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5)
|
||||
#define b2_load_5row_2col(src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5) b2_load_5row(2,src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5)
|
||||
#define b2_load_5row_3col(src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5) b2_load_5row(3,src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5)
|
||||
#define b2_load_5row_4col(src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5) b2_load_5row(4,src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5)
|
||||
|
||||
#define load_1row_1col(src,dst) basic_load_1row(1,src,dst)
|
||||
#define load_1row_2col(src,dst) basic_load_1row(2,src,dst)
|
||||
#define load_1row_3col(src,dst) basic_load_1row(3,src,dst)
|
||||
#define load_1row_4col(src,dst) basic_load_1row(4,src,dst)
|
||||
#define load_2row_1col(src,src2,dst,dst2) basic_load_2row(1,src,src2,dst,dst2)
|
||||
#define load_2row_2col(src,src2,dst,dst2) basic_load_2row(2,src,src2,dst,dst2)
|
||||
#define load_2row_3col(src,src2,dst,dst2) basic_load_2row(3,src,src2,dst,dst2)
|
||||
#define load_2row_4col(src,src2,dst,dst2) basic_load_2row(4,src,src2,dst,dst2)
|
||||
#define load_3row_1col(src,src2,src3,dst,dst2,dst3) basic_load_3row(1,src,src2,src3,dst,dst2,dst3)
|
||||
#define load_3row_2col(src,src2,src3,dst,dst2,dst3) basic_load_3row(2,src,src2,src3,dst,dst2,dst3)
|
||||
#define load_3row_3col(src,src2,src3,dst,dst2,dst3) basic_load_3row(3,src,src2,src3,dst,dst2,dst3)
|
||||
#define load_3row_4col(src,src2,src3,dst,dst2,dst3) basic_load_3row(4,src,src2,src3,dst,dst2,dst3)
|
||||
#define load_4row_1col(src,src2,src3,src4,dst,dst2,dst3,dst4) basic_load_4row(1,src,src2,src3,src4,dst,dst2,dst3,dst4)
|
||||
#define load_4row_2col(src,src2,src3,src4,dst,dst2,dst3,dst4) basic_load_4row(2,src,src2,src3,src4,dst,dst2,dst3,dst4)
|
||||
#define load_4row_3col(src,src2,src3,src4,dst,dst2,dst3,dst4) basic_load_4row(3,src,src2,src3,src4,dst,dst2,dst3,dst4)
|
||||
#define load_4row_4col(src,src2,src3,src4,dst,dst2,dst3,dst4) basic_load_4row(4,src,src2,src3,src4,dst,dst2,dst3,dst4)
|
||||
#define load_5row_1col(src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5) basic_load_5row(1,src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5)
|
||||
#define load_5row_2col(src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5) basic_load_5row(2,src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5)
|
||||
#define load_5row_3col(src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5) basic_load_5row(3,src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5)
|
||||
#define load_5row_4col(src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5) basic_load_5row(4,src,src2,src3,src4,src5,dst,dst2,dst3,dst4,dst5)
|
||||
|
||||
/* ---------------------------------- Reuse ---------------------------------- */
|
||||
#define basic_reuse_1row(col,src_31,dst_31)\
|
||||
block_cnt = channel_div4 * col;\
|
||||
while (block_cnt > 0)\
|
||||
{\
|
||||
q31_assign2(src_31,dst_31)\
|
||||
block_cnt--;\
|
||||
}
|
||||
#define basic_reuse_2row(col,src_31,src2_31,dst_31,dst2_31)\
|
||||
block_cnt = channel_div4 * col;\
|
||||
while (block_cnt > 0)\
|
||||
{\
|
||||
q31_assign2(src_31,dst_31)\
|
||||
q31_assign2(src2_31,dst2_31)\
|
||||
block_cnt--;\
|
||||
}
|
||||
#define basic_reuse_3row(col,src_31,src2_31,src3_31,dst_31,dst2_31,dst3_31)\
|
||||
block_cnt = channel_div4 * col;\
|
||||
while (block_cnt > 0)\
|
||||
{\
|
||||
q31_assign2(src_31,dst_31)\
|
||||
q31_assign2(src2_31,dst2_31)\
|
||||
q31_assign2(src3_31,dst3_31)\
|
||||
block_cnt--;\
|
||||
}
|
||||
#define basic_reuse_4row(col,src_31,src2_31,src3_31,src4_31,dst_31,dst2_31,dst3_31,dst4_31)\
|
||||
block_cnt = channel_div4 * col;\
|
||||
while (block_cnt > 0)\
|
||||
{\
|
||||
q31_assign2(src_31,dst_31)\
|
||||
q31_assign2(src2_31,dst2_31)\
|
||||
q31_assign2(src3_31,dst3_31)\
|
||||
q31_assign2(src4_31,dst4_31)\
|
||||
block_cnt--;\
|
||||
}
|
||||
#define basic_reuse_5row(col,src_31,src2_31,src3_31,src4_31,src5_31,dst_31,dst2_31,dst3_31,dst4_31,dst5_31)\
|
||||
block_cnt = channel_div4 * col;\
|
||||
while (block_cnt > 0)\
|
||||
{\
|
||||
q31_assign2(src_31,dst_31)\
|
||||
q31_assign2(src2_31,dst2_31)\
|
||||
q31_assign2(src3_31,dst3_31)\
|
||||
q31_assign2(src4_31,dst4_31)\
|
||||
q31_assign2(src5_31,dst5_31)\
|
||||
block_cnt--;\
|
||||
}
|
||||
|
||||
#define reuse_1row_1col(src_31,dst_31) basic_reuse_1row(1,src_31,dst_31)
|
||||
#define reuse_1row_2col(src_31,dst_31) basic_reuse_1row(2,src_31,dst_31)
|
||||
#define reuse_1row_3col(src_31,dst_31) basic_reuse_1row(3,src_31,dst_31)
|
||||
#define reuse_1row_4col(src_31,dst_31) basic_reuse_1row(4,src_31,dst_31)
|
||||
#define reuse_1row_5col(src_31,dst_31) basic_reuse_1row(5,src_31,dst_31)
|
||||
#define reuse_1row_6col(src_31,dst_31) basic_reuse_1row(6,src_31,dst_31)
|
||||
#define reuse_2row_1col(src_31,src2_31,dst_31,dst2_31) basic_reuse_2row(1,src_31,src2_31,dst_31,dst2_31)
|
||||
#define reuse_2row_2col(src_31,src2_31,dst_31,dst2_31) basic_reuse_2row(2,src_31,src2_31,dst_31,dst2_31)
|
||||
#define reuse_2row_3col(src_31,src2_31,dst_31,dst2_31) basic_reuse_2row(3,src_31,src2_31,dst_31,dst2_31)
|
||||
#define reuse_2row_4col(src_31,src2_31,dst_31,dst2_31) basic_reuse_2row(4,src_31,src2_31,dst_31,dst2_31)
|
||||
#define reuse_2row_5col(src_31,src2_31,dst_31,dst2_31) basic_reuse_2row(5,src_31,src2_31,dst_31,dst2_31)
|
||||
#define reuse_2row_6col(src_31,src2_31,dst_31,dst2_31) basic_reuse_2row(6,src_31,src2_31,dst_31,dst2_31)
|
||||
#define reuse_3row_1col(src_31,src2_31,src3_31,dst_31,dst2_31,dst3_31) basic_reuse_3row(1,src_31,src2_31,src3_31,dst_31,dst2_31,dst3_31)
|
||||
#define reuse_3row_2col(src_31,src2_31,src3_31,dst_31,dst2_31,dst3_31) basic_reuse_3row(2,src_31,src2_31,src3_31,dst_31,dst2_31,dst3_31)
|
||||
#define reuse_3row_3col(src_31,src2_31,src3_31,dst_31,dst2_31,dst3_31) basic_reuse_3row(3,src_31,src2_31,src3_31,dst_31,dst2_31,dst3_31)
|
||||
#define reuse_3row_4col(src_31,src2_31,src3_31,dst_31,dst2_31,dst3_31) basic_reuse_3row(4,src_31,src2_31,src3_31,dst_31,dst2_31,dst3_31)
|
||||
#define reuse_3row_5col(src_31,src2_31,src3_31,dst_31,dst2_31,dst3_31) basic_reuse_3row(5,src_31,src2_31,src3_31,dst_31,dst2_31,dst3_31)
|
||||
#define reuse_3row_6col(src_31,src2_31,src3_31,dst_31,dst2_31,dst3_31) basic_reuse_3row(6,src_31,src2_31,src3_31,dst_31,dst2_31,dst3_31)
|
||||
#define reuse_4row_3col(src_31,src2_31,src3_31,src4_31,dst_31,dst2_31,dst3_31,dst4_31) basic_reuse_4row(3,src_31,src2_31,src3_31,src4_31,dst_31,dst2_31,dst3_31,dst4_31)
|
||||
#define reuse_4row_4col(src_31,src2_31,src3_31,src4_31,dst_31,dst2_31,dst3_31,dst4_31) basic_reuse_4row(4,src_31,src2_31,src3_31,src4_31,dst_31,dst2_31,dst3_31,dst4_31)
|
||||
#define reuse_4row_5col(src_31,src2_31,src3_31,src4_31,dst_31,dst2_31,dst3_31,dst4_31) basic_reuse_4row(5,src_31,src2_31,src3_31,src4_31,dst_31,dst2_31,dst3_31,dst4_31)
|
||||
#define reuse_4row_6col(src_31,src2_31,src3_31,src4_31,dst_31,dst2_31,dst3_31,dst4_31) basic_reuse_4row(6,src_31,src2_31,src3_31,src4_31,dst_31,dst2_31,dst3_31,dst4_31)
|
||||
#define reuse_5row_3col(src_31,src2_31,src3_31,src4_31,src5_31,dst_31,dst2_31,dst3_31,dst4_31,dst5_31) basic_reuse_5row(3,src_31,src2_31,src3_31,src4_31,src5_31,dst_31,dst2_31,dst3_31,dst4_31,dst5_31)
|
||||
#define reuse_5row_4col(src_31,src2_31,src3_31,src4_31,src5_31,dst_31,dst2_31,dst3_31,dst4_31,dst5_31) basic_reuse_5row(4,src_31,src2_31,src3_31,src4_31,src5_31,dst_31,dst2_31,dst3_31,dst4_31,dst5_31)
|
||||
#define reuse_5row_5col(src_31,src2_31,src3_31,src4_31,src5_31,dst_31,dst2_31,dst3_31,dst4_31,dst5_31) basic_reuse_5row(5,src_31,src2_31,src3_31,src4_31,src5_31,dst_31,dst2_31,dst3_31,dst4_31,dst5_31)
|
||||
#define reuse_5row_6col(src_31,src2_31,src3_31,src4_31,src5_31,dst_31,dst2_31,dst3_31,dst4_31,dst5_31) basic_reuse_5row(6,src_31,src2_31,src3_31,src4_31,src5_31,dst_31,dst2_31,dst3_31,dst4_31,dst5_31)
|
||||
#endif /* ARMNN_INCLUDE_IMG2COL_ELEMENT_H_ */
|
421
TinyEngine/include/kernel_element.h
Normal file
@ -0,0 +1,421 @@
|
||||
/* ----------------------------------------------------------------------
|
||||
* Project: TinyEngine
|
||||
* Title: kernel_element.h
|
||||
*
|
||||
* Reference papers:
|
||||
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
|
||||
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
|
||||
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
|
||||
* Contact authors:
|
||||
* - Wei-Ming Chen, wmchen@mit.edu
|
||||
* - Wei-Chen Wang, wweichen@mit.edu
|
||||
* - Ji Lin, jilin@mit.edu
|
||||
* - Ligeng Zhu, ligeng@mit.edu
|
||||
* - Song Han, songhan@mit.edu
|
||||
*
|
||||
* Target ISA: ARMv7E-M
|
||||
* -------------------------------------------------------------------- */
|
||||
|
||||
#ifndef ARMNN_INCLUDE_KERNEL_ELEMENT_H_
|
||||
#define ARMNN_INCLUDE_KERNEL_ELEMENT_H_
|
||||
|
||||
#include "mutable_function.h"
|
||||
#include "precision_cnt.h"
|
||||
|
||||
#define loop_ele_ext() \
|
||||
sum = __SMLAD(col32[0], k_buf1[0], sum); \
|
||||
sum_2 = __SMLAD(col32[1], k_buf1[1], sum_2); \
|
||||
sum_3 = __SMLAD(col32[2], k_buf1[2], sum_3); \
|
||||
sum_4 = __SMLAD(col32[3], k_buf1[3], sum_4); \
|
||||
col32 += 4;\
|
||||
k_buf1 += 4; \
|
||||
|
||||
#define loop_ele() \
|
||||
op_a = arm_nn_read_q15x2(col_pos); \
|
||||
op_b = arm_nn_read_q15x2(col_pos + input_ch); \
|
||||
\
|
||||
op_c = __PKHBT(op_b, op_a, 16); \
|
||||
op_a = __PKHTB(op_b, op_a, 16); \
|
||||
sum = __SMLAD(op_c, k_buf1[0], sum); \
|
||||
sum_2 = __SMLAD(op_a, k_buf1[q32_elements], sum_2); \
|
||||
\
|
||||
op_a = arm_nn_read_q15x2(col_pos + 2); \
|
||||
op_b = arm_nn_read_q15x2(col_pos + input_ch + 2); \
|
||||
\
|
||||
op_c = __PKHBT(op_b, op_a, 16); \
|
||||
op_a = __PKHTB(op_b, op_a, 16); \
|
||||
sum_3 = __SMLAD(op_c, k_buf1[q32_elements*2], sum_3); \
|
||||
sum_4 = __SMLAD(op_a, k_buf1[q32_elements*3], sum_4); \
|
||||
\
|
||||
col_pos += two_inch; \
|
||||
k_buf1++;
|
||||
/* end of loop_ele() */
|
||||
|
||||
#define prepare_loops()\
|
||||
q7_t *out_1 = out + output_ch / output_scaler;\
|
||||
const int32_t *out_shift = output_shift;\
|
||||
const int32_t *out_mult = output_mult;\
|
||||
const int32_t *obias = bias;\
|
||||
uint16_t row_count = output_ch / 2;\
|
||||
q31_t *ksrc = &kbuf[0];\
|
||||
/* end of prepare_loops() */
|
||||
|
||||
#define conv_1stloop_ele()\
|
||||
q31_t ch_0_out_0 = *obias;\
|
||||
q31_t ch_0_out_1 = *obias++;\
|
||||
q31_t ch_1_out_0 = *obias;\
|
||||
q31_t ch_1_out_1 = *obias++;\
|
||||
q31_t b0 = arm_nn_read_q15x2_ia((const q15_t **)&ip_b0);\
|
||||
q31_t b1 = arm_nn_read_q15x2_ia((const q15_t **)&ip_b1);\
|
||||
ch_0_out_0 = __SMLAD(*ksrc, b0, ch_0_out_0);\
|
||||
ch_0_out_1 = __SMLAD(*ksrc++, b1, ch_0_out_1);\
|
||||
ch_1_out_0 = __SMLAD(*ksrc2, b0, ch_1_out_0);\
|
||||
b0 = arm_nn_read_q15x2_ia((const q15_t **)&ip_b0);\
|
||||
ch_1_out_1 = __SMLAD(*ksrc2++, b1, ch_1_out_1);\
|
||||
/* end of conv_1stloop_ele */
|
||||
|
||||
#define conv_lastloop_ele()\
|
||||
b1 = arm_nn_read_q15x2_ia((const q15_t **)&ip_b1);\
|
||||
\
|
||||
ch_0_out_0 = __SMLAD(*ksrc, b0, ch_0_out_0);\
|
||||
ch_0_out_1 = __SMLAD(*ksrc++, b1, ch_0_out_1);\
|
||||
ch_1_out_0 = __SMLAD(*ksrc2, b0, ch_1_out_0);\
|
||||
ch_1_out_1 = __SMLAD(*ksrc2++, b1, ch_1_out_1);\
|
||||
\
|
||||
ksrc = ksrc2;\
|
||||
/* end of conv_lastloop_ele */
|
||||
|
||||
#define conv_midloop_ele(k_index) \
|
||||
b1 = arm_nn_read_q15x2_ia(&ip_b1);\
|
||||
ch_0_out_0 = __SMLAD(ksrc[k_index], b0, ch_0_out_0);\
|
||||
ch_0_out_1 = __SMLAD(ksrc[k_index], b1, ch_0_out_1);\
|
||||
ch_1_out_0 = __SMLAD(ksrc2[k_index], b0, ch_1_out_0);\
|
||||
b0 = arm_nn_read_q15x2_ia(&ip_b0);\
|
||||
ch_1_out_1 = __SMLAD(ksrc2[k_index], b1, ch_1_out_1);\
|
||||
/* end of conv_midloop_ele */
|
||||
|
||||
#define conv_midloop_ptrele() \
|
||||
b1 = arm_nn_read_q15x2_ia((const q15_t **)&ip_b1);\
|
||||
ch_0_out_0 = __SMLAD(*ksrc, b0, ch_0_out_0);\
|
||||
ch_0_out_1 = __SMLAD(*ksrc++, b1, ch_0_out_1);\
|
||||
ch_1_out_0 = __SMLAD(*ksrc2, b0, ch_1_out_0);\
|
||||
b0 = arm_nn_read_q15x2_ia((const q15_t **)&ip_b0);\
|
||||
ch_1_out_1 = __SMLAD(*ksrc2++, b1, ch_1_out_1);\
|
||||
/* end of conv_midloop_ele */
|
||||
|
||||
#define unroll_8inch()\
|
||||
prepare_loops();\
|
||||
while (row_count) {\
|
||||
const q15_t *ip_b0 = two_column_buffer;\
|
||||
const q15_t *ip_b1 = ip_b0 + 8;\
|
||||
q31_t *ksrc2 = ksrc + 4;\
|
||||
conv_1stloop_ele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_lastloop_ele()\
|
||||
mix_assign_requantize()\
|
||||
row_count--;\
|
||||
}\
|
||||
|
||||
/* Specialized Loop Unrolling */
|
||||
//this can be selected for different models
|
||||
#define unroll_8inch()\
|
||||
prepare_loops();\
|
||||
while (row_count) {\
|
||||
const q15_t *ip_b0 = two_column_buffer;\
|
||||
const q15_t *ip_b1 = ip_b0 + 8;\
|
||||
q31_t *ksrc2 = ksrc + 4;\
|
||||
conv_1stloop_ele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_lastloop_ele()\
|
||||
mix_assign_requantize()\
|
||||
row_count--;\
|
||||
}\
|
||||
|
||||
#define unroll_12inch()\
|
||||
prepare_loops();\
|
||||
while (row_count) {\
|
||||
const q15_t *ip_b0 = two_column_buffer;\
|
||||
const q15_t *ip_b1 = ip_b0 + 12;\
|
||||
q31_t *ksrc2 = ksrc + 6;\
|
||||
conv_1stloop_ele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_lastloop_ele()\
|
||||
mix_assign_requantize()\
|
||||
row_count--;\
|
||||
}\
|
||||
|
||||
#define unroll_16inch()\
|
||||
prepare_loops();\
|
||||
while (row_count) {\
|
||||
const q15_t *ip_b0 = two_column_buffer;\
|
||||
const q15_t *ip_b1 = ip_b0 + 16;\
|
||||
q31_t *ksrc2 = ksrc + 8;\
|
||||
conv_1stloop_ele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_lastloop_ele()\
|
||||
mix_assign_requantize()\
|
||||
row_count--;\
|
||||
}\
|
||||
|
||||
#define unroll_20inch()\
|
||||
prepare_loops();\
|
||||
while (row_count) {\
|
||||
const q15_t *ip_b0 = two_column_buffer;\
|
||||
const q15_t *ip_b1 = ip_b0 + 20;\
|
||||
q31_t *ksrc2 = ksrc + 10;\
|
||||
conv_1stloop_ele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_lastloop_ele()\
|
||||
mix_assign_requantize()\
|
||||
row_count--;\
|
||||
}\
|
||||
|
||||
#define unroll_24inch()\
|
||||
prepare_loops();\
|
||||
while (row_count) {\
|
||||
const q15_t *ip_b0 = two_column_buffer;\
|
||||
const q15_t *ip_b1 = ip_b0 + 24;\
|
||||
q31_t *ksrc2 = ksrc + 12;\
|
||||
conv_1stloop_ele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_lastloop_ele()\
|
||||
mix_assign_requantize()\
|
||||
row_count--;\
|
||||
}\
|
||||
|
||||
|
||||
#define unroll_32inch()\
|
||||
prepare_loops();\
|
||||
while (row_count) {\
|
||||
const q15_t *ip_b0 = two_column_buffer;\
|
||||
const q15_t *ip_b1 = ip_b0 + 32;\
|
||||
q31_t *ksrc2 = ksrc + 16;\
|
||||
conv_1stloop_ele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_lastloop_ele()\
|
||||
mix_assign_requantize()\
|
||||
row_count--;\
|
||||
}\
|
||||
|
||||
#define unroll_36inch()\
|
||||
prepare_loops();\
|
||||
while (row_count) {\
|
||||
const q15_t *ip_b0 = two_column_buffer;\
|
||||
const q15_t *ip_b1 = ip_b0 + 36;\
|
||||
q31_t *ksrc2 = ksrc + 18;\
|
||||
conv_1stloop_ele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_lastloop_ele()\
|
||||
mix_assign_requantize()\
|
||||
row_count--;\
|
||||
}\
|
||||
|
||||
#define unroll_40inch()\
|
||||
prepare_loops();\
|
||||
while (row_count) {\
|
||||
const q15_t *ip_b0 = two_column_buffer;\
|
||||
const q15_t *ip_b1 = ip_b0 + 40;\
|
||||
q31_t *ksrc2 = ksrc + 20;\
|
||||
conv_1stloop_ele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_lastloop_ele()\
|
||||
mix_assign_requantize()\
|
||||
row_count--;\
|
||||
}\
|
||||
|
||||
#define unroll_48inch()\
|
||||
prepare_loops();\
|
||||
while (row_count) {\
|
||||
const q15_t *ip_b0 = two_column_buffer;\
|
||||
const q15_t *ip_b1 = ip_b0 + 48;\
|
||||
q31_t *ksrc2 = ksrc + 24;\
|
||||
conv_1stloop_ele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_midloop_ptrele()\
|
||||
conv_lastloop_ele()\
|
||||
mix_assign_requantize()\
|
||||
row_count--;\
|
||||
}\
|
||||
|
||||
|
||||
/* END: Specialized Loop Unrolling */
|
||||
|
||||
#define b2_assign_requantize() \
|
||||
ch_0_out_0 = arm_nn_requantize(ch_0_out_0, *out_mult,*out_shift);\
|
||||
ch_0_out_0 += out_offset;\
|
||||
ch_0_out_0 = MAX(ch_0_out_0, out_activation_min);\
|
||||
ch_0_out_0 = MIN(ch_0_out_0, out_activation_max);\
|
||||
\
|
||||
ch_0_out_1 = arm_nn_requantize(ch_0_out_1, *out_mult,*out_shift);\
|
||||
ch_0_out_1 += out_offset;\
|
||||
ch_0_out_1 = MAX(ch_0_out_1, out_activation_min);\
|
||||
ch_0_out_1 = MIN(ch_0_out_1, out_activation_max);\
|
||||
out_mult++;\
|
||||
out_shift++;\
|
||||
ch_1_out_0 = arm_nn_requantize(ch_1_out_0, *out_mult,*out_shift);\
|
||||
ch_1_out_0 += out_offset;\
|
||||
ch_1_out_0 = MAX(ch_1_out_0, out_activation_min);\
|
||||
ch_1_out_0 = MIN(ch_1_out_0, out_activation_max);\
|
||||
ch_1_out_1 = arm_nn_requantize(ch_1_out_1, *out_mult,*out_shift);\
|
||||
ch_1_out_1 += out_offset;\
|
||||
ch_1_out_1 = MAX(ch_1_out_1, out_activation_min);\
|
||||
ch_1_out_1 = MIN(ch_1_out_1, out_activation_max);\
|
||||
if(lower_bit == 1){\
|
||||
*out = (q7_t) ((ch_0_out_0 & 0x03) + ((ch_1_out_0 & 0x03) << 2));\
|
||||
*out_1 = (q7_t) ((ch_0_out_0 & 0x03) + ((ch_1_out_1 & 0x03) << 2));\
|
||||
lower_bit = 3;\
|
||||
}\
|
||||
else{\
|
||||
*out++ += (q7_t) (((ch_0_out_0 & 0x03) + ((ch_1_out_0 & 0x03) << 2)) << 4);\
|
||||
*out_1++ += (q7_t) (((ch_0_out_1 & 0x03) + ((ch_1_out_1 & 0x03) << 2)) << 4);\
|
||||
lower_bit = 1;\
|
||||
}\
|
||||
out_mult++;\
|
||||
out_shift++;\
|
||||
|
||||
#define b4_assign_requantize() \
|
||||
ch_0_out_0 = arm_nn_requantize(ch_0_out_0, *out_mult,*out_shift);\
|
||||
ch_0_out_0 += out_offset;\
|
||||
ch_0_out_0 = MAX(ch_0_out_0, out_activation_min);\
|
||||
ch_0_out_0 = MIN(ch_0_out_0, out_activation_max);\
|
||||
\
|
||||
ch_0_out_1 = arm_nn_requantize(ch_0_out_1, *out_mult,*out_shift);\
|
||||
ch_0_out_1 += out_offset;\
|
||||
ch_0_out_1 = MAX(ch_0_out_1, out_activation_min);\
|
||||
ch_0_out_1 = MIN(ch_0_out_1, out_activation_max);\
|
||||
out_mult++;\
|
||||
out_shift++;\
|
||||
ch_1_out_0 = arm_nn_requantize(ch_1_out_0, *out_mult,*out_shift);\
|
||||
ch_1_out_0 += out_offset;\
|
||||
ch_1_out_0 = MAX(ch_1_out_0, out_activation_min);\
|
||||
ch_1_out_0 = MIN(ch_1_out_0, out_activation_max);\
|
||||
*out++ = (q7_t) ((ch_0_out_0 & 0x0F) + ((ch_1_out_0 & 0x0F) << 4));\
|
||||
ch_1_out_1 = arm_nn_requantize(ch_1_out_1, *out_mult,*out_shift);\
|
||||
ch_1_out_1 += out_offset;\
|
||||
ch_1_out_1 = MAX(ch_1_out_1, out_activation_min);\
|
||||
ch_1_out_1 = MIN(ch_1_out_1, out_activation_max);\
|
||||
*out_1++ = (q7_t) ((ch_0_out_1 & 0x0F) + ((ch_1_out_1 & 0x0F) << 4));\
|
||||
out_mult++;\
|
||||
out_shift++;\
|
||||
|
||||
#define assign_requantize() \
|
||||
ch_0_out_0 = arm_nn_requantize(ch_0_out_0, *out_mult,*out_shift);\
|
||||
ch_0_out_0 += out_offset;\
|
||||
ch_0_out_0 = MAX(ch_0_out_0, out_activation_min);\
|
||||
ch_0_out_0 = MIN(ch_0_out_0, out_activation_max);\
|
||||
*out++ = (q7_t) ch_0_out_0;\
|
||||
\
|
||||
ch_0_out_1 = arm_nn_requantize(ch_0_out_1, *out_mult,*out_shift);\
|
||||
ch_0_out_1 += out_offset;\
|
||||
ch_0_out_1 = MAX(ch_0_out_1, out_activation_min);\
|
||||
ch_0_out_1 = MIN(ch_0_out_1, out_activation_max);\
|
||||
*out_1++ = (q7_t) ch_0_out_1;\
|
||||
out_mult++;\
|
||||
out_shift++;\
|
||||
ch_1_out_0 = arm_nn_requantize(ch_1_out_0, *out_mult,*out_shift);\
|
||||
ch_1_out_0 += out_offset;\
|
||||
ch_1_out_0 = MAX(ch_1_out_0, out_activation_min);\
|
||||
ch_1_out_0 = MIN(ch_1_out_0, out_activation_max);\
|
||||
*out++ = (q7_t) ch_1_out_0;\
|
||||
\
|
||||
ch_1_out_1 = arm_nn_requantize(ch_1_out_1, *out_mult,*out_shift);\
|
||||
ch_1_out_1 += out_offset;\
|
||||
ch_1_out_1 = MAX(ch_1_out_1, out_activation_min);\
|
||||
ch_1_out_1 = MIN(ch_1_out_1, out_activation_max);\
|
||||
*out_1++ = (q7_t) ch_1_out_1;\
|
||||
out_mult++;\
|
||||
out_shift++;\
|
||||
/* end of assign_requantize */
|
||||
|
||||
#endif /* ARMNN_INCLUDE_KERNEL_ELEMENT_H_ */
|
236
TinyEngine/include/mutable_function.h
Normal file
@ -0,0 +1,236 @@
|
||||
/* ----------------------------------------------------------------------
|
||||
* Project: TinyEngine
|
||||
* Title: mutable_function.h
|
||||
*
|
||||
* Reference papers:
|
||||
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
|
||||
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
|
||||
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
|
||||
* Contact authors:
|
||||
* - Wei-Ming Chen, wmchen@mit.edu
|
||||
* - Wei-Chen Wang, wweichen@mit.edu
|
||||
* - Ji Lin, jilin@mit.edu
|
||||
* - Ligeng Zhu, ligeng@mit.edu
|
||||
* - Song Han, songhan@mit.edu
|
||||
*
|
||||
* Target ISA: ARMv7E-M
|
||||
* -------------------------------------------------------------------- */
|
||||
|
||||
#ifndef TINYENGINE_SOURCE_CONVOLUTIONFUNCTIONS_MIX_MUTABLE_FUNCTION_H_
|
||||
#define TINYENGINE_SOURCE_CONVOLUTIONFUNCTIONS_MIX_MUTABLE_FUNCTION_H_
|
||||
|
||||
/* mutable functions */
|
||||
#if KERNEL_PRE == 4
|
||||
#define mix_read_and_pad_reordered b4_read_and_pad_reordered
|
||||
#define mix_nn_read_q7x4 b4_nn_read_q7x4
|
||||
#define mix_read_and_pad b4_read_and_pad
|
||||
#elif KERNEL_PRE == 2
|
||||
#define mix_read_and_pad_reordered b2_read_and_pad_reordered
|
||||
#define mix_nn_read_q7x4 b2_nn_read_q7x4
|
||||
#define mix_read_and_pad b2_read_and_pad
|
||||
#else
|
||||
#define mix_read_and_pad_reordered read_and_pad_reordered
|
||||
#define mix_nn_read_q7x4 arm_nn_read_q7x4
|
||||
#define mix_read_and_pad read_and_pad
|
||||
#endif
|
||||
|
||||
#if INPUT_PRE == 4
|
||||
#define mix_q7_q15_offset_ele b4_q7_q15_offset_ele
|
||||
#elif INPUT_PRE == 2
|
||||
#define mix_q7_q15_offset_ele b2_q7_q15_offset_ele
|
||||
#else
|
||||
#define mix_q7_q15_offset_ele q7_q15_offset_ele
|
||||
#endif
|
||||
|
||||
#if INPUT_PRE == 4
|
||||
#define mix_q7_q15_offset_reordered_ele b4_q15_offset_reordered_ele
|
||||
#define mix_load_1row_1col b4_load_1row_1col
|
||||
#define mix_load_1row_2col b4_load_1row_2col
|
||||
#define mix_load_1row_3col b4_load_1row_3col
|
||||
#define mix_load_1row_4col b4_load_1row_4col
|
||||
#define mix_load_1row_5col b4_load_1row_5col
|
||||
#define mix_load_1row_6col b4_load_1row_6col
|
||||
#define mix_load_1row_7col b4_load_1row_7col
|
||||
#define mix_load_2row_1col b4_load_2row_1col
|
||||
#define mix_load_2row_2col b4_load_2row_2col
|
||||
#define mix_load_2row_3col b4_load_2row_3col
|
||||
#define mix_load_2row_4col b4_load_2row_4col
|
||||
#define mix_load_2row_5col b4_load_2row_5col
|
||||
#define mix_load_2row_6col b4_load_2row_6col
|
||||
#define mix_load_2row_7col b4_load_2row_7col
|
||||
#define mix_load_3row_1col b4_load_3row_1col
|
||||
#define mix_load_3row_2col b4_load_3row_2col
|
||||
#define mix_load_3row_3col b4_load_3row_3col
|
||||
#define mix_load_3row_4col b4_load_3row_4col
|
||||
#define mix_load_3row_5col b4_load_3row_5col
|
||||
#define mix_load_3row_6col b4_load_3row_6col
|
||||
#define mix_load_3row_7col b4_load_3row_7col
|
||||
#define mix_load_4row_1col b4_load_4row_1col
|
||||
#define mix_load_4row_2col b4_load_4row_2col
|
||||
#define mix_load_4row_3col b4_load_4row_3col
|
||||
#define mix_load_4row_4col b4_load_4row_4col
|
||||
#define mix_load_4row_5col b4_load_4row_5col
|
||||
#define mix_load_4row_6col b4_load_4row_6col
|
||||
#define mix_load_4row_7col b4_load_4row_7col
|
||||
#define mix_load_5row_1col b4_load_5row_1col
|
||||
#define mix_load_5row_2col b4_load_5row_2col
|
||||
#define mix_load_5row_3col b4_load_5row_3col
|
||||
#define mix_load_5row_4col b4_load_5row_4col
|
||||
#define mix_load_5row_5col b4_load_5row_5col
|
||||
#define mix_load_5row_6col b4_load_5row_6col
|
||||
#define mix_load_5row_7col b4_load_5row_7col
|
||||
#define mix_load_6row_1col b4_load_6row_1col
|
||||
#define mix_load_6row_2col b4_load_6row_2col
|
||||
#define mix_load_6row_3col b4_load_6row_3col
|
||||
#define mix_load_6row_4col b4_load_6row_4col
|
||||
#define mix_load_6row_5col b4_load_6row_5col
|
||||
#define mix_load_6row_6col b4_load_6row_6col
|
||||
#define mix_load_6row_7col b4_load_6row_7col
|
||||
#define mix_load_7row_1col b4_load_7row_1col
|
||||
#define mix_load_7row_2col b4_load_7row_2col
|
||||
#define mix_load_7row_3col b4_load_7row_3col
|
||||
#define mix_load_7row_4col b4_load_7row_4col
|
||||
#define mix_load_7row_5col b4_load_7row_5col
|
||||
#define mix_load_7row_6col b4_load_7row_6col
|
||||
#define mix_load_7row_7col b4_load_7row_7col
|
||||
#elif INPUT_PRE == 2
|
||||
#define mix_q7_q15_offset_reordered_ele b2_q15_offset_reordered_ele
|
||||
#define mix_load_1row_1col b2_load_1row_1col
|
||||
#define mix_load_1row_2col b2_load_1row_2col
|
||||
#define mix_load_1row_3col b2_load_1row_3col
|
||||
#define mix_load_1row_4col b2_load_1row_4col
|
||||
#define mix_load_1row_5col b2_load_1row_5col
|
||||
#define mix_load_1row_6col b2_load_1row_6col
|
||||
#define mix_load_1row_7col b2_load_1row_7col
|
||||
#define mix_load_2row_1col b2_load_2row_1col
|
||||
#define mix_load_2row_2col b2_load_2row_2col
|
||||
#define mix_load_2row_3col b2_load_2row_3col
|
||||
#define mix_load_2row_4col b2_load_2row_4col
|
||||
#define mix_load_2row_5col b2_load_2row_5col
|
||||
#define mix_load_2row_6col b2_load_2row_6col
|
||||
#define mix_load_2row_7col b2_load_2row_7col
|
||||
#define mix_load_3row_1col b2_load_3row_1col
|
||||
#define mix_load_3row_2col b2_load_3row_2col
|
||||
#define mix_load_3row_3col b2_load_3row_3col
|
||||
#define mix_load_3row_4col b2_load_3row_4col
|
||||
#define mix_load_3row_5col b2_load_3row_5col
|
||||
#define mix_load_3row_6col b2_load_3row_6col
|
||||
#define mix_load_3row_7col b2_load_3row_7col
|
||||
#define mix_load_4row_1col b2_load_4row_1col
|
||||
#define mix_load_4row_2col b2_load_4row_2col
|
||||
#define mix_load_4row_3col b2_load_4row_3col
|
||||
#define mix_load_4row_4col b2_load_4row_4col
|
||||
#define mix_load_4row_5col b2_load_4row_5col
|
||||
#define mix_load_4row_6col b2_load_4row_6col
|
||||
#define mix_load_4row_7col b2_load_4row_7col
|
||||
#define mix_load_5row_1col b2_load_5row_1col
|
||||
#define mix_load_5row_2col b2_load_5row_2col
|
||||
#define mix_load_5row_3col b2_load_5row_3col
|
||||
#define mix_load_5row_4col b2_load_5row_4col
|
||||
#define mix_load_5row_5col b2_load_5row_5col
|
||||
#define mix_load_5row_6col b2_load_5row_6col
|
||||
#define mix_load_5row_7col b2_load_5row_7col
|
||||
#define mix_load_6row_1col b2_load_6row_1col
|
||||
#define mix_load_6row_2col b2_load_6row_2col
|
||||
#define mix_load_6row_3col b2_load_6row_3col
|
||||
#define mix_load_6row_4col b2_load_6row_4col
|
||||
#define mix_load_6row_5col b2_load_6row_5col
|
||||
#define mix_load_6row_6col b2_load_6row_6col
|
||||
#define mix_load_6row_7col b2_load_6row_7col
|
||||
#define mix_load_7row_1col b2_load_7row_1col
|
||||
#define mix_load_7row_2col b2_load_7row_2col
|
||||
#define mix_load_7row_3col b2_load_7row_3col
|
||||
#define mix_load_7row_4col b2_load_7row_4col
|
||||
#define mix_load_7row_5col b2_load_7row_5col
|
||||
#define mix_load_7row_6col b2_load_7row_6col
|
||||
#define mix_load_7row_7col b2_load_7row_7col
|
||||
#else
|
||||
#define mix_q7_q15_offset_reordered_ele q7_q15_offset_reordered_ele
|
||||
#define mix_load_1row_1col load_1row_1col
|
||||
#define mix_load_1row_2col load_1row_2col
|
||||
#define mix_load_1row_3col load_1row_3col
|
||||
#define mix_load_1row_4col load_1row_4col
|
||||
#define mix_load_1row_5col load_1row_5col
|
||||
#define mix_load_1row_6col load_1row_6col
|
||||
#define mix_load_1row_7col load_1row_7col
|
||||
#define mix_load_2row_1col load_2row_1col
|
||||
#define mix_load_2row_2col load_2row_2col
|
||||
#define mix_load_2row_3col load_2row_3col
|
||||
#define mix_load_2row_4col load_2row_4col
|
||||
#define mix_load_2row_5col load_2row_5col
|
||||
#define mix_load_2row_6col load_2row_6col
|
||||
#define mix_load_2row_7col load_2row_7col
|
||||
#define mix_load_3row_1col load_3row_1col
|
||||
#define mix_load_3row_2col load_3row_2col
|
||||
#define mix_load_3row_3col load_3row_3col
|
||||
#define mix_load_3row_4col load_3row_4col
|
||||
#define mix_load_3row_5col load_3row_5col
|
||||
#define mix_load_3row_6col load_3row_6col
|
||||
#define mix_load_3row_7col load_3row_7col
|
||||
#define mix_load_4row_1col load_4row_1col
|
||||
#define mix_load_4row_2col load_4row_2col
|
||||
#define mix_load_4row_3col load_4row_3col
|
||||
#define mix_load_4row_4col load_4row_4col
|
||||
#define mix_load_4row_5col load_4row_5col
|
||||
#define mix_load_4row_6col load_4row_6col
|
||||
#define mix_load_4row_7col load_4row_7col
|
||||
#define mix_load_5row_1col load_5row_1col
|
||||
#define mix_load_5row_2col load_5row_2col
|
||||
#define mix_load_5row_3col load_5row_3col
|
||||
#define mix_load_5row_4col load_5row_4col
|
||||
#define mix_load_5row_5col load_5row_5col
|
||||
#define mix_load_5row_6col load_5row_6col
|
||||
#define mix_load_5row_7col load_5row_7col
|
||||
#define mix_load_6row_1col load_6row_1col
|
||||
#define mix_load_6row_2col load_6row_2col
|
||||
#define mix_load_6row_3col load_6row_3col
|
||||
#define mix_load_6row_4col load_6row_4col
|
||||
#define mix_load_6row_5col load_6row_5col
|
||||
#define mix_load_6row_6col load_6row_6col
|
||||
#define mix_load_6row_7col load_6row_7col
|
||||
#define mix_load_7row_1col load_7row_1col
|
||||
#define mix_load_7row_2col load_7row_2col
|
||||
#define mix_load_7row_3col load_7row_3col
|
||||
#define mix_load_7row_4col load_7row_4col
|
||||
#define mix_load_7row_5col load_7row_5col
|
||||
#define mix_load_7row_6col load_7row_6col
|
||||
#define mix_load_7row_7col load_7row_7col
|
||||
#endif
|
||||
|
||||
#if OUTPUT_PRE == 4
|
||||
#define mix_assign_requantize() b4_assign_requantize()
|
||||
#elif OUTPUT_PRE == 2
|
||||
#define mix_assign_requantize() b2_assign_requantize()
|
||||
#else
|
||||
#define mix_assign_requantize() assign_requantize()
|
||||
#endif
|
||||
|
||||
#if KERNEL_PRE == 4
|
||||
#if OUTPUT_PRE == 4
|
||||
#define mix_nn_mat_mult_kernel_s8_s16_reordered b44_nn_mat_mult_kernel_s8_s16_reordered
|
||||
#define mix_nn_mat_mult_kernel_s8_s16_reordered_8mul b44_nn_mat_mult_kernel_s8_s16_reordered_8mul
|
||||
#elif OUTPUT_PRE == 2
|
||||
#define mix_nn_mat_mult_kernel_s8_s16_reordered b42_nn_mat_mult_kernel_s8_s16_reordered
|
||||
#define mix_nn_mat_mult_kernel_s8_s16_reordered_8mul b42_nn_mat_mult_kernel_s8_s16_reordered_8mul
|
||||
#else
|
||||
#define mix_nn_mat_mult_kernel_s8_s16_reordered b48_nn_mat_mult_kernel_s8_s16_reordered
|
||||
#define mix_nn_mat_mult_kernel_s8_s16_reordered_8mul b48_nn_mat_mult_kernel_s8_s16_reordered_8mul
|
||||
#endif//OUTPUT
|
||||
#elif KERNEL_PRE == 2
|
||||
#if OUTPUT_PRE == 4
|
||||
#define mix_nn_mat_mult_kernel_s8_s16_reordered b24_nn_mat_mult_kernel_s8_s16_reordered
|
||||
#define mix_nn_mat_mult_kernel_s8_s16_reordered_8mul b24_nn_mat_mult_kernel_s8_s16_reordered_8mul
|
||||
#elif OUTPUT_PRE == 2
|
||||
#define mix_nn_mat_mult_kernel_s8_s16_reordered b22_nn_mat_mult_kernel_s8_s16_reordered
|
||||
#define mix_nn_mat_mult_kernel_s8_s16_reordered_8mul b22_nn_mat_mult_kernel_s8_s16_reordered_8mul
|
||||
#else
|
||||
#define mix_nn_mat_mult_kernel_s8_s16_reordered b28_nn_mat_mult_kernel_s8_s16_reordered
|
||||
#define mix_nn_mat_mult_kernel_s8_s16_reordered_8mul b28_nn_mat_mult_kernel_s8_s16_reordered_8mul
|
||||
#endif//OUTPUT
|
||||
#else
|
||||
#define mix_nn_mat_mult_kernel_s8_s16_reordered arm_nn_mat_mult_kernel_s8_s16_reordered
|
||||
#define mix_nn_mat_mult_kernel_s8_s16_reordered_8mul arm_nn_mat_mult_kernel_s8_s16_reordered_8mul
|
||||
#endif
|
||||
|
||||
|
||||
#endif /* TINYENGINE_SOURCE_CONVOLUTIONFUNCTIONS_MIX_MUTABLE_FUNCTION_H_ */
|
31
TinyEngine/include/precision_cnt.h
Normal file
@ -0,0 +1,31 @@
|
||||
/* ----------------------------------------------------------------------
|
||||
* Project: TinyEngine
|
||||
* Title: precision_cnt.h
|
||||
*
|
||||
* Reference papers:
|
||||
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
|
||||
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
|
||||
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
|
||||
* Contact authors:
|
||||
* - Wei-Ming Chen, wmchen@mit.edu
|
||||
* - Wei-Chen Wang, wweichen@mit.edu
|
||||
* - Ji Lin, jilin@mit.edu
|
||||
* - Ligeng Zhu, ligeng@mit.edu
|
||||
* - Song Han, songhan@mit.edu
|
||||
*
|
||||
* Target ISA: ARMv7E-M
|
||||
* -------------------------------------------------------------------- */
|
||||
|
||||
#ifndef TINYENGINE_SOURCE_CONVOLUTIONFUNCTIONS_MIX_PRECISION_CNT_H_
|
||||
#define TINYENGINE_SOURCE_CONVOLUTIONFUNCTIONS_MIX_PRECISION_CNT_H_
|
||||
|
||||
/* MIX precision */
|
||||
#define INPUT_PRE 8
|
||||
#define KERNEL_PRE 8
|
||||
#define OUTPUT_PRE 8
|
||||
#define input_scaler (8 / INPUT_PRE)
|
||||
#define weight_scaler (8 / KERNEL_PRE)
|
||||
#define output_scaler (8 / OUTPUT_PRE)
|
||||
|
||||
|
||||
#endif /* TINYENGINE_SOURCE_CONVOLUTIONFUNCTIONS_MIX_PRECISION_CNT_H_ */
|
68
TinyEngine/include/profile.h
Normal file
@ -0,0 +1,68 @@
|
||||
/* ----------------------------------------------------------------------
|
||||
* Project: TinyEngine
|
||||
* Title: profile.h
|
||||
*
|
||||
* Reference papers:
|
||||
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
|
||||
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
|
||||
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
|
||||
* Contact authors:
|
||||
* - Wei-Ming Chen, wmchen@mit.edu
|
||||
* - Wei-Chen Wang, wweichen@mit.edu
|
||||
* - Ji Lin, jilin@mit.edu
|
||||
* - Ligeng Zhu, ligeng@mit.edu
|
||||
* - Song Han, songhan@mit.edu
|
||||
*
|
||||
* Target ISA: ARMv7E-M
|
||||
* -------------------------------------------------------------------- */
|
||||
|
||||
#include "stm32f7xx_hal.h"
|
||||
#include <stdio.h>
|
||||
#include <string.h>
|
||||
static UART_HandleTypeDef UART;
|
||||
#define RUNS 1
|
||||
static int profile_i;
|
||||
static int start, end;
|
||||
static char buf[100];
|
||||
|
||||
void printLog(const char *s) {
|
||||
static int is_initialized = 0;
|
||||
if (!is_initialized) {
|
||||
UART.Instance = USART1;
|
||||
UART.Init.BaudRate = 115200;
|
||||
UART.Init.WordLength = UART_WORDLENGTH_8B;
|
||||
UART.Init.StopBits = UART_STOPBITS_1;
|
||||
UART.Init.Parity = UART_PARITY_NONE;
|
||||
UART.Init.Mode = UART_MODE_TX_RX;
|
||||
UART.Init.HwFlowCtl = UART_HWCONTROL_NONE;
|
||||
UART.Init.OverSampling = UART_OVERSAMPLING_16;
|
||||
UART.Init.OneBitSampling = UART_ONE_BIT_SAMPLE_DISABLE;
|
||||
UART.AdvancedInit.AdvFeatureInit = UART_ADVFEATURE_NO_INIT;
|
||||
if (HAL_UART_Init(&UART) != HAL_OK) {
|
||||
//Error handling
|
||||
}
|
||||
is_initialized = 1;
|
||||
}
|
||||
HAL_UART_Transmit(&UART, (uint8_t*) s, strlen(s), 10);
|
||||
}
|
||||
|
||||
void recieveChar(char *s) {
|
||||
static int is_initialized = 0;
|
||||
if (!is_initialized) {
|
||||
UART.Instance = USART1;
|
||||
UART.Init.BaudRate = 115200;
|
||||
UART.Init.WordLength = UART_WORDLENGTH_8B;
|
||||
UART.Init.StopBits = UART_STOPBITS_1;
|
||||
UART.Init.Parity = UART_PARITY_NONE;
|
||||
UART.Init.Mode = UART_MODE_TX_RX;
|
||||
UART.Init.HwFlowCtl = UART_HWCONTROL_NONE;
|
||||
UART.Init.OverSampling = UART_OVERSAMPLING_16;
|
||||
UART.Init.OneBitSampling = UART_ONE_BIT_SAMPLE_DISABLE;
|
||||
UART.AdvancedInit.AdvFeatureInit = UART_ADVFEATURE_NO_INIT;
|
||||
if (HAL_UART_Init(&UART) != HAL_OK) {
|
||||
//Error handling
|
||||
}
|
||||
is_initialized = 1;
|
||||
}
|
||||
HAL_UART_Receive(&UART, (uint8_t*) s, 1, 10);
|
||||
}
|
161
TinyEngine/include/tinyengine_function.h
Normal file
@ -0,0 +1,161 @@
|
||||
/* ----------------------------------------------------------------------
|
||||
* Project: TinyEngine
|
||||
* Title: tinyengine_function.h
|
||||
*
|
||||
* Reference papers:
|
||||
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
|
||||
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
|
||||
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
|
||||
* Contact authors:
|
||||
* - Wei-Ming Chen, wmchen@mit.edu
|
||||
* - Wei-Chen Wang, wweichen@mit.edu
|
||||
* - Ji Lin, jilin@mit.edu
|
||||
* - Ligeng Zhu, ligeng@mit.edu
|
||||
* - Song Han, songhan@mit.edu
|
||||
*
|
||||
* Target ISA: ARMv7E-M
|
||||
* -------------------------------------------------------------------- */
|
||||
|
||||
#include <stdint.h>
|
||||
#include <stdbool.h>
|
||||
typedef int8_t q7_t;
|
||||
typedef uint8_t q8_t;
|
||||
typedef int16_t q15_t;
|
||||
typedef uint16_t q16_t;
|
||||
typedef int32_t q31_t;
|
||||
typedef uint32_t q32_t;
|
||||
|
||||
typedef enum {
|
||||
STATE_SUCCESS = 0, /* No error */
|
||||
PARAM_NO_SUPPORT = 1, /* Unsupported parameters */
|
||||
} tinyengine_status;
|
||||
|
||||
typedef struct add_params {
|
||||
int input_h, input_w, input_c, left_shift;
|
||||
int input1_offset, input1_multiplier, input1_shift;
|
||||
int input2_offset, input2_multiplier, input2_shift;
|
||||
int output_offset, output_multiplier, output_shift;
|
||||
int quantized_activation_max, quantized_activation_min;
|
||||
|
||||
} ADD_params;
|
||||
|
||||
#define TN_MAX(A,B) ((A) > (B) ? (A) : (B))
|
||||
#define TN_MIN(A,B) ((A) < (B) ? (A) : (B))
|
||||
|
||||
// bit assignment and check
|
||||
#define BIT_SET(a,b) ((a) |= (1ULL<<(b)))
|
||||
#define BIT_CLEAR(a,b) ((a) &= ~(1ULL<<(b)))
|
||||
#define BIT_FLIP(a,b) ((a) ^= (1ULL<<(b)))
|
||||
#define BIT_CHECK(a,b) (!!((a) & (1ULL<<(b)))) // '!!' to make sure this returns 0 or 1
|
||||
|
||||
#define BITMASK_SET(x, mask) ((x) |= (mask))
|
||||
#define BITMASK_CLEAR(x, mask) ((x) &= (~(mask)))
|
||||
#define BITMASK_FLIP(x, mask) ((x) ^= (mask))
|
||||
#define BITMASK_CHECK_ALL(x, mask) (!(~(x) & (mask)))
|
||||
#define BITMASK_CHECK_ANY(x, mask) ((x) & (mask))
|
||||
|
||||
tinyengine_status convolve_1x1_s8(const q7_t *input, const uint16_t input_x,
|
||||
const uint16_t input_y, const uint16_t input_ch, const q7_t *kernel,
|
||||
const int32_t *bias, const int32_t *output_shift,
|
||||
const int32_t *output_mult, const int32_t out_offset,
|
||||
const int32_t input_offset, const int32_t out_activation_min,
|
||||
const int32_t out_activation_max, q7_t *output, const uint16_t output_x,
|
||||
const uint16_t output_y, const uint16_t output_ch, q15_t *runtime_buf);
|
||||
|
||||
tinyengine_status convolve_1x1_s8_ch8(const q7_t *input, const uint16_t input_x,
|
||||
const uint16_t input_y, const uint16_t input_ch, const q7_t *kernel,
|
||||
const int32_t *bias, const int32_t *output_shift,
|
||||
const int32_t *output_mult, const int32_t out_offset,
|
||||
const int32_t input_offset, const int32_t out_activation_min,
|
||||
const int32_t out_activation_max, q7_t *output, const uint16_t output_x,
|
||||
const uint16_t output_y, const uint16_t output_ch, q15_t *runtime_buf);
|
||||
|
||||
tinyengine_status convolve_1x1_s8_ch16(const q7_t *input,
|
||||
const uint16_t input_x, const uint16_t input_y, const uint16_t input_ch,
|
||||
const q7_t *kernel, const int32_t *bias, const int32_t *output_shift,
|
||||
const int32_t *output_mult, const int32_t out_offset,
|
||||
const int32_t input_offset, const int32_t out_activation_min,
|
||||
const int32_t out_activation_max, q7_t *output, const uint16_t output_x,
|
||||
const uint16_t output_y, const uint16_t output_ch, q15_t *runtime_buf);
|
||||
|
||||
tinyengine_status convolve_1x1_s8_ch24(const q7_t *input,
|
||||
const uint16_t input_x, const uint16_t input_y, const uint16_t input_ch,
|
||||
const q7_t *kernel, const int32_t *bias, const int32_t *output_shift,
|
||||
const int32_t *output_mult, const int32_t out_offset,
|
||||
const int32_t input_offset, const int32_t out_activation_min,
|
||||
const int32_t out_activation_max, q7_t *output, const uint16_t output_x,
|
||||
const uint16_t output_y, const uint16_t output_ch, q15_t *runtime_buf);
|
||||
|
||||
tinyengine_status convolve_1x1_s8_ch48(const q7_t *input,
|
||||
const uint16_t input_x, const uint16_t input_y, const uint16_t input_ch,
|
||||
const q7_t *kernel, const int32_t *bias, const int32_t *output_shift,
|
||||
const int32_t *output_mult, const int32_t out_offset,
|
||||
const int32_t input_offset, const int32_t out_activation_min,
|
||||
const int32_t out_activation_max, q7_t *output, const uint16_t output_x,
|
||||
const uint16_t output_y, const uint16_t output_ch, q15_t *runtime_buf);
|
||||
|
||||
tinyengine_status convolve_s8_kernel3_inputch3_stride2_pad1(const q7_t *input,
|
||||
const uint16_t input_x, const uint16_t input_y, const uint16_t input_ch,
|
||||
const q7_t *kernel, const int32_t *bias, const int32_t *output_shift,
|
||||
const int32_t *output_mult, const int32_t output_offset,
|
||||
const int32_t input_offset, const int32_t output_activation_min,
|
||||
const int32_t output_activation_max, q7_t *output,
|
||||
const uint16_t output_x, const uint16_t output_y,
|
||||
const uint16_t output_ch, q15_t *runtime_buf, q15_t *kbuf,
|
||||
q7_t pad_value);
|
||||
|
||||
tinyengine_status add(int size, ADD_params *params, const int8_t *input1_data,
|
||||
const int8_t *input2_data, int8_t *output_data);
|
||||
|
||||
tinyengine_status avg_pooling(const q7_t *input, const uint16_t input_h,
|
||||
const uint16_t input_w, const uint16_t input_c, const uint16_t sample_h,
|
||||
const uint16_t sample_w, const uint16_t output_h,
|
||||
const uint16_t output_w, const int32_t out_activation_min,
|
||||
const int32_t out_activation_max, q7_t *output);
|
||||
|
||||
tinyengine_status fully_connected_fp(const float *input, const uint16_t input_x,
|
||||
const uint16_t input_y, const uint16_t input_ch,
|
||||
const uint16_t output_ch, const float *bias, const float *weights,
|
||||
float *output);
|
||||
|
||||
tinyengine_status statble_softmax_inplace(float *input, const uint16_t length);
|
||||
|
||||
tinyengine_status mat_mul_fp(const float *matA, const uint16_t matA_row,
|
||||
const uint16_t matA_col, const float *matB, const uint16_t matB_col,
|
||||
float *output);
|
||||
|
||||
tinyengine_status convolve_s8_kernel3_inputch3_stride2_pad1_fpreq(
|
||||
const q7_t *input, const uint16_t input_x, const uint16_t input_y,
|
||||
const uint16_t input_ch, const q7_t *kernel, const int32_t *bias,
|
||||
const float *scales, const int32_t output_offset,
|
||||
const int32_t input_offset, const int32_t output_activation_min,
|
||||
const int32_t output_activation_max, q7_t *output,
|
||||
const uint16_t output_x, const uint16_t output_y,
|
||||
const uint16_t output_ch, q15_t *runtime_buf, q15_t *kbuf,
|
||||
q7_t pad_value);
|
||||
|
||||
tinyengine_status add_fpreq(int size, const int8_t* input1_data, const float input1_scale, const float input1_zero,
|
||||
const int8_t* input2_data, const float input2_scale, const float input2_zero, const float output_scale,
|
||||
const float zero_y, int8_t* output_data);
|
||||
|
||||
tinyengine_status add_fpreq_mask(int size, const int8_t* input1_data, const float input1_scale, const float input1_zero,
|
||||
const int8_t* input2_data, const float input2_scale, const float input2_zero, const float output_scale,
|
||||
const float zero_y, int8_t* output_data, int8_t* output_mask);
|
||||
|
||||
tinyengine_status add_fpreq_bitmask(int size, const int8_t* input1_data, const float input1_scale, const float input1_zero,
|
||||
const int8_t* input2_data, const float input2_scale, const float input2_zero, const float output_scale,
|
||||
const float zero_y, int8_t* output_data, int8_t* output_mask);
|
||||
|
||||
tinyengine_status where_int8(const bool* inMask, const uint16_t size, signed char* input1_data,
|
||||
const char* input2_data, char* output_data);
|
||||
|
||||
tinyengine_status convolve_1x1_s8_fpreq_mask_partialCH(const q7_t *input,
|
||||
const uint16_t input_x, const uint16_t input_y, const uint16_t input_ch,
|
||||
const q7_t *kernel_sram, const q7_t *kernel_flash, const uint16_t first_k_channel, const int32_t *bias, const float *scales,
|
||||
const int32_t out_offset, const int32_t input_offset,
|
||||
const int32_t out_activation_min, const int32_t out_activation_max,
|
||||
q7_t *output, q7_t *mask, const uint16_t output_x, const uint16_t output_y,
|
||||
const uint16_t output_ch, q15_t *runtime_buf);
|
||||
|
||||
#include "genInclude.h"
|
||||
#include "fp_requantize_op.h"
|
31
TinyEngine/include/tinyengine_lib.h
Normal file
@ -0,0 +1,31 @@
|
||||
/* ----------------------------------------------------------------------
|
||||
* Project: TinyEngine
|
||||
* Title: tinyengine_lib.h
|
||||
*
|
||||
* Reference papers:
|
||||
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
|
||||
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
|
||||
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
|
||||
* Contact authors:
|
||||
* - Wei-Ming Chen, wmchen@mit.edu
|
||||
* - Wei-Chen Wang, wweichen@mit.edu
|
||||
* - Ji Lin, jilin@mit.edu
|
||||
* - Ligeng Zhu, ligeng@mit.edu
|
||||
* - Song Han, songhan@mit.edu
|
||||
*
|
||||
* Target ISA: ARMv7E-M
|
||||
* -------------------------------------------------------------------- */
|
||||
|
||||
#ifndef TINYENGINE_INCLUDE_TINYENGINE_FUNCTIONLIB_H_
|
||||
#define TINYENGINE_INCLUDE_TINYENGINE_FUNCTIONLIB_H_
|
||||
#include <stdio.h>
|
||||
|
||||
typedef int8_t q7_t;
|
||||
typedef uint8_t q8_t;
|
||||
typedef int16_t q15_t;
|
||||
typedef uint16_t q16_t;
|
||||
typedef int32_t q31_t;
|
||||
typedef uint32_t q32_t;
|
||||
|
||||
|
||||
#endif /* TINYENGINE_INCLUDE_TINYENGINE_FUNCTIONLIB_H_ */
|
33
TinyEngine/include/yoloOutput.h
Normal file
@ -0,0 +1,33 @@
|
||||
/* ----------------------------------------------------------------------
|
||||
* Project: TinyEngine
|
||||
* Title: yoloOutput.h
|
||||
*
|
||||
* Reference papers:
|
||||
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
|
||||
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
|
||||
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
|
||||
* Contact authors:
|
||||
* - Wei-Ming Chen, wmchen@mit.edu
|
||||
* - Wei-Chen Wang, wweichen@mit.edu
|
||||
* - Ji Lin, jilin@mit.edu
|
||||
* - Ligeng Zhu, ligeng@mit.edu
|
||||
* - Song Han, songhan@mit.edu
|
||||
*
|
||||
* Target ISA: ARMv7E-M
|
||||
* -------------------------------------------------------------------- */
|
||||
|
||||
typedef struct box{
|
||||
float x0;
|
||||
float y0;
|
||||
float x1;
|
||||
float y1;
|
||||
float score;
|
||||
} det_box;
|
||||
|
||||
det_box** postprocessing(signed char *input_data[3], signed char y_zero[3], float y_scale[3],
|
||||
unsigned char *data_buf, int w, int h, int output_c, int num_classes, const int anchors[3][3][2], int outputs,
|
||||
const float NMS_threshold, const float VALID_THRESHOLD, int* box_ret, det_box** ret_box);
|
||||
|
||||
det_box** postprocessing_fp(float *input_data[3], signed char y_zero[3], float y_scale[3],
|
||||
unsigned char *data_buf, int w, int h, int output_c, int num_classes, const int anchors[3][3][2], int outputs,
|
||||
const float NMS_threshold, const float VALID_THRESHOLD, int* box_ret, det_box** ret_box);
|
88
TinyEngine/src/kernels/fp_requantize_op/add_fpreq.c
Normal file
@ -0,0 +1,88 @@
|
||||
/* ----------------------------------------------------------------------
|
||||
* Project: TinyEngine
|
||||
* Title: add_fpreq.c
|
||||
*
|
||||
* Reference papers:
|
||||
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
|
||||
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
|
||||
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
|
||||
* Contact authors:
|
||||
* - Wei-Ming Chen, wmchen@mit.edu
|
||||
* - Wei-Chen Wang, wweichen@mit.edu
|
||||
* - Ji Lin, jilin@mit.edu
|
||||
* - Ligeng Zhu, ligeng@mit.edu
|
||||
* - Song Han, songhan@mit.edu
|
||||
*
|
||||
* Target ISA: ARMv7E-M
|
||||
* -------------------------------------------------------------------- */
|
||||
|
||||
#include <math.h>
|
||||
#include "arm_math.h"
|
||||
#include "tinyengine_function.h"
|
||||
|
||||
tinyengine_status add_fpreq(int size, const int8_t* input1_data, const float input1_scale, const float input1_zero,
|
||||
const int8_t* input2_data, const float input2_scale, const float input2_zero, const float output_scale,
|
||||
const float zero_y, int8_t* output_data) {
|
||||
for (int i = 0; i < size; ++i) {
|
||||
float input1_fp = ((float)*input1_data++ - input1_zero) * input1_scale;
|
||||
float input2_fp = ((float)*input2_data++ - input2_zero) * input2_scale;
|
||||
int clamped_output = (int)round((input1_fp + input2_fp) / output_scale + zero_y); // to align with tvm implementation
|
||||
clamped_output = TN_MAX(clamped_output, -128);
|
||||
clamped_output = TN_MIN(clamped_output, 127);
|
||||
output_data[i] = (int8_t)(clamped_output);
|
||||
}
|
||||
}
|
||||
|
||||
const int activation_min = -128;
|
||||
const int activation_max = 127;
|
||||
tinyengine_status add_fpreq_mask(int size, const int8_t* input1_data, const float input1_scale, const float input1_zero,
|
||||
const int8_t* input2_data, const float input2_scale, const float input2_zero, const float output_scale,
|
||||
const float zero_y, int8_t* output_data, int8_t* output_mask) {
|
||||
for (int i = 0; i < size; ++i) {
|
||||
float input1_fp = ((float)*input1_data++ - input1_zero) * input1_scale;
|
||||
float input2_fp = ((float)*input2_data++ - input2_zero) * input2_scale;
|
||||
int clamped_output = (int)round((input1_fp + input2_fp) / output_scale + zero_y); // to align with tvm implementation
|
||||
int8_t mask_value = 1;
|
||||
if (clamped_output < activation_min){
|
||||
clamped_output = activation_min;
|
||||
mask_value = 0;
|
||||
}
|
||||
if (clamped_output > activation_max){
|
||||
clamped_output = activation_max;
|
||||
mask_value = 0;
|
||||
}
|
||||
output_data[i] = (int8_t)(clamped_output);
|
||||
output_mask[i] = mask_value;
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
tinyengine_status add_fpreq_bitmask(int size, const int8_t* input1_data, const float input1_scale, const float input1_zero,
|
||||
const int8_t* input2_data, const float input2_scale, const float input2_zero, const float output_scale,
|
||||
const float zero_y, int8_t* output_data, int8_t* output_mask) {
|
||||
int mask_idx = 0;
|
||||
for (int i = 0; i < size; ++i) {
|
||||
float input1_fp = ((float)*input1_data++ - input1_zero) * input1_scale;
|
||||
float input2_fp = ((float)*input2_data++ - input2_zero) * input2_scale;
|
||||
int clamped_output = (int)round((input1_fp + input2_fp) / output_scale + zero_y); // to align with tvm implementation
|
||||
int8_t mask_value = 1;
|
||||
if (clamped_output < activation_min){
|
||||
clamped_output = activation_min;
|
||||
mask_value = 0;
|
||||
}
|
||||
if (clamped_output > activation_max){
|
||||
clamped_output = activation_max;
|
||||
mask_value = 0;
|
||||
}
|
||||
output_data[i] = (int8_t)(clamped_output);
|
||||
if (mask_value == 1)
|
||||
BIT_SET(*output_mask, mask_idx);
|
||||
else
|
||||
BIT_CLEAR(*output_mask, mask_idx);
|
||||
mask_idx++;
|
||||
if (mask_idx == 8){
|
||||
mask_idx = 0;
|
||||
output_mask++;
|
||||
}
|
||||
}
|
||||
}
|
@ -0,0 +1,122 @@
|
||||
/* ----------------------------------------------------------------------
|
||||
* Project: TinyEngine
|
||||
* Title: convolve_1x1_s8_ch16_fpreq.c
|
||||
*
|
||||
* Reference papers:
|
||||
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
|
||||
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
|
||||
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
|
||||
* Contact authors:
|
||||
* - Wei-Ming Chen, wmchen@mit.edu
|
||||
* - Wei-Chen Wang, wweichen@mit.edu
|
||||
* - Ji Lin, jilin@mit.edu
|
||||
* - Ligeng Zhu, ligeng@mit.edu
|
||||
* - Song Han, songhan@mit.edu
|
||||
*
|
||||
* Target ISA: ARMv7E-M
|
||||
* -------------------------------------------------------------------- */
|
||||
|
||||
#include "arm_nnfunctions.h"
|
||||
#include "img2col_element.h"
|
||||
#include "tinyengine_function.h"
|
||||
#include "fp_requantize_op.h"
|
||||
|
||||
#define DIM_KER_X (1U)
|
||||
#define DIM_KER_Y (1U)
|
||||
|
||||
tinyengine_status convolve_1x1_s8_ch16_fpreq(const q7_t *input,
|
||||
const uint16_t input_x, const uint16_t input_y, const uint16_t input_ch,
|
||||
const q7_t *kernel, const int32_t *bias, const float *scales,
|
||||
const int32_t out_offset, const int32_t input_offset,
|
||||
const int32_t out_activation_min, const int32_t out_activation_max,
|
||||
q7_t *output, const uint16_t output_x, const uint16_t output_y,
|
||||
const uint16_t output_ch, q15_t *runtime_buf) {
|
||||
int32_t i_element;
|
||||
(void) input_x;
|
||||
(void) input_y;
|
||||
|
||||
/* Partial(two columns) im2col buffer */
|
||||
q15_t *two_column_buffer = runtime_buf;
|
||||
q7_t *out = output;
|
||||
const int32_t num_elements = output_x * output_y;
|
||||
const int channel_div4 = (input_ch >> 2);
|
||||
|
||||
const int16_t inoff16 = input_offset;
|
||||
q31_t offset_q15x2 = __PKHBT(inoff16, inoff16, 16);
|
||||
|
||||
for (i_element = 0; i_element < num_elements / 2; i_element++) {
|
||||
/* Fill buffer for partial im2col - two columns at a time */
|
||||
q7_t *src = &input[i_element * input_ch * 2];
|
||||
q15_t *dst = two_column_buffer;
|
||||
|
||||
//use variables
|
||||
q31_t in_q7x4;
|
||||
q31_t in_q15x2_1;
|
||||
q31_t in_q15x2_2;
|
||||
q31_t out_q15x2_1;
|
||||
q31_t out_q15x2_2;
|
||||
|
||||
int cnt = channel_div4; //two columns
|
||||
while (cnt > 0) {
|
||||
q7_q15_offset_reordered_ele(src, dst)
|
||||
q7_q15_offset_reordered_ele(src, dst)
|
||||
cnt--;
|
||||
}
|
||||
|
||||
out = mat_mult_kernel_s8_s16_reordered_ch16_fpreq(kernel,
|
||||
two_column_buffer, output_ch, scales, (q7_t) out_offset,
|
||||
out_activation_min, out_activation_max,
|
||||
input_ch * DIM_KER_Y * DIM_KER_X, bias, out);
|
||||
}
|
||||
|
||||
/* check if there is an odd column left-over for computation */
|
||||
if (num_elements & 0x1) {
|
||||
int32_t i_ch_out;
|
||||
const q7_t *ker_a = kernel;
|
||||
q7_t *src = &input[(num_elements - 1) * input_ch];
|
||||
q15_t *dst = two_column_buffer;
|
||||
|
||||
//use variables
|
||||
q31_t in_q7x4;
|
||||
q31_t in_q15x2_1;
|
||||
q31_t in_q15x2_2;
|
||||
q31_t out_q15x2_1;
|
||||
q31_t out_q15x2_2;
|
||||
|
||||
int cnt = channel_div4; //two * numof2col columns
|
||||
while (cnt > 0) {
|
||||
q7_q15_offset_reordered_ele(src, dst)
|
||||
cnt--;
|
||||
}
|
||||
|
||||
for (i_ch_out = 0; i_ch_out < output_ch; i_ch_out++) {
|
||||
q31_t sum = bias[i_ch_out];
|
||||
|
||||
/* Point to the beginning of the im2col buffer where the input is available as a rearranged column */
|
||||
const q15_t *ip_as_col = runtime_buf;
|
||||
uint16_t col_count = (input_ch * DIM_KER_X * DIM_KER_Y) >> 2;
|
||||
|
||||
while (col_count) {
|
||||
q31_t ker_a1, ker_a2;
|
||||
q31_t in_b1, in_b2;
|
||||
ker_a = read_and_pad_reordered(ker_a, &ker_a1, &ker_a2);
|
||||
|
||||
in_b1 = arm_nn_read_q15x2_ia(&ip_as_col);
|
||||
sum = __SMLAD(ker_a1, in_b1, sum);
|
||||
in_b2 = arm_nn_read_q15x2_ia(&ip_as_col);
|
||||
sum = __SMLAD(ker_a2, in_b2, sum);
|
||||
|
||||
col_count--;
|
||||
}
|
||||
|
||||
sum = (float) sum * scales[i_ch_out];
|
||||
sum += out_offset;
|
||||
sum = MAX(sum, out_activation_min);
|
||||
sum = MIN(sum, out_activation_max);
|
||||
*out++ = (q7_t) sum;
|
||||
}
|
||||
}
|
||||
|
||||
/* Return to application */
|
||||
return STATE_SUCCESS;
|
||||
}
|
@ -0,0 +1,122 @@
|
||||
/* ----------------------------------------------------------------------
|
||||
* Project: TinyEngine
|
||||
* Title: convolve_1x1_s8_ch24_fpreq.c
|
||||
*
|
||||
* Reference papers:
|
||||
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
|
||||
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
|
||||
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
|
||||
* Contact authors:
|
||||
* - Wei-Ming Chen, wmchen@mit.edu
|
||||
* - Wei-Chen Wang, wweichen@mit.edu
|
||||
* - Ji Lin, jilin@mit.edu
|
||||
* - Ligeng Zhu, ligeng@mit.edu
|
||||
* - Song Han, songhan@mit.edu
|
||||
*
|
||||
* Target ISA: ARMv7E-M
|
||||
* -------------------------------------------------------------------- */
|
||||
|
||||
#include "arm_nnfunctions.h"
|
||||
#include "img2col_element.h"
|
||||
#include "tinyengine_function.h"
|
||||
#include "fp_requantize_op.h"
|
||||
|
||||
#define DIM_KER_X (1U)
|
||||
#define DIM_KER_Y (1U)
|
||||
|
||||
tinyengine_status convolve_1x1_s8_ch24_fpreq(const q7_t *input,
|
||||
const uint16_t input_x, const uint16_t input_y, const uint16_t input_ch,
|
||||
const q7_t *kernel, const int32_t *bias, const float *scales,
|
||||
const int32_t out_offset, const int32_t input_offset,
|
||||
const int32_t out_activation_min, const int32_t out_activation_max,
|
||||
q7_t *output, const uint16_t output_x, const uint16_t output_y,
|
||||
const uint16_t output_ch, q15_t *runtime_buf) {
|
||||
int32_t i_element;
|
||||
(void) input_x;
|
||||
(void) input_y;
|
||||
|
||||
/* Partial(two columns) im2col buffer */
|
||||
q15_t *two_column_buffer = runtime_buf;
|
||||
q7_t *out = output;
|
||||
const int32_t num_elements = output_x * output_y;
|
||||
const int channel_div4 = (input_ch >> 2);
|
||||
|
||||
const int16_t inoff16 = input_offset;
|
||||
q31_t offset_q15x2 = __PKHBT(inoff16, inoff16, 16);
|
||||
|
||||
for (i_element = 0; i_element < num_elements / 2; i_element++) {
|
||||
/* Fill buffer for partial im2col - two columns at a time */
|
||||
q7_t *src = &input[i_element * input_ch * 2];
|
||||
q15_t *dst = two_column_buffer;
|
||||
|
||||
//use variables
|
||||
q31_t in_q7x4;
|
||||
q31_t in_q15x2_1;
|
||||
q31_t in_q15x2_2;
|
||||
q31_t out_q15x2_1;
|
||||
q31_t out_q15x2_2;
|
||||
|
||||
int cnt = channel_div4; //two columns
|
||||
while (cnt > 0) {
|
||||
q7_q15_offset_reordered_ele(src, dst)
|
||||
q7_q15_offset_reordered_ele(src, dst)
|
||||
cnt--;
|
||||
}
|
||||
|
||||
out = mat_mult_kernel_s8_s16_reordered_ch24_fpreq(kernel,
|
||||
two_column_buffer, output_ch, scales, (q7_t) out_offset,
|
||||
out_activation_min, out_activation_max,
|
||||
input_ch * DIM_KER_Y * DIM_KER_X, bias, out);
|
||||
}
|
||||
|
||||
/* check if there is an odd column left-over for computation */
|
||||
if (num_elements & 0x1) {
|
||||
int32_t i_ch_out;
|
||||
const q7_t *ker_a = kernel;
|
||||
q7_t *src = &input[(num_elements - 1) * input_ch];
|
||||
q15_t *dst = two_column_buffer;
|
||||
|
||||
//use variables
|
||||
q31_t in_q7x4;
|
||||
q31_t in_q15x2_1;
|
||||
q31_t in_q15x2_2;
|
||||
q31_t out_q15x2_1;
|
||||
q31_t out_q15x2_2;
|
||||
|
||||
int cnt = channel_div4; //two * numof2col columns
|
||||
while (cnt > 0) {
|
||||
q7_q15_offset_reordered_ele(src, dst)
|
||||
cnt--;
|
||||
}
|
||||
|
||||
for (i_ch_out = 0; i_ch_out < output_ch; i_ch_out++) {
|
||||
q31_t sum = bias[i_ch_out];
|
||||
|
||||
/* Point to the beginning of the im2col buffer where the input is available as a rearranged column */
|
||||
const q15_t *ip_as_col = runtime_buf;
|
||||
uint16_t col_count = (input_ch * DIM_KER_X * DIM_KER_Y) >> 2;
|
||||
|
||||
while (col_count) {
|
||||
q31_t ker_a1, ker_a2;
|
||||
q31_t in_b1, in_b2;
|
||||
ker_a = read_and_pad_reordered(ker_a, &ker_a1, &ker_a2);
|
||||
|
||||
in_b1 = arm_nn_read_q15x2_ia(&ip_as_col);
|
||||
sum = __SMLAD(ker_a1, in_b1, sum);
|
||||
in_b2 = arm_nn_read_q15x2_ia(&ip_as_col);
|
||||
sum = __SMLAD(ker_a2, in_b2, sum);
|
||||
|
||||
col_count--;
|
||||
}
|
||||
|
||||
sum = (float) sum * scales[i_ch_out];
|
||||
sum += out_offset;
|
||||
sum = MAX(sum, out_activation_min);
|
||||
sum = MIN(sum, out_activation_max);
|
||||
*out++ = (q7_t) sum;
|
||||
}
|
||||
}
|
||||
|
||||
/* Return to application */
|
||||
return STATE_SUCCESS;
|
||||
}
|
@ -0,0 +1,122 @@
|
||||
/* ----------------------------------------------------------------------
|
||||
* Project: TinyEngine
|
||||
* Title: convolve_1x1_s8_ch48_fpreq.c
|
||||
*
|
||||
* Reference papers:
|
||||
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
|
||||
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
|
||||
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
|
||||
* Contact authors:
|
||||
* - Wei-Ming Chen, wmchen@mit.edu
|
||||
* - Wei-Chen Wang, wweichen@mit.edu
|
||||
* - Ji Lin, jilin@mit.edu
|
||||
* - Ligeng Zhu, ligeng@mit.edu
|
||||
* - Song Han, songhan@mit.edu
|
||||
*
|
||||
* Target ISA: ARMv7E-M
|
||||
* -------------------------------------------------------------------- */
|
||||
|
||||
#include "arm_nnfunctions.h"
|
||||
#include "arm_nnsupportfunctions.h"
|
||||
#include "img2col_element.h"
|
||||
#include "tinyengine_function.h"
|
||||
|
||||
#define DIM_KER_X (1U)
|
||||
#define DIM_KER_Y (1U)
|
||||
|
||||
tinyengine_status convolve_1x1_s8_ch48_fpreq(const q7_t *input,
|
||||
const uint16_t input_x, const uint16_t input_y, const uint16_t input_ch,
|
||||
const q7_t *kernel, const int32_t *bias, const float *scales,
|
||||
const int32_t out_offset, const int32_t input_offset,
|
||||
const int32_t out_activation_min, const int32_t out_activation_max,
|
||||
q7_t *output, const uint16_t output_x, const uint16_t output_y,
|
||||
const uint16_t output_ch, q15_t *runtime_buf) {
|
||||
int32_t i_element;
|
||||
(void) input_x;
|
||||
(void) input_y;
|
||||
|
||||
/* Partial(two columns) im2col buffer */
|
||||
q15_t *two_column_buffer = runtime_buf;
|
||||
q7_t *out = output;
|
||||
const int32_t num_elements = output_x * output_y;
|
||||
const int channel_div4 = (input_ch >> 2);
|
||||
|
||||
const int16_t inoff16 = input_offset;
|
||||
q31_t offset_q15x2 = __PKHBT(inoff16, inoff16, 16);
|
||||
|
||||
for (i_element = 0; i_element < num_elements / 2; i_element++) {
|
||||
/* Fill buffer for partial im2col - two columns at a time */
|
||||
q7_t *src = &input[i_element * input_ch * 2];
|
||||
q15_t *dst = two_column_buffer;
|
||||
|
||||
//use variables
|
||||
q31_t in_q7x4;
|
||||
q31_t in_q15x2_1;
|
||||
q31_t in_q15x2_2;
|
||||
q31_t out_q15x2_1;
|
||||
q31_t out_q15x2_2;
|
||||
|
||||
int cnt = channel_div4; //two columns
|
||||
while (cnt > 0) {
|
||||
q7_q15_offset_reordered_ele(src, dst)
|
||||
q7_q15_offset_reordered_ele(src, dst)
|
||||
cnt--;
|
||||
}
|
||||
|
||||
out = mat_mult_kernel_s8_s16_reordered_ch48_fpreq(kernel,
|
||||
two_column_buffer, output_ch, scales, (q7_t) out_offset,
|
||||
out_activation_min, out_activation_max,
|
||||
input_ch * DIM_KER_Y * DIM_KER_X, bias, out);
|
||||
}
|
||||
|
||||
/* check if there is an odd column left-over for computation */
|
||||
if (num_elements & 0x1) {
|
||||
int32_t i_ch_out;
|
||||
const q7_t *ker_a = kernel;
|
||||
q7_t *src = &input[(num_elements - 1) * input_ch];
|
||||
q15_t *dst = two_column_buffer;
|
||||
|
||||
//use variables
|
||||
q31_t in_q7x4;
|
||||
q31_t in_q15x2_1;
|
||||
q31_t in_q15x2_2;
|
||||
q31_t out_q15x2_1;
|
||||
q31_t out_q15x2_2;
|
||||
|
||||
int cnt = channel_div4; //two * numof2col columns
|
||||
while (cnt > 0) {
|
||||
q7_q15_offset_reordered_ele(src, dst)
|
||||
cnt--;
|
||||
}
|
||||
|
||||
for (i_ch_out = 0; i_ch_out < output_ch; i_ch_out++) {
|
||||
q31_t sum = bias[i_ch_out];
|
||||
|
||||
/* Point to the beginning of the im2col buffer where the input is available as a rearranged column */
|
||||
const q15_t *ip_as_col = runtime_buf;
|
||||
uint16_t col_count = (input_ch * DIM_KER_X * DIM_KER_Y) >> 2;
|
||||
|
||||
while (col_count) {
|
||||
q31_t ker_a1, ker_a2;
|
||||
q31_t in_b1, in_b2;
|
||||
ker_a = read_and_pad_reordered(ker_a, &ker_a1, &ker_a2);
|
||||
|
||||
in_b1 = arm_nn_read_q15x2_ia(&ip_as_col);
|
||||
sum = __SMLAD(ker_a1, in_b1, sum);
|
||||
in_b2 = arm_nn_read_q15x2_ia(&ip_as_col);
|
||||
sum = __SMLAD(ker_a2, in_b2, sum);
|
||||
|
||||
col_count--;
|
||||
}
|
||||
|
||||
sum = (float) sum * scales[i_ch_out];
|
||||
sum += out_offset;
|
||||
sum = MAX(sum, out_activation_min);
|
||||
sum = MIN(sum, out_activation_max);
|
||||
*out++ = (q7_t) sum;
|
||||
}
|
||||
}
|
||||
|
||||
/* Return to application */
|
||||
return STATE_SUCCESS;
|
||||
}
|
@ -0,0 +1,122 @@
|
||||
/* ----------------------------------------------------------------------
|
||||
* Project: TinyEngine
|
||||
* Title: convolve_1x1_s8_ch8_fpreq.c
|
||||
*
|
||||
* Reference papers:
|
||||
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
|
||||
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
|
||||
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
|
||||
* Contact authors:
|
||||
* - Wei-Ming Chen, wmchen@mit.edu
|
||||
* - Wei-Chen Wang, wweichen@mit.edu
|
||||
* - Ji Lin, jilin@mit.edu
|
||||
* - Ligeng Zhu, ligeng@mit.edu
|
||||
* - Song Han, songhan@mit.edu
|
||||
*
|
||||
* Target ISA: ARMv7E-M
|
||||
* -------------------------------------------------------------------- */
|
||||
|
||||
#include "arm_nnfunctions.h"
|
||||
#include "img2col_element.h"
|
||||
#include "tinyengine_function.h"
|
||||
#include "fp_requantize_op.h"
|
||||
|
||||
#define DIM_KER_X (1U)
|
||||
#define DIM_KER_Y (1U)
|
||||
|
||||
tinyengine_status convolve_1x1_s8_ch8_fpreq(const q7_t *input,
|
||||
const uint16_t input_x, const uint16_t input_y, const uint16_t input_ch,
|
||||
const q7_t *kernel, const int32_t *bias, const float *scales,
|
||||
const int32_t out_offset, const int32_t input_offset,
|
||||
const int32_t out_activation_min, const int32_t out_activation_max,
|
||||
q7_t *output, const uint16_t output_x, const uint16_t output_y,
|
||||
const uint16_t output_ch, q15_t *runtime_buf) {
|
||||
int32_t i_element;
|
||||
(void) input_x;
|
||||
(void) input_y;
|
||||
|
||||
/* Partial(two columns) im2col buffer */
|
||||
q15_t *two_column_buffer = runtime_buf;
|
||||
q7_t *out = output;
|
||||
const int32_t num_elements = output_x * output_y;
|
||||
const int channel_div4 = (input_ch >> 2);
|
||||
|
||||
const int16_t inoff16 = input_offset;
|
||||
q31_t offset_q15x2 = __PKHBT(inoff16, inoff16, 16);
|
||||
|
||||
for (i_element = 0; i_element < num_elements / 2; i_element++) {
|
||||
/* Fill buffer for partial im2col - two columns at a time */
|
||||
q7_t *src = &input[i_element * input_ch * 2];
|
||||
q15_t *dst = two_column_buffer;
|
||||
|
||||
//use variables
|
||||
q31_t in_q7x4;
|
||||
q31_t in_q15x2_1;
|
||||
q31_t in_q15x2_2;
|
||||
q31_t out_q15x2_1;
|
||||
q31_t out_q15x2_2;
|
||||
|
||||
int cnt = channel_div4; //two columns
|
||||
while (cnt > 0) {
|
||||
q7_q15_offset_reordered_ele(src, dst)
|
||||
q7_q15_offset_reordered_ele(src, dst)
|
||||
cnt--;
|
||||
}
|
||||
|
||||
out = mat_mult_kernel_s8_s16_reordered_fpreq(kernel, two_column_buffer,
|
||||
output_ch, scales, (q7_t) out_offset, out_activation_min,
|
||||
out_activation_max, input_ch * DIM_KER_Y * DIM_KER_X, bias,
|
||||
out);
|
||||
}
|
||||
|
||||
/* check if there is an odd column left-over for computation */
|
||||
if (num_elements & 0x1) {
|
||||
int32_t i_ch_out;
|
||||
const q7_t *ker_a = kernel;
|
||||
q7_t *src = &input[(num_elements - 1) * input_ch];
|
||||
q15_t *dst = two_column_buffer;
|
||||
|
||||
//use variables
|
||||
q31_t in_q7x4;
|
||||
q31_t in_q15x2_1;
|
||||
q31_t in_q15x2_2;
|
||||
q31_t out_q15x2_1;
|
||||
q31_t out_q15x2_2;
|
||||
|
||||
int cnt = channel_div4; //two * numof2col columns
|
||||
while (cnt > 0) {
|
||||
q7_q15_offset_reordered_ele(src, dst)
|
||||
cnt--;
|
||||
}
|
||||
|
||||
for (i_ch_out = 0; i_ch_out < output_ch; i_ch_out++) {
|
||||
q31_t sum = bias[i_ch_out];
|
||||
|
||||
/* Point to the beginning of the im2col buffer where the input is available as a rearranged column */
|
||||
const q15_t *ip_as_col = runtime_buf;
|
||||
uint16_t col_count = (input_ch * DIM_KER_X * DIM_KER_Y) >> 2;
|
||||
|
||||
while (col_count) {
|
||||
q31_t ker_a1, ker_a2;
|
||||
q31_t in_b1, in_b2;
|
||||
ker_a = read_and_pad_reordered(ker_a, &ker_a1, &ker_a2);
|
||||
|
||||
in_b1 = arm_nn_read_q15x2_ia(&ip_as_col);
|
||||
sum = __SMLAD(ker_a1, in_b1, sum);
|
||||
in_b2 = arm_nn_read_q15x2_ia(&ip_as_col);
|
||||
sum = __SMLAD(ker_a2, in_b2, sum);
|
||||
|
||||
col_count--;
|
||||
}
|
||||
|
||||
sum = (float) sum * scales[i_ch_out];
|
||||
sum += out_offset;
|
||||
sum = MAX(sum, out_activation_min);
|
||||
sum = MIN(sum, out_activation_max);
|
||||
*out++ = (q7_t) sum;
|
||||
}
|
||||
}
|
||||
|
||||
/* Return to application */
|
||||
return STATE_SUCCESS;
|
||||
}
|
125
TinyEngine/src/kernels/fp_requantize_op/convolve_1x1_s8_fpreq.c
Normal file
@ -0,0 +1,125 @@
|
||||
/* ----------------------------------------------------------------------
|
||||
* Project: TinyEngine
|
||||
* Title: convolve_1x1_s8_fpreq.c
|
||||
*
|
||||
* Reference papers:
|
||||
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
|
||||
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
|
||||
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
|
||||
* Contact authors:
|
||||
* - Wei-Ming Chen, wmchen@mit.edu
|
||||
* - Wei-Chen Wang, wweichen@mit.edu
|
||||
* - Ji Lin, jilin@mit.edu
|
||||
* - Ligeng Zhu, ligeng@mit.edu
|
||||
* - Song Han, songhan@mit.edu
|
||||
*
|
||||
* Target ISA: ARMv7E-M
|
||||
* -------------------------------------------------------------------- */
|
||||
|
||||
#include "arm_nnfunctions.h"
|
||||
#include "img2col_element.h"
|
||||
#include "tinyengine_function.h"
|
||||
|
||||
#define DIM_KER_X (1U)
|
||||
#define DIM_KER_Y (1U)
|
||||
|
||||
tinyengine_status convolve_1x1_s8_fpreq(const q7_t *input,
|
||||
const uint16_t input_x, const uint16_t input_y, const uint16_t input_ch,
|
||||
const q7_t *kernel, const int32_t *bias, const float *scales,
|
||||
const int32_t out_offset, const int32_t input_offset,
|
||||
const int32_t out_activation_min, const int32_t out_activation_max,
|
||||
q7_t *output, const uint16_t output_x, const uint16_t output_y,
|
||||
const uint16_t output_ch, q15_t *runtime_buf) {
|
||||
if (input_ch % 4 != 0 || input_ch % 2 != 0) {
|
||||
return PARAM_NO_SUPPORT;
|
||||
}
|
||||
|
||||
int32_t i_element;
|
||||
(void) input_x;
|
||||
(void) input_y;
|
||||
|
||||
/* Partial(two columns) im2col buffer */
|
||||
q15_t *two_column_buffer = runtime_buf;
|
||||
q7_t *out = output;
|
||||
const int32_t num_elements = output_x * output_y;
|
||||
const int channel_div4 = (input_ch >> 2);
|
||||
|
||||
const int16_t inoff16 = input_offset;
|
||||
q31_t offset_q15x2 = __PKHBT(inoff16, inoff16, 16);
|
||||
|
||||
for (i_element = 0; i_element < num_elements / 2; i_element++) {
|
||||
/* Fill buffer for partial im2col - two columns at a time */
|
||||
q7_t *src = &input[i_element * input_ch * 2];
|
||||
q15_t *dst = two_column_buffer;
|
||||
|
||||
//use variables
|
||||
q31_t in_q7x4;
|
||||
q31_t in_q15x2_1;
|
||||
q31_t in_q15x2_2;
|
||||
q31_t out_q15x2_1;
|
||||
q31_t out_q15x2_2;
|
||||
|
||||
int cnt = channel_div4; //two columns
|
||||
while (cnt > 0) {
|
||||
q7_q15_offset_reordered_ele(src, dst)
|
||||
q7_q15_offset_reordered_ele(src, dst)
|
||||
cnt--;
|
||||
}
|
||||
|
||||
out = mat_mult_kernel_s8_s16_reordered_fpreq(kernel, two_column_buffer,
|
||||
output_ch, scales, (q7_t) out_offset, out_activation_min,
|
||||
out_activation_max, input_ch * DIM_KER_Y * DIM_KER_X, bias,
|
||||
out);
|
||||
}
|
||||
|
||||
/* check if there is an odd column left-over for computation */
|
||||
if (num_elements & 0x1) {
|
||||
int32_t i_ch_out;
|
||||
const q7_t *ker_a = kernel;
|
||||
q7_t *src = &input[(num_elements - 1) * input_ch];
|
||||
q15_t *dst = two_column_buffer;
|
||||
|
||||
//use variables
|
||||
q31_t in_q7x4;
|
||||
q31_t in_q15x2_1;
|
||||
q31_t in_q15x2_2;
|
||||
q31_t out_q15x2_1;
|
||||
q31_t out_q15x2_2;
|
||||
|
||||
int cnt = channel_div4; //two * numof2col columns
|
||||
while (cnt > 0) {
|
||||
q7_q15_offset_reordered_ele(src, dst)
|
||||
cnt--;
|
||||
}
|
||||
|
||||
for (i_ch_out = 0; i_ch_out < output_ch; i_ch_out++) {
|
||||
q31_t sum = bias[i_ch_out];
|
||||
|
||||
/* Point to the beginning of the im2col buffer where the input is available as a rearranged column */
|
||||
const q15_t *ip_as_col = runtime_buf;
|
||||
uint16_t col_count = (input_ch * DIM_KER_X * DIM_KER_Y) >> 2;
|
||||
|
||||
while (col_count) {
|
||||
q31_t ker_a1, ker_a2;
|
||||
q31_t in_b1, in_b2;
|
||||
ker_a = read_and_pad_reordered(ker_a, &ker_a1, &ker_a2);
|
||||
|
||||
in_b1 = arm_nn_read_q15x2_ia(&ip_as_col);
|
||||
sum = __SMLAD(ker_a1, in_b1, sum);
|
||||
in_b2 = arm_nn_read_q15x2_ia(&ip_as_col);
|
||||
sum = __SMLAD(ker_a2, in_b2, sum);
|
||||
|
||||
col_count--;
|
||||
}
|
||||
|
||||
sum = (q31_t) ((float) sum * scales[i_ch_out]);
|
||||
sum += out_offset;
|
||||
sum = MAX(sum, out_activation_min);
|
||||
sum = MIN(sum, out_activation_max);
|
||||
*out++ = (q7_t) sum;
|
||||
}
|
||||
}
|
||||
|
||||
/* Return to application */
|
||||
return STATE_SUCCESS;
|
||||
}
|
@ -0,0 +1,287 @@
|
||||
/* ----------------------------------------------------------------------
|
||||
* Project: TinyEngine
|
||||
* Title: convolve_s8_kernel3_inputch3_stride2_pad1_fpreq.c
|
||||
*
|
||||
* Reference papers:
|
||||
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
|
||||
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
|
||||
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
|
||||
* Contact authors:
|
||||
* - Wei-Ming Chen, wmchen@mit.edu
|
||||
* - Wei-Chen Wang, wweichen@mit.edu
|
||||
* - Ji Lin, jilin@mit.edu
|
||||
* - Ligeng Zhu, ligeng@mit.edu
|
||||
* - Song Han, songhan@mit.edu
|
||||
*
|
||||
* Target ISA: ARMv7E-M
|
||||
* -------------------------------------------------------------------- */
|
||||
|
||||
#include "arm_math.h"
|
||||
#include "arm_nnfunctions.h"
|
||||
#include "arm_nnsupportfunctions.h"
|
||||
#include "img2col_element.h"
|
||||
#include "tinyengine_function.h"
|
||||
|
||||
tinyengine_status convolve_s8_kernel3_inputch3_stride2_pad1_fpreq(
|
||||
const q7_t *input, const uint16_t input_x, const uint16_t input_y,
|
||||
const uint16_t input_ch, const q7_t *kernel, const int32_t *bias,
|
||||
const float *scales, const int32_t output_offset,
|
||||
const int32_t input_offset, const int32_t output_activation_min,
|
||||
const int32_t output_activation_max, q7_t *output,
|
||||
const uint16_t output_x, const uint16_t output_y,
|
||||
const uint16_t output_ch, q15_t *runtime_buf, q15_t *kbuf,
|
||||
q7_t pad_value) {
|
||||
const int kernel_y = 3;
|
||||
const int kernel_x = 3;
|
||||
|
||||
int16_t i_out_y, i_out_x, i_ker_y, i_ker_x;
|
||||
|
||||
/* Generate two columns from the input tensor a GEMM computation */
|
||||
q15_t *two_column_buf = runtime_buf;
|
||||
q7_t *out = output;
|
||||
|
||||
q15_t pad16 = pad_value;
|
||||
const int16_t inoff16 = input_offset;
|
||||
q15_t pad_out = pad16 + inoff16;
|
||||
q31_t pad_out_q15x2 = __PKHBT(pad_out, pad_out, 16);
|
||||
q31_t offset_q15x2 = __PKHBT(inoff16, inoff16, 16);
|
||||
|
||||
const q7_t *ip_a0 = kernel;
|
||||
|
||||
for (int i = 0; i < output_ch; i += 2) {
|
||||
q15_t *dst1 = &kbuf[i * 27]; //each q31_t store 2 elements
|
||||
q15_t *dst2 = dst1 + 27;
|
||||
|
||||
const q7_t *ip_a1 = ip_a0 + 27;
|
||||
|
||||
//27 for each output_ch
|
||||
q31_t *dst1_31 = dst1;
|
||||
q31_t *dst2_31 = dst2;
|
||||
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
|
||||
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
|
||||
dst1_31 += 2;
|
||||
dst2_31 += 2;
|
||||
|
||||
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
|
||||
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
|
||||
dst1_31 += 2;
|
||||
dst2_31 += 2;
|
||||
|
||||
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
|
||||
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
|
||||
dst1_31 += 2;
|
||||
dst2_31 += 2;
|
||||
|
||||
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
|
||||
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
|
||||
dst1_31 += 2;
|
||||
dst2_31 += 2;
|
||||
|
||||
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
|
||||
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
|
||||
dst1_31 += 2;
|
||||
dst2_31 += 2;
|
||||
|
||||
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
|
||||
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
|
||||
dst1_31 += 2;
|
||||
dst2_31 += 2;
|
||||
//25, 26, 27
|
||||
dst1 = dst1_31;
|
||||
dst2 = dst2_31;
|
||||
dst1[0] = *ip_a0++;
|
||||
dst1[1] = *ip_a0++;
|
||||
dst1[2] = *ip_a0++;
|
||||
dst2[0] = *ip_a1++;
|
||||
dst2[1] = *ip_a1++;
|
||||
dst2[2] = *ip_a1++;
|
||||
|
||||
/* skip row */
|
||||
ip_a0 += 27;
|
||||
}
|
||||
|
||||
for (i_out_y = 0; i_out_y < output_y; i_out_y++) {
|
||||
for (i_out_x = 0; i_out_x < output_x; i_out_x++) {
|
||||
/* This part implements the im2col function */
|
||||
const int16_t base_idx_y = (i_out_y * 2) - 1;
|
||||
const int16_t base_idx_x = (i_out_x * 2) - 1;
|
||||
const q15_t *col_buffer = two_column_buf;
|
||||
|
||||
//use variables
|
||||
q31_t in_q7x4;
|
||||
q31_t in_q15x2_1;
|
||||
q31_t in_q15x2_2;
|
||||
q31_t out_q15x2_1;
|
||||
q31_t out_q15x2_2;
|
||||
|
||||
/* load address:8bit */
|
||||
q7_t *src;
|
||||
q7_t *src2;
|
||||
q7_t *src3;
|
||||
|
||||
/* buffer for load:16bit */
|
||||
q15_t *dst;
|
||||
q15_t *dst2;
|
||||
q15_t *dst3;
|
||||
|
||||
int input_row_offset = 3 * input_x;
|
||||
dst = col_buffer;
|
||||
dst2 = dst + 9;
|
||||
dst3 = dst2 + 9;
|
||||
if (base_idx_y != -1) {
|
||||
if (base_idx_x != -1) { //load all for now and unroll all
|
||||
//3x3 = 9 elements
|
||||
src = input
|
||||
+ (base_idx_y * input_x + base_idx_x) * input_ch;
|
||||
src2 = src + input_row_offset;
|
||||
src3 = src2 + input_row_offset;
|
||||
|
||||
//4 * 2 = 8
|
||||
q7_q15_offset_ele(src, dst)
|
||||
q7_q15_offset_ele(src, dst)
|
||||
*dst++ = *src++ + input_offset;
|
||||
//
|
||||
q7_q15_offset_ele(src2, dst2)
|
||||
q7_q15_offset_ele(src2, dst2)
|
||||
*dst2++ = *src2++ + input_offset;
|
||||
//
|
||||
q7_q15_offset_ele(src3, dst3)
|
||||
q7_q15_offset_ele(src3, dst3)
|
||||
*dst3++ = *src3++ + input_offset;
|
||||
} else { //first element is pad
|
||||
//3x3 = 9 elements
|
||||
src = input + (base_idx_y * input_x) * input_ch;
|
||||
src2 = src + input_row_offset;
|
||||
src3 = src2 + input_row_offset;
|
||||
|
||||
//pad the first one: 1x3 = 3
|
||||
*dst++ = pad_out;
|
||||
*dst++ = pad_out;
|
||||
*dst++ = pad_out;
|
||||
*dst2++ = pad_out;
|
||||
*dst2++ = pad_out;
|
||||
*dst2++ = pad_out;
|
||||
*dst3++ = pad_out;
|
||||
*dst3++ = pad_out;
|
||||
*dst3++ = pad_out;
|
||||
//load 6 elements
|
||||
//4 * 1 = 6
|
||||
q7_q15_offset_ele(src, dst)
|
||||
*dst++ = *src++ + input_offset;
|
||||
*dst++ = *src++ + input_offset;
|
||||
//
|
||||
q7_q15_offset_ele(src2, dst2)
|
||||
*dst2++ = *src2++ + input_offset;
|
||||
*dst2++ = *src2++ + input_offset;
|
||||
//
|
||||
q7_q15_offset_ele(src3, dst3)
|
||||
*dst3++ = *src3++ + input_offset;
|
||||
*dst3++ = *src3++ + input_offset;
|
||||
}
|
||||
} else { // first row is padded
|
||||
//3x3 = 9 elements
|
||||
*dst++ = pad_out;
|
||||
q31_t *dst_31 = dst;
|
||||
*dst_31++ = pad_out_q15x2;
|
||||
*dst_31++ = pad_out_q15x2;
|
||||
*dst_31++ = pad_out_q15x2;
|
||||
*dst_31++ = pad_out_q15x2;
|
||||
if (base_idx_x != -1) { //load all for now and unroll all
|
||||
//3x3 = 9 elements
|
||||
src2 = input + (base_idx_x) * input_ch;
|
||||
src3 = src2 + input_row_offset;
|
||||
|
||||
//4 * 2 = 8
|
||||
q7_q15_offset_ele(src2, dst2)
|
||||
q7_q15_offset_ele(src2, dst2)
|
||||
*dst2++ = *src2++ + input_offset;
|
||||
//
|
||||
q7_q15_offset_ele(src3, dst3)
|
||||
q7_q15_offset_ele(src3, dst3)
|
||||
*dst3++ = *src3++ + input_offset;
|
||||
} else { //first element is pad
|
||||
//3x3 = 9 elements
|
||||
src2 = input;
|
||||
src3 = src2 + input_row_offset;
|
||||
|
||||
//pad the first one: 1x3 = 3
|
||||
*dst2++ = pad_out;
|
||||
*dst2++ = pad_out;
|
||||
*dst2++ = pad_out;
|
||||
*dst3++ = pad_out;
|
||||
*dst3++ = pad_out;
|
||||
*dst3++ = pad_out;
|
||||
//load 6 elements
|
||||
q7_q15_offset_ele(src2, dst2)
|
||||
*dst2++ = *src2++ + input_offset;
|
||||
*dst2++ = *src2++ + input_offset;
|
||||
//
|
||||
q7_q15_offset_ele(src3, dst3)
|
||||
*dst3++ = *src3++ + input_offset;
|
||||
*dst3++ = *src3++ + input_offset;
|
||||
}
|
||||
}
|
||||
|
||||
two_column_buf += 27;
|
||||
/* Computation is filed for every 2 columns */
|
||||
if (two_column_buf == runtime_buf + 2 * 27) {
|
||||
|
||||
out = mat_mult_kernel3_input3_s8_s16_fpreq(kernel, runtime_buf,
|
||||
output_ch, scales, output_offset, output_activation_min,
|
||||
output_activation_max, input_ch * kernel_y * kernel_x,
|
||||
bias, out, kbuf);
|
||||
|
||||
/* counter reset */
|
||||
two_column_buf = runtime_buf;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/* left-over because odd number of output pixels */
|
||||
if (two_column_buf != runtime_buf) {
|
||||
const q7_t *ker_a = kernel;
|
||||
int i;
|
||||
|
||||
for (i = 0; i < output_ch; i++) {
|
||||
/* Load the accumulator with bias first */
|
||||
q31_t sum = bias[i];
|
||||
|
||||
/* Point to the beginning of the im2col buffer where the input is available as a rearranged column */
|
||||
const q15_t *ip_as_col = runtime_buf;
|
||||
|
||||
/* 4 multiply and accumulates are done in one loop. */
|
||||
uint16_t col_count = (input_ch * kernel_y * kernel_x) >> 2;
|
||||
|
||||
while (col_count) {
|
||||
q31_t ker_a1, ker_a2;
|
||||
q31_t ip_b1, ip_b2;
|
||||
|
||||
ker_a = read_and_pad(ker_a, &ker_a1, &ker_a2);
|
||||
|
||||
ip_b1 = arm_nn_read_q15x2_ia(&ip_as_col);
|
||||
sum = __SMLAD(ker_a1, ip_b1, sum);
|
||||
ip_b2 = arm_nn_read_q15x2_ia(&ip_as_col);
|
||||
sum = __SMLAD(ker_a2, ip_b2, sum);
|
||||
|
||||
col_count--;
|
||||
}
|
||||
/* Handle left over mac */
|
||||
col_count = input_ch * kernel_y * kernel_x & 0x3;
|
||||
while (col_count) {
|
||||
q7_t ker_a1 = *ker_a++;
|
||||
q15_t ip_b1 = *ip_as_col++;
|
||||
sum += ker_a1 * ip_b1;
|
||||
col_count--;
|
||||
}
|
||||
|
||||
sum = (float) sum * scales[i];
|
||||
sum += output_offset;
|
||||
sum = MAX(sum, output_activation_min);
|
||||
sum = MIN(sum, output_activation_max);
|
||||
*out++ = (q7_t) sum;
|
||||
}
|
||||
}
|
||||
|
||||
/* Return to application */
|
||||
return STATE_SUCCESS;
|
||||
}
|
2017
TinyEngine/src/kernels/fp_requantize_op/mat_mul_kernels_fpreq.c
Normal file
92
TinyEngine/src/kernels/int_only/add.c
Normal file
@ -0,0 +1,92 @@
|
||||
/* ----------------------------------------------------------------------
|
||||
* Project: TinyEngine
|
||||
* Title: add.c
|
||||
*
|
||||
* Reference papers:
|
||||
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
|
||||
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
|
||||
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
|
||||
* Contact authors:
|
||||
* - Wei-Ming Chen, wmchen@mit.edu
|
||||
* - Wei-Chen Wang, wweichen@mit.edu
|
||||
* - Ji Lin, jilin@mit.edu
|
||||
* - Ligeng Zhu, ligeng@mit.edu
|
||||
* - Song Han, songhan@mit.edu
|
||||
*
|
||||
* Target ISA: ARMv7E-M
|
||||
* -------------------------------------------------------------------- */
|
||||
|
||||
#include <math.h>
|
||||
#include "arm_math.h"
|
||||
#include "tinyengine_function.h"
|
||||
|
||||
int32_t Add(int32_t a, int32_t b) {
|
||||
return a + b;
|
||||
}
|
||||
int32_t ShiftRight(int32_t a, int offset) {
|
||||
return a >> offset;
|
||||
}
|
||||
int32_t BitAnd(int32_t a, int32_t b) {
|
||||
return a & b;
|
||||
}
|
||||
int32_t BitNot(int32_t a) {
|
||||
return ~a;
|
||||
}
|
||||
int32_t MaskIfNonZero(int32_t a) {
|
||||
static const int32_t zero = 0;
|
||||
return a ? BitNot(zero) : zero;
|
||||
}
|
||||
int32_t MaskIfGreaterThan(int32_t a, int32_t b) {
|
||||
return MaskIfNonZero(a > b);
|
||||
}
|
||||
int32_t MaskIfLessThan(int32_t a, int32_t b) {
|
||||
return MaskIfNonZero(a < b);
|
||||
}
|
||||
|
||||
static inline int32_t SaturatingRoundingDoublingHighMul(int32_t a, int32_t b) {
|
||||
int64_t a_64 = a;
|
||||
int64_t b_64 = b;
|
||||
int64_t ab_64 = a_64 * b_64;
|
||||
int32_t nudge = ab_64 >= 0 ? (1 << 30) : (1 - (1 << 30));
|
||||
int32_t ab_x2_high32 = (int32_t)((ab_64 + nudge) / (1ll << 31));
|
||||
return a == b && a == -2147483648 ? 2147483647 : ab_x2_high32;
|
||||
}
|
||||
|
||||
static inline int32_t RoundingDivideByPOT(int32_t x, int exponent) {
|
||||
const int32_t mask = ((1ll << exponent) - 1);
|
||||
const int32_t zero = (0);
|
||||
const int32_t one = (1);
|
||||
const int32_t remainder = BitAnd(x, mask);
|
||||
const int32_t threshold = Add(ShiftRight(mask, 1), BitAnd(MaskIfLessThan(x, zero), one));
|
||||
return Add(ShiftRight(x, exponent), BitAnd(MaskIfGreaterThan(remainder, threshold), one));
|
||||
}
|
||||
|
||||
static inline int32_t MultiplyByQuantizedMultiplierSmallerThanOneExp(
|
||||
int32_t x, int32_t quantized_multiplier, int left_shift) {
|
||||
return RoundingDivideByPOT(
|
||||
SaturatingRoundingDoublingHighMul(x, quantized_multiplier), -left_shift);
|
||||
}
|
||||
|
||||
tinyengine_status add(int size, ADD_params* params, const int8_t* input1_data,
|
||||
const int8_t* input2_data, int8_t* output_data) {
|
||||
for (int i = 0; i < size; ++i) {
|
||||
const int32_t input1_val = params->input1_offset + input1_data[i];
|
||||
const int32_t input2_val = params->input2_offset + input2_data[i];
|
||||
const int32_t shifted_input1_val = input1_val * (1 << params->left_shift);
|
||||
const int32_t shifted_input2_val = input2_val * (1 << params->left_shift);
|
||||
const int32_t scaled_input1_val =
|
||||
MultiplyByQuantizedMultiplierSmallerThanOneExp(
|
||||
shifted_input1_val, params->input1_multiplier, params->input1_shift);
|
||||
const int32_t scaled_input2_val =
|
||||
MultiplyByQuantizedMultiplierSmallerThanOneExp(
|
||||
shifted_input2_val, params->input2_multiplier, params->input2_shift);
|
||||
const int32_t raw_sum = scaled_input1_val + scaled_input2_val;
|
||||
const int32_t raw_output =
|
||||
MultiplyByQuantizedMultiplierSmallerThanOneExp(
|
||||
raw_sum, params->output_multiplier, params->output_shift) +
|
||||
params->output_offset;
|
||||
const int32_t clamped_output = TN_MIN(params->quantized_activation_max,
|
||||
TN_MAX(params->quantized_activation_min, raw_output));
|
||||
output_data[i] = (int8_t)(clamped_output);
|
||||
}
|
||||
}
|
223
TinyEngine/src/kernels/int_only/arm_convolve_s8_4col.c
Normal file
@ -0,0 +1,223 @@
|
||||
/*
|
||||
* Copyright (C) 2010-2022 Arm Limited or its affiliates.
|
||||
*
|
||||
* SPDX-License-Identifier: Apache-2.0
|
||||
*
|
||||
* Licensed under the Apache License, Version 2.0 (the License); you may
|
||||
* not use this file except in compliance with the License.
|
||||
* You may obtain a copy of the License at
|
||||
*
|
||||
* www.apache.org/licenses/LICENSE-2.0
|
||||
*
|
||||
* Unless required by applicable law or agreed to in writing, software
|
||||
* distributed under the License is distributed on an AS IS BASIS, WITHOUT
|
||||
* WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
* See the License for the specific language governing permissions and
|
||||
* limitations under the License.
|
||||
*/
|
||||
|
||||
/* ----------------------------------------------------------------------
|
||||
* This file is MODIFIED from Arm CMSIS NN Library.
|
||||
*
|
||||
* Project: TinyEngine
|
||||
* Title: arm_convolve_s8_4col.c
|
||||
* Description: s8_4col version of convolution using symmetric quantization.
|
||||
*
|
||||
* Reference papers:
|
||||
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
|
||||
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
|
||||
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
|
||||
* Contact authors:
|
||||
* - Wei-Ming Chen, wmchen@mit.edu
|
||||
* - Wei-Chen Wang, wweichen@mit.edu
|
||||
* - Ji Lin, jilin@mit.edu
|
||||
* - Ligeng Zhu, ligeng@mit.edu
|
||||
* - Song Han, songhan@mit.edu
|
||||
*
|
||||
* Original Project: CMSIS NN Library
|
||||
* Original Title: arm_convolve_s8.c
|
||||
*
|
||||
* Target Processor: Cortex-M CPUs
|
||||
* -------------------------------------------------------------------- */
|
||||
|
||||
#include "arm_math.h"
|
||||
#include "arm_nnfunctions.h"
|
||||
#include "arm_nnsupportfunctions.h"
|
||||
|
||||
/**
|
||||
* @ingroup groupNN
|
||||
*/
|
||||
|
||||
/**
|
||||
* @addtogroup NNConv
|
||||
* @{
|
||||
*/
|
||||
|
||||
/*
|
||||
* Basic s8_4col convolution function.
|
||||
*
|
||||
* Refer header file for details. Optimal use case for the DSP/MVE implementation is when input and output channels
|
||||
* are multiples of 4 or atleast greater than 4.
|
||||
*
|
||||
*/
|
||||
|
||||
arm_status arm_convolve_s8_4col(const q7_t *input,
|
||||
const uint16_t input_x,
|
||||
const uint16_t input_y,
|
||||
const uint16_t input_ch,
|
||||
const uint16_t input_batches,
|
||||
const q7_t *kernel,
|
||||
const uint16_t output_ch,
|
||||
const uint16_t kernel_x,
|
||||
const uint16_t kernel_y,
|
||||
const uint16_t pad_x,
|
||||
const uint16_t pad_y,
|
||||
const uint16_t stride_x,
|
||||
const uint16_t stride_y,
|
||||
const int32_t *bias,
|
||||
q7_t *output,
|
||||
const int32_t *output_shift,
|
||||
const int32_t *output_mult,
|
||||
const int32_t out_offset,
|
||||
const int32_t input_offset,
|
||||
const int32_t out_activation_min,
|
||||
const int32_t out_activation_max,
|
||||
const uint16_t output_x,
|
||||
const uint16_t output_y,
|
||||
q15_t *buffer_a)
|
||||
{
|
||||
int i_batch;
|
||||
for (i_batch = 0; i_batch < input_batches; i_batch++)
|
||||
{
|
||||
input += i_batch * (input_x * input_y * input_ch);
|
||||
output += i_batch * (output_x * output_y * output_ch);
|
||||
|
||||
int16_t i_out_y, i_out_x, i_ker_y, i_ker_x;
|
||||
|
||||
/* Generate two columns from the input tensor a GEMM computation */
|
||||
q15_t *four_column_buf = buffer_a;
|
||||
|
||||
q7_t *out = output;
|
||||
|
||||
/* This part implements the im2col function */
|
||||
for (i_out_y = 0; i_out_y < output_y; i_out_y++)
|
||||
{
|
||||
for (i_out_x = 0; i_out_x < output_x; i_out_x++)
|
||||
{
|
||||
for (i_ker_y = i_out_y * stride_y - pad_y; i_ker_y < i_out_y * stride_y - pad_y + kernel_y; i_ker_y++)
|
||||
{
|
||||
for (i_ker_x = i_out_x * stride_x - pad_x; i_ker_x < i_out_x * stride_x - pad_x + kernel_x; i_ker_x++)
|
||||
{
|
||||
if (i_ker_y < 0 || i_ker_y >= input_y || i_ker_x < 0 || i_ker_x >= input_x)
|
||||
{
|
||||
/* Filling 0 for out-of-bound paddings */
|
||||
memset(four_column_buf, 0, sizeof(q15_t) * input_ch);
|
||||
}
|
||||
else
|
||||
{
|
||||
/* Copying the pixel data to column */
|
||||
arm_q7_to_q15_with_offset(input + (i_ker_y * input_x + i_ker_x) * input_ch, four_column_buf, input_ch, input_offset);
|
||||
}
|
||||
four_column_buf += input_ch;
|
||||
}
|
||||
}
|
||||
|
||||
/* Computation is filed for every 2 columns */
|
||||
if (four_column_buf == buffer_a + 4 * input_ch * kernel_y * kernel_x)
|
||||
{
|
||||
out =
|
||||
arm_nn_mat_mult_kernel_s8_s16_4col(kernel,
|
||||
buffer_a,
|
||||
output_ch,
|
||||
output_shift,
|
||||
output_mult,
|
||||
out_offset,
|
||||
out_activation_min,
|
||||
out_activation_max,
|
||||
input_ch * kernel_y * kernel_x,
|
||||
bias,
|
||||
out);
|
||||
|
||||
/* counter reset */
|
||||
four_column_buf = buffer_a;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
q15_t *four_column_buf_mid = buffer_a;
|
||||
|
||||
if (four_column_buf >= four_column_buf_mid + 2 * input_ch * kernel_y * kernel_x) {
|
||||
out =
|
||||
arm_nn_mat_mult_kernel_s8_s16(kernel,
|
||||
four_column_buf_mid,
|
||||
output_ch,
|
||||
output_shift,
|
||||
output_mult,
|
||||
out_offset,
|
||||
out_activation_min,
|
||||
out_activation_max,
|
||||
input_ch * kernel_y * kernel_x,
|
||||
bias,
|
||||
out);
|
||||
|
||||
four_column_buf_mid = buffer_a + 2 * input_ch * kernel_y * kernel_x;
|
||||
|
||||
}
|
||||
|
||||
/* left-over because odd number of output pixels */
|
||||
if (four_column_buf != four_column_buf_mid)
|
||||
{
|
||||
const q7_t *ker_a = kernel;
|
||||
int i;
|
||||
|
||||
for (i = 0; i < output_ch; i++)
|
||||
{
|
||||
/* Load the accumulator with bias first */
|
||||
q31_t sum = bias[i];
|
||||
|
||||
/* Point to the beginning of the im2col buffer where the input is available as a rearranged column */
|
||||
const q15_t *ip_as_col = four_column_buf_mid;
|
||||
|
||||
/* 4 multiply and accumulates are done in one loop. */
|
||||
uint16_t col_count = (input_ch * kernel_y * kernel_x) >> 2;
|
||||
|
||||
while (col_count)
|
||||
{
|
||||
q31_t ker_a1, ker_a2;
|
||||
q31_t ip_b1, ip_b2;
|
||||
|
||||
ker_a = read_and_pad(ker_a, &ker_a1, &ker_a2);
|
||||
|
||||
ip_b1 = arm_nn_read_q15x2_ia(&ip_as_col);
|
||||
sum = __SMLAD(ker_a1, ip_b1, sum);
|
||||
ip_b2 = arm_nn_read_q15x2_ia(&ip_as_col);
|
||||
sum = __SMLAD(ker_a2, ip_b2, sum);
|
||||
|
||||
col_count--;
|
||||
}
|
||||
/* Handle left over mac */
|
||||
col_count = input_ch * kernel_y * kernel_x & 0x3;
|
||||
while (col_count)
|
||||
{
|
||||
q7_t ker_a1 = *ker_a++;
|
||||
q15_t ip_b1 = *ip_as_col++;
|
||||
sum += ker_a1 * ip_b1;
|
||||
col_count--;
|
||||
}
|
||||
|
||||
sum = arm_nn_requantize(sum, output_mult[i], output_shift[i]);
|
||||
sum += out_offset;
|
||||
sum = MAX(sum, out_activation_min);
|
||||
sum = MIN(sum, out_activation_max);
|
||||
*out++ = (q7_t)sum;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/* Return to application */
|
||||
return ARM_MATH_SUCCESS;
|
||||
}
|
||||
|
||||
/**
|
||||
* @} end of NNConv group
|
||||
*/
|
@ -0,0 +1,245 @@
|
||||
/*
|
||||
* Copyright (C) 2010-2020 Arm Limited or its affiliates. All rights reserved.
|
||||
*
|
||||
* SPDX-License-Identifier: Apache-2.0
|
||||
*
|
||||
* Licensed under the Apache License, Version 2.0 (the License); you may
|
||||
* not use this file except in compliance with the License.
|
||||
* You may obtain a copy of the License at
|
||||
*
|
||||
* www.apache.org/licenses/LICENSE-2.0
|
||||
*
|
||||
* Unless required by applicable law or agreed to in writing, software
|
||||
* distributed under the License is distributed on an AS IS BASIS, WITHOUT
|
||||
* WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
* See the License for the specific language governing permissions and
|
||||
* limitations under the License.
|
||||
*/
|
||||
|
||||
/* ----------------------------------------------------------------------
|
||||
* This file is MODIFIED from Arm CMSIS NN Library.
|
||||
*
|
||||
* Project: TinyEngine
|
||||
* Title: arm_nn_mat_mult_kernel3_input3_s8_s16.c
|
||||
* Description: Matrix-multiplication function for convolution (input channel = 3 and kernel size = 3).
|
||||
*
|
||||
* Reference papers:
|
||||
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
|
||||
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
|
||||
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
|
||||
* Contact authors:
|
||||
* - Wei-Ming Chen, wmchen@mit.edu
|
||||
* - Wei-Chen Wang, wweichen@mit.edu
|
||||
* - Ji Lin, jilin@mit.edu
|
||||
* - Ligeng Zhu, ligeng@mit.edu
|
||||
* - Song Han, songhan@mit.edu
|
||||
*
|
||||
* Original Project: CMSIS NN Library
|
||||
* Original Title: arm_nn_mat_mult_kernel_s8_s16.c
|
||||
*
|
||||
* Target Processor: Cortex-M cores
|
||||
* -------------------------------------------------------------------- */
|
||||
|
||||
#include "arm_math.h"
|
||||
#include "arm_nnfunctions.h"
|
||||
#include "arm_nnsupportfunctions.h"
|
||||
|
||||
/*
|
||||
* Matrix-multiplication function for convolution with per-channel requantization.
|
||||
*
|
||||
* Refer header file for details.
|
||||
*
|
||||
*/
|
||||
|
||||
q7_t *arm_nn_mat_mult_kernel3_input3_s8_s16(const q7_t *input_a,
|
||||
const q15_t *input_b,
|
||||
const uint16_t output_ch,
|
||||
const int32_t *out_shift,
|
||||
const int32_t *out_mult,
|
||||
const int32_t out_offset,
|
||||
const int16_t activation_min,
|
||||
const int16_t activation_max,
|
||||
const uint16_t num_col_a,
|
||||
const int32_t *const output_bias,
|
||||
q7_t *out_0,
|
||||
q15_t *kbuf)
|
||||
{
|
||||
/* set up the second output pointers */
|
||||
q7_t *out_1 = out_0 + output_ch;
|
||||
const int32_t *bias = output_bias;
|
||||
|
||||
uint16_t row_count = output_ch / 2;
|
||||
const q15_t *ksrc = &kbuf[0];
|
||||
/* this loop over rows in A */
|
||||
while (row_count)
|
||||
{
|
||||
/* setup pointers for B */
|
||||
const q15_t *ip_b0 = input_b;
|
||||
const q15_t *ip_b1 = ip_b0 + num_col_a;
|
||||
const q31_t *ip31_b0 = ip_b0;
|
||||
const q31_t *ip31_b1 = ip_b1;
|
||||
|
||||
/* align the second pointer for A */
|
||||
const q15_t *ksrc2 = ksrc + 27;
|
||||
q31_t *ksrc_31 = ksrc;
|
||||
q31_t *ksrc2_31 = ksrc2;
|
||||
|
||||
/* Init accumulator with bias for channel N and N + 1 */
|
||||
q31_t ch_0_out_0 = *bias;
|
||||
q31_t ch_0_out_1 = *bias++;
|
||||
q31_t ch_1_out_0 = *bias;
|
||||
q31_t ch_1_out_1 = *bias++;
|
||||
|
||||
//------------------4
|
||||
q31_t a01, a02, a11, a12;
|
||||
q31_t b0 = arm_nn_read_q15x2_ia(&ip_b0);
|
||||
q31_t b1 = arm_nn_read_q15x2_ia(&ip_b1);
|
||||
|
||||
ch_0_out_0 = __SMLAD(ksrc_31[0], b0, ch_0_out_0);
|
||||
ch_0_out_1 = __SMLAD(ksrc_31[0], b1, ch_0_out_1);
|
||||
ch_1_out_0 = __SMLAD(ksrc2_31[0], b0, ch_1_out_0);
|
||||
ch_1_out_1 = __SMLAD(ksrc2_31[0], b1, ch_1_out_1);
|
||||
|
||||
b0 = arm_nn_read_q15x2_ia(&ip_b0);
|
||||
b1 = arm_nn_read_q15x2_ia(&ip_b1);
|
||||
|
||||
ch_0_out_0 = __SMLAD(ksrc_31[1], b0, ch_0_out_0);
|
||||
ch_0_out_1 = __SMLAD(ksrc_31[1], b1, ch_0_out_1);
|
||||
ch_1_out_0 = __SMLAD(ksrc2_31[1], b0, ch_1_out_0);
|
||||
ch_1_out_1 = __SMLAD(ksrc2_31[1], b1, ch_1_out_1);
|
||||
|
||||
//------------------8
|
||||
b0 = arm_nn_read_q15x2_ia(&ip_b0);
|
||||
b1 = arm_nn_read_q15x2_ia(&ip_b1);
|
||||
|
||||
ch_0_out_0 = __SMLAD(ksrc_31[2], b0, ch_0_out_0);
|
||||
ch_0_out_1 = __SMLAD(ksrc_31[2], b1, ch_0_out_1);
|
||||
ch_1_out_0 = __SMLAD(ksrc2_31[2], b0, ch_1_out_0);
|
||||
ch_1_out_1 = __SMLAD(ksrc2_31[2], b1, ch_1_out_1);
|
||||
|
||||
b0 = arm_nn_read_q15x2_ia(&ip_b0);
|
||||
b1 = arm_nn_read_q15x2_ia(&ip_b1);
|
||||
|
||||
ch_0_out_0 = __SMLAD(ksrc_31[3], b0, ch_0_out_0);
|
||||
ch_0_out_1 = __SMLAD(ksrc_31[3], b1, ch_0_out_1);
|
||||
ch_1_out_0 = __SMLAD(ksrc2_31[3], b0, ch_1_out_0);
|
||||
ch_1_out_1 = __SMLAD(ksrc2_31[3], b1, ch_1_out_1);
|
||||
|
||||
//------------------12
|
||||
b0 = arm_nn_read_q15x2_ia(&ip_b0);
|
||||
b1 = arm_nn_read_q15x2_ia(&ip_b1);
|
||||
|
||||
ch_0_out_0 = __SMLAD(ksrc_31[4], b0, ch_0_out_0);
|
||||
ch_0_out_1 = __SMLAD(ksrc_31[4], b1, ch_0_out_1);
|
||||
ch_1_out_0 = __SMLAD(ksrc2_31[4], b0, ch_1_out_0);
|
||||
ch_1_out_1 = __SMLAD(ksrc2_31[4], b1, ch_1_out_1);
|
||||
|
||||
b0 = arm_nn_read_q15x2_ia(&ip_b0);
|
||||
b1 = arm_nn_read_q15x2_ia(&ip_b1);
|
||||
|
||||
ch_0_out_0 = __SMLAD(ksrc_31[5], b0, ch_0_out_0);
|
||||
ch_0_out_1 = __SMLAD(ksrc_31[5], b1, ch_0_out_1);
|
||||
ch_1_out_0 = __SMLAD(ksrc2_31[5], b0, ch_1_out_0);
|
||||
ch_1_out_1 = __SMLAD(ksrc2_31[5], b1, ch_1_out_1);
|
||||
|
||||
//------------------16
|
||||
b0 = arm_nn_read_q15x2_ia(&ip_b0);
|
||||
b1 = arm_nn_read_q15x2_ia(&ip_b1);
|
||||
ch_0_out_0 = __SMLAD(ksrc_31[6], b0, ch_0_out_0);
|
||||
ch_0_out_1 = __SMLAD(ksrc_31[6], b1, ch_0_out_1);
|
||||
ch_1_out_0 = __SMLAD(ksrc2_31[6], b0, ch_1_out_0);
|
||||
ch_1_out_1 = __SMLAD(ksrc2_31[6], b1, ch_1_out_1);
|
||||
|
||||
b0 = arm_nn_read_q15x2_ia(&ip_b0);
|
||||
b1 = arm_nn_read_q15x2_ia(&ip_b1);
|
||||
|
||||
ch_0_out_0 = __SMLAD(ksrc_31[7], b0, ch_0_out_0);
|
||||
ch_0_out_1 = __SMLAD(ksrc_31[7], b1, ch_0_out_1);
|
||||
ch_1_out_0 = __SMLAD(ksrc2_31[7], b0, ch_1_out_0);
|
||||
ch_1_out_1 = __SMLAD(ksrc2_31[7], b1, ch_1_out_1);
|
||||
|
||||
//------------------20
|
||||
b0 = arm_nn_read_q15x2_ia(&ip_b0);
|
||||
b1 = arm_nn_read_q15x2_ia(&ip_b1);
|
||||
ch_0_out_0 = __SMLAD(ksrc_31[8], b0, ch_0_out_0);
|
||||
ch_0_out_1 = __SMLAD(ksrc_31[8], b1, ch_0_out_1);
|
||||
ch_1_out_0 = __SMLAD(ksrc2_31[8], b0, ch_1_out_0);
|
||||
ch_1_out_1 = __SMLAD(ksrc2_31[8], b1, ch_1_out_1);
|
||||
|
||||
b0 = arm_nn_read_q15x2_ia(&ip_b0);
|
||||
b1 = arm_nn_read_q15x2_ia(&ip_b1);
|
||||
|
||||
ch_0_out_0 = __SMLAD(ksrc_31[9], b0, ch_0_out_0);
|
||||
ch_0_out_1 = __SMLAD(ksrc_31[9], b1, ch_0_out_1);
|
||||
ch_1_out_0 = __SMLAD(ksrc2_31[9], b0, ch_1_out_0);
|
||||
ch_1_out_1 = __SMLAD(ksrc2_31[9], b1, ch_1_out_1);
|
||||
|
||||
//------------------24
|
||||
b0 = arm_nn_read_q15x2_ia(&ip_b0);
|
||||
b1 = arm_nn_read_q15x2_ia(&ip_b1);
|
||||
ch_0_out_0 = __SMLAD(ksrc_31[10], b0, ch_0_out_0);
|
||||
ch_0_out_1 = __SMLAD(ksrc_31[10], b1, ch_0_out_1);
|
||||
ch_1_out_0 = __SMLAD(ksrc2_31[10], b0, ch_1_out_0);
|
||||
ch_1_out_1 = __SMLAD(ksrc2_31[10], b1, ch_1_out_1);
|
||||
|
||||
b0 = arm_nn_read_q15x2_ia(&ip_b0);
|
||||
b1 = arm_nn_read_q15x2_ia(&ip_b1);
|
||||
|
||||
ch_0_out_0 = __SMLAD(ksrc_31[11], b0, ch_0_out_0);
|
||||
ch_0_out_1 = __SMLAD(ksrc_31[11], b1, ch_0_out_1);
|
||||
ch_1_out_0 = __SMLAD(ksrc2_31[11], b0, ch_1_out_0);
|
||||
ch_1_out_1 = __SMLAD(ksrc2_31[11], b1, ch_1_out_1);
|
||||
|
||||
//------------------25,26,27
|
||||
b0 = arm_nn_read_q15x2_ia(&ip_b0);
|
||||
b1 = arm_nn_read_q15x2_ia(&ip_b1);
|
||||
ch_0_out_0 = __SMLAD(ksrc_31[12], b0, ch_0_out_0);
|
||||
ch_0_out_1 = __SMLAD(ksrc_31[12], b1, ch_0_out_1);
|
||||
ch_1_out_0 = __SMLAD(ksrc2_31[12], b0, ch_1_out_0);
|
||||
ch_1_out_1 = __SMLAD(ksrc2_31[12], b1, ch_1_out_1);
|
||||
q15_t _b0 = *ip_b0++;
|
||||
q15_t _b1 = *ip_b1++;
|
||||
|
||||
ch_0_out_0 += ksrc[26] * _b0;
|
||||
ch_0_out_1 += ksrc[26] * _b1;
|
||||
ch_1_out_0 += ksrc2[26] * _b0;
|
||||
ch_1_out_1 += ksrc2[26] * _b1;
|
||||
|
||||
ch_0_out_0 = arm_nn_requantize(ch_0_out_0, *out_mult, *out_shift);
|
||||
ch_0_out_0 += out_offset;
|
||||
ch_0_out_0 = MAX(ch_0_out_0, activation_min);
|
||||
ch_0_out_0 = MIN(ch_0_out_0, activation_max);
|
||||
*out_0++ = (q7_t)ch_0_out_0;
|
||||
|
||||
ch_0_out_1 = arm_nn_requantize(ch_0_out_1, *out_mult, *out_shift);
|
||||
ch_0_out_1 += out_offset;
|
||||
ch_0_out_1 = MAX(ch_0_out_1, activation_min);
|
||||
ch_0_out_1 = MIN(ch_0_out_1, activation_max);
|
||||
*out_1++ = (q7_t)ch_0_out_1;
|
||||
out_mult++;
|
||||
out_shift++;
|
||||
|
||||
ch_1_out_0 = arm_nn_requantize(ch_1_out_0, *out_mult, *out_shift);
|
||||
ch_1_out_0 += out_offset;
|
||||
ch_1_out_0 = MAX(ch_1_out_0, activation_min);
|
||||
ch_1_out_0 = MIN(ch_1_out_0, activation_max);
|
||||
*out_0++ = (q7_t)ch_1_out_0;
|
||||
|
||||
ch_1_out_1 = arm_nn_requantize(ch_1_out_1, *out_mult, *out_shift);
|
||||
ch_1_out_1 += out_offset;
|
||||
ch_1_out_1 = MAX(ch_1_out_1, activation_min);
|
||||
ch_1_out_1 = MIN(ch_1_out_1, activation_max);
|
||||
*out_1++ = (q7_t)ch_1_out_1;
|
||||
out_mult++;
|
||||
out_shift++;
|
||||
|
||||
/* skip row */
|
||||
ksrc += 54;
|
||||
row_count--;
|
||||
}
|
||||
|
||||
out_0 += output_ch;
|
||||
|
||||
/* return the new output pointer with offset */
|
||||
return out_0;
|
||||
}
|
@ -0,0 +1,174 @@
|
||||
/*
|
||||
* Copyright (C) 2010-2020 Arm Limited or its affiliates. All rights reserved.
|
||||
*
|
||||
* SPDX-License-Identifier: Apache-2.0
|
||||
*
|
||||
* Licensed under the Apache License, Version 2.0 (the License); you may
|
||||
* not use this file except in compliance with the License.
|
||||
* You may obtain a copy of the License at
|
||||
*
|
||||
* www.apache.org/licenses/LICENSE-2.0
|
||||
*
|
||||
* Unless required by applicable law or agreed to in writing, software
|
||||
* distributed under the License is distributed on an AS IS BASIS, WITHOUT
|
||||
* WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
* See the License for the specific language governing permissions and
|
||||
* limitations under the License.
|
||||
*/
|
||||
|
||||
/* ----------------------------------------------------------------------
|
||||
* This file is MODIFIED from Arm CMSIS NN Library.
|
||||
*
|
||||
* Project: TinyEngine
|
||||
* Title: arm_nn_mat_mult_kernel_s8_s16_reordered_8mul.c
|
||||
* Description: Matrix-multiplication function for convolution with reordered columns (input channels with the multiple of 8).
|
||||
*
|
||||
* Reference papers:
|
||||
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
|
||||
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
|
||||
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
|
||||
* Contact authors:
|
||||
* - Wei-Ming Chen, wmchen@mit.edu
|
||||
* - Wei-Chen Wang, wweichen@mit.edu
|
||||
* - Ji Lin, jilin@mit.edu
|
||||
* - Ligeng Zhu, ligeng@mit.edu
|
||||
* - Song Han, songhan@mit.edu
|
||||
*
|
||||
* Original Project: CMSIS NN Library
|
||||
* Original Title: arm_nn_mat_mult_kernel_s8_s16_reordered.c
|
||||
*
|
||||
* Target Processor: Cortex-M cores
|
||||
* -------------------------------------------------------------------- */
|
||||
|
||||
#include "arm_math.h"
|
||||
#include "arm_nnfunctions.h"
|
||||
#include "arm_nnsupportfunctions.h"
|
||||
|
||||
/*
|
||||
* Matrix-multiplication with re-ordered input and bias inputs for convolution with per-channel
|
||||
* requantization. The re-ordering is a consequence of sign extension is done by the SXTB16 command.
|
||||
*
|
||||
* Refer header file for details. This function differs from arm_nn_mat_mult_kernel_s8_s16(), in that it uses
|
||||
* read_and_pad_reordered() instead of arm_nn_mat_mult_kernel_s8_s16(). Investigating the cycles impact and
|
||||
* unifying these two functions is a potential future improvement.
|
||||
*
|
||||
*/
|
||||
|
||||
q7_t *arm_nn_mat_mult_kernel_s8_s16_reordered_8mul(const q7_t *input_a,
|
||||
const q15_t *input_b,
|
||||
const uint16_t output_ch,
|
||||
const int32_t *out_shift,
|
||||
const int32_t *out_mult,
|
||||
const int32_t out_offset,
|
||||
const int16_t activation_min,
|
||||
const int16_t activation_max,
|
||||
const uint16_t num_col_a,
|
||||
const int32_t *const output_bias,
|
||||
q7_t *out_0)
|
||||
{
|
||||
/* set up the second output pointers */
|
||||
q7_t *out_1 = out_0 + output_ch;
|
||||
const int32_t *bias = output_bias;
|
||||
|
||||
uint16_t row_count = output_ch / 2;
|
||||
const q7_t *ip_a0 = input_a;
|
||||
/* this loop over rows in A */
|
||||
while (row_count)
|
||||
{
|
||||
/* setup pointers for B */
|
||||
const q15_t *ip_b0 = input_b;
|
||||
const q15_t *ip_b1 = ip_b0 + num_col_a;
|
||||
|
||||
/* align the second pointer for A */
|
||||
const q7_t *ip_a1 = ip_a0 + num_col_a;
|
||||
|
||||
/* Init accumulator with bias for channel N and N + 1 */
|
||||
q31_t ch_0_out_0 = *bias;
|
||||
q31_t ch_0_out_1 = *bias++;
|
||||
q31_t ch_1_out_0 = *bias;
|
||||
q31_t ch_1_out_1 = *bias++;
|
||||
|
||||
uint16_t col_count = num_col_a / 8;
|
||||
/* accumulate over the vector */
|
||||
while (col_count)
|
||||
{
|
||||
q31_t a01, a02, a11, a12;
|
||||
q31_t b0 = arm_nn_read_q15x2_ia(&ip_b0);
|
||||
q31_t b1 = arm_nn_read_q15x2_ia(&ip_b1);
|
||||
|
||||
ip_a0 = read_and_pad_reordered(ip_a0, &a01, &a02);
|
||||
|
||||
ch_0_out_0 = __SMLAD(a01, b0, ch_0_out_0);
|
||||
ip_a1 = read_and_pad_reordered(ip_a1, &a11, &a12);
|
||||
ch_0_out_1 = __SMLAD(a01, b1, ch_0_out_1);
|
||||
ch_1_out_0 = __SMLAD(a11, b0, ch_1_out_0);
|
||||
b0 = arm_nn_read_q15x2_ia(&ip_b0);
|
||||
ch_1_out_1 = __SMLAD(a11, b1, ch_1_out_1);
|
||||
|
||||
b1 = arm_nn_read_q15x2_ia(&ip_b1);
|
||||
|
||||
ch_0_out_0 = __SMLAD(a02, b0, ch_0_out_0);
|
||||
ch_0_out_1 = __SMLAD(a02, b1, ch_0_out_1);
|
||||
ch_1_out_0 = __SMLAD(a12, b0, ch_1_out_0);
|
||||
b0 = arm_nn_read_q15x2_ia(&ip_b0);
|
||||
ch_1_out_1 = __SMLAD(a12, b1, ch_1_out_1);
|
||||
|
||||
b1 = arm_nn_read_q15x2_ia(&ip_b1);
|
||||
|
||||
ip_a0 = read_and_pad_reordered(ip_a0, &a01, &a02);
|
||||
|
||||
ch_0_out_0 = __SMLAD(a01, b0, ch_0_out_0);
|
||||
ip_a1 = read_and_pad_reordered(ip_a1, &a11, &a12);
|
||||
ch_0_out_1 = __SMLAD(a01, b1, ch_0_out_1);
|
||||
ch_1_out_0 = __SMLAD(a11, b0, ch_1_out_0);
|
||||
b0 = arm_nn_read_q15x2_ia(&ip_b0);
|
||||
ch_1_out_1 = __SMLAD(a11, b1, ch_1_out_1);
|
||||
|
||||
b1 = arm_nn_read_q15x2_ia(&ip_b1);
|
||||
|
||||
ch_0_out_0 = __SMLAD(a02, b0, ch_0_out_0);
|
||||
ch_0_out_1 = __SMLAD(a02, b1, ch_0_out_1);
|
||||
ch_1_out_0 = __SMLAD(a12, b0, ch_1_out_0);
|
||||
ch_1_out_1 = __SMLAD(a12, b1, ch_1_out_1);
|
||||
|
||||
col_count--;
|
||||
} /* while over col_count */
|
||||
|
||||
ch_0_out_0 = arm_nn_requantize(ch_0_out_0, *out_mult, *out_shift);
|
||||
ch_0_out_0 += out_offset;
|
||||
ch_0_out_0 = MAX(ch_0_out_0, activation_min);
|
||||
ch_0_out_0 = MIN(ch_0_out_0, activation_max);
|
||||
*out_0++ = (q7_t)ch_0_out_0;
|
||||
|
||||
ch_0_out_1 = arm_nn_requantize(ch_0_out_1, *out_mult, *out_shift);
|
||||
ch_0_out_1 += out_offset;
|
||||
ch_0_out_1 = MAX(ch_0_out_1, activation_min);
|
||||
ch_0_out_1 = MIN(ch_0_out_1, activation_max);
|
||||
*out_1++ = (q7_t)ch_0_out_1;
|
||||
out_mult++;
|
||||
out_shift++;
|
||||
|
||||
ch_1_out_0 = arm_nn_requantize(ch_1_out_0, *out_mult, *out_shift);
|
||||
ch_1_out_0 += out_offset;
|
||||
ch_1_out_0 = MAX(ch_1_out_0, activation_min);
|
||||
ch_1_out_0 = MIN(ch_1_out_0, activation_max);
|
||||
*out_0++ = (q7_t)ch_1_out_0;
|
||||
|
||||
ch_1_out_1 = arm_nn_requantize(ch_1_out_1, *out_mult, *out_shift);
|
||||
ch_1_out_1 += out_offset;
|
||||
ch_1_out_1 = MAX(ch_1_out_1, activation_min);
|
||||
ch_1_out_1 = MIN(ch_1_out_1, activation_max);
|
||||
*out_1++ = (q7_t)ch_1_out_1;
|
||||
out_mult++;
|
||||
out_shift++;
|
||||
|
||||
/* skip row */
|
||||
ip_a0 += num_col_a;
|
||||
row_count--;
|
||||
}
|
||||
|
||||
out_0 += output_ch;
|
||||
|
||||
/* return the new output pointer with offset */
|
||||
return out_0;
|
||||
}
|
@ -0,0 +1,215 @@
|
||||
/*
|
||||
* Copyright (C) 2010-2020 Arm Limited or its affiliates. All rights reserved.
|
||||
*
|
||||
* SPDX-License-Identifier: Apache-2.0
|
||||
*
|
||||
* Licensed under the Apache License, Version 2.0 (the License); you may
|
||||
* not use this file except in compliance with the License.
|
||||
* You may obtain a copy of the License at
|
||||
*
|
||||
* www.apache.org/licenses/LICENSE-2.0
|
||||
*
|
||||
* Unless required by applicable law or agreed to in writing, software
|
||||
* distributed under the License is distributed on an AS IS BASIS, WITHOUT
|
||||
* WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
* See the License for the specific language governing permissions and
|
||||
* limitations under the License.
|
||||
*/
|
||||
|
||||
/* ----------------------------------------------------------------------
|
||||
* This file is MODIFIED from Arm CMSIS NN Library.
|
||||
*
|
||||
* Project: TinyEngine
|
||||
* Title: arm_nn_mat_mult_kernel_s8_s16_reordered_oddch.c
|
||||
* Description: Matrix-multiplication function for convolution with reordered columns (odd number of channel).
|
||||
*
|
||||
* Reference papers:
|
||||
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
|
||||
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
|
||||
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
|
||||
* Contact authors:
|
||||
* - Wei-Ming Chen, wmchen@mit.edu
|
||||
* - Wei-Chen Wang, wweichen@mit.edu
|
||||
* - Ji Lin, jilin@mit.edu
|
||||
* - Ligeng Zhu, ligeng@mit.edu
|
||||
* - Song Han, songhan@mit.edu
|
||||
*
|
||||
* Original Project: CMSIS NN Library
|
||||
* Original Title: arm_nn_mat_mult_kernel_s8_s16_reordered.c
|
||||
*
|
||||
* Target Processor: Cortex-M cores
|
||||
* -------------------------------------------------------------------- */
|
||||
|
||||
#include "arm_math.h"
|
||||
#include "arm_nnfunctions.h"
|
||||
#include "arm_nnsupportfunctions.h"
|
||||
|
||||
/*
|
||||
* Matrix-multiplication with re-ordered input and bias inputs for convolution with per-channel
|
||||
* requantization. The re-ordering is a consequence of sign extension is done by the SXTB16 command.
|
||||
*
|
||||
* Refer header file for details. This function differs from arm_nn_mat_mult_kernel_s8_s16(), in that it uses
|
||||
* read_and_pad_reordered() instead of arm_nn_mat_mult_kernel_s8_s16(). Investigating the cycles impact and
|
||||
* unifying these two functions is a potential future improvement.
|
||||
*
|
||||
*/
|
||||
|
||||
q7_t *arm_nn_mat_mult_kernel_s8_s16_reordered_oddch(const q7_t *input_a,
|
||||
const q15_t *input_b,
|
||||
const uint16_t output_ch,
|
||||
const int32_t *out_shift,
|
||||
const int32_t *out_mult,
|
||||
const int32_t out_offset,
|
||||
const int16_t activation_min,
|
||||
const int16_t activation_max,
|
||||
const uint16_t num_col_a,
|
||||
const int32_t *const output_bias,
|
||||
q7_t *out_0)
|
||||
{
|
||||
#if defined(ARM_MATH_LOOPUNROLL) && defined(ARM_MATH_DSP)
|
||||
/* set up the second output pointers */
|
||||
q7_t *out_1 = out_0 + output_ch;
|
||||
const int32_t *bias = output_bias;
|
||||
|
||||
uint16_t row_count = output_ch / 2;
|
||||
const q7_t *ip_a0 = input_a;
|
||||
/* this loop over rows in A */
|
||||
while (row_count)
|
||||
{
|
||||
/* setup pointers for B */
|
||||
const q15_t *ip_b0 = input_b;
|
||||
const q15_t *ip_b1 = ip_b0 + num_col_a;
|
||||
|
||||
/* align the second pointer for A */
|
||||
const q7_t *ip_a1 = ip_a0 + num_col_a;
|
||||
|
||||
/* Init accumulator with bias for channel N and N + 1 */
|
||||
q31_t ch_0_out_0 = *bias;
|
||||
q31_t ch_0_out_1 = *bias++;
|
||||
q31_t ch_1_out_0 = *bias;
|
||||
q31_t ch_1_out_1 = *bias++;
|
||||
|
||||
uint16_t col_count = num_col_a / 4;
|
||||
/* accumulate over the vector */
|
||||
while (col_count)
|
||||
{
|
||||
q31_t a01, a02, a11, a12;
|
||||
q31_t b0 = arm_nn_read_q15x2_ia(&ip_b0);
|
||||
q31_t b1 = arm_nn_read_q15x2_ia(&ip_b1);
|
||||
|
||||
ip_a0 = read_and_pad_reordered(ip_a0, &a01, &a02);
|
||||
|
||||
ch_0_out_0 = __SMLAD(a01, b0, ch_0_out_0);
|
||||
ip_a1 = read_and_pad_reordered(ip_a1, &a11, &a12);
|
||||
ch_0_out_1 = __SMLAD(a01, b1, ch_0_out_1);
|
||||
ch_1_out_0 = __SMLAD(a11, b0, ch_1_out_0);
|
||||
b0 = arm_nn_read_q15x2_ia(&ip_b0);
|
||||
ch_1_out_1 = __SMLAD(a11, b1, ch_1_out_1);
|
||||
|
||||
b1 = arm_nn_read_q15x2_ia(&ip_b1);
|
||||
|
||||
ch_0_out_0 = __SMLAD(a02, b0, ch_0_out_0);
|
||||
ch_0_out_1 = __SMLAD(a02, b1, ch_0_out_1);
|
||||
ch_1_out_0 = __SMLAD(a12, b0, ch_1_out_0);
|
||||
ch_1_out_1 = __SMLAD(a12, b1, ch_1_out_1);
|
||||
|
||||
col_count--;
|
||||
} /* while over col_count */
|
||||
|
||||
ch_0_out_0 = arm_nn_requantize(ch_0_out_0, *out_mult, *out_shift);
|
||||
ch_0_out_0 += out_offset;
|
||||
ch_0_out_0 = MAX(ch_0_out_0, activation_min);
|
||||
ch_0_out_0 = MIN(ch_0_out_0, activation_max);
|
||||
*out_0++ = (q7_t)ch_0_out_0;
|
||||
|
||||
ch_0_out_1 = arm_nn_requantize(ch_0_out_1, *out_mult, *out_shift);
|
||||
ch_0_out_1 += out_offset;
|
||||
ch_0_out_1 = MAX(ch_0_out_1, activation_min);
|
||||
ch_0_out_1 = MIN(ch_0_out_1, activation_max);
|
||||
*out_1++ = (q7_t)ch_0_out_1;
|
||||
out_mult++;
|
||||
out_shift++;
|
||||
|
||||
ch_1_out_0 = arm_nn_requantize(ch_1_out_0, *out_mult, *out_shift);
|
||||
ch_1_out_0 += out_offset;
|
||||
ch_1_out_0 = MAX(ch_1_out_0, activation_min);
|
||||
ch_1_out_0 = MIN(ch_1_out_0, activation_max);
|
||||
*out_0++ = (q7_t)ch_1_out_0;
|
||||
|
||||
ch_1_out_1 = arm_nn_requantize(ch_1_out_1, *out_mult, *out_shift);
|
||||
ch_1_out_1 += out_offset;
|
||||
ch_1_out_1 = MAX(ch_1_out_1, activation_min);
|
||||
ch_1_out_1 = MIN(ch_1_out_1, activation_max);
|
||||
*out_1++ = (q7_t)ch_1_out_1;
|
||||
out_mult++;
|
||||
out_shift++;
|
||||
|
||||
/* skip row */
|
||||
ip_a0 += num_col_a;
|
||||
row_count--;
|
||||
}
|
||||
|
||||
if (output_ch & 1)
|
||||
{
|
||||
/* setup pointers for B */
|
||||
const q15_t *ip_b0 = input_b;
|
||||
const q15_t *ip_b1 = ip_b0 + num_col_a;
|
||||
|
||||
/* Init accumulator with bias for channel N + 1 */
|
||||
q31_t ch_0_out_0 = *bias;
|
||||
q31_t ch_0_out_1 = ch_0_out_0;
|
||||
|
||||
int32_t col_count = num_col_a / 4;
|
||||
while (col_count)
|
||||
{
|
||||
q31_t a01, a02;
|
||||
q31_t b0 = arm_nn_read_q15x2_ia(&ip_b0);
|
||||
q31_t b1 = arm_nn_read_q15x2_ia(&ip_b1);
|
||||
|
||||
ip_a0 = read_and_pad_reordered(ip_a0, &a01, &a02);
|
||||
|
||||
ch_0_out_0 = __SMLAD(a01, b0, ch_0_out_0);
|
||||
ch_0_out_1 = __SMLAD(a01, b1, ch_0_out_1);
|
||||
|
||||
b0 = arm_nn_read_q15x2_ia(&ip_b0);
|
||||
b1 = arm_nn_read_q15x2_ia(&ip_b1);
|
||||
|
||||
ch_0_out_0 = __SMLAD(a02, b0, ch_0_out_0);
|
||||
ch_0_out_1 = __SMLAD(a02, b1, ch_0_out_1);
|
||||
|
||||
col_count--;
|
||||
} /* while over col_count */
|
||||
|
||||
ch_0_out_0 = arm_nn_requantize(ch_0_out_0, *out_mult, *out_shift);
|
||||
ch_0_out_0 += out_offset;
|
||||
ch_0_out_0 = MAX(ch_0_out_0, activation_min);
|
||||
ch_0_out_0 = MIN(ch_0_out_0, activation_max);
|
||||
*out_0++ = (q7_t)ch_0_out_0;
|
||||
|
||||
ch_0_out_1 = arm_nn_requantize(ch_0_out_1, *out_mult, *out_shift);
|
||||
ch_0_out_1 += out_offset;
|
||||
ch_0_out_1 = MAX(ch_0_out_1, activation_min);
|
||||
ch_0_out_1 = MIN(ch_0_out_1, activation_max);
|
||||
*out_1++ = (q7_t)ch_0_out_1;
|
||||
}
|
||||
|
||||
out_0 += output_ch;
|
||||
|
||||
/* return the new output pointer with offset */
|
||||
return out_0;
|
||||
#else
|
||||
(void)input_a;
|
||||
(void)input_b;
|
||||
(void)output_ch;
|
||||
(void)out_shift;
|
||||
(void)out_mult;
|
||||
(void)out_offset;
|
||||
(void)activation_min;
|
||||
(void)activation_max;
|
||||
(void)num_col_a;
|
||||
(void)output_bias;
|
||||
(void)out_0;
|
||||
/* To be completed */
|
||||
return NULL;
|
||||
#endif
|
||||
}
|
57
TinyEngine/src/kernels/int_only/avgpooling.c
Normal file
@ -0,0 +1,57 @@
|
||||
/* ----------------------------------------------------------------------
|
||||
* Project: TinyEngine
|
||||
* Title: avgpooling.c
|
||||
*
|
||||
* Reference papers:
|
||||
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
|
||||
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
|
||||
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
|
||||
* Contact authors:
|
||||
* - Wei-Ming Chen, wmchen@mit.edu
|
||||
* - Wei-Chen Wang, wweichen@mit.edu
|
||||
* - Ji Lin, jilin@mit.edu
|
||||
* - Ligeng Zhu, ligeng@mit.edu
|
||||
* - Song Han, songhan@mit.edu
|
||||
*
|
||||
* Target ISA: ARMv7E-M
|
||||
* -------------------------------------------------------------------- */
|
||||
|
||||
#include "tinyengine_function.h"
|
||||
|
||||
tinyengine_status avg_pooling(const q7_t* input, const uint16_t input_h, const uint16_t input_w,
|
||||
const uint16_t input_c, const uint16_t sample_h, const uint16_t sample_w,
|
||||
const uint16_t output_h, const uint16_t output_w, const int32_t out_activation_min,
|
||||
const int32_t out_activation_max, q7_t* output)
|
||||
{
|
||||
int h, w, c;
|
||||
int sh, sw;
|
||||
const int divider_half = ((sample_h * sample_w) / 2);
|
||||
for(c = 0; c < input_c; c++){
|
||||
for(h = 0; h < output_h; h++){
|
||||
for(w = 0; w < output_w; w++){
|
||||
int avg = 0;
|
||||
|
||||
for(sh = 0; sh < sample_h; sh++){
|
||||
int height = sh + h * sample_h;
|
||||
for(sw = 0; sw < sample_w; sw++){
|
||||
int width = sw + w * sample_w;
|
||||
avg += input[(width + height * input_w) * input_c + c];
|
||||
}
|
||||
}
|
||||
|
||||
// for rounded div
|
||||
if (avg > 0)
|
||||
avg += divider_half;
|
||||
else
|
||||
avg -= divider_half;
|
||||
|
||||
int out = avg / (sample_h * sample_w);
|
||||
out = TN_MAX(out, out_activation_min);
|
||||
out = TN_MIN(out, out_activation_max);
|
||||
output[(w + h * output_w) * input_c + c] = out;
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
|
43
TinyEngine/src/kernels/int_only/concat_ch.c
Normal file
@ -0,0 +1,43 @@
|
||||
/* ----------------------------------------------------------------------
|
||||
* Project: TinyEngine
|
||||
* Title: concat_ch.c
|
||||
*
|
||||
* Reference papers:
|
||||
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
|
||||
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
|
||||
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
|
||||
* Contact authors:
|
||||
* - Wei-Ming Chen, wmchen@mit.edu
|
||||
* - Wei-Chen Wang, wweichen@mit.edu
|
||||
* - Ji Lin, jilin@mit.edu
|
||||
* - Ligeng Zhu, ligeng@mit.edu
|
||||
* - Song Han, songhan@mit.edu
|
||||
*
|
||||
* Target ISA: ARMv7E-M
|
||||
* -------------------------------------------------------------------- */
|
||||
|
||||
#include "arm_nnfunctions.h"
|
||||
#include "tinyengine_function.h"
|
||||
|
||||
tinyengine_status concat_ch(const q7_t *input1, const uint16_t input_x,
|
||||
const uint16_t input_y, const uint16_t input1_ch, const q7_t* input2, const uint16_t input2_ch, q7_t *output) {
|
||||
|
||||
int elements = input_y * input_x;
|
||||
|
||||
while(elements--){
|
||||
//place the first input
|
||||
memcpy(output, input1, input1_ch);
|
||||
input1 += input1_ch; output += input1_ch;
|
||||
|
||||
//place the second input
|
||||
memcpy(output, input2, input2_ch);
|
||||
input2 += input2_ch; output += input2_ch;
|
||||
}
|
||||
|
||||
return STATE_SUCCESS;
|
||||
}
|
||||
|
||||
|
||||
|
||||
|
||||
|
127
TinyEngine/src/kernels/int_only/convolve_1x1_s8.c
Normal file
@ -0,0 +1,127 @@
|
||||
/* ----------------------------------------------------------------------
|
||||
* Project: TinyEngine
|
||||
* Title: convolve_1x1_s8.c
|
||||
*
|
||||
* Reference papers:
|
||||
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
|
||||
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
|
||||
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
|
||||
* Contact authors:
|
||||
* - Wei-Ming Chen, wmchen@mit.edu
|
||||
* - Wei-Chen Wang, wweichen@mit.edu
|
||||
* - Ji Lin, jilin@mit.edu
|
||||
* - Ligeng Zhu, ligeng@mit.edu
|
||||
* - Song Han, songhan@mit.edu
|
||||
*
|
||||
* Target ISA: ARMv7E-M
|
||||
* -------------------------------------------------------------------- */
|
||||
|
||||
#include "arm_nnfunctions.h"
|
||||
#include "img2col_element.h"
|
||||
#include "tinyengine_function.h"
|
||||
|
||||
#define DIM_KER_X (1U)
|
||||
#define DIM_KER_Y (1U)
|
||||
|
||||
tinyengine_status convolve_1x1_s8(const q7_t *input, const uint16_t input_x,
|
||||
const uint16_t input_y, const uint16_t input_ch, const q7_t *kernel,
|
||||
const int32_t *bias, const int32_t *output_shift,
|
||||
const int32_t *output_mult, const int32_t out_offset,
|
||||
const int32_t input_offset, const int32_t out_activation_min,
|
||||
const int32_t out_activation_max, q7_t *output, const uint16_t output_x,
|
||||
const uint16_t output_y, const uint16_t output_ch, q15_t *runtime_buf) {
|
||||
if (input_ch % 4 != 0 || input_ch % 2 != 0) {
|
||||
return PARAM_NO_SUPPORT;
|
||||
}
|
||||
|
||||
int32_t i_element;
|
||||
(void) input_x;
|
||||
(void) input_y;
|
||||
|
||||
/* Partial(two columns) im2col buffer */
|
||||
q15_t *two_column_buffer = runtime_buf;
|
||||
q7_t *out = output;
|
||||
const int32_t num_elements = output_x * output_y;
|
||||
const int channel_div4 = (input_ch >> 2);
|
||||
|
||||
const int16_t inoff16 = input_offset;
|
||||
q31_t offset_q15x2 = __PKHBT(inoff16, inoff16, 16);
|
||||
|
||||
for (i_element = 0; i_element < num_elements / 2; i_element++) {
|
||||
/* Fill buffer for partial im2col - two columns at a time */
|
||||
q7_t *src = &input[i_element * input_ch * 2];
|
||||
q15_t *dst = two_column_buffer;
|
||||
|
||||
//use variables
|
||||
q31_t in_q7x4;
|
||||
q31_t in_q15x2_1;
|
||||
q31_t in_q15x2_2;
|
||||
q31_t out_q15x2_1;
|
||||
q31_t out_q15x2_2;
|
||||
|
||||
int cnt = channel_div4; //two columns
|
||||
while (cnt > 0) {
|
||||
q7_q15_offset_reordered_ele(src, dst)
|
||||
q7_q15_offset_reordered_ele(src, dst)
|
||||
cnt--;
|
||||
}
|
||||
|
||||
out = arm_nn_mat_mult_kernel_s8_s16_reordered(kernel,
|
||||
two_column_buffer, output_ch, output_shift, output_mult,
|
||||
(q7_t) out_offset, out_activation_min,
|
||||
out_activation_max, input_ch * DIM_KER_Y * DIM_KER_X,
|
||||
bias, out);
|
||||
}
|
||||
|
||||
/* check if there is an odd column left-over for computation */
|
||||
if (num_elements & 0x1) {
|
||||
int32_t i_ch_out;
|
||||
const q7_t *ker_a = kernel;
|
||||
q7_t *src = &input[(num_elements - 1) * input_ch];
|
||||
q15_t *dst = two_column_buffer;
|
||||
|
||||
//use variables
|
||||
q31_t in_q7x4;
|
||||
q31_t in_q15x2_1;
|
||||
q31_t in_q15x2_2;
|
||||
q31_t out_q15x2_1;
|
||||
q31_t out_q15x2_2;
|
||||
|
||||
int cnt = channel_div4; //two * numof2col columns
|
||||
while (cnt > 0) {
|
||||
q7_q15_offset_reordered_ele(src, dst)
|
||||
cnt--;
|
||||
}
|
||||
|
||||
for (i_ch_out = 0; i_ch_out < output_ch; i_ch_out++) {
|
||||
q31_t sum = bias[i_ch_out];
|
||||
|
||||
/* Point to the beginning of the im2col buffer where the input is available as a rearranged column */
|
||||
const q15_t *ip_as_col = runtime_buf;
|
||||
uint16_t col_count = (input_ch * DIM_KER_X * DIM_KER_Y) >> 2;
|
||||
|
||||
while (col_count) {
|
||||
q31_t ker_a1, ker_a2;
|
||||
q31_t in_b1, in_b2;
|
||||
ker_a = read_and_pad_reordered(ker_a, &ker_a1, &ker_a2);
|
||||
|
||||
in_b1 = arm_nn_read_q15x2_ia(&ip_as_col);
|
||||
sum = __SMLAD(ker_a1, in_b1, sum);
|
||||
in_b2 = arm_nn_read_q15x2_ia(&ip_as_col);
|
||||
sum = __SMLAD(ker_a2, in_b2, sum);
|
||||
|
||||
col_count--;
|
||||
}
|
||||
|
||||
sum = arm_nn_requantize(sum, output_mult[i_ch_out],
|
||||
output_shift[i_ch_out]);
|
||||
sum += out_offset;
|
||||
sum = MAX(sum, out_activation_min);
|
||||
sum = MIN(sum, out_activation_max);
|
||||
*out++ = (q7_t) sum;
|
||||
}
|
||||
}
|
||||
|
||||
/* Return to application */
|
||||
return STATE_SUCCESS;
|
||||
}
|
158
TinyEngine/src/kernels/int_only/convolve_1x1_s8_SRAM.c
Normal file
@ -0,0 +1,158 @@
|
||||
/* ----------------------------------------------------------------------
|
||||
* Project: TinyEngine
|
||||
* Title: convolve_1x1_s8_SRAM.c
|
||||
*
|
||||
* Reference papers:
|
||||
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
|
||||
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
|
||||
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
|
||||
* Contact authors:
|
||||
* - Wei-Ming Chen, wmchen@mit.edu
|
||||
* - Wei-Chen Wang, wweichen@mit.edu
|
||||
* - Ji Lin, jilin@mit.edu
|
||||
* - Ligeng Zhu, ligeng@mit.edu
|
||||
* - Song Han, songhan@mit.edu
|
||||
*
|
||||
* Target ISA: ARMv7E-M
|
||||
* -------------------------------------------------------------------- */
|
||||
|
||||
#include "arm_nnfunctions.h"
|
||||
#include "tinyengine_function.h"
|
||||
#include "img2col_element.h"
|
||||
#include "kernel_element.h"
|
||||
|
||||
#define DIM_KER_X (1U)
|
||||
#define DIM_KER_Y (1U)
|
||||
|
||||
//#define FULL_UNROLL
|
||||
|
||||
tinyengine_status convolve_1x1_s8_SRAM(const q7_t *input, const uint16_t input_x,
|
||||
const uint16_t input_y, const uint16_t input_ch, const q7_t *kernel,
|
||||
const int32_t *bias, const int32_t *output_shift,
|
||||
const int32_t *output_mult, const int32_t out_offset,
|
||||
const int32_t input_offset, const int32_t out_activation_min,
|
||||
const int32_t out_activation_max, q7_t *output, const uint16_t output_x,
|
||||
const uint16_t output_y, const uint16_t output_ch, q15_t *runtime_buf, q15_t *kbuf) {
|
||||
if (input_ch % 4 != 0 || input_ch % 2 != 0) {
|
||||
return PARAM_NO_SUPPORT;
|
||||
}
|
||||
|
||||
int32_t i_element;
|
||||
(void) input_x;
|
||||
(void) input_y;
|
||||
|
||||
/* Partial(two columns) im2col buffer */
|
||||
q15_t *two_column_buffer = runtime_buf;
|
||||
q7_t *out = output;
|
||||
const int32_t num_elements = output_x * output_y;
|
||||
const int channel_div4 = (input_ch >> 2);
|
||||
|
||||
const int16_t inoff16 = input_offset;
|
||||
q31_t offset_q15x2 = __PKHBT(inoff16, inoff16, 16);
|
||||
|
||||
/* whether kernels can fit in the buffer */
|
||||
//fill in kernels
|
||||
const q7_t *ip_a0 = kernel;
|
||||
for (int i = 0; i < output_ch; i += 2) {
|
||||
q31_t *dst1 = &kbuf[i * input_ch / 2]; //each q31_t store 2 elements
|
||||
q31_t *dst2 = dst1 + input_ch / 2;
|
||||
|
||||
/* align the second pointer for A */
|
||||
const q7_t *ip_a1 = ip_a0 + input_ch;
|
||||
|
||||
uint16_t col_count = input_ch / 4;
|
||||
/* accumulate over the vector */
|
||||
while (col_count) {
|
||||
q31_t a01, a02, a11, a12;
|
||||
|
||||
ip_a0 = read_and_pad_reordered(ip_a0, &dst1[0], &dst1[1]);
|
||||
ip_a1 = read_and_pad_reordered(ip_a1, &dst2[0], &dst2[1]);
|
||||
|
||||
dst1 += 2;
|
||||
dst2 += 2;
|
||||
col_count--;
|
||||
} /* while over col_count */
|
||||
|
||||
/* skip row */
|
||||
ip_a0 += input_ch;
|
||||
}
|
||||
|
||||
/* output stationary */
|
||||
for (i_element = 0; i_element < num_elements; i_element += 2) {
|
||||
q7_t *src = &input[i_element * input_ch];
|
||||
q15_t *dst = two_column_buffer;
|
||||
|
||||
//use variables
|
||||
q31_t in_q7x4;
|
||||
q31_t in_q15x2_1;
|
||||
q31_t in_q15x2_2;
|
||||
q31_t out_q15x2_1;
|
||||
q31_t out_q15x2_2;
|
||||
|
||||
int cnt = channel_div4; //two columns
|
||||
while (cnt > 0) {
|
||||
q7_q15_offset_reordered_ele(src, dst)
|
||||
q7_q15_offset_reordered_ele(src, dst)
|
||||
cnt--;
|
||||
}
|
||||
|
||||
out = mat_mult_s16(kernel,
|
||||
two_column_buffer, output_ch, output_shift, output_mult,
|
||||
(q7_t) out_offset, out_activation_min,
|
||||
out_activation_max, input_ch,
|
||||
bias, out, kbuf);
|
||||
|
||||
}
|
||||
|
||||
/* check if there is an odd column left-over for computation */
|
||||
if (num_elements & 0x1) {
|
||||
int32_t i_ch_out;
|
||||
const q7_t *ker_a = kernel;
|
||||
q7_t *src = &input[(num_elements - 1) * input_ch];
|
||||
q15_t *dst = two_column_buffer;
|
||||
|
||||
//use variables
|
||||
q31_t in_q7x4;
|
||||
q31_t in_q15x2_1;
|
||||
q31_t in_q15x2_2;
|
||||
q31_t out_q15x2_1;
|
||||
q31_t out_q15x2_2;
|
||||
|
||||
int cnt = channel_div4; //two * numof2col columns
|
||||
while (cnt > 0) {
|
||||
q7_q15_offset_reordered_ele(src, dst)
|
||||
cnt--;
|
||||
}
|
||||
|
||||
for (i_ch_out = 0; i_ch_out < output_ch; i_ch_out++) {
|
||||
q31_t sum = bias[i_ch_out];
|
||||
|
||||
/* Point to the beginning of the im2col buffer where the input is available as a rearranged column */
|
||||
const q15_t *ip_as_col = runtime_buf;
|
||||
uint16_t col_count = (input_ch * DIM_KER_X * DIM_KER_Y) >> 2;
|
||||
|
||||
while (col_count) {
|
||||
q31_t ker_a1, ker_a2;
|
||||
q31_t in_b1, in_b2;
|
||||
ker_a = read_and_pad_reordered(ker_a, &ker_a1, &ker_a2);
|
||||
|
||||
in_b1 = arm_nn_read_q15x2_ia(&ip_as_col);
|
||||
sum = __SMLAD(ker_a1, in_b1, sum);
|
||||
in_b2 = arm_nn_read_q15x2_ia(&ip_as_col);
|
||||
sum = __SMLAD(ker_a2, in_b2, sum);
|
||||
|
||||
col_count--;
|
||||
}
|
||||
|
||||
sum = arm_nn_requantize(sum, output_mult[i_ch_out],
|
||||
output_shift[i_ch_out]);
|
||||
sum += out_offset;
|
||||
sum = MAX(sum, out_activation_min);
|
||||
sum = MIN(sum, out_activation_max);
|
||||
*out++ = (q7_t) sum;
|
||||
}
|
||||
}
|
||||
|
||||
/* Return to application */
|
||||
return STATE_SUCCESS;
|
||||
}
|
123
TinyEngine/src/kernels/int_only/convolve_1x1_s8_ch16.c
Normal file
@ -0,0 +1,123 @@
|
||||
/* ----------------------------------------------------------------------
|
||||
* Project: TinyEngine
|
||||
* Title: convolve_1x1_s8_ch16.c
|
||||
*
|
||||
* Reference papers:
|
||||
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
|
||||
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
|
||||
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
|
||||
* Contact authors:
|
||||
* - Wei-Ming Chen, wmchen@mit.edu
|
||||
* - Wei-Chen Wang, wweichen@mit.edu
|
||||
* - Ji Lin, jilin@mit.edu
|
||||
* - Ligeng Zhu, ligeng@mit.edu
|
||||
* - Song Han, songhan@mit.edu
|
||||
*
|
||||
* Target ISA: ARMv7E-M
|
||||
* -------------------------------------------------------------------- */
|
||||
|
||||
#include "arm_nnfunctions.h"
|
||||
#include "img2col_element.h"
|
||||
#include "tinyengine_function.h"
|
||||
|
||||
#define DIM_KER_X (1U)
|
||||
#define DIM_KER_Y (1U)
|
||||
|
||||
tinyengine_status convolve_1x1_s8_ch16(const q7_t *input, const uint16_t input_x,
|
||||
const uint16_t input_y, const uint16_t input_ch, const q7_t *kernel,
|
||||
const int32_t *bias, const int32_t *output_shift,
|
||||
const int32_t *output_mult, const int32_t out_offset,
|
||||
const int32_t input_offset, const int32_t out_activation_min,
|
||||
const int32_t out_activation_max, q7_t *output, const uint16_t output_x,
|
||||
const uint16_t output_y, const uint16_t output_ch, q15_t *runtime_buf) {
|
||||
int32_t i_element;
|
||||
(void) input_x;
|
||||
(void) input_y;
|
||||
|
||||
/* Partial(two columns) im2col buffer */
|
||||
q15_t *two_column_buffer = runtime_buf;
|
||||
q7_t *out = output;
|
||||
const int32_t num_elements = output_x * output_y;
|
||||
const int channel_div4 = (input_ch >> 2);
|
||||
|
||||
const int16_t inoff16 = input_offset;
|
||||
q31_t offset_q15x2 = __PKHBT(inoff16, inoff16, 16);
|
||||
|
||||
for (i_element = 0; i_element < num_elements / 2; i_element++) {
|
||||
/* Fill buffer for partial im2col - two columns at a time */
|
||||
q7_t *src = &input[i_element * input_ch * 2];
|
||||
q15_t *dst = two_column_buffer;
|
||||
|
||||
//use variables
|
||||
q31_t in_q7x4;
|
||||
q31_t in_q15x2_1;
|
||||
q31_t in_q15x2_2;
|
||||
q31_t out_q15x2_1;
|
||||
q31_t out_q15x2_2;
|
||||
|
||||
int cnt = channel_div4; //two columns
|
||||
while (cnt > 0) {
|
||||
q7_q15_offset_reordered_ele(src, dst)
|
||||
q7_q15_offset_reordered_ele(src, dst)
|
||||
cnt--;
|
||||
}
|
||||
|
||||
out = mat_mult_kernel_s8_s16_reordered_ch16(kernel,
|
||||
two_column_buffer, output_ch, output_shift, output_mult,
|
||||
(q7_t) out_offset, out_activation_min,
|
||||
out_activation_max, input_ch * DIM_KER_Y * DIM_KER_X,
|
||||
bias, out);
|
||||
}
|
||||
|
||||
/* check if there is an odd column left-over for computation */
|
||||
if (num_elements & 0x1) {
|
||||
int32_t i_ch_out;
|
||||
const q7_t *ker_a = kernel;
|
||||
q7_t *src = &input[(num_elements - 1) * input_ch];
|
||||
q15_t *dst = two_column_buffer;
|
||||
|
||||
//use variables
|
||||
q31_t in_q7x4;
|
||||
q31_t in_q15x2_1;
|
||||
q31_t in_q15x2_2;
|
||||
q31_t out_q15x2_1;
|
||||
q31_t out_q15x2_2;
|
||||
|
||||
int cnt = channel_div4; //two * numof2col columns
|
||||
while (cnt > 0) {
|
||||
q7_q15_offset_reordered_ele(src, dst)
|
||||
cnt--;
|
||||
}
|
||||
|
||||
for (i_ch_out = 0; i_ch_out < output_ch; i_ch_out++) {
|
||||
q31_t sum = bias[i_ch_out];
|
||||
|
||||
/* Point to the beginning of the im2col buffer where the input is available as a rearranged column */
|
||||
const q15_t *ip_as_col = runtime_buf;
|
||||
uint16_t col_count = (input_ch * DIM_KER_X * DIM_KER_Y) >> 2;
|
||||
|
||||
while (col_count) {
|
||||
q31_t ker_a1, ker_a2;
|
||||
q31_t in_b1, in_b2;
|
||||
ker_a = read_and_pad_reordered(ker_a, &ker_a1, &ker_a2);
|
||||
|
||||
in_b1 = arm_nn_read_q15x2_ia(&ip_as_col);
|
||||
sum = __SMLAD(ker_a1, in_b1, sum);
|
||||
in_b2 = arm_nn_read_q15x2_ia(&ip_as_col);
|
||||
sum = __SMLAD(ker_a2, in_b2, sum);
|
||||
|
||||
col_count--;
|
||||
}
|
||||
|
||||
sum = arm_nn_requantize(sum, output_mult[i_ch_out],
|
||||
output_shift[i_ch_out]);
|
||||
sum += out_offset;
|
||||
sum = MAX(sum, out_activation_min);
|
||||
sum = MIN(sum, out_activation_max);
|
||||
*out++ = (q7_t) sum;
|
||||
}
|
||||
}
|
||||
|
||||
/* Return to application */
|
||||
return STATE_SUCCESS;
|
||||
}
|
124
TinyEngine/src/kernels/int_only/convolve_1x1_s8_ch24.c
Normal file
@ -0,0 +1,124 @@
|
||||
/* ----------------------------------------------------------------------
|
||||
* Project: TinyEngine
|
||||
* Title: convolve_1x1_s8_ch24.c
|
||||
*
|
||||
* Reference papers:
|
||||
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
|
||||
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
|
||||
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
|
||||
* Contact authors:
|
||||
* - Wei-Ming Chen, wmchen@mit.edu
|
||||
* - Wei-Chen Wang, wweichen@mit.edu
|
||||
* - Ji Lin, jilin@mit.edu
|
||||
* - Ligeng Zhu, ligeng@mit.edu
|
||||
* - Song Han, songhan@mit.edu
|
||||
*
|
||||
* Target ISA: ARMv7E-M
|
||||
* -------------------------------------------------------------------- */
|
||||
|
||||
#include "arm_nnfunctions.h"
|
||||
#include "arm_nnsupportfunctions.h"
|
||||
#include "img2col_element.h"
|
||||
#include "tinyengine_function.h"
|
||||
|
||||
#define DIM_KER_X (1U)
|
||||
#define DIM_KER_Y (1U)
|
||||
|
||||
tinyengine_status convolve_1x1_s8_ch24(const q7_t *input, const uint16_t input_x,
|
||||
const uint16_t input_y, const uint16_t input_ch, const q7_t *kernel,
|
||||
const int32_t *bias, const int32_t *output_shift,
|
||||
const int32_t *output_mult, const int32_t out_offset,
|
||||
const int32_t input_offset, const int32_t out_activation_min,
|
||||
const int32_t out_activation_max, q7_t *output, const uint16_t output_x,
|
||||
const uint16_t output_y, const uint16_t output_ch, q15_t *runtime_buf) {
|
||||
int32_t i_element;
|
||||
(void) input_x;
|
||||
(void) input_y;
|
||||
|
||||
/* Partial(two columns) im2col buffer */
|
||||
q15_t *two_column_buffer = runtime_buf;
|
||||
q7_t *out = output;
|
||||
const int32_t num_elements = output_x * output_y;
|
||||
const int channel_div4 = (input_ch >> 2);
|
||||
|
||||
const int16_t inoff16 = input_offset;
|
||||
q31_t offset_q15x2 = __PKHBT(inoff16, inoff16, 16);
|
||||
|
||||
for (i_element = 0; i_element < num_elements / 2; i_element++) {
|
||||
/* Fill buffer for partial im2col - two columns at a time */
|
||||
q7_t *src = &input[i_element * input_ch * 2];
|
||||
q15_t *dst = two_column_buffer;
|
||||
|
||||
//use variables
|
||||
q31_t in_q7x4;
|
||||
q31_t in_q15x2_1;
|
||||
q31_t in_q15x2_2;
|
||||
q31_t out_q15x2_1;
|
||||
q31_t out_q15x2_2;
|
||||
|
||||
int cnt = channel_div4; //two columns
|
||||
while (cnt > 0) {
|
||||
q7_q15_offset_reordered_ele(src, dst)
|
||||
q7_q15_offset_reordered_ele(src, dst)
|
||||
cnt--;
|
||||
}
|
||||
|
||||
out = mat_mult_kernel_s8_s16_reordered_ch24(kernel,
|
||||
two_column_buffer, output_ch, output_shift, output_mult,
|
||||
(q7_t) out_offset, out_activation_min,
|
||||
out_activation_max, input_ch * DIM_KER_Y * DIM_KER_X,
|
||||
bias, out);
|
||||
}
|
||||
|
||||
/* check if there is an odd column left-over for computation */
|
||||
if (num_elements & 0x1) {
|
||||
int32_t i_ch_out;
|
||||
const q7_t *ker_a = kernel;
|
||||
q7_t *src = &input[(num_elements - 1) * input_ch];
|
||||
q15_t *dst = two_column_buffer;
|
||||
|
||||
//use variables
|
||||
q31_t in_q7x4;
|
||||
q31_t in_q15x2_1;
|
||||
q31_t in_q15x2_2;
|
||||
q31_t out_q15x2_1;
|
||||
q31_t out_q15x2_2;
|
||||
|
||||
int cnt = channel_div4; //two * numof2col columns
|
||||
while (cnt > 0) {
|
||||
q7_q15_offset_reordered_ele(src, dst)
|
||||
cnt--;
|
||||
}
|
||||
|
||||
for (i_ch_out = 0; i_ch_out < output_ch; i_ch_out++) {
|
||||
q31_t sum = bias[i_ch_out];
|
||||
|
||||
/* Point to the beginning of the im2col buffer where the input is available as a rearranged column */
|
||||
const q15_t *ip_as_col = runtime_buf;
|
||||
uint16_t col_count = (input_ch * DIM_KER_X * DIM_KER_Y) >> 2;
|
||||
|
||||
while (col_count) {
|
||||
q31_t ker_a1, ker_a2;
|
||||
q31_t in_b1, in_b2;
|
||||
ker_a = read_and_pad_reordered(ker_a, &ker_a1, &ker_a2);
|
||||
|
||||
in_b1 = arm_nn_read_q15x2_ia(&ip_as_col);
|
||||
sum = __SMLAD(ker_a1, in_b1, sum);
|
||||
in_b2 = arm_nn_read_q15x2_ia(&ip_as_col);
|
||||
sum = __SMLAD(ker_a2, in_b2, sum);
|
||||
|
||||
col_count--;
|
||||
}
|
||||
|
||||
sum = arm_nn_requantize(sum, output_mult[i_ch_out],
|
||||
output_shift[i_ch_out]);
|
||||
sum += out_offset;
|
||||
sum = MAX(sum, out_activation_min);
|
||||
sum = MIN(sum, out_activation_max);
|
||||
*out++ = (q7_t) sum;
|
||||
}
|
||||
}
|
||||
|
||||
/* Return to application */
|
||||
return STATE_SUCCESS;
|
||||
}
|
124
TinyEngine/src/kernels/int_only/convolve_1x1_s8_ch48.c
Normal file
@ -0,0 +1,124 @@
|
||||
/* ----------------------------------------------------------------------
|
||||
* Project: TinyEngine
|
||||
* Title: convolve_1x1_s8_ch48.c
|
||||
*
|
||||
* Reference papers:
|
||||
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
|
||||
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
|
||||
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
|
||||
* Contact authors:
|
||||
* - Wei-Ming Chen, wmchen@mit.edu
|
||||
* - Wei-Chen Wang, wweichen@mit.edu
|
||||
* - Ji Lin, jilin@mit.edu
|
||||
* - Ligeng Zhu, ligeng@mit.edu
|
||||
* - Song Han, songhan@mit.edu
|
||||
*
|
||||
* Target ISA: ARMv7E-M
|
||||
* -------------------------------------------------------------------- */
|
||||
|
||||
#include "arm_nnfunctions.h"
|
||||
#include "arm_nnsupportfunctions.h"
|
||||
#include "img2col_element.h"
|
||||
#include "tinyengine_function.h"
|
||||
|
||||
#define DIM_KER_X (1U)
|
||||
#define DIM_KER_Y (1U)
|
||||
|
||||
tinyengine_status convolve_1x1_s8_ch48(const q7_t *input, const uint16_t input_x,
|
||||
const uint16_t input_y, const uint16_t input_ch, const q7_t *kernel,
|
||||
const int32_t *bias, const int32_t *output_shift,
|
||||
const int32_t *output_mult, const int32_t out_offset,
|
||||
const int32_t input_offset, const int32_t out_activation_min,
|
||||
const int32_t out_activation_max, q7_t *output, const uint16_t output_x,
|
||||
const uint16_t output_y, const uint16_t output_ch, q15_t *runtime_buf) {
|
||||
int32_t i_element;
|
||||
(void) input_x;
|
||||
(void) input_y;
|
||||
|
||||
/* Partial(two columns) im2col buffer */
|
||||
q15_t *two_column_buffer = runtime_buf;
|
||||
q7_t *out = output;
|
||||
const int32_t num_elements = output_x * output_y;
|
||||
const int channel_div4 = (input_ch >> 2);
|
||||
|
||||
const int16_t inoff16 = input_offset;
|
||||
q31_t offset_q15x2 = __PKHBT(inoff16, inoff16, 16);
|
||||
|
||||
for (i_element = 0; i_element < num_elements / 2; i_element++) {
|
||||
/* Fill buffer for partial im2col - two columns at a time */
|
||||
q7_t *src = &input[i_element * input_ch * 2];
|
||||
q15_t *dst = two_column_buffer;
|
||||
|
||||
//use variables
|
||||
q31_t in_q7x4;
|
||||
q31_t in_q15x2_1;
|
||||
q31_t in_q15x2_2;
|
||||
q31_t out_q15x2_1;
|
||||
q31_t out_q15x2_2;
|
||||
|
||||
int cnt = channel_div4; //two columns
|
||||
while (cnt > 0) {
|
||||
q7_q15_offset_reordered_ele(src, dst)
|
||||
q7_q15_offset_reordered_ele(src, dst)
|
||||
cnt--;
|
||||
}
|
||||
|
||||
out = mat_mult_kernel_s8_s16_reordered_ch48(kernel,
|
||||
two_column_buffer, output_ch, output_shift, output_mult,
|
||||
(q7_t) out_offset, out_activation_min,
|
||||
out_activation_max, input_ch * DIM_KER_Y * DIM_KER_X,
|
||||
bias, out);
|
||||
}
|
||||
|
||||
/* check if there is an odd column left-over for computation */
|
||||
if (num_elements & 0x1) {
|
||||
int32_t i_ch_out;
|
||||
const q7_t *ker_a = kernel;
|
||||
q7_t *src = &input[(num_elements - 1) * input_ch];
|
||||
q15_t *dst = two_column_buffer;
|
||||
|
||||
//use variables
|
||||
q31_t in_q7x4;
|
||||
q31_t in_q15x2_1;
|
||||
q31_t in_q15x2_2;
|
||||
q31_t out_q15x2_1;
|
||||
q31_t out_q15x2_2;
|
||||
|
||||
int cnt = channel_div4; //two * numof2col columns
|
||||
while (cnt > 0) {
|
||||
q7_q15_offset_reordered_ele(src, dst)
|
||||
cnt--;
|
||||
}
|
||||
|
||||
for (i_ch_out = 0; i_ch_out < output_ch; i_ch_out++) {
|
||||
q31_t sum = bias[i_ch_out];
|
||||
|
||||
/* Point to the beginning of the im2col buffer where the input is available as a rearranged column */
|
||||
const q15_t *ip_as_col = runtime_buf;
|
||||
uint16_t col_count = (input_ch * DIM_KER_X * DIM_KER_Y) >> 2;
|
||||
|
||||
while (col_count) {
|
||||
q31_t ker_a1, ker_a2;
|
||||
q31_t in_b1, in_b2;
|
||||
ker_a = read_and_pad_reordered(ker_a, &ker_a1, &ker_a2);
|
||||
|
||||
in_b1 = arm_nn_read_q15x2_ia(&ip_as_col);
|
||||
sum = __SMLAD(ker_a1, in_b1, sum);
|
||||
in_b2 = arm_nn_read_q15x2_ia(&ip_as_col);
|
||||
sum = __SMLAD(ker_a2, in_b2, sum);
|
||||
|
||||
col_count--;
|
||||
}
|
||||
|
||||
sum = arm_nn_requantize(sum, output_mult[i_ch_out],
|
||||
output_shift[i_ch_out]);
|
||||
sum += out_offset;
|
||||
sum = MAX(sum, out_activation_min);
|
||||
sum = MIN(sum, out_activation_max);
|
||||
*out++ = (q7_t) sum;
|
||||
}
|
||||
}
|
||||
|
||||
/* Return to application */
|
||||
return STATE_SUCCESS;
|
||||
}
|
123
TinyEngine/src/kernels/int_only/convolve_1x1_s8_ch8.c
Normal file
@ -0,0 +1,123 @@
|
||||
/* ----------------------------------------------------------------------
|
||||
* Project: TinyEngine
|
||||
* Title: convolve_1x1_s8_ch8.c
|
||||
*
|
||||
* Reference papers:
|
||||
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
|
||||
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
|
||||
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
|
||||
* Contact authors:
|
||||
* - Wei-Ming Chen, wmchen@mit.edu
|
||||
* - Wei-Chen Wang, wweichen@mit.edu
|
||||
* - Ji Lin, jilin@mit.edu
|
||||
* - Ligeng Zhu, ligeng@mit.edu
|
||||
* - Song Han, songhan@mit.edu
|
||||
*
|
||||
* Target ISA: ARMv7E-M
|
||||
* -------------------------------------------------------------------- */
|
||||
|
||||
#include "arm_nnfunctions.h"
|
||||
#include "img2col_element.h"
|
||||
#include "tinyengine_function.h"
|
||||
|
||||
#define DIM_KER_X (1U)
|
||||
#define DIM_KER_Y (1U)
|
||||
|
||||
tinyengine_status convolve_1x1_s8_ch8(const q7_t *input, const uint16_t input_x,
|
||||
const uint16_t input_y, const uint16_t input_ch, const q7_t *kernel,
|
||||
const int32_t *bias, const int32_t *output_shift,
|
||||
const int32_t *output_mult, const int32_t out_offset,
|
||||
const int32_t input_offset, const int32_t out_activation_min,
|
||||
const int32_t out_activation_max, q7_t *output, const uint16_t output_x,
|
||||
const uint16_t output_y, const uint16_t output_ch, q15_t *runtime_buf) {
|
||||
int32_t i_element;
|
||||
(void) input_x;
|
||||
(void) input_y;
|
||||
|
||||
/* Partial(two columns) im2col buffer */
|
||||
q15_t *two_column_buffer = runtime_buf;
|
||||
q7_t *out = output;
|
||||
const int32_t num_elements = output_x * output_y;
|
||||
const int channel_div4 = (input_ch >> 2);
|
||||
|
||||
const int16_t inoff16 = input_offset;
|
||||
q31_t offset_q15x2 = __PKHBT(inoff16, inoff16, 16);
|
||||
|
||||
for (i_element = 0; i_element < num_elements / 2; i_element++) {
|
||||
/* Fill buffer for partial im2col - two columns at a time */
|
||||
q7_t *src = &input[i_element * input_ch * 2];
|
||||
q15_t *dst = two_column_buffer;
|
||||
|
||||
//use variables
|
||||
q31_t in_q7x4;
|
||||
q31_t in_q15x2_1;
|
||||
q31_t in_q15x2_2;
|
||||
q31_t out_q15x2_1;
|
||||
q31_t out_q15x2_2;
|
||||
|
||||
int cnt = channel_div4; //two columns
|
||||
while (cnt > 0) {
|
||||
q7_q15_offset_reordered_ele(src, dst)
|
||||
q7_q15_offset_reordered_ele(src, dst)
|
||||
cnt--;
|
||||
}
|
||||
|
||||
out = mat_mult_kernel_s8_s16_reordered_ch8(kernel,
|
||||
two_column_buffer, output_ch, output_shift, output_mult,
|
||||
(q7_t) out_offset, out_activation_min,
|
||||
out_activation_max, input_ch * DIM_KER_Y * DIM_KER_X,
|
||||
bias, out);
|
||||
}
|
||||
|
||||
/* check if there is an odd column left-over for computation */
|
||||
if (num_elements & 0x1) {
|
||||
int32_t i_ch_out;
|
||||
const q7_t *ker_a = kernel;
|
||||
q7_t *src = &input[(num_elements - 1) * input_ch];
|
||||
q15_t *dst = two_column_buffer;
|
||||
|
||||
//use variables
|
||||
q31_t in_q7x4;
|
||||
q31_t in_q15x2_1;
|
||||
q31_t in_q15x2_2;
|
||||
q31_t out_q15x2_1;
|
||||
q31_t out_q15x2_2;
|
||||
|
||||
int cnt = channel_div4; //two * numof2col columns
|
||||
while (cnt > 0) {
|
||||
q7_q15_offset_reordered_ele(src, dst)
|
||||
cnt--;
|
||||
}
|
||||
|
||||
for (i_ch_out = 0; i_ch_out < output_ch; i_ch_out++) {
|
||||
q31_t sum = bias[i_ch_out];
|
||||
|
||||
/* Point to the beginning of the im2col buffer where the input is available as a rearranged column */
|
||||
const q15_t *ip_as_col = runtime_buf;
|
||||
uint16_t col_count = (input_ch * DIM_KER_X * DIM_KER_Y) >> 2;
|
||||
|
||||
while (col_count) {
|
||||
q31_t ker_a1, ker_a2;
|
||||
q31_t in_b1, in_b2;
|
||||
ker_a = read_and_pad_reordered(ker_a, &ker_a1, &ker_a2);
|
||||
|
||||
in_b1 = arm_nn_read_q15x2_ia(&ip_as_col);
|
||||
sum = __SMLAD(ker_a1, in_b1, sum);
|
||||
in_b2 = arm_nn_read_q15x2_ia(&ip_as_col);
|
||||
sum = __SMLAD(ker_a2, in_b2, sum);
|
||||
|
||||
col_count--;
|
||||
}
|
||||
|
||||
sum = arm_nn_requantize(sum, output_mult[i_ch_out],
|
||||
output_shift[i_ch_out]);
|
||||
sum += out_offset;
|
||||
sum = MAX(sum, out_activation_min);
|
||||
sum = MIN(sum, out_activation_max);
|
||||
*out++ = (q7_t) sum;
|
||||
}
|
||||
}
|
||||
|
||||
/* Return to application */
|
||||
return STATE_SUCCESS;
|
||||
}
|
135
TinyEngine/src/kernels/int_only/convolve_1x1_s8_kbuf.c
Normal file
@ -0,0 +1,135 @@
|
||||
/* ----------------------------------------------------------------------
|
||||
* Project: TinyEngine
|
||||
* Title: convolve_1x1_s8_kbuf.c
|
||||
* Description: for pointwise convolution, which nests loops according to runtime buffer size
|
||||
*
|
||||
* Reference papers:
|
||||
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
|
||||
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
|
||||
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
|
||||
* Contact authors:
|
||||
* - Wei-Ming Chen, wmchen@mit.edu
|
||||
* - Wei-Chen Wang, wweichen@mit.edu
|
||||
* - Ji Lin, jilin@mit.edu
|
||||
* - Ligeng Zhu, ligeng@mit.edu
|
||||
* - Song Han, songhan@mit.edu
|
||||
*
|
||||
* Target ISA: ARMv7E-M
|
||||
* -------------------------------------------------------------------- */
|
||||
|
||||
#include "arm_nnfunctions.h"
|
||||
#include "tinyengine_function.h"
|
||||
#include "img2col_element.h"
|
||||
#include "kernel_element.h"
|
||||
|
||||
#define DIM_KER_X (1U)
|
||||
#define DIM_KER_Y (1U)
|
||||
|
||||
tinyengine_status convolve_1x1_s8_kbuf(const q7_t *input, const uint16_t input_x,
|
||||
const uint16_t input_y, const uint16_t input_ch, const q7_t *kernel,
|
||||
const q31_t *kbuf, const int32_t *bias, const int32_t *output_shift,
|
||||
const int32_t *output_mult, const int32_t out_offset,
|
||||
const int32_t input_offset, const int32_t out_activation_min,
|
||||
const int32_t out_activation_max, q7_t *output, const uint16_t output_x,
|
||||
const uint16_t output_y, const uint16_t output_ch, q15_t *runtime_buf){
|
||||
if (input_ch % 4 != 0 || input_ch % 2 != 0) {
|
||||
return PARAM_NO_SUPPORT;
|
||||
}
|
||||
|
||||
int32_t i_element;
|
||||
(void) input_x;
|
||||
(void) input_y;
|
||||
|
||||
/* Partial(two columns) im2col buffer */
|
||||
q15_t *two_column_buffer = runtime_buf;
|
||||
q7_t *out = output;
|
||||
const int32_t num_elements = output_x * output_y;
|
||||
const int channel_div4 = (input_ch >> 2);
|
||||
|
||||
const int16_t inoff16 = input_offset;
|
||||
q31_t offset_q15x2 = __PKHBT(inoff16, inoff16, 16);
|
||||
|
||||
volatile int sbufsize = get_sbuffer_size();
|
||||
int maxcol = sbufsize / input_ch / 2;
|
||||
|
||||
/* whether kernels can fit in the buffer */
|
||||
//fill in kernels
|
||||
const q7_t *ip_a0 = kernel;
|
||||
/* output stationary */
|
||||
for (i_element = 0; i_element < num_elements; i_element += 2) {
|
||||
q7_t *src = &input[i_element * input_ch];
|
||||
q15_t *dst = two_column_buffer;
|
||||
|
||||
//use variables
|
||||
q31_t in_q7x4;
|
||||
q31_t in_q15x2_1;
|
||||
q31_t in_q15x2_2;
|
||||
q31_t out_q15x2_1;
|
||||
q31_t out_q15x2_2;
|
||||
|
||||
int cnt = channel_div4; //two columns
|
||||
while (cnt > 0) {
|
||||
q7_q15_offset_reordered_ele(src, dst)
|
||||
q7_q15_offset_reordered_ele(src, dst)
|
||||
cnt--;
|
||||
}
|
||||
|
||||
out = mat_mult_s16(kernel,
|
||||
two_column_buffer, output_ch, output_shift, output_mult,
|
||||
(q7_t) out_offset, out_activation_min,
|
||||
out_activation_max, input_ch,
|
||||
bias, out, kbuf);
|
||||
|
||||
}
|
||||
|
||||
/* check if there is an odd column left-over for computation */
|
||||
if (num_elements & 0x1) {
|
||||
int32_t i_ch_out;
|
||||
const q7_t *ker_a = kernel;
|
||||
q7_t *src = &input[(num_elements - 1) * input_ch];
|
||||
q15_t *dst = two_column_buffer;
|
||||
|
||||
//use variables
|
||||
q31_t in_q7x4;
|
||||
q31_t in_q15x2_1;
|
||||
q31_t in_q15x2_2;
|
||||
q31_t out_q15x2_1;
|
||||
q31_t out_q15x2_2;
|
||||
|
||||
int cnt = channel_div4; //two * numof2col columns
|
||||
while (cnt > 0) {
|
||||
q7_q15_offset_reordered_ele(src, dst)
|
||||
cnt--;
|
||||
}
|
||||
|
||||
for (i_ch_out = 0; i_ch_out < output_ch; i_ch_out++) {
|
||||
q31_t sum = bias[i_ch_out];
|
||||
|
||||
/* Point to the beginning of the im2col buffer where the input is available as a rearranged column */
|
||||
const q15_t *ip_as_col = runtime_buf;
|
||||
uint16_t col_count = (input_ch * DIM_KER_X * DIM_KER_Y) >> 2;
|
||||
|
||||
while (col_count) {
|
||||
q31_t ker_a1, ker_a2;
|
||||
q31_t in_b1, in_b2;
|
||||
ker_a = read_and_pad_reordered(ker_a, &ker_a1, &ker_a2);
|
||||
|
||||
in_b1 = arm_nn_read_q15x2_ia(&ip_as_col);
|
||||
sum = __SMLAD(ker_a1, in_b1, sum);
|
||||
in_b2 = arm_nn_read_q15x2_ia(&ip_as_col);
|
||||
sum = __SMLAD(ker_a2, in_b2, sum);
|
||||
|
||||
col_count--;
|
||||
}
|
||||
|
||||
sum = arm_nn_requantize(sum, output_mult[i_ch_out], output_shift[i_ch_out]);
|
||||
sum += out_offset;
|
||||
sum = MAX(sum, out_activation_min);
|
||||
sum = MIN(sum, out_activation_max);
|
||||
*out++ = (q7_t) sum;
|
||||
}
|
||||
}
|
||||
|
||||
/* Return to application */
|
||||
return STATE_SUCCESS;
|
||||
}
|
128
TinyEngine/src/kernels/int_only/convolve_1x1_s8_oddch.c
Normal file
@ -0,0 +1,128 @@
|
||||
/* ----------------------------------------------------------------------
|
||||
* Project: TinyEngine
|
||||
* Title: convolve_1x1_s8_oddch.c
|
||||
*
|
||||
* Reference papers:
|
||||
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
|
||||
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
|
||||
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
|
||||
* Contact authors:
|
||||
* - Wei-Ming Chen, wmchen@mit.edu
|
||||
* - Wei-Chen Wang, wweichen@mit.edu
|
||||
* - Ji Lin, jilin@mit.edu
|
||||
* - Ligeng Zhu, ligeng@mit.edu
|
||||
* - Song Han, songhan@mit.edu
|
||||
*
|
||||
* Target ISA: ARMv7E-M
|
||||
* -------------------------------------------------------------------- */
|
||||
|
||||
#include "arm_nnfunctions.h"
|
||||
#include "tinyengine_function.h"
|
||||
#include "img2col_element.h"
|
||||
#include "kernel_element.h"
|
||||
|
||||
#define DIM_KER_X (1U)
|
||||
#define DIM_KER_Y (1U)
|
||||
|
||||
tinyengine_status convolve_1x1_s8_oddch(const q7_t *input, const uint16_t input_x,
|
||||
const uint16_t input_y, const uint16_t input_ch, const q7_t *kernel,
|
||||
const int32_t *bias, const int32_t *output_shift,
|
||||
const int32_t *output_mult, const int32_t out_offset,
|
||||
const int32_t input_offset, const int32_t out_activation_min,
|
||||
const int32_t out_activation_max, q7_t *output, const uint16_t output_x,
|
||||
const uint16_t output_y, const uint16_t output_ch, q15_t *runtime_buf) {
|
||||
if (input_ch % 4 != 0) {
|
||||
return PARAM_NO_SUPPORT;
|
||||
}
|
||||
|
||||
int32_t i_element;
|
||||
(void) input_x;
|
||||
(void) input_y;
|
||||
|
||||
/* Partial(two columns) im2col buffer */
|
||||
q15_t *two_column_buffer = runtime_buf;
|
||||
q7_t *out = output;
|
||||
const int32_t num_elements = output_x * output_y;
|
||||
const int channel_div4 = (input_ch >> 2);
|
||||
|
||||
const int16_t inoff16 = input_offset;
|
||||
q31_t offset_q15x2 = __PKHBT(inoff16, inoff16, 16);
|
||||
|
||||
for (i_element = 0; i_element < num_elements / 2; i_element++) {
|
||||
/* Fill buffer for partial im2col - two columns at a time */
|
||||
q7_t *src = &input[i_element * input_ch * 2];
|
||||
q15_t *dst = two_column_buffer;
|
||||
|
||||
//use variables
|
||||
q31_t in_q7x4;
|
||||
q31_t in_q15x2_1;
|
||||
q31_t in_q15x2_2;
|
||||
q31_t out_q15x2_1;
|
||||
q31_t out_q15x2_2;
|
||||
|
||||
int cnt = channel_div4; //two columns
|
||||
while (cnt > 0) {
|
||||
q7_q15_offset_reordered_ele(src, dst)
|
||||
q7_q15_offset_reordered_ele(src, dst)
|
||||
cnt--;
|
||||
}
|
||||
|
||||
out = arm_nn_mat_mult_kernel_s8_s16_reordered(kernel,
|
||||
two_column_buffer, output_ch, output_shift, output_mult,
|
||||
(q7_t) out_offset, out_activation_min,
|
||||
out_activation_max, input_ch * DIM_KER_Y * DIM_KER_X,
|
||||
bias, out);
|
||||
}
|
||||
|
||||
/* check if there is an odd column left-over for computation */
|
||||
if (num_elements & 0x1) {
|
||||
int32_t i_ch_out;
|
||||
const q7_t *ker_a = kernel;
|
||||
q7_t *src = &input[(num_elements - 1) * input_ch];
|
||||
q15_t *dst = two_column_buffer;
|
||||
|
||||
//use variables
|
||||
q31_t in_q7x4;
|
||||
q31_t in_q15x2_1;
|
||||
q31_t in_q15x2_2;
|
||||
q31_t out_q15x2_1;
|
||||
q31_t out_q15x2_2;
|
||||
|
||||
int cnt = channel_div4; //two * numof2col columns
|
||||
while (cnt > 0) {
|
||||
q7_q15_offset_reordered_ele(src, dst)
|
||||
cnt--;
|
||||
}
|
||||
|
||||
for (i_ch_out = 0; i_ch_out < output_ch; i_ch_out++) {
|
||||
q31_t sum = bias[i_ch_out];
|
||||
|
||||
/* Point to the beginning of the im2col buffer where the input is available as a rearranged column */
|
||||
const q15_t *ip_as_col = runtime_buf;
|
||||
uint16_t col_count = (input_ch * DIM_KER_X * DIM_KER_Y) >> 2;
|
||||
|
||||
while (col_count) {
|
||||
q31_t ker_a1, ker_a2;
|
||||
q31_t in_b1, in_b2;
|
||||
ker_a = read_and_pad_reordered(ker_a, &ker_a1, &ker_a2);
|
||||
|
||||
in_b1 = arm_nn_read_q15x2_ia(&ip_as_col);
|
||||
sum = __SMLAD(ker_a1, in_b1, sum);
|
||||
in_b2 = arm_nn_read_q15x2_ia(&ip_as_col);
|
||||
sum = __SMLAD(ker_a2, in_b2, sum);
|
||||
|
||||
col_count--;
|
||||
}
|
||||
|
||||
sum = arm_nn_requantize(sum, output_mult[i_ch_out],
|
||||
output_shift[i_ch_out]);
|
||||
sum += out_offset;
|
||||
sum = MAX(sum, out_activation_min);
|
||||
sum = MIN(sum, out_activation_max);
|
||||
*out++ = (q7_t) sum;
|
||||
}
|
||||
}
|
||||
|
||||
/* Return to application */
|
||||
return STATE_SUCCESS;
|
||||
}
|
153
TinyEngine/src/kernels/int_only/convolve_1x1_s8_skip_pad.c
Normal file
@ -0,0 +1,153 @@
|
||||
/* ----------------------------------------------------------------------
|
||||
* Project: TinyEngine
|
||||
* Title: convolve_1x1_s8_skip_pad.c
|
||||
*
|
||||
* Reference papers:
|
||||
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
|
||||
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
|
||||
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
|
||||
* Contact authors:
|
||||
* - Wei-Ming Chen, wmchen@mit.edu
|
||||
* - Wei-Chen Wang, wweichen@mit.edu
|
||||
* - Ji Lin, jilin@mit.edu
|
||||
* - Ligeng Zhu, ligeng@mit.edu
|
||||
* - Song Han, songhan@mit.edu
|
||||
*
|
||||
* Target ISA: ARMv7E-M
|
||||
* -------------------------------------------------------------------- */
|
||||
|
||||
#include "arm_nnfunctions.h"
|
||||
#include "tinyengine_function.h"
|
||||
#include "img2col_element.h"
|
||||
#include "kernel_element.h"
|
||||
|
||||
#define DIM_KER_X (1U)
|
||||
#define DIM_KER_Y (1U)
|
||||
|
||||
tinyengine_status convolve_1x1_s8_skip_pad(const q7_t *input, const uint16_t input_x,
|
||||
const uint16_t input_y, const uint16_t input_ch, const q7_t *kernel,
|
||||
const int32_t *bias, const int32_t *output_shift,
|
||||
const int32_t *output_mult, const int32_t out_offset,
|
||||
const int32_t input_offset, const int32_t out_activation_min,
|
||||
const int32_t out_activation_max, q7_t *output, const uint16_t output_x,
|
||||
const uint16_t output_y, const uint16_t output_ch, q15_t *runtime_buf,
|
||||
const uint16_t pad_t, const uint16_t pad_b, const uint16_t pad_l, const uint16_t pad_r) {
|
||||
int32_t i_element;
|
||||
(void) input_x;
|
||||
(void) input_y;
|
||||
|
||||
/* Partial(two columns) im2col buffer */
|
||||
q15_t *two_column_buffer = runtime_buf;
|
||||
q7_t *out = output;
|
||||
const int32_t num_elements = output_x * output_y;
|
||||
const int channel_div4 = (input_ch >> 2);
|
||||
|
||||
const int16_t inoff16 = input_offset;
|
||||
q31_t offset_q15x2 = __PKHBT(inoff16, inoff16, 16);
|
||||
|
||||
q31_t *kbuf = get_kernel_buffer();
|
||||
volatile int sbufsize = get_sbuffer_size();
|
||||
int maxcol = sbufsize / input_ch / 2;
|
||||
|
||||
int h=0,w=0;
|
||||
for (i_element = 0; i_element < num_elements / 2; i_element++) {
|
||||
/* Fill buffer for partial im2col - two columns at a time */
|
||||
q7_t *src = &input[i_element * input_ch * 2];
|
||||
q15_t *dst = two_column_buffer;
|
||||
|
||||
//use variables
|
||||
q31_t in_q7x4;
|
||||
q31_t in_q15x2_1;
|
||||
q31_t in_q15x2_2;
|
||||
q31_t out_q15x2_1;
|
||||
q31_t out_q15x2_2;
|
||||
|
||||
int skip = 0;
|
||||
//first element
|
||||
if (w < pad_l || w >= input_x - pad_r){
|
||||
if (h < pad_t || h >= input_y - pad_b){
|
||||
skip++;
|
||||
}
|
||||
}
|
||||
//move to the next element
|
||||
w++;
|
||||
if (w == input_x - 1){
|
||||
h++; w = 0;
|
||||
}
|
||||
//second element
|
||||
if (w < pad_l || w >= input_x - pad_r){
|
||||
if (h < pad_t || h >= input_y - pad_b){
|
||||
skip++;
|
||||
}
|
||||
}
|
||||
if (skip == 2){
|
||||
out += output_ch * 2;
|
||||
continue;
|
||||
}
|
||||
|
||||
int cnt = channel_div4; //two columns
|
||||
while (cnt > 0) {
|
||||
q7_q15_offset_reordered_ele(src, dst)
|
||||
q7_q15_offset_reordered_ele(src, dst)
|
||||
cnt--;
|
||||
}
|
||||
|
||||
out = mat_mult_kernel_s8_s16_reordered(kernel,
|
||||
two_column_buffer, output_ch, output_shift, output_mult,
|
||||
(q7_t) out_offset, out_activation_min,
|
||||
out_activation_max, input_ch * DIM_KER_Y * DIM_KER_X,
|
||||
bias, out);
|
||||
}
|
||||
|
||||
/* check if there is an odd column left-over for computation */
|
||||
if (num_elements & 0x1) {
|
||||
int32_t i_ch_out;
|
||||
const q7_t *ker_a = kernel;
|
||||
q7_t *src = &input[(num_elements - 1) * input_ch];
|
||||
q15_t *dst = two_column_buffer;
|
||||
|
||||
//use variables
|
||||
q31_t in_q7x4;
|
||||
q31_t in_q15x2_1;
|
||||
q31_t in_q15x2_2;
|
||||
q31_t out_q15x2_1;
|
||||
q31_t out_q15x2_2;
|
||||
|
||||
int cnt = channel_div4; //two * numof2col columns
|
||||
while (cnt > 0) {
|
||||
q7_q15_offset_reordered_ele(src, dst)
|
||||
cnt--;
|
||||
}
|
||||
|
||||
for (i_ch_out = 0; i_ch_out < output_ch; i_ch_out++) {
|
||||
q31_t sum = bias[i_ch_out];
|
||||
|
||||
/* Point to the beginning of the im2col buffer where the input is available as a rearranged column */
|
||||
const q15_t *ip_as_col = runtime_buf;
|
||||
uint16_t col_count = (input_ch * DIM_KER_X * DIM_KER_Y) >> 2;
|
||||
|
||||
while (col_count) {
|
||||
q31_t ker_a1, ker_a2;
|
||||
q31_t in_b1, in_b2;
|
||||
ker_a = read_and_pad_reordered(ker_a, &ker_a1, &ker_a2);
|
||||
|
||||
in_b1 = arm_nn_read_q15x2_ia(&ip_as_col);
|
||||
sum = __SMLAD(ker_a1, in_b1, sum);
|
||||
in_b2 = arm_nn_read_q15x2_ia(&ip_as_col);
|
||||
sum = __SMLAD(ker_a2, in_b2, sum);
|
||||
|
||||
col_count--;
|
||||
}
|
||||
|
||||
sum = arm_nn_requantize(sum, output_mult[i_ch_out],
|
||||
output_shift[i_ch_out]);
|
||||
sum += out_offset;
|
||||
sum = MAX(sum, out_activation_min);
|
||||
sum = MIN(sum, out_activation_max);
|
||||
*out++ = (q7_t) sum;
|
||||
}
|
||||
}
|
||||
|
||||
/* Return to application */
|
||||
return STATE_SUCCESS;
|
||||
}
|
@ -0,0 +1,213 @@
|
||||
/* ----------------------------------------------------------------------
|
||||
* Project: TinyEngine
|
||||
* Title: convolve_s8_kernel2x3_inputch3_stride2_pad1.c
|
||||
* Description: for 3x3 convolution with 3 input channels, typically for image processing
|
||||
*
|
||||
* Reference papers:
|
||||
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
|
||||
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
|
||||
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
|
||||
* Contact authors:
|
||||
* - Wei-Ming Chen, wmchen@mit.edu
|
||||
* - Wei-Chen Wang, wweichen@mit.edu
|
||||
* - Ji Lin, jilin@mit.edu
|
||||
* - Ligeng Zhu, ligeng@mit.edu
|
||||
* - Song Han, songhan@mit.edu
|
||||
*
|
||||
* Target ISA: ARMv7E-M
|
||||
* -------------------------------------------------------------------- */
|
||||
|
||||
#include "arm_math.h"
|
||||
#include "arm_nnfunctions.h"
|
||||
#include "arm_nnsupportfunctions.h"
|
||||
#include "img2col_element.h"
|
||||
#include "tinyengine_function.h"
|
||||
|
||||
tinyengine_status convolve_s8_kernel2x3_inputch3_stride2_pad1(const q7_t *input, const uint16_t input_x, const uint16_t input_y,
|
||||
const uint16_t input_ch, const q7_t *kernel, const int32_t *bias,
|
||||
const int32_t *output_shift, const int32_t *output_mult,
|
||||
const int32_t output_offset, const int32_t input_offset,
|
||||
const int32_t output_activation_min,
|
||||
const int32_t output_activation_max, q7_t *output,
|
||||
const uint16_t output_x, const uint16_t output_y,
|
||||
const uint16_t output_ch, q15_t *runtime_buf, q7_t pad_value) {
|
||||
const int kernel_y = 2;
|
||||
const int kernel_x = 3;
|
||||
|
||||
//check this during code gen for better performance
|
||||
if(input_x % 2 != 0 || input_y % 2 != 0){
|
||||
return PARAM_NO_SUPPORT;
|
||||
}
|
||||
|
||||
int16_t i_out_y, i_out_x, i_ker_y, i_ker_x;
|
||||
|
||||
/* Generate two columns from the input tensor a GEMM computation */
|
||||
q15_t *two_column_buf = runtime_buf;
|
||||
q7_t *out = output;
|
||||
|
||||
q15_t pad16 = pad_value;
|
||||
const int16_t inoff16 = input_offset;
|
||||
q15_t pad_out = pad16 + inoff16;
|
||||
q31_t pad_out_q15x2 = __PKHBT(pad_out, pad_out, 16);
|
||||
q31_t offset_q15x2 = __PKHBT(inoff16, inoff16, 16);
|
||||
|
||||
q15_t *kbuf = (q15_t*) get_kernel_buffer();
|
||||
const q7_t *ip_a0 = kernel;
|
||||
|
||||
for (int i = 0; i < output_ch; i += 2) {
|
||||
q15_t *dst1 = &kbuf[i * 18]; //each q31_t store 2 elements
|
||||
q15_t *dst2 = dst1 + 18;
|
||||
|
||||
const q7_t *ip_a1 = ip_a0 + 18;
|
||||
|
||||
//27 for each output_ch
|
||||
q31_t *dst1_31 = dst1;
|
||||
q31_t *dst2_31 = dst2;
|
||||
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
|
||||
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
|
||||
dst1_31 += 2;
|
||||
dst2_31 += 2;
|
||||
|
||||
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
|
||||
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
|
||||
dst1_31 += 2;
|
||||
dst2_31 += 2;
|
||||
|
||||
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
|
||||
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
|
||||
dst1_31 += 2;
|
||||
dst2_31 += 2;
|
||||
|
||||
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
|
||||
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
|
||||
dst1_31 += 2;
|
||||
dst2_31 += 2;
|
||||
|
||||
//17, 18
|
||||
dst1 = dst1_31;
|
||||
dst2 = dst2_31;
|
||||
dst1[0] = *ip_a0++;
|
||||
dst1[1] = *ip_a0++;
|
||||
dst2[0] = *ip_a1++;
|
||||
dst2[1] = *ip_a1++;
|
||||
|
||||
/* skip row */
|
||||
ip_a0 += 18;
|
||||
}
|
||||
|
||||
for (i_out_y = 0; i_out_y < output_y; i_out_y++) {
|
||||
for (i_out_x = 0; i_out_x < output_x; i_out_x++) {
|
||||
/* This part implements the im2col function */
|
||||
const int16_t base_idx_y = (i_out_y * 2) - 1;
|
||||
const int16_t base_idx_x = (i_out_x * 2) - 1;
|
||||
const q15_t *col_buffer = two_column_buf;
|
||||
|
||||
//use variables
|
||||
q31_t in_q7x4;
|
||||
q31_t in_q15x2_1;
|
||||
q31_t in_q15x2_2;
|
||||
q31_t out_q15x2_1;
|
||||
q31_t out_q15x2_2;
|
||||
|
||||
/* load address:8bit */
|
||||
q7_t *src;
|
||||
q7_t *src2;
|
||||
q7_t *src3;
|
||||
|
||||
/* buffer for load:16bit */
|
||||
q15_t *dst;
|
||||
q15_t *dst2;
|
||||
q15_t *dst3;
|
||||
|
||||
int input_row_offset = 3 * input_x;
|
||||
dst = col_buffer;
|
||||
dst2 = dst + 9;
|
||||
if (base_idx_y != -1) {
|
||||
if (base_idx_x != -1) {
|
||||
//load all for now and unroll all
|
||||
//3x3 = 9 elements
|
||||
src = input + (base_idx_y * input_x + base_idx_x) * input_ch;
|
||||
src2 = src + input_row_offset;
|
||||
src3 = src2 + input_row_offset;
|
||||
|
||||
//4 * 2 + 1 = 9
|
||||
q7_q15_offset_ele(src, dst)
|
||||
q7_q15_offset_ele(src, dst)
|
||||
*dst++ = *src++ + input_offset;
|
||||
//4 * 2 + 1 = 9
|
||||
q7_q15_offset_ele(src2, dst2)
|
||||
q7_q15_offset_ele(src2, dst2)
|
||||
*dst2++ = *src2++ + input_offset;
|
||||
} else {
|
||||
//first element is pad
|
||||
//3x3 = 9 elements
|
||||
src = input + (base_idx_y * input_x) * input_ch;
|
||||
src2 = src + input_row_offset;
|
||||
src3 = src2 + input_row_offset;
|
||||
|
||||
//pad the first one: 1x3 = 3
|
||||
*dst++ = pad_out;
|
||||
*dst++ = pad_out;
|
||||
*dst++ = pad_out;
|
||||
*dst2++ = pad_out;
|
||||
*dst2++ = pad_out;
|
||||
*dst2++ = pad_out;
|
||||
//load 6 elements
|
||||
//4 * 1 + 2 = 6
|
||||
q7_q15_offset_ele(src, dst)
|
||||
*dst++ = *src++ + input_offset;
|
||||
*dst++ = *src++ + input_offset;
|
||||
//4 * 1 + 2 = 6
|
||||
q7_q15_offset_ele(src2, dst2)
|
||||
*dst2++ = *src2++ + input_offset;
|
||||
*dst2++ = *src2++ + input_offset;
|
||||
}
|
||||
} else {
|
||||
//Padding the first row
|
||||
//3x3 = 9 elements
|
||||
*dst++ = pad_out;
|
||||
q31_t *dst_31 = dst;
|
||||
*dst_31++ = pad_out_q15x2;
|
||||
*dst_31++ = pad_out_q15x2;
|
||||
*dst_31++ = pad_out_q15x2;
|
||||
*dst_31++ = pad_out_q15x2;
|
||||
if (base_idx_x != -1) {
|
||||
//3x3 = 9 elements
|
||||
src2 = input + (base_idx_x) * input_ch;
|
||||
|
||||
//4 * 2 + 1 = 9
|
||||
q7_q15_offset_ele(src2, dst2)
|
||||
q7_q15_offset_ele(src2, dst2)
|
||||
*dst2++ = *src2++ + input_offset;
|
||||
} else {
|
||||
src2 = input;
|
||||
|
||||
//pad the first col: 1x3 = 3
|
||||
*dst2++ = pad_out;
|
||||
*dst2++ = pad_out;
|
||||
*dst2++ = pad_out;
|
||||
//load 6 elements
|
||||
q7_q15_offset_ele(src2, dst2)
|
||||
*dst2++ = *src2++ + input_offset;
|
||||
*dst2++ = *src2++ + input_offset;
|
||||
}
|
||||
}
|
||||
|
||||
two_column_buf += 18;
|
||||
/* Computation is filed for every 2 columns */
|
||||
if (two_column_buf == runtime_buf + 2 * 18) {
|
||||
|
||||
out = mat_mult_unloop18_s8_s16(kernel,
|
||||
runtime_buf, output_ch, output_shift, output_mult,
|
||||
output_offset, output_activation_min, output_activation_max,
|
||||
input_ch * kernel_y * kernel_x, bias, out, kbuf);
|
||||
|
||||
/* counter reset */
|
||||
two_column_buf = runtime_buf;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/* Return to application */
|
||||
return STATE_SUCCESS;
|
||||
}
|
@ -0,0 +1,285 @@
|
||||
/* ----------------------------------------------------------------------
|
||||
* Project: TinyEngine
|
||||
* Title: convolve_s8_kernel3_inputch3_stride2_pad1.c
|
||||
*
|
||||
* Reference papers:
|
||||
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
|
||||
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
|
||||
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
|
||||
* Contact authors:
|
||||
* - Wei-Ming Chen, wmchen@mit.edu
|
||||
* - Wei-Chen Wang, wweichen@mit.edu
|
||||
* - Ji Lin, jilin@mit.edu
|
||||
* - Ligeng Zhu, ligeng@mit.edu
|
||||
* - Song Han, songhan@mit.edu
|
||||
*
|
||||
* Target ISA: ARMv7E-M
|
||||
* -------------------------------------------------------------------- */
|
||||
|
||||
#include "arm_math.h"
|
||||
#include "arm_nnfunctions.h"
|
||||
#include "arm_nnsupportfunctions.h"
|
||||
#include "img2col_element.h"
|
||||
#include "tinyengine_function.h"
|
||||
|
||||
tinyengine_status convolve_s8_kernel3_inputch3_stride2_pad1(const q7_t *input, const uint16_t input_x, const uint16_t input_y,
|
||||
const uint16_t input_ch, const q7_t *kernel, const int32_t *bias,
|
||||
const int32_t *output_shift, const int32_t *output_mult,
|
||||
const int32_t output_offset, const int32_t input_offset,
|
||||
const int32_t output_activation_min,
|
||||
const int32_t output_activation_max, q7_t *output,
|
||||
const uint16_t output_x, const uint16_t output_y,
|
||||
const uint16_t output_ch, q15_t *runtime_buf, q15_t *kbuf, q7_t pad_value) {
|
||||
const int kernel_y = 3;
|
||||
const int kernel_x = 3;
|
||||
|
||||
int16_t i_out_y, i_out_x, i_ker_y, i_ker_x;
|
||||
|
||||
/* Generate two columns from the input tensor a GEMM computation */
|
||||
q15_t *two_column_buf = runtime_buf;
|
||||
q7_t *out = output;
|
||||
|
||||
q15_t pad16 = pad_value;
|
||||
const int16_t inoff16 = input_offset;
|
||||
q15_t pad_out = pad16 + inoff16;
|
||||
q31_t pad_out_q15x2 = __PKHBT(pad_out, pad_out, 16);
|
||||
q31_t offset_q15x2 = __PKHBT(inoff16, inoff16, 16);
|
||||
|
||||
const q7_t *ip_a0 = kernel;
|
||||
|
||||
for (int i = 0; i < output_ch; i += 2) {
|
||||
q15_t *dst1 = &kbuf[i * 27]; //each q31_t store 2 elements
|
||||
q15_t *dst2 = dst1 + 27;
|
||||
|
||||
const q7_t *ip_a1 = ip_a0 + 27;
|
||||
|
||||
//27 for each output_ch
|
||||
q31_t *dst1_31 = dst1;
|
||||
q31_t *dst2_31 = dst2;
|
||||
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
|
||||
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
|
||||
dst1_31 += 2;
|
||||
dst2_31 += 2;
|
||||
|
||||
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
|
||||
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
|
||||
dst1_31 += 2;
|
||||
dst2_31 += 2;
|
||||
|
||||
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
|
||||
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
|
||||
dst1_31 += 2;
|
||||
dst2_31 += 2;
|
||||
|
||||
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
|
||||
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
|
||||
dst1_31 += 2;
|
||||
dst2_31 += 2;
|
||||
|
||||
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
|
||||
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
|
||||
dst1_31 += 2;
|
||||
dst2_31 += 2;
|
||||
|
||||
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
|
||||
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
|
||||
dst1_31 += 2;
|
||||
dst2_31 += 2;
|
||||
//25, 26, 27
|
||||
dst1 = dst1_31;
|
||||
dst2 = dst2_31;
|
||||
dst1[0] = *ip_a0++;
|
||||
dst1[1] = *ip_a0++;
|
||||
dst1[2] = *ip_a0++;
|
||||
dst2[0] = *ip_a1++;
|
||||
dst2[1] = *ip_a1++;
|
||||
dst2[2] = *ip_a1++;
|
||||
|
||||
/* skip row */
|
||||
ip_a0 += 27;
|
||||
}
|
||||
|
||||
for (i_out_y = 0; i_out_y < output_y; i_out_y++) {
|
||||
for (i_out_x = 0; i_out_x < output_x; i_out_x++) {
|
||||
/* This part implements the im2col function */
|
||||
const int16_t base_idx_y = (i_out_y * 2) - 1;
|
||||
const int16_t base_idx_x = (i_out_x * 2) - 1;
|
||||
const q15_t *col_buffer = two_column_buf;
|
||||
|
||||
//use variables
|
||||
q31_t in_q7x4;
|
||||
q31_t in_q15x2_1;
|
||||
q31_t in_q15x2_2;
|
||||
q31_t out_q15x2_1;
|
||||
q31_t out_q15x2_2;
|
||||
|
||||
/* load address:8bit */
|
||||
q7_t *src;
|
||||
q7_t *src2;
|
||||
q7_t *src3;
|
||||
|
||||
/* buffer for load:16bit */
|
||||
q15_t *dst;
|
||||
q15_t *dst2;
|
||||
q15_t *dst3;
|
||||
|
||||
int input_row_offset = 3 * input_x;
|
||||
dst = col_buffer;
|
||||
dst2 = dst + 9;
|
||||
dst3 = dst2 + 9;
|
||||
if (base_idx_y != -1) {
|
||||
if (base_idx_x != -1) { //load all for now and unroll all
|
||||
//3x3 = 9 elements
|
||||
src = input + (base_idx_y * input_x + base_idx_x) * input_ch;
|
||||
src2 = src + input_row_offset;
|
||||
src3 = src2 + input_row_offset;
|
||||
|
||||
//4 * 2 = 8
|
||||
q7_q15_offset_ele(src, dst)
|
||||
q7_q15_offset_ele(src, dst)
|
||||
*dst++ = *src++ + input_offset;
|
||||
//
|
||||
q7_q15_offset_ele(src2, dst2)
|
||||
q7_q15_offset_ele(src2, dst2)
|
||||
*dst2++ = *src2++ + input_offset;
|
||||
//
|
||||
q7_q15_offset_ele(src3, dst3)
|
||||
q7_q15_offset_ele(src3, dst3)
|
||||
*dst3++ = *src3++ + input_offset;
|
||||
} else { //first element is pad
|
||||
//3x3 = 9 elements
|
||||
src = input + (base_idx_y * input_x) * input_ch;
|
||||
src2 = src + input_row_offset;
|
||||
src3 = src2 + input_row_offset;
|
||||
|
||||
//pad the first one: 1x3 = 3
|
||||
*dst++ = pad_out;
|
||||
*dst++ = pad_out;
|
||||
*dst++ = pad_out;
|
||||
*dst2++ = pad_out;
|
||||
*dst2++ = pad_out;
|
||||
*dst2++ = pad_out;
|
||||
*dst3++ = pad_out;
|
||||
*dst3++ = pad_out;
|
||||
*dst3++ = pad_out;
|
||||
//load 6 elements
|
||||
//4 * 1 = 6
|
||||
q7_q15_offset_ele(src, dst)
|
||||
*dst++ = *src++ + input_offset;
|
||||
*dst++ = *src++ + input_offset;
|
||||
//
|
||||
q7_q15_offset_ele(src2, dst2)
|
||||
*dst2++ = *src2++ + input_offset;
|
||||
*dst2++ = *src2++ + input_offset;
|
||||
//
|
||||
q7_q15_offset_ele(src3, dst3)
|
||||
*dst3++ = *src3++ + input_offset;
|
||||
*dst3++ = *src3++ + input_offset;
|
||||
}
|
||||
} else { // first row is padded
|
||||
//3x3 = 9 elements
|
||||
*dst++ = pad_out;
|
||||
q31_t *dst_31 = dst;
|
||||
*dst_31++ = pad_out_q15x2;
|
||||
*dst_31++ = pad_out_q15x2;
|
||||
*dst_31++ = pad_out_q15x2;
|
||||
*dst_31++ = pad_out_q15x2;
|
||||
if (base_idx_x != -1) { //load all for now and unroll all
|
||||
//3x3 = 9 elements
|
||||
src2 = input + (base_idx_x) * input_ch;
|
||||
src3 = src2 + input_row_offset;
|
||||
|
||||
//4 * 2 = 8
|
||||
q7_q15_offset_ele(src2, dst2)
|
||||
q7_q15_offset_ele(src2, dst2)
|
||||
*dst2++ = *src2++ + input_offset;
|
||||
//
|
||||
q7_q15_offset_ele(src3, dst3)
|
||||
q7_q15_offset_ele(src3, dst3)
|
||||
*dst3++ = *src3++ + input_offset;
|
||||
} else { //first element is pad
|
||||
//3x3 = 9 elements
|
||||
src2 = input;
|
||||
src3 = src2 + input_row_offset;
|
||||
|
||||
//pad the first one: 1x3 = 3
|
||||
*dst2++ = pad_out;
|
||||
*dst2++ = pad_out;
|
||||
*dst2++ = pad_out;
|
||||
*dst3++ = pad_out;
|
||||
*dst3++ = pad_out;
|
||||
*dst3++ = pad_out;
|
||||
//load 6 elements
|
||||
q7_q15_offset_ele(src2, dst2)
|
||||
*dst2++ = *src2++ + input_offset;
|
||||
*dst2++ = *src2++ + input_offset;
|
||||
//
|
||||
q7_q15_offset_ele(src3, dst3)
|
||||
*dst3++ = *src3++ + input_offset;
|
||||
*dst3++ = *src3++ + input_offset;
|
||||
}
|
||||
}
|
||||
|
||||
two_column_buf += 27;
|
||||
/* Computation is filed for every 2 columns */
|
||||
if (two_column_buf == runtime_buf + 2 * 27) {
|
||||
|
||||
out = arm_nn_mat_mult_kernel3_input3_s8_s16(kernel,
|
||||
runtime_buf, output_ch, output_shift, output_mult,
|
||||
output_offset, output_activation_min, output_activation_max,
|
||||
input_ch * kernel_y * kernel_x, bias, out, kbuf);
|
||||
|
||||
/* counter reset */
|
||||
two_column_buf = runtime_buf;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/* left-over because odd number of output pixels */
|
||||
if (two_column_buf != runtime_buf) {
|
||||
const q7_t *ker_a = kernel;
|
||||
int i;
|
||||
|
||||
for (i = 0; i < output_ch; i++) {
|
||||
/* Load the accumulator with bias first */
|
||||
q31_t sum = bias[i];
|
||||
|
||||
/* Point to the beginning of the im2col buffer where the input is available as a rearranged column */
|
||||
const q15_t *ip_as_col = runtime_buf;
|
||||
|
||||
/* 4 multiply and accumulates are done in one loop. */
|
||||
uint16_t col_count = (input_ch * kernel_y * kernel_x) >> 2;
|
||||
|
||||
while (col_count) {
|
||||
q31_t ker_a1, ker_a2;
|
||||
q31_t ip_b1, ip_b2;
|
||||
|
||||
ker_a = read_and_pad(ker_a, &ker_a1, &ker_a2);
|
||||
|
||||
ip_b1 = arm_nn_read_q15x2_ia(&ip_as_col);
|
||||
sum = __SMLAD(ker_a1, ip_b1, sum);
|
||||
ip_b2 = arm_nn_read_q15x2_ia(&ip_as_col);
|
||||
sum = __SMLAD(ker_a2, ip_b2, sum);
|
||||
|
||||
col_count--;
|
||||
}
|
||||
/* Handle left over mac */
|
||||
col_count = input_ch * kernel_y * kernel_x & 0x3;
|
||||
while (col_count) {
|
||||
q7_t ker_a1 = *ker_a++;
|
||||
q15_t ip_b1 = *ip_as_col++;
|
||||
sum += ker_a1 * ip_b1;
|
||||
col_count--;
|
||||
}
|
||||
|
||||
sum = arm_nn_requantize(sum, output_mult[i], output_shift[i]);
|
||||
sum += output_offset;
|
||||
sum = MAX(sum, output_activation_min);
|
||||
sum = MIN(sum, output_activation_max);
|
||||
*out++ = (q7_t) sum;
|
||||
}
|
||||
}
|
||||
|
||||
/* Return to application */
|
||||
return STATE_SUCCESS;
|
||||
}
|
@ -0,0 +1,300 @@
|
||||
/* ----------------------------------------------------------------------
|
||||
* Project: TinyEngine
|
||||
* Title: convolve_s8_kernel3_stride1_pad1.c
|
||||
* Description: for 3x3 convolution with kernels, typically for image processing
|
||||
*
|
||||
* Reference papers:
|
||||
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
|
||||
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
|
||||
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
|
||||
* Contact authors:
|
||||
* - Wei-Ming Chen, wmchen@mit.edu
|
||||
* - Wei-Chen Wang, wweichen@mit.edu
|
||||
* - Ji Lin, jilin@mit.edu
|
||||
* - Ligeng Zhu, ligeng@mit.edu
|
||||
* - Song Han, songhan@mit.edu
|
||||
*
|
||||
* Target ISA: ARMv7E-M
|
||||
* -------------------------------------------------------------------- */
|
||||
|
||||
#include "arm_math.h"
|
||||
#include "arm_nnfunctions.h"
|
||||
#include "arm_nnsupportfunctions.h"
|
||||
#include "img2col_element.h"
|
||||
#include "tinyengine_function.h"
|
||||
|
||||
tinyengine_status convolve_s8_kernel3_stride1_pad1(const q7_t *input, const uint16_t input_x, const uint16_t input_y,
|
||||
const uint16_t input_ch, const q7_t *kernel, const int32_t *bias,
|
||||
const int32_t *output_shift, const int32_t *output_mult,
|
||||
const int32_t output_offset, const int32_t input_offset,
|
||||
const int32_t output_activation_min,
|
||||
const int32_t output_activation_max, q7_t *output,
|
||||
const uint16_t output_x, const uint16_t output_y,
|
||||
const uint16_t output_ch, q15_t *runtime_buf, q7_t pad_value) {
|
||||
if (input_ch % 4 != 0) {
|
||||
return PARAM_NO_SUPPORT;
|
||||
}
|
||||
|
||||
int32_t i_element;
|
||||
(void) input_x;
|
||||
(void) input_y;
|
||||
|
||||
/* Partial(two columns) im2col buffer */
|
||||
q15_t *two_column_buffer = runtime_buf;
|
||||
q7_t *out = output;
|
||||
const int channel_div4 = (input_ch >> 2);
|
||||
|
||||
const int16_t inoff16 = input_offset;
|
||||
q31_t offset_q15x2 = __PKHBT(inoff16, inoff16, 16);
|
||||
q31_t pad_q15x2 = __PKHBT(pad_value, pad_value, 16);
|
||||
q31_t pad_out_q15x2 = __SADD16(pad_q15x2, offset_q15x2);
|
||||
int in_row_offset = input_ch * input_x;
|
||||
|
||||
for (int i_out_y = 0; i_out_y < output_y; i_out_y++) {
|
||||
const int16_t base_idx_y = i_out_y - 1;
|
||||
for (int i_out_x = 0; i_out_x < output_x; i_out_x++) {
|
||||
const int16_t base_idx_x = i_out_x - 1;
|
||||
//Img2col for 3x3 kernel
|
||||
/* Used for SIMD instructions */
|
||||
q31_t in_q7x4;
|
||||
q31_t in_q15x2_1;
|
||||
q31_t in_q15x2_2;
|
||||
q31_t out_q15x2_1;
|
||||
q31_t out_q15x2_2;
|
||||
int block_cnt;
|
||||
q15_t *col_buffer = &two_column_buffer[0];
|
||||
|
||||
//TODO: swap these two if statement out to reduce overhead
|
||||
int ypad_cnt = 0; //no pad by default
|
||||
if (base_idx_y == -1) { //pad the first row
|
||||
q31_t *dst_31 = (q31_t*) &col_buffer[0];
|
||||
int block_cnt = channel_div4;//unroll by 2, 3 element
|
||||
while (block_cnt > 0) {//total: 16bit * input_ch * 3
|
||||
*dst_31++ = pad_out_q15x2;
|
||||
*dst_31++ = pad_out_q15x2;
|
||||
*dst_31++ = pad_out_q15x2;
|
||||
*dst_31++ = pad_out_q15x2;
|
||||
*dst_31++ = pad_out_q15x2;
|
||||
*dst_31++ = pad_out_q15x2;
|
||||
block_cnt--;
|
||||
}
|
||||
ypad_cnt = 1;
|
||||
}
|
||||
else if (base_idx_y + 2 == input_y) { //pad the third row
|
||||
q31_t *dst_31 = (q31_t*) &col_buffer[input_ch * 6];
|
||||
int block_cnt = channel_div4;//unroll by 2, 3 element
|
||||
while (block_cnt > 0) {//total: 16bit * input_ch * 3
|
||||
*dst_31++ = pad_out_q15x2;
|
||||
*dst_31++ = pad_out_q15x2;
|
||||
*dst_31++ = pad_out_q15x2;
|
||||
*dst_31++ = pad_out_q15x2;
|
||||
*dst_31++ = pad_out_q15x2;
|
||||
*dst_31++ = pad_out_q15x2;
|
||||
block_cnt--;
|
||||
}
|
||||
ypad_cnt = 2;
|
||||
}
|
||||
|
||||
if (ypad_cnt == 0){ //filled all rows
|
||||
if (base_idx_x == -1) {
|
||||
/* use pad for the first 1 col */
|
||||
q31_t *dst_31 = (q31_t*) &col_buffer[0];
|
||||
q31_t *dst2_31 = (q31_t*) &col_buffer[input_ch * 3];
|
||||
q31_t *dst3_31 = (q31_t*) &col_buffer[input_ch * 6];
|
||||
|
||||
pad_3row_1col(dst_31, dst2_31, dst3_31, pad_out_q15x2)
|
||||
|
||||
/* load input to 2 col*/
|
||||
const q7_t *src = input + base_idx_y * input_x * input_ch;
|
||||
const q7_t *src2 = src + in_row_offset;
|
||||
const q7_t *src3 = src2 + in_row_offset;
|
||||
q15_t *dst = dst_31;
|
||||
q15_t *dst2 = dst2_31;
|
||||
q15_t *dst3 = dst3_31;
|
||||
|
||||
load_3row_2col(src, src2, src3, dst, dst2, dst3)
|
||||
} else if (base_idx_x + 2 == input_x) {
|
||||
/* load 2 col */
|
||||
const q7_t *src = input + (base_idx_y * input_x + base_idx_x) * input_ch;
|
||||
const q7_t *src2 = src + in_row_offset;
|
||||
const q7_t *src3 = src2 + in_row_offset;
|
||||
q15_t *dst = (q31_t*) &col_buffer[0];;
|
||||
q15_t *dst2 = (q31_t*) &col_buffer[input_ch * 3];
|
||||
q15_t *dst3 = (q31_t*) &col_buffer[input_ch * 6];;
|
||||
|
||||
load_3row_2col(src, src2, src3, dst, dst2, dst3)
|
||||
|
||||
q31_t *dst_31 = (q31_t*) dst;
|
||||
q31_t *dst2_31 = (q31_t*) dst2;
|
||||
q31_t *dst3_31 = (q31_t*) dst3;
|
||||
|
||||
/* use pad for the last 1 col*/
|
||||
pad_3row_1col(dst_31,dst2_31,dst3_31,pad_out_q15x2)
|
||||
} else {
|
||||
/* load 3 col */
|
||||
const q7_t *src = input + (base_idx_y * input_x + base_idx_x) * input_ch;
|
||||
const q7_t *src2 = src + in_row_offset;
|
||||
const q7_t *src3 = src2 + in_row_offset;
|
||||
q15_t *dst = (q31_t*) &col_buffer[0];;
|
||||
q15_t *dst2 = (q31_t*) &col_buffer[input_ch * 3];
|
||||
q15_t *dst3 = (q31_t*) &col_buffer[input_ch * 6];;
|
||||
|
||||
load_3row_3col(src, src2, src3, dst, dst2, dst3)
|
||||
}
|
||||
}
|
||||
else if (ypad_cnt == 1){//filled the last two rows
|
||||
if (base_idx_x == -1){
|
||||
/* use pad for the first 1 col */
|
||||
q31_t *dst_31 = &col_buffer[input_ch * 3];
|
||||
q31_t *dst2_31 = &col_buffer[input_ch * 6];
|
||||
pad_2row_1col(dst_31, dst2_31, pad_out_q15x2)
|
||||
|
||||
/* load input to 2 col*/
|
||||
const q7_t *src = input + 0;
|
||||
const q7_t *src2 = src + in_row_offset;
|
||||
q15_t *dst = dst_31;
|
||||
q15_t *dst2 = dst2_31;
|
||||
|
||||
load_2row_2col(src, src2, dst, dst2)
|
||||
} else if (base_idx_x + 2 == input_x) {
|
||||
/* load 2 col*/
|
||||
q31_t *dst = &col_buffer[input_ch * 3];
|
||||
q31_t *dst2 = &col_buffer[input_ch * 6];
|
||||
const q7_t *src = input + base_idx_x * input_ch;
|
||||
const q7_t *src2 = src + in_row_offset;
|
||||
|
||||
load_2row_2col(src, src2, dst, dst2)
|
||||
q31_t *dst_31 = (q31_t*) dst;
|
||||
q31_t *dst2_31 = (q31_t*) dst2;
|
||||
|
||||
/* use pad for the last 1 col*/
|
||||
pad_2row_1col(dst_31,dst2_31,pad_out_q15x2)
|
||||
}
|
||||
else {
|
||||
/* load 3 col*/
|
||||
q15_t *dst = &col_buffer[input_ch * 3];
|
||||
q15_t *dst2 = &col_buffer[input_ch * 6];
|
||||
const q7_t *src = input + base_idx_x * input_ch;
|
||||
const q7_t *src2 = src + in_row_offset;
|
||||
|
||||
load_2row_3col(src, src2, dst, dst2)
|
||||
}
|
||||
} else{ //filled the first two rows
|
||||
if (base_idx_x == -1) {
|
||||
/* use pad for the first 1 col*/
|
||||
q31_t *dst_31 = (q31_t*) &col_buffer[0];
|
||||
q31_t *dst2_31 = (q31_t*) &col_buffer[input_ch * 3];
|
||||
|
||||
pad_2row_1col(dst_31, dst2_31, pad_out_q15x2)
|
||||
|
||||
/* load input to 2 col*/
|
||||
const q7_t *src = input + (base_idx_y * input_x) * input_ch;
|
||||
const q7_t *src2 = src + in_row_offset;
|
||||
q15_t *dst = dst_31;
|
||||
q15_t *dst2 = dst2_31;
|
||||
|
||||
load_2row_2col(src, src2, dst, dst2)
|
||||
} else if (base_idx_x + 2 == input_x) {
|
||||
/* load 2 col*/
|
||||
q15_t *dst = &col_buffer[input_ch * 0];
|
||||
q15_t *dst2 = &col_buffer[input_ch * 3];
|
||||
const q7_t *src = input + (base_idx_y * input_x + base_idx_x) * input_ch;
|
||||
const q7_t *src2 = src + in_row_offset;
|
||||
|
||||
load_2row_2col(src, src2, dst, dst2)
|
||||
|
||||
/* use pad for the last 1 col*/
|
||||
q31_t *dst_31 = (q31_t*) dst;
|
||||
q31_t *dst2_31 = (q31_t*) dst2;
|
||||
|
||||
pad_2row_1col(dst_31,dst2_31,pad_out_q15x2)
|
||||
} else {
|
||||
/* load 3 col*/
|
||||
q15_t *dst = &col_buffer[input_ch * 0];
|
||||
q15_t *dst2 = &col_buffer[input_ch * 3];
|
||||
/* load input to 1 col*/
|
||||
const q7_t *src = input + (base_idx_y * input_x + base_idx_x) * input_ch;
|
||||
const q7_t *src2 = src + in_row_offset;
|
||||
|
||||
load_2row_3col(src, src2, dst, dst2)
|
||||
}
|
||||
}
|
||||
|
||||
two_column_buffer += input_ch * 9;
|
||||
|
||||
/* Computation is filed for every 2 columns */
|
||||
if (two_column_buffer == runtime_buf + 2 * input_ch * 9)
|
||||
{
|
||||
out = mat_mult_kernel_s8_s16(kernel,
|
||||
runtime_buf,
|
||||
output_ch,
|
||||
output_shift,
|
||||
output_mult,
|
||||
output_offset,
|
||||
output_activation_min,
|
||||
output_activation_max,
|
||||
input_ch * 9,
|
||||
bias,
|
||||
out);
|
||||
/* counter reset */
|
||||
two_column_buffer = runtime_buf;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/* left-over because odd number of output pixels */
|
||||
if (two_column_buffer != runtime_buf)
|
||||
{
|
||||
const q7_t *ker_a = kernel;
|
||||
int i;
|
||||
|
||||
for (i = 0; i < output_ch; i++)
|
||||
{
|
||||
/* Load the accumulator with bias first */
|
||||
q31_t sum = bias[i];
|
||||
|
||||
/* Point to the beginning of the im2col buffer where the input is available as a rearranged column */
|
||||
const q15_t *ip_as_col = runtime_buf;
|
||||
|
||||
/* 4 multiply and accumulates are done in one loop. */
|
||||
uint16_t col_count = (input_ch * 9) >> 2;
|
||||
|
||||
while (col_count)
|
||||
{
|
||||
q31_t ker_a1, ker_a2;
|
||||
q31_t ip_b1, ip_b2;
|
||||
|
||||
ker_a = read_and_pad(ker_a, &ker_a1, &ker_a2);
|
||||
|
||||
ip_b1 = arm_nn_read_q15x2_ia(&ip_as_col);
|
||||
sum = __SMLAD(ker_a1, ip_b1, sum);
|
||||
ip_b2 = arm_nn_read_q15x2_ia(&ip_as_col);
|
||||
sum = __SMLAD(ker_a2, ip_b2, sum);
|
||||
|
||||
col_count--;
|
||||
}
|
||||
/* Handle left over mac */
|
||||
col_count = input_ch * 3 * 3 & 0x3;
|
||||
while (col_count)
|
||||
{
|
||||
q7_t ker_a1 = *ker_a++;
|
||||
q15_t ip_b1 = *ip_as_col++;
|
||||
sum += ker_a1 * ip_b1;
|
||||
col_count--;
|
||||
}
|
||||
|
||||
sum = arm_nn_requantize(sum, output_mult[i], output_shift[i]);
|
||||
sum += output_offset;
|
||||
sum = MAX(sum, output_activation_min);
|
||||
sum = MIN(sum, output_activation_max);
|
||||
*out++ = (q7_t)sum;
|
||||
}
|
||||
}
|
||||
|
||||
/* Return to application */
|
||||
return STATE_SUCCESS;
|
||||
}
|
||||
|
||||
/**
|
||||
* @} end of NNConv group
|
||||
*/
|
@ -0,0 +1,232 @@
|
||||
/* ----------------------------------------------------------------------
|
||||
* Project: TinyEngine
|
||||
* Title: convolve_s8_kernel3x2_inputch3_stride2_pad1.c
|
||||
* Description: for 3x3 convolution with 3 input channels, typically for image processing
|
||||
*
|
||||
* Reference papers:
|
||||
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
|
||||
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
|
||||
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
|
||||
* Contact authors:
|
||||
* - Wei-Ming Chen, wmchen@mit.edu
|
||||
* - Wei-Chen Wang, wweichen@mit.edu
|
||||
* - Ji Lin, jilin@mit.edu
|
||||
* - Ligeng Zhu, ligeng@mit.edu
|
||||
* - Song Han, songhan@mit.edu
|
||||
*
|
||||
* Target ISA: ARMv7E-M
|
||||
* -------------------------------------------------------------------- */
|
||||
|
||||
#include "arm_math.h"
|
||||
#include "arm_nnfunctions.h"
|
||||
#include "arm_nnsupportfunctions.h"
|
||||
#include "img2col_element.h"
|
||||
#include "tinyengine_function.h"
|
||||
|
||||
tinyengine_status convolve_s8_kernel3x2_inputch3_stride2_pad1(const q7_t *input, const uint16_t input_x, const uint16_t input_y,
|
||||
const uint16_t input_ch, const q7_t *kernel, const int32_t *bias,
|
||||
const int32_t *output_shift, const int32_t *output_mult,
|
||||
const int32_t output_offset, const int32_t input_offset,
|
||||
const int32_t output_activation_min,
|
||||
const int32_t output_activation_max, q7_t *output,
|
||||
const uint16_t output_x, const uint16_t output_y,
|
||||
const uint16_t output_ch, q15_t *runtime_buf, q15_t *kbuf, q7_t pad_value) {
|
||||
const int kernel_y = 3;
|
||||
const int kernel_x = 2;
|
||||
|
||||
//check this during code gen for better performance
|
||||
if(input_x % 2 != 0 || input_y % 2 != 0){
|
||||
return PARAM_NO_SUPPORT;
|
||||
}
|
||||
|
||||
int16_t i_out_y, i_out_x, i_ker_y, i_ker_x;
|
||||
|
||||
/* Generate two columns from the input tensor a GEMM computation */
|
||||
q15_t *two_column_buf = runtime_buf;
|
||||
q7_t *out = output;
|
||||
|
||||
q15_t pad16 = pad_value;
|
||||
const int16_t inoff16 = input_offset;
|
||||
q15_t pad_out = pad16 + inoff16;
|
||||
q31_t pad_out_q15x2 = __PKHBT(pad_out, pad_out, 16);
|
||||
q31_t offset_q15x2 = __PKHBT(inoff16, inoff16, 16);
|
||||
|
||||
const q7_t *ip_a0 = kernel;
|
||||
|
||||
for (int i = 0; i < output_ch; i += 2) {
|
||||
q15_t *dst1 = &kbuf[i * 18]; //each q31_t store 2 elements
|
||||
q15_t *dst2 = dst1 + 18;
|
||||
|
||||
const q7_t *ip_a1 = ip_a0 + 18;
|
||||
|
||||
//27 for each output_ch
|
||||
q31_t *dst1_31 = dst1;
|
||||
q31_t *dst2_31 = dst2;
|
||||
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
|
||||
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
|
||||
dst1_31 += 2;
|
||||
dst2_31 += 2;
|
||||
|
||||
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
|
||||
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
|
||||
dst1_31 += 2;
|
||||
dst2_31 += 2;
|
||||
|
||||
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
|
||||
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
|
||||
dst1_31 += 2;
|
||||
dst2_31 += 2;
|
||||
|
||||
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
|
||||
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
|
||||
dst1_31 += 2;
|
||||
dst2_31 += 2;
|
||||
|
||||
//17, 18
|
||||
dst1 = dst1_31;
|
||||
dst2 = dst2_31;
|
||||
dst1[0] = *ip_a0++;
|
||||
dst1[1] = *ip_a0++;
|
||||
dst2[0] = *ip_a1++;
|
||||
dst2[1] = *ip_a1++;
|
||||
|
||||
/* skip row */
|
||||
ip_a0 += 27;
|
||||
}
|
||||
|
||||
for (i_out_y = 0; i_out_y < output_y; i_out_y++) {
|
||||
for (i_out_x = 0; i_out_x < output_x; i_out_x++) {
|
||||
/* This part implements the im2col function */
|
||||
const int16_t base_idx_y = (i_out_y * 2) - 1;
|
||||
const int16_t base_idx_x = (i_out_x * 2) - 1;
|
||||
const q15_t *col_buffer = two_column_buf;
|
||||
|
||||
//use variables
|
||||
q31_t in_q7x4;
|
||||
q31_t in_q15x2_1;
|
||||
q31_t in_q15x2_2;
|
||||
q31_t out_q15x2_1;
|
||||
q31_t out_q15x2_2;
|
||||
|
||||
/* load address:8bit */
|
||||
q7_t *src;
|
||||
q7_t *src2;
|
||||
q7_t *src3;
|
||||
|
||||
/* buffer for load:16bit */
|
||||
q15_t *dst;
|
||||
q15_t *dst2;
|
||||
q15_t *dst3;
|
||||
|
||||
int input_row_offset = 3 * input_x;
|
||||
dst = col_buffer;
|
||||
dst2 = dst + 6;
|
||||
dst3 = dst2 + 6;
|
||||
if (base_idx_y != -1) {
|
||||
if (base_idx_x != -1) {
|
||||
//load all for now and unroll all
|
||||
//3x3 = 9 elements
|
||||
src = input + (base_idx_y * input_x + base_idx_x) * input_ch;
|
||||
src2 = src + input_row_offset;
|
||||
src3 = src2 + input_row_offset;
|
||||
|
||||
//3 * 2 = 6
|
||||
q7_q15_offset_ele(src, dst)
|
||||
*dst++ = *src++ + input_offset;
|
||||
*dst++ = *src++ + input_offset;
|
||||
//
|
||||
q7_q15_offset_ele(src2, dst2)
|
||||
*dst2++ = *src2++ + input_offset;
|
||||
*dst2++ = *src2++ + input_offset;
|
||||
//
|
||||
q7_q15_offset_ele(src3, dst3)
|
||||
*dst3++ = *src3++ + input_offset;
|
||||
*dst3++ = *src3++ + input_offset;
|
||||
} else {
|
||||
src = input + (base_idx_y * input_x) * input_ch;
|
||||
src2 = src + input_row_offset;
|
||||
src3 = src2 + input_row_offset;
|
||||
|
||||
//pad the first col: 1x3 = 3
|
||||
*dst++ = pad_out;
|
||||
*dst++ = pad_out;
|
||||
*dst++ = pad_out;
|
||||
*dst2++ = pad_out;
|
||||
*dst2++ = pad_out;
|
||||
*dst2++ = pad_out;
|
||||
*dst3++ = pad_out;
|
||||
*dst3++ = pad_out;
|
||||
*dst3++ = pad_out;
|
||||
//load 3 elements
|
||||
*dst++ = *src++ + input_offset;
|
||||
*dst++ = *src++ + input_offset;
|
||||
*dst++ = *src++ + input_offset;
|
||||
//
|
||||
*dst2++ = *src2++ + input_offset;
|
||||
*dst2++ = *src2++ + input_offset;
|
||||
*dst2++ = *src2++ + input_offset;
|
||||
//
|
||||
*dst3++ = *src3++ + input_offset;
|
||||
*dst3++ = *src3++ + input_offset;
|
||||
*dst3++ = *src3++ + input_offset;
|
||||
}
|
||||
} else {
|
||||
//Padding the first row
|
||||
//3x2 = 6 elements
|
||||
q31_t *dst_31 = dst;
|
||||
*dst_31++ = pad_out_q15x2;
|
||||
*dst_31++ = pad_out_q15x2;
|
||||
*dst_31++ = pad_out_q15x2;
|
||||
if (base_idx_x != -1) {
|
||||
//3x3 = 9 elements
|
||||
src2 = input + (base_idx_x) * input_ch;
|
||||
src3 = src2 + input_row_offset;
|
||||
|
||||
//3 * 2 = 6 = 4 * 1 + 2
|
||||
q7_q15_offset_ele(src2, dst2)
|
||||
*dst2++ = *src2++ + input_offset;
|
||||
*dst2++ = *src2++ + input_offset;
|
||||
//
|
||||
q7_q15_offset_ele(src3, dst3)
|
||||
*dst3++ = *src3++ + input_offset;
|
||||
*dst3++ = *src3++ + input_offset;
|
||||
} else {
|
||||
src2 = input;
|
||||
src3 = src2 + input_row_offset;
|
||||
|
||||
//pad the first one: 1x3 = 3
|
||||
*dst2++ = pad_out;
|
||||
*dst2++ = pad_out;
|
||||
*dst2++ = pad_out;
|
||||
*dst3++ = pad_out;
|
||||
*dst3++ = pad_out;
|
||||
*dst3++ = pad_out;
|
||||
//load 3 elements
|
||||
*dst2++ = *src2++ + input_offset;
|
||||
*dst2++ = *src2++ + input_offset;
|
||||
*dst2++ = *src2++ + input_offset;
|
||||
//
|
||||
*dst3++ = *src3++ + input_offset;
|
||||
*dst3++ = *src3++ + input_offset;
|
||||
*dst3++ = *src3++ + input_offset;
|
||||
}
|
||||
}
|
||||
|
||||
two_column_buf += 18;
|
||||
/* Computation is filed for every 2 columns */
|
||||
if (two_column_buf == runtime_buf + 2 * 18) {
|
||||
|
||||
out = mat_mult_unloop18_s8_s16(kernel,
|
||||
runtime_buf, output_ch, output_shift, output_mult,
|
||||
output_offset, output_activation_min, output_activation_max,
|
||||
input_ch * kernel_y * kernel_x, bias, out, kbuf);
|
||||
|
||||
/* counter reset */
|
||||
two_column_buf = runtime_buf;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/* Return to application */
|
||||
return STATE_SUCCESS;
|
||||
}
|
@ -0,0 +1,286 @@
|
||||
/* ----------------------------------------------------------------------
|
||||
* Project: TinyEngine
|
||||
* Title: convolve_u8_kernel3_inputch3_stride1_pad1.c
|
||||
* Description: for 3x3 convolution with 3 input channels, typically for image processing
|
||||
*
|
||||
* Reference papers:
|
||||
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
|
||||
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
|
||||
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
|
||||
* Contact authors:
|
||||
* - Wei-Ming Chen, wmchen@mit.edu
|
||||
* - Wei-Chen Wang, wweichen@mit.edu
|
||||
* - Ji Lin, jilin@mit.edu
|
||||
* - Ligeng Zhu, ligeng@mit.edu
|
||||
* - Song Han, songhan@mit.edu
|
||||
*
|
||||
* Target ISA: ARMv7E-M
|
||||
* -------------------------------------------------------------------- */
|
||||
|
||||
#include "arm_math.h"
|
||||
#include "arm_nnfunctions.h"
|
||||
#include "arm_nnsupportfunctions.h"
|
||||
#include "img2col_element.h"
|
||||
#include "tinyengine_function.h"
|
||||
|
||||
tinyengine_status convolve_u8_kernel3_stride1_pad1(const q8_t *input, const uint16_t input_x, const uint16_t input_y,
|
||||
const uint16_t input_ch, const q7_t *kernel, const int32_t *bias,
|
||||
const int32_t *output_shift, const int32_t *output_mult,
|
||||
const int32_t output_offset, const int32_t input_offset,
|
||||
const int32_t output_activation_min,
|
||||
const int32_t output_activation_max, q7_t *output,
|
||||
const uint16_t output_x, const uint16_t output_y,
|
||||
const uint16_t output_ch, q15_t *runtime_buf, q15_t *kbuf, q7_t pad_value) {
|
||||
const int kernel_y = 3;
|
||||
const int kernel_x = 3;
|
||||
|
||||
int16_t i_out_y, i_out_x, i_ker_y, i_ker_x;
|
||||
|
||||
/* Generate two columns from the input tensor a GEMM computation */
|
||||
q15_t *two_column_buf = runtime_buf;
|
||||
q7_t *out = output;
|
||||
|
||||
q15_t pad16 = pad_value;
|
||||
const int16_t inoff16 = input_offset;
|
||||
q15_t pad_out = pad16 + inoff16;
|
||||
q31_t pad_out_q15x2 = __PKHBT(pad_out, pad_out, 16);
|
||||
q31_t offset_q15x2 = __PKHBT(inoff16, inoff16, 16);
|
||||
|
||||
const q7_t *ip_a0 = kernel;
|
||||
|
||||
for (int i = 0; i < output_ch; i += 2) {
|
||||
q15_t *dst1 = &kbuf[i * 27]; //each q31_t store 2 elements
|
||||
q15_t *dst2 = dst1 + 27;
|
||||
|
||||
const q7_t *ip_a1 = ip_a0 + 27;
|
||||
|
||||
//27 for each output_ch
|
||||
q31_t *dst1_31 = dst1;
|
||||
q31_t *dst2_31 = dst2;
|
||||
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
|
||||
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
|
||||
dst1_31 += 2;
|
||||
dst2_31 += 2;
|
||||
|
||||
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
|
||||
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
|
||||
dst1_31 += 2;
|
||||
dst2_31 += 2;
|
||||
|
||||
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
|
||||
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
|
||||
dst1_31 += 2;
|
||||
dst2_31 += 2;
|
||||
|
||||
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
|
||||
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
|
||||
dst1_31 += 2;
|
||||
dst2_31 += 2;
|
||||
|
||||
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
|
||||
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
|
||||
dst1_31 += 2;
|
||||
dst2_31 += 2;
|
||||
|
||||
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
|
||||
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
|
||||
dst1_31 += 2;
|
||||
dst2_31 += 2;
|
||||
//25, 26, 27
|
||||
dst1 = dst1_31;
|
||||
dst2 = dst2_31;
|
||||
dst1[0] = *ip_a0++;
|
||||
dst1[1] = *ip_a0++;
|
||||
dst1[2] = *ip_a0++;
|
||||
dst2[0] = *ip_a1++;
|
||||
dst2[1] = *ip_a1++;
|
||||
dst2[2] = *ip_a1++;
|
||||
|
||||
/* skip row */
|
||||
ip_a0 += 27;
|
||||
}
|
||||
|
||||
for (i_out_y = 0; i_out_y < output_y; i_out_y++) {
|
||||
for (i_out_x = 0; i_out_x < output_x; i_out_x++) {
|
||||
/* This part implements the im2col function */
|
||||
const int16_t base_idx_y = (i_out_y) - 1;
|
||||
const int16_t base_idx_x = (i_out_x) - 1;
|
||||
const q15_t *col_buffer = two_column_buf;
|
||||
|
||||
//use variables
|
||||
q31_t in_q7x4;
|
||||
q31_t in_q15x2_1;
|
||||
q31_t in_q15x2_2;
|
||||
q31_t out_q15x2_1;
|
||||
q31_t out_q15x2_2;
|
||||
|
||||
/* load address:8bit */
|
||||
q8_t *src;
|
||||
q8_t *src2;
|
||||
q8_t *src3;
|
||||
|
||||
/* buffer for load:16bit */
|
||||
q15_t *dst;
|
||||
q15_t *dst2;
|
||||
q15_t *dst3;
|
||||
|
||||
int input_row_offset = 3 * input_x;//channel = 3
|
||||
dst = col_buffer;
|
||||
dst2 = dst + 9;
|
||||
dst3 = dst2 + 9;
|
||||
if (base_idx_y != -1) {
|
||||
if (base_idx_x != -1) { //load all for now and unroll all
|
||||
//3x3 = 9 elements
|
||||
src = input + (base_idx_y * input_x + base_idx_x) * input_ch;
|
||||
src2 = src + input_row_offset;
|
||||
src3 = src2 + input_row_offset;
|
||||
|
||||
//4 * 2 = 8
|
||||
q8_q15_offset_ele(src, dst)
|
||||
q8_q15_offset_ele(src, dst)
|
||||
*dst++ = *src++ + input_offset;
|
||||
//
|
||||
q8_q15_offset_ele(src2, dst2)
|
||||
q8_q15_offset_ele(src2, dst2)
|
||||
*dst2++ = *src2++ + input_offset;
|
||||
//
|
||||
q8_q15_offset_ele(src3, dst3)
|
||||
q8_q15_offset_ele(src3, dst3)
|
||||
*dst3++ = *src3++ + input_offset;
|
||||
} else { //first element is pad
|
||||
//3x3 = 9 elements
|
||||
src = input + (base_idx_y * input_x) * input_ch;
|
||||
src2 = src + input_row_offset;
|
||||
src3 = src2 + input_row_offset;
|
||||
|
||||
//pad the first one: 1x3 = 3
|
||||
*dst++ = pad_out;
|
||||
*dst++ = pad_out;
|
||||
*dst++ = pad_out;
|
||||
*dst2++ = pad_out;
|
||||
*dst2++ = pad_out;
|
||||
*dst2++ = pad_out;
|
||||
*dst3++ = pad_out;
|
||||
*dst3++ = pad_out;
|
||||
*dst3++ = pad_out;
|
||||
//load 6 elements
|
||||
//4 * 1 = 6
|
||||
q8_q15_offset_ele(src, dst)
|
||||
*dst++ = *src++ + input_offset;
|
||||
*dst++ = *src++ + input_offset;
|
||||
//
|
||||
q8_q15_offset_ele(src2, dst2)
|
||||
*dst2++ = *src2++ + input_offset;
|
||||
*dst2++ = *src2++ + input_offset;
|
||||
//
|
||||
q8_q15_offset_ele(src3, dst3)
|
||||
*dst3++ = *src3++ + input_offset;
|
||||
*dst3++ = *src3++ + input_offset;
|
||||
}
|
||||
} else { // first row is padded
|
||||
//3x3 = 9 elements
|
||||
*dst++ = pad_out;
|
||||
q31_t *dst_31 = dst;
|
||||
*dst_31++ = pad_out_q15x2;
|
||||
*dst_31++ = pad_out_q15x2;
|
||||
*dst_31++ = pad_out_q15x2;
|
||||
*dst_31++ = pad_out_q15x2;
|
||||
if (base_idx_x != -1) { //load all for now and unroll all
|
||||
//3x3 = 9 elements
|
||||
src2 = input + (base_idx_x) * input_ch;
|
||||
src3 = src2 + input_row_offset;
|
||||
|
||||
//4 * 2 = 8
|
||||
q8_q15_offset_ele(src2, dst2)
|
||||
q8_q15_offset_ele(src2, dst2)
|
||||
*dst2++ = *src2++ + input_offset;
|
||||
//
|
||||
q8_q15_offset_ele(src3, dst3)
|
||||
q8_q15_offset_ele(src3, dst3)
|
||||
*dst3++ = *src3++ + input_offset;
|
||||
} else { //first element is pad
|
||||
//3x3 = 9 elements
|
||||
src2 = input;
|
||||
src3 = src2 + input_row_offset;
|
||||
|
||||
//pad the first one: 1x3 = 3
|
||||
*dst2++ = pad_out;
|
||||
*dst2++ = pad_out;
|
||||
*dst2++ = pad_out;
|
||||
*dst3++ = pad_out;
|
||||
*dst3++ = pad_out;
|
||||
*dst3++ = pad_out;
|
||||
//load 6 elements
|
||||
q8_q15_offset_ele(src2, dst2)
|
||||
*dst2++ = *src2++ + input_offset;
|
||||
*dst2++ = *src2++ + input_offset;
|
||||
//
|
||||
q8_q15_offset_ele(src3, dst3)
|
||||
*dst3++ = *src3++ + input_offset;
|
||||
*dst3++ = *src3++ + input_offset;
|
||||
}
|
||||
}
|
||||
|
||||
two_column_buf += 27;
|
||||
/* Computation is filed for every 2 columns */
|
||||
if (two_column_buf == runtime_buf + 2 * 27) {
|
||||
|
||||
out = arm_nn_mat_mult_kernel3_input3_s8_s16(kernel,
|
||||
runtime_buf, output_ch, output_shift, output_mult,
|
||||
output_offset, output_activation_min, output_activation_max,
|
||||
input_ch * kernel_y * kernel_x, bias, out, kbuf);
|
||||
|
||||
/* counter reset */
|
||||
two_column_buf = runtime_buf;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/* left-over because odd number of output pixels */
|
||||
if (two_column_buf != runtime_buf) {
|
||||
const q7_t *ker_a = kernel;
|
||||
int i;
|
||||
|
||||
for (i = 0; i < output_ch; i++) {
|
||||
/* Load the accumulator with bias first */
|
||||
q31_t sum = bias[i];
|
||||
|
||||
/* Point to the beginning of the im2col buffer where the input is available as a rearranged column */
|
||||
const q15_t *ip_as_col = runtime_buf;
|
||||
|
||||
/* 4 multiply and accumulates are done in one loop. */
|
||||
uint16_t col_count = (input_ch * kernel_y * kernel_x) >> 2;
|
||||
|
||||
while (col_count) {
|
||||
q31_t ker_a1, ker_a2;
|
||||
q31_t ip_b1, ip_b2;
|
||||
|
||||
ker_a = read_and_pad(ker_a, &ker_a1, &ker_a2);
|
||||
|
||||
ip_b1 = arm_nn_read_q15x2_ia(&ip_as_col);
|
||||
sum = __SMLAD(ker_a1, ip_b1, sum);
|
||||
ip_b2 = arm_nn_read_q15x2_ia(&ip_as_col);
|
||||
sum = __SMLAD(ker_a2, ip_b2, sum);
|
||||
|
||||
col_count--;
|
||||
}
|
||||
/* Handle left over mac */
|
||||
col_count = input_ch * kernel_y * kernel_x & 0x3;
|
||||
while (col_count) {
|
||||
q7_t ker_a1 = *ker_a++;
|
||||
q15_t ip_b1 = *ip_as_col++;
|
||||
sum += ker_a1 * ip_b1;
|
||||
col_count--;
|
||||
}
|
||||
|
||||
sum = arm_nn_requantize(sum, output_mult[i], output_shift[i]);
|
||||
sum += output_offset;
|
||||
sum = MAX(sum, output_activation_min);
|
||||
sum = MIN(sum, output_activation_max);
|
||||
*out++ = (q7_t) sum;
|
||||
}
|
||||
}
|
||||
|
||||
/* Return to application */
|
||||
return STATE_SUCCESS;
|
||||
}
|
@ -0,0 +1,286 @@
|
||||
/* ----------------------------------------------------------------------
|
||||
* Project: TinyEngine
|
||||
* Title: convolve_u8_kernel3_inputch3_stride2_pad1.c
|
||||
* Description: for 3x3 convolution with 3 input channels, typically for image processing
|
||||
*
|
||||
* Reference papers:
|
||||
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
|
||||
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
|
||||
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
|
||||
* Contact authors:
|
||||
* - Wei-Ming Chen, wmchen@mit.edu
|
||||
* - Wei-Chen Wang, wweichen@mit.edu
|
||||
* - Ji Lin, jilin@mit.edu
|
||||
* - Ligeng Zhu, ligeng@mit.edu
|
||||
* - Song Han, songhan@mit.edu
|
||||
*
|
||||
* Target ISA: ARMv7E-M
|
||||
* -------------------------------------------------------------------- */
|
||||
|
||||
#include "arm_math.h"
|
||||
#include "arm_nnfunctions.h"
|
||||
#include "arm_nnsupportfunctions.h"
|
||||
#include "img2col_element.h"
|
||||
#include "tinyengine_function.h"
|
||||
|
||||
tinyengine_status convolve_u8_kernel3_inputch3_stride2_pad1(const q8_t *input, const uint16_t input_x, const uint16_t input_y,
|
||||
const uint16_t input_ch, const q7_t *kernel, const int32_t *bias,
|
||||
const int32_t *output_shift, const int32_t *output_mult,
|
||||
const int32_t output_offset, const int32_t input_offset,
|
||||
const int32_t output_activation_min,
|
||||
const int32_t output_activation_max, q7_t *output,
|
||||
const uint16_t output_x, const uint16_t output_y,
|
||||
const uint16_t output_ch, q15_t *runtime_buf, q15_t* kbuf, q7_t pad_value) {
|
||||
const int kernel_y = 3;
|
||||
const int kernel_x = 3;
|
||||
|
||||
int16_t i_out_y, i_out_x, i_ker_y, i_ker_x;
|
||||
|
||||
/* Generate two columns from the input tensor a GEMM computation */
|
||||
q15_t *two_column_buf = runtime_buf;
|
||||
q7_t *out = output;
|
||||
|
||||
q15_t pad16 = pad_value;
|
||||
const int16_t inoff16 = input_offset;
|
||||
q15_t pad_out = pad16 + inoff16;
|
||||
q31_t pad_out_q15x2 = __PKHBT(pad_out, pad_out, 16);
|
||||
q31_t offset_q15x2 = __PKHBT(inoff16, inoff16, 16);
|
||||
|
||||
const q7_t *ip_a0 = kernel;
|
||||
|
||||
for (int i = 0; i < output_ch; i += 2) {
|
||||
q15_t *dst1 = &kbuf[i * 27]; //each q31_t store 2 elements
|
||||
q15_t *dst2 = dst1 + 27;
|
||||
|
||||
const q7_t *ip_a1 = ip_a0 + 27;
|
||||
|
||||
//27 for each output_ch
|
||||
q31_t *dst1_31 = dst1;
|
||||
q31_t *dst2_31 = dst2;
|
||||
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
|
||||
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
|
||||
dst1_31 += 2;
|
||||
dst2_31 += 2;
|
||||
|
||||
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
|
||||
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
|
||||
dst1_31 += 2;
|
||||
dst2_31 += 2;
|
||||
|
||||
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
|
||||
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
|
||||
dst1_31 += 2;
|
||||
dst2_31 += 2;
|
||||
|
||||
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
|
||||
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
|
||||
dst1_31 += 2;
|
||||
dst2_31 += 2;
|
||||
|
||||
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
|
||||
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
|
||||
dst1_31 += 2;
|
||||
dst2_31 += 2;
|
||||
|
||||
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
|
||||
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
|
||||
dst1_31 += 2;
|
||||
dst2_31 += 2;
|
||||
//25, 26, 27
|
||||
dst1 = dst1_31;
|
||||
dst2 = dst2_31;
|
||||
dst1[0] = *ip_a0++;
|
||||
dst1[1] = *ip_a0++;
|
||||
dst1[2] = *ip_a0++;
|
||||
dst2[0] = *ip_a1++;
|
||||
dst2[1] = *ip_a1++;
|
||||
dst2[2] = *ip_a1++;
|
||||
|
||||
/* skip row */
|
||||
ip_a0 += 27;
|
||||
}
|
||||
|
||||
for (i_out_y = 0; i_out_y < output_y; i_out_y++) {
|
||||
for (i_out_x = 0; i_out_x < output_x; i_out_x++) {
|
||||
/* This part implements the im2col function */
|
||||
const int16_t base_idx_y = (i_out_y * 2) - 1;
|
||||
const int16_t base_idx_x = (i_out_x * 2) - 1;
|
||||
const q15_t *col_buffer = two_column_buf;
|
||||
|
||||
//use variables
|
||||
q31_t in_q7x4;
|
||||
q31_t in_q15x2_1;
|
||||
q31_t in_q15x2_2;
|
||||
q31_t out_q15x2_1;
|
||||
q31_t out_q15x2_2;
|
||||
|
||||
/* load address:8bit */
|
||||
q8_t *src;
|
||||
q8_t *src2;
|
||||
q8_t *src3;
|
||||
|
||||
/* buffer for load:16bit */
|
||||
q15_t *dst;
|
||||
q15_t *dst2;
|
||||
q15_t *dst3;
|
||||
|
||||
int input_row_offset = 3 * input_x;
|
||||
dst = col_buffer;
|
||||
dst2 = dst + 9;
|
||||
dst3 = dst2 + 9;
|
||||
if (base_idx_y != -1) {
|
||||
if (base_idx_x != -1) { //load all for now and unroll all
|
||||
//3x3 = 9 elements
|
||||
src = input + (base_idx_y * input_x + base_idx_x) * input_ch;
|
||||
src2 = src + input_row_offset;
|
||||
src3 = src2 + input_row_offset;
|
||||
|
||||
//4 * 2 = 8
|
||||
q8_q15_offset_ele(src, dst)
|
||||
q8_q15_offset_ele(src, dst)
|
||||
*dst++ = *src++ + input_offset;
|
||||
//
|
||||
q8_q15_offset_ele(src2, dst2)
|
||||
q8_q15_offset_ele(src2, dst2)
|
||||
*dst2++ = *src2++ + input_offset;
|
||||
//
|
||||
q8_q15_offset_ele(src3, dst3)
|
||||
q8_q15_offset_ele(src3, dst3)
|
||||
*dst3++ = *src3++ + input_offset;
|
||||
} else { //first element is pad
|
||||
//3x3 = 9 elements
|
||||
src = input + (base_idx_y * input_x) * input_ch;
|
||||
src2 = src + input_row_offset;
|
||||
src3 = src2 + input_row_offset;
|
||||
|
||||
//pad the first one: 1x3 = 3
|
||||
*dst++ = pad_out;
|
||||
*dst++ = pad_out;
|
||||
*dst++ = pad_out;
|
||||
*dst2++ = pad_out;
|
||||
*dst2++ = pad_out;
|
||||
*dst2++ = pad_out;
|
||||
*dst3++ = pad_out;
|
||||
*dst3++ = pad_out;
|
||||
*dst3++ = pad_out;
|
||||
//load 6 elements
|
||||
//4 * 1 = 6
|
||||
q8_q15_offset_ele(src, dst)
|
||||
*dst++ = *src++ + input_offset;
|
||||
*dst++ = *src++ + input_offset;
|
||||
//
|
||||
q8_q15_offset_ele(src2, dst2)
|
||||
*dst2++ = *src2++ + input_offset;
|
||||
*dst2++ = *src2++ + input_offset;
|
||||
//
|
||||
q8_q15_offset_ele(src3, dst3)
|
||||
*dst3++ = *src3++ + input_offset;
|
||||
*dst3++ = *src3++ + input_offset;
|
||||
}
|
||||
} else { // first row is padded
|
||||
//3x3 = 9 elements
|
||||
*dst++ = pad_out;
|
||||
q31_t *dst_31 = dst;
|
||||
*dst_31++ = pad_out_q15x2;
|
||||
*dst_31++ = pad_out_q15x2;
|
||||
*dst_31++ = pad_out_q15x2;
|
||||
*dst_31++ = pad_out_q15x2;
|
||||
if (base_idx_x != -1) { //load all for now and unroll all
|
||||
//3x3 = 9 elements
|
||||
src2 = input + (base_idx_x) * input_ch;
|
||||
src3 = src2 + input_row_offset;
|
||||
|
||||
//4 * 2 = 8
|
||||
q8_q15_offset_ele(src2, dst2)
|
||||
q8_q15_offset_ele(src2, dst2)
|
||||
*dst2++ = *src2++ + input_offset;
|
||||
//
|
||||
q8_q15_offset_ele(src3, dst3)
|
||||
q8_q15_offset_ele(src3, dst3)
|
||||
*dst3++ = *src3++ + input_offset;
|
||||
} else { //first element is pad
|
||||
//3x3 = 9 elements
|
||||
src2 = input;
|
||||
src3 = src2 + input_row_offset;
|
||||
|
||||
//pad the first one: 1x3 = 3
|
||||
*dst2++ = pad_out;
|
||||
*dst2++ = pad_out;
|
||||
*dst2++ = pad_out;
|
||||
*dst3++ = pad_out;
|
||||
*dst3++ = pad_out;
|
||||
*dst3++ = pad_out;
|
||||
//load 6 elements
|
||||
q8_q15_offset_ele(src2, dst2)
|
||||
*dst2++ = *src2++ + input_offset;
|
||||
*dst2++ = *src2++ + input_offset;
|
||||
//
|
||||
q8_q15_offset_ele(src3, dst3)
|
||||
*dst3++ = *src3++ + input_offset;
|
||||
*dst3++ = *src3++ + input_offset;
|
||||
}
|
||||
}
|
||||
|
||||
two_column_buf += 27;
|
||||
/* Computation is filed for every 2 columns */
|
||||
if (two_column_buf == runtime_buf + 2 * 27) {
|
||||
|
||||
out = arm_nn_mat_mult_kernel3_input3_s8_s16(kernel,
|
||||
runtime_buf, output_ch, output_shift, output_mult,
|
||||
output_offset, output_activation_min, output_activation_max,
|
||||
input_ch * kernel_y * kernel_x, bias, out, kbuf);
|
||||
|
||||
/* counter reset */
|
||||
two_column_buf = runtime_buf;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/* left-over because odd number of output pixels */
|
||||
if (two_column_buf != runtime_buf) {
|
||||
const q7_t *ker_a = kernel;
|
||||
int i;
|
||||
|
||||
for (i = 0; i < output_ch; i++) {
|
||||
/* Load the accumulator with bias first */
|
||||
q31_t sum = bias[i];
|
||||
|
||||
/* Point to the beginning of the im2col buffer where the input is available as a rearranged column */
|
||||
const q15_t *ip_as_col = runtime_buf;
|
||||
|
||||
/* 4 multiply and accumulates are done in one loop. */
|
||||
uint16_t col_count = (input_ch * kernel_y * kernel_x) >> 2;
|
||||
|
||||
while (col_count) {
|
||||
q31_t ker_a1, ker_a2;
|
||||
q31_t ip_b1, ip_b2;
|
||||
|
||||
ker_a = read_and_pad(ker_a, &ker_a1, &ker_a2);
|
||||
|
||||
ip_b1 = arm_nn_read_q15x2_ia(&ip_as_col);
|
||||
sum = __SMLAD(ker_a1, ip_b1, sum);
|
||||
ip_b2 = arm_nn_read_q15x2_ia(&ip_as_col);
|
||||
sum = __SMLAD(ker_a2, ip_b2, sum);
|
||||
|
||||
col_count--;
|
||||
}
|
||||
/* Handle left over mac */
|
||||
col_count = input_ch * kernel_y * kernel_x & 0x3;
|
||||
while (col_count) {
|
||||
q7_t ker_a1 = *ker_a++;
|
||||
q15_t ip_b1 = *ip_as_col++;
|
||||
sum += ker_a1 * ip_b1;
|
||||
col_count--;
|
||||
}
|
||||
|
||||
sum = arm_nn_requantize(sum, output_mult[i], output_shift[i]);
|
||||
sum += output_offset;
|
||||
sum = MAX(sum, output_activation_min);
|
||||
sum = MIN(sum, output_activation_max);
|
||||
*out++ = (q7_t) sum;
|
||||
}
|
||||
}
|
||||
|
||||
/* Return to application */
|
||||
return STATE_SUCCESS;
|
||||
}
|
46
TinyEngine/src/kernels/int_only/element_mult.c
Normal file
@ -0,0 +1,46 @@
|
||||
/* ----------------------------------------------------------------------
|
||||
* Project: TinyEngine
|
||||
* Title: element_mult.c
|
||||
*
|
||||
* Reference papers:
|
||||
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
|
||||
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
|
||||
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
|
||||
* Contact authors:
|
||||
* - Wei-Ming Chen, wmchen@mit.edu
|
||||
* - Wei-Chen Wang, wweichen@mit.edu
|
||||
* - Ji Lin, jilin@mit.edu
|
||||
* - Ligeng Zhu, ligeng@mit.edu
|
||||
* - Song Han, songhan@mit.edu
|
||||
*
|
||||
* Target ISA: ARMv7E-M
|
||||
* -------------------------------------------------------------------- */
|
||||
|
||||
#include "tinyengine_function.h"
|
||||
#include "arm_nnfunctions.h"
|
||||
|
||||
/*
|
||||
* Spatial elementwise multiplications for nxnxc * 1x1xc
|
||||
* */
|
||||
tinyengine_status element_mult_nx1(const q7_t* input, const uint16_t input_h, const uint16_t input_w,
|
||||
const uint16_t input_c, const q7_t* input2, const int16_t input1_offset, const int16_t input2_offset,
|
||||
const int16_t output_offset, const int32_t out_activation_min, const int32_t out_activation_max,
|
||||
const int32_t output_shift, const int32_t output_mult, q7_t* output)
|
||||
{
|
||||
int c, element;
|
||||
for (element = 0; element < input_h * input_w; element++){
|
||||
q7_t* multiplier = input2;
|
||||
for (c = 0; c < input_c; c++){
|
||||
const int32_t input1_val = input1_offset + *input++;
|
||||
const int32_t input2_val = input2_offset + *multiplier++;
|
||||
int32_t unclamped_result = input1_val * input2_val;
|
||||
int32_t clamped_result = output_offset + arm_nn_requantize(unclamped_result, output_mult, output_shift);
|
||||
clamped_result = MAX(clamped_result, out_activation_min);
|
||||
clamped_result = MIN(clamped_result, out_activation_max);
|
||||
*output++ = clamped_result;
|
||||
}
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
|
43
TinyEngine/src/kernels/int_only/fully_connected.c
Normal file
@ -0,0 +1,43 @@
|
||||
/* ----------------------------------------------------------------------
|
||||
* Project: TinyEngine
|
||||
* Title: fully_connected.c
|
||||
*
|
||||
* Reference papers:
|
||||
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
|
||||
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
|
||||
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
|
||||
* Contact authors:
|
||||
* - Wei-Ming Chen, wmchen@mit.edu
|
||||
* - Wei-Chen Wang, wweichen@mit.edu
|
||||
* - Ji Lin, jilin@mit.edu
|
||||
* - Ligeng Zhu, ligeng@mit.edu
|
||||
* - Song Han, songhan@mit.edu
|
||||
*
|
||||
* Target ISA: ARMv7E-M
|
||||
* -------------------------------------------------------------------- */
|
||||
|
||||
#include "tinyengine_function.h"
|
||||
|
||||
tinyengine_status fully_connected_fp(
|
||||
const float *input, const uint16_t input_x, const uint16_t input_y,
|
||||
const uint16_t input_ch, const uint16_t output_ch, const float *bias,
|
||||
const float *weights, float *output)
|
||||
{
|
||||
int h, w, out_c, in_c;
|
||||
for (h = 0; h < input_y; h++){
|
||||
for (w = 0; w < input_x; w++){
|
||||
int pixel_cnt = w + input_x * h;
|
||||
for (out_c = 0; out_c < output_ch; out_c++){
|
||||
float intermediate = bias[out_c];
|
||||
float *start_weight = weights + out_c * input_ch;
|
||||
float *start_input = input + input_ch * pixel_cnt;
|
||||
float *start_out = output + output_ch * pixel_cnt;
|
||||
for (in_c = 0; in_c < input_ch; in_c++){
|
||||
intermediate += start_weight[in_c] * start_input[in_c];
|
||||
}
|
||||
start_out[out_c] = intermediate;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
}
|
35
TinyEngine/src/kernels/int_only/mat_mul_fp.c
Normal file
@ -0,0 +1,35 @@
|
||||
/* ----------------------------------------------------------------------
|
||||
* Project: TinyEngine
|
||||
* Title: mat_mul_fp.c
|
||||
*
|
||||
* Reference papers:
|
||||
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
|
||||
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
|
||||
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
|
||||
* Contact authors:
|
||||
* - Wei-Ming Chen, wmchen@mit.edu
|
||||
* - Wei-Chen Wang, wweichen@mit.edu
|
||||
* - Ji Lin, jilin@mit.edu
|
||||
* - Ligeng Zhu, ligeng@mit.edu
|
||||
* - Song Han, songhan@mit.edu
|
||||
*
|
||||
* Target ISA: ARMv7E-M
|
||||
* -------------------------------------------------------------------- */
|
||||
|
||||
#include "tinyengine_function.h"
|
||||
|
||||
tinyengine_status mat_mul_fp(
|
||||
const float *matA, const uint16_t matA_row, const uint16_t matA_col,
|
||||
const float* matB, const uint16_t matB_col, float* output)
|
||||
{
|
||||
int m, n, i;
|
||||
for (n = 0; n < matA_row; n++){
|
||||
for (m = 0; m < matB_col; m++){
|
||||
float sum = 0;
|
||||
for (i = 0; i < matA_col; i++){
|
||||
sum += matA[i + n * matA_col] * matB[m + i * matA_col];
|
||||
}
|
||||
output[m + n * matB_col] = sum;
|
||||
}
|
||||
}
|
||||
}
|
1813
TinyEngine/src/kernels/int_only/mat_mult_kernels.c
Normal file
50
TinyEngine/src/kernels/int_only/maxpooling.c
Normal file
@ -0,0 +1,50 @@
|
||||
/* ----------------------------------------------------------------------
|
||||
* Project: TinyEngine
|
||||
* Title: maxpooling.c
|
||||
*
|
||||
* Reference papers:
|
||||
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
|
||||
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
|
||||
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
|
||||
* Contact authors:
|
||||
* - Wei-Ming Chen, wmchen@mit.edu
|
||||
* - Wei-Chen Wang, wweichen@mit.edu
|
||||
* - Ji Lin, jilin@mit.edu
|
||||
* - Ligeng Zhu, ligeng@mit.edu
|
||||
* - Song Han, songhan@mit.edu
|
||||
*
|
||||
* Target ISA: ARMv7E-M
|
||||
* -------------------------------------------------------------------- */
|
||||
|
||||
#include "tinyengine_function.h"
|
||||
|
||||
tinyengine_status max_pooling(const q7_t* input, const uint16_t input_h, const uint16_t input_w,
|
||||
const uint16_t input_c, const uint16_t sample_h, const uint16_t sample_w,
|
||||
const uint16_t output_h, const uint16_t output_w, const int32_t out_activation_min,
|
||||
const int32_t out_activation_max, q7_t* output)
|
||||
{
|
||||
int h, w, c;
|
||||
int sh, sw;
|
||||
for(c = 0; c < input_c; c++){
|
||||
for(h = 0; h < output_h; h++){
|
||||
for(w = 0; w < output_w; w++){
|
||||
int max = out_activation_min;
|
||||
|
||||
for(sh = 0; sh < sample_h; sh++){
|
||||
int height = sh + h * sample_h;
|
||||
for(sw = 0; sw < sample_w; sw++){
|
||||
int width = sw + w * sample_w;
|
||||
max = TN_MAX(max,input[(width + height * input_w) * input_c + c]);
|
||||
}
|
||||
}
|
||||
|
||||
int out = max;
|
||||
out = TN_MAX(out, out_activation_min);
|
||||
out = TN_MIN(out, out_activation_max);
|
||||
output[(w + h * output_w) * input_c + c] = out;
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
|
@ -0,0 +1,252 @@
|
||||
/* ----------------------------------------------------------------------
|
||||
* Project: TinyEngine
|
||||
* Title: patchpadding_convolve_s8_kernel3_inputch3_stride2.c
|
||||
* Description: for 3x3 convolution with 3 input channels, typically for image processing
|
||||
*
|
||||
* Reference papers:
|
||||
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
|
||||
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
|
||||
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
|
||||
* Contact authors:
|
||||
* - Wei-Ming Chen, wmchen@mit.edu
|
||||
* - Wei-Chen Wang, wweichen@mit.edu
|
||||
* - Ji Lin, jilin@mit.edu
|
||||
* - Ligeng Zhu, ligeng@mit.edu
|
||||
* - Song Han, songhan@mit.edu
|
||||
*
|
||||
* Target ISA: ARMv7E-M
|
||||
* -------------------------------------------------------------------- */
|
||||
|
||||
#include "arm_math.h"
|
||||
#include "arm_nnfunctions.h"
|
||||
#include "arm_nnsupportfunctions.h"
|
||||
#include "img2col_element.h"
|
||||
#include "tinyengine_function.h"
|
||||
|
||||
#define HOLD_KERNEL
|
||||
|
||||
tinyengine_status patchpadding_convolve_s8_kernel3_inputch3_stride2(const q7_t *input, const uint16_t input_x, const uint16_t input_y,
|
||||
const uint16_t input_ch, const q7_t *kernel, const int32_t *bias,
|
||||
const int32_t *output_shift, const int32_t *output_mult,
|
||||
const int32_t output_offset, const int32_t input_offset,
|
||||
const int32_t output_activation_min,
|
||||
const int32_t output_activation_max, q7_t *output,
|
||||
const uint16_t output_x, const uint16_t output_y,
|
||||
const uint16_t output_ch, q15_t *runtime_buf, q7_t pad_value,
|
||||
const uint16_t pad_t, const uint16_t pad_b, const uint16_t pad_l, const uint16_t pad_r) {
|
||||
const int kernel_y = 3;
|
||||
const int kernel_x = 3;
|
||||
|
||||
int16_t i_out_y, i_out_x, i_ker_y, i_ker_x;
|
||||
|
||||
/* Generate two columns from the input tensor a GEMM computation */
|
||||
q15_t *two_column_buf = runtime_buf;
|
||||
q7_t *out = output;
|
||||
|
||||
q15_t pad16 = pad_value;
|
||||
const int16_t inoff16 = input_offset;
|
||||
q15_t pad_out = pad16 + inoff16;
|
||||
q31_t pad_out_q15x2 = __PKHBT(pad_out, pad_out, 16);
|
||||
q31_t offset_q15x2 = __PKHBT(inoff16, inoff16, 16);
|
||||
|
||||
q15_t *kbuf = (q15_t*) get_kernel_buffer();
|
||||
const q7_t *ip_a0 = kernel;
|
||||
|
||||
#ifdef HOLD_KERNEL
|
||||
for (int i = 0; i < output_ch; i += 2) {
|
||||
q15_t *dst1 = &kbuf[i * 27]; //each q31_t store 2 elements
|
||||
q15_t *dst2 = dst1 + 27;
|
||||
|
||||
const q7_t *ip_a1 = ip_a0 + 27;
|
||||
|
||||
//27 for each output_ch
|
||||
q31_t *dst1_31 = dst1;
|
||||
q31_t *dst2_31 = dst2;
|
||||
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
|
||||
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
|
||||
dst1_31 += 2;
|
||||
dst2_31 += 2;
|
||||
|
||||
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
|
||||
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
|
||||
dst1_31 += 2;
|
||||
dst2_31 += 2;
|
||||
|
||||
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
|
||||
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
|
||||
dst1_31 += 2;
|
||||
dst2_31 += 2;
|
||||
|
||||
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
|
||||
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
|
||||
dst1_31 += 2;
|
||||
dst2_31 += 2;
|
||||
|
||||
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
|
||||
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
|
||||
dst1_31 += 2;
|
||||
dst2_31 += 2;
|
||||
|
||||
ip_a0 = read_and_pad(ip_a0, &dst1_31[0], &dst1_31[1]);
|
||||
ip_a1 = read_and_pad(ip_a1, &dst2_31[0], &dst2_31[1]);
|
||||
dst1_31 += 2;
|
||||
dst2_31 += 2;
|
||||
//25, 26, 27
|
||||
dst1 = dst1_31;
|
||||
dst2 = dst2_31;
|
||||
dst1[0] = *ip_a0++;
|
||||
dst1[1] = *ip_a0++;
|
||||
dst1[2] = *ip_a0++;
|
||||
dst2[0] = *ip_a1++;
|
||||
dst2[1] = *ip_a1++;
|
||||
dst2[2] = *ip_a1++;
|
||||
|
||||
/* skip row */
|
||||
ip_a0 += 27;
|
||||
}
|
||||
#endif
|
||||
|
||||
int skip = 0;
|
||||
for (i_out_y = 0; i_out_y < output_y; i_out_y++) {
|
||||
for (i_out_x = 0; i_out_x < output_x; i_out_x++) {
|
||||
/* This part implements the im2col function */
|
||||
const q15_t *col_buffer = two_column_buf;
|
||||
int16_t base_idx_y = (i_out_y * 2);
|
||||
int16_t base_idx_x = (i_out_x * 2);
|
||||
|
||||
//use variables
|
||||
q31_t in_q7x4;
|
||||
q31_t in_q15x2_1;
|
||||
q31_t in_q15x2_2;
|
||||
q31_t out_q15x2_1;
|
||||
q31_t out_q15x2_2;
|
||||
|
||||
/* load address:8bit */
|
||||
q7_t *src;
|
||||
|
||||
/* buffer for im2col:16bit */
|
||||
q15_t *dst = col_buffer;
|
||||
|
||||
int skip_top = pad_t - base_idx_y;
|
||||
int skip_bottom = MAX(0,(base_idx_y + 3) - (input_y - pad_b));//3x3
|
||||
|
||||
int y_cnt = 3;//3 rows to load
|
||||
//fill zeros in the top regions
|
||||
while (y_cnt > 0 && skip_top-- > 0){
|
||||
*dst++ = 0; *dst++ = 0; *dst++ = 0;
|
||||
*dst++ = 0; *dst++ = 0; *dst++ = 0;
|
||||
*dst++ = 0; *dst++ = 0; *dst++ = 0;
|
||||
y_cnt--;
|
||||
base_idx_y++;
|
||||
}
|
||||
|
||||
//fill in the middle
|
||||
int skip_left = MAX(0,pad_l - base_idx_x);
|
||||
int skip_right = MAX(0,(base_idx_x + 3) - (input_x - pad_r));//3x3
|
||||
//address of the first valid values
|
||||
int m;
|
||||
for (m = 0; m < y_cnt - skip_bottom; m++){
|
||||
src = input + ((base_idx_y+m) * input_x + base_idx_x + skip_left) * input_ch;
|
||||
int x_cnt = 3;//3 columns to load
|
||||
//fill zero for left regions
|
||||
int cnt = skip_left;
|
||||
while(x_cnt > 0 && cnt-- > 0){
|
||||
*dst++ = 0; *dst++ = 0; *dst++ = 0;//input_ch == 3
|
||||
x_cnt--;
|
||||
}
|
||||
|
||||
//load the middle
|
||||
while(x_cnt > skip_right){
|
||||
*dst++ = *src++ + input_offset; *dst++ = *src++ + input_offset; *dst++ = *src++ + input_offset;
|
||||
x_cnt--;
|
||||
}
|
||||
|
||||
//fill zero for right regions (for what's left)
|
||||
while(x_cnt > 0){
|
||||
*dst++ = 0; *dst++ = 0; *dst++ = 0;//input_ch == 3
|
||||
x_cnt--;
|
||||
}
|
||||
}
|
||||
y_cnt -= m;
|
||||
|
||||
//fill zeros in the bottom regions
|
||||
while (y_cnt > 0){
|
||||
*dst++ = 0; *dst++ = 0; *dst++ = 0;
|
||||
*dst++ = 0; *dst++ = 0; *dst++ = 0;
|
||||
*dst++ = 0; *dst++ = 0; *dst++ = 0;
|
||||
y_cnt--;
|
||||
}
|
||||
|
||||
two_column_buf += 27;
|
||||
/* Computation is filed for every 2 columns */
|
||||
if (two_column_buf == runtime_buf + 2 * 27) {
|
||||
#ifdef HOLD_KERNEL
|
||||
out = arm_nn_mat_mult_kernel3_input3_s8_s16(kernel,
|
||||
runtime_buf, output_ch, output_shift, output_mult,
|
||||
output_offset, output_activation_min, output_activation_max,
|
||||
input_ch * kernel_y * kernel_x, bias, out, kbuf);
|
||||
// out = mat_mult_s16(kernel,
|
||||
// runtime_buf, output_ch, output_shift, output_mult,
|
||||
// output_offset, output_activation_min, output_activation_max,
|
||||
// input_ch * kernel_y * kernel_x, bias, out, kbuf);
|
||||
#else
|
||||
out = arm_nn_mat_mult_kernel_s8_s16(kernel,
|
||||
runtime_buf, output_ch, output_shift, output_mult,
|
||||
output_offset, output_activation_min, output_activation_max,
|
||||
27, bias, out);
|
||||
#endif
|
||||
|
||||
/* counter reset */
|
||||
two_column_buf = runtime_buf;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/* left-over because odd number of output pixels */
|
||||
if (two_column_buf != runtime_buf) {
|
||||
const q7_t *ker_a = kernel;
|
||||
int i;
|
||||
|
||||
for (i = 0; i < output_ch; i++) {
|
||||
/* Load the accumulator with bias first */
|
||||
q31_t sum = bias[i];
|
||||
|
||||
/* Point to the beginning of the im2col buffer where the input is available as a rearranged column */
|
||||
const q15_t *ip_as_col = runtime_buf;
|
||||
|
||||
/* 4 multiply and accumulates are done in one loop. */
|
||||
uint16_t col_count = (input_ch * kernel_y * kernel_x) >> 2;
|
||||
|
||||
while (col_count) {
|
||||
q31_t ker_a1, ker_a2;
|
||||
q31_t ip_b1, ip_b2;
|
||||
|
||||
ker_a = read_and_pad(ker_a, &ker_a1, &ker_a2);
|
||||
|
||||
ip_b1 = arm_nn_read_q15x2_ia(&ip_as_col);
|
||||
sum = __SMLAD(ker_a1, ip_b1, sum);
|
||||
ip_b2 = arm_nn_read_q15x2_ia(&ip_as_col);
|
||||
sum = __SMLAD(ker_a2, ip_b2, sum);
|
||||
|
||||
col_count--;
|
||||
}
|
||||
/* Handle left over mac */
|
||||
col_count = input_ch * kernel_y * kernel_x & 0x3;
|
||||
while (col_count) {
|
||||
q7_t ker_a1 = *ker_a++;
|
||||
q15_t ip_b1 = *ip_as_col++;
|
||||
sum += ker_a1 * ip_b1;
|
||||
col_count--;
|
||||
}
|
||||
|
||||
sum = arm_nn_requantize(sum, output_mult[i], output_shift[i]);
|
||||
sum += output_offset;
|
||||
sum = MAX(sum, output_activation_min);
|
||||
sum = MIN(sum, output_activation_max);
|
||||
*out++ = (q7_t) sum;
|
||||
}
|
||||
}
|
||||
|
||||
/* Return to application */
|
||||
return STATE_SUCCESS;
|
||||
}
|
@ -0,0 +1,175 @@
|
||||
/* This file is automatically generated */
|
||||
/* ----------------------------------------------------------------------
|
||||
* Project: TinyEngine
|
||||
* Title: patchpadding_depthwise_kernel3x3_stride1_inplace_CHW.c
|
||||
* Description: for sparse in-place 3x3 depth-wise convolution (HWC->CHW->HWC)
|
||||
*
|
||||
* Reference papers:
|
||||
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
|
||||
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
|
||||
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
|
||||
* Contact authors:
|
||||
* - Wei-Ming Chen, wmchen@mit.edu
|
||||
* - Wei-Chen Wang, wweichen@mit.edu
|
||||
* - Ji Lin, jilin@mit.edu
|
||||
* - Ligeng Zhu, ligeng@mit.edu
|
||||
* - Song Han, songhan@mit.edu
|
||||
*
|
||||
* Target ISA: ARMv7E-M
|
||||
* -------------------------------------------------------------------- */
|
||||
|
||||
#include "arm_nnsupportfunctions.h" //TODO: remove this in the future for self-contained
|
||||
#include "tinyengine_function.h"
|
||||
void patch_depthwise_kernel3x3_stride1_inplace_kernel_CHW(
|
||||
const uint16_t output_y, const uint16_t output_x,
|
||||
const int32_t *bias, const int32_t *biasR, const q7_t *ksrc, const int32_t *multiplier,
|
||||
const int32_t *shift, q7_t *output, const int32_t output_offset,
|
||||
const int32_t activation_min, const int32_t activation_max,
|
||||
q7_t *cols_8b_iterptr, const uint16_t column_x, int channel_offset);
|
||||
tinyengine_status patchpadding_depthwise_kernel3x3_stride1_inplace_CHW(q7_t *input, const uint16_t input_x, const uint16_t input_y,
|
||||
const uint16_t input_ch, const q7_t *kernel, const int32_t *bias, const int32_t *biasR,
|
||||
const int32_t *output_shift, const int32_t *output_mult,
|
||||
const int32_t output_offset, const int32_t input_offset,
|
||||
const int32_t output_activation_min,
|
||||
const int32_t output_activation_max, q7_t *output,
|
||||
const uint16_t output_x, const uint16_t output_y,
|
||||
const uint16_t output_ch, q15_t *runtime_buf, q7_t pad_value,
|
||||
const uint16_t pad_t, const uint16_t pad_b, const uint16_t pad_l, const uint16_t pad_r)
|
||||
{
|
||||
uint16_t c,i,j;
|
||||
q7_t *cols_8b_start = (q7_t *)runtime_buf;
|
||||
q7_t* cols_8b = (q7_t* )cols_8b_start;
|
||||
|
||||
const q7_t *src;
|
||||
const q7_t *ksrc = kernel;
|
||||
|
||||
//set the output for inplace update
|
||||
q7_t *inplace_out = input;
|
||||
|
||||
int padding_cnt = pad_t * input_x;
|
||||
//shift the input ptr accordingly for HWC->CHW
|
||||
input += padding_cnt * input_ch;
|
||||
//handle top padding
|
||||
q7_t PAD8 = pad_value;
|
||||
while (padding_cnt--){
|
||||
*cols_8b++ = PAD8;
|
||||
}
|
||||
|
||||
for (i = pad_t; i < input_y - pad_b; i++){
|
||||
//handle left padding
|
||||
for (j = 0; j < pad_l; j++){
|
||||
*cols_8b++ = PAD8;
|
||||
}
|
||||
cols_8b += input_x - (pad_l + pad_r);
|
||||
//handle right padding
|
||||
for (j = 0; j < pad_r; j++){
|
||||
*cols_8b++ = PAD8;
|
||||
}
|
||||
}
|
||||
|
||||
//handle bottom padding
|
||||
padding_cnt = pad_b * input_x;
|
||||
//not need to shift for bottom padding
|
||||
while (padding_cnt--){
|
||||
*cols_8b++ = PAD8;
|
||||
}
|
||||
|
||||
for (c = 0; c < input_ch; c++){
|
||||
src = input;
|
||||
cols_8b = (q7_t*)(cols_8b_start + pad_t * (input_x)); //skip pad_t rows
|
||||
for(i = pad_t; i < input_y - pad_b; i++){
|
||||
cols_8b += pad_l;//skip left
|
||||
src += pad_l * input_ch;
|
||||
for(j = pad_l; j < input_x - pad_r; j++){
|
||||
*cols_8b++ = *src;// + input_offset;
|
||||
src += input_ch;
|
||||
}
|
||||
cols_8b += pad_r;//skip right
|
||||
src += pad_r * input_ch;
|
||||
}
|
||||
patch_depthwise_kernel3x3_stride1_inplace_kernel_CHW(output_y, output_x, bias++, biasR++, ksrc, output_mult++, output_shift++, inplace_out, output_offset,output_activation_min, output_activation_max,cols_8b_start, input_x, input_ch);
|
||||
inplace_out++;
|
||||
input++;
|
||||
ksrc += 9;
|
||||
}
|
||||
}
|
||||
void patch_depthwise_kernel3x3_stride1_inplace_kernel_CHW(
|
||||
const uint16_t output_y, const uint16_t output_x,
|
||||
const int32_t *bias, const int32_t *biasR, const q7_t *ksrc, const int32_t *multiplier,
|
||||
const int32_t *shift, q7_t *output, const int32_t output_offset,
|
||||
const int32_t activation_min, const int32_t activation_max,
|
||||
q7_t *cols_8b_iterptr, const uint16_t column_x, int channel_offset)
|
||||
{
|
||||
#define STRIDE 1
|
||||
int i, j;
|
||||
/* MACs for each output */
|
||||
for (i = 0; i < output_y; i++) {
|
||||
for (j = 0; j < output_x / 2; j++) {
|
||||
q7_t *cols_8b = cols_8b_iterptr;
|
||||
|
||||
q31_t sum0 = bias[0];
|
||||
q31_t sum1 = bias[0];
|
||||
|
||||
/* computation */
|
||||
sum0 += cols_8b[0]*ksrc[0];
|
||||
sum1 += cols_8b[1]*ksrc[0];
|
||||
sum0 += cols_8b[1]*ksrc[1];
|
||||
sum1 += cols_8b[2]*ksrc[1];
|
||||
sum0 += cols_8b[2]*ksrc[2];
|
||||
sum1 += cols_8b[3]*ksrc[2];
|
||||
cols_8b += column_x;
|
||||
sum0 += cols_8b[0]*ksrc[3];
|
||||
sum1 += cols_8b[1]*ksrc[3];
|
||||
sum0 += cols_8b[1]*ksrc[4];
|
||||
sum1 += cols_8b[2]*ksrc[4];
|
||||
sum0 += cols_8b[2]*ksrc[5];
|
||||
sum1 += cols_8b[3]*ksrc[5];
|
||||
cols_8b += column_x;
|
||||
sum0 += cols_8b[0]*ksrc[6];
|
||||
sum1 += cols_8b[1]*ksrc[6];
|
||||
sum0 += cols_8b[1]*ksrc[7];
|
||||
sum1 += cols_8b[2]*ksrc[7];
|
||||
sum0 += cols_8b[2]*ksrc[8];
|
||||
sum1 += cols_8b[3]*ksrc[8];
|
||||
|
||||
/* requantize */
|
||||
sum0 = arm_nn_requantize(sum0 + biasR[0], *multiplier, *shift);
|
||||
sum0 += output_offset;
|
||||
sum0 = MAX(sum0, activation_min);
|
||||
sum0 = MIN(sum0, activation_max);
|
||||
output[(i * output_x + j * 2) * channel_offset] = sum0;
|
||||
|
||||
sum1 = arm_nn_requantize(sum1 + biasR[0], *multiplier, *shift);
|
||||
sum1 += output_offset;
|
||||
sum1 = MAX(sum1, activation_min);
|
||||
sum1 = MIN(sum1, activation_max);
|
||||
output[(i * output_x + (j * 2 + 1)) * channel_offset] = sum1;
|
||||
|
||||
cols_8b_iterptr += STRIDE * 2;
|
||||
}
|
||||
if (output_x & 1) {
|
||||
q7_t * cols_8b = cols_8b_iterptr;
|
||||
q31_t sum = bias[0];
|
||||
sum += cols_8b[0]*ksrc[0];
|
||||
sum += cols_8b[1]*ksrc[1];
|
||||
sum += cols_8b[2]*ksrc[2];
|
||||
cols_8b += column_x;
|
||||
sum += cols_8b[0]*ksrc[3];
|
||||
sum += cols_8b[1]*ksrc[4];
|
||||
sum += cols_8b[2]*ksrc[5];
|
||||
cols_8b += column_x;
|
||||
sum += cols_8b[0]*ksrc[6];
|
||||
sum += cols_8b[1]*ksrc[7];
|
||||
sum += cols_8b[2]*ksrc[8];
|
||||
|
||||
sum = arm_nn_requantize(sum + biasR[0], *multiplier, *shift);
|
||||
sum += output_offset;
|
||||
sum = MAX(sum, activation_min);
|
||||
sum = MIN(sum, activation_max);
|
||||
output[(i * output_x + output_x - 1) * channel_offset] = sum;
|
||||
|
||||
cols_8b_iterptr += STRIDE;
|
||||
}
|
||||
cols_8b_iterptr += 1 * 2;
|
||||
}
|
||||
}
|
@ -0,0 +1,176 @@
|
||||
/* This file is automatically generated */
|
||||
/* ----------------------------------------------------------------------
|
||||
* Project: TinyEngine
|
||||
* Title: patchpadding_depthwise_kernel3x3_stride2_inplace_CHW.c
|
||||
* Description: for sparse in-place 3x3 depth-wise convolution (HWC->CHW->HWC)
|
||||
*
|
||||
* Reference papers:
|
||||
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
|
||||
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
|
||||
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
|
||||
* Contact authors:
|
||||
* - Wei-Ming Chen, wmchen@mit.edu
|
||||
* - Wei-Chen Wang, wweichen@mit.edu
|
||||
* - Ji Lin, jilin@mit.edu
|
||||
* - Ligeng Zhu, ligeng@mit.edu
|
||||
* - Song Han, songhan@mit.edu
|
||||
*
|
||||
* Target ISA: ARMv7E-M
|
||||
* -------------------------------------------------------------------- */
|
||||
|
||||
#include "arm_nnsupportfunctions.h" //TODO: remove this in the future for self-contained
|
||||
#include "tinyengine_function.h"
|
||||
void patch_depthwise_kernel3x3_stride2_inplace_kernel_CHW(
|
||||
const uint16_t output_y, const uint16_t output_x,
|
||||
const int32_t *bias, const int32_t *biasR, const q7_t *ksrc, const int32_t *multiplier,
|
||||
const int32_t *shift, q7_t *output, const int32_t output_offset,
|
||||
const int32_t activation_min, const int32_t activation_max,
|
||||
q7_t *cols_8b_iterptr, const uint16_t column_x, int channel_offset);
|
||||
tinyengine_status patchpadding_depthwise_kernel3x3_stride2_inplace_CHW(q7_t *input, const uint16_t input_x, const uint16_t input_y,
|
||||
const uint16_t input_ch, const q7_t *kernel, const int32_t *bias, const int32_t *biasR,
|
||||
const int32_t *output_shift, const int32_t *output_mult,
|
||||
const int32_t output_offset, const int32_t input_offset,
|
||||
const int32_t output_activation_min,
|
||||
const int32_t output_activation_max, q7_t *output,
|
||||
const uint16_t output_x, const uint16_t output_y,
|
||||
const uint16_t output_ch, q15_t *runtime_buf, q7_t pad_value,
|
||||
const uint16_t pad_t, const uint16_t pad_b, const uint16_t pad_l, const uint16_t pad_r)
|
||||
{
|
||||
uint16_t c,i,j;
|
||||
q7_t *cols_8b_start = (q7_t *)runtime_buf;
|
||||
q7_t* cols_8b = (q7_t* )cols_8b_start;
|
||||
|
||||
const q7_t *src;
|
||||
const q7_t *ksrc = kernel;
|
||||
|
||||
//set the output for inplace update
|
||||
q7_t *inplace_out = input;
|
||||
|
||||
int padding_cnt = pad_t * input_x;
|
||||
//shift the input ptr accordingly for HWC->CHW
|
||||
input += padding_cnt * input_ch;
|
||||
//handle top padding
|
||||
q7_t PAD8 = pad_value;
|
||||
while (padding_cnt--){
|
||||
*cols_8b++ = PAD8;
|
||||
}
|
||||
|
||||
for (i = pad_t; i < input_y - pad_b; i++){
|
||||
//handle left padding
|
||||
for (j = 0; j < pad_l; j++){
|
||||
*cols_8b++ = PAD8;
|
||||
}
|
||||
cols_8b += input_x - (pad_l + pad_r);
|
||||
//handle right padding
|
||||
for (j = 0; j < pad_r; j++){
|
||||
*cols_8b++ = PAD8;
|
||||
}
|
||||
}
|
||||
|
||||
//handle bottom padding
|
||||
padding_cnt = pad_b * input_x;
|
||||
//not need to shift for bottom padding
|
||||
while (padding_cnt--){
|
||||
*cols_8b++ = PAD8;
|
||||
}
|
||||
|
||||
for (c = 0; c < input_ch; c++){
|
||||
src = input;
|
||||
cols_8b = (q7_t*)(cols_8b_start + pad_t * (input_x)); //skip pad_t rows
|
||||
for(i = pad_t; i < input_y - pad_b; i++){
|
||||
cols_8b += pad_l;//skip left
|
||||
src += pad_l * input_ch;
|
||||
for(j = pad_l; j < input_x - pad_r; j++){
|
||||
*cols_8b++ = *src;// + input_offset;
|
||||
src += input_ch;
|
||||
}
|
||||
cols_8b += pad_r;//skip right
|
||||
src += pad_r * input_ch;
|
||||
}
|
||||
patch_depthwise_kernel3x3_stride2_inplace_kernel_CHW(output_y, output_x, bias++, biasR++, ksrc, output_mult++, output_shift++, inplace_out, output_offset,output_activation_min, output_activation_max,cols_8b_start, input_x, input_ch);
|
||||
inplace_out++;
|
||||
input++;
|
||||
ksrc += 9;
|
||||
}
|
||||
}
|
||||
void patch_depthwise_kernel3x3_stride2_inplace_kernel_CHW(
|
||||
const uint16_t output_y, const uint16_t output_x,
|
||||
const int32_t *bias, const int32_t *biasR, const q7_t *ksrc, const int32_t *multiplier,
|
||||
const int32_t *shift, q7_t *output, const int32_t output_offset,
|
||||
const int32_t activation_min, const int32_t activation_max,
|
||||
q7_t *cols_8b_iterptr, const uint16_t column_x, int channel_offset)
|
||||
{
|
||||
#define STRIDE 2
|
||||
int i, j;
|
||||
/* MACs for each output */
|
||||
for (i = 0; i < output_y; i++) {
|
||||
for (j = 0; j < output_x / 2; j++) {
|
||||
q7_t *cols_8b = cols_8b_iterptr;
|
||||
|
||||
q31_t sum0 = bias[0];
|
||||
q31_t sum1 = bias[0];
|
||||
|
||||
/* computation */
|
||||
sum0 += cols_8b[0]*ksrc[0];
|
||||
sum1 += cols_8b[2]*ksrc[0];
|
||||
sum0 += cols_8b[1]*ksrc[1];
|
||||
sum1 += cols_8b[3]*ksrc[1];
|
||||
sum0 += cols_8b[2]*ksrc[2];
|
||||
sum1 += cols_8b[4]*ksrc[2];
|
||||
cols_8b += column_x;
|
||||
sum0 += cols_8b[0]*ksrc[3];
|
||||
sum1 += cols_8b[2]*ksrc[3];
|
||||
sum0 += cols_8b[1]*ksrc[4];
|
||||
sum1 += cols_8b[3]*ksrc[4];
|
||||
sum0 += cols_8b[2]*ksrc[5];
|
||||
sum1 += cols_8b[4]*ksrc[5];
|
||||
cols_8b += column_x;
|
||||
sum0 += cols_8b[0]*ksrc[6];
|
||||
sum1 += cols_8b[2]*ksrc[6];
|
||||
sum0 += cols_8b[1]*ksrc[7];
|
||||
sum1 += cols_8b[3]*ksrc[7];
|
||||
sum0 += cols_8b[2]*ksrc[8];
|
||||
sum1 += cols_8b[4]*ksrc[8];
|
||||
|
||||
/* requantize */
|
||||
sum0 = arm_nn_requantize(sum0 + biasR[0], *multiplier, *shift);
|
||||
sum0 += output_offset;
|
||||
sum0 = MAX(sum0, activation_min);
|
||||
sum0 = MIN(sum0, activation_max);
|
||||
output[(i * output_x + j * 2) * channel_offset] = sum0;
|
||||
|
||||
sum1 = arm_nn_requantize(sum1 + biasR[0], *multiplier, *shift);
|
||||
sum1 += output_offset;
|
||||
sum1 = MAX(sum1, activation_min);
|
||||
sum1 = MIN(sum1, activation_max);
|
||||
output[(i * output_x + (j * 2 + 1)) * channel_offset] = sum1;
|
||||
|
||||
cols_8b_iterptr += STRIDE * 2;
|
||||
}
|
||||
if (output_x & 1) {
|
||||
q7_t * cols_8b = cols_8b_iterptr;
|
||||
q31_t sum = bias[0];
|
||||
sum += cols_8b[0]*ksrc[0];
|
||||
sum += cols_8b[1]*ksrc[1];
|
||||
sum += cols_8b[2]*ksrc[2];
|
||||
cols_8b += column_x;
|
||||
sum += cols_8b[0]*ksrc[3];
|
||||
sum += cols_8b[1]*ksrc[4];
|
||||
sum += cols_8b[2]*ksrc[5];
|
||||
cols_8b += column_x;
|
||||
sum += cols_8b[0]*ksrc[6];
|
||||
sum += cols_8b[1]*ksrc[7];
|
||||
sum += cols_8b[2]*ksrc[8];
|
||||
|
||||
sum = arm_nn_requantize(sum + biasR[0], *multiplier, *shift);
|
||||
sum += output_offset;
|
||||
sum = MAX(sum, activation_min);
|
||||
sum = MIN(sum, activation_max);
|
||||
output[(i * output_x + output_x - 1) * channel_offset] = sum;
|
||||
|
||||
cols_8b_iterptr += STRIDE;
|
||||
}
|
||||
cols_8b_iterptr += 1 * 2 - (column_x & 1);
|
||||
cols_8b_iterptr += (STRIDE - 1) * (column_x);
|
||||
}
|
||||
}
|
@ -0,0 +1,179 @@
|
||||
/* ----------------------------------------------------------------------
|
||||
* Project: TinyEngine
|
||||
* Title: patchpadding_kbuf_convolve_s8_kernel3_inputch3_stride2.c
|
||||
* Description: for 3x3 convolution with 3 input channels, typically for image processing
|
||||
*
|
||||
* Reference papers:
|
||||
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
|
||||
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
|
||||
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
|
||||
* Contact authors:
|
||||
* - Wei-Ming Chen, wmchen@mit.edu
|
||||
* - Wei-Chen Wang, wweichen@mit.edu
|
||||
* - Ji Lin, jilin@mit.edu
|
||||
* - Ligeng Zhu, ligeng@mit.edu
|
||||
* - Song Han, songhan@mit.edu
|
||||
*
|
||||
* Target ISA: ARMv7E-M
|
||||
* -------------------------------------------------------------------- */
|
||||
|
||||
#include "arm_math.h"
|
||||
#include "arm_nnfunctions.h"
|
||||
#include "arm_nnsupportfunctions.h"
|
||||
#include "img2col_element.h"
|
||||
#include "tinyengine_function.h"
|
||||
|
||||
tinyengine_status patchpadding_kbuf_convolve_s8_kernel3_inputch3_stride2(const q7_t *input, const uint16_t input_x, const uint16_t input_y,
|
||||
const uint16_t input_ch, const q7_t* kernel, const q31_t *kbuf, const int32_t *bias,
|
||||
const int32_t *output_shift, const int32_t *output_mult,
|
||||
const int32_t output_offset, const int32_t input_offset,
|
||||
const int32_t output_activation_min,
|
||||
const int32_t output_activation_max, q7_t *output,
|
||||
const uint16_t output_x, const uint16_t output_y,
|
||||
const uint16_t output_ch, q15_t *runtime_buf, q7_t pad_value,
|
||||
const uint16_t pad_t, const uint16_t pad_b, const uint16_t pad_l, const uint16_t pad_r) {
|
||||
const int kernel_y = 3;
|
||||
const int kernel_x = 3;
|
||||
|
||||
int16_t i_out_y, i_out_x, i_ker_y, i_ker_x;
|
||||
|
||||
/* Generate two columns from the input tensor a GEMM computation */
|
||||
q15_t *two_column_buf = runtime_buf;
|
||||
q7_t *out = output;
|
||||
|
||||
q15_t pad16 = pad_value;
|
||||
const int16_t inoff16 = input_offset;
|
||||
|
||||
for (i_out_y = 0; i_out_y < output_y; i_out_y++) {
|
||||
for (i_out_x = 0; i_out_x < output_x; i_out_x++) {
|
||||
/* This part implements the im2col function */
|
||||
const q15_t *col_buffer = two_column_buf;
|
||||
int16_t base_idx_y = (i_out_y * 2);
|
||||
int16_t base_idx_x = (i_out_x * 2);
|
||||
|
||||
//use variables
|
||||
q31_t in_q7x4;
|
||||
q31_t in_q15x2_1;
|
||||
q31_t in_q15x2_2;
|
||||
q31_t out_q15x2_1;
|
||||
q31_t out_q15x2_2;
|
||||
|
||||
/* load address:8bit */
|
||||
q7_t *src;
|
||||
|
||||
/* buffer for im2col:16bit */
|
||||
q15_t *dst = col_buffer;
|
||||
|
||||
int skip_top = pad_t - base_idx_y;
|
||||
int skip_bottom = MAX(0,(base_idx_y + 3) - (input_y - pad_b));//3x3
|
||||
|
||||
int y_cnt = 3;//3 rows to load
|
||||
//fill zeros in the top regions
|
||||
while (y_cnt > 0 && skip_top-- > 0){
|
||||
*dst++ = 0; *dst++ = 0; *dst++ = 0;
|
||||
*dst++ = 0; *dst++ = 0; *dst++ = 0;
|
||||
*dst++ = 0; *dst++ = 0; *dst++ = 0;
|
||||
y_cnt--;
|
||||
base_idx_y++;
|
||||
}
|
||||
|
||||
//fill in the middle
|
||||
int skip_left = MAX(0,pad_l - base_idx_x);
|
||||
int skip_right = MAX(0,(base_idx_x + 3) - (input_x - pad_r));//3x3
|
||||
//address of the first valid values
|
||||
int m;
|
||||
for (m = 0; m < y_cnt - skip_bottom; m++){
|
||||
src = input + ((base_idx_y+m) * input_x + base_idx_x + skip_left) * input_ch;
|
||||
int x_cnt = 3;//3 columns to load
|
||||
//fill zero for left regions
|
||||
int cnt = skip_left;
|
||||
while(x_cnt > 0 && cnt-- > 0){
|
||||
*dst++ = 0; *dst++ = 0; *dst++ = 0;//input_ch == 3
|
||||
x_cnt--;
|
||||
}
|
||||
|
||||
//load the middle
|
||||
while(x_cnt > skip_right){
|
||||
*dst++ = *src++ + input_offset; *dst++ = *src++ + input_offset; *dst++ = *src++ + input_offset;
|
||||
x_cnt--;
|
||||
}
|
||||
|
||||
//fill zero for right regions (for what's left)
|
||||
while(x_cnt > 0){
|
||||
*dst++ = 0; *dst++ = 0; *dst++ = 0;//input_ch == 3
|
||||
x_cnt--;
|
||||
}
|
||||
}
|
||||
y_cnt -= m;
|
||||
|
||||
//fill zeros in the bottom regions
|
||||
while (y_cnt > 0){
|
||||
*dst++ = 0; *dst++ = 0; *dst++ = 0;
|
||||
*dst++ = 0; *dst++ = 0; *dst++ = 0;
|
||||
*dst++ = 0; *dst++ = 0; *dst++ = 0;
|
||||
y_cnt--;
|
||||
}
|
||||
|
||||
two_column_buf += 27;
|
||||
/* Computation is filed for every 2 columns */
|
||||
if (two_column_buf == runtime_buf + 2 * 27) {
|
||||
|
||||
out = mat_mult_s16(kernel,
|
||||
runtime_buf, output_ch, output_shift, output_mult,
|
||||
output_offset, output_activation_min, output_activation_max,
|
||||
input_ch * kernel_y * kernel_x, bias, out, kbuf);
|
||||
|
||||
/* counter reset */
|
||||
two_column_buf = runtime_buf;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/* left-over because odd number of output pixels */
|
||||
if (two_column_buf != runtime_buf) {
|
||||
const q7_t *ker_a = kernel;
|
||||
int i;
|
||||
|
||||
for (i = 0; i < output_ch; i++) {
|
||||
/* Load the accumulator with bias first */
|
||||
q31_t sum = bias[i];
|
||||
|
||||
/* Point to the beginning of the im2col buffer where the input is available as a rearranged column */
|
||||
const q15_t *ip_as_col = runtime_buf;
|
||||
|
||||
/* 4 multiply and accumulates are done in one loop. */
|
||||
uint16_t col_count = (input_ch * kernel_y * kernel_x) >> 2;
|
||||
|
||||
while (col_count) {
|
||||
q31_t ker_a1, ker_a2;
|
||||
q31_t ip_b1, ip_b2;
|
||||
|
||||
ker_a = read_and_pad(ker_a, &ker_a1, &ker_a2);
|
||||
|
||||
ip_b1 = arm_nn_read_q15x2_ia(&ip_as_col);
|
||||
sum = __SMLAD(ker_a1, ip_b1, sum);
|
||||
ip_b2 = arm_nn_read_q15x2_ia(&ip_as_col);
|
||||
sum = __SMLAD(ker_a2, ip_b2, sum);
|
||||
|
||||
col_count--;
|
||||
}
|
||||
/* Handle left over mac */
|
||||
col_count = input_ch * kernel_y * kernel_x & 0x3;
|
||||
while (col_count) {
|
||||
q7_t ker_a1 = *ker_a++;
|
||||
q15_t ip_b1 = *ip_as_col++;
|
||||
sum += ker_a1 * ip_b1;
|
||||
col_count--;
|
||||
}
|
||||
|
||||
sum = arm_nn_requantize(sum, output_mult[i], output_shift[i]);
|
||||
sum += output_offset;
|
||||
sum = MAX(sum, output_activation_min);
|
||||
sum = MIN(sum, output_activation_max);
|
||||
*out++ = (q7_t) sum;
|
||||
}
|
||||
}
|
||||
|
||||
/* Return to application */
|
||||
return STATE_SUCCESS;
|
||||
}
|
40
TinyEngine/src/kernels/int_only/stable_softmax.c
Normal file
@ -0,0 +1,40 @@
|
||||
/* ----------------------------------------------------------------------
|
||||
* Project: TinyEngine
|
||||
* Title: stable_softmax.c
|
||||
*
|
||||
* Reference papers:
|
||||
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
|
||||
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
|
||||
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
|
||||
* Contact authors:
|
||||
* - Wei-Ming Chen, wmchen@mit.edu
|
||||
* - Wei-Chen Wang, wweichen@mit.edu
|
||||
* - Ji Lin, jilin@mit.edu
|
||||
* - Ligeng Zhu, ligeng@mit.edu
|
||||
* - Song Han, songhan@mit.edu
|
||||
*
|
||||
* Target ISA: ARMv7E-M
|
||||
* -------------------------------------------------------------------- */
|
||||
|
||||
#include "tinyengine_function.h"
|
||||
#include <float.h>
|
||||
#include <math.h>
|
||||
|
||||
tinyengine_status statble_softmax_inplace(float *input, const uint16_t length)
|
||||
{
|
||||
float max = FLT_MIN;
|
||||
float exp_sum = 0;
|
||||
uint16_t i;
|
||||
for (i = 0; i < length; i++){
|
||||
if (input[i] > max) max = input[i];
|
||||
}
|
||||
|
||||
// inplace update
|
||||
for (i = 0; i < length; i++){
|
||||
input[i] = exp(input[i] - max);
|
||||
exp_sum += input[i];
|
||||
}
|
||||
for (i = 0; i < length; i++){
|
||||
input[i] = input[i] / exp_sum;
|
||||
}
|
||||
}
|
85
TinyEngine/src/kernels/int_only/upsample_byte.c
Normal file
@ -0,0 +1,85 @@
|
||||
/* ----------------------------------------------------------------------
|
||||
* Project: TinyEngine
|
||||
* Title: upsample_byte.c
|
||||
*
|
||||
* Reference papers:
|
||||
* - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
|
||||
* - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
|
||||
* - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
|
||||
* Contact authors:
|
||||
* - Wei-Ming Chen, wmchen@mit.edu
|
||||
* - Wei-Chen Wang, wweichen@mit.edu
|
||||
* - Ji Lin, jilin@mit.edu
|
||||
* - Ligeng Zhu, ligeng@mit.edu
|
||||
* - Song Han, songhan@mit.edu
|
||||
*
|
||||
* Target ISA: ARMv7E-M
|
||||
* -------------------------------------------------------------------- */
|
||||
|
||||
#include "arm_nnfunctions.h"
|
||||
#include "tinyengine_function.h"
|
||||
|
||||
tinyengine_status upsample_byte(const q7_t *input, const uint16_t input_x,
|
||||
const uint16_t input_y, const uint16_t input_ch, q7_t *output, const uint16_t sample_factor) {
|
||||
//get output resolution
|
||||
const uint16_t output_x = input_x * sample_factor, output_y = input_y * sample_factor , output_ch = input_ch;
|
||||
|
||||
//upsample in a repeated manner
|
||||
for(int ih = 0; ih < input_y; ih++){
|
||||
q7_t* out_head = output;
|
||||
//place 1 row
|
||||
for(int iw = 0; iw < input_x; iw++){
|
||||
for(int s = 0; s < sample_factor; s++){
|
||||
memcpy(output, input, input_ch);
|
||||
output += input_ch;
|
||||
}
|
||||
input += input_ch;
|
||||
}
|
||||
|
||||
//copy the remaining rows
|
||||
for(int s = 1; s < sample_factor; s++){
|
||||
memcpy(output, out_head, output_ch * output_x);
|
||||
output += output_ch * output_x;
|
||||
}
|
||||
}
|
||||
return STATE_SUCCESS;
|
||||
}
|
||||
|
||||
|
||||
//ref: https://www.cs.toronto.edu/~guerzhoy/320/lec/upsampling.pdf
|
||||
tinyengine_status upsample_byte_bilinear(const q7_t *input, const uint16_t input_x,
|
||||
const uint16_t input_y, const uint16_t input_ch, q7_t *output, const uint16_t sample_factor) {
|
||||
//get output resolution
|
||||
const uint16_t output_x = input_x * sample_factor, output_y = input_y * sample_factor , output_ch = input_ch;
|
||||
|
||||
// //upsample in a repeated manner
|
||||
// for(int oh = 0; oh < input_y; oh++){
|
||||
// int ih = oh / sample_factor;
|
||||
// int rh = oh % sample_factor;
|
||||
//
|
||||
// q7_t* out_head = output;
|
||||
// //place 1 row
|
||||
// for(int ow = 0; ow < onput_x; ow++){
|
||||
// int iw = iw / sample_factor;
|
||||
// int wh = wh % sample_factor;
|
||||
//
|
||||
// //exact coordinate
|
||||
// q7_t* ori_input = input + input_ch * (input_x * ih + iw);
|
||||
// if(rh | wh == 0){
|
||||
// memcpy(output, ori_input, input_ch);
|
||||
// continue;
|
||||
// }
|
||||
//
|
||||
// //interpolate
|
||||
// q7_t* topleft = ori_input;
|
||||
// q7_t* topright = ori_input + input_ch;
|
||||
// q7_t* bottomleft = topleft + input_ch * input_x;
|
||||
// q7_t* bottomright = topright + input_ch * input_x;
|
||||
// }
|
||||
// }
|
||||
return STATE_SUCCESS;
|
||||
}
|
||||
|
||||
|
||||
|
||||
|
1
TinyEngine/third_party/CMSIS
vendored
Submodule
@ -0,0 +1 @@
|
||||
Subproject commit 5b58d2da8af7cee64cc9145ee1154609bdfee9f9
|
BIN
assets/detection.tflite
Normal file
23
assets/detection_config.json
Normal file
@ -0,0 +1,23 @@
|
||||
{
|
||||
"output1": {
|
||||
"name": "Yolo3Output",
|
||||
"input_id": "175",
|
||||
"num_class": 1,
|
||||
"anchors": [116, 90, 156, 198, 373, 326],
|
||||
"stride": 32
|
||||
},
|
||||
"output2": {
|
||||
"name": "Yolo3Output",
|
||||
"input_id": "36",
|
||||
"num_class": 1,
|
||||
"anchors": [30, 61, 62, 45, 59, 119],
|
||||
"stride": 16
|
||||
},
|
||||
"output3": {
|
||||
"name": "Yolo3Output",
|
||||
"input_id": "5",
|
||||
"num_class": 1,
|
||||
"anchors": [10, 13, 16, 30, 33, 23],
|
||||
"stride": 8
|
||||
}
|
||||
}
|
BIN
assets/figures/0_import_project_0.png
Normal file
After Width: | Height: | Size: 413 KiB |
BIN
assets/figures/10_mcu_top_view.png
Normal file
After Width: | Height: | Size: 359 KiB |
BIN
assets/figures/11_mcu_side_view.png
Normal file
After Width: | Height: | Size: 346 KiB |
BIN
assets/figures/12_stlink_0.png
Normal file
After Width: | Height: | Size: 37 KiB |
BIN
assets/figures/13_stlink_1.png
Normal file
After Width: | Height: | Size: 85 KiB |
BIN
assets/figures/14_stlink_2.png
Normal file
After Width: | Height: | Size: 87 KiB |
BIN
assets/figures/15_demo_person.png
Normal file
After Width: | Height: | Size: 15 MiB |
BIN
assets/figures/16_demo_no_person.png
Normal file
After Width: | Height: | Size: 14 MiB |
BIN
assets/figures/1_import_project_1.png
Normal file
After Width: | Height: | Size: 559 KiB |
BIN
assets/figures/2_project_explorer.png
Normal file
After Width: | Height: | Size: 141 KiB |
BIN
assets/figures/3_main_cpp.png
Normal file
After Width: | Height: | Size: 1.6 MiB |
BIN
assets/figures/4_gcc_include_paths.png
Normal file
After Width: | Height: | Size: 1.1 MiB |
BIN
assets/figures/5_gcc_optimization.png
Normal file
After Width: | Height: | Size: 1003 KiB |
BIN
assets/figures/6_gplusplus_include_paths.png
Normal file
After Width: | Height: | Size: 1.1 MiB |
BIN
assets/figures/7_gplusplus_optimization.png
Normal file
After Width: | Height: | Size: 1.1 MiB |
BIN
assets/figures/8_run_configurations_0.png
Normal file
After Width: | Height: | Size: 746 KiB |
BIN
assets/figures/9_run_configurations_1.png
Normal file
After Width: | Height: | Size: 724 KiB |
BIN
assets/figures/applications.png
Normal file
After Width: | Height: | Size: 282 KiB |
BIN
assets/figures/inplace_depthwise.png
Normal file
After Width: | Height: | Size: 51 KiB |
BIN
assets/figures/latency_mem.png
Normal file
After Width: | Height: | Size: 279 KiB |
BIN
assets/figures/mac_result.png
Normal file
After Width: | Height: | Size: 43 KiB |
BIN
assets/figures/mcunetV3_demo.gif
Normal file
After Width: | Height: | Size: 23 MiB |
BIN
assets/figures/mcunetV3_demo_2images.gif
Normal file
After Width: | Height: | Size: 13 MiB |
BIN
assets/figures/mcunet_demo.gif
Normal file
After Width: | Height: | Size: 7.4 MiB |
BIN
assets/figures/measured_result.png
Normal file
After Width: | Height: | Size: 176 KiB |
BIN
assets/figures/memory_size.png
Normal file
After Width: | Height: | Size: 256 KiB |
BIN
assets/figures/overview.png
Normal file
After Width: | Height: | Size: 222 KiB |
BIN
assets/figures/peakmem_result.png
Normal file
After Width: | Height: | Size: 32 KiB |
BIN
assets/vww.tflite
Normal file
667
code_generator/CodeGenerator.py
Normal file
@ -0,0 +1,667 @@
|
||||
# ----------------------------------------------------------------------
|
||||
# Project: TinyEngine
|
||||
# Title: CodeGenerator.py
|
||||
#
|
||||
# Reference papers:
|
||||
# - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
|
||||
# - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
|
||||
# - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
|
||||
# Contact authors:
|
||||
# - Wei-Ming Chen, wmchen@mit.edu
|
||||
# - Wei-Chen Wang, wweichen@mit.edu
|
||||
# - Ji Lin, jilin@mit.edu
|
||||
# - Ligeng Zhu, ligeng@mit.edu
|
||||
# - Song Han, songhan@mit.edu
|
||||
#
|
||||
# Target ISA: ARMv7E-M
|
||||
# ----------------------------------------------------------------------
|
||||
|
||||
import os
|
||||
|
||||
from .OpGenerator import OpGenerator
|
||||
|
||||
Codegen_root = "./codegen/"
|
||||
include_path = Codegen_root + "Include/"
|
||||
source_path = Codegen_root + "Source/"
|
||||
|
||||
use_hard_switsh = False
|
||||
gen_kernels = True
|
||||
use_aggressive_unroll = True
|
||||
|
||||
|
||||
class CodeGenerator:
|
||||
"""Provide utilities to generate C code for a given model and memory schdeule."""
|
||||
|
||||
parse_count = 0
|
||||
header_handle = None
|
||||
source_handle = None
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
memsche,
|
||||
inplace,
|
||||
precision=8,
|
||||
unsigned_input=False,
|
||||
patch_params=None,
|
||||
FP_output=False,
|
||||
profile_mode=False,
|
||||
fp_requantize=False,
|
||||
tflite_op=False,
|
||||
dummy_address=False,
|
||||
outputTables=None,
|
||||
detectionUtils=None,
|
||||
):
|
||||
self.MemSche = memsche
|
||||
|
||||
# Check if path exists, create it if not
|
||||
if not os.path.exists(include_path):
|
||||
os.makedirs(include_path)
|
||||
if not os.path.exists(source_path):
|
||||
os.makedirs(source_path)
|
||||
|
||||
self.header_handle = open(include_path + "genModel.h", "w")
|
||||
self.source_handle = open(source_path + "genModel.c", "w")
|
||||
self.inplace = inplace
|
||||
self.BIT = precision
|
||||
self.unsigned_input = unsigned_input
|
||||
self.patch_params = patch_params
|
||||
self.FP_output = FP_output
|
||||
self.profile_mode = profile_mode
|
||||
self.fp_requantize = fp_requantize
|
||||
self.tflite_op = tflite_op
|
||||
self.dummy_address = dummy_address
|
||||
self.trainSRAMTable = []
|
||||
self.outputTables = outputTables
|
||||
self.detectionUtils = detectionUtils
|
||||
|
||||
def _readOnly(self, name):
|
||||
if self.outputTables is None or name is None:
|
||||
return True
|
||||
else:
|
||||
for o in self.outputTables:
|
||||
if o.name in name:
|
||||
return False
|
||||
return True
|
||||
|
||||
def codeGeneration(self):
|
||||
# buffer in SRAM
|
||||
self._genMemBuffer()
|
||||
|
||||
# parse trainable parameters & assign the corresponding buffers for layers
|
||||
self._parseTrainable()
|
||||
|
||||
# include all headers
|
||||
self._includeHeaders()
|
||||
|
||||
# generate detection output if any
|
||||
self._genDetprocessing()
|
||||
|
||||
# generate patch-based
|
||||
self._genPatchInference()
|
||||
|
||||
# generate invoke function
|
||||
self._genInvoke()
|
||||
|
||||
self._closefp()
|
||||
|
||||
# generate operatior kernels
|
||||
if gen_kernels:
|
||||
op_gen = OpGenerator(include_path, source_path, self.MemSche.layer, self.fp_requantize)
|
||||
op_gen.genOpcode()
|
||||
|
||||
def _genDetprocessing(self):
|
||||
if self.detectionUtils is not None:
|
||||
fp = self.source_handle
|
||||
fp.write(self.detectionUtils.genPostProcessing())
|
||||
|
||||
def _genOpstr(self, op, *args):
|
||||
if self.profile_mode:
|
||||
if len(args) > 0:
|
||||
return op.generate_profiling_str(*args)
|
||||
else:
|
||||
return op.generate_profiling_str()
|
||||
else:
|
||||
if len(args) > 0:
|
||||
return op.generate_inference_str(*args)
|
||||
else:
|
||||
return op.generate_inference_str()
|
||||
|
||||
def _genPatchInference(self):
|
||||
schedule = self.MemSche
|
||||
layer_info = schedule.layer[0].get_layer_info()
|
||||
if "is_patch" in layer_info and layer_info["is_patch"]:
|
||||
fp = self.source_handle
|
||||
string = ""
|
||||
first_height = layer_info["input_h"]
|
||||
first_width = layer_info["input_w"]
|
||||
img_w = (first_width - self.patch_params["pad_l"] - self.patch_params["pad_r"]) * self.patch_params[
|
||||
"n_patch"
|
||||
]
|
||||
# by default, we go three stride 2 conv in the patch-based inference
|
||||
patch_out_w = int((first_width - self.patch_params["pad_l"]) / 8)
|
||||
# by default, we go three stride 2 conv in the patch-based inference
|
||||
patch_out_h = int((first_height - self.patch_params["pad_l"]) / 8)
|
||||
out_w = self.patch_params["output_w"]
|
||||
# generate code for testing whole inference time
|
||||
string += (
|
||||
"""void end2endinference(q7_t* img){
|
||||
//stage 1
|
||||
int i, j, h, w, c;
|
||||
for (i = 0; i < """
|
||||
+ str(self.patch_params["n_patch"])
|
||||
+ """; i++){
|
||||
uint16_t pad_t=0,pad_b=0;
|
||||
if (i == 0){
|
||||
pad_t = """
|
||||
+ str(self.patch_params["pad_l"])
|
||||
+ """;
|
||||
}
|
||||
else if (i == """
|
||||
+ str(self.patch_params["n_patch"] - 1)
|
||||
+ """){
|
||||
pad_b = """
|
||||
+ str(self.patch_params["pad_r"])
|
||||
+ """;
|
||||
}
|
||||
for (j = 0; j < """
|
||||
+ str(self.patch_params["n_patch"])
|
||||
+ """; j++){
|
||||
uint16_t pad_l=0,pad_r=0;
|
||||
if (j == 0){
|
||||
pad_l = """
|
||||
+ str(self.patch_params["pad_l"])
|
||||
+ """;
|
||||
}
|
||||
else if (j == """
|
||||
+ str(self.patch_params["n_patch"] - 1)
|
||||
+ """){
|
||||
pad_r = """
|
||||
+ str(self.patch_params["pad_r"])
|
||||
+ """;
|
||||
}
|
||||
/* load partial input from the img */
|
||||
q7_t* patch_input = &buffer0[0]; // for partial input
|
||||
int start_x = MAX("""
|
||||
+ str(first_width - self.patch_params["pad_l"])
|
||||
+ """ * j - """
|
||||
+ str(self.patch_params["pad_l"])
|
||||
+ """,0);
|
||||
int start_y = MAX("""
|
||||
+ str(first_height - self.patch_params["pad_l"])
|
||||
+ """ * i - """
|
||||
+ str(self.patch_params["pad_l"])
|
||||
+ """,0);
|
||||
q7_t* img_ptr = &img[(start_x + start_y * """
|
||||
+ str(img_w)
|
||||
+ """) * 3];
|
||||
|
||||
//skip top
|
||||
patch_input += pad_t * """
|
||||
+ str(first_width)
|
||||
+ """ * 3;
|
||||
for (h = pad_t; h < """
|
||||
+ str(first_height)
|
||||
+ """ - pad_b; h++){
|
||||
//skip left
|
||||
patch_input += pad_l * 3;
|
||||
//fill middle
|
||||
int bytes = ("""
|
||||
+ str(first_width)
|
||||
+ """ - (pad_l + pad_r)) * 3;
|
||||
memcpy (patch_input, img_ptr, bytes);
|
||||
img_ptr += """
|
||||
+ str(img_w)
|
||||
+ """ * 3;
|
||||
patch_input += bytes;
|
||||
//skip right
|
||||
patch_input += pad_r * 3;
|
||||
}
|
||||
invoke_1patch(pad_t,pad_b,pad_l,pad_r);
|
||||
/* concat the output from buffer0 (this is set manually for now) */
|
||||
q7_t* output_ptr = buffer1 + (i * """
|
||||
+ str(patch_out_w)
|
||||
+ """ * """
|
||||
+ str(out_w)
|
||||
+ """ + j * """
|
||||
+ str(patch_out_w)
|
||||
+ """) * """
|
||||
+ str(self.patch_params["output_c"])
|
||||
+ """ ;
|
||||
for (h = 0; h < """
|
||||
+ str(patch_out_h)
|
||||
+ """; h++){
|
||||
for (w = 0; w < """
|
||||
+ str(patch_out_w)
|
||||
+ """; w++){
|
||||
for (c = 0; c < """
|
||||
+ str(self.patch_params["output_c"])
|
||||
+ """; c++){
|
||||
output_ptr[(w + h * """
|
||||
+ str(out_w)
|
||||
+ """) * """
|
||||
+ str(self.patch_params["output_c"])
|
||||
+ """ + c] = buffer0[(w + h * """
|
||||
+ str(patch_out_w)
|
||||
+ """) * """
|
||||
+ str(self.patch_params["output_c"])
|
||||
+ """ + c];
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
//stage 2
|
||||
invoke();
|
||||
}"""
|
||||
)
|
||||
string += """
|
||||
|
||||
void invoke_1patch(uint16_t pad_t, uint16_t pad_b, uint16_t pad_l ,uint16_t pad_r){
|
||||
"""
|
||||
fp.write(string)
|
||||
|
||||
# gen patch-based inference code
|
||||
patch_layers = []
|
||||
layercnt = 0
|
||||
for i, op in enumerate(schedule.layer):
|
||||
layer_info = op.get_layer_info()
|
||||
if "is_patch" not in layer_info or not layer_info["is_patch"]:
|
||||
break # end of patch-based
|
||||
string = "/* layer " + str(layercnt) + ":" + layer_info["op"] + " */\n"
|
||||
layercnt += 1
|
||||
fp.write(string)
|
||||
if layer_info["op"] == "CONV_2D":
|
||||
# hardcode this memory schedule for quick implementation
|
||||
# TODO: adjust this according to model architecture and split index
|
||||
next_layer_info = schedule.layer[i + 1].get_layer_info()
|
||||
if "is_patch" not in next_layer_info or not next_layer_info["is_patch"]:
|
||||
layer_info["output_buf_add"] = "front"
|
||||
layer_info["output_buf_add_offset"] = 0
|
||||
if self.unsigned_input:
|
||||
raise Exception("unsigned input is not supported by patch-based yet")
|
||||
|
||||
string = self._genOpstr(
|
||||
op,
|
||||
False,
|
||||
self.FP_output,
|
||||
use_aggressive_unroll,
|
||||
use_hard_switsh,
|
||||
self.fp_requantize,
|
||||
)
|
||||
fp.write(string)
|
||||
|
||||
elif layer_info["op"] == "DEPTHWISE_CONV_2D":
|
||||
string = self._genOpstr(op, self.fp_requantize)
|
||||
fp.write(string)
|
||||
|
||||
elif layer_info["op"] == "ADD":
|
||||
string = self._genOpstr(op)
|
||||
fp.write(string)
|
||||
|
||||
patch_layers.append(schedule.layer[i])
|
||||
|
||||
# remove these layers for patching for the following code gen
|
||||
for layer in patch_layers:
|
||||
schedule.layer.remove(layer)
|
||||
|
||||
string = "}\n\n"
|
||||
|
||||
fp.write(string)
|
||||
else: # not patch-based
|
||||
string = """void end2endinference(q7_t* img){
|
||||
invoke(NULL);
|
||||
}
|
||||
"""
|
||||
fp = self.source_handle
|
||||
fp.write(string)
|
||||
|
||||
def _genInvoke(self):
|
||||
fp = self.source_handle
|
||||
string = "void invoke(float* labels){\n"
|
||||
fp.write(string)
|
||||
|
||||
schedule = self.MemSche
|
||||
for i, op in enumerate(schedule.layer):
|
||||
layer_info = op.get_layer_info()
|
||||
string = "/* layer " + str(i) + ":" + layer_info["op"] + " */\n"
|
||||
fp.write(string)
|
||||
|
||||
if layer_info["op"] == "CONV_2D":
|
||||
if (
|
||||
self.FP_output
|
||||
and "effective_scale" in layer_info
|
||||
and layer_info["output_scale"] is not None
|
||||
and layer_info["effective_scale"] is not None
|
||||
):
|
||||
use_fp = True
|
||||
else:
|
||||
use_fp = False
|
||||
string = self._genOpstr(
|
||||
op,
|
||||
self.unsigned_input,
|
||||
use_fp,
|
||||
use_aggressive_unroll,
|
||||
use_hard_switsh,
|
||||
self.fp_requantize,
|
||||
self.tflite_op,
|
||||
self.dummy_address,
|
||||
)
|
||||
fp.write(string)
|
||||
elif layer_info["op"] == "DEPTHWISE_CONV_2D":
|
||||
string = self._genOpstr(op, self.fp_requantize)
|
||||
fp.write(string)
|
||||
else:
|
||||
string = self._genOpstr(op)
|
||||
fp.write(string)
|
||||
|
||||
string = "}\n"
|
||||
fp.write(string)
|
||||
|
||||
def _getBufferIndex(self, location):
|
||||
if location == "front":
|
||||
return 0
|
||||
elif location == "end":
|
||||
return 0
|
||||
elif location == "residual":
|
||||
return 1
|
||||
return None
|
||||
|
||||
def _genMemBuffer(self):
|
||||
schedule = self.MemSche
|
||||
# define output tensor
|
||||
string = "#define NNoutput &buffer0[" + str(_findtheinferenceOutput(schedule.layer)) + "];"
|
||||
fp = self.header_handle
|
||||
fp.write("\n" + string + "\n")
|
||||
|
||||
# activation buffers
|
||||
string = "\n/* sram:" + str(schedule.peakmem) + ", flash:" + str(schedule.flash) + " */\n"
|
||||
fp.write(string + "\n")
|
||||
|
||||
string = "static signed char buffer[" + str(schedule.peakmem) + "];\n"
|
||||
fp.write(string)
|
||||
accumulate_ptr = 0
|
||||
string = "static signed char *buffer0 = &buffer[" + str(accumulate_ptr) + "];\n"
|
||||
accumulate_ptr += int(schedule.buffers["input_output"])
|
||||
fp.write(string)
|
||||
string = "static signed char *buffer1 = &buffer[" + str(accumulate_ptr) + "];\n"
|
||||
accumulate_ptr += int(schedule.buffers["residual"])
|
||||
fp.write(string)
|
||||
|
||||
string = "static int16_t *sbuf = (int16_t *)&buffer[" + str(accumulate_ptr) + "];\n"
|
||||
accumulate_ptr += int(schedule.buffers["im2col"])
|
||||
fp.write(string)
|
||||
string = "static int32_t *kbuf = (int32_t *)&buffer[" + str(accumulate_ptr) + "];\n"
|
||||
accumulate_ptr += int(schedule.buffers["kernel"])
|
||||
fp.write(string)
|
||||
string = "const int SBuffer_size = " + str(int(schedule.buffers["im2col"])) + ";\n"
|
||||
fp.write(string)
|
||||
string = "const int KBuffer_size = " + str(int(schedule.buffers["kernel"])) + ";\n"
|
||||
fp.write(string + "\n")
|
||||
|
||||
def _includeHeaders(self):
|
||||
include_string = """/* Automatically generated source file */
|
||||
#include <float.h>
|
||||
#include "arm_nnfunctions.h"
|
||||
|
||||
#include "genNN.h"
|
||||
#include "genModel.h"
|
||||
|
||||
#include "tinyengine_function.h"
|
||||
//#include "tinyengine_function_fp.h"
|
||||
|
||||
"""
|
||||
if self.profile_mode:
|
||||
include_string += '#include "profile.h"\n'
|
||||
|
||||
include_string += """
|
||||
/* Variables used by all ops */
|
||||
ADD_params add_params;
|
||||
//Conv_Params conv_params;
|
||||
//Depthwise_Params dpconv_params;
|
||||
int i;
|
||||
int8_t *int8ptr;
|
||||
float *fptr,*fptr2,*fptr3;
|
||||
|
||||
signed char* getInput() {
|
||||
return &buffer0[""" + f"{self.MemSche.layer[0].params['input_buf_add_offset']}" + """];
|
||||
}
|
||||
signed char* getOutput() {
|
||||
return NNoutput;
|
||||
}\n"""
|
||||
fp = self.source_handle
|
||||
fp.write(include_string)
|
||||
|
||||
def _parseTrainable(self):
|
||||
schedule = self.MemSche
|
||||
for i, op in enumerate(schedule.layer):
|
||||
layer_info = op.get_layer_info()
|
||||
if layer_info["op"] == "CONV_2D":
|
||||
self._parseWeight(
|
||||
self.parse_count,
|
||||
layer_info["weight_value"].flatten(),
|
||||
layer_info["weight_name"],
|
||||
self._readOnly(layer_info["weight_name"]),
|
||||
)
|
||||
|
||||
if "bias_name" in layer_info:
|
||||
self._parseBias(
|
||||
self.parse_count,
|
||||
layer_info["bias"].flatten(),
|
||||
layer_info["bias_name"],
|
||||
self._readOnly(layer_info["bias_name"]),
|
||||
)
|
||||
else:
|
||||
self._parseBias(self.parse_count, layer_info["bias"].flatten())
|
||||
self._parseEffectivescales(self.parse_count, layer_info["effective_scale"].flatten())
|
||||
self._parseRequantize(
|
||||
self.parse_count,
|
||||
layer_info["shift"].flatten(),
|
||||
layer_info["multiplier"].flatten(),
|
||||
)
|
||||
|
||||
layer_info["parsed_trainable"] = self.parse_count
|
||||
self.parse_count += 1
|
||||
elif layer_info["op"] == "DEPTHWISE_CONV_2D":
|
||||
if layer_info["kernel_h"] > layer_info["kernel_w"]:
|
||||
self._parseCWHWeight(
|
||||
self.parse_count,
|
||||
layer_info["weight_value"].flatten(),
|
||||
layer_info["kernel_h"],
|
||||
layer_info["kernel_w"],
|
||||
layer_info["input_c"],
|
||||
)
|
||||
else:
|
||||
if "weight_name" in layer_info:
|
||||
self._parseCHWWeight(
|
||||
self.parse_count,
|
||||
layer_info["weight_value"].flatten(),
|
||||
layer_info["input_c"],
|
||||
)
|
||||
else:
|
||||
self._parseCHWWeight(
|
||||
self.parse_count,
|
||||
layer_info["weight_value"].flatten(),
|
||||
layer_info["input_c"],
|
||||
)
|
||||
if "bias_name" in layer_info:
|
||||
self._parseoffsetBias(
|
||||
self.parse_count,
|
||||
layer_info["bias"].flatten(),
|
||||
layer_info["input_zero_point"] * -1,
|
||||
layer_info["weight_value"].flatten(),
|
||||
layer_info["input_c"],
|
||||
layer_info["bias_name"],
|
||||
self._readOnly(layer_info["bias_name"]),
|
||||
)
|
||||
else:
|
||||
self._parseoffsetBias(
|
||||
self.parse_count,
|
||||
layer_info["bias"].flatten(),
|
||||
layer_info["input_zero_point"] * -1,
|
||||
layer_info["weight_value"].flatten(),
|
||||
layer_info["input_c"],
|
||||
)
|
||||
self._parseEffectivescales(self.parse_count, layer_info["effective_scale"].flatten())
|
||||
self._parseRequantize(
|
||||
self.parse_count,
|
||||
layer_info["shift"].flatten(),
|
||||
layer_info["multiplier"].flatten(),
|
||||
)
|
||||
|
||||
layer_info["parsed_trainable"] = self.parse_count
|
||||
self.parse_count += 1
|
||||
|
||||
elif layer_info["op"] == "FULLY_CONNECTED":
|
||||
self._parseWeight(
|
||||
self.parse_count,
|
||||
layer_info["weight_value"].flatten(),
|
||||
layer_info["weight_name"],
|
||||
self._readOnly(layer_info["weight_name"]),
|
||||
)
|
||||
self._parseBias(self.parse_count, layer_info["bias"].flatten())
|
||||
|
||||
layer_info["parsed_trainable"] = self.parse_count
|
||||
self.parse_count += 1
|
||||
|
||||
elif layer_info["op"] == "SOFTMAX":
|
||||
pass
|
||||
|
||||
def _parseCWHWeight(self, Lindex, weight, height, width, channel):
|
||||
fp = self.header_handle
|
||||
# 8bit implementation
|
||||
if self.BIT == 8:
|
||||
string = "const unsigned char CWHweight" + str(Lindex) + "[" + str(len(weight)) + "] = {"
|
||||
fp.write(string)
|
||||
for j in range(channel):
|
||||
for w in range(width):
|
||||
for h in range(height):
|
||||
value = weight[(h * width + w) * channel + j]
|
||||
if value < 0:
|
||||
value += 256
|
||||
fp.write(str(format(value, "#04x")) + ", ")
|
||||
else:
|
||||
raise NotImplementedError
|
||||
|
||||
fp.write("};\n")
|
||||
|
||||
def _parseCHWWeight(self, Lindex, weight, channel):
|
||||
fp = self.header_handle
|
||||
kernelsize = int(len(weight) / channel)
|
||||
# 8bit implementation
|
||||
if self.BIT == 8:
|
||||
string = "const unsigned char CHWweight" + str(Lindex) + "[" + str(len(weight)) + "] = {"
|
||||
fp.write(string)
|
||||
for j in range(channel):
|
||||
for i in range(kernelsize):
|
||||
value = int(weight[i * channel + j])
|
||||
if value < 0:
|
||||
value += 256
|
||||
fp.write(str(format(value, "#04x")) + ", ")
|
||||
else:
|
||||
raise NotImplementedError
|
||||
|
||||
fp.write("};\n")
|
||||
|
||||
def _parseEffectivescales(self, Lindex, scales):
|
||||
fp = self.header_handle
|
||||
string = "const float scales" + str(Lindex) + "[" + str(len(scales)) + "] = {"
|
||||
fp.write(string)
|
||||
for _, value in enumerate(scales):
|
||||
fp.write(str(value) + ", ")
|
||||
fp.write("};\n")
|
||||
|
||||
def _parseWeight(self, Lindex, weight, weight_name=None, is_const=True):
|
||||
fp = self.header_handle
|
||||
const_str = "const " if is_const else ""
|
||||
string = f"{const_str}unsigned char weight" + str(Lindex) + "[" + str(len(weight)) + "] = {"
|
||||
fp.write(string)
|
||||
for _, value in enumerate(weight):
|
||||
value = int(value)
|
||||
if value < 0:
|
||||
value += 256
|
||||
fp.write(str(format(value, "#04x")) + ", ")
|
||||
fp.write("};\n")
|
||||
|
||||
if weight_name is not None:
|
||||
for r in self.trainSRAMTable:
|
||||
if r.name == weight_name:
|
||||
return
|
||||
self.trainSRAMTable.append(tensorRecorder(weight_name, len(weight), "unknown"))
|
||||
|
||||
if weight.dtype == "int8":
|
||||
string = f"{const_str}unsigned char* {weight_name}=weight" + str(Lindex) + ";\n"
|
||||
else:
|
||||
raise NotImplementedError
|
||||
fp.write(string)
|
||||
|
||||
def _parseoffsetBias(self, Lindex, bias, input_offset, weight, channel, bias_name=None, is_const=True):
|
||||
fp = self.header_handle
|
||||
const_str = "const " if is_const else ""
|
||||
string = f"{const_str}int32_t offsetBias" + str(Lindex) + "[" + str(len(bias)) + "] = {"
|
||||
fp.write(string)
|
||||
kernelsize = int(len(weight) / channel)
|
||||
# fuse the offset into bias
|
||||
for i in range(channel):
|
||||
tmpW = 0
|
||||
for j in range(kernelsize):
|
||||
tmpW += weight[j * channel + i]
|
||||
fp.write(str(self.int32_clip(bias[i] + tmpW * input_offset)) + ", ")
|
||||
fp.write("};\n")
|
||||
string = f"{const_str}int32_t offsetRBias" + str(Lindex) + "[" + str(len(bias)) + "] = {"
|
||||
fp.write(string)
|
||||
kernelsize = int(len(weight) / channel)
|
||||
for i in range(channel):
|
||||
tmpW = 0
|
||||
for j in range(kernelsize):
|
||||
tmpW += weight[j * channel + i]
|
||||
fp.write(str(bias[i] + tmpW * input_offset - self.int32_clip(bias[i] + tmpW * input_offset)) + ", ")
|
||||
fp.write("};\n")
|
||||
|
||||
def _parseBias(self, Lindex, bias, bias_name=None, is_const=True):
|
||||
fp = self.header_handle
|
||||
const_str = "const " if is_const else ""
|
||||
string = f"{const_str}int32_t bias" + str(Lindex) + "[" + str(len(bias)) + "] = {"
|
||||
fp.write(string)
|
||||
for _, value in enumerate(bias):
|
||||
value = int(value)
|
||||
fp.write(str(value) + ", ")
|
||||
fp.write("};\n")
|
||||
|
||||
def _parseRequantize(self, Lindex, shift, multiplier):
|
||||
fp = self.header_handle
|
||||
string = "const int32_t shift" + str(Lindex) + "[" + str(len(shift)) + "] = {"
|
||||
fp.write(string)
|
||||
for _, value in enumerate(shift):
|
||||
fp.write(str(value) + ", ")
|
||||
fp.write("};\n")
|
||||
|
||||
string = "const int32_t multiplier" + str(Lindex) + "[" + str(len(multiplier)) + "] = {"
|
||||
fp.write(string)
|
||||
for _, value in enumerate(multiplier):
|
||||
fp.write(str(value) + ", ")
|
||||
fp.write("};\n")
|
||||
|
||||
def int32_clip(self, a):
|
||||
if a < -(2**31):
|
||||
return -(2**31)
|
||||
elif a > 2**31 - 1:
|
||||
return 2**31 - 1
|
||||
return a.astype(int)
|
||||
|
||||
def _closefp(self):
|
||||
self.header_handle.close()
|
||||
self.source_handle.close()
|
||||
|
||||
|
||||
def _findtheinferenceOutput(layers):
|
||||
for cnt, op in enumerate(layers):
|
||||
if op.params["output_dtype"] != "int8":
|
||||
return layers[cnt - 1].params["output_buf_add_offset"]
|
||||
return layers[-1].params["output_buf_add_offset"]
|
||||
|
||||
|
||||
class tensorRecorder:
|
||||
def __init__(self, name, len, dtype):
|
||||
self.name = name
|
||||
self.len = len
|
||||
self.dtype = dtype
|
72
code_generator/CodegenUtilTFlite.py
Normal file
@ -0,0 +1,72 @@
|
||||
# ----------------------------------------------------------------------
|
||||
# Project: TinyEngine
|
||||
# Title: CodegenUtilTFlite.py
|
||||
#
|
||||
# Reference papers:
|
||||
# - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
|
||||
# - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
|
||||
# - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
|
||||
# Contact authors:
|
||||
# - Wei-Ming Chen, wmchen@mit.edu
|
||||
# - Wei-Chen Wang, wweichen@mit.edu
|
||||
# - Ji Lin, jilin@mit.edu
|
||||
# - Ligeng Zhu, ligeng@mit.edu
|
||||
# - Song Han, songhan@mit.edu
|
||||
#
|
||||
# Target ISA: ARMv7E-M
|
||||
# ----------------------------------------------------------------------
|
||||
|
||||
import os
|
||||
from tempfile import TemporaryDirectory
|
||||
|
||||
from .CodeGenerator import CodeGenerator
|
||||
from .GeneralMemoryScheduler import GeneralMemoryScheduler
|
||||
from .TfliteConvertor import TfliteConvertor
|
||||
|
||||
|
||||
def GenerateSourceFilesFromTFlite(
|
||||
tflite_path,
|
||||
life_cycle_path=None,
|
||||
):
|
||||
use_inplace = True
|
||||
|
||||
with TemporaryDirectory() as WORKING_DIR:
|
||||
if life_cycle_path is None:
|
||||
schedule_image_path = os.path.join(WORKING_DIR, "schedule.png")
|
||||
else:
|
||||
schedule_image_path = life_cycle_path
|
||||
|
||||
tf_convertor = TfliteConvertor(tflite_path)
|
||||
tf_convertor.parseOperatorInfo()
|
||||
layer = tf_convertor.layer
|
||||
outTable = []
|
||||
VisaulizeTrainable = False # disable for code gen
|
||||
memory_scheduler = GeneralMemoryScheduler(
|
||||
layer,
|
||||
False,
|
||||
False,
|
||||
outputTables=outTable,
|
||||
inplace=use_inplace,
|
||||
mem_visual_path=schedule_image_path,
|
||||
VisaulizeTrainable=VisaulizeTrainable,
|
||||
)
|
||||
memory_scheduler.USE_INPLACE = use_inplace
|
||||
memory_scheduler.allocateMemory()
|
||||
|
||||
outTable = tf_convertor.outputTables if hasattr(tf_convertor, "outputTables") else []
|
||||
code_generator = CodeGenerator(
|
||||
memsche=memory_scheduler,
|
||||
inplace=memory_scheduler.USE_INPLACE,
|
||||
unsigned_input=False,
|
||||
patch_params=None,
|
||||
FP_output=False,
|
||||
profile_mode=False,
|
||||
fp_requantize=True,
|
||||
tflite_op=False,
|
||||
dummy_address=False,
|
||||
outputTables=outTable,
|
||||
)
|
||||
# set detection outputs before codegen if any
|
||||
code_generator.codeGeneration()
|
||||
|
||||
return memory_scheduler.buffers["input_output"]
|
389
code_generator/GeneralMemoryScheduler.py
Normal file
@ -0,0 +1,389 @@
|
||||
# ----------------------------------------------------------------------
|
||||
# Project: TinyEngine
|
||||
# Title: GeneralMemoryScheduler.py
|
||||
#
|
||||
# Reference papers:
|
||||
# - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
|
||||
# - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
|
||||
# - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
|
||||
# Contact authors:
|
||||
# - Wei-Ming Chen, wmchen@mit.edu
|
||||
# - Wei-Chen Wang, wweichen@mit.edu
|
||||
# - Ji Lin, jilin@mit.edu
|
||||
# - Ligeng Zhu, ligeng@mit.edu
|
||||
# - Song Han, songhan@mit.edu
|
||||
#
|
||||
# Target ISA: ARMv7E-M
|
||||
# ----------------------------------------------------------------------
|
||||
|
||||
from .allocator.firstFit import FirstFit
|
||||
from .constant import TTYPE_INFERNECE
|
||||
|
||||
|
||||
class GeneralMemoryScheduler:
|
||||
def __init__(
|
||||
self,
|
||||
layer,
|
||||
tflite_op=False,
|
||||
dummy_address=False,
|
||||
memory_limit=10 * 1024 * 1024,
|
||||
inplace=True,
|
||||
outputTables=None,
|
||||
mem_visual_path="codegen/allocation.png",
|
||||
VisaulizeTrainable=True,
|
||||
):
|
||||
self.layer = layer
|
||||
self.heads = 0
|
||||
self.buffers = {
|
||||
"input_output": 0,
|
||||
"residual": 0,
|
||||
"im2col": 0,
|
||||
"kernel": 0,
|
||||
"feature": 0,
|
||||
"trainable": 0,
|
||||
} # for feature pyramid
|
||||
# overall memory info
|
||||
self.peakmem = 0
|
||||
self.flash = 0
|
||||
self.bias = 0
|
||||
self.scale = 0
|
||||
self.code = 0
|
||||
self.allocator = FirstFit(memory_limit)
|
||||
self.outputTables = outputTables
|
||||
self.USE_INPLACE = inplace
|
||||
self.mem_visual_path = mem_visual_path
|
||||
self.tflite_op = tflite_op
|
||||
self.dummy_address = dummy_address
|
||||
self.VisaulizeTrainable = VisaulizeTrainable
|
||||
|
||||
# for showing layer-wise memory usage
|
||||
self.layermem = []
|
||||
|
||||
def _isTranable(self, name):
|
||||
for o in self.outputTables:
|
||||
if isinstance(name, str) and o.name in name:
|
||||
return True
|
||||
return False
|
||||
|
||||
def allocateMemory(self):
|
||||
# assign the same graph index for inplace operations
|
||||
# note: we need to handle stride == 2 for int8 depthwise to save memory
|
||||
if self.USE_INPLACE:
|
||||
for i, op in enumerate(self.layer):
|
||||
if op.params["op"] == "DEPTHWISE_CONV_2D" and op.params["input_dtype"] == "int8" and not self.tflite_op:
|
||||
# set the idx of output and next layer input
|
||||
previous_output_idx = op.output_tensors[0].graph_idx
|
||||
op.output_tensors[0].graph_idx = op.input_tensors[0].graph_idx
|
||||
if (
|
||||
i + 1 < len(self.layer)
|
||||
and len(self.layer[i + 1].input_tensors) > 0
|
||||
and str(self.layer[i + 1].input_tensors[0].graph_idx) == str(previous_output_idx)
|
||||
):
|
||||
self.layer[i + 1].input_tensors[0].graph_idx = op.input_tensors[0].graph_idx
|
||||
# update following ops' tensors
|
||||
for following_idx in range(i, len(self.layer)):
|
||||
for cnt, inp_tensor in enumerate(self.layer[following_idx].input_tensors):
|
||||
if str(inp_tensor.graph_idx) == str(previous_output_idx):
|
||||
inp_tensor.graph_idx = op.input_tensors[0].graph_idx
|
||||
|
||||
num_layers = len(self.layer)
|
||||
# go through all tensors in the model
|
||||
for i, op in enumerate(self.layer):
|
||||
# get all unallocated tensors for this layer
|
||||
unallocated_tensors = []
|
||||
for t in op.input_tensors:
|
||||
if t.allocator_idx is None:
|
||||
unallocated_tensors.append(t)
|
||||
for cnt, t in enumerate(op.output_tensors):
|
||||
if cnt == 0 and not (
|
||||
self.USE_INPLACE
|
||||
and op.params["op"] == "DEPTHWISE_CONV_2D"
|
||||
and op.params["input_dtype"] == "int8"
|
||||
and not self.tflite_op
|
||||
):
|
||||
if t.allocator_idx is None:
|
||||
unallocated_tensors.append(t)
|
||||
# assume seocnd outputs will not be inplace updated
|
||||
else:
|
||||
if t.allocator_idx is None:
|
||||
unallocated_tensors.append(t)
|
||||
|
||||
# add each tensor
|
||||
for cnt, t in enumerate(unallocated_tensors):
|
||||
start_idx = i
|
||||
end_idx = i + 1 if i == 0 else num_layers
|
||||
for idx in range(start_idx + 1, num_layers):
|
||||
for input_t in self.layer[idx].input_tensors:
|
||||
if str(t.graph_idx) == str(input_t.graph_idx):
|
||||
end_idx = idx + 1
|
||||
# check if this is output
|
||||
ttype = TTYPE_INFERNECE
|
||||
# add the tensor
|
||||
t.allocator_idx = self.allocator.addTensor(start_idx, end_idx, t.len(), name=t.graph_idx, type=ttype)
|
||||
# propagate the allocation to tensors with the same idx
|
||||
for j in range(i + 1, num_layers):
|
||||
opp = self.layer[j]
|
||||
for tt in opp.input_tensors:
|
||||
if str(t.graph_idx) == str(tt.graph_idx):
|
||||
tt.allocator_idx = t.allocator_idx
|
||||
# not inplace update
|
||||
for tt in opp.output_tensors:
|
||||
if str(t.graph_idx) == str(tt.graph_idx):
|
||||
tt.allocator_idx = t.allocator_idx
|
||||
|
||||
# for detailed memory
|
||||
layermem = {}
|
||||
|
||||
layermem["MAC"] = op.get_macs()
|
||||
layermem["activation"] = op.get_activation_size()
|
||||
layermem["scale"] = op.get_scale_size()
|
||||
layermem["runtime"] = op.get_sbuf_size()
|
||||
layermem["kernel"] = op.get_kbuf_size()
|
||||
self._enlargeBuffer("im2col", layermem["runtime"])
|
||||
self._enlargeBuffer("kernel", layermem["kernel"])
|
||||
|
||||
if (
|
||||
"weight_name" in op.params
|
||||
and self._isTranable(op.params["weight_name"])
|
||||
and op.params["op"] != "TRANSPOSE_CONV_2D"
|
||||
):
|
||||
size = int(op.get_weights_size())
|
||||
self.buffers["trainable"] += size
|
||||
layermem["trainable"] = size
|
||||
layermem["weight"] = 0
|
||||
else:
|
||||
layermem["weight"] = int(op.get_weights_size())
|
||||
if "bias_name" in op.params and self._isTranable(op.params["bias_name"]):
|
||||
size = int(op.get_bias_size())
|
||||
self.buffers["trainable"] += size
|
||||
if "trainable" in layermem:
|
||||
layermem["trainable"] += size
|
||||
else:
|
||||
layermem["trainable"] = size
|
||||
layermem["bias"] = 0
|
||||
else:
|
||||
layermem["bias"] = int(op.get_bias_size())
|
||||
# if it is float32 op, then their wegiths/bias should from SRAM buffers
|
||||
if op.params["input_dtype"] != "int8":
|
||||
layermem["scale"] = 0
|
||||
layermem["bias"] = 0
|
||||
layermem["weight"] = 0
|
||||
self.__increaseFlash(layermem["weight"])
|
||||
self.__increaseFlash(layermem["bias"])
|
||||
self.__increaseFlash(layermem["scale"])
|
||||
|
||||
self.layermem.append(layermem)
|
||||
|
||||
# find out int8 inplace depthwise conv and stride == 2
|
||||
for i, op in enumerate(self.layer):
|
||||
if (
|
||||
op.params["op"] == "DEPTHWISE_CONV_2D"
|
||||
and op.params["input_dtype"] == "int8"
|
||||
and op.params["stride_h"] == op.params["stride_w"] == 2
|
||||
):
|
||||
if op.input_tensors[0].allocator_idx == op.output_tensors[0].allocator_idx:
|
||||
self.allocator.rectangles[op.input_tensors[0].allocator_idx]["stride2_inplace_idx"] = i
|
||||
|
||||
# Reorder the rectangles to decide which tensor needs to be scheduled first
|
||||
self.allocator.sortSize()
|
||||
self.allocator.allocate()
|
||||
self.allocator.visualize(self.mem_visual_path)
|
||||
self._enlargeBuffer("input_output", self.allocator.get_peak())
|
||||
|
||||
# sanity check, see if all tensors have been allocated
|
||||
for i, op in enumerate(self.layer):
|
||||
# get all unallocated tensors for this layer
|
||||
for cnt, t in enumerate(op.input_tensors):
|
||||
assert t.allocator_idx is not None
|
||||
for cnt, t in enumerate(op.output_tensors):
|
||||
assert t.allocator_idx is not None
|
||||
|
||||
# assign the address according to placement
|
||||
for i, op in enumerate(self.layer):
|
||||
# get all unallocated tensors for this layer
|
||||
for cnt, t in enumerate(op.input_tensors):
|
||||
if cnt == 0:
|
||||
op.params["input_buf_add_offset"] = self.allocator.getIdxAddress(t.allocator_idx)
|
||||
op.params["input_buf_add"] = "front"
|
||||
elif cnt == 1:
|
||||
op.params["input2_buf_add_offset"] = self.allocator.getIdxAddress(t.allocator_idx)
|
||||
op.params["input2_buf_add"] = "front"
|
||||
elif cnt == 2:
|
||||
op.params["input3_buf_add_offset"] = self.allocator.getIdxAddress(t.allocator_idx)
|
||||
op.params["input3_buf_add"] = "front"
|
||||
op.input_tensors[cnt].buffer_name = "buffer0"
|
||||
op.input_tensors[cnt].buffer_address = self.allocator.getIdxAddress(t.allocator_idx)
|
||||
for cnt, t in enumerate(op.output_tensors):
|
||||
if cnt == 0:
|
||||
op.params["output_buf_add_offset"] = self.allocator.getIdxAddress(t.allocator_idx)
|
||||
op.params["output_buf_add"] = "front"
|
||||
op.output_tensors[cnt].buffer_name = "buffer0"
|
||||
op.output_tensors[cnt].buffer_address = self.allocator.getIdxAddress(t.allocator_idx)
|
||||
if cnt == 1:
|
||||
op.params["output2_buf_add_offset"] = self.allocator.getIdxAddress(t.allocator_idx)
|
||||
op.params["output2_buf_add"] = "front"
|
||||
op.output_tensors[cnt].buffer_name = "buffer0"
|
||||
op.output_tensors[cnt].buffer_address = self.allocator.getIdxAddress(t.allocator_idx)
|
||||
|
||||
# calculate peak mem
|
||||
self.peakmem = (
|
||||
self.allocator.get_peak() + self.buffers["im2col"] + self.buffers["kernel"] # + self.buffers["trainable"]
|
||||
)
|
||||
|
||||
def dumpLayerIndex(self):
|
||||
# header
|
||||
print("-" * 14 + " Tensor Allocation Details " + "-" * 14)
|
||||
print(" #op | operator type | input index | output index |")
|
||||
for cnt, l in enumerate(self.layer):
|
||||
operator_num = "#" + str(cnt)
|
||||
type = str(l.params["op"])
|
||||
input_tensor = ""
|
||||
for cnt_inp, inp in enumerate(l.input_tensors):
|
||||
input_tensor += str(inp.allocator_idx)
|
||||
if cnt_inp < len(l.input_tensors) - 1:
|
||||
input_tensor += ","
|
||||
output_tensor = str(l.output_tensors[0].allocator_idx)
|
||||
string = (
|
||||
operator_num.ljust(5)
|
||||
+ "|"
|
||||
+ type.ljust(19)
|
||||
+ "|"
|
||||
+ input_tensor.ljust(13)
|
||||
+ "|"
|
||||
+ output_tensor.ljust(14)
|
||||
+ "|"
|
||||
)
|
||||
print(string)
|
||||
|
||||
def dumpLayerMem(self):
|
||||
# header
|
||||
print(
|
||||
"---------------------------------------------------- Schedule Details ----------------------------------------------------------------" # noqa: E501
|
||||
)
|
||||
print(
|
||||
"----------------------| SRAM || Flash | |" # noqa: E501
|
||||
)
|
||||
print(
|
||||
"----------------------| activation | runtime | trainable | sum || weight | bias | scale | sum | MAC |" # noqa: E501
|
||||
)
|
||||
|
||||
layermem = self.layermem
|
||||
self.__dumpMemInfo(layermem)
|
||||
|
||||
def __dumpMemInfo(self, layermem):
|
||||
string = "-------Schedule-------|"
|
||||
maxActive = self.buffers["input_output"]
|
||||
maxRuntime = self.buffers["im2col"] + self.buffers["kernel"]
|
||||
maxTrainable = self.buffers["trainable"]
|
||||
totalWeight = self.__sumKey(layermem, "weight")
|
||||
totalBias = self.__sumKey(layermem, "bias")
|
||||
totalScale = self.__sumKey(layermem, "scale")
|
||||
totalMAC = self.__sumKey(layermem, "MAC")
|
||||
string += str(maxActive).ljust(14) + "|"
|
||||
string += str(maxRuntime).ljust(11) + "|"
|
||||
string += str(maxTrainable).ljust(12) + "|"
|
||||
string += str(maxActive + maxRuntime + maxTrainable).ljust(8) + "||"
|
||||
string += str(totalWeight).ljust(12) + "|"
|
||||
string += str(totalBias).ljust(10) + "|"
|
||||
string += str(totalScale).ljust(10) + "|"
|
||||
string += str(totalWeight + totalBias + totalScale).ljust(13) + "|"
|
||||
string += str(totalMAC).ljust(13) + "|"
|
||||
print(string)
|
||||
for i, _ in enumerate(layermem):
|
||||
layer_info = self.layer[i].get_layer_info()
|
||||
string = ""
|
||||
string += str(i) + ":" + layer_info["op"]
|
||||
string = string.ljust(22) + "|"
|
||||
SRAM = 0
|
||||
if "activation" in layermem[i]:
|
||||
substr = (
|
||||
str(layermem[i]["activation"]) + " (" + "{:.0%}".format(layermem[i]["activation"] / maxActive) + ")"
|
||||
)
|
||||
string += substr.ljust(14) + "|"
|
||||
SRAM += layermem[i]["activation"]
|
||||
if "runtime" in layermem[i]:
|
||||
sbuf = layermem[i]["runtime"] + layermem[i]["kernel"]
|
||||
substr = str(sbuf) + " (" + "{:.0%}".format(sbuf / maxRuntime) + ")"
|
||||
string += substr.ljust(11) + "|"
|
||||
SRAM += sbuf
|
||||
else:
|
||||
string = string.ljust(49) + "|"
|
||||
if "trainable" in layermem[i]:
|
||||
substr = (
|
||||
str(layermem[i]["trainable"])
|
||||
+ " ("
|
||||
+ "{:.0%}".format(layermem[i]["trainable"] / maxTrainable)
|
||||
+ ")"
|
||||
)
|
||||
string += substr.ljust(12) + "|"
|
||||
SRAM += layermem[i]["trainable"]
|
||||
else:
|
||||
string = string.ljust(62) + "|"
|
||||
|
||||
# SRAM end
|
||||
string += str(SRAM)
|
||||
string = string.ljust(71) + "||"
|
||||
flash = 0
|
||||
if "weight" in layermem[i]:
|
||||
substr = (
|
||||
str(layermem[i]["weight"])
|
||||
+ " ("
|
||||
+ "{:.0%}".format(layermem[i]["weight"] / (totalWeight + 0.0001))
|
||||
+ ")"
|
||||
)
|
||||
string += str(substr).ljust(12) + "|"
|
||||
flash += layermem[i]["weight"]
|
||||
if "bias" in layermem[i]:
|
||||
substr = (
|
||||
str(layermem[i]["bias"]) + " (" + "{:.0%}".format(layermem[i]["bias"] / (totalBias + 0.0001)) + ")"
|
||||
)
|
||||
string += str(substr).ljust(10) + "|"
|
||||
flash += layermem[i]["bias"]
|
||||
if "scale" in layermem[i]:
|
||||
substr = (
|
||||
str(layermem[i]["scale"]) + " (" + "{:.0%}".format(layermem[i]["scale"] / totalScale + 0.0001) + ")"
|
||||
)
|
||||
string += str(substr).ljust(10) + "|"
|
||||
flash += layermem[i]["scale"]
|
||||
|
||||
if flash > 0:
|
||||
string += (
|
||||
str(flash)
|
||||
+ " ("
|
||||
+ "{:.0%}".format(flash / (totalWeight + totalBias + totalScale + 0.0001))
|
||||
+ ")"
|
||||
)
|
||||
string = string.ljust(121) + "|"
|
||||
# flash end
|
||||
if "MAC" in layermem[i]:
|
||||
substr = str(layermem[i]["MAC"]) + " (" + "{:.0%}".format(layermem[i]["MAC"] / totalMAC) + ")"
|
||||
string += str(substr).ljust(13) + "|"
|
||||
print(string)
|
||||
|
||||
def __sumKey(self, layers, key):
|
||||
result = 0
|
||||
for _, layer in enumerate(layers):
|
||||
if key in layer:
|
||||
result += layer[key]
|
||||
|
||||
return result
|
||||
|
||||
def getBuffers(self):
|
||||
return self.buffers
|
||||
|
||||
# Maximum binary size: This should be updated if any change in the inference side
|
||||
# TODO: Combine with code generation to get more accurate result
|
||||
def profileResult(self):
|
||||
return self.peakmem, self.flash + self.bias + self.scale + int(self.code * 1024)
|
||||
|
||||
def __increaseFlash(self, size):
|
||||
self.flash += int(size)
|
||||
|
||||
def _enlargeBuffer(self, buf_str, size):
|
||||
if buf_str == "input_output" or buf_str == "residual":
|
||||
self.buffers[buf_str] = max(self.buffers[buf_str], int(size))
|
||||
else:
|
||||
if buf_str not in self.buffers:
|
||||
self.buffers[buf_str] = size
|
||||
else:
|
||||
self.buffers[buf_str] = max(self.buffers[buf_str], size)
|
167
code_generator/InputResizer.py
Normal file
@ -0,0 +1,167 @@
|
||||
# ----------------------------------------------------------------------
|
||||
# Project: TinyEngine
|
||||
# Title: InputResizer.py
|
||||
#
|
||||
# Reference papers:
|
||||
# - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
|
||||
# - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
|
||||
# - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
|
||||
# Contact authors:
|
||||
# - Wei-Ming Chen, wmchen@mit.edu
|
||||
# - Wei-Chen Wang, wweichen@mit.edu
|
||||
# - Ji Lin, jilin@mit.edu
|
||||
# - Ligeng Zhu, ligeng@mit.edu
|
||||
# - Song Han, songhan@mit.edu
|
||||
#
|
||||
# Target ISA: ARMv7E-M
|
||||
# ----------------------------------------------------------------------
|
||||
|
||||
import math
|
||||
|
||||
|
||||
def _find_previous_info(layers, idx):
|
||||
for layer in layers:
|
||||
info = layer.get_layer_info()
|
||||
if info["output_idx"] == idx:
|
||||
return info
|
||||
|
||||
|
||||
class InputResizer:
|
||||
def __init__(self, layer):
|
||||
self.layer = layer
|
||||
|
||||
def inputResize(self, input_h, input_w):
|
||||
for i, layer in enumerate(self.layer):
|
||||
layer_info = layer.get_layer_info()
|
||||
|
||||
previous_layer_info = _find_previous_info(self.layer, layer_info["input_idx"])
|
||||
# we need to handle different op
|
||||
op_code_str = layer_info["op"]
|
||||
if i == 0:
|
||||
layer_info["input_h"] = input_h
|
||||
layer_info["input_w"] = input_w
|
||||
_changeOPTensorSize(self.layer[i], "input", 0, layer_info["input_h"], layer_info["input_w"])
|
||||
else:
|
||||
if op_code_str == "SE_AVG_POOL_2D":
|
||||
SEinput_h = previous_layer_info["output_h"]
|
||||
SEinput_w = previous_layer_info["output_w"]
|
||||
layer_info["input_h"] = SEinput_h
|
||||
layer_info["input_w"] = SEinput_w
|
||||
_changeOPTensorSize(self.layer[i], "input", 0, layer_info["input_h"], layer_info["input_w"])
|
||||
layer_info["sample_h"] = SEinput_h
|
||||
layer_info["sample_w"] = SEinput_w
|
||||
else:
|
||||
layer_info["input_h"] = previous_layer_info["output_h"]
|
||||
layer_info["input_w"] = previous_layer_info["output_w"]
|
||||
layer_info["input_c"] = previous_layer_info["output_c"]
|
||||
_changeOPTensorSize(self.layer[i], "input", 0, layer_info["input_h"], layer_info["input_w"])
|
||||
if op_code_str == "AVERAGE_POOL_2D":
|
||||
layer_info["filter_h"] = layer_info["input_h"]
|
||||
layer_info["filter_w"] = layer_info["input_w"]
|
||||
layer_info["filter_c"] = layer_info["input_c"]
|
||||
|
||||
# handle nodes for dag op
|
||||
# find the previous node
|
||||
if "dagop_input0_key" in layer_info:
|
||||
for op in self.layer:
|
||||
l_into = op.get_layer_info()
|
||||
if (
|
||||
"dagop_output_key" in l_into
|
||||
and l_into["dagop_output_key"] == layer_info["dagop_input0_key"]
|
||||
):
|
||||
layer_info["input_h"] = l_into["output_h"]
|
||||
layer_info["input_w"] = l_into["output_w"]
|
||||
layer_info["input_c"] = l_into["output_c"]
|
||||
if "dagop_input1_key" in layer_info:
|
||||
for op in self.layer:
|
||||
l_into = op.get_layer_info()
|
||||
if (
|
||||
"dagop_output_key" in l_into
|
||||
and l_into["dagop_output_key"] == layer_info["dagop_input1_key"]
|
||||
):
|
||||
layer_info["input_h"] = l_into["output_h"]
|
||||
layer_info["input_w"] = l_into["output_w"]
|
||||
layer_info["input_c"] = l_into["output_c"]
|
||||
|
||||
if op_code_str == "CONV_2D" or op_code_str == "DEPTHWISE_CONV_2D":
|
||||
layer_info["output_h"] = math.ceil(layer_info["input_h"] / layer_info["stride_h"])
|
||||
layer_info["output_w"] = math.ceil(layer_info["input_w"] / layer_info["stride_w"])
|
||||
_changeOPTensorSize(self.layer[i], "output", 0, layer_info["output_h"], layer_info["output_w"])
|
||||
elif op_code_str == "ADD":
|
||||
layer_info["output_h"] = layer_info["input_h"]
|
||||
layer_info["output_w"] = layer_info["input_w"]
|
||||
layer_info["output_c"] = layer_info["input_c"]
|
||||
_changeOPTensorSize(self.layer[i], "output", 0, layer_info["output_h"], layer_info["output_w"])
|
||||
layer_info["input2_h"] = layer_info["input_h"]
|
||||
layer_info["input2_w"] = layer_info["input_w"]
|
||||
_changeOPTensorSize(self.layer[i], "input", 1, layer_info["input2_h"], layer_info["input_w"])
|
||||
elif op_code_str == "SE_ELEMENT_MULT_2D":
|
||||
layer_info["input2_h"] = SEinput_h
|
||||
layer_info["input2_w"] = SEinput_w
|
||||
_changeOPTensorSize(self.layer[i], "input", 1, layer_info["input2_h"], layer_info["input_w"])
|
||||
layer_info["output_h"] = SEinput_h
|
||||
layer_info["output_w"] = SEinput_w
|
||||
_changeOPTensorSize(self.layer[i], "output", 0, layer_info["output_h"], layer_info["output_w"])
|
||||
elif op_code_str == "UPSAMPLE":
|
||||
layer_info["output_h"] = layer_info["input_h"] * layer_info["factor"]
|
||||
layer_info["output_w"] = layer_info["input_w"] * layer_info["factor"]
|
||||
layer_info["output_c"] = layer_info["input_c"]
|
||||
_changeOPTensorSize(self.layer[i], "output", 0, layer_info["output_h"], layer_info["output_w"])
|
||||
elif op_code_str == "MAX_POOL_2D":
|
||||
layer_info["output_h"] = int(layer_info["input_h"] / layer_info["filter_h"])
|
||||
layer_info["output_w"] = int(layer_info["input_w"] / layer_info["filter_h"])
|
||||
layer_info["output_c"] = layer_info["input_c"]
|
||||
_changeOPTensorSize(self.layer[i], "output", 0, layer_info["output_h"], layer_info["output_w"])
|
||||
|
||||
|
||||
def _changeOPTensorSize(layer, tensor_type: str, tensor_idx: int, input_h: int, input_w: int):
|
||||
if tensor_type == "input":
|
||||
if hasattr(layer, "input_tensors") and len(layer.input_tensors) > tensor_idx:
|
||||
layer.input_tensors[tensor_idx].set_input_w(input_w)
|
||||
layer.input_tensors[tensor_idx].set_input_h(input_h)
|
||||
elif tensor_type == "output":
|
||||
if hasattr(layer, "output_tensors"):
|
||||
layer.output_tensors[tensor_idx].set_input_w(input_w)
|
||||
layer.output_tensors[tensor_idx].set_input_h(input_h)
|
||||
|
||||
|
||||
class PatchResizer:
|
||||
def __init__(self, layer):
|
||||
self.layer = layer
|
||||
|
||||
# manually setting these variables for now
|
||||
def patchResize(self, PatchLayers, PatchSize, PatchSize_height):
|
||||
for i, layer in enumerate(self.layer):
|
||||
layer_info = layer.get_layer_info()
|
||||
if i < PatchLayers:
|
||||
layer_info["is_patch"] = True
|
||||
op_code_str = layer_info["op"]
|
||||
if i == 0:
|
||||
layer_info["input_h"] = PatchSize_height
|
||||
layer_info["input_w"] = PatchSize
|
||||
_changeOPTensorSize(self.layer[i], "input", 0, PatchSize_height, PatchSize)
|
||||
else:
|
||||
prev_layer_info = self.layer[i - 1].get_layer_info()
|
||||
layer_info["input_h"] = prev_layer_info["output_h"]
|
||||
layer_info["input_w"] = prev_layer_info["output_w"]
|
||||
_changeOPTensorSize(
|
||||
self.layer[i], "input", 0, prev_layer_info["output_h"], prev_layer_info["output_w"]
|
||||
)
|
||||
|
||||
if op_code_str == "CONV_2D" or op_code_str == "DEPTHWISE_CONV_2D":
|
||||
layer_info["output_h"] = math.ceil(
|
||||
(layer_info["input_h"] - layer_info["kernel_h"] + 1) / layer_info["stride_h"]
|
||||
)
|
||||
layer_info["output_w"] = math.ceil(
|
||||
(layer_info["input_w"] - layer_info["kernel_w"] + 1) / layer_info["stride_w"]
|
||||
)
|
||||
_changeOPTensorSize(self.layer[i], "output", 0, layer_info["output_h"], layer_info["output_w"])
|
||||
elif op_code_str == "ADD":
|
||||
layer_info["output_h"] = layer_info["input_h"]
|
||||
layer_info["output_w"] = layer_info["input_w"]
|
||||
layer_info["input2_h"] = layer_info["input_h"]
|
||||
layer_info["input2_w"] = layer_info["input_w"]
|
||||
_changeOPTensorSize(self.layer[i], "input", 0, layer_info["input_h"], layer_info["input_w"])
|
||||
_changeOPTensorSize(self.layer[i], "input", 1, layer_info["input_h"], layer_info["input_w"])
|
||||
else:
|
||||
layer_info["is_patch"] = False
|
118
code_generator/OpGenerator.py
Normal file
@ -0,0 +1,118 @@
|
||||
# ----------------------------------------------------------------------
|
||||
# Project: TinyEngine
|
||||
# Title: OpGenerator.py
|
||||
#
|
||||
# Reference papers:
|
||||
# - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
|
||||
# - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
|
||||
# - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
|
||||
# Contact authors:
|
||||
# - Wei-Ming Chen, wmchen@mit.edu
|
||||
# - Wei-Chen Wang, wweichen@mit.edu
|
||||
# - Ji Lin, jilin@mit.edu
|
||||
# - Ligeng Zhu, ligeng@mit.edu
|
||||
# - Song Han, songhan@mit.edu
|
||||
#
|
||||
# Target ISA: ARMv7E-M
|
||||
# ----------------------------------------------------------------------
|
||||
|
||||
from .codetemplate.depthwiseTemplate import depthwiseInplace
|
||||
|
||||
|
||||
class OpGenerator:
|
||||
def __init__(self, incpath, srcpath, layers, fp_requantize=False):
|
||||
self.incpath = incpath
|
||||
self.srcpath = srcpath
|
||||
self.layers = layers
|
||||
self.fp_requantize = fp_requantize
|
||||
|
||||
def genOpcode(self):
|
||||
# find all conv ops
|
||||
op_list = []
|
||||
for op in self.layers:
|
||||
layer_info = op.get_layer_info()
|
||||
if layer_info["op"] == "CONV_2D" or layer_info["op"] == "DEPTHWISE_CONV_2D":
|
||||
op = convOp(layer_info)
|
||||
if op not in op_list:
|
||||
op_list.append(op)
|
||||
|
||||
# go through and generate all ops
|
||||
incfile = includeFile(self.incpath)
|
||||
for op in op_list:
|
||||
if op.isDepthwise:
|
||||
if op.kernel_h > op.kernel_w:
|
||||
depthwise_template = depthwiseInplace(
|
||||
op.kernel_h,
|
||||
op.kernel_w,
|
||||
op.pad_h,
|
||||
op.pad_w,
|
||||
op.stride,
|
||||
"CWH",
|
||||
self.fp_requantize,
|
||||
)
|
||||
else:
|
||||
depthwise_template = depthwiseInplace(
|
||||
op.kernel_h,
|
||||
op.kernel_w,
|
||||
op.pad_h,
|
||||
op.pad_w,
|
||||
op.stride,
|
||||
"CHW",
|
||||
self.fp_requantize,
|
||||
)
|
||||
depthwise_template.genFile(self.srcpath)
|
||||
incfile.addDefine(depthwise_template.genFuncDefine())
|
||||
|
||||
incfile.writeFile()
|
||||
|
||||
|
||||
class convOp:
|
||||
def __init__(self, layer_info):
|
||||
if layer_info["op"] == "CONV_2D":
|
||||
isDepthwise = False
|
||||
elif layer_info["op"] == "DEPTHWISE_CONV_2D":
|
||||
isDepthwise = True
|
||||
kernel_h = layer_info["kernel_h"]
|
||||
kernel_w = layer_info["kernel_w"]
|
||||
pad_h = (kernel_h - 1) // 2
|
||||
pad_w = (kernel_w - 1) // 2
|
||||
stride = layer_info["stride_h"]
|
||||
self.inchannel = layer_info["input_c"]
|
||||
self.isDepthwise = isDepthwise
|
||||
self.kernel_h = kernel_h
|
||||
self.kernel_w = kernel_w
|
||||
self.stride = stride
|
||||
self.pad_h = pad_h
|
||||
self.pad_w = pad_w
|
||||
|
||||
def __eq__(self, other):
|
||||
if isinstance(other, convOp):
|
||||
if (
|
||||
self.isDepthwise == other.isDepthwise
|
||||
and self.kernel_h == other.kernel_h
|
||||
and self.kernel_w == other.kernel_w
|
||||
and self.stride == other.stride
|
||||
and self.pad_h == other.pad_h
|
||||
and self.pad_w == other.pad_w
|
||||
):
|
||||
return True
|
||||
else:
|
||||
return False
|
||||
return NotImplemented
|
||||
|
||||
|
||||
class includeFile:
|
||||
def __init__(self, path):
|
||||
self.path = path
|
||||
self.defstring = ""
|
||||
|
||||
def addDefine(self, defstr):
|
||||
self.defstring += defstr + ";\n"
|
||||
|
||||
def writeFile(self):
|
||||
import os
|
||||
|
||||
outpath = os.path.join(self.path, "genInclude.h")
|
||||
outf = open(outpath, "w")
|
||||
outf.write(self.defstring)
|
||||
outf.close()
|
85
code_generator/PatchBasedUtil.py
Normal file
@ -0,0 +1,85 @@
|
||||
# ----------------------------------------------------------------------
|
||||
# Project: TinyEngine
|
||||
# Title: PatchBasedUtil.py
|
||||
#
|
||||
# Reference papers:
|
||||
# - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
|
||||
# - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
|
||||
# - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
|
||||
# Contact authors:
|
||||
# - Wei-Ming Chen, wmchen@mit.edu
|
||||
# - Wei-Chen Wang, wweichen@mit.edu
|
||||
# - Ji Lin, jilin@mit.edu
|
||||
# - Ligeng Zhu, ligeng@mit.edu
|
||||
# - Song Han, songhan@mit.edu
|
||||
#
|
||||
# Target ISA: ARMv7E-M
|
||||
# ----------------------------------------------------------------------
|
||||
|
||||
def getPatchParams(layers, split_idx, n_patch):
|
||||
patch_params = {}
|
||||
|
||||
feat_stride = 8
|
||||
patch_params["n_patch"] = n_patch
|
||||
patch_params["layer_cnt"] = split_idx
|
||||
|
||||
resolution = max(layers[0].get_layer_info()["input_h"], layers[0].get_layer_info()["input_w"])
|
||||
layer_cnt = layers[patch_params["layer_cnt"]].get_layer_info()
|
||||
out_shape = max(layer_cnt["input_h"], layer_cnt["input_w"])
|
||||
feat_stride = resolution // out_shape
|
||||
grain_size = out_shape // n_patch
|
||||
|
||||
patch_params["single_rf"] = compute_receptive_field(layers, patch_params["layer_cnt"], 1)
|
||||
patch_params["output_c"] = layer_cnt["input_c"]
|
||||
patch_params["output_h"] = layer_cnt["output_h"]
|
||||
patch_params["output_w"] = layer_cnt["output_w"]
|
||||
patch_params["grain_rf"] = compute_receptive_field(layers, patch_params["layer_cnt"], grain_size)
|
||||
patch_params["grain_rf_height"] = compute_receptive_field(
|
||||
layers, patch_params["layer_cnt"], layer_cnt["input_h"] // n_patch
|
||||
)
|
||||
print("receptive field: single {} all {}".format(patch_params["single_rf"], patch_params["grain_rf"]))
|
||||
|
||||
# now generate the padding for each layer (two side)
|
||||
patch_params["pad_l"] = patch_params["single_rf"] // 2
|
||||
patch_params["pad_r"] = max(
|
||||
0,
|
||||
patch_params["grain_rf"]
|
||||
+ feat_stride * grain_size * (n_patch - 1)
|
||||
- patch_params["single_rf"] // 2
|
||||
- resolution,
|
||||
)
|
||||
|
||||
return patch_params
|
||||
|
||||
|
||||
def get_recompute_layer(model, split_idx):
|
||||
layer_cnt = 1 # first conv
|
||||
|
||||
for i in range(split_idx):
|
||||
block = model["blocks"][i]
|
||||
if "pointwise1" in block and block["pointwise1"] is not None:
|
||||
layer_cnt += 1
|
||||
if "depthwise" in block and block["depthwise"] is not None:
|
||||
layer_cnt += 1
|
||||
if "pointwise2" in block and block["pointwise2"] is not None:
|
||||
layer_cnt += 1
|
||||
|
||||
return layer_cnt
|
||||
|
||||
|
||||
def compute_receptive_field(layers, layer_cnt, grain=1):
|
||||
for i in range(layer_cnt):
|
||||
op = layers[(layer_cnt - 1) - i] # trace in a backward manner
|
||||
layer_info = op.get_layer_info()
|
||||
if layer_info["op"] == "CONV_2D" or layer_info["op"] == "DEPTHWISE_CONV_2D": # receptive field will increase
|
||||
stride = layer_info["stride_h"]
|
||||
kernel_size = max(layer_info["kernel_h"], layer_info["kernel_w"])
|
||||
if stride in [1, 2]:
|
||||
if stride == 1:
|
||||
grain += kernel_size - 1
|
||||
else:
|
||||
grain = (grain - 1) * 2 + kernel_size
|
||||
else:
|
||||
pass
|
||||
|
||||
return grain
|
920
code_generator/TfliteConvertor.py
Normal file
@ -0,0 +1,920 @@
|
||||
# ----------------------------------------------------------------------
|
||||
# Project: TinyEngine
|
||||
# Title: TfliteConvertor.py
|
||||
#
|
||||
# Reference papers:
|
||||
# - MCUNet: Tiny Deep Learning on IoT Device, NeurIPS 2020
|
||||
# - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning, NeurIPS 2021
|
||||
# - MCUNetV3: On-Device Training Under 256KB Memory, arXiv:2206.15472
|
||||
# Contact authors:
|
||||
# - Wei-Ming Chen, wmchen@mit.edu
|
||||
# - Wei-Chen Wang, wweichen@mit.edu
|
||||
# - Ji Lin, jilin@mit.edu
|
||||
# - Ligeng Zhu, ligeng@mit.edu
|
||||
# - Song Han, songhan@mit.edu
|
||||
#
|
||||
# Target ISA: ARMv7E-M
|
||||
# ----------------------------------------------------------------------
|
||||
|
||||
import math
|
||||
|
||||
import numpy as np
|
||||
|
||||
from .constant import SKIP_OPs
|
||||
from .operators import add, avgpool2d, conv2d, depthwiseConv2d, maxpool2d, upsample
|
||||
from .tflite import Model
|
||||
from .tflite.BuiltinOperator import BuiltinOperator
|
||||
from .tflite.BuiltinOptions import BuiltinOptions
|
||||
from .tflite.Conv2DOptions import Conv2DOptions
|
||||
from .tflite.DepthwiseConv2DOptions import DepthwiseConv2DOptions
|
||||
from .tflite.Padding import Padding
|
||||
from .tflite.Pool2DOptions import Pool2DOptions
|
||||
from .tflite.TensorType import TensorType
|
||||
|
||||
|
||||
# Parse tflite model into TinyEngine IR format
|
||||
class TfliteConvertor(object):
|
||||
def __init__(self, filepath):
|
||||
# path to the tflite file
|
||||
self.filepath = filepath
|
||||
self.model = self.loadTFmodel(filepath)
|
||||
self.subgraph = self.model.Subgraphs(0)
|
||||
self.builtin_op_code = self._build_str_map(BuiltinOperator())
|
||||
self.layer = []
|
||||
self.tmpPADIndice = None
|
||||
self.skip_transpose = None
|
||||
self.average_1D_to_2D_holder = MEAN2D()
|
||||
|
||||
# public functions
|
||||
def loadTFmodel(self, filepath):
|
||||
buf = open(filepath, "rb").read()
|
||||
return Model.Model.GetRootAsModel(buf, 0)
|
||||
|
||||
def dumpModelInfo(self):
|
||||
version = self.model.Version()
|
||||
print("Model version:", version)
|
||||
description = self.model.Description().decode("utf-8")
|
||||
print("Description:", description)
|
||||
subgraph_len = self.model.SubgraphsLength()
|
||||
print("Subgraph length:", subgraph_len)
|
||||
|
||||
self.dumpLayerInfo()
|
||||
|
||||
def dumpLayerInfo(self):
|
||||
print("Layer length:", len(self.layer))
|
||||
|
||||
# print brief info about each layer
|
||||
for i, layer in enumerate(self.layer):
|
||||
if self.layer[i]["op"] == "ADD":
|
||||
print(
|
||||
"op:",
|
||||
layer["op"],
|
||||
",input_idx:",
|
||||
layer["input_idx"],
|
||||
",input2_idx:",
|
||||
layer["input2_idx"],
|
||||
"output_idx:",
|
||||
layer["output_idx"],
|
||||
)
|
||||
else:
|
||||
print(
|
||||
"op:",
|
||||
layer["op"],
|
||||
",input_idx:",
|
||||
layer["input_idx"],
|
||||
"output_idx:",
|
||||
layer["output_idx"],
|
||||
)
|
||||
|
||||
def parseOperatorInfo(self):
|
||||
operators_len = self.subgraph.OperatorsLength()
|
||||
|
||||
for i in range(operators_len):
|
||||
op = self.subgraph.Operators(i)
|
||||
|
||||
# parse the op
|
||||
self._handleOperator(op)
|
||||
|
||||
# private functions
|
||||
def _build_str_map(self, obj):
|
||||
ret = {}
|
||||
for field_name in dir(obj):
|
||||
if not field_name.startswith("_"):
|
||||
field_value = getattr(obj, field_name)
|
||||
if isinstance(field_value, int):
|
||||
ret[field_value] = field_name
|
||||
return ret
|
||||
|
||||
def _getOpCodeStr(self, op):
|
||||
op_code_list_idx = op.OpcodeIndex()
|
||||
op_code_id = self.model.OperatorCodes(op_code_list_idx).DeprecatedBuiltinCode()
|
||||
return self.builtin_op_code[op_code_id]
|
||||
|
||||
def _getTensorTypeStr(self, type):
|
||||
if TensorType.INT8 == type:
|
||||
return "int8"
|
||||
if TensorType.UINT8 == type:
|
||||
return "uint8"
|
||||
if TensorType.FLOAT32 == type:
|
||||
return "float32"
|
||||
|
||||
def _getMultiplierShift(self, effective_scale):
|
||||
significand = np.zeros(len(effective_scale), dtype="int32")
|
||||
shift = np.zeros(len(effective_scale), dtype="int32")
|
||||
|
||||
for i, s in enumerate(effective_scale):
|
||||
if s == 0:
|
||||
significand[i] = 0
|
||||
shift[i] = 0
|
||||
else:
|
||||
sig, shi = math.frexp(s)
|
||||
sig = int(round(sig * 2**31))
|
||||
|
||||
if sig == 2**31:
|
||||
sig /= 2
|
||||
shi += 1
|
||||
if shi < -31:
|
||||
shi = 0
|
||||
sig = 0
|
||||
|
||||
significand[i] = sig
|
||||
shift[i] = shi
|
||||
|
||||
return significand, shift
|
||||
|
||||
def _getSigShift(self, s):
|
||||
sig, shi = math.frexp(s)
|
||||
sig = int(round(sig * 2**31))
|
||||
if sig == 2**31:
|
||||
sig /= 2
|
||||
shi += 1
|
||||
if shi < -31:
|
||||
shi = 0
|
||||
sig = 0
|
||||
|
||||
return sig, shi
|
||||
|
||||
def _getADDMultiplierShift(self, input_scale, input2_scale, output_scale):
|
||||
left_shift = 20
|
||||
|
||||
twice_max_input_scale = 2 * np.double(max(input_scale, input2_scale))
|
||||
real_input1_multiplier = np.double(input_scale / twice_max_input_scale)
|
||||
real_input2_multiplier = np.double(input2_scale / twice_max_input_scale)
|
||||
real_output_multiplier = np.double(twice_max_input_scale / ((1 << left_shift) * output_scale))
|
||||
|
||||
input_multiplier, input_shift = self._getSigShift(real_input1_multiplier)
|
||||
input2_multiplier, input2_shift = self._getSigShift(real_input2_multiplier)
|
||||
output_multiplier, output_shift = self._getSigShift(real_output_multiplier)
|
||||
|
||||
return (
|
||||
left_shift,
|
||||
input_multiplier,
|
||||
input_shift,
|
||||
input2_multiplier,
|
||||
input2_shift,
|
||||
output_multiplier,
|
||||
output_shift,
|
||||
)
|
||||
|
||||
def _preprocessSoftmaxScaling(self, beta, input_scale, input_integer_bits):
|
||||
|
||||
input_beta_real_multiplier = min(beta * input_scale * (1 << (31 - input_integer_bits)), (1 << 31) - 1.0)
|
||||
|
||||
multiplier, shift = self._getSigShift(input_beta_real_multiplier)
|
||||
|
||||
return multiplier, shift
|
||||
|
||||
# follow TFlite implementation
|
||||
def _calculateInputRadius(self, input_integer_bits, input_left_shift, total_signed_bits=31):
|
||||
max_input_rescaled = (
|
||||
1.0
|
||||
* ((1 << input_integer_bits) - 1)
|
||||
* (1 << (total_signed_bits - input_integer_bits))
|
||||
/ (1 << input_left_shift)
|
||||
)
|
||||
return math.floor(max_input_rescaled)
|
||||
|
||||
# converting tflite fuctions
|
||||
def _convert_convolution(self, op):
|
||||
# operator
|
||||
op_code_str = self._getOpCodeStr(op)
|
||||
|
||||
# get input, weight, and output tensors
|
||||
input_tensors = self._get_input_tensors(op)
|
||||
input_tensor_count = len(input_tensors)
|
||||
assert input_tensor_count >= 2, "input tensors length should be >= 2"
|
||||
|
||||
input_tensor = input_tensors[0]
|
||||
weight_tensor = input_tensors[1]
|
||||
|
||||
output_tensors = self._get_output_tensors(op)
|
||||
assert len(output_tensors) == 1, "output tensors length should be 1"
|
||||
output_tensor = output_tensors[0]
|
||||
|
||||
# conv_2d options
|
||||
if op_code_str == "CONV_2D":
|
||||
assert op.BuiltinOptionsType() == BuiltinOptions.Conv2DOptions
|
||||
op_options = op.BuiltinOptions()
|
||||
conv_options = Conv2DOptions()
|
||||
conv_options.Init(op_options.Bytes, op_options.Pos)
|
||||
if op_code_str == "DEPTHWISE_CONV_2D":
|
||||
assert op.BuiltinOptionsType() == BuiltinOptions.DepthwiseConv2DOptions
|
||||
op_options = op.BuiltinOptions()
|
||||
conv_options = DepthwiseConv2DOptions()
|
||||
conv_options.Init(op_options.Bytes, op_options.Pos)
|
||||
|
||||
# conv parameters
|
||||
stride_h = conv_options.StrideH()
|
||||
stride_w = conv_options.StrideW()
|
||||
|
||||
# shapes
|
||||
_, input_h, input_w, input_c = input_tensor.tensor.ShapeAsNumpy()
|
||||
if op_code_str == "CONV_2D":
|
||||
output_c, kernel_h, kernel_w, _ = weight_tensor.tensor.ShapeAsNumpy()
|
||||
elif op_code_str == "DEPTHWISE_CONV_2D":
|
||||
_, kernel_h, kernel_w, output_c = weight_tensor.tensor.ShapeAsNumpy()
|
||||
_, output_h, output_w, output_c_dual = output_tensor.tensor.ShapeAsNumpy()
|
||||
assert output_c_dual == output_c, "output channels not match"
|
||||
|
||||
# tensor types
|
||||
input_type = self._getTensorTypeStr(input_tensor.tensor.Type())
|
||||
output_type = self._getTensorTypeStr(output_tensor.tensor.Type())
|
||||
weight_type = self._getTensorTypeStr(weight_tensor.tensor.Type())
|
||||
assert input_type == output_type == weight_type, "tensor type not consistent"
|
||||
|
||||
# tensor value: weight, scalers
|
||||
weight_value = self._get_np_from_wrapper(weight_tensor)
|
||||
if input_tensor_count == 3:
|
||||
bias_tensor = input_tensors[2]
|
||||
# bias = self._get_np_from_wrapper(bias_tensor).astype('int') # forcely casting for testing latency
|
||||
bias = self._get_np_from_wrapper(bias_tensor)
|
||||
else:
|
||||
bias = None
|
||||
|
||||
# quantized setting
|
||||
input_zero_point = input_tensor.qnn_params["zero_point"]
|
||||
output_zero_point = output_tensor.qnn_params["zero_point"]
|
||||
input_scale = input_tensor.qnn_params["scale"]
|
||||
weight_scale = weight_tensor.qnn_params["scale"]
|
||||
output_scale = output_tensor.qnn_params["scale"]
|
||||
effective_scale = np.double(input_scale) * np.double(weight_scale) / np.double(output_scale)
|
||||
|
||||
# quantized inference, used for requantize
|
||||
multiplier, shift = self._getMultiplierShift(effective_scale)
|
||||
|
||||
# find previous layer and redirct the index and fuse pad into conv
|
||||
if self.tmpPADIndice is not None:
|
||||
if self.tmpPADIndice.output_idx == input_tensor.tensor_idx:
|
||||
input_idx = self.tmpPADIndice.input_idx
|
||||
input_h = input_h - math.floor(kernel_h / 2) * 2
|
||||
input_w = input_w - math.floor(kernel_h / 2) * 2
|
||||
else:
|
||||
input_idx = input_tensor.tensor_idx
|
||||
else:
|
||||
input_idx = input_tensor.tensor_idx
|
||||
# clean the buffer
|
||||
self.tmpPADIndice = None
|
||||
|
||||
params = {
|
||||
# operator
|
||||
"op": op_code_str,
|
||||
# conv
|
||||
"kernel_h": kernel_h,
|
||||
"kernel_w": kernel_w,
|
||||
"padding": math.floor(kernel_h / 2),
|
||||
"stride_h": stride_h,
|
||||
"stride_w": stride_w,
|
||||
# tensor
|
||||
"input_idx": input_idx,
|
||||
"output_idx": output_tensor.tensor_idx,
|
||||
"input_dim": 3,
|
||||
"output_dim": 3,
|
||||
"input_h": input_h,
|
||||
"input_w": input_w,
|
||||
"input_c": input_c,
|
||||
"output_h": output_h,
|
||||
"output_w": output_w,
|
||||
"output_c": output_c,
|
||||
"dtypte": input_type,
|
||||
# trainable parameters
|
||||
"weight_value": weight_value,
|
||||
"bias": bias,
|
||||
"effective_scale": effective_scale,
|
||||
"input_zero_point": input_zero_point,
|
||||
"output_zero_point": output_zero_point,
|
||||
"input_scale": input_scale,
|
||||
"weight_scale": weight_scale,
|
||||
"output_scale": output_scale,
|
||||
# quantized infernece
|
||||
"multiplier": multiplier,
|
||||
"shift": shift,
|
||||
}
|
||||
|
||||
if op_code_str == "CONV_2D":
|
||||
op = conv2d.Conv2d(params)
|
||||
elif op_code_str == "DEPTHWISE_CONV_2D":
|
||||
op = depthwiseConv2d.DepthwiseConv2d(params)
|
||||
|
||||
return op
|
||||
|
||||
def _convert_ADD(self, op):
|
||||
# operator
|
||||
op_code_str = self._getOpCodeStr(op)
|
||||
|
||||
# get input, weight, and output tensors
|
||||
input_tensors = self._get_input_tensors(op)
|
||||
input_tensor_count = len(input_tensors)
|
||||
assert input_tensor_count == 2, "input should be 2 tensors"
|
||||
|
||||
input_tensor = input_tensors[0]
|
||||
input2_tensor = input_tensors[1]
|
||||
|
||||
output_tensors = self._get_output_tensors(op)
|
||||
assert len(output_tensors) == 1, "output tensors length should be 1"
|
||||
output_tensor = output_tensors[0]
|
||||
|
||||
# shapes
|
||||
_, input_h, input_w, input_c = input_tensor.tensor.ShapeAsNumpy()
|
||||
_, input2_h, input2_w, input2_c = input2_tensor.tensor.ShapeAsNumpy()
|
||||
_, output_h, output_w, output_c = output_tensor.tensor.ShapeAsNumpy()
|
||||
assert input_h == input2_h == output_h, "tensor shpae not consistent"
|
||||
assert input_w == input2_w == output_w, "tensor shpae not consistent"
|
||||
assert input_c == input2_c == output_c, "tensor shpae not consistent"
|
||||
|
||||
# tensor types
|
||||
input_type = self._getTensorTypeStr(input_tensor.tensor.Type())
|
||||
input_type2 = self._getTensorTypeStr(input2_tensor.tensor.Type())
|
||||
output_type = self._getTensorTypeStr(output_tensor.tensor.Type())
|
||||
assert input_type == input_type2 == output_type, "tensor type not consistent"
|
||||
|
||||
# quantized setting
|
||||
input_zero_point = input_tensor.qnn_params["zero_point"]
|
||||
input2_zero_point = input2_tensor.qnn_params["zero_point"]
|
||||
output_zero_point = output_tensor.qnn_params["zero_point"]
|
||||
input_scale = input_tensor.qnn_params["scale"]
|
||||
input2_scale = input2_tensor.qnn_params["scale"]
|
||||
output_scale = output_tensor.qnn_params["scale"]
|
||||
|
||||
# get multipliers and shifts
|
||||
(
|
||||
left_shift,
|
||||
input_multiplier,
|
||||
input_shift,
|
||||
input2_multiplier,
|
||||
input2_shift,
|
||||
output_multiplier,
|
||||
output_shift,
|
||||
) = self._getADDMultiplierShift(input_scale, input2_scale, output_scale)
|
||||
|
||||
# assign params
|
||||
params = {
|
||||
# operator
|
||||
"op": op_code_str,
|
||||
# tensor
|
||||
"input_idx": input_tensor.tensor_idx,
|
||||
"input2_idx": input2_tensor.tensor_idx,
|
||||
"output_idx": output_tensor.tensor_idx,
|
||||
"input_h": input_h,
|
||||
"input_w": input_w,
|
||||
"input_c": input_c,
|
||||
"input2_h": input_h,
|
||||
"input2_w": input_w,
|
||||
"input2_c": input_c,
|
||||
"input_dim": 3,
|
||||
"input2_dim": 3,
|
||||
"output_dim": 3,
|
||||
"output_h": output_h,
|
||||
"output_w": output_w,
|
||||
"output_c": output_c,
|
||||
"dtypte": input_type,
|
||||
# trainable parameters
|
||||
"input_zero_point": input_zero_point,
|
||||
"input2_zero_point": input2_zero_point,
|
||||
"output_zero_point": output_zero_point,
|
||||
"input_scale": input_scale,
|
||||
"input2_scale": input2_scale,
|
||||
"output_scale": output_scale,
|
||||
# quantized infernece
|
||||
"left_shift": left_shift,
|
||||
"input_multiplier": input_multiplier,
|
||||
"input2_multiplier": input2_multiplier,
|
||||
"input_shift": input_shift,
|
||||
"input2_shift": input2_shift,
|
||||
"output_multiplier": output_multiplier,
|
||||
"output_shift": output_shift,
|
||||
}
|
||||
op = add.Add(params)
|
||||
|
||||
return op
|
||||
|
||||
def _convert_AVERAGE_POOL_2D(self, op):
|
||||
# operator
|
||||
op_code_str = self._getOpCodeStr(op)
|
||||
|
||||
# get input, weight, and output tensors
|
||||
input_tensors = self._get_input_tensors(op)
|
||||
input_tensor_count = len(input_tensors)
|
||||
assert input_tensor_count == 1, "input tensors length should be 1"
|
||||
|
||||
input_tensor = input_tensors[0]
|
||||
|
||||
output_tensors = self._get_output_tensors(op)
|
||||
assert len(output_tensors) == 1, "output tensors length should be 1"
|
||||
output_tensor = output_tensors[0]
|
||||
|
||||
# shapes
|
||||
_, input_h, input_w, input_c = input_tensor.tensor.ShapeAsNumpy()
|
||||
_, output_h, output_w, output_c = output_tensor.tensor.ShapeAsNumpy()
|
||||
|
||||
# tensor types
|
||||
input_type = self._getTensorTypeStr(input_tensor.tensor.Type())
|
||||
output_type = self._getTensorTypeStr(output_tensor.tensor.Type())
|
||||
assert input_type == output_type, "tensor type not consistent"
|
||||
|
||||
# pool parameters
|
||||
assert op.BuiltinOptionsType() == BuiltinOptions.Pool2DOptions
|
||||
op_options = op.BuiltinOptions()
|
||||
pool2d_options = Pool2DOptions()
|
||||
pool2d_options.Init(op_options.Bytes, op_options.Pos)
|
||||
stride_h = pool2d_options.StrideH()
|
||||
stride_w = pool2d_options.StrideW()
|
||||
padding = pool2d_options.Padding()
|
||||
filter_h = pool2d_options.FilterHeight()
|
||||
filter_w = pool2d_options.FilterWidth()
|
||||
|
||||
# padding
|
||||
if padding == Padding.VALID:
|
||||
pad_h = 0
|
||||
pad_w = 0
|
||||
elif padding == Padding.SAME:
|
||||
pass # no support for now
|
||||
|
||||
# quantized setting
|
||||
input_zero_point = input_tensor.qnn_params["zero_point"]
|
||||
output_zero_point = output_tensor.qnn_params["zero_point"]
|
||||
input_scale = input_tensor.qnn_params["scale"]
|
||||
output_scale = output_tensor.qnn_params["scale"]
|
||||
|
||||
params = {
|
||||
# operator
|
||||
"op": op_code_str,
|
||||
# pool parameters
|
||||
"filter_h": filter_h,
|
||||
"filter_w": filter_w,
|
||||
"stride_h": stride_h,
|
||||
"stride_w": stride_w,
|
||||
"pad_h": pad_h,
|
||||
"pad_w": pad_w,
|
||||
# tensor
|
||||
"input_idx": input_tensor.tensor_idx,
|
||||
"output_idx": output_tensor.tensor_idx,
|
||||
"input_h": input_h,
|
||||
"input_w": input_w,
|
||||
"input_c": input_c,
|
||||
"input_dim": input_tensor.tensor.ShapeAsNumpy().size,
|
||||
"output_dim": output_tensor.tensor.ShapeAsNumpy().size,
|
||||
"output_h": output_h,
|
||||
"output_w": output_w,
|
||||
"output_c": output_c,
|
||||
"dtypte": input_type,
|
||||
# trainable parameters
|
||||
"input_zero_point": input_zero_point,
|
||||
"output_zero_point": output_zero_point,
|
||||
"input_scale": input_scale,
|
||||
"output_scale": output_scale,
|
||||
}
|
||||
|
||||
op = avgpool2d.AvgPool2d(params)
|
||||
|
||||
return op
|
||||
|
||||
def _convert_upsample(self, op):
|
||||
# Incase no params
|
||||
input_type = None
|
||||
input_zero_point = None
|
||||
output_zero_point = None
|
||||
input_scale = None
|
||||
output_scale = None
|
||||
|
||||
# get input, weight, and output tensors
|
||||
input_tensors = self._get_input_tensors(op)
|
||||
input_tensor_count = len(input_tensors)
|
||||
assert input_tensor_count == 1, "input tensors length should be 1"
|
||||
|
||||
input_tensor = input_tensors[0]
|
||||
|
||||
output_tensors = self._get_output_tensors(op)
|
||||
assert len(output_tensors) == 1, "output tensors length should be 1"
|
||||
output_tensor = output_tensors[0]
|
||||
|
||||
# shapes
|
||||
_, input_h, input_w, input_c = input_tensor.tensor.ShapeAsNumpy()
|
||||
_, output_h, output_w, output_c = output_tensor.tensor.ShapeAsNumpy()
|
||||
|
||||
params = {
|
||||
# operator
|
||||
"op": "UPSAMPLE",
|
||||
# upsample parameters
|
||||
"factor": output_w / input_w,
|
||||
# tensor
|
||||
"input_idx": input_tensor.tensor_idx,
|
||||
"output_idx": output_tensor.tensor_idx,
|
||||
"input_h": input_h,
|
||||
"input_w": input_w,
|
||||
"input_c": input_c,
|
||||
"input_dim": 3,
|
||||
"output_dim": 3,
|
||||
"output_h": output_h,
|
||||
"output_w": output_w,
|
||||
"output_c": output_c,
|
||||
"dtype": input_type,
|
||||
# trainable parameters
|
||||
"input_zero_point": input_zero_point,
|
||||
"output_zero_point": output_zero_point,
|
||||
"input_scale": input_scale,
|
||||
"output_scale": output_scale,
|
||||
# quantized infernece
|
||||
}
|
||||
op = upsample.upSample(params)
|
||||
|
||||
return op
|
||||
|
||||
def _convert_PAD(self, op):
|
||||
# get input, weight, and output tensors
|
||||
input_tensors = self._get_input_tensors(op)
|
||||
input_tensor = input_tensors[0]
|
||||
|
||||
output_tensors = self._get_output_tensors(op)
|
||||
assert len(output_tensors) == 1, "output tensors length should be 1"
|
||||
output_tensor = output_tensors[0]
|
||||
|
||||
# fuse pad into conv
|
||||
self.tmpPADIndice = PAD_tensorIndice(input_tensor.tensor_idx, output_tensor.tensor_idx)
|
||||
|
||||
def _convert_TRANSPOSE(self, op):
|
||||
# get input, weight, and output tensors
|
||||
input_tensors = self._get_input_tensors(op)
|
||||
input_tensor = input_tensors[0]
|
||||
|
||||
output_tensors = self._get_output_tensors(op)
|
||||
assert len(output_tensors) == 1, "output tensors length should be 1"
|
||||
output_tensor = output_tensors[0]
|
||||
|
||||
# fuse pad into conv
|
||||
self.skip_transpose = PAD_tensorIndice(input_tensor.tensor_idx, output_tensor.tensor_idx)
|
||||
|
||||
def _convert_maxpool(self, op):
|
||||
# Incase no params
|
||||
input_type = None
|
||||
input_zero_point = None
|
||||
output_zero_point = None
|
||||
input_scale = None
|
||||
output_scale = None
|
||||
|
||||
# get input, weight, and output tensors
|
||||
input_tensors = self._get_input_tensors(op)
|
||||
input_tensor_count = len(input_tensors)
|
||||
assert input_tensor_count == 1, "input tensors length should be 1"
|
||||
|
||||
input_tensor = input_tensors[0]
|
||||
|
||||
output_tensors = self._get_output_tensors(op)
|
||||
assert len(output_tensors) == 1, "output tensors length should be 1"
|
||||
output_tensor = output_tensors[0]
|
||||
|
||||
# shapes
|
||||
_, input_h, input_w, input_c = input_tensor.tensor.ShapeAsNumpy()
|
||||
_, output_h, output_w, output_c = output_tensor.tensor.ShapeAsNumpy()
|
||||
|
||||
# pool parameters
|
||||
assert op.BuiltinOptionsType() == BuiltinOptions.Pool2DOptions
|
||||
op_options = op.BuiltinOptions()
|
||||
pool2d_options = Pool2DOptions()
|
||||
pool2d_options.Init(op_options.Bytes, op_options.Pos)
|
||||
stride_h = pool2d_options.StrideH()
|
||||
stride_w = pool2d_options.StrideW()
|
||||
# padding = pool2d_options.Padding()
|
||||
filter_h = pool2d_options.FilterHeight()
|
||||
filter_w = pool2d_options.FilterWidth()
|
||||
# fused_activation_fn = pool2d_options.FusedActivationFunction()
|
||||
|
||||
pool_params = {
|
||||
# operator
|
||||
"op": "MAX_POOL_2D",
|
||||
# pool parameters
|
||||
"filter_h": filter_h,
|
||||
"filter_w": filter_w,
|
||||
"stride_h": stride_h,
|
||||
"stride_w": stride_w,
|
||||
"pad_h": 0,
|
||||
"pad_w": 0,
|
||||
# tensor
|
||||
"input_idx": input_tensor.tensor_idx,
|
||||
"output_idx": output_tensor.tensor_idx,
|
||||
"input_h": input_h,
|
||||
"input_w": input_w,
|
||||
"input_c": input_c,
|
||||
"input_dim": 3,
|
||||
"output_dim": 3,
|
||||
"output_h": output_h,
|
||||
"output_w": output_w,
|
||||
"output_c": output_c,
|
||||
"dtype": input_type,
|
||||
# trainable parameters
|
||||
"input_zero_point": input_zero_point,
|
||||
"output_zero_point": output_zero_point,
|
||||
"input_scale": input_scale,
|
||||
"output_scale": output_scale,
|
||||
# quantized infernece
|
||||
}
|
||||
op = maxpool2d.maxPool2d(pool_params)
|
||||
|
||||
return op
|
||||
|
||||
def _convert_mean1D(self, op, MEAN2Dholder):
|
||||
# Incase no params
|
||||
input_type = None
|
||||
|
||||
# get input, weight, and output tensors
|
||||
input_tensors = self._get_input_tensors(op)
|
||||
input_tensor_count = len(input_tensors)
|
||||
assert input_tensor_count == 1, "input tensors length should be 1"
|
||||
|
||||
input_tensor = input_tensors[0]
|
||||
|
||||
output_tensors = self._get_output_tensors(op)
|
||||
assert len(output_tensors) == 1, "output tensors length should be 1"
|
||||
output_tensor = output_tensors[0]
|
||||
|
||||
# shapes
|
||||
input_shape = input_tensor.tensor.ShapeAsNumpy()
|
||||
output_shape = output_tensor.tensor.ShapeAsNumpy()
|
||||
|
||||
input_h, input_w, input_c = get_hwc_from_chwshape(input_shape)
|
||||
output_h, output_w, output_c = get_hwc_from_chwshape(output_shape)
|
||||
input_type = self._getTensorTypeStr(input_tensor.tensor.Type())
|
||||
|
||||
if not MEAN2Dholder.has_first_1D:
|
||||
MEAN2Dholder.add_first_1D_op(input_tensor.tensor_idx, output_tensor.tensor_idx, input_h, input_w, input_c)
|
||||
return None
|
||||
elif not MEAN2Dholder.has_second_1D:
|
||||
MEAN2Dholder.add_second_1D_op(
|
||||
input_tensor.tensor_idx, output_tensor.tensor_idx, output_h, output_w, output_c
|
||||
)
|
||||
filter_h = input_h - output_h + 1
|
||||
filter_w = input_w - output_w + 1
|
||||
params = {
|
||||
# operator
|
||||
"op": "AVERAGE_POOL_2D",
|
||||
# pool parameters
|
||||
"filter_h": filter_h,
|
||||
"filter_w": filter_w,
|
||||
"stride_h": 1,
|
||||
"stride_w": 1,
|
||||
"pad_h": 0,
|
||||
"pad_w": 0,
|
||||
# tensor
|
||||
"input_idx": MEAN2Dholder.first_1D_input_idx,
|
||||
"output_idx": MEAN2Dholder.second_1D_output_idx,
|
||||
"input_h": MEAN2Dholder.input_h,
|
||||
"input_w": MEAN2Dholder.input_w,
|
||||
"input_c": MEAN2Dholder.input_c,
|
||||
"input_dim": 3,
|
||||
"output_dim": 3,
|
||||
"output_h": MEAN2Dholder.output_h,
|
||||
"output_w": MEAN2Dholder.output_w,
|
||||
"output_c": MEAN2Dholder.output_c,
|
||||
"dtypte": input_type,
|
||||
}
|
||||
|
||||
op = avgpool2d.AvgPool2d(params)
|
||||
|
||||
return op
|
||||
else:
|
||||
raise NotImplementedError
|
||||
|
||||
def _convert_FULLY_CONNECTED(self, op):
|
||||
# get input, weight, and output tensors
|
||||
input_tensors = self._get_input_tensors(op)
|
||||
input_tensor_count = len(input_tensors)
|
||||
assert input_tensor_count == 3, "input tensors length should be 3"
|
||||
|
||||
input_tensor = input_tensors[0]
|
||||
weight_tensor = input_tensors[1]
|
||||
bias_tensor = input_tensors[2]
|
||||
weight = self._get_np_from_wrapper(weight_tensor)
|
||||
bias = self._get_np_from_wrapper(bias_tensor)
|
||||
|
||||
output_tensors = self._get_output_tensors(op)
|
||||
assert len(output_tensors) == 1, "output tensors length should be 1"
|
||||
output_tensor = output_tensors[0]
|
||||
|
||||
# shapes
|
||||
if input_tensor.tensor.ShapeAsNumpy().shape[0] == 2:
|
||||
input_w, input_c = input_tensor.tensor.ShapeAsNumpy()
|
||||
input_h = 1
|
||||
elif input_tensor.tensor.ShapeAsNumpy().shape[0] == 4:
|
||||
_, input_h, input_w, input_c = input_tensor.tensor.ShapeAsNumpy()
|
||||
output_c, input_c_dual = weight_tensor.tensor.ShapeAsNumpy()
|
||||
output_h, output_c_dual = output_tensor.tensor.ShapeAsNumpy()
|
||||
assert input_c_dual == input_c, "channels not match"
|
||||
assert output_c_dual == output_c, "channels not match"
|
||||
|
||||
# tensor types
|
||||
input_type = self._getTensorTypeStr(input_tensor.tensor.Type())
|
||||
output_type = self._getTensorTypeStr(output_tensor.tensor.Type())
|
||||
assert input_type == output_type, "tensor type not consistent"
|
||||
|
||||
# quantized setting
|
||||
input_zero_point = input_tensor.qnn_params["zero_point"]
|
||||
output_zero_point = output_tensor.qnn_params["zero_point"]
|
||||
input_scale = input_tensor.qnn_params["scale"]
|
||||
weight_scale = weight_tensor.qnn_params["scale"]
|
||||
bias_scale = bias_tensor.qnn_params["scale"]
|
||||
output_scale = output_tensor.qnn_params["scale"]
|
||||
# We support per channel in the CONV2D operator
|
||||
if isinstance(bias_scale, float) and isinstance(weight_scale, float):
|
||||
np_ones = np.ones(output_c)
|
||||
bias_scale = np_ones * bias_scale
|
||||
np_ones = np.ones(output_c)
|
||||
output_scale = np_ones * output_scale
|
||||
effective_scale = np.double(input_scale) * np.double(weight_scale) / np.double(output_scale)
|
||||
|
||||
# follows tensorflow lite micro
|
||||
multiplier, shift = self._getMultiplierShift(effective_scale)
|
||||
|
||||
params = {
|
||||
# operator
|
||||
"op": "CONV_2D",
|
||||
# tensor
|
||||
"input_idx": input_tensor.tensor_idx,
|
||||
"output_idx": output_tensor.tensor_idx,
|
||||
"input_h": input_h,
|
||||
"input_w": input_w,
|
||||
"input_c": input_c,
|
||||
"input_dim": 3,
|
||||
"output_dim": 2,
|
||||
"output_h": output_h,
|
||||
"output_w": 1,
|
||||
"output_c": output_c,
|
||||
"dtypte": input_type,
|
||||
"kernel_h": 1,
|
||||
"kernel_w": 1,
|
||||
# trainable parameters
|
||||
"weight_value": weight,
|
||||
"bias": bias,
|
||||
"effective_scale": effective_scale,
|
||||
"input_zero_point": input_zero_point,
|
||||
"output_zero_point": output_zero_point,
|
||||
"input_scale": input_scale,
|
||||
"output_scale": output_scale,
|
||||
# quantized infernece
|
||||
"multiplier": multiplier,
|
||||
"shift": shift,
|
||||
}
|
||||
|
||||
op = conv2d.Conv2d(params)
|
||||
|
||||
return op
|
||||
|
||||
# handle one op and parse it into layers[] for supported operators
|
||||
def _handleOperator(self, op):
|
||||
op_code_str = self._getOpCodeStr(op)
|
||||
if op_code_str == "CONV_2D":
|
||||
self.layer.append(self._convert_convolution(op))
|
||||
elif op_code_str == "ADD":
|
||||
self.layer.append(self._convert_ADD(op))
|
||||
elif op_code_str == "AVERAGE_POOL_2D":
|
||||
self.layer.append(self._convert_AVERAGE_POOL_2D(op))
|
||||
elif op_code_str == "DEPTHWISE_CONV_2D":
|
||||
self.layer.append(self._convert_convolution(op))
|
||||
elif op_code_str == "PAD":
|
||||
self._convert_PAD(op)
|
||||
elif op_code_str == "RESIZE_NEAREST_NEIGHBOR":
|
||||
self.layer.append(self._convert_upsample(op))
|
||||
elif op_code_str == "MAX_POOL_2D":
|
||||
self.layer.append(self._convert_maxpool(op))
|
||||
elif op_code_str in "MEAN":
|
||||
ret_op = self._convert_mean1D(op, self.average_1D_to_2D_holder)
|
||||
if ret_op is not None:
|
||||
# TODO: This only handle a specific graph: TRANSPOSE -> MEAN -> MEANS
|
||||
if self.skip_transpose is not None:
|
||||
ret_op.params["input_idx"] = self.skip_transpose.input_idx
|
||||
ret_op.input_tensors[0].graph_idx = self.skip_transpose.input_idx
|
||||
self.layer.append(ret_op)
|
||||
elif op_code_str == "TRANSPOSE":
|
||||
self._convert_TRANSPOSE(op)
|
||||
elif op_code_str in "FULLY_CONNECTED":
|
||||
self.layer.append(self._convert_FULLY_CONNECTED(op))
|
||||
elif op_code_str in SKIP_OPs:
|
||||
pass
|
||||
else:
|
||||
raise NotImplementedError(f"Unsupported {op_code_str}")
|
||||
|
||||
def _get_np_from_wrapper(self, wrapper):
|
||||
if wrapper.tensor.Type() == TensorType.INT8:
|
||||
dtype = np.int8
|
||||
elif wrapper.tensor.Type() == TensorType.INT32:
|
||||
dtype = np.int32
|
||||
else:
|
||||
raise NotImplementedError("Current implementation only supports int8 and int32")
|
||||
|
||||
data = wrapper.buffer.DataAsNumpy()
|
||||
shape = wrapper.tensor.ShapeAsNumpy() if wrapper.tensor.ShapeLength() != 0 else []
|
||||
|
||||
return np.frombuffer(data, dtype=dtype).reshape(shape)
|
||||
|
||||
def _get_tensor_type_str(self, tensor_type):
|
||||
if tensor_type == TensorType.INT8:
|
||||
return "int8"
|
||||
raise NotImplementedError(f"Tensor type: {tensor_type} is not supported yet.")
|
||||
|
||||
def _get_input_tensors(self, op):
|
||||
return self._get_wrapper_tensors(op.InputsAsNumpy())
|
||||
|
||||
def _get_output_tensors(self, op):
|
||||
return self._get_wrapper_tensors(op.OutputsAsNumpy())
|
||||
|
||||
def _get_wrapper_tensors(self, tensor_index_list):
|
||||
ret = []
|
||||
for idx in tensor_index_list:
|
||||
tensor = self.subgraph.Tensors(idx)
|
||||
buffer_idx = tensor.Buffer()
|
||||
buffer = self.model.Buffers(buffer_idx)
|
||||
|
||||
tflite_qparams = tensor.Quantization()
|
||||
|
||||
if tflite_qparams is None:
|
||||
continue
|
||||
assert tflite_qparams, "Quantization parameters not found in the model!"
|
||||
|
||||
scale = tflite_qparams.ScaleAsNumpy()
|
||||
zero_point = tflite_qparams.ZeroPointAsNumpy()
|
||||
qparams_to_tensor_wrapper = None
|
||||
|
||||
if isinstance(zero_point, np.ndarray):
|
||||
# Per-channel quantization
|
||||
if scale.size != 1 and zero_point.size != 1:
|
||||
qparams_to_tensor_wrapper = {"scale": scale, "zero_point": zero_point}
|
||||
# Per-tensor quantization
|
||||
elif scale.size == 1 and zero_point.size == 1:
|
||||
qparams_to_tensor_wrapper = {"scale": float(scale[0]), "zero_point": int(zero_point[0])}
|
||||
else:
|
||||
raise NotImplementedError
|
||||
elif scale == zero_point == 0:
|
||||
pass
|
||||
|
||||
ret.append(TFLiteTensorWrpper(idx, tensor, buffer, qparams_to_tensor_wrapper))
|
||||
return ret
|
||||
|
||||
|
||||
class PAD_tensorIndice(object):
|
||||
def __init__(self, input_idx, output_idx):
|
||||
self.input_idx = input_idx
|
||||
self.output_idx = output_idx
|
||||
|
||||
|
||||
class MEAN2D(object):
|
||||
def __init__(self):
|
||||
self.has_first_1D = False
|
||||
self.has_second_1D = False
|
||||
|
||||
def add_first_1D_op(self, input_idx, output_idx, input_h, input_w, input_c):
|
||||
self.first_1D_input_idx = input_idx
|
||||
self.first_1D_output_idx = output_idx
|
||||
self.input_h = input_h
|
||||
self.input_w = input_w
|
||||
self.input_c = input_c
|
||||
self.has_first_1D = True
|
||||
|
||||
def add_second_1D_op(self, input_idx, output_idx, output_h, output_w, output_c):
|
||||
self.second_1D_input_idx = input_idx
|
||||
self.second_1D_output_idx = output_idx
|
||||
self.output_h = output_h
|
||||
self.output_w = output_w
|
||||
self.output_c = output_c
|
||||
self.has_second_1D = True
|
||||
|
||||
|
||||
class TFLiteTensorWrpper:
|
||||
def __init__(self, tensor_idx, tensor, buffer, qnn_params):
|
||||
self.tensor_idx = tensor_idx
|
||||
self.tensor = tensor
|
||||
self.buffer = buffer
|
||||
self.qnn_params = qnn_params
|
||||
|
||||
|
||||
def get_hwc_from_chwshape(shape):
|
||||
h = 1
|
||||
w = 1
|
||||
c = 1
|
||||
if len(shape) == 4:
|
||||
c = shape[1]
|
||||
h = shape[2]
|
||||
w = shape[3]
|
||||
elif len(shape) == 3:
|
||||
c = shape[1]
|
||||
h = shape[2]
|
||||
elif len(shape) == 2:
|
||||
c = shape[1]
|
||||
return h, w, c
|
0
code_generator/__init__.py
Normal file
1
code_generator/allocator/__init__.py
Normal file
@ -0,0 +1 @@
|
||||
__all__ = ["base_allocator", "firstFit"]
|