Training demo on openmv cam (#45)

* sparse training example on openmv cam * minor fix for openmv firmware compliation * python side code * mior * update README * remove fc only and update README * Update README.md * update news * update news * update link
2025-10-18 09:44:22 +08:00 · 2023-02-07 14:03:45 -08:00
parent 8ff3ade724
commit d47d179e49
9 changed files with 729 additions and 784 deletions
--- a/README.md
+++ b/README.md
@@ -19,6 +19,7 @@ TinyEngine is a part of MCUNet, which also consists of TinyNAS. MCUNet is a syst

 **If you are interested in getting updates, please sign up [here](https://forms.gle/UW1uUmnfk1k6UJPPA) to get notified!**

+- **(2023/02)** We release the source code of the [training demo](examples/openmv_training_sparse) on OpenMV Cam H7.
 - **(2022/12)** We update the [measured results](README.md#measured-results) on STM32H743 with the new versions of the inference libraries.
 - **(2022/12)** We release the source code for patch-based inference and update the [tutorial of our inference demo](tutorial/inference/README.md) to provide option that generates patch-based inference code for the visual wake words (VWW) demo.
 - **(2022/11)** We release the source code of Tiny Training Engine, and include the [tutorial of our training demo](tutorial/training) for training a visual wake words (VWW) model on microcontrollers.
--- a/TinyEngine/include/genNN.h
+++ b/TinyEngine/include/genNN.h
@@ -28,8 +28,8 @@ signed char* getInput();
 signed char* getOutput();
 float* getOutput_fp();
 int32_t* getOutput_int32();
-static float lr = 0.0008;
-static float blr = 0.0004;
+static float lr __attribute__((unused)) = 0.0008;   // To suppress warning
+static float blr __attribute__((unused)) = 0.0004;  // To suppress warning

 void setupBuffer();
 void invoke(float* labels);
--- a/TinyEngine/include/img2col_element.h
+++ b/TinyEngine/include/img2col_element.h
--- a/examples/openmv_training/codegen_ffc_only.zip
+++ b/examples/openmv_training/codegen_ffc_only.zip
--- a/examples/openmv_training_sparse/README.md
+++ b/examples/openmv_training_sparse/README.md
@@ -0,0 +1,85 @@
+# Training on OpenMV Cam H7
+
+This is an example showing how to train a model using a predefined sparse update schema with TinyEngine.
+
+## Install build dependencies on Linux
+
+Note: This section is basically from https://github.com/openmv/openmv/blob/master/src/README.md. Please refer to the OpenMV's repo for more details or steps on different env.
+
+```
+sudo apt-get update
+sudo apt-get install git build-essential
+```
+
+## Install GNU ARM toolchain
+
+```
+Install arm toolchain
+TOOLCHAIN_PATH=/usr/local/arm-none-eabi
+TOOLCHAIN_URL="https://armkeil.blob.core.windows.net/developer/Files/downloads/gnu-rm/10-2020q4/gcc-arm-none-eabi-10-2020-q4-major-x86_64-linux.tar.bz2"
+sudo mkdir ${TOOLCHAIN_PATH}
+wget --no-check-certificate -O - ${TOOLCHAIN_URL} | sudo tar --strip-components=1 -jx -C ${TOOLCHAIN_PATH}
+export PATH=${TOOLCHAIN_PATH}/bin:${PATH}
+```
+
+## Clone the OpenMV source
+
+```
+cd tinyengine/examples/openmv_training_sparse/
+git clone https://github.com/openmv/openmv.git
+```
+
+Currently, we don't have compatibility tests for the OpenMV source, so let's use the version that has been manually tested before.
+
+```
+cd openmv
+git checkout 918ccb937730cc759ee5709df089d9de516dc7bf
+git submodule update --init --recursive
+```
+
+## Build the source
+
+Let's first build the firmware from the source to make sure all required dependencies are correctly installed. The `TARGET `is set to `OPENMV4` for OpenMV Cam H7.
+
+```
+make -j4 -C src/micropython/mpy-cross
+make -j4 TARGET=OPENMV4 -C src
+```
+
+You should see the compiled binary at `openmv/src/build/bin/firmware.bin`.
+
+## Apply customized patch
+
+The patch is to
+
+1. disable some features in the firmware for SRAM and flash space
+1. setup for TinyEngine source
+1. add the application code for training in `exampleemodule.c`
+
+```
+git apply ../openmv_sparse_training.patch
+```
+
+## Generate model-specific code and recompile the firmware with TinyEngine
+
+```
+cd ..
+sh gen_code.sh
+cd openmv
+make -j4 TARGET=OPENMV4 -C src
+```
+
+Flash the binary `openmv/src/build/bin/firmware.bin` into your OpenMV. Please refer to the official [Instructions](https://github.com/openmv/openmv/blob/master/src/README.md#flashing-the-firmware%5D).
+
+## Connect two buttons to your board
+
+Connect two buttons with jump wires to pin1 and pin4. Please refer to the [pinout](http://wiki.amperka.ru/_media/products:openmv-cam-h7:openmv-cam-h7-pinout.pdf).
+
+These two buttons will be used to label images captured by the camera.
+![image](https://user-images.githubusercontent.com/17592131/217367877-6a500f31-be3b-4258-a86e-4eabbb947a7e.png)
+
+## Start the demo
+
+1. Open OpenMV IDE
+1. Connect your OpenMV cam to the PC
+1. Run the python script `tinyengine/examples/openmv_vww/vww_openmv_demo.py` in OpenMV IDE.
--- a/examples/openmv_training_sparse/gen_code.sh
+++ b/examples/openmv_training_sparse/gen_code.sh
@@ -0,0 +1,7 @@
+cd ../../
+export PYTHONPATH=${PYTHONPATH}:$(pwd)
+cp -r TinyEngine examples/openmv_training_sparse/openmv/src/omv/modules/
+cd examples/openmv_training_sparse
+mkdir codegen
+python ../tiny_training.py -f ../../assets/49kb-int8-graph.json -D ../../assets/full-int8-params.pkl -QAS ../../assets/scale.json -m -g -d -FR
+mv codegen openmv/src/omv/modules/TinyEngine/
--- a/examples/openmv_training_sparse/openmv_sparse_training.patch
+++ b/examples/openmv_training_sparse/openmv_sparse_training.patch
@@ -150,10 +150,10 @@ index 84601904..abc6fe04 100644
  * @brief defition to adding rouding offset
  */
 diff --git a/src/omv/Makefile b/src/omv/Makefile
-index 159d07a5..6bdfd47a 100644
+index 159d07a5..239fa50a 100644
 --- a/src/omv/Makefile
 +++ b/src/omv/Makefile
-@@ -96,6 +96,25 @@ SRCS += $(addprefix imlib/,     \
+@@ -96,6 +96,50 @@ SRCS += $(addprefix imlib/,     \
 	zbar.c                      \
    )
 
@@ -162,18 +162,43 @@ index 159d07a5..6bdfd47a 100644
 +	codegen/Source/depthwise_kernel3x3_stride1_inplace_CHW_fpreq.c                     \
 +	codegen/Source/depthwise_kernel3x3_stride2_inplace_CHW_fpreq.c                     \
 +	codegen/Source/depthwise_kernel5x5_stride1_inplace_CHW_fpreq.c                     \
-+	codegen/Source/depthwise_kernel5x5_stride2_inplace_CHW_fpreq.c                     \
 +	codegen/Source/depthwise_kernel7x7_stride1_inplace_CHW_fpreq.c                     \
 +	codegen/Source/depthwise_kernel7x7_stride2_inplace_CHW_fpreq.c                     \
+	codegen/Source/depthwise_kernel3x3_stride1_inplace_CHW_fpreq_bitmask.c           \
+	codegen/Source/depthwise_kernel3x3_stride2_inplace_CHW_fpreq_bitmask.c           \
+	codegen/Source/depthwise_kernel5x5_stride1_inplace_CHW_fpreq_bitmask.c           \
+	codegen/Source/depthwise_kernel7x7_stride1_inplace_CHW_fpreq_bitmask.c           \
+	codegen/Source/depthwise_kernel7x7_stride2_inplace_CHW_fpreq_bitmask.c           \
 +	src/kernels/fp_requantize_op/add_fpreq.c \
 +	src/kernels/fp_requantize_op/convolve_1x1_s8_ch8_fpreq.c \
 +	src/kernels/fp_requantize_op/convolve_1x1_s8_ch16_fpreq.c \
 +	src/kernels/fp_requantize_op/convolve_1x1_s8_ch24_fpreq.c \
 +	src/kernels/fp_requantize_op/convolve_1x1_s8_ch48_fpreq.c \
 +	src/kernels/fp_requantize_op/convolve_1x1_s8_fpreq.c \
-+	src/kernels/int_only/avgpooling.c \
+	src/kernels/int_forward_op/avgpooling.c \
 +	src/kernels/fp_requantize_op/convolve_s8_kernel3_inputch3_stride2_pad1_fpreq.c \
 +	src/kernels/fp_requantize_op/mat_mul_kernels_fpreq.c \
+	src/kernels/fp_requantize_op/convolve_1x1_s8_fpreq_mask.c \
+	src/kernels/fp_requantize_op/convolve_1x1_s8_fpreq_mask_partialCH.c \
+	src/kernels/fp_backward_op/sum_4D_exclude_fp.c \
+	src/kernels/fp_backward_op/where_fp.c \
+	src/kernels/fp_backward_op/transpose_depthwise_conv_fp_kernel3_stride1_inpad1_outpad0.c \
+	src/kernels/fp_backward_op/transpose_depthwise_conv_fp_kernel3_stride2_inpad1_outpad1.c \
+	src/kernels/fp_backward_op/transpose_depthwise_conv_fp_kernel5_stride1_inpad2_outpad0.c \
+	src/kernels/fp_backward_op/transpose_depthwise_conv_fp_kernel5_stride2_inpad2_outpad1.c \
+	src/kernels/fp_backward_op/transpose_depthwise_conv_fp_kernel7_stride1_inpad3_outpad0.c \
+	src/kernels/fp_backward_op/transpose_depthwise_conv_fp_kernel7_stride2_inpad3_outpad1.c \
+	src/kernels/fp_backward_op/tte_exp_fp.c \
+	src/kernels/fp_backward_op/sub_fp.c \
+	src/kernels/fp_backward_op/mul_fp.c \
+	src/kernels/fp_backward_op/pointwise_conv_fp.c \
+	src/kernels/fp_backward_op/group_pointwise_conv_fp.c \
+	src/kernels/fp_backward_op/group_conv_fp_kernel4_stride1_pad0.c \
+	src/kernels/fp_backward_op/group_conv_fp_kernel8_stride1_pad0.c \
+	src/kernels/fp_backward_op/strided_slice_4Dto4D_fp.c \
+	src/kernels/fp_backward_op/sum_3D_fp.c \
+	src/kernels/fp_backward_op/nll_loss_fp.c \
+	src/kernels/fp_backward_op/log_softmax_fp.c \
 +   )
 +
 SRCS += $(wildcard ports/$(PORT)/*.c)
@@ -406,10 +431,10 @@ index 412de472..f7da2c03 100644
 
 // Domain 2 DMA buffers region.
 diff --git a/src/omv/modules/examplemodule.c b/src/omv/modules/examplemodule.c
-index 37e2b4f4..1f6ce7d4 100644
+index 37e2b4f4..52d1bda2 100644
 --- a/src/omv/modules/examplemodule.c
 +++ b/src/omv/modules/examplemodule.c
-@@ -1,17 +1,277 @@
+@@ -1,17 +1,81 @@
 // Include MicroPython API.
 #include "py/runtime.h"
 +#include "genNN.h"
@@ -417,224 +442,27 @@ index 37e2b4f4..1f6ce7d4 100644
 +#include <stdio.h>
 +#include "py_image.h"
 
-+#define TEST_SIZE 1 * 1024
-+// signed char w[TEST_SIZE];
- // This is the function which will be called from Python as cexample.add_ints(a, b).
- STATIC mp_obj_t example_add_ints(mp_obj_t a_obj, mp_obj_t b_obj) {
+-// This is the function which will be called from Python as cexample.add_ints(a, b).
+-STATIC mp_obj_t example_add_ints(mp_obj_t a_obj, mp_obj_t b_obj) {
 -    // Extract the ints from the micropython input objects.
 -    int a = mp_obj_get_int(a_obj);
 -    int b = mp_obj_get_int(b_obj);
-+    invoke(NULL);
-+    return mp_obj_new_int(999);
-+}
-+
+#define TEST_SIZE 1 * 1024
 +#define TN_MAX(A,B) ((A) > (B) ? (A) : (B))
 +#define TN_MIN(A,B) ((A) < (B) ? (A) : (B))``
 +
 +// for fc only
-+#define ORIGIN_H 72
-+#define ORIGIN_W 88
-+#define IMAGE_H 80
-+#define IMAGE_W 80
-+#define INPUT_CH 160
-+#define OUTPUT_CH 2
-+#define IMAGES 6
-+
-+float feat_fp[INPUT_CH];
-+int8_t feat[INPUT_CH];
-+float w[INPUT_CH * OUTPUT_CH];
-+float b[OUTPUT_CH];
-+float out[OUTPUT_CH];
-+float dw[OUTPUT_CH*INPUT_CH];
-+float lr = 0.1;
-+const signed char zero_x = 6;
-+const float scale_x = 0.060486205;
-+
-+void fully_connected_fp(
-+		const float *input, const uint16_t input_x, const uint16_t input_y,
-+		const uint16_t input_ch, const uint16_t output_ch, const float *bias,
-+		const float *weights, float *output)
-+{
-+	int h, w, out_c, in_c;
-+	for (h = 0; h < input_y; h++){
-+		for (w = 0; w < input_x; w++){
-+			int pixel_cnt = w + input_x * h;
-+			for (out_c = 0; out_c < output_ch; out_c++){
-+				float intermediate = bias[out_c];
-+				const float *start_weight = weights + out_c * input_ch;
-+				const float *start_input = input + input_ch * pixel_cnt;
-+				float *start_out = output + output_ch * pixel_cnt;
-+				for (in_c = 0; in_c < input_ch; in_c++){
-+					intermediate += start_weight[in_c] * start_input[in_c];
-+				}
-+				start_out[out_c] = intermediate;
-+			}
-+		}
-+	}
-+}
-+
-+void mat_mul_fp(
-+				const float *matA, const uint16_t matA_row, const uint16_t matA_col,
-+				const float* matB, const uint16_t matB_col, float* output)
-+{
-+	int m, n, i;
-+	for (n = 0; n < matA_row; n++){
-+		for (m = 0; m < matB_col; m++){
-+			float sum = 0;
-+			for (i = 0; i < matA_col; i++){
-+				sum += matA[i + n * matA_col] * matB[m + i * matA_col];
-+			}
-+			output[m + n * matB_col] = sum;
-+		}
-+	}
-+}
-+
-+void statble_softmax_inplace(float *input, const uint16_t length)
-+{
-+	float max = FLT_MIN;
-+	float exp_sum = 0;
-+	uint16_t i;
-+	for (i = 0; i < length; i++){
-+		if (input[i] > max) max = input[i];
-+	}
-+
-+	// inplace update
-+	for (i = 0; i < length; i++){
-+		input[i] = exp(input[i] - max);
-+		exp_sum += input[i];
-+	}
-+	for (i = 0; i < length; i++){
-+		input[i] = input[i] / exp_sum;
-+	}
-+}
-+
-+
-+void invoke_new_weights(const int8_t* img, float *out){
-+  int i;
-+  signed char *input = getInput();
-+  const int8_t* image = img;
-+  for (i = 0; i < IMAGE_H * IMAGE_W * 3; i++){
-+	  input[i] = *image++;
-+  }
-+  invoke(NULL);
-+  signed char *output = getOutput();
-+  for (i = 0; i < INPUT_CH; i++){
-+	  feat_fp[i] = (output[i] - zero_x)*scale_x;
-+  }
-+
-+  // out = new_w @ feat + new_b
-+  fully_connected_fp(feat_fp, 1, 1, INPUT_CH, OUTPUT_CH, b, w, out);
-+}
-+
-+void invoke_new_weights_givenimg(float *out){
-+  int i;
-+  invoke(NULL);
-+  signed char *output = getOutput();
-+  for (i = 0; i < INPUT_CH; i++){
-+	  feat_fp[i] = (output[i] - zero_x)*scale_x;
-+  }
-+
-+  // out = new_w @ feat + new_b
-+  fully_connected_fp(feat_fp, 1, 1, INPUT_CH, OUTPUT_CH, b, w, out);
-+}
-+
-+void train_one_img(const int8_t* img, int cls)
-+{
-+  int i;
-+  signed char *input = getInput();
-+  const int8_t* image = img;
-+  for (i = 0; i < IMAGE_H * IMAGE_W * 3; i++){
-+	  input[i] = *image++;
-+  }
-+  invoke(NULL);
-+  signed char *output = getOutput();
-+  for (i = 0; i < INPUT_CH; i++){
-+	  feat_fp[i] = (output[i] - zero_x)*scale_x;
-+  }
-+
-+  // out = new_w @ feat + new_b
-+  fully_connected_fp(feat_fp, 1, 1, INPUT_CH, OUTPUT_CH, b, w, out);
-+
-+  // softmax = _stable_softmax(out)
-+  statble_softmax_inplace(out, OUTPUT_CH);
+#define ORIGIN_H 128
+#define ORIGIN_W 128
+#define IMAGE_H 128
+#define IMAGE_W 128
 
 -    // Calculate the addition and convert to MicroPython object.
 -    return mp_obj_new_int(a + b);
-+  out[cls] -= 1;
-+
-+  //dw = dy.reshape(-1, 1) @ feat.reshape(1, -1)
-+  mat_mul_fp(out, OUTPUT_CH, 1, feat_fp, INPUT_CH, dw);
-+
-+  for (i = 0; i < OUTPUT_CH * INPUT_CH; i++){
-+	  w[i] = w[i] - lr * dw[i];
-+  }
-+  //new_w = new_w - lr * dw
-+  //new_b = new_b - lr *
-+  b[0] = b[0] - lr * out[0];
-+  b[1] = b[1] - lr * out[1];
- }
-+
-+void train(int cls)
-+{
-+  int i;
-+  invoke(NULL);
-+  signed char *output = getOutput();
-+  for (i = 0; i < INPUT_CH; i++){
-+	  feat_fp[i] = (output[i] - zero_x)*scale_x;
-+  }
-+
-+  // out = new_w @ feat + new_b
-+  fully_connected_fp(feat_fp, 1, 1, INPUT_CH, OUTPUT_CH, b, w, out);
-+
-+  // softmax = _stable_softmax(out)
-+  statble_softmax_inplace(out, OUTPUT_CH);
-+
-+  out[cls] -= 1;
-+
-+  //dw = dy.reshape(-1, 1) @ feat.reshape(1, -1)
-+  mat_mul_fp(out, OUTPUT_CH, 1, feat_fp, INPUT_CH, dw);
-+
-+  for (i = 0; i < OUTPUT_CH * INPUT_CH; i++){
-+	  w[i] = w[i] - lr * dw[i];
-+  }
-+  //new_w = new_w - lr * dw
-+  //new_b = new_b - lr *
-+  b[0] = b[0] - lr * out[0];
-+  b[1] = b[1] - lr * out[1];
-+}
-+
-+void train_one_feat(const float* feat, int cls)
-+{
-+  int i;
-+  signed char *input = getInput();
-+  for (i = 0; i < IMAGE_H * IMAGE_W * 3; i++){
-+	  input[i] = feat[i];
-+  }
-+
-+  // out = new_w @ feat + new_b
-+  fully_connected_fp(feat, 1, 1, INPUT_CH, OUTPUT_CH, b, w, out);
-+
-+  // softmax = _stable_softmax(out)
-+  statble_softmax_inplace(out, OUTPUT_CH);
-+
-+  out[cls] -= 1;
-+
-+  //dw = dy.reshape(-1, 1) @ feat.reshape(1, -1)
-+  mat_mul_fp(out, OUTPUT_CH, 1, feat, INPUT_CH, dw);
-+
-+  for (i = 0; i < OUTPUT_CH * INPUT_CH; i++){
-+	  w[i] = w[i] - lr * dw[i];
-+  }
-+  //new_w = new_w - lr * dw
-+  //new_b = new_b - lr *
-+  b[0] = b[0] - lr * out[0];
-+  b[1] = b[1] - lr * out[1];
-+}
-+
-+
 +uint16_t color;
+float labels[] = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
 +// This is the function which will be called from Python as cexample.add_ints(a, b).
-+STATIC mp_obj_t example_VWW(mp_obj_t a, mp_obj_t b) {
+STATIC mp_obj_t example_train_demo_fn(mp_obj_t a, mp_obj_t b) {
 +    image_t* img = py_image_cobj(a);
 +    // >= 0, for training with the label, -1 is for inference
 +    int command = mp_obj_get_int(b);
@@ -663,12 +491,16 @@ index 37e2b4f4..1f6ce7d4 100644
 +        }
 +    }
 +    if (command >= 0){
+        labels[0] = 0;
+        labels[1] = 0;
+        labels[command] = 1;
+        invoke(labels);
 +        printf("train class %d\n", command);
-+        train(command);
 +    }
 +    else{
-+        invoke_new_weights_givenimg(out);
-+        if(out[0] > out[1]){
+        invoke_inf();
+        uint8_t* output = (uint8_t*)getOutput();
+        if(output[0] > output[1]){
 +            printf("infer class 0\n");
 +            color = 63488;
 +		}
@@ -684,30 +516,31 @@ index 37e2b4f4..1f6ce7d4 100644
 +        }
 +    }
 +    return mp_obj_new_int(0);
-+}
+ }
 +
 // Define a Python reference to the function above.
- STATIC MP_DEFINE_CONST_FUN_OBJ_2(example_add_ints_obj, example_add_ints);
-+STATIC MP_DEFINE_CONST_FUN_OBJ_2(example_VWW_obj, example_VWW);
+-STATIC MP_DEFINE_CONST_FUN_OBJ_2(example_add_ints_obj, example_add_ints);
+STATIC MP_DEFINE_CONST_FUN_OBJ_2(example_train_demo, example_train_demo_fn);
 
 // Define all properties of the module.
 // Table entries are key/value pairs of the attribute name (a string)
-@@ -21,6 +281,7 @@ STATIC MP_DEFINE_CONST_FUN_OBJ_2(example_add_ints_obj, example_add_ints);
+@@ -20,7 +84,7 @@ STATIC MP_DEFINE_CONST_FUN_OBJ_2(example_add_ints_obj, example_add_ints);
+ // optimized to word-sized integers by the build system (interned strings).
 STATIC const mp_rom_map_elem_t example_module_globals_table[] = {
     { MP_ROM_QSTR(MP_QSTR___name__), MP_ROM_QSTR(MP_QSTR_cexample) },
-     { MP_ROM_QSTR(MP_QSTR_add_ints), MP_ROM_PTR(&example_add_ints_obj) },
-+    { MP_ROM_QSTR(MP_QSTR_VWW), MP_ROM_PTR(&example_VWW_obj) },
+-    { MP_ROM_QSTR(MP_QSTR_add_ints), MP_ROM_PTR(&example_add_ints_obj) },
+    { MP_ROM_QSTR(MP_QSTR_train_demo), MP_ROM_PTR(&example_train_demo) },
 };
 STATIC MP_DEFINE_CONST_DICT(example_module_globals, example_module_globals_table);
 
-@@ -33,4 +294,4 @@ const mp_obj_module_t example_user_cmodule = {
+@@ -33,4 +97,4 @@ const mp_obj_module_t example_user_cmodule = {
 // Register the module to make it available in Python.
 // Note: This module is disabled, set the thrid argument to 1 to enable it, or
 // use a macro like MODULE_CEXAMPLE_ENABLED to conditionally enable this module.
 -MP_REGISTER_MODULE(MP_QSTR_cexample, example_user_cmodule, 0);
 +MP_REGISTER_MODULE(MP_QSTR_cexample, example_user_cmodule, 1);
 diff --git a/src/omv/ports/stm32/omv_portconfig.mk b/src/omv/ports/stm32/omv_portconfig.mk
-index 200ffb7d..e742c135 100644
+index 200ffb7d..b3049e25 100644
 --- a/src/omv/ports/stm32/omv_portconfig.mk
 +++ b/src/omv/ports/stm32/omv_portconfig.mk
@@ -4,7 +4,7 @@ STARTUP   ?= st/startup_$(shell echo $(MCU) | tr '[:upper:]' '[:lower:]')
@@ -715,7 +548,7 @@ index 200ffb7d..e742c135 100644
 
 # Compiler Flags
 -CFLAGS += -std=gnu99 -Wall -Werror -Warray-bounds -mthumb -nostartfiles -fdata-sections -ffunction-sections
-+CFLAGS += -std=gnu99 -Wall -Warray-bounds -mthumb -nostartfiles -fdata-sections -ffunction-sections 
+CFLAGS += -std=gnu99 -Warray-bounds -mthumb -nostartfiles -fdata-sections -ffunction-sections -lm
 CFLAGS += -fno-inline-small-functions -D$(MCU) -D$(CFLAGS_MCU) -D$(ARM_MATH) -DARM_NN_TRUNCATE\
           -fsingle-precision-constant -Wdouble-promotion -mcpu=$(CPU) -mtune=$(CPU) -mfpu=$(FPU) -mfloat-abi=hard
 CFLAGS += -D__FPU_PRESENT=1 -D__VFP_FP__ -DUSE_USB_FS -DUSE_DEVICE_MODE -DUSE_USB_OTG_ID=0 -DHSE_VALUE=$(OMV_HSE_VALUE)\
@@ -730,7 +563,7 @@ index 200ffb7d..e742c135 100644
 OMV_CFLAGS += -I$(TOP_DIR)/$(OMV_DIR)/sensors/
 OMV_CFLAGS += -I$(TOP_DIR)/$(OMV_DIR)/ports/$(PORT)/
 OMV_CFLAGS += -I$(TOP_DIR)/$(OMV_DIR)/ports/$(PORT)/modules/
-@@ -213,6 +217,25 @@ FIRM_OBJ += $(addprefix $(BUILD)/$(OMV_DIR)/imlib/, \
+@@ -213,6 +217,50 @@ FIRM_OBJ += $(addprefix $(BUILD)/$(OMV_DIR)/imlib/, \
 	zbar.o                      \
    )
 
@@ -739,20 +572,63 @@ index 200ffb7d..e742c135 100644
 +	codegen/Source/depthwise_kernel3x3_stride1_inplace_CHW_fpreq.o                     \
 +	codegen/Source/depthwise_kernel3x3_stride2_inplace_CHW_fpreq.o                     \
 +	codegen/Source/depthwise_kernel5x5_stride1_inplace_CHW_fpreq.o                     \
-+	codegen/Source/depthwise_kernel5x5_stride2_inplace_CHW_fpreq.o                     \
 +	codegen/Source/depthwise_kernel7x7_stride1_inplace_CHW_fpreq.o                     \
 +	codegen/Source/depthwise_kernel7x7_stride2_inplace_CHW_fpreq.o                     \
+	codegen/Source/depthwise_kernel3x3_stride1_inplace_CHW_fpreq_bitmask.o           \
+	codegen/Source/depthwise_kernel3x3_stride2_inplace_CHW_fpreq_bitmask.o           \
+	codegen/Source/depthwise_kernel5x5_stride1_inplace_CHW_fpreq_bitmask.o           \
+	codegen/Source/depthwise_kernel7x7_stride1_inplace_CHW_fpreq_bitmask.o           \
+	codegen/Source/depthwise_kernel7x7_stride2_inplace_CHW_fpreq_bitmask.o           \
 +	src/kernels/fp_requantize_op/add_fpreq.o \
 +	src/kernels/fp_requantize_op/convolve_1x1_s8_ch8_fpreq.o \
 +	src/kernels/fp_requantize_op/convolve_1x1_s8_ch16_fpreq.o \
 +	src/kernels/fp_requantize_op/convolve_1x1_s8_ch24_fpreq.o \
 +	src/kernels/fp_requantize_op/convolve_1x1_s8_ch48_fpreq.o \
 +	src/kernels/fp_requantize_op/convolve_1x1_s8_fpreq.o \
-+	src/kernels/int_only/avgpooling.o \
+	src/kernels/int_forward_op/avgpooling.o \
 +	src/kernels/fp_requantize_op/convolve_s8_kernel3_inputch3_stride2_pad1_fpreq.o \
 +	src/kernels/fp_requantize_op/mat_mul_kernels_fpreq.o \
+	src/kernels/fp_requantize_op/convolve_1x1_s8_fpreq_mask.o \
+	src/kernels/fp_requantize_op/convolve_1x1_s8_fpreq_mask_partialCH.o \
+	src/kernels/fp_backward_op/sum_4D_exclude_fp.o \
+	src/kernels/fp_backward_op/where_fp.o \
+	src/kernels/fp_backward_op/transpose_depthwise_conv_fp_kernel3_stride1_inpad1_outpad0.o \
+	src/kernels/fp_backward_op/transpose_depthwise_conv_fp_kernel3_stride2_inpad1_outpad1.o \
+	src/kernels/fp_backward_op/transpose_depthwise_conv_fp_kernel5_stride1_inpad2_outpad0.o \
+	src/kernels/fp_backward_op/transpose_depthwise_conv_fp_kernel5_stride2_inpad2_outpad1.o \
+	src/kernels/fp_backward_op/transpose_depthwise_conv_fp_kernel7_stride1_inpad3_outpad0.o \
+	src/kernels/fp_backward_op/transpose_depthwise_conv_fp_kernel7_stride2_inpad3_outpad1.o \
+	src/kernels/fp_backward_op/tte_exp_fp.o \
+	src/kernels/fp_backward_op/sub_fp.o \
+	src/kernels/fp_backward_op/mul_fp.o \
+	src/kernels/fp_backward_op/pointwise_conv_fp.o \
+	src/kernels/fp_backward_op/group_pointwise_conv_fp.o \
+	src/kernels/fp_backward_op/group_conv_fp_kernel4_stride1_pad0.o \
+	src/kernels/fp_backward_op/group_conv_fp_kernel8_stride1_pad0.o \
+	src/kernels/fp_backward_op/strided_slice_4Dto4D_fp.o \
+	src/kernels/fp_backward_op/sum_3D_fp.o \
+	src/kernels/fp_backward_op/nll_loss_fp.o \
+	src/kernels/fp_backward_op/log_softmax_fp.o \
 +	)	
 +
 FIRM_OBJ += $(wildcard $(BUILD)/$(OMV_DIR)/ports/$(PORT)/*.o)
 FIRM_OBJ += $(wildcard $(BUILD)/$(MICROPY_DIR)/modules/*.o)
 FIRM_OBJ += $(wildcard $(BUILD)/$(MICROPY_DIR)/ports/$(PORT)/modules/*.o)
+@@ -625,7 +673,7 @@ endif
+ # This target generates the main/app firmware image located at 0x08010000
+ $(FIRMWARE): FIRMWARE_OBJS
+ 	$(CPP) -P -E -I$(OMV_BOARD_CONFIG_DIR) $(OMV_DIR)/ports/$(PORT)/$(LDSCRIPT).ld.S > $(BUILD)/$(LDSCRIPT).lds
+-	$(CC) $(LDFLAGS) $(FIRM_OBJ) -o $(FW_DIR)/$(FIRMWARE).elf $(LIBS) -lgcc
+	$(CC) $(LDFLAGS) $(FIRM_OBJ) -o $(FW_DIR)/$(FIRMWARE).elf $(LIBS) -lgcc -lm
+ 	$(OBJCOPY) -Obinary -R .big_const* $(FW_DIR)/$(FIRMWARE).elf $(FW_DIR)/$(FIRMWARE).bin
+ 	$(PYTHON) $(MKDFU) -D $(DFU_DEVICE) -b $(MAIN_APP_ADDR):$(FW_DIR)/$(FIRMWARE).bin $(FW_DIR)/$(FIRMWARE).dfu
+ 
+@@ -633,7 +681,7 @@ ifeq ($(OMV_ENABLE_BL), 1)
+ # This target generates the bootloader.
+ $(BOOTLOADER): FIRMWARE_OBJS BOOTLOADER_OBJS
+ 	$(CPP) -P -E -I$(OMV_BOARD_CONFIG_DIR) $(BOOTLDR_DIR)/stm32fxxx.ld.S > $(BUILD)/$(BOOTLDR_DIR)/stm32fxxx.lds
+-	$(CC) $(BL_LDFLAGS) $(BOOT_OBJ) -o $(FW_DIR)/$(BOOTLOADER).elf -lgcc
+	$(CC) $(BL_LDFLAGS) $(BOOT_OBJ) -o $(FW_DIR)/$(BOOTLOADER).elf -lgcc -lm
+ 	$(OBJCOPY) -Obinary $(FW_DIR)/$(BOOTLOADER).elf $(FW_DIR)/$(BOOTLOADER).bin
+ 	$(PYTHON) $(MKDFU) -D $(DFU_DEVICE) -b 0x08000000:$(FW_DIR)/$(BOOTLOADER).bin $(FW_DIR)/$(BOOTLOADER).dfu
+ endif
--- a/examples/openmv_training_sparse/training_demo.py
+++ b/examples/openmv_training_sparse/training_demo.py
@@ -0,0 +1,31 @@
+# This example shows how to invoke to training or inference function calls of tinyengine.
+import cexample
+import lcd
+import sensor
+from pyb import Pin
+
+sensor.reset()  # Reset and initialize the sensor.
+sensor.set_pixformat(sensor.RGB565)  # Set pixel format to RGB565 (or GRAYSCALE)
+sensor.set_framesize(sensor.B128X128)  # Set frame size to QVGA (128x128)
+lcd.init()  # Initialize the lcd screen.
+
+
+# class 1: green
+pin4 = Pin("P4", Pin.IN, Pin.PULL_UP)
+# class 0: red
+pin1 = Pin("P1", Pin.IN, Pin.PULL_UP)
+
+while True:
+    img = sensor.snapshot()  # Take a picture and return the image.
+
+    pin4_value = pin4.value()
+    pin1_value = pin1.value()
+    if pin4_value == 0:
+        ret = cexample.train_demo(img, 0)
+        print("train class 0")
+    elif pin1_value == 0:
+        ret = cexample.train_demo(img, 1)
+        print("train class 1")
+    else:
+        ret = cexample.train_demo(img, -1)
+    lcd.display(img)  # Display the image.
--- a/examples/openmv_vww/README.md
+++ b/examples/openmv_vww/README.md
@@ -12,6 +12,7 @@ sudo apt-get install git build-essential
 ```

 ## Install GNU ARM toolchain
+
 ```
 Install arm toolchain
 TOOLCHAIN_PATH=/usr/local/arm-none-eabi
@@ -27,41 +28,48 @@ export PATH=${TOOLCHAIN_PATH}/bin:${PATH}
 cd tinyengine/examples/openmv_vww/
 git clone https://github.com/openmv/openmv.git
 ```
+
 Currently, we don't have compatibility tests for the OpenMV source, so let's use the version that has been manually tested before.
+
 ```
 cd openmv
 git checkout 918ccb937730cc759ee5709df089d9de516dc7bf
 git submodule update --init --recursive
 ```

-## Build the source 
+## Build the source
+
 Let's first build the firmware from the source to make sure all required dependencies are correctly installed. The `TARGET `is set to `OPENMV4` for OpenMV Cam H7.
+
 ```
 make -j4 -C src/micropython/mpy-cross
 make -j4 TARGET=OPENMV4 -C src
 ```
+
 You should see the compiled binary at `openmv/src/build/bin/firmware.bin`.

-## Apply customized patch 
+## Apply customized patch

 The patch is to

 1. disable some features in the firmware for SRAM and Flash space
-2. setup for TinyEngine source
-3. add vww application code in `exampleemodule.c`
+1. setup for TinyEngine source
+1. add vww application code in `exampleemodule.c`
+
 ```
 cd tinyengine/examples/openmv_vww/openmv
 git apply ../openmv.patch
 ```

 # Add the Tinyengine into openmv
+
 ```
 cd tinyengine
 cp -r TinyEngine examples/openmv_vww/openmv/src/omv/modules/
 ```

-
 ## Generate model-specific code for VWW
+
 ```
 cd tinyengine/examples/openmv_vww/
 python ../vww.py
@@ -71,15 +79,16 @@ cp -r codegen/ openmv/src/omv/modules/TinyEngine/
 Copy the generated code at `tinyengine/example/openmv_vww/codegen` into TinyEngie.

 ## Recompile the firmware with TinyEngine
+
 ```
 cd tinyengine/examples/openmv_vww/openmv/
 make -j4 TARGET=OPENMV4 -C src
 ```

-Flash the binary `openmv/src/build/bin/firmware.bin` into your OpenMV. Please refer to the official [Instructions](https://github.com/openmv/openmv/blob/master/src/README.md#flashing-the-firmware]).
+Flash the binary `openmv/src/build/bin/firmware.bin` into your OpenMV. Please refer to the official [Instructions](https://github.com/openmv/openmv/blob/master/src/README.md#flashing-the-firmware%5D).

 ## Start the demo

-1. download OpenMV IDE
-2. Connect your OpenMV cam to the PC
-3. Run the python script `tinyengine/examples/openmv_vww/vww_openmv_demo.py` in OpenMV IDE.
+1. Open OpenMV IDE
+1. Connect your OpenMV cam to the PC
+1. Run the python script `tinyengine/examples/openmv_vww/vww_openmv_demo.py` in OpenMV IDE.