Building an ML Processor using CFU Playground (Part 3)
We have built a machine learning (ML) processor on an Arty A7-35T using CFU Playground. In Part 3, we have accelerated the inference of MobileNetV2 by 5.5 times.
CFU stands for Custom Function Unit, which is a mechanism to add hardware for RISC-V custom instructions in LiteX-VexRiscv.
Click here for related articles.
- Part 1: Person Detection int8 model
- Part 2: Keyword Spotting model
- Part 3: MobileNetV2 model (this article)
CFU Playground is a framework for tuning hardware (actually FPGA gateware) and software to accelerate the ML model of Google’s TensorFlow Lite for Microcontrollers (TFLM, tflite-micro).
See the Part 1 article for more details.
MobileNetV2 is the successor to MobileNetV1 which is famous as a lightweight model. It is a model that incorporates the structure of inverted residuals and linear bottlenecks while following the depthwise separable convolution of MobileNetV1.
The figure below, quoted from arxiv paper, shows the inverted residual block in MobileNetV2.
For more information on MobileNetV2, see the arxiv paper.
The multimodel_accel project is an in-house project that aims to accelerate multiple ML models, while most CFU Playground projects are model-specific to accelerate only one ML model.
As shown in the console image above, MobileNetV2 is called the
mnv2 model in the CFU Playground.
Introducing the results first, as shown in the featured image and the table below, the total number of cycles of the
mnv2 model has been reduced from 1079M to 197M, achieving a 5.5x speedup.
Note that our project is even twice as fast as the CFU Playground
mnv2_first project, which accelerates the total number of cycles of the
mnv2 model to 397.5M.
Since it is a multimodel_accel project aimed at accelerating multiple ML models, the Person Detection int8 (hereafter
pdti8) model in Part 1 has been accelerated from 48M to 38.4M, and the Keyword Spotting (hereafter
kws) model in Part 2 has been accelerated from 15.7M to 9.8M.
Software Specialization & CFU Optimization
The multimodel_accel project uses the 1×1 convolution (hereafter
1x1_conv) and the depthwise convolution (hereafter
dw_conv), which are specialized and optimized for the
kws models in addition to the
In Part 3, we improved
dw_conv, which has a smaller speedup factor than
1x1_conv in Part 1 and Part 2. Specifically, we have achieved speedup by making
dw_conv, which was compatible with CFU in Part 1, compatible with single instruction, multiple data (SIMD). For this reason, we are changing both the gateware and the software.
Also, as the speed of other layers has increased, the processing time ratio of the first 2D convolution that is not
1x1_conv has increased, so a dedicated kernel has been added.
We have built an ML processor on an Arty A7-35T using CFU Playground. The ML processor can infer the MobileNetV2 5.5 times faster and run in 197M cycles.