Building an ML Accelerator using CFU Playground (Part 2)
We have built a tiny machine learning (TinyML) accelerator on an Arty A735T using CFU Playground.
In part 2, we have accelerated the inference of the Keyword Spotting model called kws
in CFU Playground.
CFU is an abbreviation for Custom Function Unit, and it is a mechanism to add hardware for RISCV custom instructions in LiteX/VexRiscv.
CFU Playground
CFU Playground is a framework for tuning hardware (actually FPGA gateware) and software to accelerate the ML model of Google’s TensorFlow Lite for Microcontrollers (TFLM).
See this article for more details.
Keyword Spotting Model
The kws
model is one of the TFLM models from the MLPerf Tiny Deep Learning Benchmarks.
As shown in the figure below, this model is a 13layer model consisting of 5layer CONV_2D
, 4layer DEPTHWISE_CONV_2D
(hereinafter dw_conv
), etc.
The CONV_2D
from the third layer onward of this model operates as a 1×1 convolution (hereinafter 1x1_conv
).
multimodel_accel project
The multimodel_accel project is an inhouse project that aims to accelerate multiple ML models, while most CFU Playground projects are modelspecific to accelerate only one ML model.
As a background of the above, since the kws_micro_accel project of CFU Playground is specialized for the kws
model, there is no effect of speeding up to other models such as the Person Detection int 8 (hereinafter pdti8
) model introduced in the previous article.
Introducing the results first, as shown in the featured image and the table below, the total cycles of the kws
model has been reduced from 86.7M to 15.7M, achieving a 5.5x speedup.
Keyword Spotting Model  Cycles  Speedup factor 


Before  After  
CONV_2D 
67.8M  8.7M  7.8 
DEPTHWISE_CONV_2D 
18.6M  6.9M  2.7 
Total  86.7M  15.7M  5.5 
The operating frequency of the Gateware for Arty A735T is 100MHz, so the latency is 157ms.
Since the latency of Arm CortexM4 with an operating frequency of 120MHz in MLPerf Tiny v0.5 inference benchmarks is 181.92ms, the above result is 16% faster even by simple comparison, and 39% faster when the operating frequency is normalized.
The total cycles of the pdti8
model is 50.3M. Since the total cycles specialized for the pdti8
model introduced in the previous article is 48.0M, the total cycles has increased a little.
Software Specialization & CFU Optimization
The multimodel_accel project uses 1x1_conv
and dw_conv
, which are specialized and optimized for both kws
and pdti8
models.
However, since the ratio of the processing time of the first layer of the kws
model has increased with the speedup of other layers, kws_conv
specialized for the first layer of the kws
model has been added.
In the pdti8
model with all 31 layers, the first layer is not specialized or optimized, but in the kws
model with all 13 layers, the influence of the first layer tends to be strong.
Summary
We have built a TinyML accelerator on an Arty A735T using CFU Playground.
The TinyML accelerator can infer the Keyword Spotting model, called the kws
model in the CFU Playground, 5.5 times faster and run in 15.7M cycles.