# Matrix Multiplication based on the RISC-V Vector Extension

We have created a matrix multiplication kernel based on the RISC-V Vector Extension (RVV) and evaluated its performance using an RTL simulator.

Click here for related articles.

- Running Auto-Vectorized Program on RISC-V Vector RTL Simulator
- Matrix Multiplication based on the RISC-V Vector Extension (this article)
- 1×1 Convolution based on the RISC-V Vector Extension

## Matrix Multiplication

The matrix multiplication kernel computes an n × p matrix C, which is the product of an n × m matrix A and an m × p matrix B (C=AB). Let c_{ij} be the element of i-th row and j-th column of matrix C, and c_{ij} is expressed by the following formula:

The code before vectorization looks like this:

// C = AB with A = [n x m], B = [m x p], C = [n x p] void matmul_int8(int32_t* c, const int8_t* a, const int8_t* b, const unsigned long int n, const unsigned long int m, const unsigned long int p) { for (int i = 0; i < n; ++i) { for (int j = 0; j < p; ++j) { int32_t sum = 0; for (int k = 0; k < m; ++k) { sum += a[i * m + k] * b[k * p + j]; } c[i * p + j] = sum; } } }

Considering application to machine learning, the elements of matrix A and B are `int8_t`

, and the elements of matrix C are `int32_t`

.

In previous article, we tested LLVM/Clang auto-vectorization, but this time we created a matrix multiplication kernel by hand tuning.

## Ara

Ara is an implementation of the RISC-V Vector Extension developed by the Parallel Ultra Low Power (PULP) project. Ara’s repository has the following description:

Ara is a vector unit working as a coprocessor for the CVA6 core. It supports the RISC-V Vector Extension, version 1.0.

Ara ensures scalability by implementing a number of 64-bit vector unit called lane. The config directory contains `[2|4|8|16]_lanes.mk`

for the Ara system configuration. Each lane contains an integer Arithmetic Logic Unit (ALU), an integer multiplier (MUL), and a Floating Point Unit (FPU) that can perform 64-bit wide integer and double precision floating point operations.

Note that the CVA6 in the quote is a 64-bit RISC-V core previously called PULP Ariane.

## Matrix Multiplication on Ara RTL Simulator

Ara can create an RTL simulator for Ara system using Verilator. This time, we used the default configuration `4_lanes.mk`

.

Below is the console output when running our matrix multiplication kernel.

$ cd $ARA/hardware $ app=imatmul_int8 make simv ... ============= = IMATMUL = ============= ... ------------------------------------------------------------ Calculating a (32 x 32) x (32 x 32) matrix multiplication... ------------------------------------------------------------ Initializing matrices... Calculating imatmul... The execution took 11119 cycles. The performance is 5.894055 OP/cycle. Verifying result... Passed. ------------------------------------------------------------ Calculating a (64 x 64) x (64 x 64) matrix multiplication... ------------------------------------------------------------ Initializing matrices... Calculating imatmul... The execution took 55627 cycles. The performance is 9.425063 OP/cycle. Verifying result... Passed.

Compared to the number of cycles in CVA6, we achieved a 40x speedup for 32 x 32 matrix multiplication and a 62x speedup for 64 x 64 matrix multiplication.

## Summary

We have created a matrix multiplication kernel based on the RISC-V Vector Extension and evaluated its performance using Ara’s RTL simulator.