# Matrix Multiplication on FPGA with the RISC-V Vector Extension

We have implemented Vicuna, an implementation of the RISC-V Vector Extension, on an FPGA board and evaluated the performance of the matrix multiplication kernel.

Click here for related articles.

- Running Auto-Vectorized Program on RISC-V Vector RTL Simulator
- Matrix Multiplication based on the RISC-V Vector Extension
- 1×1 Convolution based on the RISC-V Vector Extension
- Matrix Multiplication on FPGA with the RISC-V Vector Extension (this article)

## Vicuna

Vicuna is a 32-bit integer vector coprocessor written in SystemVerilog. More precisely, Vicuna complies with the `Zve32x`

extension that supports vector element widths of 8, 16, and 32 bits and does not require 64-bit elements or floating point support. However, at the time of writing, the divide instructions are missing.

Since Vicuna is a co-processor, it requires a main processor, Ibex or CV32E40X.

## FPGA with Vicuna

This time, we have created gateware for Digilent’s FPGA board Nexys Video.

The main specifications of the gateware for Nexys Video are as follows.

- Processor
- Main processor: Ibex
- Co-processor: Vicuna
- VLEN (bit length of vector register): 512-bit
- Multiplier bit length: 256-bit

- ISA: RV32IMCV
- Operating frequency: 100 MHz

- SRAM: 256 KiB
- UART: 1 ch

## Matrix Multiplication on FPGA with Vicuna

We used the code below as a reference kernel for matrix multiplication.

// C = AB with A = [M x K], B = [K x N], C = [M x N] void imatmul_ref(const int M, const int N, const int K, const int32_t* A, const int32_t* B, int32_t* C) { int i, j, k; int32_t sum; for (i = 0; i < M; ++i) { for (j = 0; j < N; ++j) { sum = 0; for (k = 0; k < K; ++k) { sum += A[i * K + k] * B[k * N + j]; } C[i * N + j] = sum; } } }

As in other articles we had the elements of matrices A and B set to `int8_t`

, but changed them to `int32_t`

because Vicuna’s `vsext.vf2`

was giving incorrect results. Upon investigation, an issue was raised on GitHub.

The featured image above shows Vicuna’s performance.

When the square matrix size (M=N=K) is 32, 64 and 128, the performance [OP/cycle] of the matrix multiplication kernel based on the RISC-V Vector Extension is 5.865, 6.544 and 6.913 respectively. Compared to the reference kernel, we get a speedup of 39-45x.

## Summary

We have implemented Vicuna, which complies with the RISC-V Vector Extension `Zve32x`

, on an FPGA board and evaluated the performance of the matrix multiplication kernel.