Matrix Multiplication on FPGA with the RISC-V Vector Extension

We have implemented Vicuna, an implementation of the RISC-V Vector Extension, on an FPGA board and evaluated the performance of the matrix multiplication kernel.

Click here for related articles.

Running Auto-Vectorized Program on RISC-V Vector RTL Simulator
Matrix Multiplication based on the RISC-V Vector Extension
1×1 Convolution based on the RISC-V Vector Extension
Matrix Multiplication on FPGA with the RISC-V Vector Extension (this article)

Vicuna

Vicuna is a 32-bit integer vector coprocessor written in SystemVerilog. More precisely, Vicuna complies with the Zve32x extension that supports vector element widths of 8, 16, and 32 bits and does not require 64-bit elements or floating point support. However, at the time of writing, the divide instructions are missing.

Since Vicuna is a co-processor, it requires a main processor, Ibex or CV32E40X.

FPGA with Vicuna

This time, we have created gateware for Digilent’s FPGA board Nexys Video.

The main specifications of the gateware for Nexys Video are as follows.

Processor
- Main processor: Ibex
- Co-processor: Vicuna
  - VLEN (bit length of vector register): 512-bit
  - Multiplier bit length: 256-bit
- ISA: RV32IMCV
- Operating frequency: 100 MHz
SRAM: 256 KiB
UART: 1 ch

Matrix Multiplication on FPGA with Vicuna

We used the code below as a reference kernel for matrix multiplication.

// C = AB with A = [M x K], B = [K x N], C = [M x N]
void imatmul_ref(const int M, const int N, const int K, const int32_t* A,
                 const int32_t* B, int32_t* C) {
  int i, j, k;
  int32_t sum;
  for (i = 0; i < M; ++i) {
    for (j = 0; j < N; ++j) {
      sum = 0;
      for (k = 0; k < K; ++k) {
        sum += A[i * K + k] * B[k * N + j];
      }
      C[i * N + j] = sum;
    }
  }
}

As in other articles we had the elements of matrices A and B set to int8_t, but changed them to int32_t because Vicuna’s vsext.vf2 was giving incorrect results. Upon investigation, an issue was raised on GitHub.

The featured image above shows Vicuna’s performance.

When the square matrix size (M=N=K) is 32, 64 and 128, the performance [OP/cycle] of the matrix multiplication kernel based on the RISC-V Vector Extension is 5.865, 6.544 and 6.913 respectively. Compared to the reference kernel, we get a speedup of 39-45x.

Summary

We have implemented Vicuna, which complies with the RISC-V Vector Extension Zve32x, on an FPGA board and evaluated the performance of the matrix multiplication kernel.