Inference Performance of Large Language Models on a 64-core RISC-V CPU with Silicon-Enabled Vectors

Garcia, Adriano Marques; Malenza, Giulio; Birke, Robert; Aldinucci, Marco

doi:10.1016/j.future.2025.108242

The rising usage of compute-intensive AI applications with fast response time requirements, such as text generation using large language models (LLMs), underscores the need for more efficient and versatile hardware solutions. This drives the exploration of emerging architectures like RISC-V, which have the potential to deliver strong performance within tight power constraints. The recent commercial release of processors with RISC-V Vector (RVV) silicon-enabled extensions further amplifies the significance of RISC-V architectures, offering enhanced capabilities for parallel processing and accelerating tasks critical to LLMs. This work evaluates the BERT, GPT-2, Gemma-2, LLaMA-3.2, and DeepSeek-LLM language model inference performance on the SOPHON SG2042 64-core RISC-V architecture with silicon-enabled RVV v0.7.1. We benchmarked the models with and without RVV, using OpenBLAS and BLIS as backends for PyTorch to enable vectorization. Our results show that the performance impact of RVV is closely tied to the matrix shape and arithmetic intensity. Indeed, vectorization can actually slow down GEMM operations due to memory-bound behavior, whereas higher batch sizes shift execution to the compute-bound region, where RVV shows clear benefits. We validate this behavior experimentally using roofline modeling and traced GEMM timing, revealing performance bottlenecks that are invisible to synthetic micro-benchmarks. While enabling RVV in OpenBLAS can speed up inference performance by up to 1.3x, its benefits are highly configuration-dependent. These insights suggest that workload characteristics, threading behavior, and datatype must be carefully aligned to unlock RVV’s full potential. Our findings highlight both the promise and the current software limitations of running LLMs on RVV-enabled RISC-V platforms.