0% found this document useful (0 votes)
29 views6 pages

Design and Implementation of A RISC V Processor

Uploaded by

umesharaddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views6 pages

Design and Implementation of A RISC V Processor

Uploaded by

umesharaddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

2021 17th International Conference on Mobility, Sensing and Networking (MSN)

Design and Implementation of a RISC V Processor


on FPGA
2021 17th International Conference on Mobility, Sensing and Networking (MSN) | 978-1-6654-0668-0/21/$31.00 ©2021 IEEE | DOI: 10.1109/MSN53354.2021.00037

Ludovico Poli Sangeet Saha Xiaojun Zhai


Embedded and Intelligent Systems Embedded and Intelligent Systems Embedded and Intelligent Systems
Laboratory Laboratory Laboratory
University of Essex University of Essex University of Essex
Essex,UK Essex, UK Essex, UK
[email protected] [email protected] [email protected]

Klaus D. Mcdonald-Maier
Embedded and Intelligent Systems
Laboratory
University of Essex
Essex, UK
[email protected]

Abstract—
includes information about instructions, registers, memory
The RISC-V ISA is becoming one of the leading instruction sets for access, arithmetic, data buses and so on. While every function
the Internet-of-Things and System-on-Chip applications. Due to its
strong security features and open-source nature, it is becoming a of the processor is defined in the standard, how hardware and
competitor to the popular ARM architecture. This paper describes
circuits are implemented is left up to the designer. Most ISAs
the design of a light weight, open-source implementation of a RISC-
V processor using modern hardware design techniques, the will typically have many extensions and variations to suit the
implementation of the design onto a Field Programmable Gate Array requirements of different designs. This could include
(FPGA), and its testing. We wanted to create a RISC-V processor extensions with support for multiplication, floating point
that is easy for beginners to learn from and lightweight enough to be numbers, or different data, address and instruction widths, or
implemented on even small FPGAs. While there are existing ISAs targeted for embedded systems, personal computers,
opensource implementations of RISC-V processors, none are super computers, etc.
intuitive enough for a beginner to follow. For this reason, in this
paper we have minimised the use of conventions and components in
modern processors that are not strictly necessary for a barebones Commercially successful ISAs have typically been
implementation. For example, the processor does not include proprietary and required licenses. With the specific designs
pipelining and uses a simple Harvard architecture. The barebones and implementations being hidden behind patents and non-
nature of the design allows for a lot of potential for upgradability. disclosure agreements. The new RISC-V ISA [2] is
The implementation of each component, and the corresponding test promising for the hardware industry with it being the first
benches, are written in concise and conventional System Verilog. open-source modern ISA. RISC-V is a combination of work
The project produced a RISC-V processor with files for targeting
Basys 3 Artix-7 FPGA. Performance was tested using the Dhyrstone
by the University of California and companies such as AMD,
benchmark and achieved a strong 2.276 DMIPs/MHz, even Google, Microsoft, IBM and many more. Since the
outperforming the ARM Cortex-A9, while maintaining very low finalisation of the first variants of RISC-V, many open-source
resource utilization on the FPGA. tools and designs have been created and published [3].

Keywords—RISC-V, CPU, Verilog, FPGA, open-source RISC-V is not only exciting because it is open source, but
also an ISA that can compete with the other restricted
commercial ISAs by Intel, AMD and ARM. While most
academic designs are optimised for learning purposes, RISC-
V is intended for modern practical use.
I. INTRODUCTION
The open-source aspect of RISC-V made it particularly well
Instruction set architectures (ISA), are standards for how the
suited for this project as we could use it to create an open-
hardware of a processor should function and interact with the
source CPU without having to use one of the "toy" or
ISA’s own assembly language. ISAs have all the information
academic ISAs, and instead use something more realistic and
necessary to create a processor that will run machine code
practical. While there already exist opensource
correctly and consistently according to the standard [1]. This
implementations of RISC-V processors, none is simple and
intuitive enough for a beginner to follow.

978-1-6654-0668-0/21/$31.00 ©2021 IEEE 161


DOI 10.1109/MSN53354.2021.00037
Authorized licensed use limited to: East West Institute of Technology. Downloaded on March 10,2025 at 10:11:31 UTC from IEEE Xplore. Restrictions apply.
In this paper, we have designed a light weight, open-source Table 1 Instruction set
implementation of a RISC-V processor using modern
hardware design techniques and implemented the design onto Instruction Description
a Field Programmable Gate Array (FPGA), and to test it. We ld Load double integer
wanted to create a RISC-V processor that is simple enough to
learn from and lightweight enough to be implemented on sd Store double integer
even small FPGAs that students could afford. add Addition

While there are already several open-source implementations sub Subtraction


of RISC-V processors, they are only intended for practical and Bitwise AND operation
usage. The consequence of this is that they are often too
complex for a student to learn from. The commonly used or Bitwise OR operation
PicoRV32 by Claire Wolf [5] is considered to be a very beq Branch if equal
compact CPU, but even this uses complex periphery such as
XIP SIP Flash Controllers, UARTs, GPIO controllers, and the
RV32IMC variable length instruction ISA. The Berkeley
CPUs (Rocket, BOOM and Sodor [6, 7, 8]) have the difficult ld and sd are two memory related instructions. ld takes a
entry requirement of being designed using the unique and double (64-bit value) from a given memory address and puts
rarely used Chisel [9] hardware design language. SCR1 by it into a register. While sd does the opposite and takes a double
Synthacore [10] is considered to be industry grade with the from a register and stores it at an address in memory.
core including 2 to 4 stage pipelining, debug support,
AXI4/AHB-Lite external interfaces, machine and many other add and sub are two arithmetic instructions. add takes the
powerful features. There are many instances of processors contents of two source registers, adds them together using
with this level of complexity, and so there has been a rising signed addition, and stores the results in a destination register.
demand for RISC-V processors that can be used as a learning sub subtracts the contents of the two source registers instead.
example. If the result of an arithmetic operation overflows (i.e., the
To make the design accessible, in our CPU we neglected all result cannot fit in a double integer), then the result will wrap
those conventions and components in modern processors that around back through the negative numbers. As all arithmetic
are not strictly necessary for a barebones implementation. For is signed, the negative numbers are represented in two’s
instance, the processor does not include pipelining and uses a complement.
simple Harvard architecture which reduces complexity. Also,
the implementation of each component, and the corresponding and and or are the two logical instructions in the instruction
test benches, are written in concise and conventional System set. and takes the contents of two source registers, performs
Verilog [4]. a bitwise AND operation, and stores the results in a
destination register. or does the same but performs a bitwise
OR operation on the source registers instead.

II. METHODOLOGY beq is the only control instruction (a conditional instruction).


beq stands for “branch if equal”. It takes the contents of two
A. The Instruction Set Architecture source registers and verifies if they are equal by checking if
their difference by subtraction is zero. If they are equal, then
The ISA used in this project is a reduced version of the the program will jump to a new instruction, based on a signed
RV64I extension. RV64I only supports 64-bit integers and offset provided by the beq instruction. Normally the program
the instructions included are based on the minimal counter register will increment through each instruction one
instruction set from “Computer Organization and Design at a time with each new clock cycle. However, when beq is
RISC-V Edition: The Hardware Software Interface.” [11] used, the program counter can be incremented or
The exact seven instructions are presented in decremented by a specified amount, jumping the program
forwards or backwards by some number of instructions.
Table 1.

Figure 1 Processor Architecture

162

Authorized licensed use limited to: East West Institute of Technology. Downloaded on March 10,2025 at 10:11:31 UTC from IEEE Xplore. Restrictions apply.
B. System design The ALU performs four different operations on its inputs and
The top level of the processor is based on the theory from output the result. It can perform addition, subtraction, logical
“Computer Organization and Design RISC-V Edition: The AND and logical OR. Which operation is performed is
Hardware Software Interface.” [11]. Each component is determined by the control signal coming from the control
implemented as a System Verilog module with a focus on the unit, which in turn is determined by the instruction type.
code being conventional and highly readable. Each Naturally, the ALU is used by the add, sub, and, or
component includes a simple test bench written in System instructions. However, also the ld, sd and beq instructions
Verilog so that the behavior of the module can be tested in also require a calculation to be performed (such as the
simulation. The entire system is designed so that each calculation of memory addresses). The ALU also has a “zero
instruction is executed in one clock cycle (no pipelining). The flag” output, which is a single-bit output that goes high when
main components can be seen in Figure 1 (the control unit is the result of an operation is zero. This is used during the beq
not included for simplicity as it connects to every other main instruction execution to decide whether to branch or not.
component with control signals).
The data memory is an asynchronous read, synchronous write
The instruction memory is a read-only memory (ROM) that memory used for storing data generated by the program or
is flashed with the program instructions when the design is data flashed during power up. It has a single address input
uploaded onto the FPGA. The size of the memory is shared during read and write operations. There is a write data
parameterizable within the System Verilog code of the input and a read data output. There are also control signals for
instruction memory. The width of each memory location is enabling write and enabling read separately. The write data
32 bits so that they can each contain exactly one 32-bit RISC- input is provided by the contents of one of the read register
V instruction. The instruction memory can be flashed by outputs of the register file. The read data output goes to the
providing the compiler with a “.mem” file; a space separated register file’s write data input. The address input is provided
value file. by the ALU output since the address for reading and writing
needs to be calculated by adding a register from an
instruction’s source register field to the offset field.
The register file contains the 32 general-purpose registers,
Additionally, a small region of the data memory is read only
with each register being 64-bit wide. For the sake of
and flashable on start-up which is useful for a programmer
simplicity, this deviates slightly from the RV64I ISA
since the small instruction set doesn't provide any immediate
standard as usually certain registers are dedicated to holding
instructions. This can be flashed with the same method as
specifics values (stack pointers, frame pointers, etc). It has
instruction memory by using a “.mem” file.
two read address inputs and two corresponding asynchronous
read data outputs. This is useful so that two registers can be
read from in the same clock cycle, like in the case of adding The control unit is the most important component as it sends
two registers together. It has a single write address input and control signals to all the other components so they can
a synchronous write data input; this means that any register coordinate with each other. It takes the opcode field of an
that is written to will only be updated with the new values on instruction (see Figure 2) as the input and uses a lookup table
the next clock cycle. Writing is enabled with a control signal to send the corresponding signals to the components. We
from the control unit. Register #31 is tied to the output pins chose to use a purely combinational control unit, rather than
so that it can be used as a debug register output for better a microprocessor with microcode, since the instruction set is
observability. small and does not require the additional layer of abstraction
provided by microcode.

Figure 2 Instruction format

163

Authorized licensed use limited to: East West Institute of Technology. Downloaded on March 10,2025 at 10:11:31 UTC from IEEE Xplore. Restrictions apply.
A fundamental design principle of the processor is that each x ‘+’ We use ld to load the data stored
instruction would be executed in one clock cycle. This is where P points, then add to increment and
because it makes many aspects of the design simpler (e.g., no finally sd to store the result back
pipeline registers are required) and because getting high clock x ‘-‘ same as above but using sub
speeds is not a priority for the project. A design consequence x ‘.’ We reserve part of the RAM (e.g., above address
of this is that any logical path in the design can only have one 30000) to write output. We use another
clocked process. In the design there are only 3 clocked register, O (for output), to point to that area. We
processes: the program counter incrementing, the use ld to load the data stored where P points,
synchronous write of the register file and the synchronous then sd to store the result where O points, and
write of the data memory. Even though the latter two finally we use add to increment O (so future output
processes are on the same logical path, they cannot happen at is not overwritten)
the same time due to the control unit never asserting both x ‘,’ We reserve part of the ROM to get inputs from.
write enables high in the same clock cycle. We use another register, I (for input), to point to that
area. We use ld to load the data stored
C. Turing Completeness where I points, then sd to store the result
where P points, and finally we use add to
While the instruction set of our CPU is much reduced, it increment I (so it points to the next input data, for
is possible to prove that it is Turing complete. An instruction future input operations)
set/language is Turing complete if it can implement a Turing x ‘[’ and ‘]’ We implement the while loop by simply
Machine or if it can implement the instructions of a language devoting a register, Z (for zero), to containing the
that has already been proven to be Turing Complete. constant 0, then adding a few instructions before and
after the body of the loop. At the beginning we use
We use the second method to show Turning Completeness. one ld to load the data stored where P points,
In particular, we want to show that our assembly code can then beq to compare such data with Z and jump
implement the instructions of the 8 simple instructions of the forward to the end of the loop if the condition is met.
Brainfuck language [14], which is Turing Complete. The At the end of the body, we do beq Z Z: because you
are comparing two things that are identical, the
instructions are shown in Table 2 Instructions in the
instruction will always cause a jump. So, we can
Brainfuck language and their C equivalent. with their jump back to the conditional instructions just before
equivalent C programming language implementation. the body of the loop.

Table 2 Instructions in the Brainfuck language and their C


D. Physical Implementation
equivalent.
Synthesizable designs can either be sent to a manufacturer to
be printed onto an Application-specific Integrated Circuit
brainfuck
command
C equivalent (ASIC) or it can be programmed onto a Field Programmable
Gate Array (FPGA), the latter being what was used in this
project. FPGAs are programmable digital circuits that can
(Program
Start)
char array[30000] = {0}; char *ptr = &array[0];
have their logic reconfigured to match the HDL code [5]. The
> ++ptr; main appeal of FPGAs, compared to traditional circuit
printing methods, is that digital circuits can be prototyped and
< --ptr;
changed quickly, easily and with no added cost. A processor
+ ++*ptr; put onto an FPGA, rather than an ASIC, is known as a soft
- --*ptr; processor.
. putchar(*ptr);
The board we used is the Basys 3 Artix-7 FPGA Trainer
, scanf(" %c",ptr); Board [12]. This a board suited for educational purposes as it
[ while (*ptr) { offers lots of I/O (switches, buttons, LEDs, displays, etc.) and
examples to use online. The main drawback of this board is
] }
that high performance and high clock speeds are difficult to
achieve, which is an acceptable compromise given the aims
For each of the C equivalent statements one can create an of this project. The design tool that was used in this project is
equivalent with our CPU assembly language: the Vivado Design Suite [13], which handles design
x (Program Start) We make sure we have RAM synthesis, implementation and programming the FPGA.
(>30K) memory preallocated. We devote a register
(let us call it P) to act as ptr. The CPU already
initialises all registers to 0.
x ‘<’ We can use add to increment P III. RESULTS
x ‘>’ As above but for sub Once the CPU was fully designed in System Verilog and
tested successfully in simulation, two tests were run on the
FPGA implementation described in II. Methodology D.
Physical Implementation. The first involved running a very

164

Authorized licensed use limited to: East West Institute of Technology. Downloaded on March 10,2025 at 10:11:31 UTC from IEEE Xplore. Restrictions apply.
small program to test the FPGA utilization of the processor utilization as shown Table 3 might also allow the compiler to
(how “light weight” it is). The program simply calculates the optimize instruction memory ROM into combinational logic
Fibonacci Sequence and stores the sequence in the debug which could potentially increase the performance
register using six assembly instructions, with one instruction significantly.
being executed per clock cycle:
IV. CONCLUSIONS
ld r0, 33, r1
This paper presented the design methodology,
add r1, r2, r31 lightweight, and open-source RISC-V processor. By sticking
add r2, r0, r1 to only implementing what we considered to be the essential
theory, we were able to create a system design we believe to
add r31, r0, r2 be simple and intuitive for students to learn from. Due to the
add r31, r0, r2 minimalistic instruction set, the logic required is small and so
the utilization on hardware is very efficient, which achieves
beq r0, r0, -2 the objective of being lightweight. In addition to this, the
The instruction memory ROM, data memory ROM and data Dhyrstone performance could allow for potential practical
memory RAM were parameterized to be 256 bits each in and industry uses even if the processor was not originally
depth. Using these parameters, the implementation was able intended for this application.
to be optimized to achieve very low utilization as show in table
Table 3. Due to our RISC-V instruction set being so minimal, it is not
currently supported by common RISC-V toolchains. So, a
Table 3 FPGA Utilization Metrics
natural progression to this project would be providing support
Resource Utilization Utilization % for assemblers and compilers. The design hardware also has
a lot of intentional room for upgrades with the possibility of
Look-up tables 322 1.55 variants being created with pipelining and support for
additional instruction sets such as RV64IM.
Flip-Flops 229 0.55
IO 18 16.98
A. Source
The System Verilog design files and source can be found at
The second test was to run a standard benchmark on the https://github.com/Ludini1/minimal-risc-v-cpu.
processor. The benchmark we used is the industry standard
Dhyrstone benchmark by Reinhold P. Weicker [15]
B. Acknowledgements
Dhyrstone is a synthetic benchmark that measures the non-
floating-point performance of a CPU using a realistic set of This work is supported by the UK Engineering and Physical
operations. The program itself is written in simple C and the Sciences Research Council through grants EP/R02572X/1,
main program loop is small which makes it well suited for EP/P017487/1 EP/V000462/1 and EP/V034111.
small processors. Dhyrstone performance is measured in
DMIPs and is usually normalized for the clock speed using
DMIPs/MHz. Using a clock speed of 100 MHz, our processor
performs very well against other common CPU as seen in
Figure 3.

Figure 3 Dhyrstone Benchmark Performance

One of the reasons our processor performed so well might be


due to the small instruction set, and hence the small amount of
logic, means that it can be optimized more significantly by the
compiler. It could also be the fact that the memory of our chip
is completely on-chip and doesn’t require slower memory
interfaces to do memory accesses. The small amount of FPGA

165

Authorized licensed use limited to: East West Institute of Technology. Downloaded on March 10,2025 at 10:11:31 UTC from IEEE Xplore. Restrictions apply.
REFERENCES [9] Bachrach et al. "Chisel: constructing hardware in a Scala
embedded language" Proceedings of the 49th Annual Design
[1] David A. Patterson and John L. Hennessy “Computer
Automation Conference (DAC 2012). San Francisco, California,
Organization and Design RISC-V Edition: The Hardware Software
USA
Interface.” Page 22 Morgan Kaufmann Publishers Inc., 2017, San
Francisco, CA, USA. [10] Synthacore "SCR1 RISC-V Core" GitHub 2021
https://github.com/syntacore/scr1
[2] Andrew Waterman, Krste Asanovi, and Five Inc. “The RISC-V
Instruction Set Manual Volume I: User-Level ISA Document [11] David A. Patterson and John L. Hennessy “Computer
Version 2.” CS Division, EECS Department, University of Organization and Design RISC-V Edition: The Hardware Software
California, Berkeley 2017 Interface.” Morgan Kaufmann Publishers Inc., 2017, San Francisco,
CA, USA.
[3] RISC-V "RISC-V GNU Compiler Toolchain" Github 2021
https://github.com/riscv/riscv-gnu-toolchain/ [12] Digilent “Basys 3 Reference Manual”
reference.digilentinc.com 2021
[4] IEEE "IEEE Standard for SystemVerilog--Unified Hardware
https://reference.digilentinc.com/reference/programmable-
Design, Specification, and Verification Language," IEEE Std 1800-
logic/basys-3/reference-manual
2017 (Revision of IEEE Std 1800-2012) 2018
[13] Xilinx “Vivado Design Suite - HLx Editions” Xilinx.com 2021
[5] Claire Wolf “PicoRV32 - A Size-Optimized RISC-V CPU”
https://www.xilinx.com/products/design-tools/vivado.html
GitHub 2019 https://github.com/cliffordwolf/picorv32
[14] Brian Raiter "Brainfuck: An Eight-Instruction Turing-
[6] CHIPS Alliance "Rocket Chip" Github 2021, RISC-V
Complete Programming Language" muppetlabs.com 2013,
International https://github.com/chipsalliance/rocket-chip
http://www.muppetlabs.com/~breadbox/bf
[7] Christopher Celio "The Berkeley Out-of-Order RISC-V
[15] Alan R. Weiss "Dhrystone Benchmark History, Analysis,
Processor" Github 2021, University of California
"Scores" and Recommendations" ECL, LLC, 2011, El Dorado Hills,
https://github.com/riscv-boom/riscv-boom
California, USA
[8] Christopher Celio, "The Sodor Processor Collection" Github
2021, University of California https://github.com/ucb-bar/riscv-
sodor

166

Authorized licensed use limited to: East West Institute of Technology. Downloaded on March 10,2025 at 10:11:31 UTC from IEEE Xplore. Restrictions apply.

You might also like