Assembly #02 - More on x86
Loop without LOOP instruction
You will notice in the last lecture we used mainly the LOOP command that uses the
ECX register as a method of counting. This is done by decrementing the register every
time the LOOP instruction is triggered. Now although this is a standard practice to
use the LOOP command there are other ways to create for-loops.
** The current print function (i.e., print ret), has a slight problem. **
Sadly, with how I have implemented the signed version of printing, and the way some
OS’s are set up to run, “custom” loops requires messing with a stack. It is important
in any case to always use a safety procedure. This is where stacking values can be
useful. The main idea is to stack (push) used registers before sending out a call and
then popping the registers after a ret.
For example, let’s try converting a for-loop without the LOOP instruction.
Python:
for i in range (0 , 6) : # inclusive of 5 ( i . e . , i <= 5)
print ( i )
While-Loops
While-loops are pretty fundamental to our programming, as such there are a few ways
to handle them. Let’s take this Python code for example:
Python:
counter = 0
while counter <= 5:
print ( counter )
counter -= 1
While-loops work on a true/false concept, meaning we will need to use either CMP
or TEST instructions to make decisions. Remember it is a main part of our control
flow to use CMP or TEST.
1
Advanced Memory Addressing in x86-64 (without hiding be-
hind abstractions)
The x86-64 addressing formula is:
[ base + index * scale + displacement ]
Where:
• base = general-purpose register
• index = general-purpose register (optional)
• scale = 1, 2, 4, or 8 (usually matching data sizes)
• displacement = constant offset
This idea was discussed earlier, and in general is used for arrays.
Let’s look at an example:
# Python code for arrays ( really a list lmao )
array = [10 , 20 , 30 , 40 , 50]
index = 2
print ( array [ index ]) # prints 30
print ( array [ index + 1]) # prints 40
LEA - Compute Effective Address
One of the most useful commands is one that allows us to do a memory read without
storage and a computation, this is known as the LEA instruction. Can compute an
address without storing it in a register and then follow through with a computation
(i.e., no dereference needed).
Instruction syntax:
lea destination , [ base + index * scale + displacement ]
Essentially we skip the register loading step. Let’s look at an example:
; Python sample code
rbx = 10
rcx = 3
sum = rbx + rcx * 4 + 8
Normally this would require multiple instructions to first load and store to compute
the sum portion. Let’s see how it is handled with LEA:
extern print_ret
section .text
global _start
_start :
mov rbx , 10
2
mov rcx , 3
lea rax , [ rbx + rcx *4 + 8] ; rax = rbx + rcx *4 + 8
; rax = 10 + 3*4 + 8 = 30
; rax holds 30 without a single ADD or MUL instruction !
call print_ret ; view rax
; exit
mov rax , 60
xor rdi , rdi
syscall
It’s important to note that MUL and ADD are not the only instructions that LEA
can handle, it can also do SUB (but only in unique cases).
LEA is also more powerful than this as it can be used to do Address Calculation (i.e.,
it’s normal use-case).
; assume we have this array again
section .data
array dq 10 , 20 , 30 , 40 , 50 ; array [0] to array [4]
; how can we get address of array [2] = > 30 using LEA
lea rsi , [ array + 2*8] ; rsi = & array [2]
; rsi now points to the memory where array [2] is stored
; it did not load array [2] ’ s value , only its address
This is just good old address math! This is typically faster than MOV and ADD/MUL
instructions to do a similar idea.
Arguments/Parameters
We understand how to hard-code values into registers (i.e., as constants) but what if
we want to handle dynamic user input?
For example, what if we want to pass in the numbers 10 and 5 to a program?
$ ./ if 10 5
It was mentioned that we have specific register’s dedicated to handling arguments
(e.g., RDI, RSI, RDX, RCX, R8, and R9). While this is true they are a noted
convention, they still require the values being stored into those registers.
For example:
$ ./ if 10 5
# does not mean RDI now equals 10 and RSI equals 5
# we must grab them from a stack and store them accordingly
3
So how do we handle this, don’t CLI arguments come back as a ‘char’ type? The
answer: we will have to do something similar to what is normal in C → atoi.
In C:
# include < stdio .h >
# include < stdlib .h >
int main ( int argc , char * argv []) {
int x = argv [1]; // will not resolve properly , it is an
ASCII representation at this point
int x = atoi ( argv [1]) ; // returns the ASCII representation
to int
return 0;
}
For us to handle this, we will need to first grab the arguments from the stack pointer
(i.e., RSP). Then convert it from an ASCII representation into an integer represen-
tation.
Register Renaming
In reality the CPU internally has more physical registers than the 16 we have shown/dis-
cussed in past lectures.
A problem arises when dealing when we reuse registers, we may introduce:
• True Dependency (Read After Write, RAW)
– This is where an instruction needs a result from previous instruction.
• False Dependency (Write After Write, WAW)
– This is where two instructions write to the same register
• Anti Dependency (Write After Read, WAR)
– This is where an instruction writes to a register that is used as a source
before
With Register Renaming the main idea is to write code that avoids unnecessary
dependencies.
Let’s take a look at some Bad Code:
mov rax , [ array ] ; rax = array [0]
add rax , [ array +8] ; rax = array [0] + array [1]
sub rax , [ array +16] ; rax = rax - array [2]
What are the issues here?
• RAX is used for both accumulation and subtraction in sequence.
4
• This forces the CPU to wait for the previous RAX instruction to finish before
starting the next.
How can we improve this?
mov rax , [ array ] ; rax = array [0]
mov rbx , [ array +8] ; rbx = array [1]
add rax , rbx ; rax = array [0] + array [1]
mov rcx , [ array +16] ; rcx = array [2]
sub rax , rcx ; rax = rax - array [2]
What is different here?
• Now the CPU can schedule mov rbx, [array+8] and mov rcx, [array+16] in
“parallel”
• No false dependencies
• Exploits instruction-level parallelism (ILP)
Register renaming helps remove WAR and WAW hazards.
Without Renaming :
rax --> rax --> rax ( sequential dependency chain )
With Renaming :
[ one core ] [ one core ]
[ rax + rbx ] --> [ rax + rcx ] = > result ( parallelizable )