0% found this document useful (0 votes)
31 views50 pages

Undocumented Cpu Behavior

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views50 pages

Undocumented Cpu Behavior

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Undocumented CPU Behavior:

Analyzing Undocumented Opcodes on Intel x86-64


Catherine Easdon
Why investigate undocumented behavior?
The “golden screwdriver” approach
● Intel have confirmed they add undocumented features to general-release chips for
key customers

"As far as the etching goes, we have done different things for different
customers, and we have put different things into the silicon, such as adding
instructions or pins or signals for logic for them. The difference is that it
goes into all of the silicon for that product. And so the way that you do it
is somebody gives you a feature, and they say, 'Hey, can you get this into
the product?' You can't do something that takes up a huge amount of die, but
you can do an instruction, you can do a signal, you can do a number of
things that are logic-related." ~ Jason Waxman, Intel Cloud Infrastructure
Group

(Source: [Link] )
Poor documentation
● Intel has a long history of withholding information from their manuals

(Source: [Link]
Poor documentation
● Intel has a long history of withholding information from their manuals

(Source: [Link]
Poor documentation
● Even when the manuals don’t withhold information, they are often misleading or
inconsistent

Section 22.15, Intel Developer Manual Vol. 3:

Section 6.15 (#UD exception):


Poor documentation leads to vulnerabilities
● In operating systems
○ POP SS/MOV SS (May 2018)
■ Developer confusion over #DB handling
■ Load + execute unsigned kernel code on Windows
■ Also affected: Linux, MacOS, FreeBSD...
● In virtual machines / emulators
○ Privilege escalation on a cloud instance!
● In disassemblers
○ “Anti-disassembly”
○ Hide malicious code in plain sight

Source: Breaking the x86 ISA, Christopher Domas


Why investigate undocumented opcodes?
● Many undocumented x86 opcodes in the past
○ LOADALL, SALC, INT1 / ICEBP, UDO / UD1…
Why investigate undocumented opcodes?
● Many undocumented x86 opcodes in the past
● “Halt and catch fire” instructions
○ e.g. F00F C7C8 on Pentium
○ Could be used for denial of service attacks
○ “Killer poke”: POKE 62975, 0 (TRS-80 M100); POKE 59458,62 (Commodore PET)
Why investigate undocumented opcodes?
● Many undocumented x86 opcodes in the past
● “Halt and catch fire” instructions
● Instructions which create exploitable side-channels
○ CLFLUSH, PREFETCH…
Why investigate undocumented opcodes?
● Many undocumented x86 opcodes in the past
● “Halt and catch fire” instructions
● Instructions which create exploitable side-channels
● Hidden debug mechanisms
Why investigate undocumented opcodes?
● Many undocumented x86 opcodes in the past
● “Halt and catch fire” instructions
● Instructions which create exploitable side-channels
● Hidden debug mechanisms
● Malicious microcode updates
Why investigate undocumented opcodes?
● Many undocumented x86 opcodes in the past
● “Halt and catch fire” instructions
● Instructions which create exploitable side-channels
● Hidden debug mechanisms
● Malicious microcode updates
● Undocumented behavior - bugs (“errata”)
Why investigate undocumented opcodes?
● Many undocumented x86 opcodes in the past
● “Halt and catch fire” instructions
● Instructions which create exploitable side-channels
● Hidden debug mechanisms
● Malicious microcode updates
● Undocumented behavior - bugs (“errata”)
● Capabilities of ultra-privileged modes (Intel ME, SMM…)
○ Security by obscurity is not enough
How can we find undocumented opcodes?
How many opcodes are documented?

● No-one knows!
● 1569 XED iclasses (~mnemonics)
● But >30 different encodings for MOV alone…
● 6290 iforms (e.g. ADC_GPRv_IMMb)
● Iforms still don’t account for all variations, e.g. some prefixes
The instruction search space

Opcode != instruction, but max instruction length: 15 bytes

2^15*8 possibilities (~1.33 x 1036)= 1,329,227,995,784,915,872,903,807,060,280,344,576


The instruction search space

• If we test 1 billion instructions a second...it’ll only take us 4.21 x 10 19 years of 24/7/365


testing (that’s ~98x the age of the universe)
• And there’ll be crashes + processor failures to deal with. Any volunteers?
• So a brute-force search for undocumented (or documented!) instructions is infeasible
Approach 1: manual targeting

● Requires a deep understanding of the ISA


● Lots of time needed (and there’ll always be just one more opcode to
investigate…)
● Example: Corkami Standard Test by Ange Albertini (for Windows)
○ Aimed at identifying disassembler / OS flaws
○ Exploring misunderstood/undocumented behavior

[Link]

[Link] (Talk: ‘Such a weird processor - messing with x86 opcodes’)


Approach 2: opcode search

• Just search within 3-byte opcode range


• 2 ^ (3*8) = 16,777,216. Feasible for brute search
• BUT extremely buggy when executing (stack smashing, chains of seg faults…most likely
undocumented jumps!)
• Seg faults are much harder to handle than illegal opcode exceptions
Approach 2: opcode search
Approach 3: tunneling
• Previous research in this area by Christopher
Domas - created Sandsifter tool
• Tunneling algorithm
• Depth-first search: execute instruction, observe its
length, increment last byte, repeat…
■ If length change - start incrementing new
last byte
■ If FF, set last byte to 00, start
incrementing second to last byte
• Instruction length determined via page faults (where
does the CPU stop decoding?)
[Link]
Approach 3: tunneling
• Advantages:
• Reduces search space to ~1,000,000,000 instructions
• Much more stable (as it is guided by the CPU decoder)
• Flaws:
• Reduces search space to ~1,000,000,000 instructions!
Assumes a length change is the only “interesting”
change
• Assumes the CPU will stop decoding after one
instruction...it doesn’t
• 0000 isn’t padding - it can be decoded as an ADD.
Note: 00 isn’t valid.
Approach 3: tunneling
● Sandsifter is currently the most stable
automated approach for detecting
undocumented instructions
○ Note: by default Sandsifter also searches for
disassembler bugs. Run with --unk flag only (not
--dis or --len).
● But Sandsifter has problems:
○ Lots of false positives - many valid but unusual
instructions unknown to Capstone disassembler
○ Assumes all instructions which throw SIG_ILL are
invalid and can never execute (not true)
Approach 3: tunneling
“If a REX prefix is used when it has no meaning, it is
ignored.”
How can we execute an undocumented
opcode?
Execution in ring 3 (user mode)
● Unsigned char array of hex
instruction bytes + function
prologue and epilogue
● mprotect to make page containing
array executable
○ Must align to start of page boundary
○ Assumes array fits in one page
● Create a function pointer to the
array and call it
● Need a signal handler if testing
undocumented instructions!
Execution in ring 0 (kernel driver)
● Similar to user mode using
kernel functions
● Exception handling is the
hardest part
○ We’re not supposed to throw
exceptions in the kernel, but
most undocumented
instructions do fault
○ Die notifiers are the kernel
equivalent of signal handlers
Handling exceptions in the kernel driver
● Kernel source digging: do_error_trap calls do_trap_no_signal
which calls die. End result: kernel oops and our user process gets
killed
● Solution: use a die notifier to return NOTIFY_STOP
● But this alone will hang the system!
○ Need to re-enable interrupts on this core
○ Need to move the instruction pointer past the faulting
instruction: (instruction length - 2) if length > 2
● Alternatives: Systemtap, modify IDT
● Important to minimise messages in the kernel log

Source: [Link]
See also:
[Link]
[Link]
How can we determine what an
undocumented opcode does?
Monitoring opcode execution

● Clock cycles
○ Notoriously difficult to measure accurately
○ Kernel RDTSCP vs. counters
○ More realistic: distinguish between NOPs, simple instructions, and complex instructions
● Performance counters
○ Uops per execution port
○ Floating point operations
○ Memory operations
○ Lots more!
Execution port profiling
● Information sources:
○ Intel Optimization Reference
Manual
○ Agner Fog’s optimization manuals
(3 and 4)
● Microarchitecture-specific
● Needs runtime profiling (all
counters have overhead)
● Assume we can saturate all
relevant ports for an instruction
Execution port profiling
● Combine port uop counters
with other counters (memory,
FPU, branches…) to make a
best guess at functionality
● Surprisingly effective for
identifying broad categories
(branch, load/store address,
division…)
● Program must be locked to a
single core (taskset -c 0)
Can we learn anything about faulting instructions?

(Source: Intel Developer Manual Vol. 3, section 6.15)


Can we learn anything about faulting instructions?

So...we can learn nothing, as the machine state is entirely restored to the pre-execution
state? What about that subset of exceptions which do result in loss of execution state?
Can we learn anything about faulting instructions?

Source: Table 6-2 ‘Priority among Simultaneous Exceptions and Interrupts’, Intel Developer Manual Vol. 3

Wait...if #UD is a fault from decoding the next instruction, does the instruction even
execute at all?
Can we learn anything about faulting instructions?

(Source: Intel Developer Manual Vol. 3, section 6.15)

Apparently, yes it can be speculatively executed.

Did someone say Spectre?...We can also target it with performance counters!

Do #UD instructions leave microarchitectural traces behind?


Can we defend against unknown
undocumented behavior?
Defending random.c from RDRAND
Defending random.c from RDRAND
random: mix in architectural randomness in
extract_buf() [July 2012]

“Mix in any architectural randomness in extract_buf()


instead of xfer_secondary_buf(). This allows us to mix in
more architectural randomness, and it also makes
xfer_secondary_buf() faster, moving a tiny bit of additional
CPU overhead to process which is extracting the
randomness.

[ Commit description modified by tytso to remove an


extended advertisement for the RDRAND instruction. ]

Signed-off-by: H. Peter Anvin <hpa@[Link]>


Acked-by: Ingo Molnar <mingo@[Link]>
Cc: DJ Johnston <[Link]@[Link]> What’s wrong here?
Signed-off-by: Theodore Ts'o <tytso@[Link]>
Cc: stable@[Link]” ● HWRNG output added after all mixing - so it
controls the final “random” output
For full code, see: ● Imagine HWRNG is compromised via malicious
[Link]
drivers/char/random.c
microcode update
[Link] ● Microcode has access to our hash.l[i] value
source/drivers/char/random.c
● It can output a value v which XORs to 0
Defending random.c from RDRAND
random: mix in architectural randomness in random: mix in architectural randomness earlier random: use the architectural HWRNG for the
extract_buf() [July 2012] in extract_buf() [September 2013] SHA's IV in extract_buf() [December 2013]
“Previously if CPU chip had a built-in random number generator “To help assuage the fears of those who think the NSA can
“Mix in any architectural randomness in extract_buf() (i.e., RDRAND on newer x86 chips), we mixed it in at the very introduce a massive hack into the instruction decode and out
instead of xfer_secondary_buf(). This allows us to mix in end of extract_buf() using an XOR operation. of order execution engine in the CPU without hundreds of Intel
more architectural randomness, and it also makes engineers knowing about it (only one of which woud need to
xfer_secondary_buf() faster, moving a tiny bit of additional We now mix it in right after the calculate a hash across the have the conscience and courage of
CPU overhead to process which is extracting the entire pool. This has the advantage that any contribution of Edward Snowden to spill the beans to the public), use the
randomness. entropy from the CPU's HWRNG will get mixed back into the HWRNG to initialize the SHA starting value, instead of xor'ing it
pool. In addition, it means that if the HWRNG has any defects in afterwards.
[ Commit description modified by tytso to remove an (either accidentally or maliciously introduced), this will be Signed-off-by: "Theodore Ts'o" <tytso@[Link]>”
extended advertisement for the RDRAND instruction. ] mitigated via the non-linear transform of the SHA-1 hash
function before we hand out generated
Signed-off-by: H. Peter Anvin <hpa@[Link]> output.
Acked-by: Ingo Molnar <mingo@[Link]> Signed-off-by: "Theodore Ts'o" <tytso@[Link]>”
Cc: DJ Johnston <[Link]@[Link]>
Signed-off-by: Theodore Ts'o <tytso@[Link]>
Cc: stable@[Link]”

For full code, see:


[Link]
drivers/char/random.c
[Link]
source/drivers/char/random.c
Opcode Tester
Opcode Tester
● Concept: automate CPU analysis as far as possible
● Execute and analyze instructions in ring 3 and ring 0
○ Command-line tool
○ Input: Sandsifter log file (or similarly formatted instruction list)
○ Filter Sandsifter false positives with XED
○ Clock cycles, functionality analysis
○ Stable ring 0 (kernel driver) #UD handling
■ Test in ring 0 to see if an instruction is valid but privileged
■ Execute 500,000+ illegal instructions in the kernel...And nothing explodes!
■ Segfaults are harder to handle - but only crash the program, not the OS
● Lots more potential for development

[Link]
Where next?
Unanswered questions...
● What might we find looking for undocumented instructions in:
○ SGX
○ SMM
○ Other machine modes
○ ME (separate coprocessor)
○ Non-Intel processors
Unanswered questions...
● Why do instructions which normally throw #UD sometimes throw #GP
instead? Is this a hardware bug, and could it be exploited?
Unanswered questions...
● How can we feasibly test for undocumented behavior which depends on
‘password’ register values?

EDI=9C5A203A
activates 4 debug MSRs on AMD K7
Unanswered questions...
● Recall: do speculatively executed #UD instructions leave
microarchitectural traces?

(Source: Intel Developer Manual Vol. 3, section 6.15)


Thank you for listening!
Any questions?
Bonus: incrementing IP - when does an exception occur?
● User program:
○ LEA: Copy RIP + 0 offset to RDX
○ MOV: Set EAX to 0
○ CALL: Push return address (current RIP
value) onto stack and jump to absolute
address (value of RDX)
○ CALL E8 rel64 calls near with displacement
relative to next instruction (so call next
instruction, basically)
● Kernel program:
○ Mov RIP to RDX and then CALL the line
itself
○ CALL FF calls near absolute indirect
address given in register

You might also like