The instruction set of the x86 is too complicated

When reading the report of an exploit by a security firm, one invariably finds x86 assembly code. I would stumble on

 xbegin  mayEnd
 cmp     mutex, 0
 jz      weAreDone
 xabort  $0xff

and not be sure what it did. Long ago I had programmed other chips in assembly, so I felt a day or two would give me some idea of the assembly language for the Intel chips.

Wrong! The instruction set is so complex that in 2016 Intel needs 3796 reference pages to describe them. And that does not include examples or application notes. To get an idea of the complexity of the instruction set I created a visualization of the reference manual (below). I used chapter 5, Instruction Set Summary as a guide. It groups the x86 instructions into categories:

1 General purpose 11 Fused-multiply-add (FMA)
2 X87 FPU 12 Vector Extensions 2 (AVX2)
3 MMX 13 Transactional Synchronization (TSX)
4 SSE 14 System
5 SSE2 15 64-Bit Mode
6 SSE3 16 Virtual-machine Extensions
7 Supplemental SSSE3 17 Safer Mode Extensions
8 SSE4.1 18 Memory Protection Extensions
9 SSE4.2 19 Security Guard Extensions
10 16-Bit Float Conversion

Intel went through several generations of single instruction, multiple data operation sets starting with MMX and several generations of SSE. These are instructions that specialize on array operations that are common in linear algebra and image and sound processing. There are also several groups to help run operating systems: memory protection to isolate multiple programs on one computer, TSX to help with multi threaded operations, and virtual machine extensions to help multiple operating systems co-exits. Some of the instruction groups are for secure computing: Safer Mode to help verify code integrity and execute what was intended and Security Guard Extensions.

I collected the instructions into these groups and asked how often do these different groups occur in one page. The idea was that if there is a lot of commingling, then it would be harder to understand. I extracted the instructions from chapter 5 and then went through all the pages, creating a column for each. For each of the 19 groups, I would check if the instructions for each groups appeared in the page. Plotting that one gets a picture of the distribution of the instructions

abstract version of the sparklines chart

The general purpose instructions (in row 1) are used throughout the manual. The next most used group are the system instructions (row 14), useful for writing multi-user, multi-process programs such as operating systems. It includes the LOCK prefix or the model specific register instructions RDMSR or WRMSR.

There are about 1.6 million words in the manual. Just reading the manual 8 hours a day at novel-reading speeds would take me about 17 days to go through it. But this is not a novel. It is a complicated reference book. As a full time job this could take from months to years to learn well.

Just to start reading code, I went through sensepost’s crash course in x86 assembly for reverse engineers and Nayuki’s A fundamental introduction to x86 assembly programming. From there I just played with the excellent in the browser compiler by Matt Godbolt. And if all fails, there are always the big Intel manuals.