How can we produce re-assemblable disassembly? In other words, why an assembly generated by objdump can not be compiled by an assembler like gas? First of all, on CISC-like machines including x86, it is non-trivial to faithfully disassemble binaries if obfusticated or self-modifying. Fortunately though, binaries compiled with typical compilers and programmed with modern style are not too difficult to disassemble, what we call commodity off-the-shelf (COTS) binaries. Then, given the correctly disassembled assembly, what are the remained issues to reproduce a binary that functions semantically same as the original one.

1) Direct jmp/call

jmp     0x800000ff
calll   0x800000ff

Without precise restriction on instruction layout (via a linker script), the above assembly code incorrectly behaves after being compiled if there is single byte misalignment, which we often see a nop-like (or faulty) gap between two functions.

In fact, the solution is rather simple: symbolize all addresses. For example, like objdump generated outputs, 1) we first label every single instruction, and 2) replace an operand (target address) with the corresponding symbol (label). The above assembly will be rendered like the below:

0x8000000: jmp        A800000ff
0x8000000: calll      A800000ff

A800000ff: ...

This principle is the foundation, what we called symbolization, that enable us to produce reassemblable disassembly from COTS binaries.

2) Indirect jmp/call

The symbolization naturally makes indirect jmp/call instructions reassemblable as well. Let's consider the below example.

movl     $0x80000ff, %eax
calll    *%eax

As far as we correctly symbolize an operand (e.g., a function pointer) of movl (source of code pointers), we can make the assembly correctly recompilable.

0x8000000: movl     $A80000ff, %eax
0x8000011: calll    *%eax

0x80000ff: ...

Then, how can we identify all instructions that might introduce code pointers?

3) Operands

To faithfully symbolize all code pointers, we have to interpret every single operand of instructions; symbolize if it's a code pointer. To do so, we rely on meta information (X86_OP_IMM or X86_OP_MEM) generated by capstone. For constant (immediate) addresses or displacement, we could symbolize them if they point to the code segment that we can extract from its elf header.

4) Data section

To relocate data section (e.g., .data or .bss), we also have to symbolize code pointers embedded in the data section (e.g., jump tables, global variables initialized with a function pointer).

We first scan 4-byte chunk from data section (32-bit binary) and check if it points to the code section. But we immediately realize that this is large source of false positives: incorrectly symbolize the data values (non-code pointers). In some binaries, in particular when having crypto-related static tables, this false positive is critical as it changes the semantic of the binary.

To reduce false positives, one can dynamically instrument to figure out whether the data is used as a code pointer or just data. However, we decide to statically check its validity with two conditions: the chunk should point to the valid, well-aligned instructions (prefix) in the code section, and a group of code pointers should have a xref in its leading code pointer (xref is a set of cross-referencing addresses that refer to the current address). To do so, we symbolize the address of every single byte of data region and symbolize four of such if it points to the code segment and there is an instruction referring to (xref is not empty). It might become clear if you see the below example.

.section .data
A80001000: .byte 0x00
A80001001: .byte 0x11
A80001002: A80000000   /* {xref:['A800000ee']} */
A80001006: .byte 0x22
A80001007: .byte 0x22
A80001008: A80000011   /* {xref:['A800000ff']} */
A8000100c: A80000022
A80001010: A80000033

For example, A80001008 might be a jump table and A80001002 might be a global variable that was initialized with a function pointer (A80000000). The global variable was accessed by an instruction at 0x800000ee (in its original binary) and the jump table was access by 0x800000ff.

5) Oddities

There are few oddities that we couldn't easily imagine at the first glance. For example, SSE instructions (xmm in x86) require 16-byte alignment of its operands, a data pointer. So, whenever we meet xmm instructions, we enforce its alignment on the data section.

.section .data
.align 16 
A80001000: .byte 0x00 /* {xref:['A800000ff']} */
A80001001: .byte 0x11
A80001001: .byte 0x22

A800000ff: movsd A80001000, %xmm1

Binary Patching

In fact, the goal of having this pipeline of producing relocatable, reassemblable disassembly is to make the binary patching easy (e.g., no trempoline, no crafting to make a room for patches &c), and more importantly human-readable (and handy analysis). For example, we could easily detect well-known code gadgets (say, malloc()) and then symbolize them (adding a label func_malloc: to the beginning of the gadget). Then, that patch (e.g., cfi, safe stack &c) can easily reuse such code or a library. Since the disassembly is relocatable, inserting new code (patch) or transforming do not require special cares to the disassembly, and open new possibility of applying well-understood optimization techniques like dead-code/data elimination or peephole optimization.