How can we produce re-assemblable disassembly? In other words, why an
assembly generated by objdump
can not be compiled by an
assembler like gas
? First of all, on CISC-like machines
including x86, it is non-trivial to faithfully disassemble binaries
if obfusticated or self-modifying. Fortunately though, binaries
compiled with typical compilers and programmed with modern style
are not too difficult to disassemble, what we call
commodity off-the-shelf (COTS) binaries. Then, given the correctly
disassembled assembly, what are the remained issues to reproduce a
binary that functions semantically same as the original one.
1) Direct jmp/call
jmp 0x800000ff
calll 0x800000ff
Without precise restriction on instruction layout (via a linker script), the above assembly code incorrectly behaves after being compiled if there is single byte misalignment, which we often see a nop-like (or faulty) gap between two functions.
In fact, the solution is rather simple: symbolize all addresses. For
example, like objdump
generated outputs, 1) we first label every single
instruction, and 2) replace an operand (target address) with the
corresponding symbol (label). The above assembly will be
rendered like the below:
0x8000000: jmp A800000ff
0x8000000: calll A800000ff
A800000ff: ...
This principle is the foundation, what we called symbolization, that enable us to produce reassemblable disassembly from COTS binaries.
2) Indirect jmp/call
The symbolization naturally makes indirect jmp/call instructions reassemblable as well. Let's consider the below example.
movl $0x80000ff, %eax
calll *%eax
As far as we correctly symbolize an operand (e.g., a function pointer)
of movl
(source of code pointers), we can make the assembly
correctly recompilable.
0x8000000: movl $A80000ff, %eax
0x8000011: calll *%eax
0x80000ff: ...
Then, how can we identify all instructions that might introduce code pointers?
3) Operands
To faithfully symbolize all code pointers, we have to interpret every
single operand of instructions; symbolize if it's a code pointer. To
do so, we rely on meta information (X86_OP_IMM
or X86_OP_MEM
)
generated by capstone. For
constant (immediate) addresses or displacement, we could symbolize
them if they point to the code segment that we can extract
from its elf header.
4) Data section
To relocate data section (e.g., .data or .bss), we also have to symbolize code pointers embedded in the data section (e.g., jump tables, global variables initialized with a function pointer).
We first scan 4-byte chunk from data section (32-bit binary) and check if it points to the code section. But we immediately realize that this is large source of false positives: incorrectly symbolize the data values (non-code pointers). In some binaries, in particular when having crypto-related static tables, this false positive is critical as it changes the semantic of the binary.
To reduce false positives, one can dynamically instrument to figure out whether the data is used as a code pointer or just data. However, we decide to statically check its validity with two conditions: the chunk should point to the valid, well-aligned instructions (prefix) in the code section, and a group of code pointers should have a xref in its leading code pointer (xref is a set of cross-referencing addresses that refer to the current address). To do so, we symbolize the address of every single byte of data region and symbolize four of such if it points to the code segment and there is an instruction referring to (xref is not empty). It might become clear if you see the below example.
.section .data
A80001000: .byte 0x00
A80001001: .byte 0x11
A80001002: A80000000 /* {xref:['A800000ee']} */
A80001006: .byte 0x22
A80001007: .byte 0x22
A80001008: A80000011 /* {xref:['A800000ff']} */
A8000100c: A80000022
A80001010: A80000033
For example, A80001008
might be a jump table and
A80001002
might be a global variable that was initialized with
a function pointer (A80000000
). The global variable was accessed by
an instruction at 0x800000ee
(in its original binary) and the jump
table was access by 0x800000ff
.
5) Oddities
There are few oddities that we couldn't easily imagine at the first glance. For example, SSE instructions (xmm in x86) require 16-byte alignment of its operands, a data pointer. So, whenever we meet xmm instructions, we enforce its alignment on the data section.
.section .data
.align 16
A80001000: .byte 0x00 /* {xref:['A800000ff']} */
A80001001: .byte 0x11
A80001001: .byte 0x22
A800000ff: movsd A80001000, %xmm1
Binary Patching
In fact, the goal of having this pipeline of producing relocatable,
reassemblable disassembly is to make the binary patching easy (e.g.,
no trempoline, no crafting to make a room for patches &c), and more
importantly human-readable (and handy analysis). For example,
we could easily detect well-known code gadgets (say, malloc()
) and
then symbolize them (adding a label func_malloc:
to the beginning of
the gadget). Then, that patch (e.g., cfi, safe stack &c) can
easily reuse such code or a library. Since the disassembly is
relocatable, inserting new code (patch) or transforming
do not require special cares to the disassembly, and open new
possibility of applying well-understood optimization techniques like
dead-code/data elimination or peephole optimization.