How to assemble, disassemble, and simulate with Python?

This article walks you through how to assemble, disassemble, and emulate machine code (ARM, x86-64, etc.) in Python using the Keystone, Capstone, and Unicorn engines.

Python assembly, disassembly, and emulation: The processor executes assembly code, a low-level programming language that uses registers and memory directly in native executables. The assembly code is stored in its assembly form as binary data, and there is a processor manual that specifies how to encode each instruction into data bytes.

How does Python disassemble? Disassembly is the inverse process of assembly, where data bytes are parsed and translated into assembly instructions (more readable for the user).

Different processor architectures can have different instruction sets, a processor can only execute assembly instructions in its own instruction set, to run code for different architectures, we need to use an emulator, which is a program that translates code for unsupported programs to convert the architecture into code that can run on the host system.

How does Python assemble code? In many scenarios, it is useful to assemble, disassemble, or mock code for different architectures. One of the main interests is to learn (most universities teach MIPS assembly) to run and test programs written for different devices such as routers (fuzz testing, etc.), as well as reverse engineering.

In this tutorial, we’ll assemble, disassemble, and emulate assembly code written for ARM using the Keystone, Capstone, and Unicorn engines, which provide convenient Python bindings to manipulate assembly code, and they support different architectures (x86, ARM, MIPS, SPARC, etc.), and they have native support for major operating systems, including Linux, Windows, and MacOS.

First, let’s install these three frameworks:

pip3 install keystone-engine capstone unicorn

For this tutorial demonstration, we’ll take the factorial function implemented in ARM assembly, assemble the code, and simulate it.

We’ll also disassemble an x86 function (to show how to easily handle multiple schemas).

Assemble ARM

How does Python assemble code? Let’s start by importing the ARM assembly we need:

# We need to emulate ARM
from unicorn import Uc, UC_ARCH_ARM, UC_MODE_ARM, UcError
# for accessing the R0 and R1 registers
from unicorn.arm_const import UC_ARM_REG_R0, UC_ARM_REG_R1
# We need to assemble ARM code
from keystone import Ks, KS_ARCH_ARM, KS_MODE_ARM, KsError

Python assembly, disassembly, and emulation – Let’s write our ARM assembly code, which computes factorial( ), where is an input register:r0r0

ARM_CODE = """
// n is r0, we will pass it from python, ans is r1
mov r1, 1       	// ans = 1
loop:
cmp r0, 0       	// while n >= 0:
mulgt r1, r1, r0	//   ans *= n
subgt r0, r0, 1 	//   n = n - 1
bgt loop        	// 
                	// answer is in r1
"""

Let’s assemble the assembly code above (convert it to bytecode):

print("Assembling the ARM code")
try:
    # initialize the keystone object with the ARM architecture
    ks = Ks(KS_ARCH_ARM, KS_MODE_ARM)
    # Assemble the ARM code
    ARM_BYTECODE, _ = ks.asm(ARM_CODE)
	# convert the array of integers into bytes
    ARM_BYTECODE = bytes(ARM_BYTECODE)
    print(f"Code successfully assembled (length = {len(ARM_BYTECODE)})")
    print("ARM bytecode:", ARM_BYTECODE)
except KsError as e:
    print("Keystone Error: %s" % e)
    exit(1)

The function returns an ARM-mode assembler that assembles the code and returns bytes and the number of instructions it assembles.Ksasm()

Bytecode can now be written to an area of memory and executed (or emulated, in our case) by an ARM processor:

# memory address where emulation starts
ADDRESS = 0x1000000

print("Emulating the ARM code")
try:
    # Initialize emulator in ARM mode
    mu = Uc(UC_ARCH_ARM, UC_MODE_ARM)
    # map 2MB memory for this emulation
    mu.mem_map(ADDRESS, 2 * 1024 * 1024)
    # write machine code to be emulated to memory
    mu.mem_write(ADDRESS, ARM_BYTECODE)
    # Set the r0 register in the code, let's calculate factorial(5)
    mu.reg_write(UC_ARM_REG_R0, 5)
    # emulate code in infinite time and unlimited instructions
    mu.emu_start(ADDRESS, ADDRESS + len(ARM_BYTECODE))
    # now print out the R0 register
    print("Emulation done. Below is the result")
    # retrieve the result from the R1 register
    r1 = mu.reg_read(UC_ARM_REG_R1)
    print(">>  R1 = %u" % r1)
except UcError as e:
    print("Unicorn Error: %s" % e)

In the above code, we initialize the simulator in ARM mode, we map 2MB memory at the specified address (2*1024*1024 bytes), we write our assembly results to the mapped memory area, we set the register to 5, and then we start simulating our code.r0

This method simulates with an optional parameter and an optional maximum number of instructions, which is useful for sandbox code or restricting the simulation to a certain part of the code.emu_start()timeout

Once the simulation is complete, we read the contents of the register, which should contain the simulation results, run the code and output the following result:r1

Assembling the ARM code
Code successfully assembled (length = 20)
ARM bytecode: b'\x01\x10\xa0\xe3\x00\x00P\xe3\x91\x00\x01\xc0\x01\x00@\xc2\xfb\xff\xff\xca'
Emulating the ARM code
Emulation done. Below is the result
>>  R1 = 120

We get the expected result, the factorial of 5 is 120.

Disassemble x86-64 code

Python assembly, disassembly, and emulation – now if we have x86 machine code and we want to disassemble it, the following code will do so:

# We need to emulate ARM and x86 code
from unicorn import Uc, UC_ARCH_X86, UC_MODE_64, UcError
# for accessing the RAX and RDI registers
from unicorn.x86_const import UC_X86_REG_RDI, UC_X86_REG_RAX
# We need to disassemble x86_64 code
from capstone import Cs, CS_ARCH_X86, CS_MODE_64, CsError

X86_MACHINE_CODE = b"\x48\x31\xc0\x48\xff\xc0\x48\x85\xff\x0f\x84\x0d\x00\x00\x00\x48\x99\x48\xf7\xe7\x48\xff\xcf\xe9\xea\xff\xff\xff"
# memory address where emulation starts
ADDRESS = 0x1000000
try:
      # Initialize the disassembler in x86 mode
      md = Cs(CS_ARCH_X86, CS_MODE_64)
      # iterate over each instruction and print it
      for instruction in md.disasm(X86_MACHINE_CODE, 0x1000):
            print("0x%x:\t%s\t%s" % (instruction.address, instruction.mnemonic, instruction.op_str))
except CsError as e:
      print("Capstone Error: %s" % e)

How does Python disassemble? We initialize a disassembler in x86-64 mode, disassemble the provided machine code, iterate over the instructions in the disassembly result, and for each instruction, we print the instruction and the address where it occurred.

This produces the following output:

0x1000: xor     rax, rax
0x1003: inc     rax
0x1006: test    rdi, rdi
0x1009: je      0x101c
0x100f: cqo
0x1011: mul     rdi
0x1014: dec     rdi
0x1017: jmp     0x1006

Now let’s try to emulate it with the Unicorn engine:

try:
    # Initialize emulator in x86_64 mode
    mu = Uc(UC_ARCH_X86, UC_MODE_64)
    # map 2MB memory for this emulation
    mu.mem_map(ADDRESS, 2 * 1024 * 1024)
    # write machine code to be emulated to memory
    mu.mem_write(ADDRESS, X86_MACHINE_CODE)
    # Set the r0 register in the code to the number of 7
    mu.reg_write(UC_X86_REG_RDI, 7)
    # emulate code in infinite time & unlimited instructions
    mu.emu_start(ADDRESS, ADDRESS + len(X86_MACHINE_CODE))
    # now print out the R0 register
    print("Emulation done. Below is the result")
    rax = mu.reg_read(UC_X86_REG_RAX)
    print(">>> RAX = %u" % rax)
except UcError as e:
    print("Unicorn Error: %s" % e)

Output:

Emulation done. Below is the result
>>> RAX = 5040

We get the result of 5040 and we enter 7. If we take a closer look at this x86 assembly code, we notice that this code calculates the factorial of the RDI register (5040 is the factorial of 7).

Python assembly, disassembly, and simulation summaries

How does Python disassemble? These three frameworks manipulate assembly code in a unified way, as you can see in the code that emulates x86-64 assembly, which is very similar to the ARM emulation version. Disassembly and assembly code are done in the same way for any supported architecture.

One thing to keep in mind is that the Unicorn emulator emulates the raw machine code, it doesn’t emulate Windows API calls, nor does it parse and emulate file formats like PE and ELF.

In some cases, it is useful to emulate the entire operating system, a program in the form of a kernel driver, or binaries for different operating systems, and a good framework is built on top of Unicorn to handle these limitations, while providing Python bindings, this is the Qiling framework, which also allows binary detection (e.g., fake system call return values, file descriptors, etc.).

How does Python assemble code? After testing these three Python frameworks, we concluded that it is very easy to manipulate assembly code using Python, and that the simplicity of Python, combined with the convenient and unified Python interfaces provided by Keystone, Capstone, and Unicorn, makes it easy for even beginners to assemble, disassemble, and emulate assembly code for different architectures.