[PROLOG] 0x000, In Assembler we trust (English version)
September 26, 2022 edito malware prolog
In this journey to the heart of binary files and executable code lies the emperor of all programming languages, the primary source of dialogue with our CPUs: assembly language.
I will not produce assembly language courses here on this blog. There are many excellent ones available on the internet. I will just lay down a few reminders that seem essential to me for the continuation of our journey into Reverse Engineering of binaries.
Nature of Assembly Language
Assembly language is a language. Since this language is specific and tied to the type of CPU it addresses, there are several types. The name of the assembly language takes the name of the CPU for which it is intended. For our learning purposes, we will limit our reverses to 2 CPU families: INTEL X64/32 and ARM (64 bit).
INTEL (aka x86): We will mainly read (and write a little) Intel 32-bit (x86_32) and Intel 64-bit code. The Intel 64-bit is found behind the following acronyms: ‘x64’, ‘x86_64’, ‘Intel64’, ‘AMD64’. The choice of this processor architecture will allow us to address PCs (under Windows and Linux with their different binary file formats: ELF for Linux, PE (32-bits) and PE+ (64-bit) for Windows).
ARM 64: The study of code running on ARM64 processors (often referred to as AArch64) will allow us to reverse and understand applications and malware compiled natively for Mac M1 in the MachO64 executable binary format.
Assembly language is the “last” grammar/abstraction/representation that a human can reasonably use to write the instructions they want the CPU to execute. This code is then translated into hexadecimal and binary. And yes, you could directly program in binary if you had infinite time ;-)
Sizes and Units
It seems interesting to me to recall here a few units on the information we will handle:
- BYTE - one Byte (8 bits) | Allows storing values between 0-255 or -128 to 127
- WORD - Word (16 bits) | Allows storing values between 0 - 65535 or -32768 to 32767
- DWORD - Double word (32 bits) | Allows storing values from 0 - 2^32
- QWORD - Quad word (64 bits) | Allows storing values from 0 - 2^64
x86 and x64 CPU Registers
Each CPU has a set of general-purpose registers, 8 for x86 and 16 for x86-64. A register is a particular memory area, integrated into the CPU, whose access is ultra-fast and allows storing untyped data temporarily. It is through these registers (but not only) that the CPU receives and “transfers” information, temporarily stores it, and transmits it according to the instructions from its control unit (ECU).
In 32-bit architecture, registers have a storage capacity of 4 bytes. On 64-bit CPUs, registers have a storage capacity of 8 bytes.
Register | Name | Sub-register |
---|---|---|
RAX | Accumulator | EAX(32), AX(16), AH(8), AL(8) |
RBX | Base | EBX(32), BX(16), BH(8), BL(8) |
RCX | Counter | ECX(32), CX(16), CH(8), CL(8) |
RDX | Data | EDX(32), DX(16), DH(8), DL(8) |
RSI | Source | ESI(32), SI(16), SL(8) |
RDI | Destination | EDI(32), DI(16), DL(8) |
RBP | Base pointer | EBP(32), BP(16), BPL(8) |
RSP | Stack pointer | ESP(32), SP(16), SPL(8) |
New registers | New registers | R8D-R15D(32), R8W-R15W(16), R8B-R15B(8) |
Note The suffixes used to address the low-order bits of the New registers are:
- B byte, 8 bits
- W word, 16 bits
- D double word, 32 bits
We will return to the registers very soon to present their usage convention, especially on Linux and Windows 64-bit OSes.
One Assembly Language, Two Syntaxes:
For historical reasons, there are 2 possible syntaxes for the same assembly code: AT&T syntax and INTEL syntax. Understand this well: it is the same assembly language (thus the same instructions). Only the writing conventions change.
Let’s take code that in C language would be:
int i = 62;
j = i;
INTEL Syntax
mov rax,0x3e
mov [ebp-8],rax
AT&T Syntax
movq $0x3e,%rax
movq %rax,-8(%ebp)
The main differences between the 2 syntaxes are summarized in the table below:
Personally, I prefer the Intel syntax. However, know that we will use the GDB debugger a lot, and it uses AT&T syntax by default. If, like me, you want it to generate Intel, it is possible.
set disassembly-flavor intel
At this stage, we have 2 essential notions with which you need to familiarize yourself: The stack and calling conventions
. This is precisely the subject of the next post.
Citations: [1] https://translate.google.com [2] https://translate.google.fr/?hl=fr [3] https://www.deepl.com/en/translator [4] https://www.collinsdictionary.com/translator [5] https://www.reverso.net/traduction-texte [6] https://frenchtogether.com/french-translation/ [7] https://www.easyhindityping.com/french-to-english-translation [8] https://www.linguee.com/english-french/translation/french%2Bto%2Benglish.html [9] https://www.reddit.com/r/French/comments/11a1g6k/what_is_the_most_accurate_french_translator_online/ [10] https://www.typingbaba.com/translator/french-to-english-translation.php [11] https://play.google.com/store/apps/details?hl=ln&id=com.anhlt.frentranslator [12] https://www.systransoft.com/lp/french-translation/ [13] https://www.reverso.net/text-translation [14] https://www.linguee.fr/anglais-francais/traduction/french%2Bto%2Benglish%2Btranslator.html [15] https://www.tradonline.fr/en/blog/six-best-online-translation-websites/