Hi, in this post I’ll go through what happens when you compile a program. It’s going to involve a little bit of C, Assembly, and some very basic linux tooling. The main idea is that you get out of here with a deeper understanding of what happens behind the scenes.

This type of knowledge is not very useful for most people, but if you like to know how things work, I think you’ll enjoy at least some of the things I’ll expose here.

I’ll do this with Linux in mind, but the same ideas can be transposed to how it works in Windows.

I have taken a lot of inspiration from the Practical Binary Analysis book by No Starch Press, written by Dennis Andriesse, mainly on how to structure these ideas in a concise way. I highly recommend the book if you want to go deep into binary analysis but keep in mind, it’s not a book you’ll go through in a weekend, it takes years to master everything in there, but that’s beside the point of this post. Let’s go!

Compilation?

Ok, the first step here is to actually understand what compiling a program means.

Computers don’t understand code the way we write it; at a deep level, it’s all about electrical signals and flip-flops, however, we can go one layer above this and understand that this language can be represented by a series of 0s and 1s, or in other words, binary.

To these series of 0s and 1s that the computer understands, we call machine code. As the name implies, this is not something humans can easily read, except for a few wizards who started doing it when dinosaurs were still around.

So, how is it possible that we can write code in a human-readable form so that it can be run by the computer? Compilation is the key here. This is essentially the process of converting our hopefully bug-free code into something the machine can understand, i.e.; machine code, or as commonly referred to in the streets, “the binary”.

Now, compilation is normally not as simple as taking some C (or other languages) code directly into 0s and 1s. We could do this, but it would be incredibly difficult to maintain the level of progress we have today because we would have to create very complicated software for every single language out there, which is not feasible. So, compiling code is a multi-step process, which I’ll go through in this post.

Let’s take advantage of a good compiler that comes pre-installed on pretty much all linux distributions, gcc, which stands for “The GNU Compiler”. By the way, GNU is an actual operating system, based on UNIX, just like Linux!

Compiling C code

Take this code as an example:

#include <stdio.h>

#define FORMAT "%s"
#define PRINTSTRING "Hello World, how's it going?\n"

int main() {
    printf(FORMAT, PRINTSTRING);
}

The code above will just print the string “Hello World, how’s it going?” to the console.

The line

#include <stdio.h>

just means “I want you to include the code contained in this library since I’ll use it in my program”, in this case, the printf() function comes from here.

These lines

#define FORMAT "%s"
#define PRINTSTRING "Hello World, how's it going?\n"

Are called macros. The important thing is to understand that they are not actual code that will be run, but rather things that allow programmers to have a little bit more flexibility when coding. In this case, we’re defining a format and the string we want to print as macros. I’ll explain why I put those there soon.

The rest of the code

int main() {
    printf(FORMAT, PRINTSTRING);
}

is just a normal print to the console. Every language out there has this, so even if you don’t know C, you can understand the idea (I hope).

If we want to compile this code, all we have to do is (assuming the file name is main.c):

$ gcc main.c -o main

$ ./main

Hello World, how's it going?

That’s cool, but if you’re reading this post I bet you’ve already done similar things and this doesn’t impress you at all. So let’s get deeper in the next sections, to see what’s happening under the hood.

Compilation Steps

When you run that gcc main.c -o main line of code, what happens inside is actually extremely complicated and it involves multiple steps.

The high-level view is as follows:

Your .c files, alongside any headers, are fed to something called a Preprocessor
The output of the Preprocessor is passed through to the compiler, which will produce assembly code
The output of the compiler, that is, the assembly code, is passed to something called an Assembler. Here, the output will be one or more object files. These are binaries with executable code (not quite, but close enough).
Because we may have multiple object files and/or some external dependencies, these are passed to a Linker
After linking, you got your final executable!

As an image is worth more than a 0b1111101000 words, below’s a diagram for this process.

compileprocess

Preprocessor

In this step, you feed the Preprocessor .c code and the output will still be .c code. So, what changes? Well, remember these lines?

#include <stdio.h>

#define FORMAT "%s"
#define PRINTSTRING "Hello World, how's it going?\n"

The preprocessor will expand these into the output. Basically, it will insert all the code in the stdio.h header file into the output, as well as replace every single usage of FORMAT and PRINTSTRING by "%s" and "Hello World, how's it going?\n", respectively.

We can see this in action using GCC, which is why I chose it! I urge you to RTFM with man gcc, but all you have to do here is pass the -E flag to the compiler, alongside an optional -P so the output isn’t super verbose.

$ gcc -E -P main.c

Running the line above will print a lot of code to the console, it should look something like this:

...
...
extern int getchar_unlocked (void);
extern int fgetc_unlocked (FILE *__stream);
extern int fputc (int __c, FILE *__stream);
extern int putc (int __c, FILE *__stream);
extern int putchar (int __c);
...
...

int main() {
    printf("%s", "Hello World, how's it going?\n");
}

Basically, at the top of the file, you will find a lot of function signatures and type definitions. Those are all from the stdio.h header file. You can easily check this by looking into the actual stdio.h file on your machine. If you’re using linux (which you should), you can find it at /usr/include/stdio.h.

Next, remember those macros I mentioned? Look at the printf() statement after the Preprocessor did its thing and compare it to before. It replaced our FORMAT and PRINTSTRING with the right stuff, as expected.

So, now you should have a grasp of what preprocessing is, although it does much more, such as taking care of your “preprocessor conditionals”, etc… you can read more about this here.

Compilation

In this step the Compiler will take the output of the Preprocessor and turn it into assembly code. Please note that although we call GCC “The Compiler”, there’s an actual step in there that’s technically called compilation. I know it’s confusing but it is what it is.

We can tell the compiler to give us our assembly code, to do this, just run GCC with the -S flag, as such:

$ gcc -S -masm=intel main.c

The -masm=intel means we want the output to use the Intel syntax, as opposed to AT&T, for example. This syntax is the simplest to read in my opinion, so I’m using it to illustrate this process.

After running the command above, a new file should emerge, called main.s. the .s means it’s an assembly file, by convention (remember, extensions don’t mean much in linux, it’s all files).

The output of main.s is:

        .file   "main.c"
        .intel_syntax noprefix
        .text
        .section        .rodata
.LC0:
        .string "Hello World, how's it going?"
        .text
        .globl  main
        .type   main, @function
main:
.LFB0:
        .cfi_startproc
        endbr64
        push    rbp
        .cfi_def_cfa_offset 16
        .cfi_offset 6, -16
        mov     rbp, rsp
        .cfi_def_cfa_register 6
        lea     rdi, .LC0[rip]
        call    puts@PLT
        mov     eax, 0
        pop     rbp
        .cfi_def_cfa 7, 8
        ret
        .cfi_endproc
.LFE0:
        .size   main, .-main
        .ident  "GCC: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0"
        .section        .note.GNU-stack,"",@progbits
        .section        .note.gnu.property,"a"
        .align 8
        .long    1f - 0f
        .long    4f - 1f
        .long    5
0:
        .string  "GNU"
1:
        .align 8
        .long    0xc0000002
        .long    3f - 2f
2:
        .long    0x3
3:
        .align 8
4:

Now we’re talking! There’s quite a lot to unpack there. For a very simple primer on assembly you can read a previous blog post of mine on how to write a file like a maniac. There’s a section there dedicated to assembly that may be useful. You can find it here.

The first two lines

.file   "main.c"
.intel_syntax noprefix

are kind of useless to us. The .file one is used by the compiler to know the original source file name, which is used by debuggers and the .intel_syntax noprefix makes it so the registers don’t require the ’%’ prefix.

I’ll briefly go over the rest now, picking some blocks and explaining them (the most relevant)

        .section        .rodata
.LC0:
        .string "Hello World, how's it going?"

This one is where our constants will reside. The .section .rodata stands for “Section Read Only Data”, which is where you have data that cannot be changed in runtime (otherwise an exception happens).

The .LC0 is a compiler-given label, and it’s where we’re storing said data, in this case, a .string with our “Hello …” value.

main:
.LFB0:
        .cfi_startproc
        endbr64
        push    rbp
        .cfi_def_cfa_offset 16
        .cfi_offset 6, -16
        mov     rbp, rsp
        .cfi_def_cfa_register 6
        lea     rdi, .LC0[rip]
        call    puts@PLT
        mov     eax, 0
        pop     rbp
        .cfi_def_cfa 7, 8
        ret
        .cfi_endproc

This is the actual code we wrote in C. We see the main: label specifying the program’s entry point and the rest of the code below. Special attention to that call puts@PLT call that’s used with printf(). Ignore the @PLT part for now, I’ll get to it soon.

The rest of the code can be ignored at this stage, but by all means, try to understand it if you want, although that’s out of scope for this post.

Assembler

At this point, after we have our assembly code, we’re ready to pass it through the Assembler, which will convert that into machine code. The reason we have this step and don’t go straight from C to Machine Code is that assemblers are incredibly difficult, and although converting C to Assembly is already a daunting task, it makes it easier to have this separated. Why? You may ask… Well, let’s say you’re creating a new programming language, like Go, or Rust, or C++, all you have to do is to write the means to convert that code into Assembly, which is difficult but doable since assembly is “human readable”; then you’d use an assembler to do the rest of the heavy lifting for you.

Remember we still have one step after this, the Linker, so that means we can still make the compiler stop at an early stage and spit out what we need. Call GCC like this:

$ gcc main.c -c

This will generate a new file called main.o, where the .o stands for “object file”. If you try to read this file you won’t be able to, as it’s already in a non-readable form (unless you’re a machine). You can read its bytes if you’re curious, by doing xxd -l 256 main.o (on windows, I like HxD to read binaries), which will yield the first 256 bytes of the file as hexadecimal output, as such:

$ xxd -l 256 main.o

00000000: 7f45 4c46 0201 0100 0000 0000 0000 0000  .ELF............
00000010: 0100 3e00 0100 0000 0000 0000 0000 0000  ..>.............
00000020: 0000 0000 0000 0000 2003 0000 0000 0000  ........ .......
00000030: 0000 0000 4000 0000 0000 4000 0e00 0d00  ....@.....@.....
00000040: f30f 1efa 5548 89e5 488d 3d00 0000 00e8  ....UH..H.=.....
00000050: 0000 0000 b800 0000 005d c348 656c 6c6f  .........].Hello
00000060: 2057 6f72 6c64 2c20 686f 7727 7320 6974   World, how's it
00000070: 2067 6f69 6e67 3f00 0047 4343 3a20 2855   going?..GCC: (U
00000080: 6275 6e74 7520 392e 342e 302d 3175 6275  buntu 9.4.0-1ubu
00000090: 6e74 7531 7e32 302e 3034 2e31 2920 392e  ntu1~20.04.1) 9.
000000a0: 342e 3000 0000 0000 0400 0000 1000 0000  4.0.............
000000b0: 0500 0000 474e 5500 0200 00c0 0400 0000  ....GNU.........
000000c0: 0300 0000 0000 0000 1400 0000 0000 0000  ................
000000d0: 017a 5200 0178 1001 1b0c 0708 9001 0000  .zR..x..........
000000e0: 1c00 0000 1c00 0000 0000 0000 1b00 0000  ................
000000f0: 0045 0e10 8602 430d 0652 0c07 0800 0000  .E....C..R......

You can see that our “Hello World” string is there somehow, which makes sense, but the rest is a bunch of gibberish, impossible to read like this.

When you assemble a file like we just did, you’re creating an ELF file, which stands for “Executable and Linkable Format”. This is the standard linux format for executables, as opposed to the PE-Format of Windows. You can see that at the top of the xxd output, there’s a string specifying “ELF” as the format, but that’s not enough. If you want to make sure, you can run the file tool, which you should have on any linux distribution, and see what it tells you:

$ file main.o

main.o: ELF 64-bit LSB relocatable, x86-64, version 1 (SYSV), not stripped

Ok, so this tells us that the Assembler created an ELF file, with a 64-bit architecture. The LSB means “Least Significant Byte”, which is just the way numbers will be stored in the program. I’ll explain the “not stripped” part in the next section.

Finally, you’ll see the word “relocatable”. When you run your code, those instructions will be stored in memory, this directive just means that you’re dealing with something that can be relocated to any memory address and everything will remain within context, i.e., you’re not breaking any assumptions in the code. These relocatable files tell you you’re dealing with object files that aren’t executable per si. Why are these relocatable files, well, relocatable? It’s quite simple, when you’re assembling things into object files, they are all done independently, which means that, if you have a call from the code present in one object file to a function (or other things) of another object file, it has no way of knowing where that address is in memory, therefore, at this stage, these files are all marked as relocatable.

The idea here is that at the next stage, we can link all of these files together to build something executable. This includes having all the addresses ready, and so on…

Linker

Now’s the final part, the grand finale where you turn your code into something the world can execute. As this is the last part, we can just run gcc main.c -o main to complete everything and give us the final result.

So, let’s compile it and see what kind of file we actually got:

$ gcc main.c -o main
$ file main

main: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=0dede4223b251bc42cd38d0e83891f06443a0d43, for GNU/Linux 3.2.0, not stripped

Ok, now we have new things to know about. “dynamically linked” is related to the fact we’re linking with shared libraries that are usable by everybody. Remember that assembly line we saw above with call puts@PLT? Well, that’s related to this stage and what I’ve mentioned about the object files being relocatable. PLT stands for Procedure Linkage Table and all it means is that when linking, these @PLT instructions will be replaced with the actual addresses, because at that point, our object files wouldn’t know exactly where to go for, in this case, for the puts() function from stdin.h.

TLDR on the PLT thing: It’s there as a stub, so that when the linker gets the object files, it replaces those entries with the right addresses.

The interpreter /lib64/ld-linux-x86-64.so.2 just tells us what dynamic linker is used to resolve the shared library code addresses. Remember, this is code that’s accessible by everything and everyone on the operating system.

Finally, you’ll see that “not stripped” part as well. This one is very interesting to anyone who wants to learn reverse engineering. When you don’t have your executables stripped, all the symbol information will remain intact, which means variable and function names are kept!! This is a godsend when you’re trying to figure out what something does, let me illustrate.

Run xxd on the executable first, just to see how useless it looks :)

$ xxd -l 256 main

00000000: 7f45 4c46 0201 0100 0000 0000 0000 0000  .ELF............
00000010: 0300 3e00 0100 0000 6010 0000 0000 0000  ..>.....`.......
00000020: 4000 0000 0000 0000 7839 0000 0000 0000  @.......x9......
00000030: 0000 0000 4000 3800 0d00 4000 1f00 1e00  [email protected]...@.....
00000040: 0600 0000 0400 0000 4000 0000 0000 0000  ........@.......
00000050: 4000 0000 0000 0000 4000 0000 0000 0000  @.......@.......
00000060: d802 0000 0000 0000 d802 0000 0000 0000  ................
00000070: 0800 0000 0000 0000 0300 0000 0400 0000  ................
00000080: 1803 0000 0000 0000 1803 0000 0000 0000  ................
00000090: 1803 0000 0000 0000 1c00 0000 0000 0000  ................
000000a0: 1c00 0000 0000 0000 0100 0000 0000 0000  ................
000000b0: 0100 0000 0400 0000 0000 0000 0000 0000  ................
000000c0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000000d0: f805 0000 0000 0000 f805 0000 0000 0000  ................
000000e0: 0010 0000 0000 0000 0100 0000 0500 0000  ................
000000f0: 0010 0000 0000 0000 0010 0000 0000 0000  ................

Well, it’s an ELF file alright, but I can’t figure out anything about what it does from here.

Let’s introduce a new tool, called readelf. This tool essentially parses ELF files and allows you to get certain information in a structured way. Let’s run this with the —syms argument to see the symbols of our executable:

$ readelf --syms main

Symbol table '.dynsym' contains 7 entries:
   Num:    Value          Size Type    Bind   Vis      Ndx Name
     0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT  UND
     1: 0000000000000000     0 NOTYPE  WEAK   DEFAULT  UND _ITM_deregisterTMCloneTab
     2: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND puts@GLIBC_2.2.5 (2)
     3: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND __libc_start_main@GLIBC_2.2.5 (2)
     4: 0000000000000000     0 NOTYPE  WEAK   DEFAULT  UND __gmon_start__
     5: 0000000000000000     0 NOTYPE  WEAK   DEFAULT  UND _ITM_registerTMCloneTable
     6: 0000000000000000     0 FUNC    WEAK   DEFAULT  UND __cxa_finalize@GLIBC_2.2.5 (2)

Symbol table '.symtab' contains 65 entries:
   Num:    Value          Size Type    Bind   Vis      Ndx Name
     0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT  UND
     1: 0000000000000318     0 SECTION LOCAL  DEFAULT    1
     2: 0000000000000338     0 SECTION LOCAL  DEFAULT    2
     ...
     ...
    27: 0000000000000000     0 SECTION LOCAL  DEFAULT   27
    28: 0000000000000000     0 FILE    LOCAL  DEFAULT  ABS crtstuff.c
    29: 0000000000001090     0 FUNC    LOCAL  DEFAULT   16 deregister_tm_clones
    30: 00000000000010c0     0 FUNC    LOCAL  DEFAULT   16 register_tm_clones
    31: 0000000000001100     0 FUNC    LOCAL  DEFAULT   16 __do_global_dtors_aux
    32: 0000000000004010     1 OBJECT  LOCAL  DEFAULT   26 completed.8061
    33: 0000000000003dc0     0 OBJECT  LOCAL  DEFAULT   22 __do_global_dtors_aux_fin
    34: 0000000000001140     0 FUNC    LOCAL  DEFAULT   16 frame_dummy
    35: 0000000000003db8     0 OBJECT  LOCAL  DEFAULT   21 __frame_dummy_init_array_
    36: 0000000000000000     0 FILE    LOCAL  DEFAULT  ABS main.c
    37: 0000000000000000     0 FILE    LOCAL  DEFAULT  ABS crtstuff.c
    38: 000000000000216c     0 OBJECT  LOCAL  DEFAULT   20 __FRAME_END__
    39: 0000000000000000     0 FILE    LOCAL  DEFAULT  ABS
    40: 0000000000003dc0     0 NOTYPE  LOCAL  DEFAULT   21 __init_array_end
    41: 0000000000003dc8     0 OBJECT  LOCAL  DEFAULT   23 _DYNAMIC
    42: 0000000000003db8     0 NOTYPE  LOCAL  DEFAULT   21 __init_array_start
    43: 0000000000002024     0 NOTYPE  LOCAL  DEFAULT   19 __GNU_EH_FRAME_HDR
    44: 0000000000003fb8     0 OBJECT  LOCAL  DEFAULT   24 _GLOBAL_OFFSET_TABLE_
    45: 0000000000001000     0 FUNC    LOCAL  DEFAULT   12 _init
    46: 00000000000011e0     5 FUNC    GLOBAL DEFAULT   16 __libc_csu_fini
    47: 0000000000000000     0 NOTYPE  WEAK   DEFAULT  UND _ITM_deregisterTMCloneTab
    48: 0000000000004000     0 NOTYPE  WEAK   DEFAULT   25 data_start
    49: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND puts@@GLIBC_2.2.5
    50: 0000000000004010     0 NOTYPE  GLOBAL DEFAULT   25 _edata
    51: 00000000000011e8     0 FUNC    GLOBAL HIDDEN    17 _fini
    52: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND __libc_start_main@@GLIBC_
    53: 0000000000004000     0 NOTYPE  GLOBAL DEFAULT   25 __data_start
    54: 0000000000000000     0 NOTYPE  WEAK   DEFAULT  UND __gmon_start__
    55: 0000000000004008     0 OBJECT  GLOBAL HIDDEN    25 __dso_handle
    56: 0000000000002000     4 OBJECT  GLOBAL DEFAULT   18 _IO_stdin_used
    57: 0000000000001170   101 FUNC    GLOBAL DEFAULT   16 __libc_csu_init
    58: 0000000000004018     0 NOTYPE  GLOBAL DEFAULT   26 _end
    59: 0000000000001060    47 FUNC    GLOBAL DEFAULT   16 _start
    60: 0000000000004010     0 NOTYPE  GLOBAL DEFAULT   26 __bss_start
    61: 0000000000001149    27 FUNC    GLOBAL DEFAULT   16 main
    62: 0000000000004010     0 OBJECT  GLOBAL HIDDEN    25 __TMC_END__
    63: 0000000000000000     0 NOTYPE  WEAK   DEFAULT  UND _ITM_registerTMCloneTable
    64: 0000000000000000     0 FUNC    WEAK   DEFAULT  UND __cxa_finalize@@GLIBC_2.2

The purpose of this post is not to teach you how to read all of this, because that’s most likely reserved for another post, however, you can see that now we have some “text” we can interpret. Look at line 36, it clearly shows the filename that originated the main executable file. Also, you can see, on line 61, that we have an entry point we’re familiar with, the main() function, which you can infer from:

61: 0000000000001149    27 FUNC    GLOBAL DEFAULT   16 main

This means that there’s a function (FUNC), of size 27 bytes, that will reside at the address 0x1149 in memory. Besides this entry point, we can see familiar things like the puts() symbol on line 49, which tells us that there’s probably something being printed out somewhere.

Now, the thing is, when you write a program, you may want to make it hard for people to figure your stuff out and for that, there’s “stripping”, which will strip the output of these symbols. Without the symbols, things get quite complicated to reverse engineer sometimes, especially on large applications.

If you want to strip this binary, you can run the following:

$ strip --strip-all main

$ readelf --syms main

Symbol table '.dynsym' contains 7 entries:
   Num:    Value          Size Type    Bind   Vis      Ndx Name
     0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT  UND
     1: 0000000000000000     0 NOTYPE  WEAK   DEFAULT  UND _ITM_deregisterTMCloneTab
     2: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND puts@GLIBC_2.2.5 (2)
     3: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND __libc_start_main@GLIBC_2.2.5 (2)
     4: 0000000000000000     0 NOTYPE  WEAK   DEFAULT  UND __gmon_start__
     5: 0000000000000000     0 NOTYPE  WEAK   DEFAULT  UND _ITM_registerTMCloneTable
     6: 0000000000000000     0 FUNC    WEAK   DEFAULT  UND __cxa_finalize@GLIBC_2.2.5 (2)

Yep, good luck with this one, you see it has WAY less information now. You can still see that it calls puts() and has a _start() function though. By the way, we usually say main() is our entry point, but who calls main()? That’s the job of the _start() function!

Linking is way more complicated and contains more “cool things” than what I’ve shown, but for now, I think you get the gist of it.

Final product

After linking there’s nothing else to do, you now have an executable that you can run, as such:

./main

Hello World, how's it going?

This is it for now, I may do another post in the future where we go through the insides of the object files and see how they “merge” together when linking, but for this one, the idea was to give a brief overview of what happens behind the scenes, at a semi-high level.

If you’ve made it here, thank you and bye.

What happens when you compile a program?