NP-Incompleteness

Shared Libraries

2025-04-25T00:00:00+00:00

Previously we learned about the ELF file used by Linux operating systems. There, we briefly covered the dynamic linker which is responsible for loading shared libraries and that they’re also encoded as ELF files.

In this post we’ll study shared libraries in more details. Knowing about ELF is not strictly necessary for this, but I would recommended reading the ELF post to fully follow some of the discussions.

Shared libraries can be thought as code that was previously compiled separately. It can be either because system code that we don’t want included by default (e.g. math) or a third party library (e.g. boost).

They also help with compile time: since the code from the library is already compiled into object code, we don’t need to compile it again when including in our binary. We just need to link it. There are two ways of linking a shared library with our binary.

Static vs Dynamic Linking

Static linking is when the shared library is linked during the overall compilation process. It’s no different from other object code. For example, if we have some code like:

// main.cpp
#include "my_file_1.h"
#include "my_file_2.h"

int main() {
    f();
    g();
    return 0;
}

And a set of .cpp/.h as:

// my_file_1.h
void f();

// my_file_1.cpp
# include 
void f() {
    printf("hello\n");
}

// my_file_2.h
void g();

// my_file_2.cpp
# include 
void g() {
    printf("world\n");
}

We can compile it as

clang main.cpp my_file_1.cpp my_file_2.cpp

In the process, clang will compile each .cpp into a corresponding object code .o. Then the static linker will combine the object files into one, the final binary (see Figure 1).

Figure 1: Compilation in which all `.cpp` files are provided. Blue boxes indicate the source code. Green boxes are ELF files.

A similar process happens with a statically linked library: we package multiple object code files into a .a archive. When we do static linking, the linker will unarchive those .o files and link them as if they were compiled just now.

We can actually do this with our example above. First we compile my_file.cpp without running the linker:

clang -c my_file_1.cpp my_file_2.cpp

This produces my_file_1.o and my_file_2.o. We can package it into a .a archive called libmy.a:

arc rcs libmy.a my_file_1.o my_file_2.o

Linux systems use the convention that shared libraries are prefixed with lib, but when referring to it to the compiler/driver we can omit it. For example, we can statically link our shared library with main.cpp via:

$ clang main.cpp -L. -lmy
$ a./out
hello
world

The L. is telling the linker to search for libraries in the current directory (.). The -lmy is telling the linker to link a library called libmy.a.

Figure 2: Compilation in which dependencies are specificed via a statically linked shared library, which is just a wrapper around multiple `.o` files.

Dynamic linking happens at runtime. We can think of it as a deferred or lazy linking. When compiling with dynamic linking, the static linker won’t include the code from the library in the binary. It will just note down which symbols are to be resolved at runtime (see Figure 3).

Figure 3: Compilation in which dependencies are specificed via a dynamically linked shared library. The object code are generated differently from the static case. Nothing is added to the main binary at this point

During program startup, as we saw in our last post [1], the dynamic linker, or interpreter, will do the actualy linking. We can do dynamic linking with our previous example. First we compile it into object code:

clang -fPIC -c my_file_1.cpp my_file_2.cpp

The main difference with static linking is the flag -fPIC. This tells the compiler to generate the object code as Position-Independent Code. Recall from our ELF post [1] that the ELF file for the binary can specify the exact virtual memory address where code and data are to be loaded. But because dynamic libraries can be linked against any binary, we don’t know a proper address to use upfront, so a position-independent code lets the dynamic linker decide.

We can then generate the shared library file:

clang -shared -o libmy.so my_file_1.o my_file_2.o

This combines multiple ELF files (my_file_1.o and my_file_2.o) into another one, libmy.so. Note: .so stands for shared object, MacOS uses .dylib which stands for dynamic library. The corresponding ELF file has the type ET_DYN and contains sections related to dynamic linking such as .dynsym, .dynamic, .got, and .plt, which we’ll explore later.

Figure 4: Object code gets compiled into a single ELF file, corresponding to a shared library.

Finally, we can compile the main binary with dynamic linking:

$ LD_LIBRARY_PATH=. clang main.cpp -L. -lmy
$ LD_LIBRARY_PATH=. ./a.out
hello
world

Note: The environment variable LD_LIBRARY_PATH is the list of directories the dynamic linker looks for when searching for shared libraries. We need to tell it to also look in the current directory.

Figure 5: Execution of a binary with dynamic linking. The dynamic linker will load the dependencies at the start up of the program.

If we compile with dynamic linking, it won’t include the code from my_file.cpp into the compiled binary, but will add metadata for the dynamic linker to know what to link. This also tells the compiler which symbols are ok to not have definitions for because they will be provided during runtime.

To exemplify the fact that the a.out binary does not have the code from my_file.cpp, we can actually change the code of the latter to print “dynamic”:

// my_file_1.cpp
# include 
void f() {
    printf("dynamic\n");
}

We then recompile the dynamic library:

clang -fPIC -c my_file_1.cpp my_file_2.cpp
clang -shared -o libmy.so my_file_1.o my_file_2.o

and re-run our binary without re-compiling it:

$ LD_LIBRARY_PATH=. ./a.out
dynamic
world

Some advantages of dynamic linking over static ones:

The binary is smaller because it does not include the dependencies code.
The kernel can share the code from the shared library across processes by loading it into memory once and mapping virtual addresses from different processes to the same physical one.
You can update the library without recompiling all the dependents.

Some downsides:

Overhead in program start-up to perform the dynamic linking.
Dependencies must be properly installed and paths setup correctly when running a binary with dynamic linking.
If a back-incompatible change is made to the shared library, it will crash during runtime.

For the rest of this post we’ll assume dynamic linking. The shared libraries that are dynamically linked are also called Dynamic Shared Objects or DSOs.

Recursive Dependencies

What if our DSO depends on other DSOs? How do we encode this dependency? For example, let’s suppose our my_file_2.cpp requires some other external dependency, say my_file_3.h:

// my_file_2.cpp (MODIFIED)
# include "my_file_3.h"
void f() {
    h();
}

// my_file_3.h
void h();

// my_file_3.cpp
# include 
void h() {
    printf("indirect\n");
}

And suppose we compile them into separate DSOs (libmy.so and libmy2.so):

clang -fPIC -c my_file_1.cpp my_file_2.cpp my_file_3.cpp
clang -shared -o libmy.so my_file_1.o my_file_2.o

# my_file_3 is compiled into its own library
clang -shared -o libmy2.so my_file_3.o

When we try to compile main.cpp as before:

$ LD_LIBRARY_PATH=. clang main.cpp -L. -lmy
/usr/bin/ld: ./libmy.so: undefined reference to `h()'

We need to provide the dependencies of libmy as well, not just of main.cpp:

$ LD_LIBRARY_PATH=. clang main.cpp -L. -lmy -lmy2

Ideally libmy should declare its own dependencies so we don’t need to know about them ourselves. We can do this via:

# compile libmy2 before hand:
clang -shared -o libmy2.so my_file_3.o

# link libmy with libmy2
clang -shared -o libmy.so my_file_1.o my_file_2.o -L. -lmy2

Now we can compile the binary as before, only providing direct dependencies:

$ LD_LIBRARY_PATH=. clang main.cpp -L. -lmy

Dynamic Linking Process

As we discussed in the ELF post, the dynamic linker is a binary that is loaded in memory before our main binary starts and its main function is to load the DSO our main binary depends on. We discuss a few steps the linker perform both before yielding control to the main binary and afterwards.

It first loads the dependencies (DSOs) in memory. Depending on whether symbol lookup is configured to be done eagerly or lazily, the dynamic linker will perform it before the binary starts or on demand, when the program is run. We’ll assume lazy evaluation.

The symbol lookup has two parts to it: from the main binary perspective and from the perspective of each of the DSOs. Let’s go over each steps in detail.

Loading Dependencies

It knows which libraries to load based on the .dynamic section of our binary, which can be obtained via:

$ readelf -d ./a.out | grep NEEDED
 0x0000000000000001 (NEEDED)             Shared library: [libmy.so]
 0x0000000000000001 (NEEDED)             Shared library: [libstdc++.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libm.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libgcc_s.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]

Here we can see our custom library libmy.so and libstdc++ which implements the STL. It also includes libm.so (for floating point functions), libgcc_s.so.1 (for stack unwinding, used by exception handling) and finally libc.so.6 (for functions like printf()).

The linker will go over these dependencies in this order and process them recursively, in breadth-first search fashion. We can check the dependencies of our DSO:

readelf -d libmy.so | grep NEEDED
 0x0000000000000001 (NEEDED)             Shared library: [libmy2.so]
 0x0000000000000001 (NEEDED)             Shared library: [libstdc++.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libm.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libgcc_s.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]

We can see our other DSO there libmy2.so and the other same dependencies. Of course it will not re-load a shared dependency like libm.so.6 multiple times, only the first one.

We can check all recursive dependencies of a binary via ldd:

$ LD_LIBRARY_PATH=. ldd ./a.out

Which also includes the path of the library the dynamic linker found (this is very useful when there are multiple possible copies of the same library available in different directories).

Symbol Lookup: Main Binary

When the binary is loaded into memory by the kernel, it also loads sections used by the dynamic linker: rel.plt, .got, .got.plt and .plt. The GOT stands for Global Offset Table and PLT for Procedure Linkage Table. .got is used for global varibles and .got.plt is used for function calls.

We can think the GOT as a memoization for the symbol lookup. Once a symbol lookup is performed, it’s added to the GOT so it doesn’t need to be done again. The .rel.plt table lists the symbols that are to be looked up and also the index of the GOT where they should be written to.

The PLT serves as a indirection layer. The .plt section is loaded into a read-only segment of memory and we can inspect it via:

$ objdump -d -C --section=.plt ./a.out

0000000000401020 <.plt>:
  401020:       ff 35 e2 2f 00 00       push   0x2fe2(%rip)
  401026:       ff 25 e4 2f 00 00       jmp    *0x2fe4(%rip)
  40102c:       0f 1f 40 00             nopl   0x0(%rax)

0000000000401030 :
  401030:       ff 25 e2 2f 00 00       jmp    *0x2fe2(%rip)
  401036:       68 00 00 00 00          push   $0x0
  40103b:       e9 e0 ff ff ff          jmp    401020 <.plt>

0000000000401040 :
  401040:       ff 25 da 2f 00 00       jmp    *0x2fda(%rip)
  401046:       68 01 00 00 00          push   $0x1
  40104b:       e9 d0 ff ff ff          jmp    401020 <.plt>
...

The interesting thing is that this is actual machine code that is executed by the CPU, not some array-like data structure like the GOT is. In the binary, when we call f(), instead of jumping to a label f(): as it would for a statically linked symbol would, it actually jumps to the label f()@plt: shown above (address 401030).

The first instruction then jumps to the address stored in GOT + 0x2fe2 (GOT is stored in the register %rip). Initially that entry holds the address of the second instruction of f()@plt, i.e. 401036. We can check that by first grabbing the address of f() in the GOT:

objdump -R -C  ./a.out
...
0000000000404018 R_X86_64_JUMP_SLOT  f()
0000000000404020 R_X86_64_JUMP_SLOT  g()

So the entry for f() is at address 0x404018. Then we peek at that location using a debugger:

LD_LIBRARY_PATH=. gdb ./a.out
(gdb) start
(gdb) x/a 0x0000000000404018
0x404018 : 0x401036

So the effect is that it moves on to the next instruction (as if it was a no-op). The next instruction is push $0x0, pushing the index 0 to the stack (this will be used for the linker to write back at position 0 of the GOT). It then jumps to .plt (401020) which pushes the address of the GOT to the stack and then jumps to the address held by GOT + 0x2fe4 which is the linker code responsible for doing the actual symbol lookup. This example is depicted in Figure 6.

Figure 6: "Cold" lookup of `f()`'s address. It has to go through many hoops and eventually invoke the dynamic linker code to search for `f()` among the DSOs.

Once it finds the address of f() in the corresponding DSO, it will write to GOT + 0x2fe2, so that next time we jump into f()@plt, we’ll jump directly to the real address of f()! Figure 7 illustrates this flow.

Figure 6: "Warmed up" lookup of `f()`'s address. When the PLT looks up on the GOT and jumps, it goes straight to `f()`'s address.

We note a few things from this clever process:

Neither the original binary code nor the PLT code are modified when the linker resolves the address of f(). Only the GOT is.
The address resolution is lazy, so for functions that are never invoked we need pay the cost of resolving their addresses.
There’s always a level indirection when we invoke f()@PLT: it first needs to jump to the PLT and then to the actual address of f().
There are no conditionals in the lookup process aside from the linker code. That is, we don’t check if the entry is set in the GOT, we always unconditionally jump to the address stored in there.

Symbol Lookup: DSO

The linker loads a few sections from a DSO’s ELF including .dynsym, .dynstr and .gnu.hash. The .dynsym contains the symbols that can be used by dependents. This table is normalized: it doesn’t include the text of the symbol. They’re listed in .dynstr and the .dynsym only includes an offset to that table.

The .gnu.hash is a hashtable for looking up symbols in the DSO. It is meant to replace an old, less efficient hashtable under the .hash section. The details on the differences and optimizations are well described in [2]. This hash table contains references to entries in .dynsym.

So the process of looking-up a symbol (on demand or otherwise) is: the linker visits each DSO in order and does a lookup in its hash table. If it finds a matching hash, it might have to check a list of entries (due to hash collision), and for each entry it involves doing string comparison for the symbol name.

If it finds the symbol, it returns the absolute address of the symbol, which is the relative address within the DSO + the address of the DSO itself. If it doesn’t find, then the next DSO is looked up.

Let $D$ be the number of DSOs, $R$ the number of “external” symbols to be looked up, $m$ the number of entries for a given hash due to collisions and $s$ the length of a symbol name. The complexity of the lookup process is $O(DRms)$.

So a high number of external symbols and shared libraries, the quality of the hash function and the length of a function name can all affect the performance of the lookup process. In C++ in particular the length of a symbol makes things worse because the namespace is part of the mangled name. So if there are a lot of nested namespaces shared across symbols, they become share prefixes, and the string comparison will take longer in most cases.

We can get a sense of how many symbol lookups ($R$) are taking place via:

LD_LIBRARY_PATH=. LD_DEBUG=symbols ./a.out 2>&1 | grep -o "symbol=" | wc -l

For the simple example above, this returns around 10,000 instances! For a bloated binary I work with on a regular basis it showed 8,000,000 lookups.

ABI

Another aspect to consider for shared libraries is ABI-compatibility. ABI stands for Application Binary Interface.

The most obvious compatibility issue is regarding symbol names and signature: if during compilation time we rely on a given DSO but then at link time we provide a DSO that has changed the signature of one of its exported functions, we’ll get a runtime error.

There are more subtle ways to make the binary and DSOs incompatible: for example, if the way structs are organized in memory changes. The binary code might be expecting a specific order of member variables. For example, suppose we have a point struct defined as:

// point.h
struct Point {
  int x;
  int y;

  void print();
};

// point.cpp
void Point::print() {
  printf("(%d, %d)", x, y);
}

And we use it in a main function:

int main() {

  Point p {
    .x = 1,
    .y = 2
  };
  p.print();

  return 0;
}

We can make point.cpp into a DSO and link dynamically with main.cpp. It will work just fine and print (1, 2). Now suppose during a refactoring we swapped the order of x and y in point.h:

// point.h
struct Point {
  int y;
  int x;

  void print();
};

If we re-compile the DSO (but not the binary), it will print (2, 1)! This is because the binary and the DSO are working with different Point structs. So one must be very careful in changing .h files in DSOs.

To avoid such problems, if the DSO author needs to make a back-incompatible change, they can append a version number at the end of the library name, e.g. libc.so.6 we saw above. Neither the compiler nor the linker has the notion of versioning though. It’s up to the user or the OS packaging system to symkink libc.so to the appropriate version.

Conclusion

The post ended up becoming much longer than I initially envisioned, but that’s because there was so much I learned about shared libraries.

I had this misconception for a long while: I thought statically linked libraries were those that are self-contained, meaning they do not depend on dynamic libraries. As we learned, it’s just a list of .o objects that are linked with the main binary at compile time. In fact, statically linked libraries can themselves depend on dynamic libraries which then become dependencies of the main binary.

It was also great to dig into the symbol lookup process of the dynamic linker and learn about many of the different sections I had seen while studying a sample ELF file in my previous post.

I had no idea that that calling functions from DSOs involved a level of indirection and that the symbol process can be so expensive! Finally, trying to come up with examples for ABI incompatibility gave a much more solid grasp of it.

In Local Inter-Process Communication, we covered shared memory segments as a way to communicate between processes, and we also raised the issue related to ABI compatibility if the processes are using incompatible binaries.

References

[1] NP-Incompleteness: ELF: Executable and Linkable Format
[2] How To Write Shared Libraryes - Ulrich Drepper
[3] ChatGPT

The Residue Theorem

2025-04-16T00:00:00+00:00

“Why does a mathematician call their dog Cauchy?” (answer in the Conclusion)

This is a post with my notes on complex integration, corresponding to Chapter 4 in Ahlfors’ Complex Analysis.

In the last post of the series, we studied The General Form of Cauchy’s Theorem . One of the outcomes in generalizing it for multiply connected regions was the concept of the modules of periodicity. In this post we’ll explore this idea further and see how it can be used as a tool for solving integrals, via Cauchy’s Residue Theorem.

Residues

Supose a function $f(z)$ that is holomorphic in a (simply connected) region $\Omega$ except at a finite number of singularities $a_1, \cdots, a_n$. We can obtain a multiply connected region $\Omega’$ by removing these points, so that $f(z)$ becomes holomorphic in $\Omega’$.

We can choose our “canonical” closed curve (see Multiply Connected Region in [2]) for each of the holes in $\Omega’$ as a circle $C_j$ centered in $a_j$, sufficiently small that it’s contained in $\Omega’$.

So the corresponding module of periodicity is

\[P_j = \int_{C_j} f(z)dz\]

Now suppose we plug the function $1/(z - a_j)$. Then we have

\[P_j = \int_{C_j} \frac{1}{z - a_j} dz\]

From our winding number post [3] we have that:

\[n(\gamma, a_j) 2\pi i = \int_{C_j} \frac{1}{z - a_j} dz\]

Since $C_j$ winds around $a_j$ exactly once, so $n(\gamma, a_j) = 1$, we conclude that the period for $1/(z - a_j)$ is exactly $2\pi i$.

Because it’s an integral, the period of a function is also a linear function. So the period of $f(z) - \alpha / (z - a_j)$ is the period of $f(z)$ minus $\alpha$ times the period of $1/(z - a_j)$. We want to choose $\alpha_j$ so that the period of $f(z) - \alpha_j / (z - a_j)$ is 0. We can do:

\[\int_{C_j} f(z) - \alpha_j g(z) dz = P_j - \alpha_j 2\pi i = 0\]

So that

\[\alpha_j = \frac{P_j}{2 \pi i}\]

The scalar $\alpha_j$ is defined as the residue. So the function $g(z) = f(z) - \alpha_j / (z - a_j)$ is such that:

\[\int_{C_j} g(z) = 0\]

By Corollary 1 in [4] we conclude that $g(z)$ is the derivative of a holomorphic function in $\Omega’$. To avoid accounting for the other holes of $\Omega’$, we can shrink the domain to the annulus $0 \lt \abs{z - a_j} \lt \delta$. This let’s us defined the residue in simpler terms:

Definition. Let $f(z)$ be a holomorphic function except at an singularity $a$. The residue of $f(z)$ at $a$ is the scalar $R$ for which the function $f(z) - R / (z - a)$ is the derivative of a holomorphic function in $0 \lt \abs{z - a_j} \lt \delta$.

The residue can be denoted by $\mbox{Res}_{z = a} f(z)$. By this definition, $f(z) - R / (z - a)$ is not necessarily holomorphic in $0 \lt \abs{z - a_j} \lt \delta$, but by Corollary 1 in [4] we can claim that:

\[(1) \quad \int_{\gamma} \left( f(z) - \frac{\mbox{Res}_{z = a} f(z)}{z - a} \right) dz = 0\]

We’ll come back to this equation later.

The Residue Theorem

In [2] we learned that the integral of a holomorphic function in a multiply connected region can be expressed as a linear combination of the modules of periodicity:

\[(2) \quad \int_\gamma f(z) dz = \sum_{j = 1}^{n - 1} c_j P_j\]

where $c_j$ is the number of times $\gamma$ winds around the hole, or equivalently, the singularity $a_j$, or in short, $n(\gamma, a_j)$. Replacing these in $(2)$:

\[= \sum_{j = 1}^{n - 1} n(\gamma, a_j) P_j\]

And using the residue definition:

\[\frac{1}{2\pi i} \int_\gamma f(z) dz = \sum_{j = 1}^{n - 1} n(\gamma, a_j) \mbox{Res}_{z = a_j} f(z)\]

This is known as Cauchy’s Residue Theorem. Summarizing:

Theorem 1. Let $f(z)$ be a holomorphic function except for isolated singularities $a_j$ in a region $\Omega$. Then:

\[(3) \quad \frac{1}{2\pi i} \int_\gamma f(z) dz = \sum_{j = 1}^{n} n(\gamma, a_j) \mbox{Res}_{z = a_j} f(z)\]

For any cycle $\gamma$ that is homologous to 0 in $\Omega$ not passing through any of the singularities.

Application: Poles

One question we might ask ourselves is why are residues useful? Couldn’t we just use the definition of modules of periodicity directly? Why do we need residues which is just modules of periodicity divided by $2 \pi i$?

The advantage is not on the definition of the residue itself but rather because of equation $(1)$:

\[\int_{\gamma} \left( f(z) - \frac{\mbox{Res}_{z = a} f(z)}{z - a} \right) dz = 0\]

Suppose we can write $f(z)$ as:

\[f(z) = \frac{\alpha}{z - a} + g(z)\]

where $g(z)$ is the derivative of a holomorphic function. Then $\alpha$ is the residue! This is particularly useful if the singularity is a pole. In Lemma 4 in [5], we showed that if $f(z)$ has a pole of order $m$, it can be written as the Laurent series:

\[f(z) = \sum_{n=-m}^{\infty} c_n (z - a)^n\]

Where $c_{-m} \ne 0$. For the analytic part of the series (i.e. those containing terms with non-negative values of $n$) we have the standard Taylor series, so we know by [6] that it forms a holomorphic function. We can write it as:

\[f(z) = g(z) + \sum_{n=1}^{m} c_{-n} (z - a)^{-n}\]

Isolating the term $n = 1$:

\[f(z) = g(z) + \frac{c_{-1}}{z - a} + \sum_{n=2}^{m} c_{-n} (z - a)^{-n}\]

We claim that $f(z) - c_{-1} / (z - a)$ is the derivative of a holomorphic function. To do so, we can analyze each term on the right hand side in turn: $g(z)$ is holomorphic, so by [4] it has a holomorphic anti-derivative. For $n \ge 2$, the term $c_{-n} (z - a)^{-n}$ is the derivative of $c_{-n} (z - a)^{-n + 1} / (-n + 1)$. Thus the definition of residue applies and we conclude that $c_{-1}$ is exactly the scalar we’re looking for.

Examples

Example 1. Compute the residues of

\[\frac{\alpha}{z - a} + \frac{\beta}{z - b}\]

It has poles $a$ and $b$. The residue for $z = a$ is $\alpha$ because $\beta/(z - b)$ is holomorphic in the annulus $0 \lt \abs{z - a} \lt \delta$. By analogous reasoning $\beta$ is the residue for $z = b$.

Example 2. Compute the residues of

\[\frac{e^{z}}{(z - a)(z - b)}\]

For $a \ne b$. We can compute the Laurent series expansion around $a$ to conclude the residue for that pole is $e^a / (a - b)$ (Lemma 4) and similarly for $b$ that the residue is $e^b / (b - a)$.

Connections

We can obtain Cauchy’s Integral formula from $(3)$, by using the function $f(z) / (z - a)$, where $f(z)$ is holomorphic in $\Omega$. Since $f(z)$ is holomorphic, it’s analytic [6] and can be written as the convergent series:

\[(4) \quad f(z) = \sum_{j = 0}^{\infty} c_j (z - a)^j\]

Dividing by $(z - a)$ and isolating the first term we get:

\[\frac{f(z)}{z - a} = \frac{c_0}{z - a} + \sum_{j = 1}^{\infty} c_j (z - a)^{j - 1}\]

The summand on the right hand side forms a series corresponding to a holomorphic function, and as we’ve seen in the Application: Poles section, that means that $c_0$ is the residue of $f(z) / (z - a)$. The coefficient $c_0$ can be found via $(4)$ be setting $z = a$, which gives us $f(a)$, thus we can replace those in $(3)$ to obtain:

\[\frac{1}{2\pi i} \int_\gamma f(z) dz = n(\gamma, a) f(a)\]

If we assume $n(\gamma, a) = 1$ we obtain exactly Cauchy’s Integral Formula (Lemma 1 in [7]).

The Argument Principle

In The Open Mapping Theorem [8] we proved the following Lemma (Lemma 10 in the Appendix):

Let $f(z)$ be a holomorphic function in $\Omega$ and $z_1, z_2, \cdots, z_n$ be its zeros, and $m_1, m_2, \cdots, m_n$ their order. Let $\gamma$ be a closed curve in $\Omega$ and $n(\gamma, a)$ the winding number of a point $a$.

Then:
\[\sum_{i = 1}^n n(\gamma, z_i) m_i = \frac{1}{2\pi i} \int_{\gamma} \frac{f'(z)}{f(z)}dz\]

We can prove it using the residue theorem. If $f(z)$ has zeros $z_1, z_2, \cdots, z_n$ of respective order $m_1, m_2, \cdots, m_n$, then from Lemma 1 in [8], we can write it as:

\[f(z) = (z - z_1)^{m_1}(z - z_2)^{m_2} \cdots (z - z_n)^{m_n} g(z) = g(z) \prod_{j = 1}^{n} (z - z_j)^{m_j}\]

where $g(z)$ is a holomorphic function with $g(a) \ne 0$. Differentiating it gives us the equation:

\[f'(z) = g'(z) \prod_{j = 1}^{n} (z - z_j)^{m_j} + \\ g(z) \sum_{j = 1}^{n} m_j (z - z_j)^{m_j - 1} \prod_{k = 1, k \ne j}^{n} (z - z_k)^{m_k}\]

We can replace the definition of $f(z) / g(z) = \prod_{j = 1}^{n} (z - z_j)^{m_j}$ back here and get:

\[f'(z) = g'(z) \frac{f(z)}{g(z)} + \\ g(z) \sum_{j = 1}^{n} m_j (z - z_j)^{m_j - 1} \frac{f(z)}{g(z) (z - z_j)^{m_j}}\]

Cancelling terms:

\[f'(z) = g'(z) \frac{f(z)}{g(z)} + \sum_{j = 1}^{n} m_j \frac{f(z)}{(z - z_j)}\]

Dividing by $f(z)$:

\[\frac{f'(z)}{f(z)} = \frac{g'(z)}{g(z)} + \sum_{j = 1}^{n} \frac{m_j}{(z - z_j)}\]

The first term is a holomorphic function because $g(z) \ne 0$. The other terms have poles at $z_j$. As we saw in Example 1, the residue for pole $z_j$ is $m_j$. Plugging these into $(3)$ gives us, for the function $f’(z) / f(z)$:

\[(5) \quad \frac{1}{2\pi i} \int_\gamma \frac{f'(z)}{f(z)} dz = \sum_{j = 1}^{n} n(\gamma, a_j) m_j\]

Note that nowhere in our calculation we require $m_j$ to be positive. We can thus generalize Lemma 1 in [8] with Lemma 2:

Lemma 2. Let $f(z)$ be a function with zeroes $z_1, z_2, \cdots, z_{n_z}$ of order $m_1, m_2, \cdots, m_{n_p}$ and poles $p_1, p_2, \cdots, p_{n_p}$ of order $n_1, n_2, \cdots, n_{n_p}$. Then $f(z)$ can be written as:

\[f(z) = \frac{\prod_{j}^{n_z} (z - z_j)^{m_j}}{\prod_{j}^{n_p} (z - p_j)^{n_j}} g(z)\]

Where $g(z)$ is a non-zero holomorphic function.

Consider the pole $p_1$ of order $n_1$. From Lemma 2 in [5], we can write $f(z)$ as a Laurent series around $p_1$ as: $$f(z) = \sum_{j = -n_1}^{\infty} c_j (z - p_1)^j$$ If we multiply it by $(z - p_1)^m$, we get rid of the terms with $(z - p_1)$ in the denominator and thus obtain a regular Taylor series and if we define $h_1(z) = (z - p_1)^{n_1} f(z)$, $h_1(z)$ is holomorphic at $p_1$. The function $h_1(z)$ is not holomorphic everywhere though. In fact, we now claim that $p_2$ is a pole of order $n_2$ of $h_1(z)$. Since it's a pole of $f(z)$, we have that $1/f(z) = 0$. We have: $$\frac{1}{h_1(z)} = \frac{1}{(z - p_1)^n_1}\frac{1}{f(z)}$$ and since $p_2 \ne p_1$, the first multiplicand is finite and we conclude that $1/h_1(p_2) = 0$. For the next order, we have that $$h'_1(z) = n_1 (z - a)^{n_1 - 1} f(z) + (z - p_1)^n_1 f'(z)$$ and $$\frac{1}{h'_1(z)} = \frac{1}{n_1 (z - a)^{n_1 - 1} f(z)} + \frac{1}{(z - p_1)^n_1 f'(z)}$$ Again, since $p_2 \ne p_1$, $1/f(p_2) = 0$ and $1/f'(p_2) = 0$ we have that $g'_1(p_2) = 0$. We can continue with this idea to conclude that $p_2$ is a pole of order $n_2$ of $h_1(z)$. We can thus define $h_2(z) = (z - p_2)^{n_2} h_1(z)$ and repeat the process for the other poles until we arrive at: $$\prod_{j = 1}^{n_p} (z - p_j)^{n_j} f(z) = h(z)$$ Where $h(z)$ is holomorphic at every pole of $f(z)$. Now consider a zero $z_1$ of order $m_1$ of $f(z)$. Since it's different from all the poles, $f(z_1) = 0$ implies $h(z_1) = 0$ and by inspection we'll find that it's indeed a zero of order $m_1$ of $h(z)$. We can thus use Lemma 1 of [8] to write $g(z)$ as: $$h(z) = \prod_{j = 1}^{n_z} (z - z_j)^{m_j} g(z)$$ Where $g(z)$ is a non-zero holomorphic function. Putting it all together: $$\prod_{j = 1}^{n_p} (z - p_j)^{n_j} f(z) = \prod_{j = 1}^{n_z} (z - z_j)^{m_j} g(z)$$ or $$f(z) = \frac{\prod_{j = 1}^{n_z} (z - z_j)^{m_j}}{\prod_{j = 1}^{n_p} (z - p_j)^{n_j} } g(z)$$ QED.

Which we can simplify as a single product where some of the exponents might be negative. This allow us to generalize $(5)$ to Theorem 3, known as the Argument Principle:

Theorem 3. Let $f(z)$ be a function with zeroes $z_1, z_2, \cdots, z_{n_z}$ of order $m_1, m_2, \cdots, m_{n_p}$ and poles $p_1, p_2, \cdots, p_{n_p}$ of order $n_1, n_2, \cdots, n_{n_p}$:

\[(6) \quad \frac{1}{2\pi i} \int_\gamma \frac{f'(z)}{f(z)} dz = \sum_{j = 1}^{n} n(\gamma, z_j) m_j - \sum_{j = 1}^{n} n(\gamma, p_j) n_j\]

For any cycle $\gamma$ homologous to 0 in $\Omega$ that does not pass through any of the zeros or poles.

Why is it called the argument principle? We have that $f(z)$ is a complex number so we can write it as:

\[f(z) = \abs{f(z)}e^{i \mbox{arg}f(z)}\]

Taking the logarithm:

\[\ln f(z) = \ln \abs{f(z)} + i \mbox{arg} f(z)\]

Differentiating

\[\frac{d \ln f(z)}{dz} = \frac{d}{dz} \ln \abs{f(z)} + i \frac{d}{dz} \mbox{arg} f(z)\]

So the change in $f(z)$ for a small delta $dz$ corresponds to a change in magnitude ($\abs{f(z)}$) and in argument ($\mbox{arg} f(z)$). We also have the identity:

\[\frac{d \ln f(z)}{dz} = \frac{f'(z)}{f(z)}\]

So when we integrate the right hand size over $\gamma$, we’re computing the overall change in magnitude and argument. However, $(6)$ only has the imaginary part (multiply both sides by $i$ to see it). This means the net change in magnitude of $f(z)$ is 0. The net change in argument, let’s call it $\Delta_{\mbox{arg}}$ can be obtained via:

\[\int_\gamma \frac{f'(z)}{f(z)} dz = i \Delta_{\mbox{arg}}\]

Or that,

\[\Delta_{\mbox{arg}} = 2 \pi \left(\sum_{j = 1}^{n_j} n(\gamma, z_j) m_j - \sum_{j = 1}^{n_p} n(\gamma, p_j) n_j\right)\]

If $\gamma$ is a simple curve and we only count zeros and poles contained inside the curve, we get:

\[\Delta_{\mbox{arg}} = 2 \pi \left(\sum_{j = 1}^{n_j} m_j - \sum_{j = 1}^{n_p} n_j\right)\]

Note that $m_j$ and $n_j$ are integers so $\Delta_{\mbox{arg}} = 2\pi k$, corresponding to how many revolutions $f(z)$ performed around the origin as $z$ went around $\gamma$. It makes intuitive sense since $z$ is travelling around a closed circle, so it starts and stop at the same place.

Conclusion

The answer to the question at the start is: “Because it leaves residues on every pole!”

Coincidentally, I just heard this joke recently on the Oxford Mathematics Instagram account as I was finishing up this post! I would have not gotten the joke before studying this topic.

In this post we saw that residues are connected to modules of periodicity but they’re easier to compute, and like modules of periodicity they’re useful because the integral of a function can be expressed as a linear combination of them. Residues are particularly easy to find at poles.

We learned about some connections between residues and Cauchy’s Integral Formula and also that the change in argument of a function over a curve can be computed from the difference of zeros and poles of a function contained inside it.

Appendix

Lemma 4. In the Laurent series expansion of $e^z / ((z - a)(z - b))$ around the pole $a$, the coefficient $c_{-1}$ is $e^a / (a - b)$.

Let $w = z - a$, so $$ f(z) = \frac{e^z}{(z - a)(z - b)} = \frac{e^{w + a}}{w (w + a - b)} = \frac{e^w e^a}{w (w + a - b)} $$ We can write $$ \frac{1}{w + a - b} = \frac{1}{a - b} \cdot \frac{1}{1 + \frac{w}{a - b}} $$ The second factor can be expanded as a convergent geometric series, because we assume a neighborhood around $a$, so $\abs{z - a} = \abs{w} \lt \abs{b - a}$. So we have: $$ \frac{1}{1 + \frac{w}{a - b}} = \sum_{n = 0}^{\infty} \left(\frac{w}{b - a}\right)^n $$ Using the Taylor expansion of $e^w$: $$ e^w = \sum_{m = 0}^{\infty} \frac{w^m}{m!} $$ Putting it together: $$ f(z) = \frac{e^a}{a - b} \cdot \frac{1}{w} \left( \sum_{n = 0}^{\infty} \left(\frac{w}{b - a}\right)^n \right) \left( \sum_{m = 0}^{\infty} \frac{w^m}{m!} \right) $$ Multiplying each term: $$ = \frac{e^a}{a - b} \sum_{n = 0}^{\infty} \sum_{m = 0}^{\infty} \frac{1}{m!} \left(\frac{1}{b - a}\right)^n w^{n + m - 1} $$ Replacing back the term $w = z - a$: $$ = \frac{e^a}{a - b} \sum_{n = 0}^{\infty} \sum_{m = 0}^{\infty} \frac{1}{m!} \left(\frac{1}{b - a}\right)^n (z - a)^{n + m - 1} $$ And this is our expanded Laurent series. To obtain the coefficient $c_{-1}$ we need to consider all terms for which $n + m - 1$. Since $n$ and $m$ are non-negative, the only combination that yields $-1$ is $n = m = 0$, which makes our life easier. We have: $$c_{-1} (z - a)^{-1} = \frac{e^a}{a - b} \frac{1}{0!} \left(\frac{1}{b - a}\right)^0 (z - a)^{-1}$$ So we conclude that $c_{-1} = e^a / (a - b)$.

References

[1] Complex Analysis - Lars V. Ahlfors
[2] NP-Incompleteness: The General Form of Cauchy’s Theorem
[3] NP-Incompleteness: The Winding Number
[4] NP-Incompleteness: Cauchy Integral Theorem
[5] NP-Incompleteness: Zeros and Poles
[6] NP-Incompleteness: Holomorphic Functions are Analytic
[7] NP-Incompleteness: Cauchy’s Integral Formula
[8] NP-Incompleteness: The Open Mapping Theorem

ELF: Executable and Linkable Format

2025-04-12T00:00:00+00:00

Recently we learned a bit how LLVM works, in particular that it produces an artifact called object code.

In this post I'd like to delve into them, in particular on the Executable and Linkable Format or ELF file format (used by most Linux systems). We'll cover what ELF is used for, how its contents are organized and finally how the operating system "executes" an ELF file.

Object Code

ELF is a file storing the object code produced by the compiler and it contains metadata for the operating system to load it into memory and actual data for the CPU to run the code.

Different families of operating systems use different formats. ELF is used for Linux while MacOS uses the Mach-O format and Windows uses the Portable Executable (PE).

Both ELF and PE are based on an older format called Common Object File Format COFF. COFF on its turn replaced a format called a.out. We can see references to it when we compile code without specifying an output name (the default filename is a.out).

We’ll be using this simple C++ program to follow along with an example:

#include 

int main() {
  printf("hello\n");
  return 0;
}

To compile it, I did:

clang example.cpp -o example

Shared libraries (.a or .so) are also represented using ELF.

ELF File Layout

The ELF layout consists of basically four parts: ELF header, Program Header, Other, Section Data and Section Header, typically in this order:

ELF Header

Program Header

Other

Section Data

Section Header

The ELF header contains metadata about the ELF file itself and the intended architecture it’s meant to be run in. The Program Header is a table of contents containing information on what to load from the file into memory. Each row in this table is called a segment.

The Section Header is a table of contents for the more granular chunks of memory called sections (a segment typically includes one or more sections). This header is mainly used by the linker and debugger, but not used during runtime, so it’s not loaded into memory.

All the headers discussed so far are table of contents: they contain indexes. The actual data is located under the Section Data. Note that not all section are loaded into memory (e.g. debug information are left out).

The Other contains information that are not part of any headers but are also not section data either. One example is the dynamic linker path. It’s indexed by the program header (via INTERP as we’ll see later) but not loaded into memory.

Why does the section header appears after the section data? That’s because the section header doesn’t need to be loaded into memory by the loader, so putting it between the program header and the section data that are loaded could lead to unnecessary disk seeking.

ELF Header

The ELF header describes some metadata of the ELF file itself, such as whether it’s meant for 64-bit architectures (e.g. Class = ELF64), little-endian vs big-endian (e.g. Data = 2's complement, little endian), instruction set (e.g. Machine = x86-64) whether it’s an executable or a shared library (e.g Type = EXEC). This header also tells the kernel where to find the other two headers: program and section headers.

To inspect the ELF header we can do:

readelf -h example

Program Header

The program header describes what portions from the ELF file (segments) to load into memory. It’s a index represented by a table with the columns:

Type

Offset

VirtAddr

PhysAddr

FileSiz

MemSiz

Flags

Align

We’ll cover types shortly. Offset indicates where in the ELF file the segment is located. FileSiz is the length of the segment.

VirtAddr is the location in the virtual memory address space where this segment will be loaded. Recall that the OS allocates a dedicated virtual memory address space for each program, so programs can be very explicit about where each segment is loaded into that space. Of course the OS will map these addresses into arbitrary physical memory addresses. The attribute PhysAddr is ignored unless this binary is for firmware or kernel programs.

MemSiz indicates how much space the segment will take when loaded into memory. It can be larger than FileSiz because the file representation might be more compact than in runtime. For example:

// uninitialized.cpp

char v[100];

This array requires allocating 100 bytes in memory but it can be represented with fewer bytes in the file. In this case FileSiz < MemSiz.

Finally Flags controls access to the segment and Align is the byte alignment in memory. In general Align should not matter because the virtual memory address is specified, but there’s this thing called Position-Independent Executable (PIE) where the OS has the freedom to choose a different address, in which case it must honor the alignment.

Now on to types. The main types are:

PHDR INTERP DYNAMIC LOAD

PHDR, short for Program HeaDeR, is a segment representing the table itself. The table needs to be loaded into memory too, so we need to specify the memory address + size. This table will be used by the dynamic linker, which loads the dynamic libraries into memory. It needs to know where the information about dynamic libraries are loaded in memory, and it doesn’t have access to the ELF file.

The INTERP, short for INTERPreter, points to the path to the dynamic linker (also known as interpreter) executable, which will be invoked to link the libraries with the program. For my Linux system the path is /lib64/ld-linux-x86-64.so.2.

DYNAMIC are segments containing information about the dynamic libraries to be linked with the program. When code is compiled, it doesn’t include the code of the libraries on the ELF file, so they must be loaded in during runtime.

LOAD are segments that represent data or code to be loaded in memory. This includes the machine instructions that will be read by the CPU and also static variables.

To inspect the program header we can do:

readelf -l example

Section Header

As we discussed in ELF File Layout, the section header is not used in runtime. It’s only used by debuggers (running the binary with gdb or lldb) and linkers (as part of the overall compilation process). Recall that the job of the linker is to take multiple independent ELF files and stitch them together into one.

The columns of the section headers are:

Name

Type

Address

Offset

Nr is a sequence number to uniquely identify a section. Name can be used to indicate what a section does. For example, some common names are:

.text - contains machine instructions
.data - contains initialized global and static variables
.bss - contains uninitialized global and static variables (e.g. see uninitialized.cpp)
.rodata - read-only data (constant globals)
.interp - the path to the dynamic linker
.debug_info - to be used by the debugger (not loaded into memory)

To inspect the section header we can do:

readelf -S example

Execution of an ELF file

We now cover the steps that take place when we run an ELF file. We start by executing our binary in a shell:

./example

The shell process runs the system call execve("example", argv, argp). This instructs the kernel to replace the current shell process with a new one.

The kernel starts by reading the ELF header file to extract metadata about the program. It also finds the offset of the program header in the file.

The kernel goes over each row of the program header and for each LOAD segment it maps the file contents into memory (recall that some segments take more space in memory, such those from uninitialized variables).

If the binary uses shared libraries, then the kernel reads the INTERP row to find the path of the dynamic linker, e.g. /lib64/ld-linux-x86-64.so.2. This is a shared library that is also represented as an ELF file. The kernel loads it in memory and passes control to it.

The dynamic linker is passed the location of the DYNAMIC segments to it can determine which shared libraries to load. We won’t go over the details here since I want to write a post specifically for shared libraries, but in overall terms: each shared library is also an ELF file, so part of the work of the dynamic linker is to map it to memory, the same way the kernel did initially.

The main difference is that the virtual address space belongs to the main program, not to the shared library, so the dynamic linker will ignore the VirtAddr of the program header. Shared libraries might also depend on other libraries, so this whole process is recursive. It also doesn’t need to pass control to a dynamic linker, since it’s already running.

The dynamic linker also needs to resolve symbols. If you have a function call such as:

f(x);

that is not defined in your binary, the dynamic linker must find which shared library implements it. After that it needs to update the instruction on that function call (this process is called relocation).

Once the dynamic linker is done, it passes the control back to the main() function of your binary.

Conclusion

To motivation of this post is that I wanted to know more about the ELF format. I had some very vague understanding of it after reading [1] a while back but lacked a good mental model.

I used mostly ChatGPT to write this post and I felt like a journalist interviewing an expert, asking questions and follow-ups where I needed more detail. It was a fun process and I find the word journalist appropriate with respect to considering my blog a diary or journal.

I learned about the overall structure of ELF files but I was also able to understand a lot more clearly how a program is executed! I had learned in college how a CPU might execute a program once it is initialized but I don’t recall learning about the initialization process itself.

Things I learned that I find particularly interesting:

a.out is the name of a defunct binary format.
Not all contents of the ELF is loaded in memory.
The segments specify explicitly their address in the virtual memory space.
Uninitialized variables are placed in different sections and handled differently from initialized ones.
The kernel does the initial loading of th ELF file but delegates the dynamic linking to a “third-party” program in user space.

References

[1] Ray - Analyzing The Simplest C++ Program
[2] How To Write Shared Libraryes - Ulrich Drepper
[3] ChatGPT

Queues

2025-03-29T00:00:00+00:00

We recently visited Vietnam and on the way back, departing from Hanoi, we had to stand in multiple queues: one for ticketing and luggage, one for immigration, one for X-ray screening, and finally one for boarding. We could even count the last queue inside the plane to reach our seats.

This got me thinking about queues in data processing and reasons to use (or avoid) them. In this post I’d like to explore some of those trade-offs.

Benefits of Queues

Amortizing Irregular Throughput

In a producer-consumer pattern, the producer may emit data at an irregular rate, some times all at once, some times barely at all. The consumer might not be able to handle a sudden burst and could stall or drop some data.

One solution is to introduce a queue as buffer. If the consumer can’t process everything immediately, the data gets queued and handled later.

Of course, this only works if the consumer’s average throughput exceeds the producer’s in the long run. If the producer is consistently faster, the queue will eventually fill up.

Digression: The first time I heard of the term amortization was in analysis of algorithms. An algorithm might be slow in one iteration, but when looking at the entire sequence, the average (amortized) complexity is much better.

The word amortize comes from the Latin ad mortire [1], meaning “to kill” and was used in the context of extinguishing debt. It’s an apt metaphor in algorithm analysis if you think of computational complexity as cost. In fact, some amortization proofs explicitly use the concept of. In fact some amortization analysis proofs explicitly use the concept of “debt”.

Checkpointing

Another reason to use queues is to persist intermediate data and avoid recomputation. Suppose we have a multi-stage data processing pipeline where the first stage is very expensive and the final stage is unreliable.

If we computed everything in a single pass, a failure in a later stage would require recomputing the more costly first stage. Technically, we don’t need a queue here, just persistence, but many queue implementations provide built-in durability.

Decoupling Parallelism

Continuing with the multi-stage pipeline example, suppose one of the stages becomes a bottleneck. Then we can increase the number of threads or machines allocated to processing it, while keeping the resouces on the cheaper stages the same.

This has a clear analogy with the airport queues: ticketing and luggage tagging are slower processes than X-ray screening, so you’d expect more agents handling the former. Queues make that kind of decoupling possible.

Observability

In a complex data processing pipeline it might be tricky to determine which stage is the bottleneck. We can measure how long it takes for a stage from receiving the data to sending it downstream but it might not be accurate if async computation is involved.

Queues can help. For example, if we design a queue to buffer occasional throughput spikes (see Amortizing Irregular Throughput), we can double that estimate and add monitoring. If the queue is ever more than half-full, it indicates the consumer cannot process data quick enough and is a bottleneck.

Downsides

Now that we’ve covers the benefits of using queues, let’s consider some downsides.

Processing Lag

Before filling up, queues might mask a bottlenecked consumer by absorbing data. Events will take longer to reach their destination because they stay parked in the queue. A more costly alternative would be to avoid queues and scale the whole processing system to handle peak load.

Memory

When the queue is between processing units on the same machine, it uses memory. Memory usage is unpredictable and it’s possible that a burst in throughput can cause OOMs and make processing stall.

When the machine is restarted there’s an ever increasing backlog to catch up and it can become a negative feedback loop. Overprovisioning for peak memory helps but it can be wasteful.

Conclusion

We covered some pros and cons of using queues. In my experience, using queues between machines is often beneficial (e.g. in a MapReduce architecture), while for local queues the trade-off is harder. I find the observability part very useful.

I don’t know much about Kakfa but I suspect learning more about it might shed light on additional trade-offs, especially for distributed queues. Queue theory is a topic on its own, so studying it might also yield other insights.

References

[1] Etymonline: amortize

Review: Getting Started with LLVM Core Libraries

2025-03-22T00:00:00+00:00

In this post I’ll share my notes on the book Getting Started with LLVM Core Libraries by Bruno Cardoso Lopes and Rafael Auler.

The book is aimed at people interested in developing code using LLVM internal libraries. It’s very practical in the sense that it provides lots of code examples, installation steps and command lines.

My main objective in reading this book was to get a better understanding on how LLVM worked and so I skimmed most of the parts around setting up and running things. I’ll focus instead more on the high level / conceptual parts of the book in my summary.

Book Organization

This book has about 285 pages and 11 chapters. Chapters 1, 2 and 3 explain how to set up the project and getting a sense of the codebase. I’ll skip these.

Chapters 4, 5 and 6 cover the main architecture of LLVM (Frontend, IR, Backend) and the ones I was mostly interested in.

Chapter 7 is about the JIT (Just-in-time compilation) feature LLVM provides. I’m not particularly interested in JIT as of now so I’ll just provide a summary.

Chapter 8 is about cross-platform compilation. This means compiling code in an architecture/system that is different from where it will be executed. I’m not interested in this at the moment, so I’ll skip.

Chapter 9 and 10 cover non-compilation features LLVM provides: static analysis and code transformation. I’m actually very interested in these, but until I get to try it out myself I don’t feel like I can provide a good summary, so I’ll try to provide some high-level overview instead.

Table of Contents
Overview
The Frontend
The LLVM IR
The Backend
The Just-in-Time Compiler
The Clang Static Analyzer
Clang Tools
Conclusion

Overview

For the sake of simplicity and self-interest, I’ll throughout assume the language being compiled is C++, even though LLVM is able to compile other languages. The C++ compiler is known as Clang which I had trouble differentiating from LLVM, but reading this book made things clearer.

LLVM stands for Low Level Virtual Machine. It started as a virtual machine rivaling the Java VM but used a lower level intermediate representation, which is now the LLVM IR. The Java IR is called Java bytecode, and as a jab at how verbose that format is, LLVM’s IR is sometimes called bitcode. LLVM today is not used as a virtual machine but the name stuck.

In very high-level and simplistic terms, LLVM is composed of two parts: the frontend and the backend. The frontend is coupled with a specific programming language (e.g. C++, Rust), while the backend is coupled with the target where the code is run (e.g. Linux or Windows, x86 or ARM64).

To avoid having to handle for every single combination of language and target the frontend writes to an intermediate representation (IR) which in theory is agnostic to both programming language and target (in practice I learned from the book this is not true).

Figure 1: Relationship between frontend, backend, IR and linker.

This modularity is useful not only from the maintainability perspective of the code, but it also allows partial use of LLVM. For example, the Rust compiler (rustc) has its own frontend component that is capable of compiling to LLVM’s IR and then only uses LLVM’s backend component for general optimizations.

Modularity is a key design principle of LLVM and it goes beyond just the frontend and backend. Each of these components are composed of substeps which are implemented by libraries with public interfaces, so it’s possible to only use a subset of steps from either of these components.

The process of converting a source file into an object code is called compilation. This process generates one such artifact for each .cpp file, so it is still necessary to combine them into a single binary. This is done by the linker. At the time the book was written, LLVM relied on the GNU linker (ld) because its own linker was not mature yet.

The Frontend

The frontend component that handles C++ (and also C and Objective-C) is called Clang. The terminology is confusing because Clang can also refer to the full suite of compilation and linking, called compiler driver (via the clang command) or just the compiler (via the clang --cc1 option).

In this post when we mention Clang we’ll be referring to the frontend of LLVM for C++.

As we mentioned in Overview, the frontend is composed of a few steps, show in Figure 2:

Figure 2: Steps of the frontent. The blue boxes and blue text represent artifacts.

We’ll cover each of the steps briefly:

Lexical Analysis

This step is also known as lexing, and is responsible for splitting the source code (text) into tokens, which have an associated type.

There’s also the Preprocessing step which runs interleaved with the Lexical Analysis and hence is not depicted in the diagram of Figure 2. This step is responsible for replacing C macros.

This part was a learning for me: I always assumed the preprocessing happened before lexing.

Syntactic Analysis

This step is also known as parsing, and it structures the tokens into a tree, the Abstract Syntax Tree or AST.

Semantic Analysis

This step is essentially a type checker. It traverses the AST and keeps type information about variables to detect type inconsistencies.

Clang runs the semantic analysis while constructing the AST.

LLVM IR Generator

This step consists in transforming the C++-specific AST into the generic AST of LLVM, called LLVM IR.

Following the principle of modularity, we can run inspect the result of each of these steps individually with the invokation of clang. For example, if we want to see the output of the parser, we can do

clang -fsyntax-only -Xclang -ast-view min.c

The LLVM IR

The intermediate representation has to stike a fine balance to avoid coupling too much with specific languages or specific targets. In practice there is not a single format of the LLVM IR due to being unable to account for all possible languages and targets.

Target-independence is challenging to achieve, since even the source code might be target-dependent (e.g. C++ can make Linux syscalls).

The LLVM IR can be represented in disk either as bitcode (extension .bc) or LLVM assembly (extension .ll). These are parallel analogous to object code (extension .o) and LLVM assembly (extension .asm).

The bitcode is a binary format, while the LLVM assembly is human readable. Let’s explore an example for the code sum.cpp:

int sum(int a, int b) {
  return a + b;
}

We can generate the LLVM assembly via:

clang sum.cpp -emit-llvm -S -c -o sum.ll

and obtain:

; ModuleID = 'sum.cpp'
source_filename = "sum.cpp"
target datalayout = "e-m:o-i64:64-i128:128-n32:64-S128"
target triple = "arm64-apple-macosx15.0.0"

; Function Attrs: noinline nounwind optnone ssp uwtable(sync)
define i32 @_Z3sumii(i32 noundef %0, i32 noundef %1) #0 {
  %3 = alloca i32, align 4
  %4 = alloca i32, align 4
  store i32 %0, ptr %3, align 4
  store i32 %1, ptr %4, align 4
  %5 = load i32, ptr %3, align 4
  %6 = load i32, ptr %4, align 4
  %7 = add nsw i32 %5, %6
  ret i32 %7
}

Some observations about the format:

It uses virtual registers, e.g. %3, which are mapped into physical ones by the backend step.
It uses Static Single Assignment (SSA) which means a variable is never reassigned. SSA requires phi instructions to handle merging flow paths, for example, if the original code was:

int x;
if (condition) {
    x = 1;
} else {
    x = 2;
}

In SSA form it becomes:

if (condition) {
    x1 = 1;
} else {
    x2 = 2;
}
x3 = φ(x1, x2)

The special function φ that tells us only one of x1 and x2 is valid.

It uses three-address instructions, meaning an instruction uses at most 3 addresses, so the expression such as x = (-b + sqrt(b^2 - 4*a*c)) / (2*a) has to be broken down in multiple ones using intermediate variables.

The lines

target datalayout = "e-m:o-i64:64-i128:128-n32:64-S128"
target triple = "arm64-apple-macosx15.0.0"

Specify information about the target architecture. In this example the arm64 architecture of the MasOS operating system.

In datalayout the - separates attributes, so we have e, m:o, i64:64, etc.

The e indicates a little endian addressing (big endian would be E. Neat!). The m:o entry indicates the binary format. The entry i64:64 indicates the alignment of 64-bit integers (it could be more). S128 means the stack address is aligned is at 128 bits.

The Backend

Like the frontent, the backend is composed of multiple steps. Its aim is to convert the LLVM IR into either object code or assembly.

Figure 3 depicts the main steps, which are called passes. We can see there are multiple passes and they look independent. With this design LLVM achieves more modularity which helps with reuse and testing. It might sacrifice performance in case passes could be combined to avoid repeated operations.

Figure 3: Steps of the backend. The gray boxes are mostly optimizations and not critical to the correctness of the process. The blue boxes and blue text represent artifacts.

We’ll go over briefly over the green boxes of Figure 3.

Instruction Selection

The instruction selection converts the LLVM IR into a structure called Selection DAG, a DAG where each node is an operand or instruction of the target architecture. The process of converting generic instructions into target-specific ones is called lowering.

This pass is also responsible for legalization, making sure that the instructions only use existing types of the target architecture. For example, if the target architecture only supported 32-bit integers, operations involving 64-bit integers need to be split.

The instruction selection is the most expensive step of the backend, taking up to half of the time of the entire process. There are less optimized versions of this step that runs faster and can be used with different compiler optimization levels such as -O0.

Pre Instruction Scheduling

At this point we still have a DAG which doesn’t fully specify the order in which we execute instructions. This is that the instruction scheduling does.

This step uses prior target-specific information to better guide the decision of how to order instructions. This information is called instruction itineraries. One example is data about how many CPU cycles a given instruction takes.

This step also compute hazards, situations which might lead to poor performance, for example an instruction that might depend on the result of some slow computation (data hazard). In this case the compiler might opt to insert other instructions in between to avoid idleness.

Register Allocation

The overall idea of register allocation is to assign virtual registers (unbounded) to physical ones (constrained by the CPU architecture).

This process can be modeled as the graph coloring problem, which is NP-Complete. The graph coloring problem consists in assigning a color to each node such that adjacent vertices don’t share the same color.

We can reduce the register allocation to graph coloring as follows: each node is a virtual register and there’s an edge between two nodes if their lifetime overlap. Registers with the same color can use the same physical register.

During register allocation, phi-expressions are “expanded” so they’re become valid:

L1:
  x1 = 1
  goto L3
L2:
  x2 = 2
  goto L3
L3:
  x3 = φ(x1, x2)

Becomes

L1:
  x1 = 1
  mov x3, x1
  goto L3
L2:
  x2 = 2
  mov x3, x2
  goto L3
L3:

Noting that this requires “pushing” the mov instructions into the respective branches. (I wonder how this is actually implemented in practice).

Before the register allocation step is run, it does a register coalescing, a process in which unecessary register copies are removed. For example:

mov r1, r2
add r3, r1, r4

If r1 is never used again after the add instruction we can directly use r2 there and avoid a copy:

add r3, r2, r4

The register allocation process relabels the virtual registers, but the actual assignment to physical ones is its own step, the virtual register rewrite. In the coloring process it’s possible to end up with identity copying (e.g. mov x1 x1) which could not have been coalesced beforehand. This last step removes them as well.

Post Instruction Scheduling

The book doesn’t say much about this step except that it takes in a MachineInstr as input (the pre instruction scheduling takes in a SelectionDAG).

I’m guessing that after registers are allocated and instructions are added and deleted there’s a need to re-schedule these instructions.

Code Emission

This last step consists of converting an instance of MachineInstr into either assembly instruction or object code. One interesting detail is that this step uses streaming, meaning each instruction is processed at a time, suggesting this step is a simple 1:1 mapping.

The Just-in-Time Compiler

LLVM offers a JIT compiler that operates at a function granularity, i.e. if a function is invoked, the compiler will process the entire function body. An alternative is trace granularity: only processing specific code paths of a function (e.g. a branch).

Function-based granularity is slower but supports better optimizations.

The Clang Static Analyzer

The static analyzer is essentially a linter. It relies on a symbolic execution engine which in theory computes all possible paths a program can take, but it might run forever due to combinatorial explosion. In practice it employs heuristics and limits how many paths it actually searches.

The following example is provided by the book:

#include 
void f(int x) {
  int y;
  if (x) {
    y = 5;
  }
  if (!x) {
    printf("%d\n", y);
  }
}

Here y might be uninitialized depending on the value of x. It compiles without warnings but we can run the analyzer:

$ clang --analyze -Xanalyzer -analyzer-checker=core bug.cpp
bug.cpp:8:5: warning: 2nd function call argument is an uninitialized value [core.CallAndMessage]
    8 |     printf("%d\n", y);
      |     ^~~~~~~~~~~~~~~~~
1 warning generated.

The static analyzer can be extended with custom checkers. It works via a visitor pattern: by subclassing a checker class and implementing visitor for specific AST nodes.

Clang Tools

The other non-compiling feature LLVM/Clang provides are code-related tools. One of them is a formatter called clang-tidy. It also provides a tool for modernizing C++ code (for example convering)

A particularly relevant tool is clang-query, a CLI which allows testing AST matchers which is useful for writing code refactoring tools. The book explains how to write one such tool.

Conclusion

Overall the book is very well written and detailed. As I mentioned at the start, my main goal was to to better understand how LLVM worked so I didn’t find the code details very useful.

I did learn a few things about LLVM:

The frontend and the backend architecture.
LLVM IR can be represented in disk as bitcode or LLVM assembly.
LLVM relies on modular libraries which can be used by applications, and they can serializing their outputs to disk for debugging.
LLVM started as a virtual machine but it’s not used as such anymore.
LLVM has a JIT compiler.
Clang is the frontend for C++, but can be used as a driver to not only carry out the full compilation (i.e invoke the backend) but also the linking step.
rustc uses the LLVM backend but not the frontend.

It was also a good refresher on compilers, which I studied in college a while back.

The General Form of Cauchy’s Theorem

2025-03-15T00:00:00+00:00

This is a post with my notes on complex integration, corresponding to Chapter 4 in Ahlfors’ Complex Analysis.

We previously studied the Cauchy Integral Theorem, which says that integrating a holomorphic function $f(z)$ over a closed curve results in 0. The results we obtained only applied if the domain of $f(z)$ is a circle or a rectangle.

In this post we want to generalize that result for other domains.

Chains and Cycles

We define a chain as a “union” of arcs. For example in Figure 1. More formally we have:

Figure 1. A chain where each arc has a different color. The blue and pink arcs overlap and cancel each other.

Definition. A chain is a set of arcs $\gamma_1, \gamma_2, \cdots, \gamma_n$, denoted by $\gamma_1 + \gamma_2 + \cdots + \gamma_n$ such that

\[\int_{\gamma_1 + \gamma_2} f(z)dz = \int_{\gamma_1} f(z)dz + \int_{\gamma_2} f(z)dz\]

Noting that $\gamma_1, \gamma_2, \cdots, \gamma_n$ need not be connected. This notation is not particularly compact nor the definition illuminating, however the concept of equivalence of chains is useful.

Definition. We say that two chains $C_1$ and $C_2$ are equivalent if:

\[\int_{C_1} f(z)dz = \int_{C_2} f(z)dz\]

for all functions $f$.

Here are some operations that preserve equivalence of chains:

Permutating two arcs
Subdividing an arc into multiple ones
Merging multiple subarcs into one
Removal of opposite arcs (i.e. cancelling them out)

In a chain can also have multiple copies of the same arc which we can denote via an appropriate coefficient, so a general form of a chain $\gamma$ is:

\[\gamma = \alpha_1 \gamma_1 + \alpha_2 \gamma_2 + \cdots + \alpha_n \gamma_n\]

For non-negative (0 might be useful for ease of notation) integers $\alpha_i$ and distinct $\gamma_i$.

A cycle is analogous to a chain, except that the arcs must be closed curves.

Definition. A cycle is a chain in which all its arcs are closed curves.

Another way to define a cycle is as a chain in which each of its arcs have coinciding start and end points.

Simple Connectivity

A region is simply connected if it doesn’t have holes. More formally:

Definition. A region is simply connected if its complement with respect to the extended plane is connected.

Note the “with respect to the extended plane”. Which means the complement always includes the infinity. This is convenient to consider a parallel strip as simply connected, since a line partitions the unextended plane into two parts.

We can provide an alternative characterization of simple connectivity via winding numbers:

Theorem 1. A region $\Omega$ is simply connected if and only if $n(\gamma, a) = 0$ for all cycles $\gamma$ in $\Omega$ and all points $a$ not in $\Omega$.

Let $H_1$ be the statement "A region $\Omega$ is simply connected" and $H_2$ be "$n(\gamma, a) = 0$ for all cycles $\gamma$ in $\Omega$ and all points $a$ not in $\Omega$". We wish to prove $H_1 \iff H_2$.

First we prove that $H_1 \implies H_2$. This means that the complement of $\Omega$ is connected. Let $\gamma$ be a cycle in $\Omega$ and consider the regions defined by it. Let $\Gamma$ be the one containing $\infty$. Corollary 5 in [5] shows that $n(\gamma, a) = 0$ for $a \in \Gamma$. The complement of $\Omega$ is contained in $\Gamma$, so we arrive at statement $H_2$.

We now prove that $H_2 \implies H_1$ via count counter-positive ($\neg H_1 \implies \neg H_2$). This is almost obvious: if a region is not simply connected it contains a hole. There exists a cycle in $\Omega$ surrounding the hole and for any point $a$ in the hole we have $n(\gamma, a) = 1$ which implies $\neg H_2$ is true. But we need to provide a more precise way to find such cycle.

More formally, if the complement of $\Omega$ is not connected, it has multiple components one of them containing $\infty$. Let $A$ be one of the components not containing $\infty$ and $B$ the union of the other components. Let $\delta \gt 0$ be the shortest distance between a point in $A$ and $B$. We then tessalate the plane with a net of squares $Q$ with side $\lt \delta / 2$. Let $a$ be any point in $A$. We chose the tessalation so that $a$ lies in the center of a square. Notice by our choice of the square size, a given square cannot contain points from both $A$ and $B$.

Let $\partial Q$ be the curve corresponding to the boundary of the square $Q$ and oriented counter-clockwise. Consider the set indices $S$ of the squares that contain at least one point of $A$. We define the curve: $$\gamma = \sum_{j \in S} \partial Q_j$$ Because $A$ is a component, the internal sides of the squares cancel out and we're left with a curve that is a polygon with orthogonal sides.

Figure 1.1. A region with a hole. The green polygonal curve is contained within the region and wraps around the hole exactly once.

Since exactly one square in $S$ must contain $a$ by construct, we have that $n(\gamma, a) = 1$. As we claimed, a given square cannot contain points from both $A$ and $B$, so no square in $S$ intersects $B$, and so does $\gamma$.

We also claim that no point in $\gamma$ belongs to $A$: if it did, it would have to exist on the boundary of at least two squares and they would have cancelled out. This means that $\gamma$ is contained enriely within $\Omega$.

Homology

Definition. A cycle $\gamma$ in an open set $\Omega$ is homologous to zero with respect to $\Omega$, denoted by $\gamma \sim 0 \, (\mbox{mod } \Omega)$, if $n(\gamma, a) = 0$ for all points $a$ in the complement of $\Omega$.

We can simply say $\gamma \sim 0$ if the open set $\Omega$ is implied from context. We can also define the notation $\gamma_1 \sim \gamma_2$ to mean $\gamma_1 - \gamma_2 \sim 0$.

Note that if $\Omega$ is a simply connected region, by Theorem 1 all cycles in it are homologous to zero. In this sense homology is a more general property of cycles, which will allow us to extend some results to non-simply connected regions.

Cauchy Integral Theorem

We’re ready to state the first generalization of Cauchy Integral Theorem:

Theorem 2. If $f(z)$ is holomorphic in $\Omega$, then

\[(1) \quad \int_\gamma f(z) dz = 0\]

For any cycle $\gamma$ satisfying $\gamma \sim 0 \, (\mbox{mod } \Omega)$.

Notice that in this result we don’t assume anything about $\Omega$. However, if it’s a simply connected region, then all cycles in it are homologous to zero, so we can claim that:

Corollary 3. If $f(z)$ is holomorphic in a simply connected region $\Omega$, then

\[\int_\gamma f(z) dz = 0\]

For any cycles in $\Omega$.

So until now, we knew that $(1)$ held as long as $\Omega$ as a circle or rectangle, but now we have relaxed the condition to any simply connected region!

This is the main result we wanted to prove in this post. We now consider some variants and further generalizations.

Locally Exact Differential

In our post Path-Independent Line Integrals [3], we said that a line integral $\int_\gamma f(z)dz$ can be expressed as a function of its real and imaginary parts: $\int_\gamma p(x)dx + q(y)dy$ or $\int_\gamma pdx + qdy$ for short. This form of the integrand is defined as a differential form. In this case, we can call its integrand a differential.

The differential $p dx + q dy$ is called an exact differential in $\Omega$ if there exists a function $U(x, y): \Omega \rightarrow \mathbb{R}$ such that $\partial U(x, y)/\partial x = p(x)$ and $\partial U(x, y)/\partial y = q(y)$ or more concisely, that $dU = pdx + pdy$ [3].

Definition. A differential $p dx + q dy$ is called a locally exact differential if it’s exact in some neighborhood of every point in the domain $\Omega$.

More precisely, suppose $p dx + q dy$ is a locally exact differential. Then for each $a \in \Omega$, there must exist a neighborhood $N(a)$ such that there exists $U(x, y)$ with $dU = pdx + qdy$ for each $(x, y)$ in $N(a)$.

Note that a exact differential is a more strict condition than a locally exact differential. In an exact differential the existence of $U(x, y)$ with $dU = pdx + qdy$ must hold for all points in its domain. Thus a exact differential implies locally exact differential.

We’ve seen in Theorem 1 in [3] that the exact differential $\int_\gamma p dx + q dy$ only depends on the endpoints of $\gamma$. This means that if $\gamma$ is a closed curve, we can split it into $\gamma = \gamma_1 + \gamma_2$ where $\gamma_1$ and $\gamma_2$ share endpoints. Thus we have that $\int_{\gamma_1} p dx + q dy = -\int_{\gamma_2} p dx + q dy$ implying that $\int_\gamma p dx + q dy = 0$. In other words, exact differentials satisfies $(1)$ without any specific constraints on the domain, $f$ also doesn’t need to be holomorphic.

We can obtain an analogous result for locally exact differentials:

Theorem 4. $p dx + q dy$ is a locally exact differential in $\Omega$, then

\[\int_{\partial R} p dx + q dy = 0\]

For every cycle $\gamma \sim 0$ in $\Omega$.

So in [3] we showed that an exact integral satisfied $(1)$ if $\gamma$ happens to be a Jordan curve. Now we relaxed the condition for locally exact integrals and $\gamma$ to cycles homologous to $0$ (which are more general than Jordan curves).

Multiply Connected Regions

Now we consider regions with holes. The idea is to decompose the problem and express the integral a linear combination of one integral per hole.

A region that is not simply connected is multiply connected, in other words, it contains holes. Here we restrict ourselves to finite connectivity, where the number of holes is finite.

A precise definition of finite connectivity is that the complement of a region $\Omega$ with respect to the extend plane are $n$ components $A_1, \cdots, A_n$, where by convention the $A_n$ region is the “external” to $\Omega$, the one containing $\infty$.

Consider a cycle $\gamma$ in $\Omega$. As we proved in [5], points within the same component $A_i$ have the same winding number with respect to $\gamma$. Intuitively, since $\gamma$ cannot cut throught $A_i$, it must wind around it and thus around its points the same amount of times. We can thus associate a winding number to each of the components $A_i$, and call them $c_i$.

We can then decompose the cycle into multiple closed curves, such that it winds around a component at most once. See Figure for an example. If a closed curve $\delta$ does not wind around any regions, then we have $n(\delta, a) = 0$ for all points in the complement of $\Omega$ and thus $\delta \sim 0$.

Figure 2. (left) A cycle in a region with two holes with points of self-intersection in red, which allows us to decompose it into simpler closed curves surrounding each hole at most once (right).

If $\delta$ does wind around a region $A_i$, it does so exactly once, because otherwise it would self-intersect and we would have split into multiple curves. We can choose a “canonical” close curve for each $A_i$ which we call $\gamma_i$ which winds around it exactly once. Let $a$ be a point in $A_i$, we then have $n(\delta, a) = n(\gamma_i, a)$ or that $\delta - \gamma_i \sim 0$.

So for each closed curve in the original $\gamma$, we can find a corresponding $\gamma_i$ to “neutralize” it, so we arrive at:

\[\gamma - \sum_{i = 1}^{n-1} c_i \gamma_i \sim 0\]

Note that we don’t have $\gamma_n$ because that’s the “external” region. Apply Theorem 2 to this curve:

\[\int_{\gamma - \sum_{i = 1}^{n-1} c_i \gamma_i} f(z)dz = 0\]

Splitting into separate integrals (from the definition of chains):

\[\int_{\gamma} f(z)dz - \sum_{i = 1}^{n-1} c_i \left(\int_{\gamma_i} f(z)dz \right) = 0\]

or that

\[\int_{\gamma} f(z)dz = \sum_{i = 1}^{n-1} c_i \left(\int_{\gamma_i} f(z)dz \right)\]

This allows us compute the integral over a complex curve from a linear combination of the integrals over simpler curves. Let’s define

\[P_i = \int_{\gamma_i} f(z)dz\]

as the module of periodicity of $f dz$. We note that $\gamma_i$ is dependent only on $\Omega$ (more specifically the region $A_i$ of its complement) and not on the curve $\gamma$ and thus $P_i$ is only dependent on $f$ and $\Omega$.

Example

Let’s look at an example of the application of the result above. Consider the region $\Omega$ defined as

\[r_1 \lt \abs{z} \lt r_2\]

that is, an annulus. The complement of this region is $A_1: \abs{z} \le r_1$ and $A_2: \abs{z} \ge r_2$. A possible canonical closed curve winding around $A_1$ is $C: \abs{z} = r_1 + \epsilon$ for some $\epsilon \gt 0$.

From our results above, we have

\[\int_{\gamma} f(z)dz = c_1 \int_C f(z) dz\]

Where $c_1$ is the number of times a given cycle $\gamma$ winds around $A_1$.

Conclusion

In this post we learned about a generalization of the Cauchy Integral Theorem for other domains, in particular simply connected regions.

We went a step further and generalized the theorem for the so called curves that are homologous to 0. We also generalized the results from path independent integrals [3] for these type of curves.

We finally considered multiply connected regions (those with holes) and found a way to make any cycle homologous to zero via modules of periodicity and that allowed us to apply the general form of the Cauchy Integral Theorem. This in turn led to a formula for computing the integral of $f(z)$ over an arbitrary cycle as a linear combination of the modules of periodicity of $f(z)$.

The modules of periodicity will be useful in our next top of study, The Calculus of Residues.

References

[1] Complex Analysis - Lars V. Ahlfors
[2] NP-Incompleteness: Cauchy Integral Theorem
[3] NP-Incompleteness: Path-Independent Line Integrals
[4] NP-Incompleteness: The Open Mapping Theorem
[5] NP-Incompleteness: The Winding Number

Computer History Museum

2025-02-07T00:00:00+00:00

I’ve visited the Computer History Museum in Mountain View California several times over the years. After my last visit, I decided to write about it.

The museum is organized in multiple sections and are ordered more or less chronologically. For example, the first sections cover primitive computing machines (physical calculators, punch cards), then comes analog computers, mainframes, super computers, personal computers, computer games, mobile devices and finally the web.

In between there are sections dedicated to storage, computer graphics, AI and robotics.

Gallery

At the entrance there’s the early version of Waymo’s self-driving car:

The early version of Waymo's self-driving car (photo taken in 2025).

The first section is about mechanical computing devices. It showcases the Napier’s bones, a device from the 17th century for performing multiplication of large numbers. It was invented and named after the scottish John Napier, who also discovered (among others) the logarithm.

Napier's Bones (photo taken in 2024).

The Curta calculator is a neat device from 1948, developed by Curt Herzstark. This device also has an intriguing backstory: Curt was an Austrian of partial Jewish ancestry who was sent to a concentration camp, but the Nazis spared his life because he was known to be working on this device. The workings of the Curta is well described in this short Numberphile video.

Curta Calculator (photo taken in 2024).

The Jacquard Machine is a machine that can be attached to a loom to manufacture patterns specified from a punchcard. On itself unrelated to computers, but the idea of using punchcards as input inspired early computers.

Jacquard fabric sample, 19th century (photo taken in 2024).

Herman Hollerith developed a an electromechanical machine to puncture holes in paper cards for the US Census. I had never heard of this machine but growing up in Brazil hollerith (spelled holerite in Portuguese) was how we called paper slips.

Hollerith (photo taken in 2024).

Another cool application of puncturing holes on paper is for data visualization. This Atlas from the Botanical society used punch card technology for distribution maps of species:

The Botanical Society of the British Isles Atlas of the British Flora (photo taken in 2024).

ENIAC, from 1945, stands for Electronic Numerical Integrator and Computer and is considered the first general-purpose digital computer. The museum has parts of it on display.

ENIAC. This photo has a lot of glare, but we can see the knobs and also the vacuum tubes from behind (photo taken in 2024).

The Enigma machine was used by the German military during WWII. The messages were decoded by the allies with the help of Polish mathematicians. The decyphering of the messages is credited with the substiantial shortening of the war.

Enigma machine (photo taken in 2024).

The IBM 305 RAMAC, standing for Random Access Method of Accounting and Control, is considered the first hard-disk drive and consisted of 50 24” inches disks, adding to a capacity of 3.75MB!

The IBM 305 RAMAC (photo taken in 2025).

The IBM System/360 was a mainframe first released in 1964. The museum has the Model 30 on display. These mainframes are used as cases in the book Mythical Man-Month, since the author, Fred Brooks, was a project manager for this system.

IBM System/360 Model 30 (photo taken in 2024).

The Cray-1 was a super-computer designed by Seymour Cray and released in 1975. It was capable of 160MFLOPS (for comparison, nowadays a personal computer such as an Apple Macbook can do up to 320GFLOPS). It cost about 8 million USD in 1977 (equivalent to 40 million USD in 2023).

Cray-1 (photo taken in 2024).

The Utah teapot is very famous 3D object in Computer Graphics. The museum has the original, which was bought by Martin Newell in a Utah department store.

The Utah Teapot (photo taken in 2024).

The company Psion started as a sofware company but later entered the hardware market with the Psion Organizer. The Psion Series 3 Organizer is in display at the museum. The OS for the Psion Organizer was called EPOC and it was eventually renamed to Symbian, being adopted by many early smartphones.

I find it pretty intresting to see how things played out, with Symbian eventually losing to Android as the “generic” OS, and how things could have turned out differently.

The Psion Series 3 Organizer (photo taken in 2025).

There’s a section of the museum on the opposite side of the reception which I think I have missed in all my previous visits. It has a gallery of random exhibits on technology, including more recent games such as World of Warcraft.

I was an avid (to put it nicely) player of Blizzard games such as Diablo, Starcraft and Warcraft (including DotA), but never got into World of Warcraft (luckily I guess). The museum has the actual hardware from a World of Warcraft server.

The hardware powering one of the servers of World of Warcraft (photo taken in 2025).

This section also has a copy of the robot Ameca, which I recall seeing online a few years back. It’s able to understand speech pretty well.

Ameca (photo taken in 2025).

Conclusion

I believe this is my first post on what I call Nerd Tourism. I love visiting museums but most of them don’t fit in the usual theme of my blog, so I tend to document them in https://www.kuniga.me/amuseum/ instead.

I have visited some science museums such as the Lawrence Berkeley National Laboratory, the MIT Museum, Museum of Science and Industry in Chicago, The Tech Interactive in San Jose, the Exploratorium and the California Academy of Sciences in San Francisco. Each could be worth a post, but I never took the time to write one.

I also have other nerd spots in my to-go list, mainly in Europe. I hope to visit and post about them some day.

Vector Views in C++

2025-01-25T00:00:00+00:00

A view can be thought of as an object that is derived from another without representing it explicitly.

In programming, a classic example is a string view: it can be used to represent a substring of a string without actually storing the whole substring: it only needs two indexes representing the start and end of the interval.

Another classic example are views in SQL. A view represents a table but it doesn’t actually store the rows explicitly. It’s a query to another table and it can be materialized on demand.

In this post I’d like to explore views but for std::vector.

Context

Recently at work I wrote a function that takes in a vector of a given type, then groups the entries by some key. For example, suppose we have a class Person:

struct Person {
  std::string name;
  int age;
};

and that we want to group them by age first:

std::unordered_map<int, std::vector<Person>> peopleByAge;
for (const auto &person : people) {
    peopleByAge[person.age].emplace_back(person);
}

so that we can process them grouped by age:

void processPeopleForAge(const std::vector<Person>& people, int age);

for (const auto& [age, people] : peopleByAge) {
    processPeopleForAge(people, age);
}

It was then pointed out to me that this code is innefficient, because Person is copied when assining to the std::vector inside std::unordered_map.

Vector of references

My first attempt was to turn std::unordered_map> into std::unordered_map>, i.e. have the Person inside the inner vector be a reference.

It turns out std::vector does not allow references. The reason being is that it requires its elements to be copiable and assignable.

A vector stores its data in a contiguous segment of memory. It starts with a pre-allocated segment, but if it grows beyond this initial size, it must move to a bigger segment of memory.

When it does so, it actually needs to move or copy the elements over and it will do so using the copy/move constructors. So adding elements to a vector or resizing it can actually cause the constructor of its existing elements to be called! I’ve been working with C++ for many years and never realized that! In my mind I thought std::vector would simply do memcpy to a new destination, to copy the bytes as is.

To reduce these extraneous copies, we can explicitly reserve a size if we have a good estimate of the vector size via .reserve().

The need for it to be assignable is because when we do vec[i] = x, the existing object at index i would be re-assigned a different value.

Using `std::reference_wrapper`

A way to make a reference into an actual copiable and assignable object is to wrap it into one. This is essentially what std::reference_wrapper does. The key to turn a reference into something we can manipulate is by converting it into a pointer!

A pointer is just a integer, so it can be easily copied and assigned to. Here’s a simple version of std::reference_wrapper:

template <typename T>
struct reference_wrapper {
  reference_wrapper(T& t) {
    value_ = std::addressof(t);
  }

  T& get() { return *value_; }

  T* value_;
};

The expression std::addressof() is the one “converting” from a reference to a raw pointer. To get a reference back to the original variable we can first dereference the pointer and then return a reference.

This enables us to have a vector of references in our original example:

std::unordered_map<
  int,
  std::vector<reference_wrapper<Person>>
> peopleByAge;

for (const auto &person : people) {
  peopleByAge[person.age].emplace_back(person);
}

With the caveat that now processPeopleForAge() will need to take in these references and call .get() on them.

void processPeopleForAge(
  const std::vector<reference_wrapper<Person>>& people,
  int age
);

The STL implementation of reference_wrapper also has this interesting operator:

operator T&() {
  return *value_;
}

Which is invoked when we assign that class to a reference type, so it’s just a syntax sugar for calling .get():

reference_wrapper<Person> p;

Person& pRef = p.get();
Person& pRef = p; // calls that operator

Using Raw Pointers

Another option is to use raw points instead of reference_wrapper, since that’s what it does under the hoods anyway. One advantage of reference_wrapper besides being more readable is that it cannot be null (even when it’s moved away from).

Vector View

Instead of having each element be a reference, I was wondering if it would be helpful to have the concept of a vector view, that is, a data structure that can represent a subset of a vector without incurring in copies.

Span

A special case of this already exists via the C++20 structure std::span. It represents a contiguous range and can be used as a view for a vector, for example:

#include 

std::vector<int> vec = {1, 2, 3, 4, 5};
std::span<int> subVec(vec.begin() + 1, vec.begin() + 3);
for (auto& x : subVec) {
  std::cout << x << std::endl;
}

The span doesn’t copy the elements, but only works with contiguous intervals, so it wouldn’t be useful for the case I had.

Custom Class

We can implement our own version of a vector view by storing a reference to the original object and another vector of indices:

template <typename T>
class vector_view {
public:
    using TData = typename std::vector<T>::iterator;

    vector_view() = default;

    vector_view(TData data) : data_(data) {}

    void push_back(size_t index) {
      indices_.push_back(index);
    }

    T& operator[](size_t index) {
      return data_[indices_[index]];
    }

    size_t size() {
      return indices_.size();
    }

private:
  std::vector<size_t> indices_;
  TData data_;
};

Then our code could be changed to:

std::unordered_map<int, vector_view<Person>> peopleByAge;

for (int = 0; i < people.size(); i++) {
  Person& person = people[i];
  if (!peopleByAge.contains(person.age)) {
    peopleByAge[person.age] = people.begin();
  }
  peopleByAge[person.age].emplace_back(i);
}

It feels pretty clunky that the view starts unitialized and that we to check for it inside the loop. Another undesired semantics is that the elements can appear in a different order inside the vector_view which seems wrong: vector is an ordered list of elements, so we’d expect a subset to preserve the relative order.

Another implementation could be to use a bit vector to indicate the presence of the element in the view, but this could be pretty wasteful and innefficient if the vector view was very sparse. Yet another approach would be to keep the indices sorted, which has its own set of downsides.

For this particular case we might encapsulate this use into its own structure, a map of vector views. It takes the original vector and a function to compute the key:

template <typename T, typename K>
class vector_partition {
public:
  vector_partition(std::vector<T>& data, std::function<K(T&)>) {
    for (int i = 0; i < data.size(); ++i) {
      K key = getKey(data[i]);
      if (!dataByKey_.contains(key)) {
        dataByKey_[key] = vector_view<T>(data.begin());
      }
      dataByKey_[key].push_back(i);
    }
  }

  const vector_view<T>& operator[] (K key) {
    return dataByKey_[key];
  }

private:
  std::unordered_map<K, vector_view<T>> dataByKey_;

};

This avoids the issues with using vector_view directly since we can guarantee internally that indices preserve relative order and he hide the ugly initialization inside the loop. It has the obvious downside of having a much narrower application. The code using it would be very simple however:

vector_partition<Person> peopleByAge(
  people,
  [](Person& p) {
    return p.age;
  }
);

Conclusion

When I started looking into how to solve my original problem, i.e. avoid copies when groupping by a vector by a key, I imagined there would be an existing data structure that would model this use case neatly.

Turns out it doesn’t and after trying my hand on coming up with one, I arrived at implementations that are too clunky or too narrow to be useful. In the end, I went with std::vector>.

It was a fun exercise though, and I learned things I feel like I should have known! One lesson learned for me is that .reserve() is much more important than I thought!

The Maximum Principle

2025-01-18T00:00:00+00:00

This is a post with my notes on complex integration, corresponding to Chapter 4 in Ahlfors’ Complex Analysis.

In today’s post we’ll go over the Maximum Principle in Complex Analysis which states that a non-constant holomorphic function over an open set does not have a maximum value.

The maximum principle is one of those results that is counter-intuitive because of infinity (see Hilbert’s Hotel). How is it possible for a set to not have a maximum element? If we picture a set as a discrete and finite collection of numbers, then it’s indeed hard to “see” it.

For infinite sets, we have to rely on contradiction to prove it. For example, with $f(x) = 1 - 1/x$, for $x \in \mathbb{R}, x \gt 0$. What is the maximum value $f(x)$ can attain?

No matter how large $x$ is, $1/x$ is never 0, so $f(x) \lt 1$. Now suppose there is $x’$ for which $f(x’)$ is the maximum. It’s not hard to see that $f(x’ + 1) \gt f(x’)$, so we have a contradiction.

That’s why we need the concept of infimum and supremum (when mininum and maximum don’t exist).

I first learned about the max principle while studying real analysis, but it extends to complex analysis.

Formal Statement

Theorem 1. If $f(z)$ is a holomorphic and non-constant function in an open set $\Omega$, then $\abs{f(z)}$ has to maximum value.

The proof follows from the Open map theorem [2]. Since $\Omega$ is open, by the open map theorem, $f(\Omega)$ is also open.

Now suppose there is $z'$ such that $f(z') = w'$ is maximum. Then there exists a disk $B = \curly{\abs{w - w'} \lt \delta}$ with $\delta \gt 0$ in $f(\Omega)$, since its open.

We can write $w'$ in polar form as $\abs{w'}e^{i\theta}$. Let $a = \abs{w'}e^{i\theta} + \epsilon e^{i\theta}$ as depicted in Figure 1.1. Then $\abs{a} = \abs{w'} + \epsilon \gt \abs{w'}$ which is a contradiction.

Figure 1. For every point $w'$ in an open set, we can find an open disk centered at $w'$ in that open set. Since the disk is not empty, there exists some $a$ with value (modulus) larger than $w'$.

A variant of Theorem 1, which we can state as a corollary is:

Corollary 2. If $f(z)$ is continuous in a compact set $\Omega$ (closed and bounded) and holomorphic in the interior of $\Omega$, then $\abs{f(z)}$ attains a maximum for $z$ in the boundary of $\Omega$.

First suppose that $f(z)$ is constant. Then it has a maximum everywhere, so the corollary is trivially true. So assume $f(z)$ is non-constant henceforth.

The Extreme value theorem states that a continuous function on a compact set has a maximum value.

Now we claim that this maximum value is attained for a point $z'$ at the boundary of $\Omega$. Suppose that it's not, that $z'$ is in the interior of $\Omega$. Since the interior $\Omega$ is open, and by hypothesis $f(z)$ is holomorphic and non-constant there, this would contradict Theorem 1.

The proofs rely on the Open Mapping Theorem which we studied in the last post of the series.

Schwarz Lemma

We can use Theorem 1 and Corollary 2 to prove the Schwarz Lemma:

Theorem 3. (Schwarz Lemma) Let $f(z)$ be a holomorphic function in $\abs{z} \lt 1$, with $\abs{f(z)} \le 1$ and $f(0) = 0$. Then

\[\abs{f(z)} \le \abs{z} \quad \mbox{and} \quad \abs{f'(0)} \le 1\]

Moreover, if equality is attained in either of these inequalities, i.e., $\abs{f(z)} = \abs{z}$ or $\abs{f’(0)} = 1$, then

\[f(z) = cz\]

Where $c$ is a constant with $\abs{c} = 1$.

We define the function $g(z) = f(z)/z$ which, since $f(z)$ is holomorphic, is also holomorphic except at $z = 0$, which is a singularity, but luckily a removable one, since $\lim_{z \rightarrow 0} g(z)z = f(0)$, and $f(0) = 0$ by hypothesis.

We can do a holomorphic extension [3] of $g(z)$ to include $0$ via $g(0) = \lim_{z \rightarrow 0} g(z)$: $$\lim_{z \rightarrow 0} g(z) = \lim_{z \rightarrow 0} \frac{f(z)}{z} = \lim_{z \rightarrow 0} \frac{f(z) - f(0)}{z - 0}$$ The last equation being the definition of $f'(0)$, so $$ \begin{equation} g(z)=\left\{ \begin{array}{@{}ll@{}} f'(0), & z = 0 \\ \frac{f(z)}{z}, & z \ne 0 \end{array}\right. \end{equation} $$ is holomorphic in $\abs{z} \lt 1$. Now consider the set of points inside $\abs{z} \lt 1$ in a ring of radius $0 \lt r \lt 1$. For these points, we have that $\abs{z} = r$, so: $$\abs{g(z)} = \frac{\abs{f(z)}}{r} \le \frac{1}{r}$$ The last inequality is from the hypothesis. Consider the closed disk $\abs{z} \le r$. Since $g(z)$ is holomorphic and the close disk a compact set, $\abs{g(z)}$ obtains its maximum value at the boundary $\abs{z} = r$, so $\abs{g(z)} \le 1/r$ for a closed disk $\abs{z} \le r$ (including the $0$).

As we increase $r$, the upper bound of $\abs{g(z)}$ tighens and approaches $1$, so we can make $r \rightarrow 1$ to obtain $\abs{g(z)} \le 1$ in $\abs{z} \lt 1$. For $z \ne 0$ we get $\abs{f(z)} \le \abs{z}$. For $z = 0$ we get $f'(0) \le 1$.

If $g(z) = 1$ for some $\abs{z} \lt 1$, i.e. it attains its maximum inside an open disk, then $g(z)$ has to be constant. Otherwise it contradicts Theorem 1. Thus $g(z) = 1$ for all $\abs{z} \lt 1$. Then $g(0) = f'(0) = 1$ and $\abs{f(z)} = \abs{z}$ which is equivalent to saying $f(z) = cz$ for a constant $\abs{c} = 1$.

One way to interpret $f(z)$ with the above conditions is that it maps the unit open disk at the origin into the closed unit disk at the origin.

The theorem then says that we cannot strech the disk. It we look at the set of points in $\abs{z} = r$ in the domain, they’ll be mapped to points $\abs{w} \le r$ in the image. We can rotate these points around the origin, move them closer to the origin, but they cannot be moved further away. This idea is being captured on the post’s thumbnail.

It’s possible to generalize the constraints on this theorem by transforming $f(z)$ with a Möbius transformation [4]. This would allow us to obtain an analogous result for holomorphic functions $f(z)$ mapping some open circle to another.

References

[1] Complex Analysis - Lars V. Ahlfors
[2] NP-Incompleteness: The Open Mapping Theorem
[3] NP-Incompleteness: Removable Singularities
[4] NP-Incompleteness: Möbius Transformation

Neomania

2025-01-11T00:00:00+00:00

In his book, The Art of Thinking Clearly, Rolf Dobelli describes 99 psychological traps that humans tend to fall into. One of them is what he calls Neomania, the mania for the new.

In this post I'd like to share my thoughts on it.

The reason for it being a trap (or more generally a negative thing) is that new things have a high chance of being irrelevant. We touched on this in On Lifetime, where we mention the book Algorithms to Live By [1], which says: an estimate for the lifespan of something we have no information about is twice its current age.

Thus, it’s risky to invest time and energy on new things. Examples of neomania I can think of:

Reading or watching the news
Paying more for premieres
Buying the latest Apple device or changing cars every few years
Staying on top of the latest technology trends

Neotrality

I believe I’m less affected by neomania than average. It’s not that I make an effort to avoid new things nor that I think it’s always best to do so, it’s just my default mode. I decided to call it neotrality, a word play between the neo prefix (meaning new) and neutrality, as in being neutral to novelty.

If I had a more negative view towards it, I could used the term neophobia.

News

On my review to Real Analysis by Jay Cummings [2] in 2023, I included a quote from the book:

Thomas Jefferson revered Isaac Newton and once wrote to John Adams: “I have given up newspapers in exchange for Tacitus and Thucydides, for Newton and Euclid; and I find myself much happier.”

And I mentioned:

I haven’t paid much attention to the news for a long time (well, except Hacker news) and in the past year or so found I enjoy reading history and math/physics a lot, so I have been focusing on those.

A big part of my neotrality to news is that I have a weird range of attention span: I can focus on texts that are one sentence (think of tweets or headlines) and books of hundreds of pages, but struggle reading things of length in between: I cannot recall the last time I read an entire article in a magazine or newspaper.

I also struggle to read blog posts (ironically, I’d have trouble reading my own posts). This is one medium I wish I could have more focus for.

Another downside of not reading the news is not having much material for small talk. I personally don’t like small talk but sometimes it’s useful for breaking awkward silences with strangers or acquaintances.

The Latest Gadget

I got my first smartphone in 2012, the work phone provided by my company. It wasn’t until many years later (2017?) that I bought my own smartphone but mostly because I wanted to develop on Android.

I wonder how long it would have taken me to get a smartphone had I not gotten one from my company.

I do save a lot of money by not buying the latest gadgets but there’s also opportunity costs: I bought an iPad last year after visiting an Apple Store (it’s such an effective marketing!) and wished I had done it sooner.

The Latest Technology

In the book The Code Breaker [3], Walter Isaacson quotes Jack Szostak, Jennifer Doudna’s PhD advisor:

Never do something that a thousand other people are doing

I tend to agree to this and find that it’s a lot harder to make a meaningful contribution in crowded areas.

Also, as I mentioned in On Lifetime:

I have a strong preference for building things that last

and relying on unproven technology often leads to throw-away exploratory projects which is not my forte.

In here too there are downsides, especially at work: I often have trouble staying on top of what other people are working on, which limits my ability to find new opportunities and project ideas.

On the other hand, I found that if something is really important or worth it, it eventually reaches me.

Conclusion

Overall I still find neotrality a net positive. It’s easy to regret things in hindsight (I wish I had bought the iPad sooner) but it helps remembering there’s a survivorship bias at play: for every iPad I didn’t buy, I also didn’t buy other devices that would not have been useful. For each tech like LLMs there’s web3 and blockchain.

The Innovator’s Dilemma reminds me of the tradeoffs between neomania and neotrality. It’s easy to blame companies that failed to innovate because they didn’t invest enough in X that startup a ABC did.

On the other hand there are many startups that invested in Y instead and disappeared from the map and conversely many other promising technologies that company did invest in and didn’t pan out. Survivorship bias again.

References

[1] Algorithms to Live By, Brian Christian and Tom Griffiths
[2] Real Analysis: A Long-Form Mathematics Textbook, Jay Cummings
[3] The Code Breaker: The Code Breaker: Jennifer Doudna, Gene Editing, and the Future of the Human Race, Walter Isaacson.

NP-Incompleteness

Shared Libraries

Static vs Dynamic Linking

Recursive Dependencies

Dynamic Linking Process

Loading Dependencies

Symbol Lookup: Main Binary

Symbol Lookup: DSO

ABI

Conclusion

Related Posts

References

The Residue Theorem

Residues

The Residue Theorem

Application: Poles

Examples

Connections

The Argument Principle

Conclusion

Appendix

References

ELF: Executable and Linkable Format

Object Code

ELF File Layout

ELF Header

Program Header

Section Header

Execution of an ELF file

Conclusion

References

Queues

Benefits of Queues

Amortizing Irregular Throughput

Checkpointing

Decoupling Parallelism

Observability

Downsides

Processing Lag

Memory

Conclusion

References

Review: Getting Started with LLVM Core Libraries

Book Organization

Table of Contents

Overview

The Frontend

Lexical Analysis

Syntactic Analysis

Semantic Analysis

LLVM IR Generator

The LLVM IR

The Backend

Instruction Selection

Pre Instruction Scheduling

Register Allocation

Post Instruction Scheduling

Code Emission

The Just-in-Time Compiler

The Clang Static Analyzer

Clang Tools

Conclusion

The General Form of Cauchy’s Theorem

Chains and Cycles

Simple Connectivity

Homology

Cauchy Integral Theorem

Locally Exact Differential

Multiply Connected Regions

Example

Conclusion

References

Computer History Museum

Gallery

Conclusion

Vector Views in C++

Context

Vector of references

Using std::reference_wrapper

Using Raw Pointers

Using `std::reference_wrapper`