People always told me that Unix is all about files, but now at the Christmas village, the ELFs taugh me that this is actually not true. In Unix systems, and especially in Linux, everything is either a file, a process, or something that hides behind the file interface but is actually an gremlin in disguise. And today, we want look a little bit deeper into the nature of processes and threads and the muddy middleground between them. Or, to say it more clearly, we will look at the almighty clone(2) system call, which is the modern way to create processes, threads, and namespaces. The latter one is also the basis of all container techniques, like Docker or LXC.
First, we have to understand that within the Linux kernel (kernel-level) threads and processes are the same object: struct task_struct.
task_structs are the entity of scheduling and every time the kernel returns to the user space, it returns to a specific
task_struct. And for example, the field struct mm_struct *mm points to the address space (mm=memory management) that should be activated if we dispatch to this task.
With a grain of salt, this explains the relationship of threads and processes. If the
mm-pointer of two tasks point to the same
mm_struct they are within the same process. If they point to a different
mm_struct, they are located in different processes.
Actually, as we will see, the distinction is a little bit finer than this as thread groups exist and process-affiliation is actually managed through these.
But this also raises the question, if two tasks can differ or match in their
mm_struct-pointer, why can't they match or differ in other important aspects of resource usage.
For example, can two tasks work in a different file-system tree?
And indeed, there is a struct fs_struct *fs pointer for exactly this purpose (see chroot(2))!
Many of you are familiar with the fork(2) system call, which copies the current process and returns twice(!): once in the parent process, and once in the child process. From ten miles away, it creates a new address space, copies the parent's virtual memory and its
task_struct, and hands the newly-cloned child
task_struct to the scheduler. Plain and simple!
fork() is quite coarse grained, as we cannot control what resources parent and child task should share?
They will live in independent address spaces, they will have the same file system, and they will see the same list of running processes.
However, with the more elaborated clone(2) system call we have a much higher degree of control and can even emulate the behavior of
int clone(int (*fn)(void *), void *stack, int flags, void *arg, ...);
fork(), we have to explicitly provide a stack pointer (
stack) and a function pointer (
fn), where the newly created task should start its user space execution.
However, we are also able to pass an argument to the new task and control the behavior of
clone() with different flags.
For each resource (address space, file system, UID namespace, ...), we can decide whether they are shared between parent and child. For example, with
CLONE_VM, we instruct
clone to share the virtual memory between both tasks (i.e., they have the same
CLONE_THREAD, the new task is placed in (or shares a) thread group with the calling thread. Thereby, they are actually part of the same process.
But this also means that we can have two actual processes that share the same address space! A chimera between process and thread!
Namespaces are sets of name-to-object mappings.
For example, a file system is a namespace as it maps human-readable filenames to inodes/file contents.
Actually, address spaces are also namespaces as the virtual addresses are translated by the MMU (Memory Management Unit) to physical addresses.
And by using different
clone() flags, we can control whether the kernel uses the same (shared) namespaces as the calling task or if we create new ones (by cloning).
One of the simplest "namespaces" that we can create for our new task is a User ID namespace.
A UID namespace is a translation table between inside-user IDs and outside user IDs.
Within a UID namespace, you can have
uid=0 (root), which is translated to a normal user id (e.g., 1000) outside the namespace:
# from_start to_start length 0 1000 1
The UID namespace is represented by a file in Linux in
/proc/self/uid_map. It can be alterted by writing into this file with the above format (without the line starting with
UID namespaces were one of the enablers of Docker containers, as they allow us to have a full-blown system, including daemons, within a container. Actually, all threads/processes/tasks within a Docker container share the same UID namespace.
clone()that shares the address space and that is put in the same thread group.
getuid()will show you your success.
getpid()returns the same value (returns the process ID or the parent process).
gettid()returns the thread/task id.
setuid()to work within your UID namespace you require a simple
uid_mapthat can look like:
sprintf("0 %d 1", outside_uid).