Virtual Memory on NUMA Systems

1 Virtual Memory and NUMA

The base abstraction for storing data in a computing system is the volatile memory (such as DRAM). Typically available in quantities ranging from a few to a few hundred gigabytes, it can be plugged into our machines in the form of DIMM ModulesWikipedia: Dual Inline Memory Module. At the application level, we utilize pointers (as discussed in our introduction to pointers and arrays in C) to address individual memory cells for reading or writing.

However, user space applications do not interact with the physical addresses that the memory module exposes directly. Processor designers introduce a memory management unit (MMU) between the CPU pipeline and the memory interface, forcing user-space applications to use virtual addresses. The MMU oversees all memory accesses, translating virtual addresses - which are valid in the user space - into physical addresses that the DRAM module can interpret. To accomplish this, the MMU consults the page tables, a sparsely-populated data structure encoding the virtual-to-physical mapping, and caches the results in the translation-lookaside buffer (TLB).

While it might be theoretically possible to use bytes or words as mapping granularity, such an approach would necessitate extremely large page tables. Hence, hardware vendors generally opt for a mapping granularity of at least 4 KiB or 16 KiB. For nomenclature, we refer to the 4 KiB of physical memory as page frame or frame, while we use the term page for the "places" in the virtual-address space, where we make frames visible.

In general-purpose operating systems (such as Linux, Windows, Mac OS X), each process has its own private virtual-address space. This means that the same numerical virtual-address value can represent different locations in different processes. For example, dereferencing the same pointer value (e.g., 0x1000) in different processes may access different memory cells located in distinct physical page frames. From a functional standpoint, this encapsulates the core concept of virtual memory. However, the non-functional aspects of virtual memory are considerably more complex on modern machines. Beyond the apparent simplicity of the address translation, the underlying hardware became much more complex to handle.

While simpler machines exhibit unified memory access (UMA) behavior, where every memory access takes roughly the same amount of time, modern multi-core systems usually display non-unified memory access (NUMA) behavior: On the hardware side, a NUMA system is composed of multiple nodes (or sometimes cells), each consisting of several CPUs, memory interfaces, and DRAM chips. Unlike UMA systems that connect all CPUs to each other, NUMA systems only connect the NUMA nodes, significantly simplifying the hardware complexity. This makes it easier to scale to higher CPU counts. As a result, one CPU in a NUMA system can access two types of memory:

  • Local memory, directly attached to its own NUMA node.
  • Remote memory, accessed through the NUMA interconnect, connected to another NUMA node.

Hence, local and remote memory have different access latenciesLocal-memory latency is currently at around 90-100ns, while remote memory has a latency of 150-300 ns. This disparity in access latency has given application developers and operating-system designers a whole new field of research: How can we write software and manage our systems in a way that ensures most (or all?) memory accesses only encounter the faster local access latency, while avoiding costly remote off-node accesses? In the following sections, we will explore operating-system techniques for handling NUMA systems through virtual-memory management. This task is complicated by the fact that the OS cannot predict an application's future access patterns, making application-agnostic NUMA optimization particularly challenging.

2 NUMA-Aware Allocation of Memory

With virtual memory, the OS has a powerful tool at hands to hide the fact that we are on a NUMA system. As the MMU introduces an additional level of indirection, we can use the page tables to map every page to every (local or remote) page frame, making it possible to redirect memory accesses without changing the application (or its pointers). However, things are not that easy as handling NUMA machines well involves not only memory handling and allocation but also NUMA-aware threads scheduling. In the end it is the combination of an control flow (thread) that executes on a CPU that issues an memory access to a virtual address, which can either be local or remote. Therefore, we have to schedule two ressources: CPUs and memory. Nevertheless, for this short discussion, we will focus on virtual memory and assume that threads "stick" to their CPU and a rarely migrated to another CPU/NUMA node.

The simplest way to handle a NUMA machine is to largely ignore the existence of the interconnect and manage the machine as a partitioned system. For this, we pin every process to one NUMA node and only allocate memory only from that node; no matter what. Thereby, all memory accesses are always local as remote memory is never used and we end up with perfect memory access latencies. However, this comes at the price that we put an artifical cap on available resources (CPU and memory) for processes: If we run out of local memory, allocations will fail in a partitoned system although memory might still be available on other NUMA nodes. For processing, a partioned system cannot attentuate an imbalance in CPU demand by migrating some threads to an underutilied node. In the end, the partioned system solution is a very poor fit for a general-purpose operating system with dynamically-changing memory and CPU demand.

Another method to distribute memory to users is to use a statistical process to even out the access latencies that arise from NUMA. With an interleaved allocation, we either assign each page to a NUMA node (as depicted on the slides) or cycle through all NUMA nodes for every allocation. With this interleaved allocation, we balance the allocation of page frames across all nodes. Therefore, with N NUMA nodes, an access has an average probability of being a local access of 1/N if we can assume that memory accesses are uniformly distributed over our virtual-address spaceMemory accesses are not uniformly distributed, but usually have a highly localized distribution with a few hot spots and many cold areas.. So, for a low number of NUMA nodes, this policy utilizes the full system (memory and CPU) but at the cost of having a certain number of off-node memory accesses. Worse still, interleaved allocation is a policy that cannot incorporate any application knowledge; even if we knew something about the access patterns, this policy would ignore it.

To harness this knowledge, we can use the first-touch allocation policy. As memory accesses originate from threads currently running on a CPU that is part of a NUMA domain, we know from which NUMA node the memory was requested for every page fault (and also for explicit allocation requests). The first-touch policy leverages this information to allocate memory from that specific node. Consequently, subsequent requests from that thread will be local and remain local as long as the scheduler doesn't migrate the thread to another NUMA node. Furthermore, as long as memory is not used by different threads, most memory accesses remain local. Up until Linux 3.8, the first-touch policy, in conjunction with NUMA-aware schedulingIn essence, the scheduler tried to keep threads on their current NUMA node., was at the core of Linux's NUMA handling.

3 Manual and Automatic Management on NUMA Systems

As we have explored, when mapping memory into the virtual address space, NUMA machines demand more attention than UMA systems. The operating system, with its allocation policies, endeavors to predict the future access patterns of the application. Yet, it is worth contemplating whether we should empower the application to explicitly manage its NUMA resources. After all, the application is best positioned to anticipate its own usage patterns. Who else would be more knowledgeable?

Operating systems are designed to offer explicit management of CPU assignments and to request memory adhering to specific NUMA nodes and allocation guidelines. For instance, in Linux, the libnuma libraryManpage: numa(3) offers these explicit controls. An application can utilize this to instruct the kernel on a preferred NUMA policy.

Regarding CPU and scheduling, the "CPU set" emerges as the paramount component in NUMA handling. Through this, one can limit a thread (and its offspring) to a certain subset of CPUs. The function numa_run_on_node() is particularly useful here, as it pins the invoking thread to a designated NUMA node. This means that the scheduler, when choosing a CPU for the thread, only considers CPUs on that specified node. Although this typically suffices for addressing NUMA considerations, there are also system callsManpage: schedsetaffinity(2), that pin a thread directly to a particular CPU, should the need arise. Furthermore, numa_run_on_node_mask() offers a more adaptable interface, letting users designate multiple NUMA nodes via a bitmask. However, it is important to note that many system calls related to threads and processes allow a thread to adjust only its own settings. System calls affecting other threads are a rarity.

On the memory front, libnuma offers alternatives to familiar memory-allocation methodsManpage: malloc(3). Using the numa_alloc* functions, we can request a new virtual memory zone that either draws physical frames from the _local() NUMA node or _interleaved() from multiple nodes. This means applications can determine the binding of NUMA nodes to VM addresses at allocation time. One application could allocate a large memory segment that is accessed uniformly from all cores in interleaved fashion, thus balancing access latencies.

However, there are instances where decisions made at allocation time, considered an "early binding", are insufficient. This is especially true when data needs to migrate dynamically across cores. Take, for example, a work queue serviced by a pool of worker threads. When submitting work, it remains uncertain which core will eventually process it. It would be beneficial if a worker thread could ensure that a work package resides on its own NUMA node. For such scenarios, numa_move_pages() is a viable route as it facilitates the migration of data between NUMA nodes, ensuring efficient processing. A possible usage scenario could look like this:

void worker(void) {
    while(1) {
        // In this pop(), the worker might block on new work.
        work_t work = queue.pop();

        // Get NUMA node and pin yourself to that node
        int cpu = sched_getcpu();
        int node = numa_node_of_cpu(cpu);
        numa_run_on_node(node);

        // Move the workload (work->data is a void*) to our NUMA node
        void *pages[1]  = { work->data };
        int   nodes[1]  = { node };
        int   status[1];
        numa_move_pages(/* current task */ -1,
                        /* length N   */  1,
                        /* N page pointers, N node numers, N return values */
                        pages, nodes, status,
                        MPOL_MF_MOVE)

       // Execute the job!
       work->doit(work);

       // Undo the NUMA-node Fixing to wakeup anywhere.
       numa_run_on_node_mask(numa_all_nodes);
    }
}

When a worker thread retrieves a work item from a global queue, it might block if no work is immediately available. During this waiting period, we ensure that the thread is not pinned to a specific NUMA node. Once the pop() function yields a work package, the thread pins itself to the NUMA node it was activated on. This pinning ensures the scheduler will not relocate the thread while it is processing the task.

To optimize memory access, the thread employs numa_move_pages() to transfer the data page of the task to its current NUMA node. The numa_move_pages() function operates on an individual page level, requiring three vectors of identical size: pages specifies which pages to migrate, nodes indicates the target NUMA node of these pages, and status provides space for the return value of each move operation. For our purposes, we only move one memory page to which work->data points.

Once the data has been migrated, the thread can proceed with executing its main workload. After this, we unpin the thread from the NUMA node to return a degree of freedom to the scheduler. By adopting this approach, we ensure that all accesses from doit() to data ranges work->data[0] through work->data[4095] occur locally on the NUMA node.

As observed in the example, manual NUMA handling can be particularly cumbersome, and it is not advisable to undertake it manually. Yet, merely rectifying to the initial allocation strategy (i.e., first touch) is not sufficient. We require a mechanism that dynamically adapts to the application's usage patterns. Over recent years, operating systems have incorporated such features and now offer Automatic NUMA rebalancing. This function aims to automatically migrate physical frames to the NUMA nodes that access the data. As a result, access patterns that exhibit both spatial and temporal locality become local NUMA accesses over time.

For automatic NUMA rebalancing to function effectively, the OS must perform two tasks for a page: (1) identify which node accesses the page most frequently and (2) relocate the page to that particular node. The latter is the more straightforward mechanism: the OS simply allocates a page from the target node, transfers the data from the source to the destination frame, and modifies the virtual-memory mappingAnd it should perform a TLB shootdown, but let us not delve into shootdowns at this juncture.

However, the primary challenge lies in the access detection. To truly find the node with most accesses, we would need specific hardware support that individually records the access statistics for each page frame. However, one might argue that storing a counter for each frame and for each NUMA node would consume vast amounts of memory. Consider a scenario with 4 NUMA nodes, 1 Terabyte in 4KiB pages, using 32-bit access counters. This would consume approximately 0.3 percent of our memory (= 4 GiB) just for storing these counters. Yet, a pertinent question arises: Is it essential to possess a complete statistic to determine the most frequent accessor of a page? Or, might it suffice to ascertain, with high probability, the node accessing the page most frequently.

Envisioning memory accesses to a page frame as a stream of NUMA-node numbers (e.g., [1,1,1,0,1,0,1,1,1]), one could sample a few of these numbers to identify the most frequent accessor. In an even simpler approach, we might sample just one access, leading us to conclude that the sampled NUMA node is the predominant accessor. For frames accessed exclusively by one node, this assessment would be accurate. Moreover, for pages primarily accessed by one node, there's a high likelihood that our guess would be correct. This technique is cost-effective as it only requires observing a single memory access to decide whether to migrate a page or retain its current location.

While some processors directly support the observation of memory-access patterns, a more efficient method for rebalancing exists: We employ the memory-management unit (MMU) to prohibit all memory accesses within a specific address range (red bars on the slide). When an application accesses a page within this range, the processor prompts the operating system by activating the page-fault handler on the accessing CPU. Consequently, the OS can directly deduce from the current CPU id which NUMA node the application thread is operating on. A concise page-fault handler could be represented as follows:

void page_fault_handler(void *vaddr) {
    // Get a pointer to the page-table entry
    // that controls the mapping at vaddr
    pte_t *pte = resolve_vaddr(vaddr);
    //...

    if (numa_is_rebalancing(pte)) {
        int cpu = sched_getcpu();
        int node = numa_node_of_cpu(cpu);
        if (pte_get_node(pte) != node) {
            // Allocate a local page
            int frame_number = numa_alloc_local();
            // Migrate the page
            copy_physical(pte->frame_number, frame_number);
            // Redirect the mapping
            pte->frame_number = frame_number;
        }
        // Remove the access protection from THIS page
        pte->present = 1;
    }
}

The page-fault handler initially determines if the fault was truly instigated by the rebalancing mechanism and not by an actual access fault. It then retrieves the current NUMA node (node) and verifies if the page-table entry pte is directed towards a frame from the current NUMA node. If this is not the case, the page undergoes migration through copying. In all scenarios, the access protection is lifted from the page-table entry to make this fault an one-shot operation.

This mechanism offers several notable advantages: (1) Automatic NUMA rebalancing can be activated in a precise manner, targeting specific segments of the address space. (2) Only the initial access to an access-protected page incurs the cost associated with the page-fault handler. (3) There is no dependence on exotique hardware features; the process solely employs the standard feature where a page is marked as not present (despite its actual presence). (4) Unaccessed pages remain access-protected, and this status can be utilized to inform the page-replacement strategy.

In the Linux environment, automatic NUMA rebalancing operates as a continuous procedure: Every second, the Linux kernel designates a particular segment of the virtual-address space as not present (for instance, 256 MiB). When this incremental marking process reaches the boundary of the mapped virtual memory, Linux reinitiates from the start, following a cyclical approach. The coding for this procedure might appear as follows:

void reblance_marker_thread() {
    void *vaddr = 0;
    while (true) {
        // Mark 256 MiB of memory as not accessible
        for (unsigned i = 0; i < (256 * 1024 * 1024 / (4 * 1024)); i++) {
            vaddr = next_mapped_page(vaddr);
            pte_t *pte = resolve_vaddr(vaddr);
            pte->present = 0;
        }
        // Wait for another second to mark more memory.
        sleep(1);
    }
}

In the provided example, the intricate details of locating the next mapped page in a circular manner have been encapsulated within the function next_mapped_page(). Furthermore, all the TLB invalidations have been omitted for simplicity. However, I hope the primary concept becomes evident: setting the present bit to 0 denotes the juncture at which we prohibit accesses. Upon revisiting the page-fault–handler code, you can pinpoint the line where this access protection is lifted.

Stepping away from these granular technical specifics, the portrayed automatic NUMA rebalancing method evokes a sense of familiarity: It operates akin to a first-touch policy, perpetually applied across our virtual-memory space. Rather than exclusively making a NUMA node determination at the time of allocation, this approach periodically revisits such decisions, allowing for dynamic adaptation to changing access patterns without necessitating intricate application insights.

In conclusion, we can say that knowing about NUMA and OS mechanisms to tackle NUMA effects is important. The difference between local and remote NUMA accesses can make your application performing poorly. You can either do NUMA management on your own, which is not recommended, or you can rely on the mechanism the operating system provides. But in all cases you have to know that on modern machines, not all memory is created equal.

Fußnoten:

2

Local-memory latency is currently at around 90-100ns, while remote memory has a latency of 150-300 ns

3

Memory accesses are not uniformly distributed, but usually have a highly localized distribution with a few hot spots and many cold areas.

4

In essence, the scheduler tried to keep threads on their current NUMA node.

5

Manpage: numa(3)

7

Manpage: malloc(3)

8

And it should perform a TLB shootdown, but let us not delve into shootdowns at this juncture

Creative Commond BY Logo
Kurzvorträge zu verschiedenen Themen aus dem Bereich Betriebssysteme, Christian Dietrich, 2020-2022.
Die Materialien zu dieser Kurzvorträge sind, soweit nicht anders deklariert, unter der Creative Commons Attribution International License (CC-BY 4.0) veröffentlicht.