Skip to main content

Managing Memory Through Address Sanitizer and Observeing Virtual Memory

· 46 min read
Pranav Ram Joshi
Software Engineer — Systems & Networks

Virtual Memory Management provides the dream of infinite space. Address Sanitizer is the precipice hidden in the fog - the silent pitfall that claims the soul the moment the dream wanders into the void.

Preamble

I have recently started learning Rust. But while learning about the various types and programming model of Rust, I couldn't help but try to see the differences between Rust and C. I also revisited flexible array member in C structures. While doing that, I poked around the memory and was using various Sanitizers. Until now, I've naively used the address sanitizer. I decided to understand the output of the sanitizer and how the algorithm works. Initially, I looked into compiler specific documentation. Both GCC and clang's documentation ultimately refers to [GOOGLE_ASAN]. I plan to discuss common programming errors which causes memory related issues, how address sanitizer can be used, why it's not always reliable, and understand how to read the vmmap(1) utility on Darwin.

There are various memory related bugs that have been discovered till now. [GOOGLE_ASAN] provides definition to each of such problems. In essence, we can classify them as:

  1. Buffer overflow. Stack, heap, and global buffer overflows.
  2. Use-after bugs. free(3) (dangling pointers), return (returning reference to automatic variables local to function), and scope (example below).
  3. Leak. A region of memory allocated but no variables references it, thereby making it inaccessible.

Buffer overflow

Buffer overflow is a common programming mishap. Interfaces such as gets(3), strcpy(3), and other string handling function which do not consider the length of buffer causes buffer overflow. Consider a simple program fragment which will show all the various overflows mentioned above:

#include <stdlib.h>
#include <string.h>

char global_buff[10]; /* stored in BSS (or DATA) section */

int
main (void)
{
char stack_buff[5];
strcpy(stack_buff, "This strcpy will cause stack overflow");

char *heap_buff = malloc(8);
if (!heap_buff) exit(1); /* malloc failed */
strcpy(heap_buff, "This strcpy will cause heap overflow");

strcpy(global_buff, "This strcpy will cause global overflow");
}

Global initialized variables are kept in DATA section of the program whereas uninitialized global variables are kept in BSS section. Variables declared inside a function are automatic by default (hence stored in the stack) unless the storage class static or extern is specified. Library function such as malloc(3), mmap(2), brk(2) (deprecated), and sbrk(2) (deprecated) allocate heap memory region.

brk(2) and sbrk(2)

Before virtual memory management scheme was introduced as an abstraction layer between processes and physical memory, a typical process had its memory layout such that uninitialized data segment (BSS segment) was just below the heap. Do note that the heap--which is above BSS--grows upward. The program break, which defines the end of process's data segment (i.e., the program break is the first location after the end of the uninitialized data segment), can be consulted via sbrk(2) with argument of 0, on both Linux and Darwin. The end(3) interface is available too, though the manual (on Darwin) suggests avoiding it. Specifically, since the end(3) interface assumes that the program has the memory layout of a UNIX program, but this is not true in Darwin.

Increasing the program break is equivalent to allocating memory to the process, and conversely, decreasing the program break is equivalent to deallocating memory of the process.

Most implementation suggests using function such as malloc(3) (or mmap(2)) for the purpose of memory allocation. brk(2) on Darwin even provides a useful remark:

The brk and sbrk functions are historical curiosities left over from earlier days before the advent of virtual memory management.

Modifiable global variables are stored in region which have read and write permission. Constant data are stored in region where read permission is only present. Attempt to write to such region would cause segmentation violation, immediately terminating the program through signal (program termination can be bypassed if handler is installed for the signal.) Since the global variable is modifiable, such global overflow could result in memory corruption and program would no longer be in a reliable state.

Variables stored in the stack has been abused to an extent where remedies such as stack canary, non-executable stack, and others have been introduced. A detailed introduction to such schemes are discussed in [Wikipedia: Stack buffer overflow (Protection schemes)]. If we don't consider such protection and assume a simple stack structure, then corruption of stack variable can lead to disastrous outcomes. For example, the return address to the caller is stored in the stack (ARM uses a dedicated register to store the return address) and if it is corrupted, the program could branch out or return to some other routine that might cause havoc. Even if such problems have been largely solved by now, the stack is still prone to memory corruption if stack bufferflow was to occur.

Heap corruption is caused by heap overflow. When you request memory for your application, it usually allocates in unit called page. mmap(2) does that. Library function malloc(3) uses implementation specific algorithm to return chunks of memory instead of pages. Memory allocation is a complicated topic and various studies have been done in this area. If you want to learn the earlier implementations, check out [DLMALLOC].

Use after errors

I've previously discussed about use-after-free bug on C and how Rust is able to statically locate such errors. Despite that, let's see all three types of bugs in the code fragment below:

#include <stdlib.h>

int *
use_after_ret_routine (void)
{
int num = 0x45;

/*
* local variable are stored in stack. Once
* we return from this function, the address
* of variable 'num' will no longer be valid.
* This is use-after-return bug.
*/
return (&num);
}

int
main (void)
{
int *alloc_num;
alloc_num = malloc(sizeof(int));
if (!alloc_num) exit(1); /* malloc failed */
/* free the memory region */
free(alloc_num);
/* use it, hence use-after-free */
*alloc_num = 0105;

/* use-after-scope */
int *uas;
{
/* different scope, 'scope_var' is only valid in this scope */
int scope_var = 69;
uas = &scope_var;
}
/* still points to 'scope_var' even though it's lifetime has finished */
*uas = 0;
}

As complicated as malloc(3) is, so is free(3). In contrast, mmap(2) and munmap(2) are rather simple, as discussed in previous blog post. I've briefly discussed about why use-after-free occurs previously, so I won't be discussing it here.

Regarding use-after-return bug, notice the use_after_ret_routine function. If a function returns a pointer instead of an object, then we must be careful to not return a reference to variable local to the function. It's okay to return a reference that is obtained from memory allocation function such as malloc(3) since such reference are for the heap region instead of the stack. Once the function returns, the stack is destroyed for that function, so any variable local to that function will no be a valid reference. In this example, reference to num would no longer be valid once use_after_ret_routine returns.

The lifetime of an object inside a block is as follows:

int
main (void)
{
int x;<---------------------------------------------------------+
| /* lifetime of x */
{ |
int y;<---------------------------------+ |
| /* lifetime of y */ |
{ | |
int z;<-------+ | |
| /* lifetime of z */ | |
}<--------------+ | |
| |
}<----------------------------------------+ |
|
}<----------------------------------------------------------------+

The idea of block scope is geared towards compile-time concept enforced by the language for readability and safety. It does not exactly correspond to how the compiler manages memory; all local variables within a function are typically allocated together on the stack at function entry, regardless of which (inner) block they were declared in.

In the above example, uas--in the inner block--is assigned a reference to scope_var, which is local to that inner block. After exiting from the inner block, the lifetime of scope_var ends, and since on the outer block, uas is dereferenced and its content is modified, the behavior would be undefined.

Leaks

The leaks(1) manual on Darwin provides the following definition of leaked memory:

leaks identifies leaked memory -- memory that the application has allocated, but has been lost and cannot be freed.

Let's create a simple program to illustrate what a memory leak is:

#include <stdlib.h>

int
main (void)
{
{
char *leak_me;
leak_me = malloc(64);
if (!leak_me) exit(1);
}
/* notice that 'free' was not called above */
}

The inner block has a variable leak_me which holds the memory region returned by malloc(3). Since free(3) was not called, the memory cannot be reclaimed. Furthermore, since the lifetime of leak_me ends once we leave the inner block, the program has no longer a means to deallocate this region. This is what memory leak is, as stated in leaks(1) as:

Specifically, leaks examines a specified process's memory for values that may be pointers to malloc-allocated buffers. Any buffer reachable from a pointer in writable global memory (e.g., __DATA segments), a register, or on the stack is assumed to be memory in use. Any buffer reachable from a pointer in a reachable malloc-allocated buffer is also assumed to be in use. The buffers which are not reachable are leaks; the buffers could never be freed because no pointer exists in memory to the buffer, and thus free() could never be called for these buffers.

Leaks are not as lethal as memory corruption, but on the long run, the application will eventually run out of memory since it failed to deallocate or reclaim those memory region.

Memory Management Unit

caution

This section is mostly unrelated to the topic we're discussing. I've always wanted to understand memory from a low level perspective. While learning about workings of address sanitizer, I was also trying to understand implementation details of memory, which is documented in this section. Feel free to skip it entirely if you just want to understand about address sanitization process.

The process of converting a virtual address to its corrresponding physical address has some additional intermediate steps. Needless to say, the information required by the intermediate steps are contained within the virtual address itself.

Since the advent of virtual memory addresses, each process views memory addresses in its entirety, i.e., a process thinks it has all the available addresses even when that's not the case. A hardware component known as Memory Management Unit (MMU) exists between the processor and the actual memory device that acts as a translator for such virtual memory address.

Virtual Address Width

Even though the address space on most modern machine is represented by a 64-bit value, not all of the bits are used for addressing. On most implementations, 4 types of virtual address widths are used:

  1. 32-bit. Used in older 32-bit systems but is compatible with 64-bit architectures as well. Can represent memory up to 4 GiB. It typically uses 2 levels of paging. Do note that unlike 64-bit extensions, 10 bits are used for indexing, supporting 1024 entries.
  2. 39-bit. Introduced for embedded and mobile devices. Such addressing uses 3 levels of paging, where each level of paging holds 512 entries. Implementation may provide support for larger page size, up to 1 GiB. Can represent memory up to 512 GiB.
  3. 48-bit. Available on most general purpose computing systems. An extra level of paging is used--with total of 4 level of paging--while each level of paging holds 512 entries. Can represent memory up to 256 TiB.
  4. 57-bit. Introduced for high-end servers. It allows 5 level of paging. Can represent memory up to 128 PiB.
Architecture and Implementation Specific Attributes

I can't stress enough how deep this topic is. Even though I've tried to list the width used in most common general computing systems, it is not always entirely accurate. For example, x86 architecture requires 32-bit virtual address width and, with subsequent extension to the architecture, there is a support for memory upto 1 TiB, or 40-bit physical addressing, as mentioned in section 4.3 32-BIT PAGING of [INTEL_ARCH_SDM]:

A logical processor uses 32-bit paging if CR0.PG = 1 and CR4.PAE = 0. 32-bit paging translates 32-bit linear addresses to 40-bit physical addresses

ARM also have its own implementation specific attributes, and to further complicate matter, requires the knowledge of the ARM version and profile (which is usually Application (A), Real-time (R), Microcontroller (M), and Classic). Furthermore, the Translation scheme for ARMv7 and ARMv8 (more precisely, with the introduction of AArch64 execution context) may have its own specification. Interested readers can check out [ARMV8_9_AARCH64_VMSA] and [ARMV8_9_AARCH32_VMSA] that is provided in References. The process of translating the virtual address to physical address contains factors such as:

  1. Base address to locate the first table (ARM uses Translation Table Base Register (TTBR), while Intel uses Control Register 3 (CR3)) in conjunction with bits from other system registers to advance to next table or locate the larger page.
  2. Depending on the capability and extensions to the architecture, setting and resetting of appropriate bits in system registers changes the behavior of table lookup.
  3. Extensions to older architecture also requires fine-grain handling of table entries. For example, enabling Physical Address Extension (PAE) on Intel architecture, using 32-bit paging, causes the Page Directory Entry and Page Table Entry to be 64-bit wide. This is briefly described in Wikipedia: Physical Address Extension § 32-bit Paging, 4 KiB pages, with PAE.
  4. Entries on the table structures convey more than the pointer to another table or to the actual physical address. For instance, Table 4-4 of [INTEL_ARCH_SDM] shows that certain bits within Page Directory Entry are used for distinct purposes, notably writing to 4 MiB pages referenced, access to 4 MiB page by the user that is referenced, and so on. On ARM, the entires are aliased as descriptors. Two formats are normally found: Short-descriptor format and the extension Long-descriptor format. The definition of these descriptors for ARMv7 can be found at Armv7-A § Translation tables, while AArch64 is at Armv8-A § D8.3 Translation table descriptor formats and AArch32 is at Armv8-A § G5.3 Translation tables.

Each addition of page level accounts for 9 extra bits used for indexing into the respective page table. The standard page size is 4 KiB, so 12 bits of virtual address must be reserved for that purpose. Since each process also has its own distinct virtual address space, Address Space Identifiers (ASIDs) are used to distinguish the address for respective processes.

In most 64-bit systems which employ 48-bit virtual address width, the bits [63:48] must be the same as bit 47. The address range [0000800000000000, FFFF7FFFFFFFFFFF] is not used. It is often referred to as canonical addressing.

caution

For the sake of brevity, 32-bit and 57-bit virtual address widths are not discussed below unless explicitly stated. The concepts below use 39-bit virtual address width for illustration, but also contains information in regards to 48-bit virtual address width. As mentioned above, these virtual address widths adds extra level of table lookup, and most implemetation does not allow pages larger than 1 GiB, which is already available in 39-bit virtual address width.

Page, Frame, and Translation Granule

A page is a contiguous block of memory which has some fixed length. The page table is a data structure which contains multiple entries of such pages. A page frame is the smallest fixed-length contiguous block of physical memory into which memory pages are mapped by the operating system. A system's page size can be queried using the sysconf(3) library function:

long pagesize;

pagesize = sysconf(_SC_PAGESIZE);

On most systems, the size of a page is 4 KiB. However, arm-based macOS seems to have a page size of 16 KiB.

Let's consider about the page table size now. Assume a naive implementation which only has one level of paging is used. This means that all the address space entires is found in the same page table. Suppose that the machine uses 64-bit virtual address space. Given an entry in the page table corresponds to 4 KiB memory space, the page table must have:

number of entries = 264 / 212 = 252

This really wasn't an issue when working with physical memory of 4 GiB size as we only needed 220 entries in the single level paging scheme. Also, since each entries were stored as 4 bytes information, only 4 MiB was required to map the entire physical memory. For most 64-bit systems (or extensions to old 32-bit architecture), it has been the norm to have entries use 8 bytes to store paging information, and with the required amount of entries needed for 64-bit address space, the total storage required for single level paging scheme would be 32 PiB. We've accepted the need for multilevel paging, which not only uses layers of indirection to reach to the main memory, but also allows for accessing larger pages if the underlying hardware supports it.

Needless to say, it's impractical to store such a large page table for one process, let alone multiple processes which exists simultaneously in the system. As an additional restriction, at least on ARM, the translation table adhere to the hardware page size (frame size) as well, so 4 KiB of translation information can be provided. On ARM, Translation Table Entry (TTE) is 8 bytes, and so translation table can contain 4096 / 8 = 512 entries. We can decrease the number of entries in the page table by:

  1. Increasing the page size. For example, if the page size is 16 KiB, the number of entries would be: 250. This doesn't really improve the situation, but the number of entries has been reduced by a factor of 4. Another drawback is, a page size usually describes the frame size for main memory, hence having different page size and frame size would introduce problems with fragmentation.

  2. Introducing multilevel paging. A page table will now contain fixed number of entries. Now, the page table may either refer to the physical memory like previously mentioned, or refer to another page table. This allows a smaller data structure which needs to be maintained by the kernel, but introduces performance penalty when we have to traverse the page tables.

    Translation Lookaside Buffer (TLB)

    Although Multilevel paging is a better approach and is used in most implementations today, it has its drawback: table walk. Walking through multiple page tables can be expensive operation and defeats the purpose of using page tables, not by the size factor but by the time factor. To overcome this problem, MMU have a dedicated Translation Lookaside Buffer which is used to store recent table walk data to allow quick access without walking through the table.

    TLB is considered as a cache of the page table, representing only a subset of the page table contents. Similar to how cache coherency issues exist, so does TLB pagewalk coherence present issues. I won't be discussing the issue in this blog, but interested reader can refer to [TLB and Pagewalk Coherence in x86 Processor].

The page size usually corresponds to the capability of actual memory hardware, i.e., the page size of 4 KiB is used because the frame size is usually 4 KiB. The term frame refers to physical memory. An illustration of multilevel paging can be seen in Figure 1 (excerpt from ARM: Multilevel translation). One thing to notice is that a page table (ARM uses the term translation table) contains:

  1. Reference to another page table.
  2. Entry to physical memory.

ARM, which the figure below is based on, uses the term Translation Granule, which is defined as the smallest block of memory that can be described--a page size. The last table--whose entries only refers to physical memory--contains entries for a given page size, but prior tables can hold entries to physical memory which is larger than the page size (a multiple of page size). This is stated in the ARM documentation as:

The selected granule is the smallest block that can be described in the last level table. Larger blocks can also be described.

This can be seen in the figure itself (Level 0, -1, and -2 not present), where one entry of Level 1 table (the figure does not provide label for the level 1) refers to a large block in physical memory. For example, given the translation granule (page size) of 4 KiB:

  1. Level 3 table contains entries of only frame size 4 KiB.
  2. Level 2 table contains entries of either level 3 table entries or frame size of 2 MiB.
  3. Level 1 table contains entries of either level 2 table entries or frame size of 1 GiB.
  4. Level 0 table is used for 48-bit (or 52-bit) virtual addressing width.
  5. Level -1 table is used for 57-bit virutal addressing width.
  6. Level -2 table is planned as an extension for the future.
Intel Alternative

In addition to paging, which is used to convert a liner address to physical address, there are some distinct segment registers which holds logical address. [OSDev: CPU Registers x86 § Segment Registers] lists them. Notable ones are Code Segment (CS), Data Segment (DS), Extra Segment (ES), Stack Segment (SS), and two other segment registers. The answer on [StackOverflow: What is the difference between linear, physical, logical and virtual memory address?] discusses these terminologies in brief. It should be known that conversion from logical address to linear address is a different procedure than converting linear address to physical address; the latter one is done by paging. The procedure for the former one is provided in 3.4 LOGICAL AND LINEAR ADDRESSES of [INTEL_ARCH_SDM].

Intel provides three paging mode, excluding the none paging mode. If no paging mode is selected (Paging (PG) bit in Control Register 0 (CR0) is set to 0), the linear address width is 32, and the corresponding physical address width is 32. Other three paging modes are:

  1. 32-bit paging. This paging mode is used when PG bit in CR0 is set to 1, PAE bit in Control Register 4 (CR4) is set to 0, and Long Mode Enable (LME) bit on Intel Architecture 32 Extended Feature Enable Register (IA32_EFER) is set to 0. The linear address width is limited to 32-bits, but the physical address width can be upto 40-bits but it requires two prerequisites: if the CPU supports Page-Size Extension (PSE) with 40-bit physical address extension (PSE-36) and for physical address of widths more than 32-bits, Page Directory Entry (PDE) has Page Size (PS) bit set to 1 which makes the entry map to 4 MiB pages. If the PS bit is set to 0, PDE maps to Page Table Entry (PTE) which then contains an entry to 4 KiB page. The support for PSE-36 can be determined with the use of CPUID instruction. This paging mode supports two sizes of pages: 4 KiB and 4 MiB (if the processor supports it).
  2. Physical Address Extension (PAE). This paging mode is used when PG bit in CR0 is set to 1, PAE bit in CR4 is set to 1, and LME bit in IA32_EFER is set to 0. The linear address width of 32-bits is used to fetch the physical address having width of upto 52-bits. It supports pages of size: 4 KiB and 2 MiB. Additionally, if the No-Execute Enable (NXE) bit of IA32_EFER is set to 1, this paging mode will support execute disable functionality.
  3. Intel Architecture 32 extensions (IA-32e). This paging mode is used when PG bit in CR0 is set to 1, PAE bit in CR4 is set to 1, and LME bit in IA32_EFER is set to 1. The linear address width is 48-bits while the physical address width can be upto 52-bits. The page sizes that supported by this paging mode is: 4 KiB, 2 MiB, and 1 GiB. Do note that we need to check for Page1GB feature through CPUID instruction to affirm the use of 1 GiB pages. Like with PAE paging mode, this paging mode also supports execute disable functionality, and additionally supports Process-Context Identifiers (PCIDs) and protection keys. Support for PCID is only realized if PCID Enable (PCIDE) bit of CR4 is set to 1. Similarly, protection key feature is only used if Protection Key Enable (PKE) bit of CR4 is set to 1.

Based on the paging mode, various paging structures are selected. [OSDev: Paging] provides the structure of the entries that are discussed below.

  1. Page Table (PT). Contains Page Table Entry (PTE). Entries in a PT maps the physical address of the 4 KiB page frame. Not used for 2 MiB, 4 MiB (on PSE-36 supporting processors using 32-bit paging), and 1 GiB pages. On 32-bit paging, It is functionally similar to Level 3 table on ARM. PAE allows having 64-bit PTE.
  2. Page Directory (PD). Contains Page Directory Entry (PDE). For a given page size, PD might contain another PDE, a PTE, or reference to physical memory. Not used for 1 GiB pages. It is functionally similar to Level 2 table on ARM
  3. Page Directory Pointer Table (PDPT). Contains Page Directory Pointer Table Entry (PDPTE). Not available in 32-bit paging and was introduced for 36-bit paging (64 GiB). It is functionally similar to Level 1 table on ARM.
  4. Page-Map Level 4 (PML4). Contains Page-Map Level 4 Entry (PML4E). It is functionally similar to Level 0 table on ARM.
  5. 5-level Paging. Enables 56-bit userspace virtual address space. It is functionally similar to Level -1 table on ARM. Refer to [Linux Kernel: 5-level Paging] for more information.

A detailed information of Paging on Intel architecture can be found in [Github:dreamportdev/Osdev-Notes/04_Memory_Management/03_Paging.md]. The following table is an excerpt from 4.2 Hierarchical Paging Structures: An Overview of [INTEL_ARCH_SDM].

Paging StructureEntry NamePaging ModePhysical Address of StructureBits Selecting EntryPage Mapping
PML4 tablePML4E32-bit, PAEN/AN/AN/A
IA-32eCR347:39N/A (PS must be 0)
Page-directory-pointer tablePDPTE32-bitN/AN/AN/A
PAECR331:30N/A (PS must be 0)
IA-32ePML4E38:301-GByte page if PS=1
Page directoryPDE32-bitCR331:224-MByte page if PS=1
PAE, IA-32ePDPTE29:212-MByte page if PS=1
Page tablePTE32-bitPDE21:124-KByte page
PAE, IA-32ePDE20:124-KByte page

This can be seen in the table provided in the Translation Granule section of ARM documentation. The table is as follows:

Level of table4KB granule (Size per entry)4KB granule (Bits used to index)16KB granule (Size per entry)16KB granule (Bits used to index)64KB granule (Size per entry)64KB granule (Bits used to index)
0512GB47:39128TB47
11GB38:3064GB46:364TB51:42
22MB29:2132MB35:25512MB41:29
34KB20:1216KB24:1464KB28:16

I'd like to point out one trivial information here. On 4 KiB granule, we see that bits [11:0] is used for page offset. The page offset is used to access the individual byte within a page. The pattern is identical to other granule as well as level of table as well.

Multilevel paging
Figure 1: Multilevel paging

MMU Conclusion

Memory Management is a critical task in any computing system. Also, unlike embedded and application specific systems, the use of virtual memory address and mapping it into the actual physical memory requires deep knowledge of the architecture and how the architecture implements it.

I made an attmept to briefly describe what an MMU is, various virtual address width supported by architectures as well as additional complexity introduced due to extensions to existing architectures, define terms such as page, frame, and translation granule. I also described the page table data structure from a high level view, which is used to store information regarding pages. Two major architectures were discussed, Intel and ARM. An attempt was made to map the similarities between these two different architectures, with more emphasis on the Intel architecture. Again, a simple blog as this one is not sufficient to properly explain the implementation employed by architectures. For interested readers, various references are provided in the subsection above for further reading.

Address Sanitizer

Address Sanitizer (or ASan) is a memory error detector, primarily for C and C++. The memory-related bugs discussed above are deteced by ASan. Unlike some other similar tools, the reason ASan is preferred is how fast it really is. AddressSanitizerComparisonOfMemoryTools page of [GOOGLE_ASAN] compares various tools and states that ASan has ~2x slowdown when compared with normal application. In contrast, Valgrind has ~20x slowdown.

GCC version 4.8 and above, and LLVM version 3.1 and above will have ASan by default as part of toolchain. If your compiler toolchain supports ASan, the -fsanitize=address flag will be supported. GCC also mentions that using this flag also implicitly enables -fsanitize-address-use-after-scope.

To use ASan, you need to compile and link your program with the -fsanitize=address flag. Unless you use -static switch during compilation (Darwin does not support completely static executable), ASan will be included dynamically. Once the executable is prepared, you can use otool(1) on Darwin or ldd(1) on Linux to check if your program has been properly instrumented or not.

On Darwin:

$ otool -L ./fam_struct             
./fam_struct:
@rpath/libclang_rt.asan_osx_dynamic.dylib (compatibility version 0.0.0, current version 0.0.0)
/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1319.0.0)

$ otool -l ./fam_struct
...
Load command 14
cmd LC_LOAD_DYLIB
cmdsize 72
name @rpath/libclang_rt.asan_osx_dynamic.dylib (offset 24)
time stamp 2 Thu Jan 1 05:30:02 1970
current version 0.0.0
compatibility version 0.0.0
...

And on Linux:

$ ldd ./fam_struct
linux-vdso.so.1 (0x0000ffff9cbbd000)
libasan.so.6 => /lib/aarch64-linux-gnu/libasan.so.6 (0x0000ffff9c0b0000)
libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000ffff9bf00000)
/lib/ld-linux-aarch64.so.1 (0x0000ffff9cb84000)
libm.so.6 => /lib/aarch64-linux-gnu/libm.so.6 (0x0000ffff9be60000)
libgcc_s.so.1 => /lib/aarch64-linux-gnu/libgcc_s.so.1 (0x0000ffff9be30000)

Demystifying the Shadow Byte

When ASan is triggered, it reports a verbose output. We'll look into one later, but let's first understand the Shadow Byte and the legend provided at the output as:

Shadow byte legend (one shadow byte represents 8 application bytes):
Addressable: 00
Partially addressable: 01 02 03 04 05 06 07
Heap left redzone: fa
Freed heap region: fd
Stack left redzone: f1
Stack mid redzone: f2
Stack right redzone: f3
Stack after return: f5
Stack use after scope: f8
Global redzone: f9
Global init order: f6
Poisoned by user: f7
Container overflow: fc
Array cookie: ac
Intra object redzone: bb
ASan internal: fe
Left alloca redzone: ca
Right alloca redzone: cb
Shadow gap: cc

For a program, the Addressable and Partially addressable regions are the only one which are valid memory addresses. In ASan terms, such memory region is unpoisoned. The redzone regions are the poisoned memory region where memory access triggers ASan. The Shadow gap region is a region of memory that is not addressable. Attempt to access this memory region results in a crash.

Partially Addressable

[AddressSanitizerAlgorithm#Mapping] states that there are only 9 different values for any 8 bytes of the application memory. We've already seen that if the byte value is 0, it indicates fully addressable application memory address, and a negative value indicates that corresponding application memory is unaddressable.

However, if the first k bytes are unpoisoned, the rest 8-k are poisoned. The shadow value is k. This is guaranteed by the fact that malloc(3) returns 8-byte aligned chunks of memory.

The virtual address space for a process is divided into two disjoint classes: main application memory and shadow memory. I'll consider the 64-bit architecture, where the address range and the shadow offset are different in contrast to 32-bit architecture. With the virtue of virtual memory management, a single process is not limited to a portion of memory address space. This means, a process theoretically has access to address ranging from 0 to 2^64 - 1. Recall that the smallest addressable unit is a byte. Theoretically, a 64-bit address space allows addressing of:

  • 18,446,744,073,709,551,616 Bytes
  • 18,014,398,509,481,984 Kibi Bytes (KiB)
  • 17,592,186,044,416 Mebi Bytes (MiB)
  • 17,179,869,184 Gibi Bytes (GiB)
  • 16,777,216 Tebi Bytes (TiB)
  • 16,384 Pebi Bytes (PiB)
  • 16 Exbi Bytes (EiB)

In practice, not all of address space can be used by the process. A simple example is the address range [0x0000000000000000, 0x00000000FFFFFFFF] (4 GiB; on 64-bit Darwin) is marked as __PAGEZERO segment and dereferencing any address within this range results in segmentation violation:

$ otool -l ./fam_struct
Load command 0
cmd LC_SEGMENT_64
cmdsize 72
segname __PAGEZERO
vmaddr 0x0000000000000000
vmsize 0x0000000100000000
fileoff 0
filesize 0
maxprot 0x00000000
initprot 0x00000000
nsects 0
flags 0x0
...

Why is __PAGEZERO segment needed

One of the biggest transition in software has been a shift from 32-bit to 64-bit architecture. Given how systems programming language like C allow user to write code such as:

char *p = NULL;
p[0xfeed] = 0x01; /* *(0 + 0xfeed) = 0x01 */

which is often called NULL pointer dereference, porting software which was written for 32-bit to a 64-bit architecture although it might not be 64-bit safe, the virtual __PAGEZERO segment allowed catching such bugs. This is why the size of this segment covers all of 32-bit memory address space (4 GiB) on 64-bit architecture. On 32-bit architecture, this segment is 4KiB.

Even on 64-bit architecture, NULL pointer dereference technique is still valid as seen with program below:

#include <stdio.h>

int
main (void)
{
char str[5] = {'t', 'a', 'l', 'e', '\0'};
char *p = NULL;
uintptr_t straddr;

straddr = (uintptr_t) str;
p[straddr + 1] = 'e';
p[straddr + 3] = 'l';

printf("str: %s\n", str);
}
$ gcc -Wall -std=c99 -pedantic -fsanitize=address -O0 -o null_dereference ./null_dereference.c 
$ ./null_dereference
str: tell

Also notice the upper threshold of HighMem (shown below) address region. As with canonical addressing, the upper 17 bits of 64-bit address space is unused. The LowMem address region as well as HighMem region are the only valid regions that can be dereferenced without ASan invoking runtime error.

ASan marks a portion of this memory region as Shadow memory, which is further divided. Typically, there are three distinct regions within the shadow memory: LowShadow, HighShadow, and ShadowGap. AddressSanitizerAlgorithm#64-bit lists out the division of address space. A more architecture specfic mapping can be found in [compiler-rt/lib/asan/asan_mapping.h] of the LLVM project. On my machine, the provided address range does not correspond to the actual address range. The ASAN_OPTIONS environment variable can influence the behavior of ASan. To fetch the actual address range, I did:

$ ASAN_OPTIONS=verbosity=1 ./fam_struct          
==80466==AddressSanitizer: libc interceptors initialized
|| `[0x107000020000, 0x7fffffffffff]` || HighMem ||
|| `[0x027e00024000, 0x10700001ffff]` || HighShadow ||
|| `[0x007e00024000, 0x027e00023fff]` || ShadowGap ||
|| `[0x007000020000, 0x007e00023fff]` || LowShadow ||
|| `[0x000000000000, 0x00700001ffff]` || LowMem ||
MemToShadow(shadow): 0x007e00024000 0x007fc00247ff 0x00bfc0024800 0x027e00023fff
...
SHADOW_SCALE: 3
SHADOW_GRANULARITY: 8
SHADOW_OFFSET: 0x7000020000
...

Of the 140,737,488,355,328 (128 TiB) addressable bytes for a process (range of [0-0x7fffffffffff]), the shadow memory address space is from 0x007000020000 to 0x10700001ffff. This means, the shadow memory consumes 17,592,186,044,416 addressable bytes (16 TiB) for its usage. Notice that the shadow memory is one-eigth of the total addressable bytes. This is because ASan maps 8 addressable bytes of main application memory into 1 addressable byte of shadow memory (as mentioned in the legend).

One might wonder why the address range is laid out the way it is. Specifically the HighShadow and LowShadow region. There is an interesting reason behind this. The operation used to convert a non-shadow address to a shadow address, when applied to the shadow address will yield an address which lies in the ShadowGap region. To calculate the shadow address for an address, the following operation is applied:

/* Instrumentation used for each memory access */
shadow = (memory >> 3) + SHADOW_OFFSET;

As described previoulsy, since 8 application bytes correspond to 1 shadow byte, we perform the bitwise shift operation to clear out the lower 3 bits. In practice, the SHADOW_OFFSET value is same as the lower bound of LowMem region, as can be seen above. Consider the memory address 0x700001FFBA which lies in the LowMem region. The shadow memory of this address would be:

0x700001FFBA >> 3 -> 0xE00003FF7
0xE00003FF7 + SHADOW_OFFSET (0x7000020000) -> 0x7E00023FF7

Notice that 0x7E00023FF7 lies in the LowShadow region. Now let's try to apply the same operation to this address:

0x7E00023FF7 >> 3 -> 0xFC00047FE
0xFC00047FE + SHADOW_OFFSET (0x7000020000) -> 0x7FC00247FE

The address resulted lies in the ShadowGap region. The ShadowGap region is unaddressable, and is used when the application which is probed by ASan tries to dereferences memory that already lies in the shadow region. Since every memory access is instrumented to use the above operation to check the validity of the address, the concept of ShadowGap region ensures the application errors out when dereferencing memory address which lies in the shadow region.

Reading Shadow Byte Ouput by ASan

To look into the output of ASan, let's work on a simple program which is intentionally buggy:

/* fam_struct.c */

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

struct fam {
int i;
long j;
int arr[];
};

int
main (void)
{
struct fam *a, *b;

a = malloc(sizeof(struct fam) + 10 * sizeof(int));
b = malloc(sizeof(struct fam) + 2 * sizeof(int));

if (!a || !b) exit(1);

a->arr[7] = 100;

printf("The address of a is %p and b is %p\n",
a, b);

printf("The addresses of a are:\n"
"&i: %p\n"
"&j: %p\n"
"arr: %p\n",
&a->i, &a->j, a->arr);

printf("The addresses of b are:\n"
"&i: %p\n"
"&j: %p\n"
"arr: %p\n",
&b->i, &b->j, b->arr);

*b = *a;

printf("The content of the tenth element is: %d\n", b->arr[7]);

return (0);
}

The program uses Flexible Array Member within the structure, which was introduced in C99. The pointer to structure variables a and b allocate memory at runtime using malloc(3) function. Needless to say, the memory allocation for FAM structures isn't usually done like this; a member within the structure is used as an size property of the array. Since b's array is not large enough to index element 7, accessing it would result in undefined behavior. Furthermore, the assignment operation *b = *a is also undefined since the underlying objects of both pointers have different sizes.

Structure Assignment

K&R C pre-ANSI lacked the support for using assignment operator on structures. The C Programming Language (1978) § 6.2: Structures and Functions states the following:

There are a number of restrictions on C structures. The essential rules are that the only operations that you can perform on a structure are take its address with &, and access one of its members. This implies that structures may not be assigned to or copied as a unit, and that they can not be passed to or returned from functions. (These restrictions will be removed in fourth-coming versions.) Pointer to structures do not suffer these limiations, however, so structures and functions do work together comfortably. Finally, automatic structures, like automatic arrays, cannot be initialized; only external or static structures can.

ANSI C did standardized the assignment to structures, but the specification is left for compiler writers to implement. For example, when assigning to a structure which is as simple as:

struct basic_structure {
int i, j, k;
char a, b, c;
};

the compiler can use instructions such as LOAD/STORE to assign structures. In practice, however, structures are complicated objects whose members could be primitive or other derived types. For such structures, the compiler will prefer using memcpy(3) to copy bitwise from one structure to another.

When talking about complex structures, we also need to consider two types of copying: shallow copy and deep copy. I will not explain this in detail, but the basic gist of this is: for a structure which contains reference members (pointers), shallow copy will copy the reference value and not the underlying object--both pointer variables will contain the same value--whereas deep copy would create a new reference whose underlying object is identical. For example, when using the assignment operator on a structure:

struct my_string {
size_t length;
size_t capacity;
char *str;
};

struct my_string name1, name2;

/* Assume name2 has been populated */

/* Assign to name1 */
name1 = name2;

str member of name1 would have the value that is identical to str member of name2. This is mostly undesired since the reference has been incremented. If both structures were freed at some point later on, the program could incur the double free problem.

In contrast, a deep copy would create a new reference and copy the underlying objects. The assignment operator is avoided and instead a dedicated function is created:

int
my_string_deep_copy (struct my_string *dst, struct my_string *src)
{
/* primitive data types can be safely assigned */
dst->length = src->length;
dst->capacity = src->capacity;

/*
* request address range, which is identical to that of src.
* Here, we could also use 'strlen(3)' on 'src->str' but since
* 'capacity' member is present, we can use it.
*/
dst->str = malloc(src->capacity);
if (!dst->str) return 1; /* malloc failed */

/*
* The length member is the length of 'str' whereas 'capacity'
* describes the actual capacity of 'str'. 'length' must be
* equal or less than 'capacity'.
*/
memcpy(dst->str, src->str, src->length);

return 0;
}

Copying structures is a non-trivial task. The implicit operations could not always be aligned to the need of the programmer. This can be seen in other languages such as C++ and Rust, where the concept of moving objects is a feature provided by the language.

Let's run the program and see the output:

$ gcc -Wall -Wextra -std=c99 -O0 -fsanitize=address -g -o fam_struct ./fam_struct.c 

$ ./fam_struct
The address of a is 0x105001460 and b is 0x105101ea0
The addresses of a are:
&i: 0x105001460
&j: 0x105001468
arr: 0x105001470
The addresses of b are:
&i: 0x105101ea0
&j: 0x105101ea8
arr: 0x105101eb0
=================================================================
==58730==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x000105101ecc at pc 0x000102e47cf4 bp 0x00016cfbae70 sp 0x00016cfbae68
READ of size 4 at 0x000105101ecc thread T0
#0 0x102e47cf0 in main fam_struct.c:115
#1 0x18aa17f24 (<unknown module>)
...
SUMMARY: AddressSanitizer: heap-buffer-overflow fam_struct.c:115 in main
Shadow bytes around the buggy address:
0x007020a40380: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x007020a40390: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x007020a403a0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x007020a403b0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x007020a403c0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
=>0x007020a403d0: fa fa fa fa 00 00 00 fa fa[fa]00 00 00 00 fa fa
0x007020a403e0: 00 00 00 00 fa fa 00 00 06 fa fa fa 00 00 00 00
0x007020a403f0: fa fa 00 00 00 00 fa fa 00 00 00 00 fa fa 00 00
0x007020a40400: 00 00 fa fa 00 00 00 00 fa fa 00 00 00 00 fa fa
0x007020a40410: 00 00 00 00 fa fa 00 00 00 00 fa fa 00 00 00 00
0x007020a40420: fa fa 00 00 00 00 fa fa 00 00 00 00 fa fa 00 00
...
tip

The address at the right side is the address of the shadow memory region. Recall that each address corresponds to 8 bytes of application address. Each row has 16 columns, so one row describes 256 application memory addresses. The byte value corresponds to the legend. Here, fa means Heap left redzone, which is valid since we're requesting memory using malloc(3).

On terminal, the output is colorized so the user can inspect more efficiently. But let's focus into the report at hand. Since we're inspecting the object referenced by b, first let's get the shadow memory region for it's address, specifically the address of the array member:

shadow(&b->arr[0]) -> shadow(0x105101eb0) -> 0x7020A403D6
shadow(&b->arr[1]) -> shadow(0x105101eb4) -> 0x7020A403D6
shadow(&b->arr[2]) -> shadow(0x105101eb8) -> 0x7020A403D7
shadow(&b->arr[3]) -> shadow(0x105101ebc) -> 0x7020A403D7
shadow(&b->arr[4]) -> shadow(0x105101ec0) -> 0x7020A403D8
shadow(&b->arr[5]) -> shadow(0x105101ec4) -> 0x7020A403D8
shadow(&b->arr[6]) -> shadow(0x105101ec8) -> 0x7020A403D9
shadow(&b->arr[7]) -> shadow(0x105101ecc) -> 0x7020A403D9
shadow(&b->arr[8]) -> shadow(0x105101ed0) -> 0x7020A403DA
shadow(&b->arr[9]) -> shadow(0x105101ed4) -> 0x7020A403DA

Notice that the ASan report enclosed the faulting address inside square brackets. If you figure out the corresponding shadow memory address, it is the one that corresponds to &b->arr[7] (and &b->arr[6]). If you're wondering why the address is incremented by 4 on each iteration, it's cause an integer object on my machine takes 4 bytes of storage.

ASan Can Miss

In the above report, notice something interesting. Instead of accessing the element at index 7, let's try accessing the element at index 8. Only the printf(3) statement is modified:

printf("The content of the tenth element is: %d\n", b->arr[8]);

which then gives the output as:

$ gcc -Wall -Wextra -std=c99 -O0 -fsanitize=address -g -o fam_struct ./fam_struct.c

$ ./fam_struct
The address of a is 0x106c01460 and b is 0x106d01ea0
The addresses of a are:
&i: 0x106c01460
&j: 0x106c01468
arr: 0x106c01470
The addresses of b are:
&i: 0x106d01ea0
&j: 0x106d01ea8
arr: 0x106d01eb0
The content of the tenth element is: -1969390120

To understand this, let's review the report from above. The corresponding shadow address of &b->arr[8] is 0x7020A403DA. If you look at the shadow byte above, the byte value at that shadow memory address is 00, which indicates that the application address corresponding to it is fully addressable. By default, the modified memory allocation function provided by ASan will posion the nearby boundary addresses of the allocated region. Since we initially only requested 2 integers worth of memory for b's arr member, it did poison the address next to it. But it turns out that address of &b->arr[8] is still addressable!

Luckily for most cases, what ASan does is sufficient. Off-by-one is a famous phrase in CS and for a reason. Most memory related problems occur cause we accessed memory just outside the valid range. People who designed ASan were aware of it and appropriately designed this tool. But I found this bit to be interesting and worth sharing.

vmmap(1) to Inspect Memory Properties

Let's now inspect the memory region using the vmmap(1) utility. Since the fam_struct program terminates almost immediately, we need a way to pause the process and inspect the memory region. We can achieve this using the ASAN_OPTIONS environment variable itself as:

$ ASAN_OPTIONS=sleep_before_dying=60 ./fam_struct
The address of a is 0x102201460 and b is 0x102301ea0
The addresses of a are:
&i: 0x102201460
&j: 0x102201468
arr: 0x102201470
The addresses of b are:
&i: 0x102301ea0
&j: 0x102301ea8
arr: 0x102301eb0
=================================================================
==54610==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x000102301ecc at pc 0x00010006fcf4 bp 0x00016fd92e70 sp 0x00016fd92e68
READ of size 4 at 0x000102301ecc thread T0
#0 0x10006fcf0 in main fam_struct.c:115
#1 0x18aa17f24 (<unknown module>)
...
==54610==ABORTING
==54610==Sleeping for 60 second(s)

On another terminal, call vmmap(1) to inspect the memory region:

$ sudo vmmap 54610
Can't examine target process's malloc zone asan_0x100edc950, so memory analysis will be incomplete or incorrect.
Reason: for security, cannot load non-system library /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/14.0.0/lib/darwin/libclang_rt.asan_osx_dynamic.dylib
Process: fam_struct [54610]
Path: /Users/USER/Downloads/*/fam_struct
Load Address: 0x10006c000
Identifier: fam_struct
Version: ???
Code Type: ARM64
Platform: macOS
...
==== Non-writable regions for process 54610
REGION TYPE START - END [ VSIZE RSDNT DIRTY SWAP] PRT/MAX SHRMOD PURGE REGION DETAIL
...
Sanitizer 7e00024000-27e00024000 [ 2.0T 0K 0K 0K] ---/rwx SM=NUL
...
==== Writable regions for process 54610
REGION TYPE START - END [ VSIZE RSDNT DIRTY SWAP] PRT/MAX SHRMOD PURGE REGION DETAIL
Sanitizer 102300000-102400000 [ 1024K 0K 0K 16K] rw-/rwx SM=PRV
...
Sanitizer (reserved) 700001c000-702001c000 [512.0M 0K 0K 0K] rw-/rwx SM=NUL reserved VM address space (unallocated)
Sanitizer 702001c000-702801c000 [128.0M 0K 0K 2688K] rw-/rwx SM=PRV
Sanitizer 702801c000-702ded4000 [ 94.7M 0K 0K 16K] rw-/rwx SM=ZER
Sanitizer (reserved) 702ded4000-702dfd0000 [ 1008K 0K 0K 0K] rw-/rwx SM=NUL reserved VM address space (unallocated)
Sanitizer 702dfd0000-703001c000 [ 32.3M 0K 0K 16K] rw-/rwx SM=ZER
Sanitizer 703001c000-703801c000 [128.0M 0K 0K 208K] rw-/rwx SM=PRV
Sanitizer 703801c000-704001c000 [128.0M 0K 0K 48K] rw-/rwx SM=PRV
Sanitizer 704001c000-704801c000 [128.0M 0K 0K 16K] rw-/rwx SM=PRV
Sanitizer (reserved) 704801c000-7e00024000 [ 54.9G 0K 0K 0K] rw-/rwx SM=NUL reserved VM address space (unallocated)
Sanitizer (reserved) 27e00024000-107000020000 [ 13.9T 0K 0K 0K] rw-/rwx SM=NUL reserved VM address space (unallocated)
MALLOC_NANO (empty) 600000000000-600008000000 [128.0M 16K 16K 0K] rw-/rwx SM=PRV
MALLOC_NANO (empty) 600008000000-600020000000 [384.0M 0K 0K 0K] rw-/rwx SM=NUL

==== Legend
SM=sharing mode:
COW=copy_on_write PRV=private NUL=empty ALI=aliased
SHM=shared ZER=zero_filled S/A=shared_alias
PURGE=purgeable mode:
V=volatile N=nonvolatile E=empty otherwise is unpurgeable

==== Summary for process 54610
ReadOnly portion of Libraries: Total=807.8M resident=24.0M(3%) swapped_out_or_unallocated=783.7M(97%)
Writable regions: Total=14.0T written=18.9M(0%) resident=112K(0%) swapped_out=18.9M(0%) unallocated=14.0T(100%)
...

The output from vmmap(1) is verbatim, to say the least. But notice the entry in non-writable region. The address region is identical to that of ShadowGap region we saw previously. This address range has protection of ---, meaning any attempt to read, write or execute bytes within this address range would be segmentation violation. But the maximum protection of this region is rwx, meaning that application such as debugger can request write access (or other access) to pages in that memory region. On the share mode column (SHRMOD), this region is marked as NUL, meaning the page does not really exist in physical memory. The explanation for output of vmmap(1) is described in the EXPLANATION OF OUTPUT section of manual page.

The address region 700001c000-7e00024000 encapsulates the LowShadow region, while the address region 27e00024000-107000020000 covers the HighShadow region. As expected, these regions are readable and writable, but they have varying shared mode. Private pages (marked as PRV) are pages only visible to the process. Such pages are allocated as they are written to, and can be paged out to disk. The zero-filled shared mode (marked as ZER) indicates that the memory region is a private, anonymous mapping that has been initialized to zero but isn't backed in physical memory.

The entries at third column describes the state of memory region. Do note that vmmap(1) allows fetching submaps, which can generate fine-tune output. The virtual size (marked as VSIZE) is the amount of virtual address space. Resident size (marked as RSDNT) is the amount of memory which is physically present. Dirty size (marked as DIRTY) is the amount of memory that has been modified by the process. Finally, swapped size (marked as SWAP) is the memory that was once resident memory but has been paged out to the disk.

The faulting address is in the range 102300000-102400000, which is marked as Sanitizer region type. ASan then performs shadow memory translation to the address that is in the range, and checks if the address is valid or poisoned. Since the byte value of the corresponding shadow memory address is neither 0 (fully addressable) nor partially addressable, ASan triggers runtime error and provides the report we discussed above.

Linux Virtual Memory Management Reporting

We discussed vmmap(1) utility till now. Unfortunately, Linux does not provide this utility. An alternative to this utility is the pmap(1) utility. But Linux also provides viewing memory directly from the /proc pseudo-filesystem. [The /proc Filesystem] document describes this pseudo-filesystem in great lengths. For a process with PID 1234, the file /proc/1234/maps reports memory maps to executables and library files. smaps is an extension based on maps. /proc/1234/smaps provides memory consumption of each mapping and flags associated with the memory region. The documentation linked above provides information about the output of these files, so interested readers can read it for further understanding.

References

[GOOGLE_ASAN] https://github.com/google/sanitizers/wiki/AddressSanitizer (archive)

[DLMALLOC] https://gee.cs.oswego.edu/dl/html/malloc.html (archive) (github)

[INTEL_ARCH_SDM] https://web.archive.org/web/20151025081259/http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-system-programming-manual-325384.pdf

[ARMV7_VMSA] https://developer.arm.com/documentation/ddi0406/c/System-Level-Architecture/Virtual-Memory-System-Architecture--VMSA-?lang=en

[ARMV8_9_AARCH64_VMSA] https://developer.arm.com/documentation/ddi0487/maa/-Part-D-The-AArch64-System-Level-Architecture/-Chapter-D8-The-AArch64-Virtual-Memory-System-Architecture?lang=en

[ARMV8_9_AARCH32_VMSA] https://developer.arm.com/documentation/ddi0487/maa/-Part-G-The-AArch32-System-Level-Architecture/-Chapter-G5-The-AArch32-Virtual-Memory-System-Architecture?lang=en