Static Analysis
Introduction
Memory is a cruical topic when it comes to building a software. Programs running on historical devices referenced physical memory location which is superseded by a separate new component: Memory Management Unit (MMU). We now work with virtual memory addresses that is managed by MMU for us[0].
I recently stumbled across a video regarding the RustTM language. The intent of this post is not to look down on the RustTM language but to observe a behaviour of a trivial C program. A program fragment was shown in the video, similar to the one shown in Listing 1. After building the executable, we can notice that the program exits normally. But before that, we tried to access a region of memory that was "freed". In contrast, RustTM informs this issue during compilation.
#include <stdlib.h>
int
main (void)
{
int *x = malloc(sizeof(int));
free(x);
*x = 0xA455;
exit(0);
}
One would assume that the program under execution for Listing 1 (as seen in Listing 2) should have
received the segmentation violation signal (SIGSEGV) as it made an attempt to access a freed memory
region. This is more complex that it is described here. In essence, malloc-like memory allocation
library functions does more than just allocating memory. Demystifying malloc is not the purpose for
now. Interested readers can browse link [1] provided at the end. To request any operation from the
system, we use system calls. To request more memory for a process, we use the mmap(2) system call.
Script started on Wed Aug 27 16:13:10 2025
bash-3.2$ ./listing1
bash-3.2$ echo $?
0
bash-3.2$ exit
Script done on Wed Aug 27 16:13:16 2025
The output for the program listing1 (and others) is captured using the script(1) utility. The output from this utility may contain control characters that are then removed using the col(1) command. The command is used as follows:
$ SHELL=/bin/bash script <output-file-name>
...
$ col -b < <input-file-name> > <output-file-name>
My default shell is zsh. By defaut, script(1) will use the environment variable SHELL as the shell process. My configuration of zsh contains coloring and other "special" characters that appear in the output file. Unfortunately, even the col (with the given flag) is not able to clean out the terminal output. For simplicity, I chose to show the output from the bash shell.
mmap(2) System Call
As the name suggests, mmap(2) is used to map a file described by a file descriptor into the memory. But it also
allows anonymous mapping. It is more genral than malloc-like library functions. For instance,
we can specify the protection of the memory region. By default, this system call assumes that the caller wishes
to map a region of file into the memory. We need to explicitly state we intend to map anonymous memory and
not a file. The return value from this system call defines the starting address of the mapped memory. This
call is implementation defined and the address lies somewhere between the stack and heap of a process.
Listing 3 shows an program identical to the one shown in Listing 1. The first argument to mmap(2) takes an
address that the kernel will use as a "hint" as to where the starting address of the mapped region will
be placed. Unless the MAP_FIXED flag is used, any previous mapping done in the requested address is not
replaced. If mmap(2) with MAP_FIXED flag is called and the first argument is an address that already
contains a previous mapping, upon successful return, the previous mapping is replaced. The use of
MAP_FIXED is discouraged if portability is a consideration.
#include <sys/mman.h>
#include <stdlib.h>
int
main (void)
{
int *x = mmap(NULL,
sizeof(int),
PROT_WRITE,
MAP_ANONYMOUS | MAP_PRIVATE,
/* ignored */ -1,
/* ignored */ 0);
munmap(x, sizeof(int));
*x = 0xA455;
exit(0);
}
When the program from Listing 3 is compiled and executed, we see the behavior as seen in Listing 4. The process terminates due to segmentation violation. When this program is executed inside a debugger, you'll notice that the signal is received when the memory address is dereferenced for assginment of a value.
Script started on Wed Aug 27 16:18:50 2025
bash-3.2$ ./listing3
Segmentation fault: 11
bash-3.2$ echo $?
139
bash-3.2$ exit
Script done on Wed Aug 27 16:19:01 2025
Like mentioned earlier, malloc(3) does not simply allocate a memory and return the address of the allocated
memory region. This function internally performs various memory management operation (as can be seen on
musl's implementation.) Like we've seen in Listing 3, segmentation violation occurs if the received memory
region from mmap(2) was unmapped using munmap(2). I have yet to explore the actual implementation of the
free function, but at a glance, it looks like the process "advises" the system that the information
contained in the memory region is not need and can be reused right away. A process provides such advise
to the system through the madvise system call. Indeed, the free(3) function will eventually invoke the
munmap(2) system call.
The RustTM compiler is able to detect such errors by the virtue of static analysis of source file. During
compilation, the rust compiler performs various checks that most compilers perform for the respecitve
language. In addition to this, the RustTM compiler statically analyzes the source file for memory related
issues such as this one (use-after-free) along with other ownership model checks. This does not mean that
compilers for the C language does not support this feature. For instance, gcc provides the -fanalyzer
option that can be used during compilation to perform inter-procedural analysis. clang from llvm also
provides similar feature, but the option is called --analyze. clang also provides a command-line utility
called scan-build that is used during the build process.
As we can see in Listing 5, the static analyzer reports that the program suffers from a use-after-free issue. I
usually prefer the runtime analyzer over the static analyzer as the reporting is verbose. The compiler flag
such as -fsanitize=address is used to probe an Address Sanitizer (ASan) to the program such that issues
like use-after-free is detected at runtime. It also provides a stack backtrace. Another one I frequently use
is -fsanitize=undefined that is used to detect any undefined behavior during runtime (UBSan). Some of the
potential usage of UBSan is to detect array subscripts out of bounds where the bounds can be statically
defined [2], signed integer overflow, dereferencing misaligned or null pointers [3].
Script started on Wed Aug 27 16:28:49 2025
bash-3.2$ clang --analyze -DLISTING1 segfault.c
segfault.c:107:6: warning: Use of memory after it is freed [unix.Malloc]
*x = 0xA455;
~~ ^
warning generated.
bash-3.2$ exit
Script done on Wed Aug 27 16:29:15 2025
Static analysis has its pros and cons. It ensures that the program being built is hardened from some of the commonly found bugs. But it does come with a tradeoff; compilation time. Both gcc and clang mentions that static analysis is expensive compared to other warnings flags [4].
References
- [0] Before the advent of MMU, programs used physical memory location in RAM for various operations. This was an issue that would requires its own blog. LaurieWired made a video to discuss about virtual memory addressing called How a Clever 1960s Memory Trick Changed Computing. It's safe to say that most general purpose computers have a dedicated MMU that handles the required translation of virtual memory address to physical address in RAM. The address space for a process is not distinct compared to other process's address space. In fact, if two programs was to use the same standard C library, those process's would probably load the library in identical address space although it is not a requirement.
- [1] https://git.musl-libc.org/cgit/musl/tree/src/malloc/mallocng/malloc.c
- [2] One might assume that this issue should be a consideration for ASan. Consider a function variable
foothat is an array of characters. Such function variables are stored in the stack, not in the data or the bss section, during runtime. For a program under execution, there probably won't be any segmentation violation if an attempt was made to access a subscript offoothat exceeds the declared size. This is cause a function's stack frame contains: instructions for the function, the architecture's ABI; function prologue and epilogue, and potential stack canary (that must not be tampered), and the local variables that are declared for the function. Most architectures use little endian, so there's a chance that a buffer overflow could be done such that the return address located at the beginning of the stack can be modified and can cause the function to return to some other function causing arbitrary code to be executed later. - [3] When I tried to run the programs from Unix Network Programming, by W.R. Stevens, some of them caused this
runtime error. The buffered data from the network does not only contain ASCII text. For example, if the
server sends a binary data for
struct timevalto the client, the client can't interpret the data and we need to assign it to a variable of typestruct timeval. A character variable is aligned to a 1-byte address, but it is not the same for astruct timevalvariable. If this structure was 8-byte aligned, then the address that has the 3 Least Significant bit (LSb) not zero would invoke memory alignment issue. - [4] https://gcc.gnu.org/onlinedocs/gcc/Static-Analyzer-Options.html && https://clang-analyzer.llvm.org/