Computing Writesets of C Functions

Problem Statement

The writeset is the set of things written (created, updated or deleted), commonly in the context of databases: given a specific transaction, it is the set of the things that will be affected by its execution.

During my PhD, when working on the Bandwidth-adaptive Page Placement in NUMA publication, one interesting question arose - could we do the same for regular code?

Our Solution

Could we write a function computeWriteSet that receives an arbitrary function to be executed and returns the set of writes (including from other functions called from within)? That’s what I asked on StackOverflow.

We worked on a Linux-specific mechanism that relies on playing with mprotect, a function that controls the permissions/protections of regions of memory. In Linux, all memory is split into pages (usually of 4KB each). The minimum granularity that is possible to specify these protections are in page increments.

Our solution simply protects the whole memory from being written (which causes dreadful segmentation faults every time a write is attempted). We then use a Signal Handler to catch the signal, unprotect the memory and record the address accessed.

This is the most compact version I was able to devise:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <signal.h>
#include <ucontext.h>
#include <fcntl.h>
#include <execinfo.h>
#include <sys/mman.h>

#include <set>
#include <functional>
#include <cassert>

extern "C" {
extern int __data_start;
extern int _end;
}

#define PAGE_SIZE sysconf(_SC_PAGESIZE)
#define PAGE_MASK (PAGE_SIZE - 1)
#define PAGE_ALIGN_DOWN(x) (((intptr_t) (x)) & ~PAGE_MASK)
#define PAGE_ALIGN_UP(x) ((((intptr_t) (x)) + PAGE_MASK) & ~PAGE_MASK)
#define GLOBALS_START PAGE_ALIGN_DOWN((intptr_t) &__data_start)
#define GLOBALS_END   PAGE_ALIGN_UP((intptr_t) &_end - 1)
#define GLOBALS_SIZE  (GLOBALS_END - GLOBALS_START)

std::set<void*> *addresses = new std::set<void*>();

void sighandler(int signum, siginfo_t *siginfo, void *ctx) {
    void *addr = siginfo->si_addr;
    void *aligned_addr = reinterpret_cast<void*>(PAGE_ALIGN_DOWN(addr));
    switch(siginfo->si_code) {
    case SEGV_ACCERR:
        mprotect(aligned_addr, PAGE_SIZE, PROT_READ | PROT_WRITE);
        addresses->insert(aligned_addr);
        break;
    default:
        exit(-1);
    }
}

void computeWriteSet(std::function<void()> f) {
    static bool initialized = false;
    if (!initialized) {
        // install signal handler
        stack_t sigstk;
        sigstk.ss_sp = malloc(SIGSTKSZ);
        sigstk.ss_size = SIGSTKSZ;
        sigstk.ss_flags = 0;
        sigaltstack(&sigstk, NULL);
        struct sigaction siga;
        sigemptyset(&siga.sa_mask);
        sigaddset(&siga.sa_mask, SIGSEGV);
        sigprocmask(SIG_BLOCK, &siga.sa_mask, NULL);
        siga.sa_flags = SA_SIGINFO | SA_ONSTACK | SA_RESTART | SA_NODEFER;
        siga.sa_sigaction = sighandler;
        sigaction(SIGSEGV, &siga, NULL);
        sigprocmask(SIG_UNBLOCK, &siga.sa_mask, NULL);
        initialized = true;
    }
    addresses->clear();
    printf("\nexecuting function\n");
    printf("--------------\n");
    mprotect(reinterpret_cast<void*>(GLOBALS_START), GLOBALS_SIZE, PROT_READ);
    f();
    mprotect(reinterpret_cast<void*>(GLOBALS_START), GLOBALS_SIZE, PROT_READ | PROT_WRITE);
    printf("--------------\n");
    printf("pages written:\n");
    for (auto addr : *addresses) {
        printf("%p\n", addr);
    }
}

We can then use it to check the writes done by a given function:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
void f() {
    static int x[1024] = {0};
    static int y[1024] = {0};
    static int z[1024] = {0};
    static bool firsttime = true;
    if (firsttime) {
        printf("&x[0] = %p\n&y[0] = %p\n&z[0] = %p\n", x, y, z);
        firsttime = false;
    }
    if (y[0]) z[0]++;
    if (x[0]) y[0]++;
    x[0] = (x[0] + 1) % 2;
    printf("{ x, y, z } = { %d, %d, %d }\n", x[0], y[0], z[0]);
}

int main() {
    computeWriteSet(f);
    computeWriteSet(f);
    computeWriteSet(f);
    return 0;
}

When executed, it produces the following output:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
executing function
--------------
&x[0] = 0x6041c0
&y[0] = 0x6051c0
&z[0] = 0x6061c0
{x, y, z} = {1, 0, 0}
--------------
pages written:
0x604000

executing function
--------------
{x, y, z} = {0, 1, 0}
--------------
pages written:
0x604000
0x605000

executing function
--------------
{x, y, z} = {1, 1, 1}
--------------
pages written:
0x604000
0x606000

Limitations

Conclusion

In the end, the answer is yes, we can do the same to functions as we can do to database transactions, albeit within the limitations of the operating system and with a significant performance penalty, which prohibits its frequent/extensive usage.