AFX - Asynchronous Function Execution

Preemptive function scheduling at user level is a very interesting concept. I get to know about it when I started learning Go. And I tried to implement it in C.

How Go routines works

The Go runtime creates a separate thread called sysmon. This thread monitors and schedules other go routines. These go routines runs on a pool of threads. Before Go 1.14, the Go compiler used to put checks inside a go routine at specific places like before a function call. When the control reaches these checks, it stops the thread and gives control back to the sysmon. The sysmon then schedules another go routines to run on that thread. This approach is known as Co-operative scheduling.

This works great for tasks in which a lot of I/O calls are made, but fails to work when there is a tight loop. Hence in Go 1.14, Preemptive scheduling was added. In Preemptive scheduling, the sysmon checks whether there is a Go routine that is running for more than a specific time interval (10ms), and then forcefully pauses it by sending a signal.

More on Go scheduling: How go routines works

Goals

Minimal: I don't want the user to care about how the scheduling is happening i.e. I don't want to expose too many functions.
Function Signature agnostic: The scheduler should be able to schedule any type of function, without caring about the number/type of arguments.

Initial design

Monitor: A thread that will periodically halts the execution of current async function and schedules the next.
Executor: A thread on which the actually execution of async functions will happen.
Custom stacks: I am thinking of allocating a dedicated chunk of memory on heap for each async function, which will serves as its stack.
Scheduling: The monitor will preempt the executor, on which we need to store the entire state of cpu somewhere and load the state of next function on cpu.

How to preempt a thread

We can send a signal to a thread and we can define a function that will be called when a thread receives a signal. This pauses the execution of that thread, and the kernel also provides us the cpu state at which the signal was received. We can change the cpu state and the program counter to start executing another part of code. I have used the SIGURG signal, because Go also uses it.

Storing function data without corruption

NOTE: To understand what is really going on, you need to have some knowledge of how a function call works under the hood.

Context is the data in CPU registers and the data in stack at a moment. Every thread has its own stack, which expands and shrinks dynamically. Because we are using the same stack size for each async function, there is a risk of stack overflow. And since we cannot know at compile time how much stack a function will need, we cannot precisely predict the maximum stack size that should be allocated for that function (Zig is working in this direction, link). We need to figure a way to extend the stack, but that's for later.

How a function calls happens under the hood

When a function is called, these many things happens

The caller function (the function that is calling another function) stores first 6 args in rdi, rsi, rdx, rcx, r8 and r9 registers. All the next arguments are stored on stack. Then the caller stores its own return address and the value of rbp (base pointer) register on the stack.
The caller calls the callee (the function that is being called). Callee function stores the value of rdi, rsi, rdx, rcx, r8 and r9 registers on the stack. Sometimes this does not happens, as if there is no need to use these registers for any other purposes (example if the function is very small and is performing only some operations). Compiler decides which args should be stored on stack, and which should not. This is known as a function prologue. And after that the code that the programmer has written starts executing.
After the execution, the callee resets the stack pointer (rsp) (if it has used it to store any of its own variables). And using the return address, the execution goes back to the caller.

Storing and switching context

Switching between functions is not so complicated. When the signal is handled, the kernel provides us a pointer to all the registers and we can copy them, and we can also change them. What is complicated is to start a function as an async function. We need to store the value of registers right after the function prologue, and we need to store the value of the args(>7) that are stored on stack. I have used inline assembly to store the register values, and I store the rbp register value (rbp_caller) right before the caller calls the callee and right after the callee function's prologue (rbp_callee). The space between these two addresses is where the return address, base pointer and arguments (>7) are stored.

For every async function, an 8KB(can be changed during compilation) memory space is allocated as its own custom stack, and the data between rbp_caller and rbp_callee is copied. The initial idea was that we can embed some assembly code at the starting of the function to save the state of the registers and the stack and then we'll stop the execution of the function after storing its state by calling the leave and ret instruction. And then put a label (1:) just after these instruction. And We can store the rip register (address of the next instruction) at the address of the label 1. When any function calls the an async function, we store the context and then return without executing any of the code. And when the executor will put this function in its own context, it will skip the register and stack copying logic and jump to the label 1, and start executing the body of the function. That did not worked.

Accurate measure of stack

When a function is called, the compiler sometimes writes some instructions in the function prologue that saves the value of the first 6 args on the stack(This is done by callee and is different from saving the args > 6 to the stack, which is done by caller). The problem is we cannot know at compile time how many arguments are being passed to a function, so we don't know how much stack after the rbp_callee should be saved (keep in mind that the stack is growing downwards, so its more like how much space before rbp_callee should we save). So we cannot just resume the execution of that function from there because we have not yet saved the data for first 6 args on the stack which is necessary.

To get around it, what I've done is I have made a wrapper around the original function that saves the registers and stack data and then returns using the ret instruction. But I have placed a label (1:) just after the ret instruction and stored its address int the rip. So for the initial caller it looks like the async function is complete and it goes on to do the next thing. But because we have saved the state, we now know that what data will be needed in the cpu at the time of calling, to call the original function i.e. the register values and the data stored on stack for args after the 6th arg. When a function is defined it has to be wrapped inside this async macro(details at the end).

          
#define async(ret_type, fn, args, body)\
    ret_type fn args{\
        body\
    }\
    __attribute__((noinline)) ret_type __AFX_PREFIX_##fn args {\
        // saving cpu context and returning to the caller
        save_cpu_context\

        // scheduler starts the execution from here
        // calling the original function
        asm volatile(\
            "callq  %0\n\t"\
            ::"r"(fn)\
        );\

        // after the original function completes, deleting its stack
        // and jumping to executor thread
        afx_delete_context();\
        afx_yield();\
    }

When the scheduler schedules the async function to execute, the rip at this point is pointing to the 1: label and at this place instead of executing the code in the body of the function, we calls the original function. At this point, the state of the cpu is the same as when the the original caller called the async function. Now when we call the original function again, its like we are doing the compiler's job of saving the args in registers and stack and then executing the callq instruction. The image below describe what is being saved on the custom stack when the async function starts. The return address and base pointer are that of the wrapper function and not of the original caller function. And Callee is the original function which has been called again.

Syntax is ugly, but this is the best I could do.

          
//declaration
async_dec(return_type, function_name(arg_type1, arg_type2, ...))

//definition
async(
    return_type, function_name, (arg1, arg2, ...), {
        //body;
    }
)

//call
function_name(arg1, arg2, ...); // for normal execution
afx(function_name(arg1, arg2, ...)); // for async execution
          
          
//Example
async_dec(void, print_sum(int, int))

async(
    void, print_sum, (int x, int y), {
        print("Sum is %d\n", x + y);
    }
)

print_sum(10, 20);
afx(print_sum(10, 20)):

This implementation will not work at all in anything that will make the thread go to sleep which is most things i.e. an I/O call, sleep timer etc (and that's why this is actually useless). Because when a signal is received by a thread, it does care how much time is left in the timer or how much data is left to receive/send before it should wake up, instead it wakes up immediately and start executing the next line of code. So if a thread is blocked on an I/O call, and the monitor sends SIGURG to it, that I/O call is sort of cancelled. Maybe I'll add support for that in future.

Source code and references: https://github.com/vanshjangir/afx