Discussion:
Write x86_64 thunk to invoke any lambda function
(too old to reply)
Frederick Virchanza Gotham
2023-04-16 00:22:19 UTC
Permalink
In the C++ programming language, there are lambda functions. Some of them have captures, and some of them don't have captures. The lambda functions without captures can implicitly convert to a normal function pointer. The lamda functions with captures cannot implicitly convert to a normal function pointer.

I started a thread yesterday on comp.lang.c++ in which I wanted to write a thunk in x86_64 assembler that would be able to invoke any lambda. I will copy-past that post here and then continue writing:

Since C++11, there has been an implicit conversion from a lambda to a
function pointer so long as the lambda has no captures. If the lambda
has captures, the implicit conversion is disabled. However it's easy to
get a function pointer from a lambda-with-captures if we use global
variables or the heap, something like:

std::function<void(void)> f; // global object

void Func(void)
{
f(); // invokes the global object
}

void Some_Library_Func(void (*const pf)(void))
{
pf();
}

int main(int argc, char **argv)
{
auto mylambda = [argc](void) -> void
{
cout << "Hello " << argc << "!" << endl;
};

f = mylambda;

Some_Library_Func(Func);
}

It is possible though to procure a normal function pointer from a
lambda-with-captures without making use of global variables or the heap
-- it can all be kept on the stack.

To invoke a capture lambda, we need two pieces of data:
Datum A: The address of the lambda object
Datum B: The address of the 'operator()' member function

Datum A is a pointer into data memory.
Datum B is a pointer into code memory.

The technique described in this post will only work on CPU's where the
program counter can be set to an address in data memory, and therefore
we will use 'void*' for Datum B rather than 'void(*)(void)'. I'm open
to correction here but I think this technique will work on every
implementation of C++ in existence today, even on microcontrollers such
as the Texas Instruments F28069 and the Arduino Atmel sam3x8e.

We will define a simple POD struct to hold these two pieces of data:

struct LambdaInfo {
void *data, *code;
};

Let's write a function that invokes a capture lambda, passing the 'this'
pointer as the first argument to the member function:

void InvokeLambda(LambdaInfo const *const p)
{
void (*pf)(void*) = (void (*)(void*))p->code;

return pf(p->data);
}

And now let's check what this got compiled to on an x86_64 computer:

mov rdx,QWORD PTR [rdi]
mov rax,QWORD PTR [rdi+0x8]
mov rdi,rdx
jmp rax

What we've got here is four instructions. To see a little more clearly
what's going on here, I'm going to replace the function arguments with
numerical constants:

void InvokeLambda(void)
{
void (*pf)(void*) = (void (*)(void*))0x1122334455667788;

return pf( (void*)0x99aabbccddeeffee );
}

gets compiled to:

movabs rdi,0x99aabbccddeeffee
movabs rax,0x1122334455667788
jmp rax

What we've got here now is three simple instructions. Here's the
assembler alongside the machine code:

movabs rdi,0x99aabbccddeeffee ---- 48 bf ee ff ee dd cc bb aa 99
movabs rax,0x1122334455667788 ---- 48 b8 88 77 66 55 44 33 22 11
jmp rax ---- ff e0

What we have here is 22 bytes worth of CPU instructions, which we can
put into a byte array as follows:

char unsigned instructions[22u] = {
0x48, 0xBF,
0xEE, 0xFF, 0xEE, 0xDD, 0xCC, 0xBB, 0xAA, 0x99,
0x48, 0xB8,
0x88, 0x77, 0x66, 0x55, 0x44, 0x33, 0x22, 0x11,
0xFF, 0xE0,
};

This 22-byte array can be our thunk. I've written C++ code to manage
this thunk up on GodBolt, and it works fine:

https://godbolt.org/z/r84hEsG1G

This works fine for a function with no return value and no parameters. So next I wanted to work on a solution that would work for _any_ lambda function even it had 27 parameters and a return value greater than 87 bytes. So I followed up my post to comp.lang.c++ with the following:

The first thunk that I wrote accommodated a function that returned void
and took no arguments. The machine code of the thunk was 22 bytes.

Now I want to write a thunk that will work for _any_ lambda,
irrespective of how many parameters it has or how big the return type
is. On an x86_64 computer, here's how a function call works:

The return address is pushed onto the stack
Int/Ptr arguments 1-6 go in registers RDI, RSI, RDX, RCX, R8, R9
Int/Ptr arguments 7 and above go onto the stack (right to left)
Floating point arguments 1-8 go in registers XMM0 - XMM7
Floating point arguments 9 and above go onto the stack
The callee stores its return value in RAX
If the return value is > 8 bytes, the next 8 bytes go in RDX
If the return value is > 16 bytes, extra bytes go onto the stack
The callee jumps back to the return address

So let's say we have a lambda function whose signature is something like
the following:

bool Func(int a, long b, float c, int *p);

If this lambda function has captures, then there will need to be a
hidden 'this' pointer:

bool Func(void *this, int a, long b, float c, int *p);

The first function signature without the 'this' pointer would be called
as follows:
Store a in RDI
Store b in RSI
Store c in XMM0
Store d in RDX

However the second function signature would be called as follows:
Store the 'this' pointer in RDI
Store a in RSI
Store b in RDX
Store c in XMM0
Store d in RCX

So in effect, when we introduce a hidden 'this' pointer as the first
parameter, all of the other registers get moved down one, so inside the
our thunk, we need to write assembler to do the following:

push R9 onto stack
move R8 -> R9
move RCX -> R8
move RDX -> RCX
move RSI -> RDX
move RDI -> RSI
store the 'this' pointer in RDI
invoke the lambda
pop r9 off the stack
jump back to the return address

So the Assembler for the thunk should be something like:

push r9
mov r9,r8
mov r8,rcx
mov rcx,rdx
mov rdx,rsi
mov rsi,rdi
movabs rdi, 0x1122334455667788 // the 'this' pointer
movabs rax, 0x99aabbccddeeff22 // address of lambda code
call rax
add rsp,8 // pop r9 off stack
ret

This new and improved thunk is 44 bytes. This will work fine for a
function that has any number of parameters, so long as the return
value isn't more than 16 bytes. See it working up on GodBolt:

https://godbolt.org/z/xTfEYo5jE

So next what we need to do is try get it to work with a function that
returns a value greater than 16 bytes. The thunk we've written so far
is problematic in this situation because:
Step 1) We push r9 onto the stack
Step 2) We invoke the lambda function
Step 3) We pop r9 off the stack

Step 3 will cause a problem because instead of popping r9 off the stack,
it will pop the extra bytes of the return value off the stack. In order
to get around this, here's what we need to do:
Step 1) Before invoking the lambda, save the stack pointer
Step 2) After invoking the lambda, check if stack pointer has changed
Step 3) If the stack pointer is changed, do a 'memmove' to move
everything on the stack upwards by 8 bytes

Our 'memmove' operation would look like:

void move_stack(char *oldval, char *const newval)
{
char *p = oldval,
*q = oldval - 8u;

while ( oldval-- > newval )
{
*p-- = *q--;
}
}

If we compile that to assembler and use r15,rsp instead of rdi,rsi, here
is what it looks like:

start:
cmp rsp,r15
jae end
movzx r13d,BYTE PTR [r15-0x8]
sub r15,0x1
mov BYTE PTR [r15+0x1],r13b
jmp start
end:

So now I'm going to try to append this onto the end of the thunk we've
already written, here goes:

push r9
mov r9,r8
mov r8,rcx
mov rcx,rdx
mov rdx,rsi
mov rsi,rdi
movabs rdi, 0x1122334455667788 (the 'this' pointer)
movabs rax, 0x99aabbccddeeff22 (address of lambda code)
mov r15, rsp
call rax
start_of_loop:
cmp rsp,r15
jae out_of_loop
movzx r13d,BYTE PTR [r15-0x8]
sub r15,0x1
mov BYTE PTR [r15+0x1],r13b
jmp start_of_loop
out_of_loop:
add rsp,8 ; move stack pointer back 8 bytes
ret

This latest thunk comes to 67 bytes. Here is my attempt to implement
it on GodBolt but it's not working yet:

https://godbolt.org/z/nzWvE7q34

I haven't got it working yet but I'm very very close. I realise we'll
have to figure out how to restore the values of r15 and r13 but sure
we'll worry about that after we've got this much working.

Can anyone see what's wrong with my latest thunk? There's something
not right about where I compare r15 and rsp and then try to move the
entire stack down 8 bytes.

Anyone?
Frederick Virchanza Gotham
2023-04-16 13:22:14 UTC
Permalink
Post by Frederick Virchanza Gotham
The callee stores its return value in RAX
If the return value is > 8 bytes, the next 8 bytes go in RDX
If the return value is > 16 bytes, extra bytes go onto the stack
It looks like it isn't actually this simple. First let's see a function
that returns a 128-Bit value:

struct ReturnTypeB {
long long unsigned a,b;
};

ReturnTypeB FuncB(void)
{
return { 0x1111111111111111, 0x2222222222222222 };
}

This gets compiled to:

mov rax,0x1111111111111111
mov rdx,0x2222222222222222
ret

This is exactly what I expected. But let's now make it return 24 bytes:

struct ReturnTypeA {
long long unsigned a,b,c;
};

ReturnTypeA FuncA(void)
{
return { 0x1111111111111111, 0x2222222222222222,
0x3333333333333333 };
}

This........ rather unexpectedly...... gets compiled to:

mov rax,rdi
mov rdx,0x1111111111111111
mov QWORD [rdi],rdx
mov rcx,0x2222222222222222
mov QWORD [rdi+0x8],rcx
mov rsi,0x3333333333333333
mov QWORD [rdi+0x10],rsi
ret

At first glance, it looks like this function is receiving a pointer in
RDI, and that it's storing the three 64-Bit numbers sequentially at the
address specified by RDI. To be totally sure about this, I took this
machine code and ran it through a decompiler, which gave me:

uint64_t *FuncA(uint64_t *const p)
{
p[0] = 0x1111111111111111u;
p[1] = 0x2222222222222222u;
p[2] = 0x3333333333333333u;
return p;
}

What I've learned here I think can be summed up as follows:
(a) If the return value <= 8 bytes, it goes in RAX
(b) If the return value <= 16 bytes, the first 8 go in RAX and the
second 8 go in RDX
(c) If the return value >= 16 bytes, RAX and RDX are used differently.
There is a hidden first parameter in RDI to the function which acts
as a pointer indicating where the return value is to be stored. This
hidden pointer is returned from the function in RAX. RDX isn't used.

So now let's see what happens if I turn this into a member function,
because I don't know if RDI will be used for the 'this' pointer or for
the address to store the return value. Let's see, I wrote the
following:

void *volatile global;

struct ReturnTypeA {
long long unsigned a,b,c;
};

struct MyClass {
ReturnTypeA FuncA(void)
{
global = this;
return { 0x1111111111111111, 0x2222222222222222,
0x3333333333333333 };
}
};

And compiled it to the following assembler:

// Next two lines set up the stack frame
push rbp
mov rbp,rsp
// The next line sets return value = first argument
mov rax,rdi
// The next line 3 lines appear to get the 'this'
// pointer from RSI and then store it in 'global'
mov QWORD PTR [rbp-0x8],rsi
mov rcx,QWORD PTR [rbp-0x8]
mov QWORD PTR [rip+0x0],rcx # R_X86_64_PC32 global-0x4
// The next 2 lines store a 64-Bit number in rdi+0
mov rcx,0x1111111111111111
mov QWORD PTR [rdi],rcx
// The next 2 lines store a 64-Bit number in rdi+8
mov rcx,0x2222222222222222
mov QWORD PTR [rdi+0x8],rcx
// The next 2 lines store a 64-Bit number in rdi+16
mov rcx,0x3333333333333333
mov QWORD PTR [rdi+0x10],rcx
// The next 2 lines restore the frame pointer and return
pop rbp
ret

So I think I've solved the mystery here. If the return type is more than
16 bytes in size, the hidden 'this' pointer is put inside RSI rather
than RDI, because RDI is used for the address of where the return value
should be stored.

So let's go back and take another look at our original thunk code for
moving all the registers down:

push r9
mov r9,r8
mov r8,rcx
mov rcx,rdx
mov rdx,rsi
mov rsi,rdi
mov rdi, 0x1122334455667788 (the 'this' pointer)
mov rax, 0x99aabbccddeeff22 (address of lambda code)

Let's try to edit this code to put the 'this' pointer in RSI instead of
RDI, maybe something like:

push r9
mov r9,r8
mov r8,rcx
mov rcx,rdx
mov rdx,rsi
// REMOVE THIS LINE : mov rsi,rdi
mov rsi, 0x1122334455667788 (the 'this' pointer)
mov rax, 0x99aabbccddeeff22 (address of lambda code)

Ok this might be the correct thunk code, and here it is up on GodBolt:

https://godbolt.org/z/65YEsaT8o

It works. Ok so now there's only one thing left to do. We need to
have two separate thunk templates, Thunk Type A and Thunk Type B.
Thunk Type A = for any function whose return type <= 16 bytes
Thunk Type B = for any function whose return type > 16 bytes

At compile-time, we can try to make this distinction by using 'sizeof'
on the lambda's return type.... which is what I'm going to try do to
now, see up on GodBolt:

https://godbolt.org/z/MzYGxfz9Y

Take a look yourself.
Frederick Virchanza Gotham
2023-04-16 15:16:31 UTC
Permalink
Post by Frederick Virchanza Gotham
https://godbolt.org/z/MzYGxfz9Y
Take a look yourself.
https://godbolt.org/z/MzYGxfz9Y
Take a look yourself.
And finally, if the implementation I've described so far isn't available yet on your target architecture, here's a less-efficient implementation that will 'just work' on any system:

https://godbolt.org/z/sG1rTbWE6

And here it is copy-pasted:

#include <cassert> // assert
#include <cstddef> // size_t
#include <utility> // forward

template<typename LambdaType>
class LambdaThunk {
protected:

static thread_local LambdaType *p_lambda_object;

template <typename ReturnType, typename... Params>
static ReturnType Actual_Thunk(Params... args)
{
assert( nullptr != p_lambda_object );
return (*p_lambda_object)(std::forward<Params>(args)...);
}

template <typename ReturnType, typename... Params>
static ReturnType (*Get_Thunk_Address(ReturnType (LambdaType::*)(Params...) const))(Params...)
{
return Actual_Thunk<ReturnType,Params...>;
}

public:

LambdaThunk(LambdaType &obj)
{
p_lambda_object = &obj;
}

auto thunk(void) const volatile // yes this could be a static function
{
return Get_Thunk_Address(&LambdaType::operator());
}
};

template<typename LambdaType>
thread_local LambdaType *LambdaThunk<LambdaType>::p_lambda_object = nullptr;

int Some_Library_Func(int (*const pf)(char const*))
{
return pf("monkey");
}

#include <iostream>
using std::cout;
using std::endl;

int main(int argc, char **argv)
{
auto mylambda = [argc](char const *const p) -> int
{
cout << "Hello " << argc << " " << p << "!" << endl;
return 77;
};

int const z = Some_Library_Func( LambdaThunk(mylambda).thunk() );

cout << "z = " << z << endl;
}

I think the only place where this inefficient implementation could fall
down is if you have a lambda defined inside a recursive function... but
you'd just need to make sure that the thunk is invoked before the
function is re-entered (unless of course, upon re-entry, you account
for the thunk not having been invoked yet).
Anton Ertl
2023-04-16 17:04:14 UTC
Permalink
Post by Frederick Virchanza Gotham
The technique described in this post will only work on CPU's where the
program counter can be set to an address in data memory, and therefore
we will use 'void*' for Datum B rather than 'void(*)(void)'. I'm open
to correction here but I think this technique will work on every
implementation of C++ in existence today, even on microcontrollers such
as the Texas Instruments F28069 and the Arduino Atmel sam3x8e.
On MacOS on Apple Silicon an mmap() with RWX does not work. I read
some page where Apple describes the hoops they want you to jump though
to make JIT compilers work, but I forgot the details.

From some web page I got the impression that some BSD has the same
misfeature, but I don't have first-hand experience wrt that.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
***@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html
Thomas Koenig
2023-04-17 05:06:21 UTC
Permalink
Post by Frederick Virchanza Gotham
I started a thread yesterday on comp.lang.c++ in which I wanted to
write a thunk in x86_64 assembler that would be able to invoke any
Without going too much into the details (or even knowing much
about C++): This will be platform-dependent.

If you are using Linux or another Unix variant that uses the C++
Itanium ABI, you should have a look at it it and its platform-specific
supplement, specifically https://itanium-cxx-abi.github.io/cxx-abi/ and
https://raw.githubusercontent.com/wiki/hjl-tools/x86-psABI/x86-64-psABI-1.0.pdf
Loading...