Frederick Virchanza Gotham
2023-04-16 00:22:19 UTC
In the C++ programming language, there are lambda functions. Some of them have captures, and some of them don't have captures. The lambda functions without captures can implicitly convert to a normal function pointer. The lamda functions with captures cannot implicitly convert to a normal function pointer.
I started a thread yesterday on comp.lang.c++ in which I wanted to write a thunk in x86_64 assembler that would be able to invoke any lambda. I will copy-past that post here and then continue writing:
Since C++11, there has been an implicit conversion from a lambda to a
function pointer so long as the lambda has no captures. If the lambda
has captures, the implicit conversion is disabled. However it's easy to
get a function pointer from a lambda-with-captures if we use global
variables or the heap, something like:
std::function<void(void)> f; // global object
void Func(void)
{
f(); // invokes the global object
}
void Some_Library_Func(void (*const pf)(void))
{
pf();
}
int main(int argc, char **argv)
{
auto mylambda = [argc](void) -> void
{
cout << "Hello " << argc << "!" << endl;
};
f = mylambda;
Some_Library_Func(Func);
}
It is possible though to procure a normal function pointer from a
lambda-with-captures without making use of global variables or the heap
-- it can all be kept on the stack.
To invoke a capture lambda, we need two pieces of data:
Datum A: The address of the lambda object
Datum B: The address of the 'operator()' member function
Datum A is a pointer into data memory.
Datum B is a pointer into code memory.
The technique described in this post will only work on CPU's where the
program counter can be set to an address in data memory, and therefore
we will use 'void*' for Datum B rather than 'void(*)(void)'. I'm open
to correction here but I think this technique will work on every
implementation of C++ in existence today, even on microcontrollers such
as the Texas Instruments F28069 and the Arduino Atmel sam3x8e.
We will define a simple POD struct to hold these two pieces of data:
struct LambdaInfo {
void *data, *code;
};
Let's write a function that invokes a capture lambda, passing the 'this'
pointer as the first argument to the member function:
void InvokeLambda(LambdaInfo const *const p)
{
void (*pf)(void*) = (void (*)(void*))p->code;
return pf(p->data);
}
And now let's check what this got compiled to on an x86_64 computer:
mov rdx,QWORD PTR [rdi]
mov rax,QWORD PTR [rdi+0x8]
mov rdi,rdx
jmp rax
What we've got here is four instructions. To see a little more clearly
what's going on here, I'm going to replace the function arguments with
numerical constants:
void InvokeLambda(void)
{
void (*pf)(void*) = (void (*)(void*))0x1122334455667788;
return pf( (void*)0x99aabbccddeeffee );
}
gets compiled to:
movabs rdi,0x99aabbccddeeffee
movabs rax,0x1122334455667788
jmp rax
What we've got here now is three simple instructions. Here's the
assembler alongside the machine code:
movabs rdi,0x99aabbccddeeffee ---- 48 bf ee ff ee dd cc bb aa 99
movabs rax,0x1122334455667788 ---- 48 b8 88 77 66 55 44 33 22 11
jmp rax ---- ff e0
What we have here is 22 bytes worth of CPU instructions, which we can
put into a byte array as follows:
char unsigned instructions[22u] = {
0x48, 0xBF,
0xEE, 0xFF, 0xEE, 0xDD, 0xCC, 0xBB, 0xAA, 0x99,
0x48, 0xB8,
0x88, 0x77, 0x66, 0x55, 0x44, 0x33, 0x22, 0x11,
0xFF, 0xE0,
};
This 22-byte array can be our thunk. I've written C++ code to manage
this thunk up on GodBolt, and it works fine:
https://godbolt.org/z/r84hEsG1G
This works fine for a function with no return value and no parameters. So next I wanted to work on a solution that would work for _any_ lambda function even it had 27 parameters and a return value greater than 87 bytes. So I followed up my post to comp.lang.c++ with the following:
The first thunk that I wrote accommodated a function that returned void
and took no arguments. The machine code of the thunk was 22 bytes.
Now I want to write a thunk that will work for _any_ lambda,
irrespective of how many parameters it has or how big the return type
is. On an x86_64 computer, here's how a function call works:
The return address is pushed onto the stack
Int/Ptr arguments 1-6 go in registers RDI, RSI, RDX, RCX, R8, R9
Int/Ptr arguments 7 and above go onto the stack (right to left)
Floating point arguments 1-8 go in registers XMM0 - XMM7
Floating point arguments 9 and above go onto the stack
The callee stores its return value in RAX
If the return value is > 8 bytes, the next 8 bytes go in RDX
If the return value is > 16 bytes, extra bytes go onto the stack
The callee jumps back to the return address
So let's say we have a lambda function whose signature is something like
the following:
bool Func(int a, long b, float c, int *p);
If this lambda function has captures, then there will need to be a
hidden 'this' pointer:
bool Func(void *this, int a, long b, float c, int *p);
The first function signature without the 'this' pointer would be called
as follows:
Store a in RDI
Store b in RSI
Store c in XMM0
Store d in RDX
However the second function signature would be called as follows:
Store the 'this' pointer in RDI
Store a in RSI
Store b in RDX
Store c in XMM0
Store d in RCX
So in effect, when we introduce a hidden 'this' pointer as the first
parameter, all of the other registers get moved down one, so inside the
our thunk, we need to write assembler to do the following:
push R9 onto stack
move R8 -> R9
move RCX -> R8
move RDX -> RCX
move RSI -> RDX
move RDI -> RSI
store the 'this' pointer in RDI
invoke the lambda
pop r9 off the stack
jump back to the return address
So the Assembler for the thunk should be something like:
push r9
mov r9,r8
mov r8,rcx
mov rcx,rdx
mov rdx,rsi
mov rsi,rdi
movabs rdi, 0x1122334455667788 // the 'this' pointer
movabs rax, 0x99aabbccddeeff22 // address of lambda code
call rax
add rsp,8 // pop r9 off stack
ret
This new and improved thunk is 44 bytes. This will work fine for a
function that has any number of parameters, so long as the return
value isn't more than 16 bytes. See it working up on GodBolt:
https://godbolt.org/z/xTfEYo5jE
So next what we need to do is try get it to work with a function that
returns a value greater than 16 bytes. The thunk we've written so far
is problematic in this situation because:
Step 1) We push r9 onto the stack
Step 2) We invoke the lambda function
Step 3) We pop r9 off the stack
Step 3 will cause a problem because instead of popping r9 off the stack,
it will pop the extra bytes of the return value off the stack. In order
to get around this, here's what we need to do:
Step 1) Before invoking the lambda, save the stack pointer
Step 2) After invoking the lambda, check if stack pointer has changed
Step 3) If the stack pointer is changed, do a 'memmove' to move
everything on the stack upwards by 8 bytes
Our 'memmove' operation would look like:
void move_stack(char *oldval, char *const newval)
{
char *p = oldval,
*q = oldval - 8u;
while ( oldval-- > newval )
{
*p-- = *q--;
}
}
If we compile that to assembler and use r15,rsp instead of rdi,rsi, here
is what it looks like:
start:
cmp rsp,r15
jae end
movzx r13d,BYTE PTR [r15-0x8]
sub r15,0x1
mov BYTE PTR [r15+0x1],r13b
jmp start
end:
So now I'm going to try to append this onto the end of the thunk we've
already written, here goes:
push r9
mov r9,r8
mov r8,rcx
mov rcx,rdx
mov rdx,rsi
mov rsi,rdi
movabs rdi, 0x1122334455667788 (the 'this' pointer)
movabs rax, 0x99aabbccddeeff22 (address of lambda code)
mov r15, rsp
call rax
start_of_loop:
cmp rsp,r15
jae out_of_loop
movzx r13d,BYTE PTR [r15-0x8]
sub r15,0x1
mov BYTE PTR [r15+0x1],r13b
jmp start_of_loop
out_of_loop:
add rsp,8 ; move stack pointer back 8 bytes
ret
This latest thunk comes to 67 bytes. Here is my attempt to implement
it on GodBolt but it's not working yet:
https://godbolt.org/z/nzWvE7q34
I haven't got it working yet but I'm very very close. I realise we'll
have to figure out how to restore the values of r15 and r13 but sure
we'll worry about that after we've got this much working.
Can anyone see what's wrong with my latest thunk? There's something
not right about where I compare r15 and rsp and then try to move the
entire stack down 8 bytes.
Anyone?
I started a thread yesterday on comp.lang.c++ in which I wanted to write a thunk in x86_64 assembler that would be able to invoke any lambda. I will copy-past that post here and then continue writing:
Since C++11, there has been an implicit conversion from a lambda to a
function pointer so long as the lambda has no captures. If the lambda
has captures, the implicit conversion is disabled. However it's easy to
get a function pointer from a lambda-with-captures if we use global
variables or the heap, something like:
std::function<void(void)> f; // global object
void Func(void)
{
f(); // invokes the global object
}
void Some_Library_Func(void (*const pf)(void))
{
pf();
}
int main(int argc, char **argv)
{
auto mylambda = [argc](void) -> void
{
cout << "Hello " << argc << "!" << endl;
};
f = mylambda;
Some_Library_Func(Func);
}
It is possible though to procure a normal function pointer from a
lambda-with-captures without making use of global variables or the heap
-- it can all be kept on the stack.
To invoke a capture lambda, we need two pieces of data:
Datum A: The address of the lambda object
Datum B: The address of the 'operator()' member function
Datum A is a pointer into data memory.
Datum B is a pointer into code memory.
The technique described in this post will only work on CPU's where the
program counter can be set to an address in data memory, and therefore
we will use 'void*' for Datum B rather than 'void(*)(void)'. I'm open
to correction here but I think this technique will work on every
implementation of C++ in existence today, even on microcontrollers such
as the Texas Instruments F28069 and the Arduino Atmel sam3x8e.
We will define a simple POD struct to hold these two pieces of data:
struct LambdaInfo {
void *data, *code;
};
Let's write a function that invokes a capture lambda, passing the 'this'
pointer as the first argument to the member function:
void InvokeLambda(LambdaInfo const *const p)
{
void (*pf)(void*) = (void (*)(void*))p->code;
return pf(p->data);
}
And now let's check what this got compiled to on an x86_64 computer:
mov rdx,QWORD PTR [rdi]
mov rax,QWORD PTR [rdi+0x8]
mov rdi,rdx
jmp rax
What we've got here is four instructions. To see a little more clearly
what's going on here, I'm going to replace the function arguments with
numerical constants:
void InvokeLambda(void)
{
void (*pf)(void*) = (void (*)(void*))0x1122334455667788;
return pf( (void*)0x99aabbccddeeffee );
}
gets compiled to:
movabs rdi,0x99aabbccddeeffee
movabs rax,0x1122334455667788
jmp rax
What we've got here now is three simple instructions. Here's the
assembler alongside the machine code:
movabs rdi,0x99aabbccddeeffee ---- 48 bf ee ff ee dd cc bb aa 99
movabs rax,0x1122334455667788 ---- 48 b8 88 77 66 55 44 33 22 11
jmp rax ---- ff e0
What we have here is 22 bytes worth of CPU instructions, which we can
put into a byte array as follows:
char unsigned instructions[22u] = {
0x48, 0xBF,
0xEE, 0xFF, 0xEE, 0xDD, 0xCC, 0xBB, 0xAA, 0x99,
0x48, 0xB8,
0x88, 0x77, 0x66, 0x55, 0x44, 0x33, 0x22, 0x11,
0xFF, 0xE0,
};
This 22-byte array can be our thunk. I've written C++ code to manage
this thunk up on GodBolt, and it works fine:
https://godbolt.org/z/r84hEsG1G
This works fine for a function with no return value and no parameters. So next I wanted to work on a solution that would work for _any_ lambda function even it had 27 parameters and a return value greater than 87 bytes. So I followed up my post to comp.lang.c++ with the following:
The first thunk that I wrote accommodated a function that returned void
and took no arguments. The machine code of the thunk was 22 bytes.
Now I want to write a thunk that will work for _any_ lambda,
irrespective of how many parameters it has or how big the return type
is. On an x86_64 computer, here's how a function call works:
The return address is pushed onto the stack
Int/Ptr arguments 1-6 go in registers RDI, RSI, RDX, RCX, R8, R9
Int/Ptr arguments 7 and above go onto the stack (right to left)
Floating point arguments 1-8 go in registers XMM0 - XMM7
Floating point arguments 9 and above go onto the stack
The callee stores its return value in RAX
If the return value is > 8 bytes, the next 8 bytes go in RDX
If the return value is > 16 bytes, extra bytes go onto the stack
The callee jumps back to the return address
So let's say we have a lambda function whose signature is something like
the following:
bool Func(int a, long b, float c, int *p);
If this lambda function has captures, then there will need to be a
hidden 'this' pointer:
bool Func(void *this, int a, long b, float c, int *p);
The first function signature without the 'this' pointer would be called
as follows:
Store a in RDI
Store b in RSI
Store c in XMM0
Store d in RDX
However the second function signature would be called as follows:
Store the 'this' pointer in RDI
Store a in RSI
Store b in RDX
Store c in XMM0
Store d in RCX
So in effect, when we introduce a hidden 'this' pointer as the first
parameter, all of the other registers get moved down one, so inside the
our thunk, we need to write assembler to do the following:
push R9 onto stack
move R8 -> R9
move RCX -> R8
move RDX -> RCX
move RSI -> RDX
move RDI -> RSI
store the 'this' pointer in RDI
invoke the lambda
pop r9 off the stack
jump back to the return address
So the Assembler for the thunk should be something like:
push r9
mov r9,r8
mov r8,rcx
mov rcx,rdx
mov rdx,rsi
mov rsi,rdi
movabs rdi, 0x1122334455667788 // the 'this' pointer
movabs rax, 0x99aabbccddeeff22 // address of lambda code
call rax
add rsp,8 // pop r9 off stack
ret
This new and improved thunk is 44 bytes. This will work fine for a
function that has any number of parameters, so long as the return
value isn't more than 16 bytes. See it working up on GodBolt:
https://godbolt.org/z/xTfEYo5jE
So next what we need to do is try get it to work with a function that
returns a value greater than 16 bytes. The thunk we've written so far
is problematic in this situation because:
Step 1) We push r9 onto the stack
Step 2) We invoke the lambda function
Step 3) We pop r9 off the stack
Step 3 will cause a problem because instead of popping r9 off the stack,
it will pop the extra bytes of the return value off the stack. In order
to get around this, here's what we need to do:
Step 1) Before invoking the lambda, save the stack pointer
Step 2) After invoking the lambda, check if stack pointer has changed
Step 3) If the stack pointer is changed, do a 'memmove' to move
everything on the stack upwards by 8 bytes
Our 'memmove' operation would look like:
void move_stack(char *oldval, char *const newval)
{
char *p = oldval,
*q = oldval - 8u;
while ( oldval-- > newval )
{
*p-- = *q--;
}
}
If we compile that to assembler and use r15,rsp instead of rdi,rsi, here
is what it looks like:
start:
cmp rsp,r15
jae end
movzx r13d,BYTE PTR [r15-0x8]
sub r15,0x1
mov BYTE PTR [r15+0x1],r13b
jmp start
end:
So now I'm going to try to append this onto the end of the thunk we've
already written, here goes:
push r9
mov r9,r8
mov r8,rcx
mov rcx,rdx
mov rdx,rsi
mov rsi,rdi
movabs rdi, 0x1122334455667788 (the 'this' pointer)
movabs rax, 0x99aabbccddeeff22 (address of lambda code)
mov r15, rsp
call rax
start_of_loop:
cmp rsp,r15
jae out_of_loop
movzx r13d,BYTE PTR [r15-0x8]
sub r15,0x1
mov BYTE PTR [r15+0x1],r13b
jmp start_of_loop
out_of_loop:
add rsp,8 ; move stack pointer back 8 bytes
ret
This latest thunk comes to 67 bytes. Here is my attempt to implement
it on GodBolt but it's not working yet:
https://godbolt.org/z/nzWvE7q34
I haven't got it working yet but I'm very very close. I realise we'll
have to figure out how to restore the values of r15 and r13 but sure
we'll worry about that after we've got this much working.
Can anyone see what's wrong with my latest thunk? There's something
not right about where I compare r15 and rsp and then try to move the
entire stack down 8 bytes.
Anyone?