Optimize speed 8086 instruction "rep movsb" and "rep stosb"

wolfgang kern

2022-07-22 13:37:33 UTC

Post by Phu Tran Hoang
;Replace "rep movsb" by the following code
test di,1 ; alaign by word
jz $+4
movsb
dec cx
shr cx,1
rep movsw
jnc $+3
movsb
;Replace "rep stosb" by the following code
mov ah, al
test di,1 ; alaign by word
jz $+4
stosb
dec cx
shr cx,1
rep stosw
jnc $+3
stosb

[jnc+1 ? stosb/stosw are only one byte code "AA/AB"]

Yes, pre- and post-aligning string operations are
the main speed-gain in my OS. It works with 32-bit
reduction/extension for any odd start and size.

But I also align source or destination to quad bounds.

TEST esi,3
JZ isAligned
... ;adjust for an aligned loop start here
isAligned:
SHR ecx,1 ;no action at all if ecx=0
JNC +1
LODSB
SHR ecx,1
JNC +2 ; +2 for use32
LODSW ; because prefix required here
REP LODSD ;falls through if ECX=Zero

and with similar dummy reads up front and at end it
can part-read disk sectors at any offset and size.
__
wolfgang