Discussion:
Optimize speed 8086 instruction "rep movsb" and "rep stosb"
(too old to reply)
Phu Tran Hoang
2022-07-22 05:19:55 UTC
Permalink
;Replace "rep movsb" by the following code
test di,1 ; alaign by word
jz $+4
movsb
dec cx

shr cx,1
rep movsw
jnc $+3
movsb



;Replace "rep stosb" by the following code
mov ah, al
test di,1 ; alaign by word
jz $+4
stosb
dec cx

shr cx,1
rep stosw
jnc $+3
stosb
wolfgang kern
2022-07-22 13:37:33 UTC
Permalink
Post by Phu Tran Hoang
;Replace "rep movsb" by the following code
test di,1 ; alaign by word
jz $+4
movsb
dec cx
shr cx,1
rep movsw
jnc $+3
movsb
;Replace "rep stosb" by the following code
mov ah, al
test di,1 ; alaign by word
jz $+4
stosb
dec cx
shr cx,1
rep stosw
jnc $+3
stosb
[jnc+1 ? stosb/stosw are only one byte code "AA/AB"]

Yes, pre- and post-aligning string operations are
the main speed-gain in my OS. It works with 32-bit
reduction/extension for any odd start and size.

But I also align source or destination to quad bounds.

TEST esi,3
JZ isAligned
... ;adjust for an aligned loop start here
isAligned:
SHR ecx,1 ;no action at all if ecx=0
JNC +1
LODSB
SHR ecx,1
JNC +2 ; +2 for use32
LODSW ; because prefix required here
REP LODSD ;falls through if ECX=Zero

and with similar dummy reads up front and at end it
can part-read disk sectors at any offset and size.
__
wolfgang

Loading...