Working Boyer-Moore search

Discussion:

(too old to reply)

Robert Prins

2020-10-31 20:01:07 UTC

Hi all,

There's one on the masm32 site, I found a Pascal version @
http://www.algorytm.org/przetwarzanie-tekstu/algorytm-bm-boyer-moorea/bm-d.html
but neither of them seems to work correctly, the masm32 version is unable to
find one particular string, the Pascal version only finds 9 occurrences of one
particular (other) string.

Does anyone have the mixed source/assembler output of a working 32-bit version
written in C?

Robert

--
Robert AH Prins
robert(a)prino(d)org
The hitchhiking grandfather - https://prino.neocities.org/indez.html
Some REXX code for use on z/OS - https://prino.neocities.org/zOS/zOS-Tools.html

Terje Mathisen

2020-11-01 14:35:40 UTC

Permalink

Post by Robert Prins
Hi all,
http://www.algorytm.org/przetwarzanie-tekstu/algorytm-bm-boyer-moorea/bm-d.html
but neither of them seems to work correctly, the masm32 version is
unable to find one particular string, the Pascal version only finds 9
occurrences of one particular (other) string.
Does anyone have the mixed source/assembler output of a working 32-bit
version written in C?
Robert

Here's some random code I wrote in 1998, its main calim to fame is that
it should allow you to search in a case-independent way inside unicode
strings, while still using very little setup/&table memory.

It works by generating two copies of the search string, forcing one to
UPPERCASE_ONLY and the other to lowercase_only, then when verifying a
match is is OK if at least one of the forms fits at any given position.

This means that it works OK even if the function convert to/from
upper/lower case is complicated & slow.

Terje

#include <stddef.h> // wchar_t definition
#include <windows.h> // To get the monocasing functions!

#define _memuprw(buf, len) CharUpperBuffW(buf, len)
#define _memlwrw(buf, len) CharLowerBuffW(buf, len)
#define _hashw(h) (h & 255) // _very_ simplistic hash from Unicode to
[0..255]

#define _memupra(buf, len) CharUpperBuffA((char *) buf, len)
#define _memlwra(buf, len) CharLowerBuffA((char *) buf, len)

typedef unsigned char achar_t;

class wboyer_moore {
private:
wchar_t *lPattern;
wchar_t *uPattern;
wchar_t *lPatEnd;
wchar_t *uPatEnd;
unsigned plen;
bool ignorecase;
unsigned char skiptable[256];

bool init(wchar_t *pattern, unsigned len, bool icase);
public:
wboyer_moore(wchar_t *pattern, unsigned len, bool icase);
~wboyer_moore(void);
wchar_t *search(wchar_t *buf, unsigned buflen);
};

bool wboyer_moore::init(wchar_t *pattern, unsigned len, bool icase)
{
plen = len;

ignorecase = icase;
lPattern = uPattern = NULL;
if (len == 0) return true;

lPattern = new wchar_t [len];
if (!lPattern) return false;

memcpy(lPattern, pattern, len * sizeof(wchar_t));
uPattern = lPattern;
if (icase) {
uPattern = new wchar_t [len];
if (!uPattern) return false;

memcpy(lPattern, pattern, len * sizeof(wchar_t));
_memlwrw(lPattern, len);
_memuprw(uPattern, len);
}
if (len > 1) {
unsigned l, skip, p;

lPatEnd = lPattern + len - 1;
uPatEnd = uPattern + len - 1;

l = min(len, 255);
memset(skiptable, l, sizeof(skiptable));

for (skip = l - 1, p = plen - len; skip--; p++) {
skiptable[_hashw(lPattern[p])] = skiptable[_hashw(uPattern[p])] = skip;
}
}
return true;
}

wboyer_moore::wboyer_moore(wchar_t *pattern, unsigned len, bool icase)
{
init(pattern, len, icase);
}

wboyer_moore::~wboyer_moore(void)
{
if (lPattern) delete lPattern;
if (ignorecase && uPattern) delete uPattern;
}

wchar_t *wboyer_moore::search(wchar_t *buf, unsigned buflen)
{
if (buflen < plen) return NULL; // Buffer too shart, cannot match

wchar_t *bufend = buf + buflen;

if (plen <= 1) {
if (plen == 0) return buf; // An empty search pattern will always match

wchar_t l = lPattern[0];
if (ignorecase) {
wchar_t u = uPattern[0];

do {
wchar_t w;

if (((w = *buf) == l) || (w == u)) return buf;
} while (++buf < bufend);
return NULL;
}
// case-sensitive search for a single char, could use _wmemchr()!
do {
if (*buf == l) return buf;
} while (++buf < bufend);
return NULL;
}

// The search pattern is at least two characters long, so use the skip
table:

wchar_t *b = buf + plen - 1; // Starting position for end of pattern
do { // Repeat until the end of the search buffer
unsigned skip;
do { // repeat until we find a possible matching endpoint
skip = skiptable[_hashw(*b)];
b += skip; // Guaranteed safe skip length
if (b >= bufend) return NULL;
} while (skip);

// We have a possible match on the endpoint, check the full pattern:

wchar_t *pl = lPatEnd, *pu = uPatEnd, *pb = b, w;
for (unsigned l = plen; ( (w = *pb) == *pl) || (w == *pu); pb--, pl--,
pu--) {
if (--l == 0) return pb;
}
} while (++b < bufend); // Increment end pointer and try again
return NULL;
}

class aboyer_moore {
private:
achar_t *lPattern;
achar_t *uPattern;
achar_t *lPatEnd;
achar_t *uPatEnd;
unsigned plen;
bool ignorecase;
unsigned char skiptable[256];

bool init(achar_t *pattern, unsigned len, bool icase);
public:
aboyer_moore(achar_t *pattern, unsigned len, bool icase);
~aboyer_moore(void);
achar_t *search(achar_t *buf, unsigned buflen);
};

bool aboyer_moore::init(achar_t *pattern, unsigned len, bool icase)
{
plen = len;

ignorecase = icase;
lPattern = uPattern = NULL;
if (len == 0) return true;

lPattern = new achar_t [len];
if (!lPattern) return false;

memcpy(lPattern, pattern, len * sizeof(achar_t));
uPattern = lPattern;
if (icase) {
uPattern = new achar_t [len];
if (!uPattern) return false;

memcpy(lPattern, pattern, len * sizeof(achar_t));
_memlwra(lPattern, len);
_memupra(uPattern, len);
}
if (len > 1) {
unsigned l, skip, p;

lPatEnd = lPattern + len - 2;
uPatEnd = uPattern + len - 2;

l = min(len, 255);
memset(skiptable, l, sizeof(skiptable));

for (skip = l - 1, p = plen - len; skip--; p++) {
skiptable[lPattern[p]] = skiptable[uPattern[p]] = skip;
}
}
return true;
}

aboyer_moore::aboyer_moore(achar_t *pattern, unsigned len, bool icase)
{
init(pattern, len, icase);
}

aboyer_moore::~aboyer_moore(void)
{
if (lPattern) delete lPattern;
if (ignorecase && uPattern) delete uPattern;
}

achar_t *aboyer_moore::search(achar_t *buf, unsigned buflen)
{
if (buflen < plen) return NULL; // Buffer too shart, cannot match

achar_t *bufend = buf + buflen;

if (plen <= 1) {
if (plen == 0) return buf; // An empty search pattern will allways match

achar_t l = lPattern[0];
if (ignorecase) {
achar_t u = uPattern[0];

do {
achar_t a;

if (((a = *buf) == l) || (a == u)) return buf;
} while (++buf < bufend);
return NULL;
}
// case-sensitive search for a single char, could use _memchr()!
do {
if (*buf == l) return buf;
} while (++buf < bufend);
return NULL;
}

// The search pattern is at least two characters long, so use the skip
table:

achar_t *b = buf + plen - 1; // Starting position for end of pattern
do { // Repeat until the end of the search buffer
unsigned skip;
do { // repeat until we find a matching endpoint
skip = skiptable[*b];
b += skip; // Guaranteed safe skip length
if (b >= bufend) return NULL;
} while (skip);

// We have a match on the endpoint, check the rest of the pattern:

achar_t *pl = lPatEnd, *pu = uPatEnd, *pb = b - 1, a;
for (unsigned l = plen; ( (a = *pb) == *pl) || (a == *pu); pb--, pl--,
pu--) {
if (--l == 0) return pb;
}
} while (++b < bufend); // Increment end pointer and try again
return NULL;
}

int main(int, char **)
{
return 0;
}

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Robert Prins

2020-11-02 01:08:14 UTC

Permalink

Post by Robert Prins
Hi all,
http://www.algorytm.org/przetwarzanie-tekstu/algorytm-bm-boyer-moorea/bm-d.html but
neither of them seems to work correctly, the masm32 version is unable to find
one particular string, the Pascal version only finds 9 occurrences of one
particular (other) string.
Does anyone have the mixed source/assembler output of a working 32-bit version
written in C?
Robert

Hi Terje,

<very big snip>

That's way over my head.

I found a version in the "strutils.pp" unit in the FreePascal source archive,
and after a bit of fiddling, it compiles with Virtual Pascal, and unlike the two
versions mentioned above it works.

As for the code?

Setting up the bad-character table in all three is the same, the good suffix
generation code is different, and I've not compared the tables generated,
spending (a bit) more time on doing something about the generated code, which
is, as usual with VP, less than optimal, and although my code is better, it's
probably far from optimal.

As for the why of rolling my own BM, when "sed" can do what needs to be done,
adding RTF tags to files, way faster and easier (and without the silly 255
restriction on string-lengths of VP), I just like my programs to be free of
external dependencies. ;)

Robert

--
Robert AH Prins
robert(a)prino(d)org
The hitchhiking grandfather - https://prino.neocities.org/indez.html
Some REXX code for use on z/OS - https://prino.neocities.org/zOS/zOS-Tools.html

Christian Hanné

2020-11-07 08:56:33 UTC

Permalink

Better in asm, that would be multiple times faster.

Robert Prins

2020-11-08 00:16:37 UTC

Permalink

Post by Christian HannÃ©
Better in asm, that would be multiple times faster.

I've converted the FreePascal version into assembler, and although it's likely a
(fair) bit slower than the version on masm32, it works. At some stage I might go
back to it and see if I can squeeze a few more nanoseconds out of it, but when
it comes back with 360 (all occurrences of "Oostende" in a 1.7Mb) file almost at
the moment I release the Enter key, it's probably fast enough, and I expect that
the RTF-ification, replacing the string to be replaced, the current year, with
"{\cf1\highlight2 2020}" in a 157Kb file will take rather a bit more time, I
have to think of a smart way of not moving that amount of data (or probably half
of it) 92 times. Then again, moving a total of 15Mb is probably peanuts...

Robert

--
Robert AH Prins
robert(a)prino(d)org
The hitchhiking grandfather - https://prino.neocities.org/indez.html
Some REXX code for use on z/OS - https://prino.neocities.org/zOS/zOS-Tools.html

Bernhard Schornak

2020-11-08 12:06:19 UTC

Permalink

(...), I have to think of a smart way of not moving that amount of data
(or probably half of it) 92 times. Then again, moving a total of 15Mb
is probably peanuts...

If you allocate a second block, where you copy already
compared text from the source file, only (with already
replaced patterns!), it is a simple one pass action.

Greetings from Augsburg

Bernhard Schornak

Robert Prins

2020-11-08 18:50:18 UTC

Permalink

Post by Bernhard Schornak

(...), I have to think of a smart way of not moving that amount of data (or
probably half of it) 92 times. Then again, moving a total of 15Mb
is probably peanuts...

If you allocate a second block, where you copy already
compared text from the source file, only (with already
replaced patterns!), it is a simple one pass action.

Yes, that's the best way, but I might settle for a compromise and scan each line
before it's added to the output buffer, and go for a part copy, add replacement,
copy more, add another replacement, followed by a final copy. The added
replacement just consists of 22 characters, so a single YMM VMOVDQU.

As for the data, there are currently 92 occurrences of 2020, and 86 of them are
the only one on a line (in a files with 767 lines). Next year, 2021, will add
2+8 lines, 2022 2+8+50 and the years after that 2+8 again until 2029 or so.

The number of replacements (at the end of the calendar year) is right now maxed
out at 94, but as the data is spread over a multitude of tables sorted in
various orders, there's no easier way of adding the highlighting, other than of
course using a program like "sed".

Robert

--
Robert AH Prins
robert(a)prino(d)org
The hitchhiking grandfather - https://prino.neocities.org/indez.html
Some REXX code for use on z/OS - https://prino.neocities.org/zOS/zOS-Tools.html

Robert Prins

2020-11-10 21:40:56 UTC

Permalink

Post by Robert Prins

Post by Bernhard Schornak

(...), I have to think of a smart way of not moving that amount of data (or
probably half of it) 92 times. Then again, moving a total of 15Mb
is probably peanuts...

If you allocate a second block, where you copy already
compared text from the source file, only (with already
replaced patterns!), it is a simple one pass action.

Yes, that's the best way, but I might settle for a compromise and scan each line
before it's added to the output buffer, and go for a part copy, add replacement,
copy more, add another replacement, followed by a final copy. The added
replacement just consists of 22 characters, so a single YMM VMOVDQU.
As for the data, there are currently 92 occurrences of 2020, and 86 of them are
the only one on a line (in a files with 767 lines). Next year, 2021, will add
2+8 lines, 2022 2+8+50 and the years after that 2+8 again until 2029 or so.
The number of replacements (at the end of the calendar year) is right now maxed
out at 94, but as the data is spread over a multitude of tables sorted in
various orders, there's no easier way of adding the highlighting, other than of
course using a program like "sed".

In the end I decided not to use BM to find the "202x" strings, and here's why:

The first two tables both contain a single occurrence, both on columns 3..6, so
I just write out the replacement with RTF highlighting string (including the
additional columns 1..2, "| ", and then use a pointer to virtually move the
start of the line to column 7 (with a reduced by 6 bytes) length, and write that
one out.

The next five pages contain a table in (up to) four columns, and the "202x" can
only be in specific columns, so I check those columns and if there's a "202x" I
write out the replacement string followed by the rest of the column, if not I
just write out the whole column, and given that it's only 51 characters wide,
it's just four "vmovdqu" instructions in both cases.

The next four pages each contain 2x6 tables (Jan-Jun/Jul-Dec) and the code is
nearly the same, just taking into account that the columns are a bit narrower,
and while I'm writing this, I realize that edx is not used. Using it would allow
me to merge routines:

Note that ebx contain the address line to be changed, esi the "202x"

@s0:
add ecx, 5

@s1:
cmp esi, [ebx + ecx - 4]
jne @s2

vmovdqu [eax], ymm7

vmovdqu ymm0, [ebx + ecx]
vmovdqu ymm1, [ebx + ecx + 32]
vmovdqu [eax + 24], ymm0
vmovdqu [eax + 24 + 32], ymm1
add eax, 69
jmp @s3

@s2:
vmovdqu ymm0, [ebx + ecx - 6]
vmovdqu ymm1, [ebx + ecx + 26]
vmovdqu [eax], ymm0
vmovdqu [eax + 32], ymm1
add eax, 51

@s3:
add ecx, 46
cmp byte ptr [ebx], cl
ret

@t0:
add ecx, 5

@t1:
cmp esi, [ebx + ecx - 4]
jne @t2

vmovdqu [eax], ymm7

vmovdqu ymm0, [ebx + ecx]
vmovdqu ymm1, [ebx + ecx + 32]
vmovdqu [eax + 24], ymm0
vmovdqu [eax + 24 + 32], ymm1
add eax, 60
jmp @t3

@t2:
vmovdqu ymm0, [ebx + ecx - 6]
vmovdqu ymm1, [ebx + ecx + 26]
vmovdqu [eax], ymm0
vmovdqu [eax + 32], ymm1
add eax, 42

@t3:
add ecx, 37
cmp byte ptr [ebx], cl
ret

by changing the last to "add"s in either by a "lea"s.

I'm also pretty sure that my code will be (quite) a bit faster than BM searching
each line for "202x".

Robert

--
Robert AH Prins
robert(a)prino(d)org
The hitchhiking grandfather - https://prino.neocities.org/indez.html
Some REXX code for use on z/OS - https://prino.neocities.org/zOS/zOS-Tools.html