Discussion:
Problem with program (Pirko's RPSORT)
(too old to reply)
prino
2005-08-25 17:17:58 UTC
Permalink
Hi all,

I've been using Robert Pirko's RPSORT (from the widely available
RPSRT102.ZIP) for many years, without ever encountering any problems.
However, a few days ago a colleague, who uses the program under XP to
sort rather larger files mentioned that the program is no longer
working for some files, in casu he's trying to sort two 170Mb files
into one output file and the program just returns to the command prompt
without any messages. I've told him I would have a look at it, but
where do I start? The archive contains the full source, I can
reassemble it and add some progress indicators to it, but I'm not sure
where. A friend has suggested tracing it using Bochs, but given that
the two problem files each contain in excess of 400,000 records, I
might not have enough diskspace to store Boch's log.

Any suggestions?

FWIW, he's also asked me to see if it can be sped up as it is noticably
slower than M$ SORT included with XP, although that's not too
surprising given that this is an AD 1991 16-bit DOS program. I guess
replacing 'LODSB's with equivalent 'MOV AL, [SI]/INC SI' in the sorting
code will make some difference, but there's also the matter of case.
Two of his sortfields are mixed case strings of 20 characters, that
have to be compared case-insensitive and as RPSORT doesn't cache
sortfields, the translation has to be done over and over again.... The
current program uses XLAT, which isn't too hard to change, but will it
make any real difference? (The program runs on a 2GHz Athlon 64)

Finally, if you know of another sort program that has the same options
as RPSORT (ie must be able to sort fixed and variable length records,
must handle FPU datatypes, must be able to eliminate dulicate records,
must handle at least 4 sort fields and must be able to sort different
fields ascending and descending), but is more of this day and age, feel
free to suggest that instead. ;)

Thanks,

Robert
--
Robert AH Prins
***@onetel.com
Terje Mathisen
2005-08-25 19:04:37 UTC
Permalink
Post by prino
Hi all,
I've been using Robert Pirko's RPSORT (from the widely available
RPSRT102.ZIP) for many years, without ever encountering any problems.
However, a few days ago a colleague, who uses the program under XP to
sort rather larger files mentioned that the program is no longer
working for some files, in casu he's trying to sort two 170Mb files
into one output file and the program just returns to the command prompt
without any messages. I've told him I would have a look at it, but
where do I start? The archive contains the full source, I can
reassemble it and add some progress indicators to it, but I'm not sure
where. A friend has suggested tracing it using Bochs, but given that
the two problem files each contain in excess of 400,000 records, I
might not have enough diskspace to store Boch's log.
Any suggestions?
Yes.

170+170 MB = 340 MB, i.e. eminently doable with RAM only.

Perl with a user-defined sort function can probably do this in a 10-line
program. :-)

Something like this...

eval("sub cmp{$ARGV[1]}"); # Load the sort function
eval("$unpack = $ARGV[2]"); # Load the format spec
&loadfile($ARGV[3]);
&loadfile($ARGV[4]);
foreach(sort cmp @data) {print "$_\n";}
sub loadfile{...)

Terje
--
- <***@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"
SevagK
2005-08-26 06:42:50 UTC
Permalink
=Robert===================================
.. must be able to eliminate dulicate records,
..
==========================================

This implies that a record ID has to be kept in memory somewhere to check
for duplicates, this ID could be a variable length string perhaps. With
400_000 records, you'll probably run out of memory pretty fast.

This could also be the source of the speed problem. As records grow, the
look-up hash table (if that is used) would grow as well (probably
resulting in large chunks of mem->mem copies), plus there are concerns of
swapdisk memory usage.

-kain
Robert Redelmeier
2005-08-26 12:12:09 UTC
Permalink
Post by SevagK
=Robert===================================
.. must be able to eliminate dulicate records,
..
==========================================
This implies that a record ID has to be kept in memory
somewhere to check for duplicates,
Not really. That is one advantage of fully sorted records.
Duplicates can be spotted through the comparison tests.
Post by SevagK
this ID could be a variable length string perhaps.
If the records are only partially sorted (on one field
only) or unsorted, then it may be easier to calc a digest
on all relevant fields, sort this list, then compare dups
(to avoid collisions). Depending on likely data, it might
be better to do the packing (dup elimination and close-up)
in a separate pass afterwards.
Post by SevagK
With 400_000 records, you'll probably run out of memory
pretty fast.
This is a definite risk. Still, on modern machine with
512 MB, records still could be over 1000 bytes before
hitting swap.

-- Robert

Loading...