Parsing Progress

I am still working on a Perl script to parse the letterbox Superbase file. It has been tricky. Perl allows you to read in a file using specific delimiters as record separators but I couldn’t use this (normally) really useful feature because the record seperator (0×80) sometimes appeared in the data as part of a record field.

my @lboxes; while (<>) { @lboxes = split(/(?=\x80[^\0]{2})/); }

The regular expression for the split looks for the record separator followed by two characters that aren’t nulls because the data field containing the false separators is a 2 byte field with a null after it (16 bit unsigned int). If a 0×80 appeared in either of those two bytes it should not be counted as a separator. The place where a record seperator is valid is before the first field (a null terminated string) this hack will cause a problem if the string is less than 2 characters long. I will try and fix that later but I cannot rely on the 16 bit unsigned int having non-character values for both of it’s bytes i.e. the characters “st” can also be the number 29811 if treated as a 16 bit unsigned integer.

I will probably try to solve the problem by looking at the record size when I do the splitting I’ve noticed that all of the records in the file are padded to a byte length that is a multiple of 128 by checking whether a delimiter correctly falls on one of these boundaries I can avoid false record seperators.

At the moment I can extract Name, Catalogue Number, Grid Reference but the clue, type and deleted status is in a big chunk of data filled with what seems to be random null characters.

When I’m finished then comes the task of building it into something usable so I can track which letterboxes I’ve found already. My estimate is that I have around 200.


Comments