I went to the St. Louis PowerShell meetup last night, and someone had an interesting issue. They were dealing with a 4.5 gigabyte LDIF file- it needed to be somehow processed, and the issue of processing had already been dealt with. However, it turns out that PowerShell wasn’t cranking fast enough for the application. Most computing work is single-threaded by default, so the simple approach here is to split the file up and process the pieces in parallel.
What is LDIF, anyway? It’s a file format used to store LDAP entries. Happily, it’s simple in that entries are just separated by a double newline. The lowest-common-denominator approach for a straightforward text file like this is to use something like sed or awk. I don’t know a whole lot about actually using either of these, but I knew that awk was really meant for splitting up “fields” contained within “records”.
First things first: grab some sample data.
My first concept was to use awk to grab some part of the first field and use that as a key. So for example, I might grab the fifth character from the first line- say it’s “q”. I’d then append that line to a new file, say file-q.ldif. But you always want to iterate and start from what you can get working, so I started with an example I grabbed- probably from some stackoverflow article, but I didn’t keep track:
awk -v RS= '/^#/ {next} {print > ("whatever-" NR ".txt")}' Example.ldif
The concept here was to also just skip comments, which is why the /^#/ {next} is in there- it should match strings that begin with a pound sign, and then just dump each record into its own file- file-1.txt, file-2.txt, and so on.
It turns out that I’m on a Mac, and awk on my system is the BSD variant- and it wigs out about too many files being open. No matter; we’ll close the files as we go.
awk -v RS= '{print > ("whatever-" NR ".txt"); close("whatever-" NR ".txt")}' Example.ldif
Really though, GNU awk is where it’s at, so I’ll be calling gawk from here. More generally, note that sed, awk, lex, and yacc all have serious cross-platform compatibility issues. It’s easiest to standardize on GNU variants gsed, gawk, flex, and bison for all of these commands. I’ve historically shied away from these tools because the syntax brings back bad memories of Perl, although in fact Perl grew out of these commands and not the other way around. I don’t know that I’ll use these tools every day, but it is certainly a handy approach when the “big guns” seem to be letting me down.
Back to the command above. It has a bug: it omits the double newline between entries as it writes the output files, so the LDIF entries get glommed into one big entry. To fix this, I need to fiddle with the output record separator variable, called ORS. This version grabs some characters from the first line of the input record to produce the output filename:
gawk 'BEGIN{RS="\n\n";FS="\n";ORS="\n\n"}{print >> ("file-" substr($1,9,1) ".txt")}' Example.ldif
The above approach has some merit if I wanted to group files by some content in the data. Here though, I only want to split the large file into smaller files of roughly equal sizes. The way to do it is to rotate systematically through the output files. Thankfully, this is EASY. The variable NR is the “record number”, so I can just take the modulus against however many files I want to split into. On this data sample, this approach results in much more even output file sizes than the substring hack above, and I don’t run the risk of getting weird garbage strings in the output filenames. I’m splitting into 6 files here, but salt to taste.
gawk 'BEGIN{RS="\n\n";FS="\n";ORS="\n\n"}{print >> ("file-" NR % 6 ".txt")}' Example.ldif
There’s one final refinement- because the record separator RS and output record separator ORS are equal, we can write this as:
gawk 'BEGIN{ORS=RS="\n\n";FS="\n"}{print >> ("file-" NR % 6 ".txt")}' Example.ldif
Useful reference: