More and more, lawyers are going to have to deal with digital data. I have three text dumps to deal with at the moment: two from Android devices, and one from a Windows phone. I have five figures worth of messages and wound up with four figures of responsive messages to be provided to opposing counsel in discovery. If you are going to do this for your practice, you may want to consider something like an “attorney’s eyes only” protective order. I’ve done IT for enough years that I see how much sensitive information winds up on business computers, so there will almost inevitably be a lot of private information on personal devices. I recommend Ball in Your Court for deep discussion of the issues.
I had planned on sitting down for at least a day and writing crude parsers to pull the data, which was XML in two schemas. A friend has been trying to coax me into playing with Splunk, and pointed out this would be an excellent exercise. I installed Splunk Enterprise on my laptop (not entirely uneventfully), and I was off to the races with his help.
Splunk proved remarkably powerful, but also limited in some surprising ways. It proved capable of directly ingesting the dump produced by SMS Backup and Restore, though it wasn’t immediately obvious during the import process. It seems to have considerable difficulty “dividing records” at the input stage.
The Windows phone backup was produced by contacts+message backup, which Splunk proved utterly incapable of importing. Happily, someone wrote a conversion script, though I had to modify it to add a newline after each entry. Once I did that, Splunk was happy to ingest it just as it had been the Android dump.
One thing- the dates are stored in epoch milliseconds, but it was really easy to create a column with a useful date:
eval date2=strftime(date/1000, "%F %T")
Then all I had to do was query the three data sets for responsive documents, export as CSV, and clean it up in a spreadsheet. All in all, MUCH easier than mucking around with XML parsing by hand, but a few observations:
- epoch times are common enough that it would be nice if Splunk would have the capability to understand them
- I tried and had no luck importing VCF contact data. It didn’t really matter, since the text dump already had some schema information, but this is a very common format and it seems like it should be supported.
- Splunk really should have an XML type selectable in its import and be able to split records intelligently based on that rather than relying on linefeeds.
- Splunk won’t run out of the box on MacOS 10.13 (“High Sierra” for the marketing types), or at least not on APFS volumes. You must set OPTIMISTIC_ABOUT_FILE_LOCKING=1 in splunk-launch.conf. This is a fairly minor failure on their part, but the major failure is that it’s essentially undocumented on their site. They also prompt you to open a support ticket, but if you’re on a free trial….no support! Seems to me that’s a recipe for someone to have a poor initial experience and look elsewhere.