learning Splunk by analyzing phone dumps

More and more, lawyers are going to have to deal with digital data. I have three text dumps to deal with at the moment: two from Android devices, and one from a Windows phone. I have five figures worth of messages and wound up with four figures of responsive messages to be provided to opposing counsel in discovery. If you are going to do this for your practice, you may want to consider something like an “attorney’s eyes only” protective order. I’ve done IT for enough years that I see how much sensitive information winds up on business computers, so there will almost inevitably be a lot of private information on personal devices.  I recommend Ball in Your Court for deep discussion of the issues.

I had planned on sitting down for at least a day and writing crude parsers to pull the data, which was XML in two schemas. A friend has been trying to coax me into playing with Splunk, and pointed out this would be an excellent exercise.  I installed Splunk Enterprise on my laptop (not entirely uneventfully), and I was off to the races with his help.

Splunk proved remarkably powerful, but also limited in some surprising ways. It proved capable of directly ingesting the dump produced by SMS Backup and Restore, though it wasn’t immediately obvious during the import process.  It seems to have considerable difficulty “dividing records” at the input stage.

The Windows phone backup was produced by contacts+message backup, which Splunk proved utterly incapable of importing.  Happily, someone wrote a conversion script, though I had to modify it to add a newline after each entry.  Once I did that, Splunk was happy to ingest it just as it had been the Android dump.

One thing- the dates are stored in epoch milliseconds, but it was really easy to create a column with a useful date:

eval date2=strftime(date/1000, "%F %T")

Then all I had to do was query the three data sets for responsive documents, export as CSV, and clean it up in a spreadsheet.  All in all, MUCH easier than mucking around with XML parsing by hand, but a few observations:

  • epoch times are common enough that it would be nice if Splunk would have the capability to understand them
  • I tried and had no luck importing VCF contact data.  It didn’t really matter, since the text dump already had some schema information, but this is a very common format and it seems like it should be supported.
  • Splunk really should have an XML type selectable in its import and be able to split records intelligently based on that rather than relying on linefeeds.
  • Splunk won’t run out of the box on MacOS 10.13 (“High Sierra” for the marketing types), or at least not on APFS volumes.  You must set OPTIMISTIC_ABOUT_FILE_LOCKING=1 in splunk-launch.conf. This is a fairly minor failure on their part, but the major failure is that it’s essentially undocumented on their site. They also prompt you to open a support ticket, but if you’re on a free trial….no support! Seems to me that’s a recipe for someone to have a poor initial experience and look elsewhere.


I think it’s basically a trope that lawyers have some of their own language.  There’s a tendency toward double negatives- “I don’t disagree with you” and the like- but there are some ways in which the general language would benefit from upgrades.

My bad.Mea Culpa.
You’re a pain.You’re quarrelsome.
This sucks.This is not conducive to good trade.
This paperwork looks awful.It’s inexpertly drawn.
bad idea (coming from a judge)not favored

Commentary on a healthcare proposal

I’ve long been critical of the Affordable Care Act as well as proposals from the GOP.  They simply do very little to address the fundamental reasons for high cost.  However, Karl Denninger has made a nice stab of something close to what I could agree with.  I’ve long held that part of any solution is basically a Uniform Commercial Code writ large, and Mr. Denninger’s plan misses a few important ideas.  A few suggestions:

  • Rather than merely post pricing on website, require that they be structured in some machine-readable format.  A microformat would probably do nicely.
  • Records should absolutely be the property of, and always accessible to, patients.  However, a series of flash sticks is not the right way.  Introduce the concept of a data broker, specify some data structures, and let Google, Microsoft, Amazon, and the insurance companies all figure out who can build the best experience for patients.  Patients can choose their portal.
  • For the love of Pete, standardize insurance cards, coding, and programs.  A simple scan of a barcode should be able to help patients and providers alike determine exactly what they have, whether it’s valid, and what it will help with.
  • Similarly, standardize insurance dispute, payment, and claims processings means and timing.
  • I’m not completely sold on the idea of lifestyle changes as a panacea.  It’s certainly one thing that should be looked over, but then, really that’s an issue of mental health– something that our current system basically refuses to deal with.

Security Basics: what should I protect?

In the first post of this series, I talked about what exactly a threat is.  It’s more interesting to delve in and look at what kinds of assets you actually have that you might want to protect.  Of course, this is really situation-specific, and there are two major types of protection: protection from theft and protection from destruction.

Information assets, such as customer lists, credit card numbers, and the like, aren’t “stolen” so much as they are “copied without authorization”.  Keep in mind that it’s easy to tell when your car has been stolen, but it may be very difficult to know if your social security number has been leaked.

There is a third consideration, which you might call integrity: that is, you want to make sure that something is genuine or hasn’t been modified.  This is important mostly for paperwork and records: it would probably be bad, for example, to have someone alter your car title without you knowing it.


Here are some example assets that you’d want to protect from destruction:

  • Family photos
  • Car, House
  • your person and health

And you’d want to protect these from theft/unauthorized copying:

  • Social Security number
  • Money, securities, etc.
  • Driver’s license, both the card itself and the identity indicated against identity theft
  • Car

Note that some assets, such as the car, are listed in both categories above.  You hold insurance to protect your car from destruction (e.g., in a collision), and you use keys and a car alarm in order to make it harder to steal.

My stance on family photos may be somewhat controversial, but I think most people would much rather have their family photos copied over the internet than losing them completely.

Company/Business assets

The same approach as above applies, but keep in mind that there is a fair bit of regulation.  If you’re in healthcare or associated with healthcare, for example, HIPAA/HITECH may apply.  There are regulations for finance, insurance, law firms- the list goes on and on, so you should do some research and get familiar with your field.  There is also a whole set of regulations for credit card handlers called PCI.

Generally, the government requirements revolve around “PII” (Personally Identifiable Information).  In healthcare, it’s called “PHI” (Protected Health Information).  These concepts tend to be high-level, and identifying exactly what you should do and how you should handle information can be complicated.


Before you can really get started thinking about security, identify the things you want to protect and why.  For example, I’ve seen businesses bring their computing services all in-house while not doing any audits on the custom software they’re developing.  They haven’t identified what it is they actually want to protect (presumably, customer data and their own control systems), and taken measures to protect those things.

Security Basics: What is a Threat?

The legal industry is rather seriously behind other industries on security, and our whole business is keeping secrets.  The bad news is that it’s easy to come to wrong conclusions in security.  The good news is that lawyers already have an ethics framework and most have some knowledge about risk assessment.  I want to take a look at how to avoid some of the bad logic.

We are all familiar with risk.  We choose to travel, exercise (or not), elect certain medical procedures, and so forth.  But I think lawyers might find it easiest to think in terms of the sorts of risks we help our clients evaluate, so let’s try a couple examples.

If I were to suggest to a personal injury lawyer that the realistic value of a case is $100,000, he would understand.  He would also understand that if the case isn’t filed before the statute of limitations runs, the value is then $0.

Similarly, if I took a criminal defense lawyer and suggested a plea deal for her client, she would weigh that plea deal against the uncertainty (“risk”) of going to trial.  She would also know that in certain circumstances, such as a double jeopardy situation, her client is not at risk at all from that case.

Security models follow a similar sort of pattern and balance.  In order for there to be a threat, all three legs of a stool must exist: capability, intent, and opportunity.  Colloquially, you can think of these as “means, motive, and opportunity”- just like a crime novel.  In the examples above, statutes of limitation or double jeopardy preclusion would limit capability.

Let’s look at a couple of real-world examples of “security threats” that were negated because one or more legs of the stool were missing.

Not too long ago, a woman was killed after ramming part of the White House.  I’m rather a fan of Popehat’s coverage on the matter, but let’s think about this in terms of our three legged stool.  The woman did not possibly have the capability to harm the White House or any of the protected staff therein.  Nor did she have any capability to harm the Capitol complex.  She certainly may have posed a risk to pedestrians and officers trying to stop her, but she didn’t pose any risk of harm to the President.  It would behoove our media to gain some knowledge about our stool, eh?

More recently, a drone landed on the White House lawn. This incident exposed a notable hole in security, but the pilot did not have the intent to harm anyone, so there was no security risk posed by him.  Note, of course, that incidents like this are actually incredibly valuable if you’re trying to secure a system.  They show you weak points before something bad happens.

What’s the lesson here?  Applying this model is easy.  You can protect yourself against a threat just by negating any of the three legs.  The biggest part of security, though, is knowing what assets you have- something that we’ll cover next time.

On choosing open licenses

I’ve recently written a technical roadmap for one of my shadow projects.  I’ll quote what I have so far about software licensing, at least as it pertains to feedstock:

  • Open source feedstock preferred
    1. Licenses that permit us to keep the source internal, even when we distribute binaries externally (BSD, MIT, Apache) are approved for use without legal review.
    2. Licenses that require the distribution of source when binaries are distributed (GPL, EPL) are allowed when any modification we might need to make to that component can be distributed without harming our position in the market.  In other words, we don’t want any “special sauce” licensed under these terms, but it’s good karma and community to contribute back to these projects.  Legal and finance review required.
    3. Licenses that are “weird” (vim’s Uganda license) or require distribution of source for any modifications (AGPL, non-OSI) are disfavored except for specific, limited uses.  Legal review required.
    4. Proprietary software may be used after a marketplace study and legal review.

The essential idea here is that I want my projects to contribute back, but I want to do so voluntarily. I particularly want to avoid a situation where I need to implement “special sauce” in a component, but then may have to release that code out into the marketplace- there’s no particularly good reason I would want to take on, say, a virally licensed webserver.

In terms of choosing feedstock, it’s often hit or miss, but I always look at the licenses before I start, particularly for “scaffolding code” that will wind up being the core of an app.  I’ve been working more in Clojure lately, and it’s somewhat annoying to track down BSD or MIT licensed starting points.  The community mostly uses EPL, which is fine for dependencies like a database connector.  It’s not so great as a starting point when I want to release my app under a BSD license.

In terms of releasing my own projects: I tend to license as BSD3, and I figure that I can always fork internally if I need to include some “special sauce”.

So why am I writing? Because Mailpile is asking for opinions, and they’re between two very different options:

  • AGPL, the most viral of the viral licenses, which would essentially require anyone running Mailpile to make the source available, and
  • Apache 2.0, which is not nearly so restrictive.

Between these two options, my vote is emphatically for the Apache license.  However, it might make sense to consider a middle-of-the-road approach and look at using the EPL or GPL.

Note: This post is basically about feedstock; that is, dependencies for your production code.  If you’re talking about tooling, particularly when you don’t plan to modify it, then you don’t have to worry so much about “viral” licenses.  For example, most users don’t really need to worry that vim has a weird license, because it’s a tool.  If your project revolves around modifying vim extensively, then you of course will have to live with its license.

If you’re interested in more on this, tweet me up @franksiler.  Please remember: I’m a lawyer, not your lawyer.  This post is for informational purposes only.

Updated UBE Map

I’ve updated my UBE Map for late 2014.  Make sure to read my previous post and check out the ABA Guide to Bar Admissions.  My source for information is the NCBEX page on the UBE.

MyCase review

Practice management is like doing taxes: it’s not something most people take any particular enjoyment in, but you have to stay organized and on top of things. I have high, but I think reasonable, expectations for services. They are:

0. Honesty.
1. Don’t lose data.
2. Make it easy for me to save my own data.
3. Customer support in the form of reasonable response to email and other online queries.

MyCase fails on every item. I suggest not even wasting your time with a free trial. The “full backup” option does not back up documents you upload, which is a major shortcoming. I couldn’t get MyCase to rename the backup feature to a more reasonable name, such as “schema backup”. I also suggested the correct fix, which is simply to make the full backup actually, uh, do a full backup. Imagine that.

I also got the runaround on simple questions, such as whether it was possible to have clients in more than one group. The real kicker though is, after what I thought was an undue amount of patience on my end, that they wouldn’t pro-rata refund my last month of service. The reason? Their “terms and conditions”. Okay, fine, but if it was my business I would think twice about picking a fight with a customer over ~$15, especially given that their customers are lawyers. Now, as a result of this asinine policy, I am telling you exactly why you may not want to use their product. Too bad, because I could just as easily write about features and why you might consider their platform.

I may or may not review other products, such as Clio, depending on whether I stay with a manual system or develop my own. But it’s only fair that I let you know that this system not only doesn’t meet my needs, but the producers are both negligent and not customer-friendly.

So, what could MyCase do to make me happy? There’s nothing they can do to induce me to remove this blog post in its entirety. I would, however, edit this article to reflect at least a pro-rata refund on their part. I might be persuaded to make this article more fair and balanced, rather than a mere gripe, by a full refund. But good grief, what a lousy experience.

UPDATE Sept 24, 2014. MyCase called me. I don’t know why. All they did was parrot back their “terms and conditions” and not offer to change anything. I’m not sure if it’s hubris, ignorance, arrogance, or what, but I’m sure glad I don’t entrust my data to them anymore. What next? A demand letter? That will be most entertaining.

Update Dec 29, 2014- turns out they keep the data read-only for six months.  Not too bad.  The product still gets an F overall, I’m afraid.

On holding a conference

  • Respect the audience.  Treat them like adults.  This means, among other things, assuming they know how to take appropriate notes and that they don’t need a “guiding voice” in order to figure out your message.
  • Make your message comprehensible.  This means having a central thesis and giving presenters useful guidance.
  • Do not try to induce socialization with gimmicks.
  • Run on time.
  • Nobody really cares about the swag.  Put more effort into talks and less into stuff.
  • Sponsors can be important sometimes, but don’t let their interests override those of the audience, who are giving the most precious resource of all: their time.
  • Avoid videos.
  • If someone offers you candid feedback and you call them, assume that you are going to get….candid feedback.  Imagine that.
  • This post is only here because I attended a godawful conference last week, and thought it might pay to state the obvious for others to read.

PythonKC Copyright and Licensing Slides