Oddly missing core functions

MSSQL is terrible for lacking some seemingly straightforward functionality- like the ability to drop a table only if it exists and has no rows. Salt to taste:

-- TODO: accept an array of tables, check all of them for existence and zero rows, then perform drop
CREATE PROCEDURE DropTableIfEmpty(@t as nvarchar(max))
      SELECT DISTINCT row_count
        FROM sys.dm_db_partition_stats
        WHERE Object_Name(Object_Id) = @t AND row_count = 0
    SELECT @q= N'DROP TABLE' + @t
    EXEC sp_executesql @q

To use: EXEC DropTableIfEmpty 'yourTableName'

Strong self-signed SSL certificates for development on IIS

openssl req -newkey rsa:4096 -nodes -keyout key.pem -x509 -days 3650 -out certificate.pem -sha256
openssl x509 -text -noout -in certificate.pem
openssl pkcs12 -inkey key.pem -in certificate.pem -export -out certificate.pfx
openssl pkcs12 -in certificate.pfx -noout -info


Splitting LDIF files

I went to the St. Louis PowerShell meetup last night, and someone had an interesting issue.  They were dealing with a 4.5 gigabyte LDIF file- it needed to be somehow processed, and the issue of processing had already been dealt with.  However, it turns out that PowerShell wasn’t cranking fast enough for the application.  Most computing work is single-threaded by default, so the simple approach here is to split the file up and process the pieces in parallel.

What is LDIF, anyway?  It’s a file format used to store LDAP entries.  Happily, it’s simple in that entries are just separated by a double newline.  The lowest-common-denominator approach for a straightforward text file like this is to use something like sed or awk.  I don’t know a whole lot about actually using either of these, but I knew that awk was really meant for splitting up “fields” contained within “records”.

First things first: grab some sample data.

My first concept was to use awk to grab some part of the first field and use that as a key. So for example, I might grab the fifth character from the first line- say it’s “q”. I’d then append that line to a new file, say file-q.ldif. But you always want to iterate and start from what you can get working, so I started with an example I grabbed- probably from some stackoverflow article, but I didn’t keep track:

awk -v RS= '/^#/ {next} {print > ("whatever-" NR ".txt")}' Example.ldif

The concept here was to also just skip comments, which is why the /^#/ {next} is in there- it should match strings that begin with a pound sign, and then just dump each record into its own file- file-1.txt, file-2.txt, and so on.

It turns out that I’m on a Mac, and awk on my system is the BSD variant- and it wigs out about too many files being open. No matter; we’ll close the files as we go.

awk -v RS= '{print > ("whatever-" NR ".txt"); close("whatever-" NR ".txt")}' Example.ldif

Really though, GNU awk is where it’s at, so I’ll be calling gawk from here. More generally, note that sed, awk, lex, and yacc all have serious cross-platform compatibility issues.  It’s easiest to standardize on GNU variants gsed, gawk, flex, and bison for all of these commands.  I’ve historically shied away from these tools because the syntax brings back bad memories of Perl, although in fact Perl grew out of these commands and not the other way around. I don’t know that I’ll use these tools every day, but it is certainly a handy approach when the “big guns” seem to be letting me down.

Back to the command above. It has a bug: it omits the double newline between entries as it writes the output files, so the LDIF entries get glommed into one big entry. To fix this, I need to fiddle with the output record separator variable, called ORS. This version grabs some characters from the first line of the input record to produce the output filename:

gawk 'BEGIN{RS="\n\n";FS="\n";ORS="\n\n"}{print >> ("file-" substr($1,9,1) ".txt")}' Example.ldif

The above approach has some merit if I wanted to group files by some content in the data. Here though, I only want to split the large file into smaller files of roughly equal sizes. The way to do it is to rotate systematically through the output files. Thankfully, this is EASY. The variable NR is the “record number”, so I can just take the modulus against however many files I want to split into. On this data sample, this approach results in much more even output file sizes than the substring hack above, and I don’t run the risk of getting weird garbage strings in the output filenames. I’m splitting into 6 files here, but salt to taste.

gawk 'BEGIN{RS="\n\n";FS="\n";ORS="\n\n"}{print >> ("file-" NR % 6 ".txt")}' Example.ldif

There’s one final refinement- because the record separator RS and output record separator ORS are equal, we can write this as:

gawk 'BEGIN{ORS=RS="\n\n";FS="\n"}{print >> ("file-" NR % 6 ".txt")}' Example.ldif

Useful reference:

Embroidery the Hard Way

With tens of thousands of dollars of equipment on hand, one would think that the seemingly simple would be possible.  Oh dear.  Not even close.

For years I’ve been putting binding on blankets and embroidering them with, generally, just a couple words- a name or something.  I wanted to do a bigger project- a quote that ran the whole length of the blanket.  I can’t even count the number of ways this went wrong.

I have access to a Babylock Ellisimo, a Babylock Ellyse, and a Bernina Artista 170.  The Ellisimo, being the biggest and baddest of these, seemed like the logical place to start. So….

  • Babylock doesn’t label their frames.  At all.
    • No size indication.
    • No useful indication of zero.
    • No model name or number you can search.
    • Not even an easy way to ask the sewing machine what size frames it knows about.
    • To add insult to injury, you can measure the frame–but the result is NOT the frame size.
    • Finally, I found out that the sizing is on the clear plastic insert pieces, which are also useful for rough alignment of your workpiece.  However, if you have the machine trace out the pattern perimeter, it doesn’t align with these patterns.  None of the pieces are actually labeled with the part numbers or sizes, but at least you can count the square centimeters on the clear inserts.
  • Poor discoverability all around on the Babylock.
    • Insert the frame, remove the frame, try not to be upset when the moving thing breaks a needle….
    • As noted above, no way to know what size frame is loaded or what frames the machine knows about.
    • If the embroidery file on your flash drive is too large for any frame the machine knows about, it simply DOESN’T DISPLAY ANY INDICATION that the file exists!
      • Please, if there’s something wrong with the file, show some sign of life!
      • If the file is corrupt or a format you don’t understand, say so.
      • If it’s too big, say so.
      • Too many stitches?  You get the idea.
  • No way to index multiple designs together across frame positions, or at least not as far as I could tell.
    • Apparently some machines have some kind of machine vision.  I don’t need anywhere near that level of precision- just let me index off a known point.
  • Stitchcode
  • SewWhat-Pro
    • Some cool features.  Not worth $65.
  • The funny thing is that some of the hard stuff was remarkably easy.
    • The “Letterworks Pro III” software from 20+ years ago worked flawlessly to take an arbitrary TrueType font and turn it into a stitch path.
    • The pattern it generated was pretty clearly a raster, but it does work.
    • Some of the embroidery fonts have rather annoying bugs- like the “LaurenScript” font, which will have big gaps if you blow it up too much.  These don’t show up in Letterworks but are apparent on the Ellisimo display and in SewWhat.

I would much rather publish a HOWTO- but I’m in the quagmire for the time being.  There are days I wonder if a needle and thimble would be the easier thing to do….but then, this project is on the order of 50,000 stitches…..per side…..

PS It seems to me that machining, printing (2D and 3D), pick and place, and PCB manufacture all involve fundamentally the same operations, but they run on totally different stacks.  I wish we could’ve all agreed on HPGL, G-code, PostScript, or SOMETHING back in the day.

Metrics for sunsetting systems

Really, these are similar metrics to what you would use to choose systems- except they’re often on the tail end of a choice made years ago.

I frequently see multiple systems run in parallel that really should be integrated (e.g., why would one have separate time and access cards?).  If you’re going to choose between systems, score them both and compare their relative risks and benefits.

This is a good time to read my previous post about sunk cost bias and risk adjustment.

I propose the following categories and considerations for evaluating systems for sunset:

  • Longevity
    • How long do you expect to need the system?
    • What would you do if you had to replace the system?
      • What’s the horizon on the subsystems? (e.g., if it only runs on SPARC, it’s probably time to look at sunsetting it.)
      • What if the vendor goes out of business?
      • What if there’s some horrific reliability or security problem and you had to take it down?
    • Is there any reason you would have to upgrade or modify the system in the foreseeable future to meet some need?  If so, what’s the cost and associated risk?
  • Data Synergy
    • Does the system tie to your strategic data goals?
    • How easy is it to import, export, back up, and manipulate data stored in the system?
    • Is the system tied to a core competence? (e.g., if your main customer CRM is very siloed, that probably matters a lot more than if your timecard system is).
  • Finance and Risk
    • Operating and maintenance costs.  Don’t forget that periodic downtime has some associated cost, and it’s really not acceptable for any system to be down for, say, 6 hours every night.
    • Patch intervals, speed of patch issuance, difficulty, likelihood of breakage.
    • How does the system score for PCI and other security frameworks?
    • Does the system integrate nicely with both your core systems and your customers’?
    • Does the system support accretive data?

Assessing risk (and pitfalls)

I’ve already mused a bit on project estimation, and I have an upcoming post about how to think about sunsetting systems.  It occurs to me that there are at least two more fundamental questions to address:  what does it mean to talk about the risk in a system, and what biases might we have when making that estimate?

What is risk?

Finance types define risk as the “standard deviation”.  In other words, it’s a measure of things that might not go strictly according to plan, and how far away from your estimate those things might send you.  In other words, if you say that the project estimated cost is $100 with no risk, you’re saying that the cost should be exactly $100.  However, if you have a cost of $100 with a risk of $50 (we’ll naively assume normally distributed), then that implies your project cost could vary a lot from that $100 estimate.

I often see risk discussed by technical people in terms of roadblocks: “it might be really slow when we hook widget X up to database Y.”  “We might not be able to translate records directly from System Q to System W.”  These are fine things to know about- but if you can quantify them in terms of time lost, revenue lost, or in raw dollar terms, that will get you closer to being to put a bottom line on your estimates.

When doing estimates, teams often use a sort of placeholder.  Instead of estimating that a task will cost, say, $1000, we tend to say it’s a “Small” task, and other tasks might be “Medium”, Large”, or “Extra-Large”.  This is fine.  However, I also think it would be useful to score each task for risk.  For example, I think most developers would agree that “fetch a web page” is a relatively low-risk task, while doing bleeding-edge machine vision is riskier.  From here, perhaps you extrapolate that the “retrieve a webpage” task is estimated to cost $800-1200, while a risker task of similar size might be $500-2000.


We all have biases- there are some tropes in industry, such as “not invented here” syndrome.  Most technical folks, I think, have some bias either toward platforms they’ve already used and understand, or toward the latest “sexy” technology.  I suffer from a variant of this where I prefer open-source products.

There’s another bias inherent to humans: the sunk cost fallacy.  We tend to impart value to things we’ve paid for, even if that past transaction has nothing to do with the current decision.  Sunk cost bias is fairly intuitive if you think about it, but it’s difficult to consciously minimize it when decisionmaking.  The essence: It doesn’t matter if you just spent a million dollars last year.  The question is what the cost is of a new system versus the ongoing cost of the current system.


Algorithms that are 80% good aren’t scary.

Leaving aside the difference between sensitivity and specificity: if people know the machine is sometimes wrong, it’s not so bad.  Things get scary when the machine is 99% or 99.9% accurate, and you’re caught on the wrong end of a presumption.

I’ve often wondered, for example, how many people get picked up on warrants for someone else.  I think it’s important that we have systems for review of the totality of the circumstances by a human.

Vexing little bugs

I find that, particularly with geek stuff, I get hung up on tiny little details.  For example, I did a deep dive with Javascript and CSS the other week, trying to find out why I couldn’t get an input field to select all the text inside when I clicked on it.  This ties back to Spolsky’s thoughts on craftsmanship.

But if I channel Godin for just a bit, I’d remember that the whole product is important- not just one tiny detail.  It’s fine to care about little details once the building is built, but while you’re building- keep going.

2018 Law license reciprocity update

It’s been awhile since I blogged about the UBE and Kansas.  I was surprised that Kansas took an extra couple years to join, and even more surprised at the adoption in the northeast.  I had always assumed that jurisdictions such as California, Florida, Texas, and New York would maintain their own licensing regimes- but New York joined the UBE in July 2016.

As always, consult local rules.  NCBEX publishes a very useful guide about bar admissions, but I’m going to supplement it here with some reciprocity rules as I’ve found them on this date.  No guarantee this information is accurate or will be kept updated, folks!

I’ve got more than five years practice in Kansas now, so I’m looking at states in the general area I might consider joining.  Here we go:

StateApplicable Rule(s)Active Practice Requirement
Missouri8.105 of last 10 years
Iowa31.12, 31.13*5 of last 7 years
Nebraska§ 3-119(B)3 of last 5 years
Illinois7053 of last 5 years

*link to rules on a specific date!  Be careful with that one.

On project estimation

This may or may not be a series, but I wanted to dash off a few thoughts.  I have a feeling this post will come across as very stream-of-consciousness but will be clarified by followup posts.

The essence of estimating projects is to evaluate two things:

  • How long is it expected to take?
  • What risks are there?

There are some brilliant writings on the first issue, but I haven’t seen much done on the second.  I’d like to propose the following:

  • Each task, in addition to whatever size rank you want to give it (e.g., Small, Medium, Large, eXtra Large) gets ranked for risk.
    • For example, let’s suppose there is a “grab webpage” task, and every member of the team agrees it is Small.  That implies low variance.
    • Let’s suppose you want to parse the webpage you’ve just grabbed, and you get the following votes as to its size: S, S, L, M, XL.  That implies a large variance, and higher risk.  Note, however, that this level of disparity might also imply that the task isn’t well scoped, that different team members are using different assumptions, or some other definitional or implementation issue.
  • These variances should be accounted for in the budget and as part of burndown.
  • One feature that I haven’t seen in any tool is the ability to do “actual” vs “budgeted” burndown: the comparison between actual time on project versus the budget.  The useful thing about taking risk into account upfront is that you may very well find that while your actual burn exceeds a straight-line estimate, the numbers actually do fall within your predicted risk range (Nate Silver can tell you all about this one).  Of course, you can’t know until you have the tools to do this.  I’m currently exploring abstractions to allow various permutations of Kanban and other project monitoring and tie them against commits, support tickets, and testing.
  • Now, one additional layer to consider is prospective maintenance cost.  Here, parsing a webpage is a “brittle” task- obviously, if someone upstream changes the format of the web page, you may very well have to start this task over from scratch, or it may cause future outages.  This whole “ongoing cost” and/or “brittleness” risk factor is something I’ve rarely seen accounted for in projects.
  • Finally, most groups don’t really estimate to account for interdependencies, especially internal ones, or testing.
    • Interdependencies are a major problem: how many times did the Boeing 787 slip?  Obviously, you can’t ship just the wings of an airplane.  But in software, you can often ship some intermediate product even if waiting on some “important” piece.  In fact, sometimes software is better without that “essential” feature.
    • I’m not necessarily a fan of writing tests before writing code, but I do think that organizations tend to focus on “features first” at the expense of some very brittle systems.  For the love of Pete, at least build a regression suite as you close dev and support tickets.