Conducting an Effective Postmortem

Wouldn’t life would be great if nothing ever went wrong and technology never broke?

Unfortunately, the reality is that things break all the time.

Your product probably breaks a little bit all the time and sometimes breaks big time (hopefully, much more occasionally).

Perhaps your website goes down for 4 hours during your busiest season and you miss $10M in potential revenue. Or, you discover that no payments have been processing for 2 days before anyone noticed.

In these serious to catastrophic cases, you really want to get to the bottom of why the problem occurred, with a view to making sure it never happens again.

That’s where conducting an effective Postmortem becomes vital.

Avoid Blame

The ultimate purpose of the Postmortem is to make sure the problem experienced never occurs again.

In order to do this, a Postmortem involves ferreting out root causes (more on this below). Bringing blame into the Postmortem process itself risks causing defensiveness and “ass-covering”. This defensiveness tends to obscure the true root causes.

This is not to say that blame isn’t important or can be avoided. Perhaps someone needs to be fired. But, keep the “blame” part separate from the Postmortem itself in order to get to the true root causes.

The Postmortem Process

I like to conduct the Postmortem process as a group, with all the stakeholders in a room in front of a whiteboard. It’s important that everyone impacted and/or responsible for the problem is involved and has a voice.

At a high-level, my process for conducting a Postmortem is as follows:

agree and define the impact of the problem to the business – e.g. “we lost $10M in potential revenue”
flush out all the causes of the problem down to their root, as far as possible
agree a set of recommendations aimed at ensuring the problem never occurs again

Let’s use a contrived example for illustration. Imagine that I fell off my bike and broke my wrist. We start on the whiteboard with the impact – i.e. I broke my wrist.

Why-Because

My preferred method to analyze causes (step #2 above) is Why-because Analysis.

Why-because is a formalized process but don’t be put off – it can be used more casually with great success and you can add rigor as you become more familiar.

Why-because essentially involves repeatedly asking “Why?” and repeatedly answering with the “because” part. My 5-year old son is also great at this.

e.g. “Why did I fall off my bike?” “…because I hit a pothole.”
“Why did you hit a pothole?” “…because I wasn’t looking where I was going.”
“Why weren’t you looking where you were going?” “…because I was distracted”

…you get the idea.

Why-because is similar to other processes you may be familiar with like “5 Whys”. (I have found 5 whys to be insufficient because big problems typically have complex causes and the causal chains are often more than 5 levels deep.)

What you end up with at the end of the Why-because Analysis is a graph that shows you all the contributory causes that caused the impact on your business. More formally, when complete, the Why-because graph should include all the necessary and sufficient causes.

Continuing our example, here’s our Why-because analysis of why I broke my wrist:

Of course, you need to decide when you’ve gone deep enough and can stop asking Why? There is no hard-and-fast rule here – just use your judgement – but you don’t want to end up drilling down to “because the big bang happened” in every case.

One great thing about Why-because graphs like the one above is that you can test them to make sure they’re complete:

for each box on the chart, you can ask, had this not occurred, would the problem still have occurred? If the answer is no, it’s a necessary condition.
looking at all the boxes on the chart, you can ask, if all of these happened again, would the problem occur again? If the answer is no, your conditions are not sufficient and you’re not done yet.

Generally, big problems tend to have complex causes. This is because any reasonably mature organization will have checks and balances in place to avoid obvious and predictable failures.

Therefore, you will likely end up with a complex graph that includes a mixture of technical, operational and human contributory factors. It’s particularly important not to overlook or underplay the human factors since fixing the technical and operational issues alone will not avoid the problem recurring.

Recommendations

The most important part of the process is to create a list of recommendations to act on, informed by the detailed understanding of the causes from the Why-because analysis.

Don’t forget the human factors – these are often the most important to address, e.g. additional training, more staff or better process.

Again, you can test your recommendations by saying, if we do all these things, is it highly likely to prevent this problem from recurring again? If the answer is no, you’ve not got the right recommendations.

Lastly, give each recommendation an owner who is responsible for taking action and be sure to follow-up.

So, you inherited a codebase… just how screwed are you?

In a perfect world, every software developer would finish school, build a product from scratch, IPO the company and retire. However, that’s very, very rarely the case and the majority of software people will at one time or another in their careers inherit a codebase that they weren’t involved in creating.

The most common situations where this happens are when a company is acquired by another and when joining a company (because of staff growth or turnover). But, regardless of the reason, the software developers, managers and executives involved are expected to take it over and continue to achieve forward progress.

The problem is that not all codebases are created equal – they can range from absolute disasters to dream scenarios; from the ridiculous to the sublime. Plus, aside from the code itself, there are a number of other factors which have a huge impact on the success or failure of the transition.

So, the purpose of this post is to give a framework for assessing how much of a risk the transition will be. Put another way, it’s also a framework for asking the right questions and for setting expectations at the right level.

How Screwed Are You?…a formula

Every situation is unique, but this is my rough formula to calculate the “Codebase Risk Factor” (CRF):

If you’ve immediately started to glaze-over from the math, don’t worry, I’m going to break it down for you.

This formula factors in what I believe are the 5 main risk factors when you inherit a codebase:

Test Coverage (“tc“)
Team Availability (“ta“)
Team (availability) Duration (“td“)
Defect Find/Fix Ratio (“ffr“)
Age of Codebase (“ac“)

The formula then weights each of these factors to give an overall score – the higher the number, the more screwed you are.

Let’s look at each of these factors in more detail.

Test Coverage

There are many other places you can go to read more about the benefits of test coverage and Test Driven Development (TDD) in general. I am not going to make that case here.

Suffice to say that, when you inherit a codebase, Test Coverage is your friend.

For those that are new to the concept, Test Coverage is usually defined as a percentage and measures how much of the codebase comes with automated tests that validate whether it’s working or not.

When you first inherit a codebase, you understand almost nothing about it. However, despite that, in most cases you are going to be expected to a. keep it working and b. add new features. Good Test Coverage lets you add and change things in the codebase with a much lower risk of inadvertently breaking something else that you don’t understand yet.

So, how do you assess Test Coverage? Fortunately, for most platforms and languages (in fact all the platforms and languages I’ve come across), there are one or more automated tools to analyze Test Coverage. These are not perfect and you should always ask the developers who worked on the code originally what they believe the Test Coverage to be (more on that below).

Also, beware of getting a false sense of security from a raw Test Coverage % number. Just like any codebase, not all Test Coverage is created equal. I’ve seen many cases where tests are written but they don’t actually properly test the code in question. Again, this is why you should seek out the view of the developers who worked on the codebase originally.

Team Availability

To be blunt, if you inherit a codebase but don’t have any access to the developers that wrote it, I’d say you’re in Big Trouble ™.

Only the original developers know where the bodies are buried: they understand how much Tech Debt has accumulated in the codebase that you’ll need to deal with, they know which parts of the code are solid versus which kept them awake at night and they know which bits of code were written by the good developers and which by that idiot that was fired.

By “developers”, I also count devops people since they are arguably the most important in keeping what you inherit up and running.

Now, by “Team Availability”, I don’t necessarily mean that the developers have to still be working for you full-time (although that would be best) but that they are available to answer questions and provide advice on an as-needed basis.

So, Team Availability is also defined as a percentage. You can either think of it as the % of people (e.g. 2 members of the original team of 10 are available, so 20%) or as a % of the people’s time (e.g. the original developers are available 1 day a week, so 20%).

Team (availability) Duration

Having access to the original team is great but often the follow-on question is, for how long?

From experience, I’d say 6 months is enough for most cases. It’s enough time to learn what’s already there and to make significant forward progress with new features. After 6 months, the original developers no longer have familiarity with the newest code so their usefulness starts to diminish.

If you’re planning on acquiring/inheriting a codebase, I’d recommend that the absolute minimum you should try to keep the original team available for is 3 months.

Defect Find/Fix Ratio

The Defect Find/Fix Ratio is simply a measure of how rapidly bugs are being found in the code versus fixed in the code. It’s a useful measure of how troubled the codebase is at the point in time you inherit it. If bugs are being found faster than they’re being fixed, that’s an indication of a problem.

In practice, the amount of time allocated to bug fixing varies based on the team’s other commitments and, therefore, the Find/Fix Ratio will tend to oscillate. So, it’s good to look at the Find/Fix Ratio over a period of time; the past month at least. The lower the ratio, the better.

Age of Codebase

While it’s true that code itself doesn’t age or rust, older codebases are, in general, more problematic.

There are a number of reasons for this: firstly, the older the codebase, the more likely that the people who wrote the code are no longer available and/or have forgotten how it works.

Secondly, the older the codebase, the older the versions of the libraries, packages, tools, etc that it depends on, unless there has been very deliberate effort to update these (hint: this rarely happens). This makes it much harder to make changes to the codebase, especially if some of the versions it depends on have reach end-of-life or are no longer compatible. This is all Tech Debt that you will have to address before you can move forward.

An Example

Let’s imagine that ACME Corp acquires Widgets Inc and the poor VP Engineering at ACME Corp inherits Widgets Inc’s codebase.

Widget Inc’s codebase is 4 years old. The team has been pretty stable at 10 people but 7 of those people have quit in anger because of the acquisition. The remaining 3 have been persuaded to stay around for 4 months in return for some ongoing stock vesting and a retention bonus.

The VP Engineering quizzes the remaining 3 and runs the test coverage tool. He also looks at the bug tracking system to see the recent trend. He discovers that Test Coverage is about 40% and, in the past 3 months, 280 bugs have been reported, of which 80 have been fixed.

So, here are our inputs to the formula:

test coverage: tc = 40%
team availability: ta = 30%
team (availability) duration: td = 4 months
defect find/fix ratio: ffr = 280/80 = 3.5
age of codebase = 4 years

Plugging them into the formula, here’s what we get:

So, this yields a Codebase Risk Factor of 320, compared to (as the mathematically-inclined will have noticed) a perfect score of 20.

In layman’s terms, pretty screwed.

Agree or disagree? Please leave a comment.

The Truth about Tech Debt

This is a topic which almost all non-technical and a lot of technical people don’t understand. This is sad and somewhat ironic because, if you are in any way involved in the software development business, you HAVE to understand it.

I think part of the problem is that, to non-technical people, Tech Debt sounds very nebulous and like precisely the kind of excuse that software engineers come up with for not getting stuff done.

Be assured that tech debt is very real and very deadly. Even those people who do understand what it is find it all too easy to kick the can down the road. Never forget the old saw “a stitch in time saves nine”.

So What is It?

Firstly, let’s explain the analogy: it’s called “debt” in part because you pay “interest” on it. That interest consists of the extra time and risk (therefore, cost) of adding or changing product features.

Furthermore, just like a credit card, the interest compounds. If you don’t address the current debt but go ahead and build new features anyway (i.e. spend more) then you are building those new features on top of your existing debt, meaning you’re paying interest on interest.

The only way to avoid paying interest is to pay back the debt. If you let the debt grow too far, you’ll end up spending more servicing the debt than building new features. Ultimately, you’ll become bankrupt.

Just like other kinds of debt, it’s all too easy and tempting to ignore but it never goes away if you do that – it just gets worse.

Err…So What is It?

Let’s try another analogy: think of your product like a building. Each feature you build is a floor you add to the top of your building. In order for a new floor to stay up, it’s dependent on the support of all the other floors below.

Imagine your building has 8 floors. Now you’ve got demand from new tenants so you want to add a 9th floor. Easy right?

The problem is that the 8th floor isn’t in great shape. It looks ok from the outside but remember you put a lot of pressure on Mikey, the contractor who built the 8th floor, to get the job done because you had tenants waiting? Mikey’s a good guy but he definitely cut some corners to make the deadline you imposed. That wall that was supposed to be load-bearing is just a facade. The wiring is not up to code. The plumbing was rushed so it’s leaking – you haven’t seen the stains on the ceiling of the floor below yet…but you will.

And…remember when you built the 7th floor? That contractor, Dave, didn’t really know what he was doing, did he? He didn’t understand architectural diagrams properly and didn’t know where the load-bearing walls were on the lower floors so he didn’t really lay out the 7th floor properly.

Oh…and don’t forget that Brian who built the 6th floor wasn’t involved in the building the original 5 floors so he didn’t understand the specialized heating system that had been put in and just added to it using standard components. The heating has never worked quite right on the 6th, 7th and 8th floors, has it?

So, ready to build that 9th floor? You could just go ahead. It’ll probably be ok, right? I mean, the building hasn’t fallen down yet…

I think you get the point.

Of course, when it comes to buildings, this is unlikely to happen, at least in a modern, industrialized country. We have professional architects, permits, building codes and inspectors.

Guess what? There ain’t no permitting or inspection agency for software development (outside of some very specialized areas). It’s the Wild West.

Causes

The causes of Tech Debt are manifold but I would say these all boil down to three root causes:

Time pressure. Needing to meet a deadline is just business reality. However, time pressure leads inevitably to corner-cutting. Corner-cutting creates tech debt. Beware artificial ship dates and their hidden cost (more below).
Pivots. Pivots are also part of business reality, especially for a startup. The only alternative to pivoting is flogging a dead horse…and then being a dead horse. But, to software engineering efforts, pivots are moving the goal posts. Pivots lead to stretching existing product implementations to fit new objectives, to shoving square pegs in round holes. Keep stretching and eventually things break.
Team Changes. If you inherit a codebase but not the team that wrote it, you probably don’t understand where the tech debt is or how big it is. You also don’t understand the quickest and easiest way to build on what has already come before. Similarly, when you add new people to the team, they typically have a steep learning curve to learn what has come before. In either case, that learning is never perfect.

Particularly Deadly to Startups

Tech debt is particularly deadly to startups because:

there is enormous time pressure because of funding rounds, meaning it accumulates quickly (and you never have time to address it),
they tend to pivot frequently, meaning tech debt accumulates quickly (and you never have time to address it), and
startups are resource constrained so there is little to no bandwidth to address tech debt.

Beware Artificial Deadlines

Sometimes there is a real deadline to ship a product – a trade-show where the product will be launched, or running out of funding being two common examples.

In the absence of a real deadline, creating an artificial deadline definitely has its place. A deadline serves to galvanize a team. It helps stop the team endlessly tweaking and gilding the lily.

However, the hidden cost of an artificial deadline is Tech Debt. Because of the planning fallacy (beyond the scope of this post), software development projects are almost always late so corners have to be cut to make deadlines. That might be ok if you put in place a plan to address the accumulated tech debt immediately after the product launches…but, let’s be honest; that never happens. Tech debt almost always gets kicked down the road because of the pressure to fix defects and respond to user feedback and requests.

Therefore, be wary of artificial deadlines – they have their place but they also have a dangerous, hidden price.

How to deal with Tech Debt

So, now we know what it is, how should you deal with Tech Debt?

1. Understand what it is

Firstly, accept the reality that Tech Debt is a thing. It exists. It’s real and it has to be monitored and addressed. Ignore it and it will get worse and bite you in the rear.

2. Understand where it is

Next, understand how much Tech Debt you have and where it is in your product. If you are not part of the development team, the developers will be happy to tell you; trust me.

3. Prioritize it

Once you know where the debt is and how big it is, it’s time to assess how critical it is to address it. There will always be some Tech Debt – it’s inevitable. Not all Tech Debt is critical to address immediately.

There are no easy answers because the future is not easy to predict. However, here are some rough rules of thumb:

Tech Debt that exists in core features that differentiate your product and/or which users rely on should be addressed urgently, because you are likely to want to continue to develop those features.
Tech Debt that exists in any layer of your product which future development will rely on should be addressed urgently, or you’ll be continuing to build on shaky foundations.
Tech Debt that exists in rarely used features, features that you don’t compete on or features that you plan to deprecate is a lower priority to address.

4. Plan to address it

This might be the hardest part. Addressing Tech Debt will take time. Unless that time is planned and built into the priorities and the schedule, it will not happen.

There is also an associated opportunity cost. You cannot change the laws of physics. Addressing Tech Debt will mean you can’t use that same time to move forward on new feature development. You have to accept that reality.

How to Minimize Tech Debt

You can’t avoid Tech Debt completely – it’s an inevitable part of shipping a product. There is unfortunately no perfect answer here – you have to strike a balance that minimizes Tech Debt but, at the same time, avoids the equal evils of premature optimization, gilding the lily and complete rewrites. Swinging the pendulum too far in another direction will be just as much of a disaster. (Hint: Software is never “done” and, left to their own devices, most engineers would prefer to tinker indefinitely without the annoyance of real users using the product and doing stupid things that they shouldn’t.)

But, just because you can’t avoid Tech Debt, doesn’t mean you can’t take some actions to minimize it:

Spread awareness of Tech Debt in your organization to make sure you have a common understanding that it exists, that it needs to be dealt with and the dangers of kicking it down the road.
If you’re not personally involved in the details of software development effort, try to keep your finger on the pulse of the level of Tech Debt is accumulating and communicate that with other stakeholders (particularly the board and investors).
Always build time into product development schedules to address Tech Debt and factor it into commitments made to customers and investors. Expect to spend a chunk of time/money after each major feature release to address the Tech Debt accumulated during the push to get the feature over the finish line.
If you inherit a codebase from a prior team or company, make sure you have ongoing access to someone who worked on it originally, for an absolute minimum of 6 months.
Practice Test-Driven Development (TDD). The details are beyond the scope of this post but good test coverage is invaluable in minimizing Tech Debt.
Avoid “proof-of-concept” implementations of products and features, especially in a startup where you are very resource constrained. While it’s definitely critical to develop iteratively and build a Minimum Viable Product (MVP) to test if users are interested in your product before you invest too much, you should also assume that all code written will make it into production – you can’t afford to write things twice. This doesn’t mean over-engineer; it means engineer once.

In terms of striking the right balance between being aggressive about shipping products versus premature optimization, my recommendation is that you only make investments that will get you through your next funding milestone or the next 12 months, whichever is later. i.e. if it won’t pay off in 12 months or before you flame out, it’s not worth it.

Why You Should (Almost) Never Rewrite Code – A Graphical Guide

I’m by no means the first person to write about the dangers of rewriting code. The definitive work for me is “Things You Should Never Do, Part I” written by Joel Spolsky in 2000. If you haven’t read that, you should.

A recent discussion caused me to create a series of charts to graphically illustrate the dangers of rewriting. I tend to think graphically – blame too much time in Powerpoint creating VC pitch decks.

I hope these charts help anyone considering rewriting code or being hectored by earnest, bright, young engineers and architects advocating a rewrite. I’ve seen this movie before and I know how it ends.

The Status Quo

Firstly, here’s the status quo:

The more time and money you spend on an existing product, generally speaking, the more functionality you get.

Let’s then overlay the competitiveness of the product in its target market. Of course, what you want to achieve is a continual improvement in the competitiveness of the product. Competitiveness doesn’t increase as fast as functionality. Your competitors don’t stand still so, typically, the best you can hope for is to increase competitiveness gradually over time – even staying flat is a big challenge.

Of course, this chart is idealized – functionality will increase in a more lumpy way as you release major new chunks of functionality and competitiveness will move up and down against your market. However, they demonstrate the point.

Note: the Y axis on all these charts represents “functionality” which, for the purposes of this discussion, can be considered a blend of features and quality. The distinction between the two is not really important in this analysis since you always want to be advancing on one or both and they both impact overall product competitiveness.

Let’s Rewrite!

Cue your software architect. He’s just been reading about this great new application framework and language called Ban.an.as – it’s so much better than the way you’ve been doing things previously. Hell, Ban.an.as has built-in back-hashed inline quadroople integration comprehensions. Plus, all the cool kids are using it.

If you’re a non-technical manager or executive, this can become quickly overwhelming and hard to argue against. They’re the experts, right, so they must know what they’re talking about.

[By the way, if you’re a non-engineer, there’s no such thing as “back-hashed inline quadroople integration comprehensions” – I made that up. Sorry.
A little secret here: there really haven’t been any new programming paradigms invented since the 1970s. Software folks are generally in their 20s and didn’t see them the first time around – they just get “re-discovered” and given new names. Sssh – don’t tell anyone.]

Just to be clear – I’m not saying that new languages, frameworks and technologies can’t create significant improvements in developer productivity – they can. But, their introduction into an existing product has a big price, as we’ll see.

Lastly, lest I alienate or offend my fellow geeks, I have been that very software architect advocating for the rewrite. I learned this the hard way.

So, this is how the rewrite is supposed to work in theory:

Let’s break this down:

The rework is expected to take some amount of time. During this time, functionality won’t increase because developers are focusing on rebuilding the foundations.
In the graph, that’s the blue area you can see peeking through. That blue area represents the cost of the rewrite.

But, once the rewrite is done, the idea is that progress will be massively greater than it was previously because the new technology used is inherently better than the old. Developers will be more productive with the new. There’s no need to work with the old spaghetti code – instead, there’s a beautiful new architecture free of all the baggage that came before.

The critical point in time is the break-even point – this is the point at which the functionality of your product starts to exceed where you would have been had you stuck with the original implementation and continued working on it.

During the rewrite period, the competitiveness of your product will typically decline since your competitors are not standing still and you can’t develop new functionality.

However, the claim typically made by the rewrite advocates is that the rework will be relatively easy, take a relatively short period of time and hence achieve break-even quickly, after which it’s non-stop to the moon…

What tends to go wrong – Part 1

So, what tends to go wrong? The most common problem is arguably the most common problem in software development generally; the rewrite takes significantly longer that expected.

There’s a discussion of why this tends to happen below but, for now, trust me that this often happens – if you’ve had any experience with software development at all, it’s highly likely that you’ve seen this happen too.

The result is that the cost of your rewrite is significantly larger than originally claimed. (The blue area showing through on the chart is now much larger.) This means that the break-even point is also significantly pushed back in time.

The knock-on effect is that the competitiveness of your product drops for a much longer period of time.

If you are a big company, you can probably (hopefully) absorb the pushed out break-even point – maybe other products provide revenue, maybe good channel relationships continue to deliver sales of your product even though it’s falling versus the competition or maybe you’ve simply got lots of cash reserves. Joel Spolsky references several big company rewrites that failed but did not kill the company.

However, if you are a startup or smaller company (or even a less fortunate bigger company), this can be – some might argue, is very likely to be – fatal.

What tends to go wrong – Part 2

It gets worse.

Not only does the rewrite often take longer than expected but functionality is not even flat at the end of it – functionality is actually lost as a side-effect of the rewrite.

That’s because all of those small but important features, tweaks and bug-fixes that were in the original product don’t get reimplemented during the rewrite. (These are the “hairs” that Joel discusses.)

Remember that the main focus of the rewrite in the developer/architect’s mind is generally to build a better architecture and these small features don’t seem important and mess up this beautiful new architecture.

Plus, however good the developers are, they will introduce new bugs that will only get exposed and fixed through usage.

The net result is that the product after the rewrite is, in the eyes of the end-user and the market, worse than the product before the rewrite, even if it’s better in the eyes of the developer.

This further pushes out the break-even point in terms of functionality and competitiveness of your product will take a long time to recover. If you listen carefully, you can probably hear your competitors wetting themselves laughing at your folly.

What tends to go wrong – Part 3

Oh dear.

The nail in the coffin is that, not only does the rewrite take longer than expected, and functionality get lost in the process, but the benefits of the new technology/language/framework turn out to not be nearly as great as claimed. Meanwhile, if you’d stuck with the status quo, you would have got more productive due to normal learning effects – i.e. don’t forget that while your team is climbing the learning curve with the new technology, you would have been getting better with the old one anyway.

Any one of the 3 problems above is potentially fatal but all 3 together is definitely so. Competitiveness of your product will likely never recover.

Why does this happen?

So, the above sections discuss what can go wrong. It might help to understand why this happens.

In Joel’s seminal post cited above, he makes the point, “It’s harder to read code than to write it.”

Very true – it’s also more fun to write new code than learn someone else’s code.

Developers like to feel smart and, intellectually, learning someone else’s code can seem like a lose-lose scenario to a lot of developers – if the code is bad, it’s painful to learn and you’re going to have to fix it. On the other hand, if it’s better than you would have written, it’s going to make you feel stupid.

Also, bear in mind that the fundamental motivation of most (not all) engineers is to Learn New Stuff ™. They are always going to gravitate towards new things over old problems they feel they’ve already solved. Again, I’m not faulting developers for this – it doesn’t make them bad people but, if you’re a non-developer, it’s critical you understand this motivation.

Put it this way: would you rather be the guy that architected the Golden Gate Bridge or one of the guys hanging off it on a rope, scraping off rust and repainting it?

Another problem is that when you rewrite, you typically make rapid progress early on which gives false validation to the decision to rewrite. It makes you feel smart – what a beautiful cathedral I’m building.

That’s because you’re writing code in a vacuum. No one is using the code yet; certainly no actual users. But, as you start to reach launch date, all those small features, tweaks and bug-fixes – all that learning that was encapsulated in your old product – starts to become conspicuous by its absence.

The bottom-line is that, when the case to rewrite is being made, it is not comparing apples to apples – it is comparing theoretical benefits with actual benefits. The actual benefit in question being, of course, that you have an existing product that works.

Comparing the pros and cons of writing a new application in language A versus B is not the same as comparing an application already written in language A versus rewriting that application in language B. Why? because any productivity gains of one language over the other are typically massively overridden by the loss of all the domain knowledge, testing and fixes encapsulated in the existing product.

Should you EVER rewrite?

So, are there ever circumstances when you should rewrite? My answer here is an emphatic“maybe”.

Let’s consider some of the possible situations:

1. An irretrievably sick code-base
The symptom here is that it takes exponentially longer to add each new feature – per the chart above. Another symptom is that defect reports continue to come in as fast, or faster than you can fix them – you’re treading water.
However, an irretrievably sick development team is a much more likely culprit than an irretrievably sick product.
Don’t let developers convince you that they can’t maintain a codebase – what they most often mean is that they don’t want to maintain a codebase.

2. The developers who wrote the code are not available
If you buy/find/inherit a codebase, make sure you have access to as many of the developers who wrote it as possible, for as long as possible. If you didn’t…doh!
Again, don’t fall into the trap – developers will always prefer to write new code of their own than learn how the existing code works.

3. A genuine change to a new problem domain
Sometimes, some of the fundamental assumptions and technology choices may no longer be valid for the problem domain that your product is addressing. But, then I’d argue, you’re really talking about building a new product rather than rewriting an existing one.

4. Fundamentally incorrect or limiting technology choices
In some cases, the team that originally built the product may have made some poor choices in the technologies to use or the approaches to take. They may simply not map well to the problem domain of your product. However, in my experience, this is true much less often than developers claim it’s true.
Also, there are points where you start to hit fundamental scalability limits of certain technologies. However, keep in mind that if you’re hitting those scalability limits, someone else probably has too before you, and solutions to the problem exist.

So, if you are even entertaining the suggestion of a rewrite, make sure you get the developers to give you a specific cost-benefit analysis of the rewrite. Show them the charts above and get them to tell you why these problems won’t occur in this case, even though they occur in almost every case.

If and when you are convinced that this is one of those rare situations where a rewrite is the right approach, you need to make it surgical: where will you make the incision? How deep do you cut? What are the risks? What are the potential side-effects? How can I make sure no functionality is lost? How can it be done incrementally?

Divide the rewrite into a series of smaller changes, during which functionality absolutely cannot be lost. Rewrite parts of the system at a time with a well defined, understood and testable interface between old and new.

Better still, don’t rewrite.

Good luck.

Agree, disagree? Please leave a comment.

VibratingMelon

observations from the startup frontlines

Category: software economics