The metrics series, vol 2: Measuring delivery
A personal approach to inspect DORA+ metrics, the human way
In the first post of my metrics series I spoke about my process to decide what are the indicators that matter to me, as a manager of development teams.
This time around I want to dig deeper on the delivery aspect of metrics and, hopefully, provide some tips on how to use metrics that work well with humans, not machines. I won’t go into details about each particular metric -there’s plenty of material online and you probably know already. What I want to focus on is on how to use those metrics, and how to evaluate if they are positive or not.
Basic development indicators and what they can tell you
DORA metrics
The basics every manager is usually looking at:
Deployment frequency. Monitoring how often the engineers push changes to production speaks volumes about how much granular and well refined the user stories are, and to what extent the testing effort is properly scattered across layers and roles. When tasks are uncertain, engineers slide across rabbit holes and struggle to “make that daily commit”. As a consequence, they tend to produce bigger pull requests, and more complex work with increased verification efforts. Add poor automated testing to the mix and you have a ticket (no pun intended) to Regression Testing Hell.
You tipically get better at deploying frequently by facilitating effective backlog refinements, ensuring proper coverage at the unit and integration level, and -no surprise- debating with the team about their branching strategy. Applying this experiments to different extents has brought my teams into a deploy frequency spanning across 3-6 weekly releases.
Lead time. Or cycle time, depending on your area of influence or business setup. As a principle, I don’t care much about the absolute number; the actual jam is in the breakdown, as that is signaling waste and opportunities for easy wins: what is the average coding time? are the PRs lingering for days? Does finished work ferment in some env or branch without reaching production?
I try to shoot for 8-12h of coding time. This means the tasks are broken down in a healthy way, but also, that there is enough autonomy and knowledge in the team. Of course there are deviations here and there, but clusters of longer coding cycles are a red flag for someone too junior left alone, someone hoarding all the work for a service, something so tangled that it’s nightmare to work on… joy killers. Look for the clusters and do some cause analysis to smooth them a bit on every iteration.
Another classic in PR pickup time. When PR’s take longer than 1 business day to be reviewed, something fishy is happening. Investigate your figures and find out if only a portion of the team is reviewing all the work, if it’s just the habit that is not set yet, or if someone is producing 5K-liners that no one wants to touch with a stick. This last I see as a sign of poor teamwork. When you produce an exorbitant pull request, you place a burden in your peers, who have to choose between dedicating their day to understand it, or just pretending they have read it and approve it alltogether. Don’t be that person.
In my personal experience, improving the PR pickup time and encouraging the team to work in small commits (or to open draft PRs) can easily mean a 30% cut on overall cycle time. Just a mindset change, that does not require the team to work more, and makes upper management very happy.
At the moment, cycle time in my teams lies somewhere between 2,5 and 5 days.
Change failure rate, and Mean time to recovery. Two metrics I don’t actively monitor and would advise to be very wary about. The good side of them is that they can tell you about how friendly your deploy pipeline is when you really need it, or how thorough is your testing. But these can easily become cheat metrics. Aiming for zero change failure rate might mean reducing your deploy frequency or increasing your testing effort to absurd levels. Maybe you don’t want to invest so much in that, or maybe it’s more important for your particular business case to be out there than inmaculate. The same for mean time to recovery. Not all fixes require immediacy and I would rather love to have an team with the ability to make an educated decision on how a production issue impacts revenue and if it really must be fixed ASAP.
Less popular metrics I might monitor
And *might* is the keyword: I would only look at this from time to time, maybe 4 or 6 weeks after an experiment is introduced, or roughly one time per quarter. Some of them I never look at, except when something goes wrong.
Commit strategy/frequency. Specially for critical or complex features, I expect to see a number of commits before a pull request is created. You would track this one to understand how a group of people is pushing for a user story, if smaller pieces where shared before showing the whole thing, or to evaluate to which extent the work was properly analysed and planned before starting it. Again, not something to look at every sprint, but maybe worth it if you have a gut feeling about a release, or the team wants to debate in a retro about something that took way longer than expected.
PR review time, depth, amount. If the team has a commitment on items that need to be reviewed before being merged to a master branch, then I expect it to be a shared effort. A junior engineer who never looks at their peers’ code might be missing great opportunities for learning. A senior engineer who is consistenly working solo and refuses to review PRs, or approves them right away in 5 minutes, is failing their most junior peers in their responsibility to be a reference. I don’t hire seniors because they code faster, but because of the chance to expose their newer peers to their thought process and techniques. Last but not least, don’t forget the perfect opposite: that senior engineer who will leave 25 comments in every pull request, bullying others or just blocking releases for bare nitpicking. Yes, that guy you are thinking about. I want to know if that’s happening.
Open branches / WIP. I have worked with engineers who tend to open new branches when something is blocked or they can’t figure out a solution for it. Others just open many things in parallel out of optimism or because they want to be THE ONE shipping a particular task that is attractive to them. This is a measure I would look at when a stakeholder or peer reports an engineer seems idle or spread like butter. Not to punish the person -there are a variety of reasons an individual would operate that way and not all of them are a problem.
PR size. This is a good metric to look at, if you can ponder it with others. Big changes will happen sometimes and it will be unavoidable, but again, when receiving a comment on the agility of a team or an individual, this is one of the indicators of complexity, low solidarity, lack of clarity, or waste.
New feature/refactor/rework %. Again, a string to pull from when something feels off, and none is intrinsically negative -just needs to be put in the context of what the team is working on, their seniorities, and the stability/clarity of the system or the specifications.
The don’ts: Velocity and other friends
Throughput, Velocity, or anything that means putting the amount of work done in a scale and obtaining a sum: please don’t. You’ll be a clown, and your teams will adjust to fool you.
Story points are a totally subjective measure of perceived effort to complete an assignment. If you want higher output, you just have to estimate higher. Bonus clown points for the teams who estimate a number of points for engineers across the stack, plus QA, plus deployment and then come up with the totaling amount. They end up with precious fat numbers that are only a signal of management pressure or fear of complexity.
Even for teams that are using point estimations with the purest of intentions, they end up being limitators (hey, the sprint feels emptyish but we always do 20 points and we already added 21) or frustrations (we always deliver around 30 and this time we completed 15: the product owner is going to be angry or in trouble). The number of points ends up becoming THE GOAL, which is nuts.
The same for throughput. The amount of backlog items delivered varies depending on the team’s ettiquete regarding how atomic subtasks must be. The minute you start inspectig throughput, the team will start creating “write tests”, “tag release” or “deploy to pro” subtasks to make that bag bigger. I bet your customer won’t notice that increase.
Leaderboards. I have zero interest in leaderboards displaying the lines of code commited, points delivered, or any other output indicator at individual contributor level. Specially in teams that pair or mob program regularly, there is no way to assess individual responsability over every new change. Personal principle: when I worked with tools that include a leaderboard, I chose to not share the tool to the team to prevent them looking at it.
Endless improvement. Improving your metrics is a very noble iniciative and it will probably have a positive impact in your customer satisfaction, your product’s competitiveness in the market and your team’s reputation and mood. This is great. But the team can only improve to a point, and on top of that, your (or their) experiments add cognitive load. It’s not sustainable to keep your team forever in “get new habit” mode, and everyone has an optimal zone of effort/results balance. Commit to something that makes sense in relation to the humans in your team and the environment in your company (take a look at some industry benchmarks if you need reference) and when you are reasonably close, just celebrate and let them settle down.
Making numbers the whole point. Nothing in these figures guarantees value was delivered. Shitty products are shipped every day. Lots of high-performing teams are also high-churn snakepits. It’s tempting to go to your manager or peers with your fantastic cycle time numbers, or to rejoice in your continuous deploy pipeline, but it won’t be worth it if you don’t release the right thing, and it won’t last if the environment is unhealthy.
Believing numbers tell you the whole story. The most wonderful dashboard in the world will still leave you blindfolded if you never approach your team. User feedback, 1-1s, water cooler conversations… aim for the bigger picture.
Last but not least: how do I measure all that stuff? I’ll write about that in post #3.