What predictability actually measures (and why velocity is the wrong question)

Why I invented a metric I'd never heard of, and why the industry eventually caught up.

I started using the term "predictability" in 2019, and I was pretty sure I was making it up.

The situation that produced it was not subtle. I had just taken over an engineering organization where several members of the remote team had been running mouse jigglers to fake activity while collecting paychecks. Not occasionally. For weeks. We turned the team over, brought in new people, and I came in as the person running Engineering for the first time. My CTO wanted to tear the spyware off everyone's laptops immediately, which was the right instinct: you can't build a culture of trust while surveilling the people you're supposed to trust. But the founder-CEO was not exactly in a trusting mood, and reasonably so. He needed to be convinced, on a regular basis, that the new team was performing.

So I started collecting sprint metrics. Standard stuff: velocity, cycle time, points added and removed. Nothing revelatory. What I was actually trying to do was build a picture I could show to someone who was skeptical, something that said "here is evidence that the team is working and working well." And as I kept collecting and analyzing, I started adding to it. Percentage of plan complete (of the tasks the team committed to at the start of a sprint, how many actually got done). I added that one because the other metrics looked fine on paper and we were still late on things. I needed to answer why.

At some point I added a predictability metric and started tracking it without knowing quite what I was looking at. Over time it became clear that the health of that number was a leading indicator into the health of everything else. I've been running it across organizations ever since, and it keeps proving itself out.

So what is it?

At the sprint level, it's a ratio. The team commits to a certain number of story points at the start of a sprint, based on historical velocity, based on what they think they can get done. At the end of the sprint, you look at what they completed. Completed points divided by committed points, expressed as a percentage. That's it. I've always asked my teams to land between 85% and 115%. A little under usually means the work was close: someone was out sick, a ticket carried over but was nearly done. A little over means the team finished what they planned and pulled in more work, which is healthy behavior. I've validated that range across seven organizations now. It holds.

Outside that range is where it gets interesting. Significantly under means the team's read on their own capacity was wrong in a way that's going to hurt you when you try to set a delivery date. Significantly over means they radically underestimated, which sounds like a good problem until you realize the rest of the company may have planned around the original date and now everyone has to scramble.

One thing I want to be precise about: predictability measures capacity, not content. It tracks how much the team completed versus how much they committed to, deliberately ignoring what the work actually was. That's intentional. Teams rarely control their own prioritization. A PM can swap work mid-sprint. A P0 can land and demand immediate attention. A deal-critical task can materialize from nowhere. All of those are valid reasons to change what's in a sprint. The question predictability answers is narrower: given whatever was asked of this team, did they accurately forecast how much they could handle? If yes, that's a healthy team, even if the playing field kept shifting under them.

The sprint-level ratio is also a self-correcting tool when a team is using it well. A healthy team comes out of a bad sprint asking "why didn't we hit it and what do we do differently next time?" They don't need a manager to tell them something is wrong. The metric gives them the signal and the retrospective is where they diagnose it. When a team is consistently off, not a bad sprint here and there but off across multiple sprints, that's when coaching needs to start. Something structural is happening that the team isn't fixing on its own.

Why the business actually cares

I've spent a lot of time with engineering teams that resist the idea of delivery dates. The Agile community helped build that resistance, and story points exist partly because reasoning about time is genuinely hard. I can't accurately predict how long it will take me to make dinner on any given night, and I've been cooking for decades. Software is harder. The instinct to move away from date commitments is understandable.

But the rest of the company runs on dates. Marketing needs to know when to have launch materials ready. Support needs to understand the feature before it reaches customers. Finance builds revenue models around when things go live. When Engineering can't give a credible answer to "when will this be done," it doesn't make the rest of the company stop planning. It just means they plan around guesses instead of data. That's worse.

Predictability gives you a way to set dates that are grounded in something real. Say you have a quarter-long project. You groom it, break it down, end up with a pile of threes and fives. You have historical velocity, so you can divide total points by average velocity to get a sprint count. But if you know your team runs at 80% predictability consistently, you know you need to adjust that estimate. You bake in the buffer. You set a date that's realistic and as aggressive as possible given what you actually know about how this team operates.

And then you run the thing, and you watch it.

The part nobody talks about

At Stoplight, this is where the early warning system mattered more than the estimate. We had long-running projects, some running a full quarter. Before we had any of this in place, Engineering couldn't be trusted to deliver on time, and the rest of the organization had internalized that. After we implemented these systems, we started hitting dates more consistently. But not always. You do the research, you groom the work carefully, and you still find dragons. Something you didn't see coming adds scope. Someone gets sick at the wrong moment. A dependency slips.

What changed was how we handled it. When we could see early that we were going to need another sprint and a half, we said so. Here's what happened, here's why, here's the revised date. That transparency never eroded the trust we were building. If anything, it built more of it. The rest of the company could adjust their own work, reschedule launches, reset expectations with customers. They were no longer surprised.

The sin was never being late. The sin was not communicating. Letting people plan around a date you knew wasn't going to hold without telling them. That's what damages trust. Lateness is a fact of software development. Silence is a choice.

Why not DORA?

DORA metrics have been popular for the better part of a decade, and I understand the appeal: they came out of serious research and the underlying intent was right. But the four metrics they landed on measure your deployment pipeline, not your Engineering team. Deployment frequency tracks how often you push to production. Lead time for changes measures how long a commit takes to reach production. Change failure rate counts releases that required a hotfix or rollback. Mean time to recovery tracks how fast you get back up after an outage.

The first two are gameable in ways that should be obvious to anyone who has run a team. One-line changes are still changes. You can increase deployment frequency without ever shipping meaningful value to a customer. Lead time for changes goes down when you skip code review and run fewer tests, which is fine right up until the moment your change failure rate goes up and you've spent your stability gains buying back your speed. DORA's implicit argument is that these metrics are self-correcting against each other, and in theory that's true. The 2024 report defines elite performers as teams that deploy on demand: genuinely continuous deployment, sub-day lead times, 5% failure rate. That's a real bar. The next tier down, high performers, deploy somewhere between daily and weekly and carry a 20% change failure rate. One in five deployments causing a production problem, on a team that might be shipping every day. That qualifies as high performance. Make of that what you will.

Mean time to recovery is mostly a measure of your monitoring. Change failure rate is a quality metric, which matters, but it's defined in terms that made more sense before continuous deployment was the norm. Hotfixes and rollbacks are things you do when you're not rolling forward.

None of those metrics tell you whether your team is operating in a healthy way. They don't tell you whether the team can accurately forecast its own capacity, or whether it's burning out trying to hit a date that was never realistic, or whether a steady drip of unplanned work is quietly consuming the capacity that was supposed to go somewhere else. Predictability doesn't answer all of those questions either, but it's the leading indicator that tells you when to start asking them.


Raleigh Schickel is a VP/Director of Engineering with 15+ years of experience leading engineering organizations. He has been VP of Engineering at Stoplight (acquired by SmartBear), Director of Engineering at Nirvana Health, and spent 8+ years at uShip. He writes about engineering team health and is building a diagnostic platform for engineering leaders.