Zero-Downtime Migration:
What 92% Test Coverage Means.

Author

Philipp Eiselt

Topic

Platform Delivery
ITSM Migration

Published

February 2026

Read time

10 min

When we reported 92% test coverage in our steering committee, the response was almost always the same: a nod, a tick against the relevant line, and a move to the next agenda point. The number sounded good. But very few people in that room understood what it actually meant, or more importantly, what it did not mean.

The migration context.

The project was a full replacement of the IT service management platform, ServiceNow out, Jira Service Management in, for an organisation running 24/7 operations across multiple sites. The existing ServiceNow instance had been live for several years and had accumulated significant customisation: custom workflows, integrations with ERP and monitoring systems, and a large backlog of historical ticket data that business continuity requirements meant we could not simply abandon.

The constraint that made this hard was that we could not take the service desk offline. IT operations does not stop for a migration. Engineers needed to be able to log incidents, escalate, and track resolutions throughout go-live weekend. Any gap in that capability was not a technical failure; it was a business continuity failure. The go-live plan had to account for a live environment from hour one.

How we built the 92%.

Test coverage, on a migration like this, is not the same thing as unit test coverage in a software development context. We were not measuring lines of code. We were measuring scenarios. Specifically: of all the things this platform is expected to do in live operation, what proportion of them have we verified will work correctly in the new environment before we switch?

We started by building a scenario inventory from three sources: the existing ServiceNow workflow documentation (incomplete, as it always is), a series of working sessions with the service desk team and key users, and a review of the last twelve months of ticket data to identify volume patterns and edge cases. That gave us a list of roughly 340 scenarios, ranging from "engineer logs a P1 incident via the portal" to "monitoring system auto-creates a ticket and routes it to the correct assignment group via an API integration."

We then categorised each scenario by two dimensions: criticality (what breaks if this fails?) and complexity (how many system components are involved?). High criticality, high complexity scenarios got the most testing effort and multiple test cycles. Low criticality, low complexity scenarios got single-pass verification. The 92% figure meant that 92% of scenarios had been verified at least to the standard appropriate for their risk category. The remaining 8% were either edge cases with manual fallback procedures, or integration scenarios that could only be fully tested in production.

What happened on go-live day.

The cutover ran across a weekend. We went live on Sunday morning with a phased approach: read-only access to the old system for 72 hours while the new system took all new ticket creation. The first eight hours were the highest risk window: we had teams standing by on every major integration point, with pre-agreed rollback triggers if specific failure thresholds were crossed.

Three issues emerged in the first 48 hours. Two were in the 8% we had flagged: a monitoring integration that behaved differently under real production load than in our test environment, and a custom escalation workflow that had been built on ServiceNow-specific logic that did not translate directly. Both had documented workarounds. The third was genuinely unexpected: a browser compatibility issue with the new portal that affected users on an older version of a specific internal browser. It was minor, but it had not appeared in any of our test scenarios because we had not tested against that browser version.

Overall: zero major service interruptions, no data loss, no P1 incidents attributable to the migration. The ITSM error rate dropped 20% within the first month as the new platform's routing logic outperformed the customised ServiceNow workflows. By any measure that mattered, the go-live was successful.

What the number actually tells you.

The 92% figure was not a guarantee. It was a structured argument that we had done the right work in the right places, and that the residual risk was understood and manageable. That is a completely different claim from "everything will work." The value of the coverage number was not the number itself; it was the process of building it. The scenario inventory forced a conversation between the technical team and the business about what actually had to work, and that conversation surfaced assumptions that would otherwise have remained invisible until go-live.

If you are running a migration and someone asks you what your test coverage is, the right answer is not a percentage. The right answer starts with: here are the scenarios we defined as critical, here is how we tested them, and here is what we have planned for the things we could not test in advance. That answer takes longer to give. It is also the only one that is honest.

Philipp Eiselt

Independent consultant in IT Portfolio Management, PMO & Governance, and Digital Transformation. Based in APAC, working globally.

Follow on LinkedIn