Validating business rules and building confidence through tests

image showing a series of check marks to denote tests passing

It can be challenging for organizations to adapt, migrate, modernize, or otherwise change legacy systems for new uses or business requirements. So-called digital transformation or legacy modernization efforts, while motivated by concerns such as finances, new technology, or shifting priorities, can be risky endeavors from a logistical perspective, and are often dominated by the needs of operations. I want to highlight one example of how we at Ad Hoc helped the Centers for Medicare & Medicaid Services (CMS), the agency that administers, navigate such a change.

In early 2016, CMS entrusted Ad Hoc with the task of building a new interface for the Plan Compare product on, where people can browse, compare, and enroll in insurance plans.

In this part of the site, you can get a list of health and dental plans available to you, after filling out the tax credit application. You can then sort by premium or deductible, filter plans by type (like HMO or PPO) or whether they cover your doctor or prescription drugs, and see benefit details of the plans.

The project— dubbed PC 2.0— had a number of unique challenges and we had to account for them in the tests we designed and executed for the software.

Business rules everywhere

Behind lurks a large set of business rules that determine exactly how a family’s tax credit should be calculated, or for which plans they are eligible. Because business rules would be, in effect, duplicated across PC 2.0 and other systems in, we needed to make certain that PC 2.0’s implementation of the rules would produce consistent results.

Normally, organizations try to prevent business logic from being implemented in more than one place. Duplication could lead to rules getting out of sync from each other when they change, or discrepancies in how the different systems interpret and implement them.

In this case, however, it made sense to take the risk of duplicating business logic. CMS was interested in building PC 2.0 as a separate system because Plan Compare had been a main source of slowness and bottlenecking for users trying to sign up. It was also difficult to make and deploy changes to Plan Compare, and CMS was looking for some new innovative UI and UX for it.

It would integrate with the rest of, in a fashion somewhat like microservices, which are a way of building modern web applications as the combination of several discrete pieces.

Calculations in sync

How then was CMS to ensure that PC 2.0 converged on the same answers as the rest of

Since PC 2.0 would be a new, stand-alone service integrated in the user experience, the calculations that result from these business rules always had to be in sync. The app had to calculate a value correctly— the tax credit, the plan’s premium, the amount of the deductible. That’s a normal part of any app of this kind. But it also had to be the exact same values calculated on other parts of the site built on different applications.

For example, if a user were shown one number for their monthly premium while on PC 2.0, and a different number for their premium on the final enrollment page, which was part of a different application inside, that would be, obviously, an intolerable situation.

Replacing legacy functionality

As the team charged with implementing PC 2.0, we had another challenge. A new unproven system that’s replacing a functioning (but problematic) legacy system lacks legitimacy with stakeholders. We had to prove that the approach we were taking was viable, and develop confidence with our business owners at CMS that this was worth the risk.

The approach we took to both ensure consistency among systems implementing the same business rules, and to foster trust and confidence in our team and the new system we were building, was to measure progress through tests, specifically, through a corpus of system-neutral cases that modeled households that might use and what results they’d expect to get, regardless of the underlying system.

Tests are a staple part of software engineering efforts of any size, from unit tests that validate components, to integration tests that exercise internal APIs, to end-to-end tests that mimic actual user requests through the whole system.

Our test cases were like unit tests, in that they consisted of a set of inputs with a set of expected outputs. But they were also like end-to-end tests, in that they were intended to be put through the main APIs of both PC 2.0 and Plan Compare, treating them like black boxes, in order to calculate the values and to check them against the legacy system.

In addition, the tests themselves weren’t coded in a specific language in a folder in one system’s source code tree, like most unit tests – they were described in a platform-independent format, JSON, and shared between the PC 2.0 and legacy Plan Compare teams. This way, both teams could adapt the cases to their own APIs, but take advantage of common set of cases.

The test cases represented the ideal output of the business rules of For example, the tax credit, known as APTC, is a function of a household’s size, location in the U.S., and some details about each member of the household. All else being equal, a given household should always get the same APTC, which is a specific dollar amount.

Behind this calculation lie the business rules, as well as other information, such as details about the health plans on the marketplace (this is because APTC is also a function of the “second lowest cost Silver plan”, or SLCSP, for a rating area, which is determined from the plan data). Any system that conforms to the business rules of, that has implemented them correctly, should be able to plug in the household to its APTC API and get the expected dollar amount.

Building the PC 2.0 Testing Tool

To facilitate the creation of the test cases, we built a simple tool that allowed the subject matter experts— our government partners at CMS— to input various test scenarios, and then encode them as JSON, according to a schema that both PC2.0 and Plan Compare teams could digest in their own testing systems.

Each test case would have a unique ID, so that it could be tracked between the two systems. Test cases could also be grouped by type, in other words, what kind of output the test case was expecting to see, be it SLCSP, or plan eligibility, or amount of monthly premium.

Once the corpus of hundreds of test cases was built, we could then run each test through both the PC 2.0 API and the legacy Plan Compare API, and compare results. We made it so that the tests could be easily kicked off after making a change to either system.

We produced a web-based report of the results of running the tests for all project stakeholders to see. A top level summary would present a percentage of tests that “passed”, that is, how many cases had the same result for same input on both PC 2.0 and legacy Plan Compare. You could also see details about each individual test case for the run. When PC 2.0 and legacy Plan Compare differed, we consider that particular case “failed”. It invited further exploration by PC 2.0 developers, and future changes of the system would try to make it pass.

Each time the tests were run, the report generator would append a summary and details to the runs that had gone before. Over time, we could see the progression of how well we were doing relative to the expected baseline of the legacy Plan Compare. Starting out, only 6% of test cases passed. Weeks later, as we dug deeper into failing cases and understanding the business rules better, we improved to 52%.

As time went on, we’d make marginal improvements, making changes that made just one or two test cases pass at a time. Other times, we’d make more systemic changes, and fix dozens at once. There were a few occasions where we regressed: even though we fixed a few cases, the changes would inadvertently cause other previously passing cases to suddenly fail. This was not unexpected, due to the intricate and overlapping nature of business rules. It’s not always obvious how different rules will interact, and this highlights why you need stable test cases and tracking of test statistics over time.

Eventually, after a few months, PC 2.0 was passing 100% of all test cases, matching legacy Plan Compare results exactly. To the degree that the test cases were representative of the household scenarios we expected to see on, and the business rules exercised by then, this was as good a validation of the correctness of PC 2.0 as we could hope to get. No amount of traditional requirements gathering and assertions about the implementation of said requirements could produce the same level of confidence. The corpus of tests validated the software, by definition.

By building an independent corpus of platform-neutral test cases with subject matter experts and a system for running and reporting on them over time, we left the decision of when and whether to switch over to PC 2.0 from legacy Plan Compare up to other considerations, such as operational readiness, rather than whether the system would produce correct results. Given the prevalence of legacy migration projects on the government’s horizon, test cases are one weapon in the arsenal of approaches for attenuating risk.