Engineering chaos: a guide to building a game day

One of the critical aspects of supporting large scale infrastructure is the ability to respond to potentially harmful issues that can arise throughout a project’s lifecycle.

This includes responding to errors in your own application but also to those within the entire ecosystem.

To help prepare teams for these situations and validate the maturity of our customers’ incident response process, Ad Hoc utilizes a simulated disaster recovery drill to run in technical environments. This process, called a game day, tests the skills and processes that teams have established to deal with potential incidents and determine where they can make improvements.

The importance of a game day

A game day scenario goes beyond the scope of a tabletop scenario engagement. Tabletop scenarios focus more on discussing general role management and step-by-step processes that address specific incidents. Game days, on the other hand, test the resiliency of the entire incident response process. Because of this more in-depth exercise, game day drills provide important benefits.

Determine the capabilities and maturity of the incident response process

While a team may have an incident response process built out and written down, a game day scenario allows us to validate it against a real-world scenario. As practitioners of digital services, we want to ensure our response process evolves to meet the changing nature of the program. This may require:

  • Testing an infrastructure response
  • Identifying a security point of failure
  • Performing a usability test
  • Evoking a contingency event in response to situations

When we have a process established, the validation of the scenarios and the testing required helps determine if the procedures can work in a real-world event.

If the results of the game day exercise aren’t successful, it gives teams time to make changes within their incident response plan. They can remove standardized responses that aren’t applicable to them and instead develop a strategy that incorporates existing capabilities and strengths, improves their current level of maturity, and increases their preparedness.

Identify weak points within the team

A game day may also identify individuals who need additional training within a certain scope and others who can be labeled as subject matter experts (SME) for specific incidents that were broached during the game day engagement. Observing how team members approach the incident response process, how they respond to chaos events, and how they handle communication between product stakeholders and team members can be beneficial in ensuring they have or develop the skills necessary to appropriately respond in real-world, time-sensitive incidents.

How to create a game day scenario

Find the right difficulty level for your team

The scenarios we create for game days are realistic situations that would occur outside the normal state of notifications and alerting. A key consideration for selecting these scenarios is understanding your team’s current level of incident response maturity. You don’t want to simply throw a malicious security incident their way. You want to ensure the scenarios give the team an opportunity to understand and properly remediate the situation. The team needs to be able to work through all the steps of remediation to thoroughly understand the process – not be faced with a disaster scenario that they must recover from.

Put the pieces on the board

The next step is to define the members of your chaos team – the people who will create chaos in the scenario – and those teammates who will be responding.

When selecting members of the chaos team, it’s important to ensure that they know your system, are familiar with the developed processes, and can determine where a potential risk may be. I recommend anyone who has this type of familiarity with the system on your project – whether they’re DevOps, UX researchers, or security engineers – act as an agent of chaos.

Decide how you want to make chaos

Scenarios can fall into two categories depending on how you choose to execute them and the level of difficulty your team is prepared for. You can choose to follow the simple kill methodology: a single thing has one cause and one effect. A good example of this would be something like, ”I disabled your identity and access management (IAM), and now you can’t log in.”

The alternative is to evoke the domino methodology, where a sacrifice is placed upon the altar of Rube Goldberg and you evoke the chaos lords of old. Disabling your IAM services can easily evolve into a “shut down IAM services, now no one can log in, and it kills all service accounts” situation. The difference is striking, and the trickle-down effects are entirely apparent.

Preparing the setup

This is one of the most overlooked parts of the game day. Having a well-prepared setup is a critical component to a successful engagement. Here’s a list of recommended teams to alert about the event:

  • Help desk and support teams: You have to let people know that you’re performing a game day and what environment you’ll be performing it in. Reach out to your organization’s help desk to give them a run down of what to expect, including an overview of the scenarios and what alerting mechanisms you hope to see. If possible, schedule a brief meeting to discuss the planned events and answer any questions.

  • Project members and leadership: Share the planned scenario and expected results with relevant individuals and leadership to review and approve. This is especially important for anyone who has work scheduled within the same environment where the game day will occur. If project teams working in that environment aren’t made aware, the upcoming event could negatively affect them.

  • Your ecosystem: These people can include points of contact for the contract, data center employees, and cloud service providers. They all have a vested interest in the results of the game day; they‘re also the primary target for the game day’s success. As technologists, it’s our responsibility to establish the maturity and success of our incident response process, and including people from the whole ecosystem to help define what the process is for the game day is well and good. Having them participate in the game day is even better. This gives team members a heightened sense of realism of having a party of interest looking into their incident. It also allows team members to connect with those they might not often interact with.

Also consider the following items as you prepare:

  • Team preparation: Before your game day, define what the team’s obligations are and how they can ensure a successful game day. This involves giving the team the tools they need: make sure the team knows where associated documentation lives and how to reach specific points of contact for pre-defined situations (i.e., have an alternate number they dial for help desk support to prevent calling the main help desk and clogging up their lines with unnecessary tickets).

  • Understand the scope: Try to estimate the total time that it would take your team, based on previous experience, to resolve an issue of similar complexity to what you are preparing. Having a reference to what the possible timetable is for individuals to resolve an issue is key to ensuring that you don’t overshoot your minimum viable product (MVP) for scheduling a workable live session. In other words, don’t erase your entire infrastructure environment for a game day because it could take an experienced team more than 2-3 hours to fix on a good day.

  • Remember Douglas Adams: Don’t panic! I cannot stress this enough. People who are panicked don’t make rational decisions. A panicked response can often cause a more chaotic situation than one with a logical response. This is something that you need to emphasize with your team – even though chaos is erupting and dragons are breaking out of your Azure infrastructure – in the end, it’s simply a game day. The primary reasons that we undertake game day scenarios are to instill within our teams an understanding of how to respond to incidents, evoke and exercise reactionary muscle memory, and build their confidence in the process. Having people develop a form of PTSD from being part of game days is not an effective means to building a successful incident response program.

Executing the event

When you’ve completed all your preparation and the scheduled time has come, it’s your time to become chaos incarnate. It falls upon your chaos engineers to enact the scenarios as the chaos masters for the game day and work with the stakeholders to make it a successful scenario.

You already know what your success metrics are; you’ve discussed with your stakeholders what constitutes a successful game day, and it’s falling upon you to deliver a MVP. So what can you do to help establish an MVG (most viable game) session?

Work in tandem with your targets

Offer supplemental support to your team when they’ve hit a brick wall or they’re not certain where to go. You can offer them hints towards next steps to resolving this if you want to participate in the game day in that scenario, or you could have them reach out to one of your stakeholders. If you chose a stakeholder prior to the scenario to work with, give them a heads up of what could be happening and if you hope to have an expected result from it.

Expect the unexpected

Remember the most beneficial of maxims from Mike Tyson: “Everybody has a plan until they’re punched in the mouth.” You can try to prepare for possible scenarios, but don’t try to plan for every one because you’d be stuck in this stage indefinitely. Your team will surprise you. This is good.

The blameless postmortem

“The cost of failure is education.”

—Devin Carraway, from the Google SRE handbook

One of the founding tenets of a successful game day is the blameless postmortem. After the event, the team meets with a third-party mediator who goes over the results and offers the postmortem in a “human-centered” response. Focusing on what happened and identifying the contributing cause – instead of assigning responsibility or blame to a person or team – helps them improve both their incident response process and their overall security posture. A blameless culture of reviewing incident responses allows for actual incidents to be brought to attention sooner, and team members become more open about concerns without fear of finger-pointing or reprisals. As a result, overall security is improved.

Always be ready

No system’s incident response plan is ever fully complete. It is a living, breathing aspect of an application’s life cycle – the white blood cells that battle against “infections” or “illnesses” that occur within a program. By maintaining a healthy incident response process, instilling an understanding of that process, and testing it regularly, your team will gradually improve and be ready when an incident does occur. Having a game day doesn’t prepare you for every type of incident that might occur, but it does prepare you to have a team that is ready to respond to it.