Crisis management

Outages, viruses, billing errors, and more. Be prepared today for things to go wrong with your product tomorrow. Sunny days are always the best days to buy an umbrella.

Crisis management
Photo by Santa Barbara / Unsplash

What to do when things go wrong

This is an excerpt from my upcoming Developer Marketing and Developer Relations book. Be sure to subscribe to this newsletter and be notified when pre-sales are available.

Outages, viruses, billing errors, and more. When things go wrong with developer-focused products, they affect not only your customer, but your customer’s customers, and perhaps even more. In the early days of AWS, a single AWS outage could take down half the Internet. Even today, a simple DNS error can have catastrophic effects on global commerce. A botched rollout of a patch or update can bring down airlines, hotels, and e-commerce globally.

Developers have outsized importance in the world, and with it, outsized consequences to their mistakes.

You need to be prepared today for things to go wrong with your product tomorrow. Sunny days are always the best days to buy an umbrella.

In addition, every crisis is also an opportunity to build trust and credibility. And trust and credibility is the foundation of a strong developer brand.

Incident management 

The first thing to iron out is your incident management process. Let’s say an incident is detected by a Site Reliability Engineer in production. What should she do next? Who should she alert? Most engineering teams have pager duty responsibilities defined. But if an outage has consequences, shouldn’t others be notified also?

Here’s a checklist I created for an incident management process at a past company:

  1. Page engineers on duty
  2. Start a Slack channel (“incident-dateofincident-description”)
  3. Invite SREs, engineers, marketing, legal, comms, and exec team
  4. Describe the incident in as much detail as necessary
  5. Attach a SEV number to the incident (e.g., “SEV1”, “SEV2”, etc.) The SEV number indicates the seriousness of the incident. SEV1 is a critical issue affecting a significant number of users in a production environment. SEV2 is a major issue affecting a subset of users in a production environment. SEV3 is a moderate incident causing errors or minor problems for a small number of users.

Work cross-organizationally

This is the kickoff of a cross-organizational process. Engineering is, of course, in the driver's seat. But as part of this, in marketing we have our own process we need to follow:

  • Someone on the marketing team must have on-call responsibilities for a SEV1 issue.
  • The marketing team member who is on-call should have the account credentials for all social media accounts and the ability to post to the blog. Make sure you account for vacation coverage. In a larger organization with a dedicated social media team, the marketing manager should have a way to reach the social media manager.
  • Outline a process for each SEV number. For example, for a SEV1, we should post a simple message to social media, usually Twitter is sufficient (“We are experiencing technical difficulties. We are investigating and will report back soon with our findings.”) Work with your engineering leadership on the issue to quickly determine what, if anything to say publicly.
  • Decide whether all automatic and scheduled social media posts should be paused depending on the SEV number.
  • In coordination with your engineering leadership, notify the Slack channel of actions taken (“I’ve posted to Twitter. I’ve paused all other social media posts across all channels.”)

From there, monitor the engineering working group for findings and resolutions. If necessary, work with engineering to post updates to social media.

This handles the initial crisis. Now you need to deal with the consequences.

Communicating directly and succinctly

Outline a plan with your Customer Success team on when and how to share the resolution and post-mortems with customers. Depending on the SEV number, you may want to publish a post-mortem blog post and/or notify customers with billing remediation instructions. Your enterprise customers may have reliability guarantees in their contracts that Customer Success may need help messaging appropriately.

In all communication remember these important tips:

  • Remain calm and matter-of-fact. At all times, adopt a tone that reassures customers that you know of the problem and are finding a resolution. This also applies to any meetings that marketing may attend. No one benefits from marketing panicking. Engineers are working towards a resolution. Offer the gift of your serenity. If you are concerned with the speed of the resolution or the lack of seriousness with which the problem is being addressed, take it up privately with your exec team.
  • Be truthful. Don’t attempt to cover up the issue or, perhaps even worse, obfuscate the issue with florid marketing or business language. As with all developer-focused communication, be direct and honest.
  • Support post-mortems. Mistakes are opportunities to learn and build trust. Publishing a detailed post-mortem of the problem helps other engineers (and your customers) find confidence that you have learned from your mistake and have taken steps to prevent it from happening again. Developers may be exacting and demanding, but they’re also always understanding of bugs and errors. Every chicken gets a turn in the fryer.
  • Attend internal post-mortems. After the incident has subsided, attend as many engineering meetings about the issue as you can. These are great opportunities to learn from your engineering team, and learn more about how your product is built and maintained.

Summary

Crises are opportunities. Opportunities to know the holes in your organization's communication processes. Opportunities to understand how impactful your product is to customers. Opportunities to build your brand through trust and great communication.

Crises will happen no matter how well you prepare for them.

Use the time now to build your crisis management plan. Every developer-focused product should have one, and today is a good day to build yours.

Subscribe to notifications about my upcoming Developer Marketing and Developer Relations book.