What Happened?On Friday October 21st, Dyn, a major DNS provider, started having trouble due to a DOS attack. Many companies including PagerDuty, Reddit, Twitter, and others suffered significant downtime. Sumo Logic had a short blip of failures, but stayed up, allowing our customers to continue to seamlessly use our service for monitoring and troubleshooting within their organizations.
How did Sumo Logic bear the outage?
Several months ago, we suffered a DNS outage and had a postmortem that focused on being more resilient to such incidents. We decided to create a primary-secondary setup for DNS. After reading quite a bit about how this should work in theory, we implemented a solution with two providers: Neustar and Dyn. This setup saved us during today’s outage. I hope you can learn from our setup and make your service more resilient as well.
How is a primary-secondary DNS setup supposed to work?
- You maintain the DNS zone on the primary only. Any update to that zone gets automatically replicated to the secondary via two methods: A push notification from the primary and a periodic pull from the secondary. The two providers stay in sync and you do not have to worry about maintenance of the zone.
- Your registrar is configured with nameservers from both providers.
- Order does NOT matter.
- DNS Resolvers do not know which nameservers are primary and which are secondary. They just choose between all the configured nameservers.
- Most DNS Resolvers choose which name server to use based on latency of the prior responses.
- The rest of the DNS Resolvers choose at random.
- If you have 4 nameservers with 1 from one provider and 3 from another, the more simplistic DNS Resolvers will split traffic 1/4 to 3/4, whereas the ones that track latency will still hit the faster provider more often.
- When there is a problem contacting a nameserver, DNS Resolvers will pick another nameserver from the list until one works.
How to set up a primary-secondary DNS?
- Sign up for two different companies who provide high-speed DNS services and offer primary/secondary setup.
- My recommendation is: NS1, Dyn, Neustar (ultradns) and Akamai.
- Currently Amazon’s Route53 does not provide transfer ability and therefore cannot support primary/secondary setup. ( You would have to change records in both providers and keep them in sync.)
- Slower providers will not take on as much traffic as fast ones, so you have to be aware of how fast the providers are for your customers.
- Configure one to be primary. This is the provider who you use when you make changes to your DNS.
- Follow the primary provider’s and secondary provider’s instructions to set up the secondary provider.
- This usually involves configuring whitelisting the secondary’s IPs at the primary, adding notifications to primary, and telling the secondary what IPs to use to get the transfer at the primary.
- Ensure that the secondary is syncing your zones with the primary. (Check on their console and try doing a dig @nameserver domain for the secondary’s nameservers.)
- Configure your registrar with both the primary’s and secondary’s name servers.
- We found out that the order does not matter at all.
Our nameserver setup at the registrar:
What happened during the outage?
We got paged at 8:53 AM for DNS problem hitting service.sumologic.com. This was from our internal as well as external monitors. The oncalls ran a “dig” against all four of our nameservers and discovered that Dyn was down hard.
We knew that we had a primary/secondary DNS setup, but neither provider had experienced any outages since we set it up. We also knew that it would take DNS Resolvers some time to decide to use Neustar nameservers as opposed to Dyn ones. Our alarms went off, so, we posted a status page telling our customers that we are experiencing an incident with our DNS and to let us know if they see a problem.
Less than an hour later, our alarms stopped going off (although Dyn was still down). No Sumo Logic customers reached out to Support to let us know that they had issues.
Here is a graph of the traffic decreases for one of the Sumo Logic domains during the Dyn Outage:
Here is a graph of Neustar (UltraDNS) pulling in more traffic during the outage:
This setup worked for Sumo Logic. We do not have control over DNS providers, but we can prevent their problems from affecting our customers. You can easily do the same.
Complete visibility for DevSecOps
Reduce downtime and move from reactive to proactive monitoring.