Doing DNS Monitoring Right: The AT&T DNS Outage
The AT&T domain name server (DNS) outage of Aug. 15, 2012 exemplifies why a “non-cache based” method for monitoring of websites is important for mission-critical websites. Firstly, a bit of a review. The most common, basic form of website monitoring is conducted using a synthetic browser (not an actual browser), which connects to the target server via an HTTP request process. A number of server-focused processes, such as the availability of the target server, the time it takes to load the HTML file for the website from the server, and the capability to detect keywords within the HTML file are checked via the use of a synthetic browser using an HTTP request process.
To cache or not to cache – that is the question
However, what is not generally well known about the basic synthetic HTTP monitoring methodology is website monitoring companies have a choice – to use a “cache” or “non-cache” methodology. The choice of methodology by a monitoring service directly impacts its ability to detect issues on secondary DNS servers, such as the AT&T DNS outage that occurred on Aug. 15, 2012. On the one hand, a cache-based method is far simpler for the monitoring business to implement and costs less to set-up and administer. In fact, most of the low-cost, “basic” uptime monitoring services use a “cache method.”
I’ll take the non-cache, thanks
However, the dirty little secret is the cache method of monitoring is not as accurate, (nor in the long-run as cost-effective) as a non-cache solution. Why? The simple reason is cache-based methods won’t even detect the secondary DNS issue.
The slightly more complex reason is longer, but really gets at the meat of what good monitoring is all about – avoiding downtime.
Specifically, the reason non-cache is more cost-effective is that when an issue like the AT&T DNS outage invariably occurs – as when any website error condition occurs – it is the total Time-to-Repair (TTR) which determines the loss due to downtime. In other words, the total TIME (1) it takes to detect, diagnose, and repair an error the worse the impact of the error. Conversely, the faster a monitoring solution speeds up TTR the more the loss is reduced (or completely avoided).
How to Effectively Monitor for the next AT&T DNS outage situation
In the case of the AT&T DNS outage issue there are several key factors that determine Time-to-Repair:
– Error Detection method: Use a monitoring solution that uses a non-cache method to propagate DNS queries all the way through to root name servers with each monitoring instance. A cache-method service caches DNS and therefore will not detect a secondary DNS issue at all, or it may take days or weeks to detect the issue.
-Frequency of monitoring: Use a faster frequency of non-cache monitoring, such as every 1-minute versus once per hour. The faster the non-cache monitoring solution detects and alerts an impacted administrator of a website using a failing DNS service, the faster a switch can be made to a DNS failover provider.
– Frequency of Time-to-Live (TTL) setting: The smaller the value of the Time-to-live (TTL) frequency setting used by DNS administrators to set DNS caching to a secondary DNS server of the a domain name from the primary authoritative name server. Typically set to 86,400 second(1-day) or more, in disaster recovery planning the TTL can be set as low as once every 300 seconds, however the lower the setting the higher the load on the authoritative domain name server.
-Diagnostics – such as an automatic traceroute at the time of the detected DNS problem – is provided by the monitoring solution (most basic monitoring services do not provide any diagnostic info)
-Repair: Continue monitoring solution during the error condition to further pinpoint the issue. Send the monitored results to your DNS provider. You can also run free manual DNS traceroutes here (select Trace Style “DNS”) to verify the issue as needed.
-Prevent: Use a monitoring solution that allows you to view the details of a DNS look-up (such as an actual browser monitoring) in order to see “soft errors” such as slow-down trends and intermittent issues, so you can take action before the “soft error” becomes a “hard error” such as a customer facing downtime.
(1) According to organizations that participated in a TRAC Research, September 2011 study, the organizations identified the amount of time spent troubleshooting performance issues as their top challenge with “on average, over a full work-week of man-hours (46.2 hours) spent in war room situations each month.”