From mlittman@cs.rutgers.edu Wed Dec 17 01:42:13 2003 From: mlittman@cs.rutgers.edu (Michael L. Littman) Date: Tue, 16 Dec 2003 20:42:13 -0500 (EST) Subject: [KP-seed] [KP] example root cause analysis Message-ID: <200312170142.hBH1gDO29972@porthos.rutgers.edu> Hi, Here's an example of things getting pretty complicated. Is this sort of seven-part failure common? Interestingly 3 of the 7 failures seem to involve "misconfiguration". -Michael -------------------------------------------------- Date: Tue, 16 Dec 2003 19:37:46 -0500 From: Adrienne Geralds Organization: Telecommunications Division, RUCS Subject: University DNS Issue TD Network Operations apologizes for the problems and confusion regarding the DNS service maintenance experienced since 15 December 2003. TD normally does not perform intrusive maintenance during the beginning or end of a semester and had expected this work to be transparent. Over the past 36 hours, several problems have been reported. Due to the amount of confusion regarding a number of distinct and related problems, please allow us to attempt to clarify all the events surrounding this maintenance: 1. PROBLEM: Failure to announce the proper downtime TD inadvertently announced a scheduled outage for our DNS hardware upgrade for 12/16, when in fact, the outage was actually performed on 12/15. OUTCOME: University community was not improperly notified. CURRENT STATUS: No future maintenance will occur without verification that the announcement has been posted. 2. PROBLEM: DNS Server misconfiguration Due to a system misconfiguration, users residing on Rutgers private address space (172.16-32) during the hours of 5pm and 8pm on 12/15 did not have reverse IP address lookups. OUTCOME: Services (ie., central email, ssh) relying on verification of reverse lookups would have rejected access from users in this address space. CURRENT STATUS: Problem identified and corrected by 8pm on 12/15. 3. PROBLEM: Hardware failure of ru-ufl.rutgers.edu Rutgers University maintains a redundant DNS server which is hosted by The University of Florida. This hardware failed early Monday morning. OUTCOME: 25% of our external DNS resolution capacity is unavailable. For properly configured servers, this should have no affect. This is a contributing factor to issue #6 below. CURRENT STATUS: We are currently working with engineers from The University of Florida to repair or replace the hardware and expect resolution within 3-5 business days. 4. PROBLEM: TD switch misconfiguration A duplex mismatch was identified on a switch serving the connection between the Internet and dns3.rutgers.edu. OUTCOME: This misconfiguration resulted in intermittent packet loss. Hosts directed to this server may have experienced DNS query timeouts. DNS servers for outside organizations interested in caching Rutgers DNS information may have misinterpreted the degraded network performance as Rutgers being unreachable. This resulted in a further 25% reduction of our external DNS resolution capacity. This was a contributing factor to issue #6 below. CURRENT STATUS: The switch was identified on 12/16 at 4:15pm as problematic and immediately reconfigured. 5. PROBLEM: Expected DNS1 and DNS2 outage OUTCOME: As network cables were connected to replacement hardware, DNS1 and DNS2 may have experienced a momentary outage. Due to a lack of redundancy (see issues #3 & #4), this planned maintenance was no longer transparent. For a short period of time (23 seconds), Rutgers external DNS service was completely unavailable. This was a significant contribution to issue #6. CURRENT STATUS: Servers are functioning normally. 6. PROBLEM: Users from external Internet Service Providers (ISP) could not access Rutgers services. (These users could access other Internet services) OUTCOME: Many large ISPs cache DNS information for extended periods of time (ie. 24 hours), resulting in customers of that ISP experiencing failures for all name lookups in Rutgers.edu. CURRENT STATUS: TD has contacted a number of external service providers regarding the problem and advised they clear their caches to correct the Rutgers information. Problems have been reported from Akamai, Verizon and Comcast. At this time, we have reports from users of Akamai and Verizon that this problem is resolved. Further, several Comcast customers have opened trouble tickets with their provider, and we will continue to monitor their progress. 7. PROBLEM: Users with external ISPs who have misconfigured their computer to use a Rutgers DNS server could not access any services OUTCOME: These users lost the ability to resolve all hostnames. CURRENT STATUS: The RUCS Helpdesk has been assisting users in correcting their misconfigured DNS settings. Again, we apologize this inconvenience. If you have any questions or comments, please contact the TD-Network Operations Center at 732-445-7541 or at td-noc@td.rutgers.edu. ************************************************************************ Adrienne Geralds University Computing Services Associate Director of Telecommunications Rutgers University phone: (732) 445-7501 fax: (732) 445-5539 _______________________________________________ Know-plane mailing list Know-plane@mailman.isi.edu http://mailman.isi.edu/mailman/listinfo/know-plane