It is far too often that you hear of system admin’s being woke up out of bed and the whole network is down with a flurry of user complaints, only to discover the hard drive is full on the domain controller. There is a huge push in the industry to tighten the security footprint of our datacenters and services because a hack could cost millions in damages and lost productivity. However more times than not it’s not a security event that causes the most damage in terms of productivity and dollars, it’s the events like the one at the start of this article. Why? Well there are certainly a number of reasons and depending on the organization they could look radically different but here is what I have seen across my career, and my peers.
- Poor Physical Maintenance of Hardware: Poor physical care can lead to fan inlets, and heatsinks becoming clogged with debris. When this happens the device’s cooling will not run as designed and the equipment will run hotter. Doing this will decrease both the lifetime and the performance of the device. The CPU will generally throttle itself to lower it’s output temperature, While physical components such as capacitors or resistors could physically fail completely. At a bare minimum one should do a datacenter walkthrough at least once per month to identify these types of issues.
- Lack of Monitoring on Devices: 20+ years in IT and I still sit in a meeting at least once a month where a system has failed and nobody knew until the user(s) alerted the helpdesk. This is bad for a multitude of reasons but the first and primary is the loss of production time to the business. At a bare minimum you should have monitoring on the CPU Usage, RAM Usage, Storage Usage, Network Usage, and Application Services. If your company does not have a solution there are a number of open source solutions such as Nagios.
- Poor Monitoring of the Environment: I have heard everything from the core switch is flooded with water, to the servers are melting down. These are generally environmental events such as failed cooling, poor planning on new hardware, or other space related issues. Environmental damage can be some of the most as in the case of water, could permanently harm the equipment. Using a good environmental monitor such as APC Exostructure could save or provide early warning in these types of events.
- UPS Batteries: UPS batteries like car batteries lose charge over time and are constantly monitored and recharged using an internal charge controller. Over time this will degrade the lifespan of the battery and prevent it from functioning for as long as intended or at all. Some battery manufacturers will require distilled water be added at a specific interval to maintain the battery. There is no failure worse than needing the UPS and having it fail to take the load. Batteries and UPS’s should be inspected on a aquarterly basis to ensure both functionality and safety of the batteries.
- Transfer Switches: One of the most critical pieces of hardware in the datacenter during a power outage or interruption is the transfer switch. There are generally multiple ATS (Automatic Transfer Switch) in the datacenter. Usually at bare minimum there is a ATS to switch datacenter load to UPS batteries. There is usually another ATS to switch the load from battery to longer running generators that run fuel. Transfer switches and backup power infrastructure should be tested at a bare minimum once per year, but it is generally recommended to perform this testing quarterly.
- Poor maintenance of physical space: Everything from cardboard discarded wildly, to cables hanging like vines in the jungle, has been spotted in datacenters worldwide. Keeping the physical space clean and tidy serves a few major purposes. It is unsafe to be carrying a piece of equipment and then slip on a piece of trash left laying in the what should be clean floor. It sucks to get woke up at 1am because the janitor pulled the fiber trunk out with a mop handle because it wasn’t enclosed. Your company should develop and adhere to strict policies around datacenter cleanliness and invest in a camera system to monitor the space for such dangers.
This is no way an all-in-one inclusive list of everything that should be taken into consideration. Every situation and space offers it’s own unique brand of challenges, but implementing and adhering to maintenance policies and inspections will help to ensure the success of both your company, but also your career. Not only do these policies reduce safety risk, and increase operational performance of the equipment, but it may also save you hundreds to hundreds of thousands in real dollars in your company’s IT insurance. This type of documentation and care could also effect the real valuation of your company in case of loan extensions, buyouts, mergers, etc. Start with an inventory of your space then develop a set of needs to be documented keeping key infrastructure components in mind. From there you can adapt a checklist for your space that then is distributed across your team. It is important that all people with access to the datacenter space read and sign a document stating they will adhere to those standards. This is both to make them aware of your expectations, but also serves as a legal safety net should they choose to do something that creates an unsafe situation.

No responses yet