Friday, September 01, 2006

Is it getting hot in here?

Heat is probably the number one problem facing IT departments these days. Modern chips run at 3 Gigahertz or above, which means that electrical signals are being sent 3 billion times per second. That's a lot of signals, and although they are all tiny, when you put them together they generate a lot of heat.

In the old days, PC's had a heatsink (essentially a small radiator) on the CPU and a fan at the back to extract the hot air. These days, there are high speed fans on the CPU, Graphics Card, system case and even some high end RAM chips. For a home PC that adds up to a lot of noise, but imagine it in a room full of high end servers.

In a standard 42U cabinet (which is around the size of a single bedroom wardrobe) you can fit 250 blade servers, each with 2-4 high end processors, several hard drives and tens of gigabytes of RAM each. Open up a blade, and it looks like it has been designed by an aerodynamic engineer, with arrays of fans drawing the air through the entire server and radiator fans lined up in parallel with the airflow from all the major components. Fire up one of these beasts and its like a plane taking off - imagine the noise of a full rack, or an entire datacentre.


A blade enclosure with 10 nodes, each supporting 2 hard drives and 2 cpus.

I spent some time this week in a server room where we needed to turn the air conditioning off for a couple of hours. We monitored the temperature and shut down more and more servers as the heat kept rising. In the end, we had over 80% of our servers off, and were only just keeping the heat under control.

Many server rooms have no failsafe systems, and no monitoring of the temperature other than a thermometer in the room itself. If the air conditioning fails at night or over the weekend, then your servers are going to be toast when you come in on Monday morning. If you are lucky then they will keep running for a couple of weeks before they start failing. If you are unlucky then you've just wiped out your entire network.

If you havn't done so already, install environmental alerts that will let you know if the heat rises (as well as monitor humidity / smoke etc). Link them up to a pager or SMS gateway so you get alerted. Also have a system for rapidly shutting systems down when the alerts are reached. Keep your phone nearby, so you can respond rapidly if you need to. Or even better, link the environmental monitor to the shutdown system and the pager. A lot of modern servers have built in thermal shutdown features - just make sure you enable them, as they are often not enabled by default. Do all that, and you might just sleep that little bit better.

Server watch has an interesting article about environmental monitoring and Minicom make a remote power switch with environmental controls.

For fast manual shutdowns, you can download PsTools from the excellent Sysinternals site, and best of all it's completely free. Use PSSHUTDOWN in a batch file to shut down a whole bunch of servers at once from a remote location.

No comments: