It is amazing how often things happen within the technology infrastructure that are unexpected.
Sometimes they result in an event that can take the whole shebang—data center, critical function, or application—offline for an extended and unexpected period of time.
As the staff scrambles to restore order, you ponder what has occurred and why.
Looking for causes (and people to blame)
The cause is shortly identified as a single point of failure.
You are surprised and angry that there was a service interruption. But you are also irked. You know that if you had known about that weak spot, the situation would have been entirely preventable.
You, as the technology executive, CIO, or IT manager are the one person that everyone depends on.
You are never thanked when it works.
And you are always called when it doesn’t.
So, recalling from some experiences of my own, here are some hints that could save you from yourself:
Routers, switches, and ATMs
When the power blinks or is interrupted, the router, communications switch, or ATM will go offline. Many of these devices will not restore themselves when they trip offline after a power blink. Consequently the branch location, office, or ATM will sit there, idle, until the device is re-booted or reset.
To prevent the blink or short power outage and the undesirable outcome, a small UPS will suffice. (That’s short for uninterruptable power supply, and you should get one with a duration rating of four hours.) Just be sure that when you install this UPS that you connect it into the plug that says “battery and surge” protection.
Note: Every router, switch, and ATM should have an UPS battery back-up—and a schedule for replacing the battery.
Virtualized services and services
Using a server for multiple applications is a cost-effective tool. Too often, however, the concept of virtualization is misunderstood.
Just because you have multiple servers and virtualized applications does not translate to easy restoration if the hardware fails.
More importantly, systems that are configured to auto-failure may not be able to handle the increased volume.
Just to make sure, use the Wombat’s Rule of 300!
Here is how it works. Add up the total computing demand of all of the applications you are running. Now multiply that by 300%. A single server should have that capacity, so that if any server should fail, any one of your servers can still manage the load, and give you time to find another replacement server without creating a cascading failure.
This situation should be constantly monitored as applications grow. Virtual failover events can create a catastrophic and cascading incident.
That will take out all of your servers when the remaining server is unable to handle the load.
Finally, almost every major event can be avoided with the implementation of a simple solution. It is just a matter of talking about it with your staff.
Start by being philosophical:
How can our system go down?
Let me count the ways!
In other words, knowing what you don’t know!