Pay attention to resilient ICT systems

Imagine if the entire mobile networks fail to work for a day. Or, God forbid, the entire mobile money system fails for two days.

It would be catastrophic, to say the least. Our dependence on Information Communication Technologies (ICTs) is such that there must be no room for error. We expect these systems to work 24/7, 365 days.

Last week, I fell victim to a widespread system failure.

A power failure in Atlanta forced one of the World’s largest airlines, Delta, to shut down its operations, leaving thousands of travellers frustrated.

Their disaster recovery systems also failed, yet this is the system that virtually runs everything ranging from crew scheduling, passenger check-in, and flight dispatch to airport departures, information displays and ticket sales.


Within a few hours, the airline had cancelled more than 600 flights.

In July, a similar incident happened to Southwestern Airlines, leading to cancellation of more than 2,000 flights, which cost the airline between $5 million and $10 million in losses.

Continuity plan

ICTs have advanced to an extent that such failures should never happen at all.

Every large organisation with extensive ICT deployment has a business continuity plan with a precise business continuity strategy.

Once these tools are in place, the systems are regularly tested, maintained and reviewed. Most strategies provide for system redundancy, backups and multiple disaster recovery systems.

But these critical steps are often ignored.

It is important, however, that from time to time, we reflect on what could possibly happen if we were to experience system failure in any of the services that now depend on ICTs to run.

We often think that the risk of a system failure ended with Y2K.

The full impact of ICT disruptions is often never established, considering the fact that some of the customers may even have lost a significant amount of man-hours or loss of business.

When a crisis event like the Delta case happens, it gives the public a rare glimpse into the fact that we are constantly vulnerable in critical systems that we depend on to provide services.

Storage devices

There are many lessons that can be learnt from failures of systems that hold critical data, be it at personal level, where we have mobile handsets and computers that hold a significant amount of data, or at the organisational level where ICTs are deployed at transactional level.

In the past, such data was kept in external storage devices but that too became problematic owing to the fact that updating the information was in itself a challenge.

Then came the cloud service that solved both the need for external storage and instant updates so that virtually no data is lost.
Delta Airlines have not explained why the backup systems failed but chances are that the system they maintained may have been too old.

Delta CEO, Ed Bastian, simply said: “I’m sorry that it happened and I don’t have the final analysis of what caused the outage. We did have a redundant backup power source in place. Unfortunately some of our core systems and key systems did not kick-over to the back-up power source when we lost power and, as a consequence of that, it caused our entire system effectively to crash and we had to reboot and start the operation up from scratch.”
His response in real sense explained nothing.

It is more like when the Kenya Power says sorry when power surge destroys your electronics — without paying for any of the damage or refunding the cost of repair.

Delta and its partners still owe it to the public to explain what exactly happened considering the fact that cost of cloud services has in the past few years significantly dropped to the extent that one can have multiple storage locations to mitigate against abrupt failures that can be very expensive.

Although this is now water under the bridge, it must serve as a lesson to all other providers of critical services to conduct dry runs of their systems to avoid inconveniencing customers.

Resilient systems

Advances in technology have made the entire world dependent.

Providers, therefore, must always ensure that their systems are resilient given the fact that some of the systems do not just advance efficiency but are critical to mankind’s day-to-day activities.

Elizabeth Edwards, an American author and wife of former North Carolina Senator, once said: “Resilience is accepting your new reality, even if it’s less good than the one you had before.”

Let’s be resilient anyway and accept the new normal of technology.

The writer is an associate professor at University of Nairobi’s School of Business.