The recent Telstra outage served as a cautionary tale for many IT professionals. Human error still poses the biggest threat to business continuity and security. We wanted to know what our CeBIT 365 readers thought about the incident and were delighted by the high level of responses we received.
This article collates some of the most insightful submissions and reflections on questions such as:
- Could the situation have been managed better?
- Was there anything that could have prevented the outage altogether?
- What contingency plans do businesses need to have in place?
Please note that all comments are personal opinions only and do not represent the views of the commentators’ employer companies.
SCADA over cellular networks
The use of cellular networks for SCADA communications has been gaining momentum for some years. It is cheap, easy to install and has the arguable advantage of being managed by a third party.
Personally, I’ve never been happy with trusting critical information and control to a commercial cellular provider. The data performance has always been sacrificed in favour of the more lucrative voice traffic and this is evidenced by the deterioration in data performance in times of disaster, sporting or other public events.
The recent Telstra network failure is an indication of how vulnerable the network really is. IP communications is a way of life now, but should utilities that provide essential services be relying on, what appears to be, a vulnerable communications network?
Jim Baker, Principal SCADA Engineer, Water Corporation of WA
How could a single human error take down the entire network?
The most interesting thing to me is the fact that Telstra stated that the outage was caused by a single engineer doing the wrong thing that took down the entire network. This worries me on two fronts:
- If it was such a dire thing to do, what controls were in place (if any) to prevent it from occurring and what can we learn from this event ourselves in IT? Surely, whoever was the architect needed to factor in a design that removed single points of failure across these forms of critical services.
- What other events are waiting to happen through human error? This could be a dress rehearsal for things to come as systems like this are becoming more complex every year and us mere mortals find it hard to comprehend them anymore.
Others commented that it was a bit like Skynet from Terminator waiting to start up!
Keith Williams, Chief Architect
Mitigate risk through use of redundancy at the provider level
Regarding mitigating the risks posed by a service provider outage, I push for the use of redundancy at the provider level, where practical and the service is critical.
As two examples, WAN and Internet services.
In previous placements, I sometimes had an uphill battle convincing management of the service uptime risk posed by network and systems administrators within a single provider who we may be solely reliant upon.
For Internet access and inter-office WAN links, I want separate links and associated hardware, from separate providers (none of which should be reliant on the other) and to utilise diverse paths where possible. In order to avoid risks such as a severed cable, failed hardware and as highlighted by what occurred recently with Telstra, the risk of a net/sys administrative error.
Avoiding human error is easier said than done
It’s easy to be wise after the event and whilst criticism of Telstra is justified, in this instance, I would make the following points:
- Every professional in the data centre business know that avoiding human error is easier said than done. Statistically, human error is the largest cause of outages in data centres by a considerable margin.
- To be effective, change management boards should be staffed by experienced operational staff that include specialists in the various changes that have to be undertaken i.e. not executives that just rubber stamp the change. It is important that the change envisaged and documented is subject to informed scrutiny.
- Time constraints and the apparent need to get things done quickly are often the biggest issue. To paraphrase a carpentry phrase ‘measure twice cut once’
Gordon Paddy, Data Centre Advisory Specialist, NEXTDC
System protection logic and software has not kept pace
The increased centralisation of service provision into larger nodes and virtualisation has been something that has been happening for over 20 years. It seems that the inherent built in system protection logic and software has not kept pace or that lessons learned previously have been forgotten.
My understanding from Telstra post incident statements is that a signalling overload situation arose due to a manual shutdown on a signalling node. In the early days of deployment of Common Channel Signalling (CCS7 or SS7), at least one much smaller event occurred in Australia. However, this was when deployment was limited and hence very low impact due to old technology redundancy.
In the USA, however, there was at least one spectacular failure of the AT&T Network 26 years ago due to signalling overload and errored restoration algorithms.
Centralisation is driven by efficiency of service provision and in particular the cost of this. The consequence is simple. When an unforeseen event occurs and there is a failure of a big node, the impact is big and often complex to fix due to virtualisation and multiple layers of support teams.
In the end it’s a business call as to whether the money is spent to prevent an event or fix things up afterwards and compensate those affected. There has to be a balance.
Mike Higgins, Voice & Data Telecomm’s Manager, SPARQ
Anything to add to the debate? Share your thoughts in the comment section below.