Nearly two years after the infamous July 8th Rogers outage in 2022, an independent review conducted for the Canadian Radio-television and Telecommunications Commission (CRTC) blames human error. The review says the outage was worsened by network management and system “deficiencies.”
The CRTC commissioned Xona Partners in September 2023 to review the outage, a summary of which is now available on the CRTC website. Xona also looked at whether Rogers took sufficient steps to prevent another outage. The CRTC plans to release the full report at an unspecified date once it has redacted sensitive information.
Xona’s report summary details how Rogers was working on a seven-phase process to upgrade its IP core network — the outage occurred during the sixth phase of the upgrade process. According to the summary, an error in configuring distribution routers, which involved Rogers staff removing the Access Control List policy filter, caused the core network to be flooded with IP routing information that ultimately crashed the core network.
Staff removed the policy filter in an effort to “clean up” files in the distribution routers, and the “change management process, which includes audits of change parameters, failed to flag the erroneous configuration change.”
Also of note, the review summary says that Rogers’ core network hadn’t been configured with an overload limit, which meant there was no protection mechanism for the core network beyond the policy filter.
Rogers algorithm assessed network upgrade risk as ‘low’
Moreover, the review summary notes that Rogers assessed the risk of the network upgrade as ‘high,’ but the company’s risk assessment algorithm downgraded the risk level following the successful completion of the first five phases. The ‘low’ risk rating meant staff weren’t required to conduct additional scrutiny for the upgrades, like conducting lab testing for the change or seeking higher levels of approval.
“The July 2022 outage was not the result of a design flaw in the Rogers core network architecture. However, with both the wireless and wireline networks sharing a common IP core network, the scope of the outage was extreme in that it resulted in a catastrophic loss of all services,” said the review summary. Additionally, it pointed out that it’s common for service providers to share an IP core network between wireless and wireline services to “balance cost with performance.”
Elsewhere, the review summary looked at factors that impacted network restoration. These include that Rogers’ management network relied on the IP core network, resulting in remote employees being unable to access the management network. “Rogers did not provision its network operation centre and other critical remote infrastructure sites with redundant connectivity from alternative service providers for network management.”
In a similar vein, the review pointed out that Rogers staff relied on the company’s network for their mobile and internet services, so when both networks failed, staff weren’t able to communicate effectively.
Finally, the review said Rogers staff “did not initially have access to the error logs from the failed routers and could not pinpoint the root cause for about 14 hours from the start of the outage.”
All of these factors prolonged the duration of the outage.
Rogers took “satisfactory” measures to improve network reliability
The review summary highlighted various measures Rogers took following the July 2022 outage to address deficiencies and improve network resiliency. The company added safeguards to routers in the core network to prevent another overload and also implemented a separate physical and logical management network. Rogers also deployed backup connectivity from third-party service providers to its network operation centre and other critical remote sites.
Other changes Rogers announced include separating its IP core network for wireless and wireline. However, this change remains a “work in progress.”
Moreover, Rogers switched to a new risk assessment algorithm, implemented organizational changes to improve collaboration between network operations and engineering teams, added an enhanced process for introducing new equipment and technology, and more.
A letter from the CRTC to Rogers filed alongside the review summary noted that “the measures taken by Rogers have addressed the cause of the outage” and that Rogers confirmed the implementation of all additional recommendations made by Xona. It also asks Rogers to report to the commission on whether measures continue to address reliability issues and on progress the company made to separate wireless and wireline networks. In an email to MobileSyrup, Rogers said it previously shared that it partnered with Cisco to split its networks. Rogers touted the effort as a “Canadian first” and that it would be the “first operator in Canada to do this.”
In response to the review, Rogers provided MobileSyrup with the following statement:
“We said we would fix this – we completed a full review of our networks, strengthened our network resiliency, implemented all the report recommendations, and today our networks are recognized as the most reliable by global benchmarking leaders.”
Additionally, Rogers highlighted other work it was doing, including investing $20 billion in its networks over five years and introducing additional change controls and AI-based predictive simulation capabilities to strengthen testing and monitoring.
Those interested can find the full review summary here.
Update 06/07/2024 at 9:31am ET: Added additional details and a statement from Rogers.
MobileSyrup may earn a commission from purchases made via our links, which helps fund the journalism we provide free on our website. These links do not influence our editorial content. Support us here.