fbpx
Features

Why the Rogers outage was so bad, and how to prevent the next one

There are no easy solutions to prevent another 17+ hour outage

Rogers "most reliable" network claim

Canadians didn’t know how good they had it until it was gone.

Millions woke up on the morning of July 8th to find they had no internet. Their wireless service didn’t work. Debit transactions at stores failed. E-transfers didn’t go through. Canadians couldn’t reach 9-1-1. Government services reported disruptions because phone lines were down.

Roughly 36 hours later, Rogers CEO and president Tony Staffieri publicly revealed the cause of the outage: a maintenance update in the company’s core network. The update caused “some of [Rogers’] routers to malfunction” on July 8th.

Over 48 hours later, with some customers still reporting issues despite Rogers claiming it restored the “vast majority” of service, calls for an investigation were ringing loud. On July 11th, Innovation, Science and Industry Minister Francois-Phillipe Champagne met with the heads of Canada’s major telecom companies and gave them 60 days to “improve the resiliency and reliability” of networks and to reach agreements on emergency roaming, mutual assistance during outages, and a communications protocol to provide better information to the public and authorities amid telecommunications emergencies.

Champagne also promised a CRTC investigation into the outage, and on July 12th, the commission ordered Rogers to answer questions about what happened within ten days.

On the surface, it seems simple. There was a problem, and the government told telecom companies to work together to ensure it didn’t happen again.

But it’s never that simple. To come up with a solution, you need to understand the problem — this one runs deep, and well beyond Rogers.

All-in on all-IP

To start, we need to understand how Rogers’ network operates — you’ll need to bear with me through this, as it’s a bit of a slog (I promise it’s worth it). MobileSyrup has come to understand that Rogers is an all-IP (internet protocol) network, which effectively means the traffic doesn’t matter — it all goes through the same network.

A source familiar with networks, and who asked not to be named, explained all-IP as like an FM radio. Unlike typical radio, where users need to tune in to different stations, an all-IP station has every station in one tuning. In the case of Rogers, all traffic (telephony, wired, etc.) goes through the same core network.

To be clear, there isn’t anything necessarily wrong with all-IP. Telecom networks have moved in this direction over the last several years, enabling some innovations. However, there are vulnerabilities too — for example, a whole-network outage like what we saw on July 8th.


“Look, an all-IP network I don’t think is necessarily a bad thing if it’s implemented in a resilient way,” explained Ian Rae in an interview with MobileSyrup. Rae, the founder and CEO of CloudOps, has worked in the tech industry for about 25 years. Back in 2000, Rae was part of a startup that was virtualizing network access for internet companies.

“I am very much at the intersection of telecommunications and networking, and what we now call cloud computing,” Rae said.

Rogers isn’t one of Rae’s customers, so he isn’t “intimately familiar” with the company’s network — and it’s also one of the reasons he was able to speak with MobileSyrup. Rae was able to offer some high-level insight into Canadian telecom architecture.

“The thing that’s interesting about [the outage] to me is that [Rogers] already shared that this is in their core network,” Rae said. “So what’s a core network? This is where a lot of the internal handling of traffic and security policies, how services get integrated together, all this magic happens on the core network.”

According to Rae, components running at the edge of the network, like cell towers, get connected back to the core network through backhaul. Traffic runs through this system and ends up at the ultimate destination.

Tracing the traffic

Part of that journey involves what our source called the “basic level of the internet,” comprised of big, expensive gateway nodes, or routers, that handle all the traffic and transfer it out from Rogers’ network into the wider internet. An important note here: the core difference between a router and a gateway is that gateways regulate traffic between dissimilar networks, while routers handle similar networks. In other words, a router could be considered a gateway, but a gateway can’t always be considered a router.

This is where we get into the meat of what went wrong. As detailed by Cloudflare in a blog post on July 8th, the issue stemmed from Rogers’ routers that handle Border Gateway Protocol (BGP). BGP, according to Cloudflare, allows one network (for example, Rogers) to tell other networks that it exists. The internet is a network of networks, so simply, BGP is how Rogers informs the rest of the networks on the internet of its presence.

We’ll get into BGP more in a moment, but first, it’s worth noting that MobileSyrup understands Bell and Telus operate all-IP networks as well. In other words, both could be vulnerable to similar issues.

But first, to highlight the scope of how traffic runs on Rogers’ network, it’s worth looking at what happened to Rae when Rogers went down. Rae had been on vacation in Rhode Island and was just starting the drive back to Montreal when the outage hit, and Rae lost service.

“One of the reasons for that is that the ability to roam actually does still tie back to the availability of those core networking services back up in Canada,” Rae said.

That’s perhaps one of the best examples of how this system works for people outside the know, and it helps clarify why the Rogers outage was so significant. It wasn’t that phones couldn’t connect to towers. Rogers’ network failure was much more specific, taking down a core piece of the network responsible for directing traffic from Rogers’ network to the rest of the internet.

It’s also key to understanding the issues with 9-1-1 and Rogers customers being unable to call emergency services. As MobileSyrup understands, Canadian telecommunications companies already have network sharing agreements to enable 9-1-1 access in the event of a network outage. In other words, if a Rogers phone can’t connect to Rogers’ towers, it can fall back to other carriers’ towers through local roaming to access the emergency network. If you have cell signal, you can dial 9-1-1.

Given that Rogers’ towers were operating fine, it appears that the emergency fallback didn’t kick in. Further, Iristel president, founder and CEO Samer Bishay said in a statement that Rogers customers could have regained access to 9-1-1 services by removing the SIM card from their device. Typically this isn’t necessary, but because of how Rogers’ network failed, Bishay said removing the SIM would enable the typical fallback routing for emergency calls. Unfortunately, this wasn’t communicated to Canadians during the outage, with some emergency services directing people to find landlines or borrow other, working phones.

Assembling the puzzle pieces

Albert Heinle, co-founder and CTO of Waterloo, Ontario-based CoGuard, shared a deep dive into Rogers’ BGP issues on the CoGuard website. Heinle assembles a few pieces — first, noting what Rogers revealed about an update causing router malfunctions, then pulling in Cloudflare’s information about BGP — and explains that there was likely a scheduled maintenance update on Friday morning, which caused Rogers’ BGP routers to malfunction. That malfunction stopped those routers from communicating to the rest of the internet that Rogers’ network existed. Rae also notes that Rogers may use internal BGP (IBGP) for communication within its own network, which could also potentially be a point of failure.

Both Heinle and Rae referenced Facebook’s October 2021 outage, which was also BGP-related. A small misconfiguration removed the ability of Facebook’s systems to communicate with each other.

The anonymous source described the issue to MobileSyrup as similar to being connected remotely to a computer. If you turn on that computer’s firewall, it cuts off the remote connection, and now you can’t remotely reconnect to turn off that firewall. Then, you have to physically go to that computer and physically connect to turn off the firewall. Of course, it’s never that simple — there’s still the process of figuring out what went wrong, where it went wrong, and how to fix it. Oh, and then actually fixing it!

However, it’s worth acknowledging that there may still be pieces of the puzzle that haven’t been revealed. Rogers is due to answer CRTC’s questions about the outage on July 22nd, and new information will likely be revealed there. That said, it seems enough of the pieces have been revealed for people to start teasing out ways to prevent this from happening again.

And that brings us to the crux of all this: solutions.

Work together, or else!

It’s important to understand that no solution should be off the table. Everything is worth considering at this point, and every solution has pros and cons. People can argue about what should be done, but first, we should examine what can be done.

So far, the solution that appears to have garnered the biggest headlines is Minister Champagne’s demand that Canada’s telecommunications companies work together and develop agreements for mutual assistance, emergency roaming, and better communication about outages.

The latter point is critically important, especially given that Rogers’ existing solutions for communicating outages almost completed failed to do that effectively. The ‘@RogersHelps’ Twitter account shared its first update over four hours into the outage on July 8th. Prior to that, customers were directed to visit either a community forum page that was supposed to offer information about ongoing outages — but didn’t — or a Rogers support page where customers could access a chatbot to get information about outages. During the early hours of the outage, that chatbot appeared to have difficulties working correctly.

The other two demands are more difficult. Emergency roaming agreements didn’t work during the July 8th outage, so revamping that system could help. However, it’s currently unclear how best to do that, considering that the way Rogers’ network failed prevented traffic from routing to fallback measures.

As for mutual assistance, while it would be good to allow phones to effectively “hop” between available networks, our source explained that this would essentially open a back door into the network that competitors can use. And, as is so often pointed out with government attempts to gain access to encryption, if a backdoor exists, it becomes a target for exploitation. That could come from anywhere — governments, hackers, competitors. It seems impossible — how do you open the core of your network to prevent outages without putting the whole network at risk?

Moreover, Rae said that although he liked the idea of Champagne’s mutual assistance, he worried that such an agreement could further hamper efforts to increase competition and bring in new players.

Update the way you update

Heinle’s analysis includes a close examination of Rogers’ own proposed solutions. On July 9th, Rogers outlined three parts of its action plan regarding the outage, which included analyzing the root cause of the outage and implementing redundancy and any other necessary changes.

Redundancy can be best thought of as increasing the amount of infrastructure to create fallbacks. In the case of Rogers’ outage, that could be increasing the number of routers. MobileSyrup’s source suggested adding specialized routers to handle emergency traffic, if such a system doesn’t already exist. However, Heinle notes redundancy isn’t the issue. The update structure is.

Rogers’ outage started with a faulty update, which means increasing the number of routers won’t solve the problem – if they all receive a faulty update, they all break. So, Rogers should focus on updating the way it handles updates to mitigate the potential for outages of this magnitude.

“These maintenance activities are generally pretty typical in routine,” said Rae. “You’re going to have a change management plan, you’re going to have an approval process, you’re going to have a backout plan. It doesn’t sound like, from what [Rogers] is saying, that it was a major change architecturally… those tend to be much riskier activities.”

Both Rae and Heinle posed the question of what Rogers’ risk management was with the update. Heinle suspects a rollback wasn’t possible given that Rogers said it disconnected impacted equipment. Both also questioned the “blast radius” of the outage — why didn’t Rogers stage the update to catch any potential issues on a smaller scale before it impacted the entire network? And if Rogers did stage the update, how did the issue slip through? We may not know these answers until we hear them from Rogers in the coming days.

A long road ahead

Ultimately, Rogers will need to review its internal update policies and develop solutions to fix possible failure points. Ideas shared with MobileSyrup include reviewing why updates need to be applied, and how those updates spread through the company’s network. Can Rogers contain updates to specific areas of the network for testing before a broader release? The approval process for updates should also be considered.

Rogers may examine whether it should implement check systems to warn of potential issues and prevent wide rollouts of broken updates. Maybe the company could implement (or improve an existing) system for managing update rollbacks when something goes wrong. Maybe more frequent, smaller updates instead of singular, major updates is the key.

Even better? A combination of everything. No solution should be off the table, including potentially expensive options — for example, the company’s consideration of splitting the wireless and wireline networks. That would be a huge expense given how the network currently works.

Moreover, while Rogers carries significant blame, no critical service in Canada — or anywhere — should be wholly dependent on a single telecom company.

“The fact that they went down is something that I’m shocked that everybody’s so shocked about it,” said Rae. “How is it that we have banks and other services that are mission-critical, and they depend entirely on the ability of a single telco provider to provide services? That is [an] unacceptable risk from my perspective.”

Rae acknowledges that that thought line only goes so far. It works for major services like Interac — which announced it would add a supplier to increase redundancy following the Rogers outage. For regular customers and small businesses, it may not make sense to have multiple internet services. Expense aside, many companies — including Rogers — incentivize customers to bundle services and get internet, wireless, TV, and more from one company.

In all of this, it’s easy to forget that Rogers’ employees were also affected. Like everyone else, employees couldn’t access services, couldn’t make payments, and couldn’t call 9-1-1. Unfortunately, many will likely be on the receiving end of vitriol from customers frustrated with how the company handled the outage.

So what can Rogers do to prevent a future outage? A lot. What should it do? That’s up for debate. What will it do? We don’t know yet. Rogers made it clear on the call with Champagne that it wants to work with Bell and Telus on this because what happened to Rogers could happen to them.

What will that mean for Canadians? We’ll have to wait and see.

With files from Douglas Soltys.

Related Articles

Comments