Resolved

We are happy to report that our Canadian routing is now fully restored.

Please keep reading for our RFO.

Due to the recent DDOS attacks against Bandwidth and Voip.MS, Skyetel is currently experiencing a dramatic increase in new customers, port-in requests and origination traffic. This sharp increase in traffic, along with our engineering team’s desire to increase our own DDOS mitigation preparedness, required us to deploy additional network devices across our infrastructure faster than we normally like to. (Ordinarily, Skyetel spends about a month testing new networking devices prior to deploying these devices to production.) Unfortunately, one of our peers fell into an edge case that caused our SIP packets to be malformed after the call had been negotiated.

While almost all Skyetel’s peers utilize Sonus SBCs, the impacted traffic was being delivered to us by a peer that utilizes SBCs made by Metaswitch. This Metaswitch SBC had additional SIP requirements that our preexisting PSTN gateways understood, but that the new network devices did not. Specifically, Skyetel supports the use of rport for our customers who are behind NAT. The peer that originates the impacted traffic does not. This meant that any customer using a traditional PBX (which is almost the entirety of our network) would have audio issues and call failures when we handed the SIP negotiation to our peer’s audio gateways. Because these failures occurred inside our peer’s network, rather than ours, Skyetel’s gateways thought the calls were completing normally. This meant that our systems reported normal call flows, while our customers experienced degraded service.

Initially, we believed the rport issue originated from the network device (the load balancer) that we deployed. After removing the network device, customers still reported continued audio issues and we were forced to reevaluate. As a fix, this morning we updated the configurations used by the load balancers to allow rport on the Skyetel side of the network, but to not support it on the peer’s side. This effectively served to hide the rport requests from our peer, and allow their Metaswitches to process the calls normally.

To address why this took so long to remediate, it is important to share that our engineering and technical support teams have been under extraordinary pressures due to the recent DDOS attacks. In normal circumstances, Skyetel Support would have identified these edge cases based on customer feedback, and our engineering team would work to immediately address routing issues. However, Skyetel Support is currently experiencing a 700% increase in support requests from new customers seeking help getting their PBXs configured on our network. This effectively hid that critical feedback behind ordinary support requests. The delay in identifying the errors was compounded by our peer’s policy to not update carrier-to-carrier peering during normal business hours. Thus remediation took longer than expected.

After we deployed the update to our load balancers that hid the rport requests, and updated the phone numbers to use this new route, calls completed normally.

We are very sorry for the disruption of service caused by this chain of events, and we are embarrassed by our mistake. Thank you to our awesome Canadian customers for their patience and understanding.

Your Friends at Skyetel

Avatar for
Recovering

Initial feedback is reporting that calls to impacted numbers are working correctly. We continue to closely monitor this situation.

Avatar for
Updated

All impacted numbers have been updated with the new route, and we will now continue to test.

Avatar for
Updated

We are deploying a route change on all impacted phone numbers now. This will take anywhere between 10-30 minutes.

Avatar for
Updated

The change has now been deployed in all 4 regions. We will do additional testing to verify network stability. Once that is complete, we will update the routes on all impacted phone numbers to the new route. ETA is 30 minutes.

Avatar for
Updated

We have deployed the fix across 3 regions, and are conducting additional testing. So far so good.

Avatar for
Updated

We have deployed this change across two regions, and will now perform additional testing.

Avatar for
Identified

Initial testing confirms that our fix appears to resolve these issues. We are going to deploy this change to two of our regions and perform additional testing.

Avatar for
Updated

We have deployed a change to our network, and are working with a few customers to test. If this change works, we expect resolution to be within 2 hours.

Avatar for
Updated

We are still working to resolve this issue at this time. We will be deploying a network change shortly in an attempt to remediate.

Avatar for
Investigating

While we have removed our Load Balancer, we are still getting reports of audio issues. We are working with the offending upstream carrier to identify and resolve these issues.

Avatar for
Recovering

Our errant load balancer has been removed, and the impacted Canadian traffic should now flow normally. We will remain in this status through the morning to verify normal call flows.

Avatar for
Updated

We have received word that our errant load balancer is in the process of being removed now.

Avatar for
Updated

We have been in touch with the upstream carrier and have asked them to remove the errant load balancer from their routing. Unfortunately, the team that handles carrier to carrier route changes only makes changes after hours, and so the ETA for resolution will be after the close of business tonight.

Please note that this is only impacting a small subset of our Canadian customers (about 10% of our Canadian DIDs). Customers who came to use during the voip.ms DDOS outages are most likely to be impacted.

Avatar for
Identified

Skyetel is aware of issues impacting our Canadian Customers. Those customers may be experiencing no audio, call drops, or call failures. We have pinpointed the issue to a Skyetel Load Balancer, and are working to remediate now.

This is only impacting customers in Canada. We will update this post when we learn more.

Avatar for
Began at:

Affected components
  • PSTN Network