San Rafael Central Office Reason For Outage

On Sunday, at approximately 8:30 PM PST, our on-call Network Engineering staff began to receive network event alerts for some of our core routers losing power in a Central Office that we collocate in. This site acts as both a serving point for our Sonic Fiber customers and as a core routing node for the junction between our ‘North Bay Ring’ and our ‘South Bay Ring’. Upon initial investigation, an incident call was started for the Network Engineering team to begin assessing possible failures and mitigations and initiating our Outage Response processes with our Customer Care department and we dispatched our on-call Central Office Technician. While the Care team sent notices directly to affected Sonic Fiber customers, more network alerts that we lost power to one device after another at this site begin to flow in.

Initial speculation swirled among the team that the historic flooding occurring in Marin County, combined with information from the Marin Emergency Services website indicating a fire truck responding to the exact street address of our equipment began to paint a picture of likely damage to the power systems that serve both the ILEC, and customers like ourselves within the building caused by flooding. This speculation proved to be true as we arrived on site to find additional ILEC repair technicians arriving. A rough estimate of 3 hours was given to finish pumping the water out of the basement, to be followed by an unknown length of time to assess the damage to the electrical equipment and begin repairs. We were prevented from accessing our equipment to check fuses due to safety concerns from the repair team. Our technician reported that the building smelled very badly of ‘burnt electronics’.  By around midnight, power was restored and our devices automatically turned on and begin to work again. By around 2 AM all services were mostly restored. Unfortunately, our OLTs (Optical Line Terminal) at this site powered back on in a degraded state internally whereby they didn’t allow enough bandwidth to transmit between the OLT and the rest of the network, causing bandwidth bottlenecks for our Fiber customers. The only fix for this was a night-time scheduled maintenance to restore full-speed connectivity which was performed at 11:59 PM last night to avoid impact during peak usage hours.

In Sonic’s history collocating in ILEC Central Offices, failures like this have not happened. This was a first for us. Central Offices are engineered and built with resiliency to withstand many failure modes. The power that our routers and network equipment draw are fed from two redundant battery banks and provide DC voltage to our equipment. Sites have generators that are regularly tested in case of utility power failure. Facilities are strictly maintained to high standards and federally regulated to ensure failures like this are as unlikely as possible. In addition to this, we have built our network to have redundant paths, redundant locations, and recovery plans to handle truly disastrous failures. Despite this, 911 service across Marin County was severely degraded for nearly all carriers, not just Sonic. There were reports of various LTE services with other carriers being impacted as well, and of course, Sonic customers lost internet and phone access during peak usage times.

As mentioned earlier, this particular Central Office is both a serving point for Sonic Fiber customers, and a core site that carries traffic from points north of San Rafael, to points south, ultimately to data centers in Silicon Valley where we peer with other networks that make Sonic part of the Internet. Approximately 2,700 Fiber customers lost service when the OLTs lost power (and the other devices within the building). Events like this are impactful, especially on a Sunday night when many people are trying squeeze the last bit of personal and family time out of the weekend by enjoying our service. Being a critical point where our North Bay Ring meets our South Bay Ring, this could have created widespread service reliability issues for all of our North Bay (points north of San Rafael) customers if we hadn’t engineered our network to have multiple paths, multiple sites, and critically, enough bandwidth to handle failure scenarios as internet traffic automatically re-routes around the ring to avoid the failed node.

While there was nothing that Sonic could have done to prevent the flooding that took out power, we strive to do our best to be open, transparent, and communicate with our customers, and the broader public – and continue to build reliable networks so we can provide the best service in the business. It’s our goal that you don’t even need to think about your connection. When we are doing our job right you should never even have to think of us – it should just work.

Systems Maintenance

Update: Maintenance complete.

Tonight beginning at 10 PM PST we will be applying updates to our VPN cluster which will require us to reboot each node.  The maintenance should only last around 15 – 30 minutes.  Thanks for choosing Sonic!

Sonic Holiday Schedule

Sonic Support and Sales will be closing early on 12/24 at 5pm, and will be closed on 12/25. We return to normal operating hours on 12/26 at 6:00am PT.

Have a happy holiday!

 

Mail cluster upgrade.

Update: Looks like there may have been a brief period where some users experienced interruption, but maintenance is complete. Let us know if you have any issues with service.

System operations will be taking down a portion of our IMAP/POP3 cluster to upgrade the hardware. We do not expect this to be service impacting and should only take a few minutes.

– SOC

DNS Problems – Resolved

We are seeing an increase in in DNSSEC validation failures on our recursive dns servers. The cause has been identified as a security patch that was applied, which applied a stricter validation policy to domains with DNSSEC enabled. We are currently looking for ways to mitigate the problem.

Update: The cause of these problems has been positively identified as a behavior change that came along with a new version of ISC’s Bind which was was released two days ago in response to a collection of discovered potential security exploits in a group of CVEs.  As always, we strive to push deploy security fixes in our network as quickly as possible and deployed this new version to all of our recursive DNS cluster backend servers over the course of the day starting Thursday AM.  The problem specifically is the removal of SIG(0) combined with a change in behavior for what are seen as “invalid” DNSSEC keys resulting in these being treated as failures instead of being skipped.  We’re currently stuck between a rock and a hard place, a known potential cache poisoning vulnerability, or a version which results in an unknown quantity of broken domains still relying on SIG(0).  More updates forthcoming, we hope to have chosen a path forward to mitigate customer impact from this soon.

Update: We are in the process of rolling back the affected version across our name server clusters.  It is our assessment that the additional complexity we believe is required for one of these potential cache poisoning attacks to succeed in our network justifies rolling back to the previous version rather than other choice which was to entirely disable DNSSEC until the issues with the new version could be resolved.  For additional clarity this was originally brought to our attention by students and staff at usfca.edu who found they were unable to resolve usfca.edu domains this morning, we are not sure how many other affected domains there are or if this issues can rightly be blamed on the upstream DNS server administrators or not.  The roll back should be completed shortly and we’re sorry for any confusion or trouble this may have caused you today.  -Kelsey, Kevan and William

Update 2025-10-31:  As it turns out this problem ended up being a combination of several issues and was actually related to zones that contained a deprecated DNSSEC key type (RSASHA1), even if they also had a valid key as well.  This was additionally confused by RHEL’s security policy framework which triggered the new undesired behavior in Bind.  We are in investigating several work around solutions for this but also expect that Bind will be releasing an update that corrects this behavior relatively soon.

-SOC

webmail upgrade coming soon

Sonic will soon be rolling out an upgrade to https://webmail.sonic.net

You can preview these changes by going to https://webmail-beta.sonic.net

For more information, please use this link: https://forums.sonic.net/viewtopic.php?t=18376

Update: We’re aware of an issue post upgrade regarding contacts in webmail.  Please follow https://forums.sonic.net/viewtopic.php?t=18376&start=40 for details.  A shorter update will be posted here as the situation evolves.

Update: A fix has been made on https://webmail.sonic.net, and anyone still experiencing problems should contact sonic, via a private message on the forums.sonic.net site.  Please see this post called “We’ve applied a fix to the contacts issue” on forums.sonic.net, for more details.

VPN Encryption TLS Update

On Friday, October 24th at 11am PST we will be upgrading our VPN cluster to modify the minumum supported TLS encryption version from 1.1 to 1.2.  Our analysis shows that we do not have any active users connecting with TLS 1.1, however we still wanted to provide this notice for your information. Thank you for your understanding, and thanks for choosing Sonic!

– System Operations team

Intermittent IMAP login issue.

Routine maintenance this evening caused unexpected load on some of our IMAP/POP mail servers that lasted from 11:15pm to 12:12am. During this time some users may have experienced intermittent problems logging in. The situation is believed to be resolved; our operations team will be reviewing the incident to reduce the impact of the same maintenance in the future.

-SOC