If there’s one thing Microsoft, Spotify, Google Voice, and the Bank of England have in common, what would it be? All of them have recently been rocked by a major outage caused by an expired TLS certificate that significantly disrupted their operations and essential public services. The ripple effect of these outages reached far beyond the companies themselves, impacting countless users who rely on their services daily.
While the reasons why certificates unknowingly expired in each of these examples varies from case to case, we can all agree that should have been avoided. Many organizations simply lack the framework, processes, and tools to properly manage certificates and prevent these costly disruptions. As our reliance on digital services increases, a single outage can trigger a cascade of failures. The stakes are too high. So, it’s time for all organizations to shift their focus from damage control to proactive solutions that prevent single point of failures and stop outages before they start.
One area where prevention is key—and where many organizations, including those as large as Microsoft and Google, continue to struggle—is certificate expiry-related outages. Despite being a seemingly small piece of the cybersecurity stack, digital certificates are a surprisingly common cause of application outages.
According to our new research report with industry analyst ESG, when it comes to managing and securing non-human identities, operational interruptions caused by expired digital certificates is the number one concern facing organizations today.
So, for the scope of this blog, let’s focus on certificate-related outages and explore solutions and best practices that can help prevent them.
But first, let’s look at the two recent outages caused by expired digital certificates.
- On the night of September 22, 2024, Alaska Airlines suffered an IT outage that caused significant disruption to its operations, resulting in delayed flights. The airline had to issue a ground stop in Seattle for all Alaska Airlines flights to clear the aircraft congestion on the ground. The outage lasted for about two hours.
In the meantime, multiple users took to social media complaining of delays and problems with the airline’s app and website. Alaska apologized for the inconvenience and requested its flyers to either check their flight status before leaving for the airport or reschedule their flights if their plans allowed.
On Monday, the airline confirmed the outage in a statement to Reuters, adding that “the issue had been resolved, but it expects some residual impact to operations.”
Citing the cause of the outage, the airline said, “This was not a cyber attack or any kind of unauthorized activity. It was a certificate issue that impacted multiple systems.”
- On July 21, 2024, the Bank of England, one of the largest banks in the UK, suffered a global outage to its automated high-value payment system, CHAPS. This led to the bank’s retail settlement systems halting transactions for more than an hour and a half.
The CHAPS system is a same-day payment system used to transfer large sums of money, typically to make high-value purchases, such as a car, or to pay a deposit for a house, while banks use it between themselves for low-value but critical payments. It is said that on average the CHAPS system handles about 4,000 housing transactions daily, amounting to over 360 billion pounds ($467 billion USD).
The July 21st outage was preceded by another massive outage of the CHAPS system three days before (July 18) that lasted more than four hours, leaving many homebuyers and movers stranded with loaded-up removal vans parked in driveways due to transactions not going through.
According to the latest report from Stack, the July 21st outage was caused by an expired SSL/TLS certificate. Ostensibly, it was the second outage due to a certificate issue. The first was on January 26, 2024, a 39-minute outage to RTGS that stopped the CHAPS and CREST settlement. The bank had blamed the outage on a certificate authority issue.
Certificate Lifecycle Management with Visibility, Control and Insights – All in One Place
Sometimes, All It Takes is One Expired Certificate!
Certificate-related outages have become increasingly common in recent times. Even large organizations with well-staffed IT teams continue to fall victim to these outages every now and then. In the last three years, we have seen giants like Starlink (SpaceX), Microsoft, Spotify, and Google Voice taken down by certificate-related outages.
The root cause of these outages can often be traced back to an expired digital certificate. Many organizations manage their digital certificates manually using spreadsheets, legacy home-grown tools, and multiple CA-provided solutions. While these approaches are satisfactory for managing a handful of certificates, they are certainly not practical for managing a massive inventory of certificates that organizations use today.
With non-human identities growing 20x times more than human identities, digital certificates are skyrocketing. Manually monitoring hundreds or thousands of certificates for expiry and ensuring that they are renewed on time has become hugely challenging for PKI and IT teams today. The process is not only laborious, but heavily time-consuming and error-prone.
Different teams often use separate processes for certificate management, independently sourcing certificates from various Certificate Authorities (CAs). Without a centralized tool to streamline the request and renewal process, it’s common for certificate renewals to fall through the cracks and expire.
In industries like Manufacturing, overlooking a single certificate expiry can shut off critical systems and trigger a massive outage, disrupting supply chain operations, and causing substantial financial losses. In healthcare, patient services and care can be critically impacted. And, in financial services, transactions between customers and institutions can be halted.
With spreadsheets and CA-provided CLM tools, visibility and management are also fragmented. This leaves PKI teams with no control over their certificates. Even during an outage, it’s a scramble to locate and renew the expired certificate, just like finding a needle in a haystack.
Certificate renewals are a massive headache with manual processes. Certificate renewals involve multiple, time consuming steps such as enrolling for a new certificate, domain validation, provisioning the certificate to the correct endpoint, installation, and finally, end-point binding. Additionally, certificate renewals may require approvals by upper management or a PKI Administrator. Once approved, the Certificate Signing Request (CSR) must be sent to the Certificate Authority (CA). Getting a new certificate issued by the CA can potentially take days (due to certificate and domain validation processes), making the whole process long and tedious.
In a fast-paced environment like DevOps, where developers need certificates issued quickly, manual renewal processes can be a roadblock, nudging them to take risky shortcuts like procuring certificates from unapproved CAs or using self-signed certificates.
Manual processes can also complicate certificate provisioning. Human intervention increases the likelihood of certificate misconfigurations, which can lead to unexpected outages as well as security weaknesses.
Digital certificates are designed to expire for security reasons, helping to prevent misuse and outdated identity information, while also prompting regular updates to keep an infrastructure secure. In line with this strategy, last year, Google proposed reducing the maximum validity period for public TLS certificates from 398 days (13 months) to 90 days (3 months) as part of its ‘Moving Forward, Together’ roadmap. Google is expected to introduce this change either in a future policy update of its Chrome Root Program or a Certificate Authority/Browser (CA/B) Forum Ballot Proposal.
Although shorter-lived certificates are great for security, this change would mean organizations renewing their public TLS certificates not once but four times a year. Imagine the workload that goes into identifying expiring certificates and carrying out the renewal and provisioning process for tens and thousands of certificates every 60-90 days! It is simply beyond the capacity of manual processes. Continuing to operate manually will only lead to more outages and security issues.
What Can You Do to Prevent Certificate-Related Outages?
Steering clear of certificate expiry-related outages isn’t necessarily difficult. Following certificate management best practices and investing in a mature automated certificate lifecycle management solution is a great place to start towards outage prevention.
- Gain Complete Visibility of the Certificate Landscape
Complete visibility of all certificates in your infrastructure is essential to stay on top of expiring certificates and to eliminate outages. Build and maintain a central inventory of certificates, along with the necessary information, such as their expiration date, certificate location, issuing CA (Certificate Authority), owner, and other metadata.
Having a holistic view of all certificates in your environment prevents blind spots and helps monitor certificates effectively for timely renewals. Even in the event of a sudden certificate expiration, ready access to certificate information helps quickly identify the expired certificate’s location and replace it to mitigate downtime.
- Automate Certificate Lifecycle Operations
Automation is an effective way of simplifying certificate lifecycle management (CLM). An automated CLM solution allows for auto-renewal of certificates based on pre-set policies and re-provisioning them to the right endpoint or application. Removing the need for human intervention accelerates the renewal process while ensuring that all certificates are correctly provisioned and installed.
An advanced automation solution can also help ensure that certificates are auto-renewed with newer and safer crypto standards for stronger security, which will be especially crucial for post-quantum cryptography transitions.
CLM automation also enables certificate self-service management. Self-service allows administrators to securely access, manage, and provision certificates from one centralized platform, streamlining processes and preventing ad hoc and fragmented certificate issuance.
- Enforce Policy Control
Enforcing strict PKI policies can greatly help streamline certificate issuance and management across all business units. Policy-driven management ensures the use of best practices in terms of compliant and approved CAs, recommended crypto-standards, validity periods, and trust levels, in turn, ensuring compliance with internal, industry, and regulatory standards and mandates.
Implement role-based access control (RBAC) to regulate permissions and provide the right level of access to certificates and keys to the right roles. Create audit trails to log every certificate and key-related activity for granular control and easy auditing. Generate periodic reports to detect anomalies, eliminate non-compliant certificates, and simplify auditing.
Eliminate the Risk of Outages with AppViewX AVX ONE CLM
AppViewX AVX ONE is the most advanced SaaS certificate lifecycle management (CLM) and PKI platform for enterprise PKI, IAM, security, DevOps, cloud, platform, and application teams. With visibility, automation, and policy control of certificates and keys, AVX ONE CLM streamlines certificate lifecycle management end-to-end and enables crypto-agility, minimizing the risk of outages and security breaches.
AVX ONE CLM has purpose-built features designed to mitigate outages including –
- Complete Visibility: One central, comprehensive inventory of all public and private trust certificates. Useful certificate-related insights and dashboards featuring crypto health scores and Google 90-Day Readiness to mitigate security weaknesses.
- Closed-loop Automation: Automation that extends well beyond just “auto-renewal” and ensures accurate provisioning, installation and end-point binding. Powered by automation workflows, auto-enrollment protocols and REST APIs, organizations can automate CLM for various use cases such as hybrid/multi-cloud, DevOps, and IoT.
- Continuous Control: Zero touch policy enforcement to eliminate rogue and non-compliant certificates. Granular RBAC, approval workflows, intelligent notifications, audit trails, reporting, and a robust policy engine for security-approved and compliant certificate issuance and management at all times.
Talk to an AppViewX expert today for a demo on how to quickly begin automating certificate lifecycle management to prevent outages and prepare for the upcoming 90-day TLS validity change.
*** This is a Security Bloggers Network syndicated blog from Blogs Archive - AppViewX authored by Krupa Patil. Read the original post at: https://www.appviewx.com/blogs/dont-let-an-expired-certificate-cause-critical-downtime-prevent-outages-with-a-smart-clm/