The cost of downtime in the enterprise environment rapidly adds up. In one survey, 40% of respondents suggested that just one hour of downtime cost their organization over USD 1m in losses. Ensuring that database services are consistently available is clearly worth the trouble.
It saves your organization large sums of money, not to mention smoothing over relationships with stakeholders of all shapes and sizes.
So how do you ensure continuous availability? The concept behind persistent availability is called high availability. In this article we outline what high availability is and how you can achieve it for your MySQL clusters.
We also point to a darker side of high availability, where sysadmins incorrectly rely on high availability to perform maintenance tasks – and explain why doing so undermines the aims of high availability, putting your enterprise operations at risk.
1. Introduction to high availability
2. What level of availability is high availability?
3. How does high availability work in practice?
4. Configuring high availability in MySQL
5. Good reasons to rely on high availability…
6. … and the wrong reason to rely on high availability
7. Do not rely on high availability for maintenance
8. Wrapping it up
Introduction to high availability
Let’s talk about availability first. There’s little point in running a service such as a database if it is not available to users most of the time. So, when we talk about availability, we refer to the degree to which a service is accessible.
For any functioning service it would reasonably be expected that it is available when needed – but some downtime would also be expected, a day or two a year or perhaps a couple of hours per month.
A service that is generally available might be just fine for many use case scenarios, but where the service is critical in nature or where a very large volume of users depends on a service mere “availability” simply won’t cut it.
And that’s where high availability comes in. In the most basic terms, high availability ensures availability at a level higher than normally expected, and more specifically, at an agreed level even when allowing for maintenance, patching and general errors and glitches.
What level of availability is high availability?
There’s no agreed definition as to what qualifies as high availability, only that it exceeds what would generally be accepted as “available” in order to meet a specified (higher) availability requirement. In fact, your organization will likely define the availability it requires based on operational needs – weighing up the costs of high availability against the losses associated with downtime.
The level of availability you need can be expressed as a percentage. For example 99.99% or “four nines” availability entails a maximum of 52.60 minutes of downtime in a year, while “six nines” or 99.9999% availability limits downtime to 31.56 seconds in a year.
Essentially, the choice is yours – but, again, there is a trade-off. Maintaining high availability will be expensive – requiring additional physical resources and software licenses and draining your staff resources too. However, you may well find that it is a price worth paying to avoid the knock-on costs of disruption, or the risks of losing revenue due to unhappy customers.
How does high availability work in practice?
The exact nature of your high availability infrastructure would depend on your workload. However, in the broad you could argue that high availability is achieved when there is fault-tolerance, so that even if a service or device fails the workload is not interrupted. Typically, that means there is no single point of failure – all services and devices are fully redundant at both network and application level.
Depending on the service, this could typically involve a number of nodes – for example, your MySQL cluster will contain several nodes across which the data is stored. Multiple nodes are then combined with a load balancing tool so that, if one node fails, requests are simply directed to another node. Users will still access an available service, even if performance is slightly degraded.
Configuring high availability in MySQL
Your route to a high availability MySQL database will, of course, depend on your implementation of MySQL. Broadly speaking, you’ll need to create some type of MySQL cluster with multiple nodes – in other words, your data must reside on multiple MySQL servers.
Next, you’ll need a service that can replicate data across these nodes ensuring that every node carries an exact copy of the data contained in your database. Finally, you need a load balancer that ensure any database requests are evenly directed to database nodes – ensuring, yes, a balanced load – but also ensuring requests are met even if one node is offline.
For example, MySQL offers a commercial product directed at high availability – the MySQL InnoDB Cluster. It is based on MySQL Group Replication, which is a popular way to ensure high availability in a MySQL database environment.
Another alternative is Galera, which has been offering MySQL high availability for many years. If you’re using the MariaDB fork of MySQL you could, for example, configure your MariaDB environment for high availability by running multiple nodes using Galera Cluster – while relying on HAProxy for load balancing. Alternatively, you could look into MariaDB’s own MaxScale product.
GOOD REASONS TO RELY ON HIGH AVAILABILITY…
Enterprise-scale workloads increasingly make use of high availability principles simply because, in the long run, it delivers the best results. Here are just a few of the many good reasons why you should consider setting up high availability in your operations:
- Critical applications. Some applications simply cannot afford any downtime, think about military applications or energy networks. High availability is a must under these scenarios, and you have little choice but to ensure extremely high levels of availability – though you may still risk-assess and decide exactly how much of an availability guarantee you require.
- Knock-on effects. Where a system is at the core of a workload, even brief downtime can lead to much more widespread problems as connected and synchronized systems fall over in a cascade. It is worth considering investing in high availability in a few core areas – such as a database – because it can well be worth the cost, given that the costs of much larger knock-on problems that may be very difficult to recover from.
- Revenue loss. High availability, even if to a modest number of nines, can guard against revenue loss. For a major online retailer, just a few hours of lost sales combined with the associated reputational damage can lead to a very significant impact on the bottom line.
- Client expectations and SLAs. You operations may be bound by service level agreements guaranteeing your clients a certain level of uptime. If that’s the case, you need to ensure that the services that supports your client's workloads have the required level of uptime – and you will do so through high availability. Failing to do so can lead to termination of contracts, or penalties under your contracts.
That’s a couple of the valid reasons for high availability – and, again, in today’s tech-first world there are many workloads that simply can’t operate without a high availability platform in place.
… and the wrong reason to rely on high availability
The increasing prevalence of high availability has, unfortunately, also led to its abuse. Because high availability makes systems so incredibly robust, tech teams can be tempted to take shortcuts when performing sysadmin tasks such as patching because the team assumes that high availability infrastructure will simply carry the burden of taking a machine offline.
In practice, it can quickly become more complicated. Take a MySQL cluster, for example. Yes, if you restart a machine to patch it, your MySQL cluster will remain running thanks to high availability. However, remember that when you take one node down in order to patch and then reboot it, it results in a backlog of data that requires ingress. It can take a very long time for this process to complete.
Needless to say, every database host must see the same data. The danger comes in during resync: if another node goes down while you’ve already taken off a node to patch it you can end up with a loss of valid quorum. In other words, the number of servers that holds the “truth” about the data falls below an acceptable level. Recovering from such a state can be hard and complex and even lead to data loss.
Do not rely on high availability for maintenance
High availability is there to keep your systems up and running even when something fails. This inherent protection against failure is not a free pass to depend on the robustness of high availability in order to perform system maintenance in an irresponsible way, hoping that no-one would notice it.
Instead, technical teams should rely on other solutions – for example, setting up full redundancy for a system that is being patched rather than simply hoping that high availability infrastructure will absorb the pressure. Or, where possible relying on live patching instead and by doing to removing the need to restart a service to install a patch.
Nonetheless, dependence on high availability for maintenance jobs is showing worrying signs of becoming prevalent. Look around a bit and you’ll even find official vendor guidance that instructs users to depend on high availability to execute patching tasks and that users simply hope that nothing else goes wrong with other nodes while one node is taken offline for patching.
WRAPPING IT UP
High availability is critical for a lot of applications – and highly beneficial for many others. Configured correctly, a MySQL database can offer virtually perfect availability, but that does not mean that tech teams can take availability for granted.
Abusing high availability architecture to take maintenance shortcuts just isn’t an option – the risks are greater than it may seem to be at first glance.
Instead, sysadmins should look to tried and tested alternatives – including redundancy and live patching – to perform maintenance operations without undermining the capabilities of high availability solutions.