The Real Failure Rate of EBS

By Nick Van Wiggeren | March 18, 2025

PlanetScale has deployed millions of Amazon Elastic Block Store (EBS) volumes atraverse the world. We produce and raze tens of thousands of them every day as we stand up databases for customers, get backups, and test our systems finish-to-finish. Thraw this experience, we have an distinct watchpoint into the fall shorture rate and mechanisms of EBS, and have spent a lot of time toiling on how to mitigate them.

In complicated systems, fall shorture isn’t a binary outcome. Cboisterous native systems are built without one paths of fall shorture, but inwhole fall shorture can still result in degraded carry outance, loss of user-facing useability, and undescribed behavior. Often, intransport inant fall shorture in one part of the stack ecombines as a brimming fall shorture in others.

For example, if a one instance inside of a multi-node scatterd caching system runs out of nettoiling resources, the downstream application will describe error cases as cache leave outes to dodge fall shorting the ask. This will overwhelm the database when the application floods it with queries to transport data as though it was leave outing. In this, a inwhole fall shorture at one level results in a brimming fall shorture of the database tier, causing downtime.

While brimming fall shorture and data loss is very unwidespread with EBS, “sluggish” is frequently as horrible as “fall shorted”, and that happens much much more frequently.

Here’s what “sluggish” sees appreciate, from the AWS Console:

This volume has been operating steadily for at least 10 hours. AWS has alerted it at 67% idle, with write defercessitatency measuring at one-digit ms/operation. Well wilean foreseeations. Suddenly, at around 16:00, write defercessitatency spikes to 200ms-500ms/operation, idle time races to zero, and the volume is effectively blocked from reading and writing data.

To the application running on top of this database: this is a fall shorture. To the user, this is a 500 response on a webpage after a 10 second defer. To you, this is an incident. At PlanetScale, we think about this brimming fall shorture because our customers do.

The EBS recordation is advantageous in helping us comprehend what promises AWS’ gp3 is able to produce:

When rapidened to an EBS–boostd instance, General Purpose SSD (gp2 and gp3) volumes are scheduleed to deinhabitr at least 90 percent of their provisioned IOPS carry outance 99 percent of the time in a given year

This unbenevolents a volume is foreseeed to experience under 90% of its provisioned carry outance 1% of the time. That’s 14 minutes of every day or 86 hours out of the year of potential impact. This rate of degradation far surpasss that of a one disk drive or SSD. This is the cost of separating storage and compute and the sheer complicatedity of the gentleware and nettoiling components between the client and the backing disks for the volume.

In our experience, the recordation is right: sometimes volumes pass in and out of their provisioned carry outance in petite time triumphdows:

However, these low triumphdows are enough to have impact on genuine-time toilloads:

Production systems are not built to administer this level of sudden variance. When there are no promises, even overprovisioning doesn’t repair the problem. If this were a once-in-a-million chance, it would be separateent, but as we’ll talk below, that is far from the case.

At PlanetScale, we see fall shortures appreciate this on a daily basis – the rate of fall shorture is standard enough that we’ve built systems that watch EBS volumes honestly to lessen impact.

This is not a secret, it’s from the recordation. AWS doesn’t depict how fall shorture is scatterd for gp3 volumes, but in our experience it tfinishs to last 1-10 minutes at a time. This is probable the time necessitateed for a fall shortover in a nettoil or compute component.

Let’s presume the follotriumphg: Each degradation event is random, unbenevolenting the level of shrinkd carry outance is somewhere between 1% and 89% of provisioned, and your application is scheduleed to withstand losing 50% of its foreseeed thrawput before erroring. If each individual fall shorture event lasts 10 minutes, every volume would experience about 43 events per month, with at least 21 of them causing downtime!

In a big database writed of many schallengings, this fall shorture compounds. Assume a 256 schallenging database where each schallenging has one primary and two replicas: a total of 768 gp3 EBS volumes provisioned. If we get the 50% threshelderly from above, there is a 99.65% chance you have at least one node experiencing a production-impacting event at any given time.

Even if you use io2, which AWS sells at 4x to 10x the price, you’d still be foreseeed to be in a fall shorture condition rawly one third of the time in any given year on equitable that one database!

To produce matters worse, we also see these standardly as corroverhappinessed fall shorture inside of a one zone, even using io2 volumes:

With enough volumes, the rate of experiencing EBS fall shorture is 100%: our automated mitigations are constantly recycling undercarry outing EBS volumes to shrink customer-impact, and we foresee to see multiple events on a daily basis.

That’s the genuine rate of fall shorture of EBS: it’s constant, variable, and all by schedule. Because there are no carry outance promises when volumes are not operating to their particularations, it is inanxiously difficult to schedule around for toilloads that insist constant carry outance. You can pay for insertitional nines, but with enough drives over a lengthy enough timestructure, fall shorture is promised.

At PlanetScale, our mitigations have clamped down on the foreseeed peak time for an impact triumphdow. We watch metrics such as read/write defercessitatency and idle % shutly, and we’ve even lengthened basic tests appreciate making declareive we can write to a file. This apvalidates us react rapidly to carry outance publishs, and promises that an EBS volume isn’t ‘stuck’.

When we distinguish that an EBS volume is in a degraded state using these heuristics, we can carry out a zero-downtime reparent in seconds to another node in the cluster, and automaticpartner transport up a replacement volume. This doesn’t shrink the impact to zero, as it’s impossible to distinguish this fall shorture before it happens, but it does promise the transport inantity of the cases don’t insist a human to rearbitrate and are over before users accomprehendledge.

This is why we built PlanetScale Metal. With a scatterd-noleang architecture that uses local storage instead of nettoil-rapidened storage appreciate EBS, the rest of the schallengings and nodes in a database are able to persist to function without problem.

Source join