I’ve been involved in developing and operating SaaS applications for a while now. One of the questions that is often asked is: what uptime do you guarantee? Some customers simply demand 99.99% uptime, but I’m not sure whether they really realize what they are asking for.

First some facts:

99.99% uptime means maximum 52.56 minutes of downtime per year (1.01 minute per week)
Uptime is not the same as availability
There does not seem to be a good definition for measuring uptime
What happens when the service cannot comply to the promised uptime?
Scheduled downtime is taken out of the equation

Uptime: year, month, week?

The uptime is typically stated per year. However, depending on the spread of those 52.56 minutes of ‘acceptable’ downtime, and depending on the time of day, the impact can be wildly different. If the system goes down for 50+ minutes in one go at a time when most users would be active, this can be a disaster.

If the system goes down for 50+ minutes on January 1st between midnight and 1am, while nobody is using the system, the downtime does not matter.

Uptime versus availability

Even if a provider holds true to its 99.99% promise, uptime does not mean a lot when the system is not available because of network problems. Typically providers will not make promises about availability, because they simply cannot guarantee anything. At best they can give an availability guarantee from the entrance of the data center onwards.

I know from personal experience that you can have the best data center in the world, with fully redundant network connections, a contingency data center in standby, and cutting edge monitoring tools in place, and still be confronted with not being available because one of the main backbones in the country is defective. Not a problem in the provider’s network, not a problem at the customer’s site, but the user experience is one of a system that is not available.

Measuring uptime

How does one measure uptime? Most monitoring tools use some form of polling to check for life signs. Suppose this polling happens once per minute. It is perfectly possible for a web server to stop and restart within that minute. So the service can be down without being noticed by the monitoring tool.

Increasing the frequency is not always an option. But suppose it is. How often do you need to measure to be sure? Once per second? Per millisecond? Per microsecond?

Not meeting the SLA

An SLA often states uptime requirements, but what happens if there is more downtime? I haven’t seen too many SLA’s that offer a realistic compensation for lacking uptimes. Some providers simply apologize for the inconvenience. Others may credit you for the lost time, i.e. you will be billed a day later, or for a somewhat smaller amount.

Scheduled downtime

And last but not least: the infamous scheduled downtime. Most providers plan scheduled downtime, and most providers do not consider scheduled downtime to affect the uptime measurement.

My online banking system is down for a night per month! They schedule this downtime on Saturday night, because most users don’t pay their bills at that time. Guess what? Saturday night happens to be the most convenient time for me to pay my bills. I don’t care about 99.99% availability if I cannot pay my bills when I want to!

Conclusion

It’s good for providers to aim for high uptimes, and by preference high availability. It would help if the definition of uptime was somewhat more objective, and it would definitely help if scheduled maintenance was considered downtime. Push providers towards zero-downtime deployments, or very short maintenance windows. After all we are living in a 24/7 economy…

{Blog}

Articles, Opinions & Tidbits

About web app development & marketing, agile practices, lean startups & software craftsmanship.