Monday, March 07, 2005

Network Reliability

There's been a flurry of news about VoIP network outages the first week of March: Light Reading and Om Malik's blog (and a commentary on Om's blog) seem to have the most information, unless you want to dive into the Vonage Forum or Broadband Reports VoIP forum.

If you read product datasheets for VoIP equipment (softswitches, media gateways, application servers, take your pick), you will be flooded with terms like "carrier grade", "99.999% availability", "fully redundant", "live software upgrades", and "no single points of failure". I've even seen a vendor claim of "99.99994% availability", which means that over a five-year deployed life, a system will be unavailable for about a minute and a half.

So how do you reconcile all this extremely reliable equipment with networks that go down for hours at a time?

One possibility is simple: Vendors lie. Or, to put it less judgementally, vendor claims of equipment reliability are theoretical calculations that are not borne out in real-world deployment.

Given the historical perception that VoIP is less reliable than traditional phone service, there is clearly an element of marketing hype in vendors' reliability claims. But any vendor that tries to sell into a telco knows that their claims are going to be held up to some level of scrutiny, and the methodology they use to forecast reliability and availability had better be "generally accepted in the industry". For the most part, that means forecasts in accordance with methodologies and models in Telcordia Reliability and Quality Generic Requirements. While 90 seconds of downtime over a five-year lifespan (which realistically means that of 20 boxes deployed for five years, 19 never go down at all and one goes down for half an hour) may tax one's credulity, I don't think that there's a strong reason to believe that VoIP network equipment is intrinsically less reliable than traditional phone network equipment on a box-by-box level.

I think the actual reason is much more subtle and deep-seated, and is pervasive across equipment vendors and carriers. It's not one of equipment, or of engineering, but of culture, and of a mismatch between corporate culture and customer expectations.

"Traditional" telcos, and "traditional" vendors, have a culture of reliability above all. If this is your mindset, you do things like rigorously testing new software before deploying it - and you deploy it at 1:00 AM on a Saturday, not midday on a weekday. You have processes and procedures in place to deploy with well-defined checkpoints, safe stop points, and capabilities to backout. Vendors with this mindset do their own rigorous regression testing before releasing new software to their customers.

This has its own downside - testing adds time and cost to software releases. Processes and procedures slow things down and require a different mindset. It's hard to "get an idea on Tuesday and deploy the service on Wednesday" if you have to integration test and regression test the software with everything in your network - let alone if you have interoperability with other networks to worry about.

Neither approach is wrong. The problem comes about when the company's culture clashes with the customer's expectations (which, of course, mostly come about from the company's positioning of the product). If Skype stops working for a couple of hours, there's some grumbling but not a lot of repercussions - after all, it's free, the users are self-identified early adopters who tend to be tolerant of glitches, and they buy into the value trade-off that Skype presents them. If SkypeOut has problems, the grumbling escalates, because SkypeOut is a paid service, and the customers' expectations are higher. And a mass-marketed "phone service" that you can buy at Circuit City generates expectations of reliability and availability consistent with that of the "phone service" that people have been accustomed to for the last 75 years. If the culture of the company providing that service is more focused on providing low-cost service with new features than on providing reliable service, problems are inevitable.

Companies - both vendors and carriers - have to decide what they want to be, they have to communicate that message and those expectations to their customers, and they have to live them. If your marketing message is that you're a Phone Company, you've just bought yourself 130 years of history that people are expecting you to live up to. Those expectations can help you win a lot of customers. But if you're really trying to be a fast-moving, low-cost provider of disruptive technology for voice communications, and your corporate culture is based on that model, perhaps "Phone Company" isn't the message you want your customers to walk away with.