Thursday, July 19, 2007

Scalable fault-tolerant upgradable systems Part 1

let's talk about servers which are:
  • Scalable
  • fault-tolerant
  • Dynamically Upgradable
Q: Are these really the same thing?

A: Well not really, but they are very similar.

A system that is fault-tolerant can easily be made scalable and easily made so that we can do in-service upgrade.

Here's how:


In-service Upgrade.

Assume we have N nodes running version one of a program - we want to upgrade to version two with no loss of service.

Foreach node K in nodes:
  1. migrate the traffic on node K to some other node
  2. stop node K
  3. upgrade the software
  4. migrate the traffic back to node K

Algorithm 2

Scale up system

To add a new node to the system:

  • Find some busy node
  • migrate half the traffic on the node to the new node
Algorithm 3

Make fault-tolerant system

Assume we have N nodes connected in a ring. If node N crashes we will recover on node N+1 (or the first node if N was the last node)

In operation we have to make sure that node N+1 has enough information about node N to take over from node N if node N crashes.

In practise we would send an asynchronous stream of messages from N to N+1 containing enough information to recover if things go wrong.


These algorithms are very similar. They can all be built with a small number of primitives. The primitives must:
  • detect failure
  • move state from one node to another
Q: What are the state changes that must be moved?

A: Enough state to resume the operation on a new node.

Q: Could dynamic code upgrade be viewed as a special case of failure?

A: Yes - here's how

Algorithm 4

Dynamic code upgrade

Apply Algorithm 3 to make a fault-tolerant system.

For any node running old code:
  1. crash the node
  2. restart the node
  3. put new code on the node
  4. make the node available
As the new node becomes available Algorithm two (make system scalable) applies

What do you have to think about?

When designing a system for fail-over, scalability, dynamic code upgrade we have to think about the following:
  1. What information do I need to recover from a failure?
  2. How can we replicate the information we need to recover from a failure?
  3. How can we mask failures/code_upgrades/scaling operations from the clients

This is part of the essential analysis that we have to perform if we want to make a highly reliable system that is scalable and which can be upgrade with zero loss of service.

In part 2 - I'll talk about how to mask failures from the clients. Do we use IP-fail-over techniques, or some other technique?

When it comes to programming, you'll want to implement this all in Erlang - Erlang has the correct set of primitives for failure detection (links) and for stable storage (replicated amnesia tables) which make programming this a not too daunting task.


Matt Williamson said...

Hi Joe, I first heard of erlang when I was setting up ejabberd for my company around 1.5 years ago. I'd heard of jabber recently, at the time, researched it and implemented ejabberd. I was very impressed with it and the name erlang stuck with me.

Remembering the name 'erlang', I decided to read about it a couple months ago and got very excited at what it can do! Right after, I started learning it from docs at and I just get more and more excited about the language as I learn it.

Out of some sort of coincidence, I saw on Amazon, that you were publishing a book and that it was on pre-order. Strangely I remembered it on Monday and purchased it the next day. I'm very happy with the purchase and I think that it's quite possible that erlang will shake the foundations of the programming world and I hope that it does.

jasonwatkinspdx said...

What about upgrades that require changes to the protocol between nodes?