Thursday, July 19, 2007

Scalable fault-tolerant upgradable systems Part 1

let's talk about servers which are:
  • Scalable
  • fault-tolerant
  • Dynamically Upgradable
Q: Are these really the same thing?

A: Well not really, but they are very similar.

A system that is fault-tolerant can easily be made scalable and easily made so that we can do in-service upgrade.

Here's how:


In-service Upgrade.

Assume we have N nodes running version one of a program - we want to upgrade to version two with no loss of service.

Foreach node K in nodes:
  1. migrate the traffic on node K to some other node
  2. stop node K
  3. upgrade the software
  4. migrate the traffic back to node K

Algorithm 2

Scale up system

To add a new node to the system:

  • Find some busy node
  • migrate half the traffic on the node to the new node
Algorithm 3

Make fault-tolerant system

Assume we have N nodes connected in a ring. If node N crashes we will recover on node N+1 (or the first node if N was the last node)

In operation we have to make sure that node N+1 has enough information about node N to take over from node N if node N crashes.

In practise we would send an asynchronous stream of messages from N to N+1 containing enough information to recover if things go wrong.


These algorithms are very similar. They can all be built with a small number of primitives. The primitives must:
  • detect failure
  • move state from one node to another
Q: What are the state changes that must be moved?

A: Enough state to resume the operation on a new node.

Q: Could dynamic code upgrade be viewed as a special case of failure?

A: Yes - here's how

Algorithm 4

Dynamic code upgrade

Apply Algorithm 3 to make a fault-tolerant system.

For any node running old code:
  1. crash the node
  2. restart the node
  3. put new code on the node
  4. make the node available
As the new node becomes available Algorithm two (make system scalable) applies

What do you have to think about?

When designing a system for fail-over, scalability, dynamic code upgrade we have to think about the following:
  1. What information do I need to recover from a failure?
  2. How can we replicate the information we need to recover from a failure?
  3. How can we mask failures/code_upgrades/scaling operations from the clients

This is part of the essential analysis that we have to perform if we want to make a highly reliable system that is scalable and which can be upgrade with zero loss of service.

In part 2 - I'll talk about how to mask failures from the clients. Do we use IP-fail-over techniques, or some other technique?

When it comes to programming, you'll want to implement this all in Erlang - Erlang has the correct set of primitives for failure detection (links) and for stable storage (replicated amnesia tables) which make programming this a not too daunting task.