Thursday, July 19, 2007

Scalable fault-tolerant upgradable systems Part 1

let's talk about servers which are:
  • Scalable
  • fault-tolerant
  • Dynamically Upgradable
Q: Are these really the same thing?

A: Well not really, but they are very similar.

A system that is fault-tolerant can easily be made scalable and easily made so that we can do in-service upgrade.

Here's how:


In-service Upgrade.

Assume we have N nodes running version one of a program - we want to upgrade to version two with no loss of service.

Foreach node K in nodes:
  1. migrate the traffic on node K to some other node
  2. stop node K
  3. upgrade the software
  4. migrate the traffic back to node K

Algorithm 2

Scale up system

To add a new node to the system:

  • Find some busy node
  • migrate half the traffic on the node to the new node
Algorithm 3

Make fault-tolerant system

Assume we have N nodes connected in a ring. If node N crashes we will recover on node N+1 (or the first node if N was the last node)

In operation we have to make sure that node N+1 has enough information about node N to take over from node N if node N crashes.

In practise we would send an asynchronous stream of messages from N to N+1 containing enough information to recover if things go wrong.


These algorithms are very similar. They can all be built with a small number of primitives. The primitives must:
  • detect failure
  • move state from one node to another
Q: What are the state changes that must be moved?

A: Enough state to resume the operation on a new node.

Q: Could dynamic code upgrade be viewed as a special case of failure?

A: Yes - here's how

Algorithm 4

Dynamic code upgrade

Apply Algorithm 3 to make a fault-tolerant system.

For any node running old code:
  1. crash the node
  2. restart the node
  3. put new code on the node
  4. make the node available
As the new node becomes available Algorithm two (make system scalable) applies

What do you have to think about?

When designing a system for fail-over, scalability, dynamic code upgrade we have to think about the following:
  1. What information do I need to recover from a failure?
  2. How can we replicate the information we need to recover from a failure?
  3. How can we mask failures/code_upgrades/scaling operations from the clients

This is part of the essential analysis that we have to perform if we want to make a highly reliable system that is scalable and which can be upgrade with zero loss of service.

In part 2 - I'll talk about how to mask failures from the clients. Do we use IP-fail-over techniques, or some other technique?

When it comes to programming, you'll want to implement this all in Erlang - Erlang has the correct set of primitives for failure detection (links) and for stable storage (replicated amnesia tables) which make programming this a not too daunting task.

Saturday, March 03, 2007

Hasta La Vista, baby

But what did he say when he came back?

I think he just grunted.

I've been busy. With this. But now I'm back (perhaps not every week since there are three more chapters to write).

When I've started this Blog the idea was "blog at least once a week." Nobody will read a blog unless there's content, once a week. Just say something, have a point of view.

That's easy - once a week - easy.

Said he.

A while back I thought "I'll write another Erlang book." I feel stressed here, put upon. At the last (every) Erlang conference they say "when are you going to write another book?" - I say "soon." I approached a few publishers (the BIG ones, I won't embarrass them by saying who).

They said "no".

I said "why".

They said, "there is no market"

Chicken and egg.

Then I discovered Print-on-demand and the Lua book, I mailed Roberto Ierusalimschy, we exchanged a few emails. I thought "I'll start a publishing company." I'll be my own publisher.

I had great dreams "We'll be like Faber and Faber", I'll be T.S.Eliot we'll publish great books, beautiful books, wonderful books, books that make you make you think, with code, code that makes you cry, code that makes you think...

Cool, cool, we'll published Erlang books, Haskell books, Clean books, Prolog books, this is an art form, let's create some art, lets publish art.

He called it the "art of computer programming", remember ...

I'd found my publisher - me - great - no grovelling letters to publishers.

I wrote to the Erlang list "I'm going to write a book..." I've got a publisher.

A few days passed ...

"Hi my name is X.Y ... I'm a friend of Dave and Andy, ..., would you like to talk to them ..."

"Talk sure, no harm in talking."

I mailed Dave - I said "convince me that you can sell more books than I can sell." - come on if I give up my dreams of being T.S.Eliot then something has to give.

And so I started work. Dave said "the first thing you have to do is find a voice" - like, ummm, I've got a voice, hello.

He said "write a few chapters" we'll fix the voice.

He said "Joe you sound like you're standing on a mountain preaching, I want you to imagine you're sitting at a terminal with your friend beside you, explaining how things work."

I struggled - he re-wrote bits of my text - like "this" he wrote.

When I started I thought "who is my audience." I had in my mind the Erlang crowd. As I wrote we re-focused - it became "the Java programmer who has heard about Erlang and wonders what it is."


Sorry, Guys (Erlang Guys) this one is not for you, you know this stuff


Yesterday it happen. It clicked. I'd been feeling tense, nervous, hyper. It happens like this - ask Helen.

I wrote the otp introduction chapter. The book's in beta so this chapter was not yet written. I went into pure flow. I wrote for three hours in your time ten seconds in mine. Then the tension dropped, I felt expunged, cleansed, relaxed.

Then I read what I'd written.

In the gen_server there's a function called terminate with two arguments (terminate/2).

So I called the section "Hasta La Vista, baby."

I would never ever have written that six months ago.

I found a new voice I didn't know I had.

Thanks Dave, for helping me find a new voice.

Enjoy the book.