Friday, September 01, 2006

Why I don't like shared memory

In my last blog concurrency is easy I wrote about a simple model for programming concurrent systems. When you write a blog you have to think about the target audience and what level you want to pitch the blog at. Should it be technically advanced, or should it popularise the ideas you want to talk about?

I chose to talk about concurrency in a decidedly non-technical manner, I thought I'd use the analogy of people talking to each other. In my blog I argued that processes should behave pretty much like people. People have private memories and exchange data by message passing.

Now reactions to this were beyond my expectations. To start with a lot of people read what I'd written - this came as a surprise. Hardly without publicising the article it reached the #3 stop on - indeed last Wednesday three of the top five programming articles on were about Erlang. Secondly a discussion thread started on where people discussed what I'd written.

Here I'm going to answer the first point that was raised in the discussion thread:

"dogger" said this:

I'm not sure I quite got why not having shared memory was so great. His example of throwing simple question/answer messages around isn't really how lots of programs work. I think sending messages has its place, as does shared memory. Choose which ever one is most appropriate.

Good comment! the things that we take for granted are the things we feel the least need to explain. Now I take the view that sharing memory is wrong. Having thought this way for the last twenty odd years I take this as axiomatic, and never really explain why I think shared memory is a bad idea - so in what follows I'll give are a number of reasons why I don't like sharing memory.

Dogger's second comment "His example ... isn't really how lots of programs work" Is of course, correct. Lots of programs do not work in the way I have suggest, the point is that these program could have been written in a different programming style which totally avoids shared memory and locks and which does make use of fine-grain concurrency and pure message passing. I also believe that such programs are much easier to write and understand since all of the problems with shared memory that I list below are sidestepped. Note that I do not suggest that I have a solution to any of the problems below. But I would say that such problems can be avoided altogether by using a different programming style.

Now I'll turn to why I don't like shared memory:

Problem 1 - Programs that crash in a critical region

The following approach is commonly used when two or more programs wish to share memory. Any program that wishes to manipulate shared memory, must do the following:
  1. Acquire a lock
  2. Manipulate the shared memory
  3. Free the lock
The code that the program executes after acquiring the lock and before freeing it is called the critical region.

During the time that the program is inside a critical region it should not crash, and it should not spend too much time inside the critical region.

But what happens if the program does crash inside the critical region?

In this case things get messy. In an ideal world we would like to have some kind of transaction semantics, meaning that the net effect of running the program would be either that all the memory changes made inside the critical region succeeded or none of them succeed and the state of memory is the same as it was before the program entered the critical region.

Suppose that A tries to change 10 memory locations, I'll call them M1,M2,...,M10, and that the program was supposed to do this:

  1. Acquire a lock
  2. Modify M1
  3. Modify M2
  4. ...
  5. Modify M10
  6. Release the lock
But in fact it did this:
  1. Acquire a lock
  2. Modify M1
  3. Modify M2
  4. Crash
We would like the effect of running A to be that either is succeeded and all the memory changes M1 to M10 took place, or that none of them succeeded. So having crashed at step 4 of the above we would like to undo the effect of the first two modifications of memory.

Achieving this is pretty complicated, it assumes that there is some monitor program that can detect that a thread has crashed and which reset the memory to its original state in the event of a crash.

Problem 2 - Programs that spend too long time in the critical region

A number of things can happen when a program is inside a critical region, it can manipulate memory, which is why it must be in the critical region in the first place, and it can perform computations. The problem is that these computations take place while the program is inside the critical region - so if these computations take a long time, then all other programs waiting to access the shared memory will be queued until the current program has left the critical region.

Writing code that executes inside the critical region can be very difficult, since we have avoid very time consuming computations and move them outside the critical region. We also have to remove things like remote procedure calls from the critical region, in case they should suddenly take too long. All of this is a very unnatural way of programming and extremely difficult to get right.

Problem 3 - Locking too much

Unfortunately we often lock far more memory than we need to, a program will typical lock all the shared memory, just to manipulate one tiny fragment of it. In languages which permit direct memory modification through pointers then the minimum size of memory that can be protected is determined by the granularity of the page tables. A typical page size might be in the range 8-64KB - with a page size of 8K you might only want to protect a single byte, but you are forced to protect a minimum of 8KBytes.

Your program might only need to protect 1 Byte, and the other programs in the system might wish to modify some other parts of memory in the same page, yet these have to wait until your program as left its critical region before they can manipulate their portion of the memory.

Now all of this probably doesn't matter on a single CPU where different threads all run on the same CPU, the CPU is always busy and at least its doing something. But on a multi-core CPU it does matter. On a multi-core CPU many of the processes will be waiting to acquire locks despite the fact that logically all these CPUs could be running in parallel.

We could, of course, divide the shared memory into different partitions, and let the program lock onto those parts of memory they were interested in, but now the programming gets even more difficult.

Problem 4 - Distributed shared memory

Now things get really messy. On a single motherboard there really can be a single memory that a number of different CPUs can access, but on a cluster or in a networked distributed system this is just infeasible. What really happens is that each node in the system has its own memory and that reads and writes and locks are applied to the local memory. In any system one of these memories must assume some kind of master role, the other memories in the system assume secondary roles and behave as caches. Then some kind of cache consistency protocol is run between the different memories to ensure that all the processes accessing this memory have a consistent view of the world.

Now all of this is very difficult to achieve - so at this point most programmers just give up and use a fault tolerant distributed database, which is generally pretty slow since it has to do a lot of messy things in the background.

Problem 5 - Sharing limits scalability

Threads which share data cannot run independently and in parallel. Now this doesn't matter on a single-core CPU, but it does matter no a multi-core CPU. At the point in their execution where the threads share data, their execution becomes serial rather than parallel. The critical regions in the threads introduce serial bottlenecks which limits scalability.

If we want really high performance we have to make sure that our applications share nothing, that way we can just replicate our solution over many independent CPUs.

Problem 6 - Sharing can introduce deadlocks

Sometime we try to increase concurrency by some form of fine-grain sharing. The idea is that instead of locking all our memory, we divide the memory into smaller regions and only lock those parts of memory we are interested in. Imagine now two threads P and Q which want access to memory regions A and B. Suppose P locks memory region A and then waits for memory region B, and Q does the opposite, ie it locks first B then waits for A. This results in a deadlock, P and Q now suspend indefinitely

Problem 7 - Sharing makes systems error prone and debugging difficult

Suppose two threads A and B share data. A mistake in A can overwrite data used by B. Even through the code in B was correct it might crash because the data structures it manipulates have been corrupted by A. Now all systems should ideally obey the my program should not be able to crash your program rule - this is clearly not the case when programs can share data.

Debugging becomes horrific. Thread B has crashed, so it seems reasonable to assume that the code for thread B is incorrect - this assumption is wrong, since the code in A is to blame. This separation of cause and effect makes debugging very difficult.

Finally a more general comment

Sharing doesn't exist in the real world

I'm an ex physicist. In classical physics and ignoring quantum effects two objects in the real world cannot exist in the same place at the same time.

If we have two objects they must be at different places. Now the only way one object can interact with another is by sending it a message (say by a light ray). If this light ray encodes some information about a state change, then as far as the receiving object is concerned the state change only becomes known after the message has been received.

In simple relativity theory the concept of simultaneity just does not exist.

The point is that in reality objects do not share state, I believe its not a good idea to model what cannot exist in reality in software.

The idea that you actually need sharing and locks to implement parallel software is false. Everything that can be achieved with sharing and locks can also be achieved with pure message passing and no locks. The is the Erlang way.

In a future posting I'll tell you how to make a transaction memory which provides a lock-free method of achieve fine-grain state consistency in sets of parallel processes.


Anonymous said...

There is an alternative to locking large blocks of memory at a time, in suitably low-level languages - manual locking. Here you acquire a semaphore or mutex from the operating system or other reliable API that can guarantee atomic locks/unlocks. To the API itself, the lock you acquire has no particular semantics.

The locking is accomplished at a highly granular level by explicitly checking the lock each time the data of interest is accessed. For instance, in C++, you can write field accessor methods for your classes that wrap this behavior, and granularly lock/unlock each individual field of a record, with fully customizable behavior and semantics.

This isn't actually a benefit, though. Manual locking is roughly twenty billion times harder than automatic locking to get right. It's incredibly difficult to do manual locking without introducing a host of race conditions and deadlocks/livelocks; all it takes is forgetting your lock in one place, or implementing slightly different semantics in different bits of the code.

Either way, though, the end result is the same: shared memory is evil.

Hal Snyder said...

I accept the First Law of Concurrency, that the way to keep your sanity in concurrent programming is to have No Shared Data. But working against this on a distributed platform there are requirements for such things as 1) no single point of failure; 2) allocation of work to available resources; 3) aggregation of related real-time events.

Unknown said...

Great article. I generally agree with your conclusions about how awkward and error prone shared memory model is - yet your physics digression I suspect to be not adequate. Shared memory, when mapped to physisc, does not mean that we introduce two objects existing both at the same time and in the same place, because sharing memory region does not create seperate, yet coupled, views of its contents. Whenever we start with some allocated and initialized memory then we have single object per memory region. Reasoning by induction we deduce that sharing memory between n threads does not change that property. It rather marks object as being manipulated by n manipulators at the same time, which clearly is possible in "physics world" - let's call it interference of multiple waves in single point of 4d space

Anonymous said...

But even with message passing you can have dead-locks. Message queues can fill up and leave you in a blocked state.If you now have some interdependencies, you have a dead-lock. Been there recently.

But I still think that MP is better than shared memory, even if you have nice mechanisms like software transactional memory

黄涧石 said...

This is a great article. You really convinced me how wrong shared memory model could be. Anyway, message passing with independent memory space fits the rule of good engineering and design: minimal mechanism and maximum functionality.

One of the important thing is: it scales!

Unknown said...

A much better article than the last one; I think a reasonable level to pitch things is to imagine a Java/C++ programmer who works on "enterprise" systems. Myself, a member of the choir already, isn't about to be unduly swayed. Someone who needs a metaphor to understand message passing isn't going to benefit either.

You might also point towards the recent paper that surveyed recent language trends, comparing (off the top of my head): Erlang, E, Haskell+STM, and noting that they all take a multi-layered approach: pure functional core, cheap concurrency, message passing concurrency, heavyweight (slow) transactional shared state.

Keep up the good work, especially on Erlang -- so that us Haskeller's can steal the good ideas ;-)

Unknown said...

I have a question to do with memory usage and the cost of memory copies.

Often processes work on huge amounts of data (e.g. image and signal processing) and it isn't feasable to copy the data from process to process.

How do we organize work done by several processes on these large chunks of data given limited memory resources?

Unknown said...

Surely you need shared memory to implement a messaging system?

As far as I can tell you would have to allocate to each process a message queue which would have to be shared among other processes to allow message posting.

Is there some other way of doing it and if not, don't some of the mentioned evils of shared memory apply to messaging itself?

Unknown said...

It's funny -- in principle I agree with you, although, since my work tends to be more in things like embedded image processing, copying massive amounts of data doesn't seem like such a hot idea. Still, for most purposes, I'm on your side.

The thing that bugs me, and I wish you would leave out, is the attempt to justify it with weak "the world is like this" metaphors.

Things in the real-world (effectively) share state all the bloody time . The example that popped into my head is a traffic light. One light, and the drivers *reference* it.

I would actually even argue your "people talking to each other" metaphor, because if I'm broadcasting, it's more like sharing state. I say it once, and everybody hears it or not. You even get to see some nice race conditions. Message passing would be like writing notes to everyone. Which might not be ideal. Not the best way to watch a movie for example....

Joe Armstrong said...

No no no - Ben is wrong
If I drive at a red traffic light
at 7*10^4 Km/sec the red light will
appear green due to the doppler shift.

So some people will see red other green, depending upon the speed.

So there is no shared state,
eveybody has a different idea and they cannot agree.

Anonymous said...

Most of the above problems can be avoided by the use of consensus operators, such as compare-and-swap.

You may be interested in investigating lock free data structures.

That said, lock free structures do require careful design.

Byron said...

@Ben & Joe: Seems like the salient point regarding the traffic light is that 1), it is read-only, as far as the drivers are concerned (drivers can't change its state), and 2) it can be read by all drivers in its lane concurrently without need for message passing or non-shared-state constructs.

A better example might be the entire system of traffic lights, rather than any single one. Most city traffic light systems are synchronized so as to achieve the most efficient rate of flow of traffic (or the slowest, safest, depending on the city). However, pedestrians can come up to any intersection and push the button that requests that light to change so they can cross the street. That light starts the process of changing, and sends out signals to its nearest neighbors alerting them it is about to change outside of their normal synchronization, which in turn causes the other lights to start changing and sending out like signals to their nearest neighbors. Each light contains its own state, and changes to that state are initiated via message passing among independent nodes.

(Disclaimer: I have no idea how traffic light systems really work, and know from experience that they don't always change when pedestrians hit the button. But as a simplification, I think the analogy works better to describe a non-shared-state, concurrent, message passing system)

Anonymous said...

Enjoyed the article; thanks.

@things-like-image-processing: this is obviously a case where message passing probably wouldn't be the best solution.

One of the issues discussed was the disconnect between page size and space-I-want-to-lock size; with image processing your processes will generally be working on different areas, so I don't see it as being as much of an issue.

This only applies to tightly-coupled processes, though; if it's a more loosely-coupled parallel system then you don't have much of a choice anyway! :)

Anonymous said...

I have a few questions about this article.

Problem 3: "A typical page size might be in the range 8-64KB - with a page size of 8K you might only want to protect a single byte, but you are forced to protect a minimum of 8KBytes."

Leaving aside that on 32-bit architectures this is usually 4k, what do you mean that someone is forced to protect at least a page? I can have two small data structures, each with a lock, accessed by each of two processes at once. Are you perhaps referring to cache line invalidation?

Problem 2: "We also have to remove things like remote procedure calls from the critical region, in case they should suddenly take too long. All of this is a very unnatural way of programming and extremely difficult to get right."

It occurs to me that any program that wants exclusive access to an object is going to block out others. This fits in with 6) Deadlocks -- if two of your threads want to access the same object(s) or database row(s), you will run into this problem regardless of the model used.

To give an example, let's say the program is running a fantasy game wherein someone is casting spells. The player issues two "cast spell" commands. One of your threads starts to handle the first event, and one starts to handle the second event. Both see that the player has enough spell points to cast this one spell, so both send a transaction: spell_points-=5. The player goes from 5 to 0 to -5. Woops. On the other hand, if one thread takes exclusive ownership, you're right back to the locking issues above. How do you solve this?

Problem 4: I don't quite understand your explanation. When dealing with nodes over a network you use message-passing. In a high-performance environment the idea is to minimize the number of objects/records bouncing back and forth over the network. This means each node has a subset of the entire data that is local to it. This is always something you must be aware of if you care about performance; you can ignore it and load every object from a database over the network every time but then your performance goes down the drain.

Problem 5: "If we want really high performance we have to make sure that our applications share nothing, that way we can just replicate our solution over many independent CPUs."

This strikes me as being exactly backward, because of a misunderstanding about what sharing is. If your threads don't share memory, but do share a common data set, then you have a lot of network overhead as you send and receive/parse messages even when all threads are on the same multi-CPU system.

If your data is partitioned such that you really don't need to share ANYTHING, then you won't run into any locking problems and it doesn't matter what model you use.

Problem 1: "Programs that crash in a critical region" Inconsistent data views are an issue regardless of where your program crashes. This is always the case unless you make the entire operation a transaction. The general way to do this is to work on a copy of the data and then swap pointers; or use a journalled approach.

In addition, you can bet Mnesia uses locks and runs into all the issues any other program does. The only way it wouldn't is if only a single thread/process is used, in which case your database is a huge bottleneck.

Anonymous said...

I do not share memory with other trafficants when waiting for a traffic light - the light is not forcing us all to stop state,
it is a message (photons)!

I can ignore it and drive anyway. And ambulances are even allowed to do just that!

Anonymous said...

I don't understand Problem 5 - it seems like a restatement of what Problems 2 and 3 describe more specifically. In fact, I don't even like its wording - it seems to say that if you don't use shared memory, then tasks never have to wait for each other. That's certainly not true. Message-passing systems certainly have tasks blocked waiting for results from other tasks.

Otherwise, I agree with your points.

I think it's important to point out that you don't always even need to switch programming languages to get some benefits of a better model - in C/C++ projects, I simply choose fork() and/or O_NONBLOCK over pthread_create(). Even when you're stuck without fork() (Java, Windows), you can structure your code in a message-passing way, though unfortunately without the compiler complaining if you screw it up. I'd rather use Erlang, but when adding to an existing system...

Anonymous said...

Joe (not Armstrong) said...
"Surely you need shared memory to implement a messaging system?"

No; in fact shared memory systems are implemented in terms of message passing at the hardware level. That's what a cache coherence protocol does.

Several people mentioned the cost of message copying. Since Erlang is based on a pure-functional core language, message contents are immutable. Copying immutable data is indistinguishable from passing a pointer to it (leaving aside memory management issues), and so the implementation can use whichever is more efficient in a particular case.

I don't know what the references to having to lock whole memory pages were about in Problem 3; that appears to be a misunderstanding.

Byron said...

By the way, good article. Looking forward to many more, though it's probably not necessary to use simplified analogies to explain Erlang concurrency. Most programmers seriously studying Erlang right now are most likely early adopters and real hackers capable of understanding the intracacies of the language on its own terms, or by analogy to other languages.

Anonymous said...

re Joe's comment, "Surely you need shared memory to implement a messaging system?"

Perhaps, but not necessarily. You could use sockets, for example, which would allow you to run things in parallel on a cluster as well as a SMP. Second, assuming you did use shared memory to implement the messaging system, you'd only need to get it right in one place. Get the messaging system right (or, better yet, use someone else's system that already works) rather than having to get shared memory right throughout your application.

re genneth's comment, "You might also point towards the recent paper that surveyed recent language trends." Can you post a link?

Unknown said...

Threads, processes, machines are all part of the execution of a model. They aren't the model.

Similarly, pipes, sockets, shared memory, in memory calls are also part of the execution of message passing. They aren't the model too.

For example, a bank teller waits on a queue of customers. They only do something when the queue is not empty. To me that means that each object instance is at some level a conceptual process, at least in design terms.

The threads, processes, shared memory are just how these conceptual processes and messages get mapped onto physical processes and messaging.

One area to look at is Eiffel and SCOOP. This is the concurency mechanism in Eiffel. There are plenty of references on the web if you google.

If we take the queue. The teller will only process something off the queue if the queue is not empty. That's a precondition, and its part of Eiffel's design by contract. All SCOOP does is turn this into something called wait on necessity. The teller will wait until the preconditions are satisfied. It knows to do this because you specify the instance of the queue as separate. That is, running in a separate process, thread, machine...

Then before runtime, you map the conceptual processes to runtime processes.

All very neat because you program at the conceptual level.

Also, if your code is well written, it will run just as well in a single thread of execution as it will in a distributed environment.

Unknown said...

Is Mnesia a shared memory system?

If you need to find data, how do you do this (efficiently) in a non-shared memory system? Especially if it is more data than can fit on one machine (and even more if it needs to be persistent)?

Unknown said...

Just a note about the (deeply flawed) traffic light analogy that some people seem to have missed:

The supposition that we (or a thread) might all be looking at the same traffic light overlooks the fact that only one person at a time can be looking at it (particularly in a single-processor machine). Suggesting that shared state often resolves itself to certain constants (like the color of the light) overlooks the fact that one person might look up and see a green light, close their eyes (go to sleep), and come to a dead stop in the middle of the intersection, having discovered that another person (thread) saw a green light too -- but coming from another direction! The driver (thread) passed through the intersection (critical area) without re-evaluating the condition of the light, and so paid the penalty by slamming into the other car (thread), which also thought it had exclusive rights to go forward.

Ultimately, 'people' analogies are utter crap for this kind of speculation -- you can't compare shared-state machines (humans in the real world, all operating at the same time) with single-state machines, like processes & threads in computers, which operate not in concert but sequentially (at least in theory -- if you have a multi-core or whatnot, the above perils still apply, as pipelining and aliveness assures only a few threads are right all of the time).

Unknown said...

re genneth's comment, "You might also point towards the recent paper that surveyed recent language trends." Can you post a link?

I believe this is the reference:

Yaar said...

You may find it interesting that in the field of static analysis and code verification - shared memory is often reduced to a message passing protocol. It seems that what you really dislike is the usage of blocking. Yes, blocking is the root of many software evils, but is very common in message passing systems too. Maybe you should write a column about that?

More comments following:
Problems 1,2 - In message passing architecture, a process may crash or linger while other processes are blocked waiting for messages from it.

Problem 3 - In many programming languages and DBs, very granular locking is easy and efficient.

Problem 4 - Those very confusing and expensive distributed clusters with messy things in the background usually use message passing in their underlying protocols.

Problem 5 - Messages architecture limit scalability even harder. In shared memory, N processes require O(N) operations and O(1) time to propagate N pieces of information. With message passing, its more complex, with either O(n^2) operations or time or O(n) time. Even best algorithms (e.g. spanning trees) reduce this to O(n) messages and O(logn) time, and introduce fragility, as a single process may kill or corrupt the protocol.

Problem 6 - Deadlocks result from the protocol the prcesses use, not the memory they share. In message passing, deadlocks are common too. The real world "day-after-first-date-who-should-phone-who?" protocol is a (guaranteed) deadlocking message passing protocol.

Problem 7 - A single faulty process can broadcast corrupt data to all other processes, easily invalidating their local data, as illustrated by computer viruses.

Your final comment: Sharing does exist in the real world. I, for instance, would love to sleep on my double bed's diagonal. But my girlfriend wouldn't let me. And I use a lock to safeguard my bicycles, so nobody can 'share' it with me...

Finally, building computers involve quantum physics far more than classic or even relativity theory mechanics. And with quantum computing coming up, maybe we should flex our minds about whether computing languages really should reflect how reality works.

Anonymous said...

I atually think having shared memory is a great thing. But i also agree with the comment above. Especially with problem #7 A single faulty process can broadcast corrupt data to all other processes, easily invalidating their local data, as illustrated by computer viruses.