Thursday, July 10, 2008

UBF and VM opcocde design

UBF is a data encoding that allows structured terms (rather like XML) to be sent over the network. It also includes a protocol checking scheme to automatically determine if sequences of typed messages follow a particular protocol.

This blog entry was stimulated by this posting on the erlang mailing list.

One of the basic ideas of UBF of was to send programs not data structures. The programs were for a byte-coded stack machine. So instead of sending data structures between machines we send tiny programs which when evaluated create data structures.

Each byte is an opcode for a VM. The net-effect of executing a UBF program is to leave a value on the stack.

The trick in UBF was not to start allocating the opcodes in the VM from zero - but to allocate them with loving care.

A common mistake in making byte coded VMs is to allocate the byte codes from zero. If you think about it the byte code for a PLUS operation can only be 43 (why? - easy - this is the ASCII code for "+").
In fact the byte code for PLUS should be 43 in all byte coded VMs - there should be laws that make it a criminal offense for the opcode to be anything other than 43 - thus it is written - there will of course, be a problem with the opcode for TIMES - if you are familiar with your ASCII codes then you should understand why.
I have no idea where I learned this trick - it seems to be in the folk-law of VM design - choose the op codes so that the binary code is readable (if you can). Unfortunately I didn't know this when I designed the first Erlang VM but now I know better.

So this way the byte code for start-of-tuple is 123, end-of-tuple is 125 and element-separator is 44 - unsurprisingly "{", "}" and ",". Thus "{...,...,... }" is a program and NOT a bit of syntax.

With this choice of encoding programs become human readable strings which require zero parsing - you just execute the byte codes.

Contrast XML where the data structures are human readable but require parsing - this is why constructing a term from UBF is far faster than using XML and why the size is far smaller and is human readable.

Why didn't UBF spread?

If you have something that is almost ok - then lots of people can have great fun arguing over it and polishing it at the edges.

Things which deeply flawed and industry standards things like XML can lead to endless discussions - great fun - lots of hot air. Project management can happily preside over "the illusion of work" - wages get paid - everybody is happy. Projects get delayed - project management becomes very happy.

The optimal point is where projects get as delayed as much as possible, budget overruns are as large as possible and the project manger is almost, but not quite, sacked. This idea is explored in Putt's law and the successful Technocrat - recommended to me Gilad Bracha - and a great read.

Some things like (scheme, pascal, ..) are pretty nearly perfect - thus there is little to do. In fact pascal was perfect (anybody got a UCSD pascal emulator and image? - now that was really nice)

Fixing stuff that's broke

Programmers like to have something to do - so our lot in life is to fix flawed things. Most of my time is spent in fixing things that should work, but are in fact, broken.

ASN.1 (which got me started on this blog entry) is elegant - but how it has been used is not.

I am currently examining LDAP - LDAP schemas have to be seen to be believed (and yes LDAP schemas are written in ASN.1)

In LDAP schema speak a boolean is a 1.3.6.1.4.1.1466.115.121.1.7 (this is an OID, for those in the know) and 1.3.6.1.4.1.1466.115.121.1.40 is a string ...

I'm glad the LDAP schema designers didn't turn their hand at
programming language design. If they had, then

     boolean x,y,z;

Might have been

   type 1.3.6.1.4.1.1466.115.121.1.7 x,y,z;


The only thing that is good about LDAP schemas is that they are not XML schemas.
...