Thursday, July 10, 2008

UBF and VM opcocde design

UBF is a data encoding that allows structured terms (rather like XML) to be sent over the network. It also includes a protocol checking scheme to automatically determine if sequences of typed messages follow a particular protocol.

This blog entry was stimulated by this posting on the erlang mailing list.

One of the basic ideas of UBF of was to send programs not data structures. The programs were for a byte-coded stack machine. So instead of sending data structures between machines we send tiny programs which when evaluated create data structures.

Each byte is an opcode for a VM. The net-effect of executing a UBF program is to leave a value on the stack.

The trick in UBF was not to start allocating the opcodes in the VM from zero - but to allocate them with loving care.

A common mistake in making byte coded VMs is to allocate the byte codes from zero. If you think about it the byte code for a PLUS operation can only be 43 (why? - easy - this is the ASCII code for "+").
In fact the byte code for PLUS should be 43 in all byte coded VMs - there should be laws that make it a criminal offense for the opcode to be anything other than 43 - thus it is written - there will of course, be a problem with the opcode for TIMES - if you are familiar with your ASCII codes then you should understand why.
I have no idea where I learned this trick - it seems to be in the folk-law of VM design - choose the op codes so that the binary code is readable (if you can). Unfortunately I didn't know this when I designed the first Erlang VM but now I know better.

So this way the byte code for start-of-tuple is 123, end-of-tuple is 125 and element-separator is 44 - unsurprisingly "{", "}" and ",". Thus "{...,...,... }" is a program and NOT a bit of syntax.

With this choice of encoding programs become human readable strings which require zero parsing - you just execute the byte codes.

Contrast XML where the data structures are human readable but require parsing - this is why constructing a term from UBF is far faster than using XML and why the size is far smaller and is human readable.

Why didn't UBF spread?

If you have something that is almost ok - then lots of people can have great fun arguing over it and polishing it at the edges.

Things which deeply flawed and industry standards things like XML can lead to endless discussions - great fun - lots of hot air. Project management can happily preside over "the illusion of work" - wages get paid - everybody is happy. Projects get delayed - project management becomes very happy.

The optimal point is where projects get as delayed as much as possible, budget overruns are as large as possible and the project manger is almost, but not quite, sacked. This idea is explored in Putt's law and the successful Technocrat - recommended to me Gilad Bracha - and a great read.

Some things like (scheme, pascal, ..) are pretty nearly perfect - thus there is little to do. In fact pascal was perfect (anybody got a UCSD pascal emulator and image? - now that was really nice)

Fixing stuff that's broke

Programmers like to have something to do - so our lot in life is to fix flawed things. Most of my time is spent in fixing things that should work, but are in fact, broken.

ASN.1 (which got me started on this blog entry) is elegant - but how it has been used is not.

I am currently examining LDAP - LDAP schemas have to be seen to be believed (and yes LDAP schemas are written in ASN.1)

In LDAP schema speak a boolean is a 1.3.6.1.4.1.1466.115.121.1.7 (this is an OID, for those in the know) and 1.3.6.1.4.1.1466.115.121.1.40 is a string ...

I'm glad the LDAP schema designers didn't turn their hand at
programming language design. If they had, then

     boolean x,y,z;

Might have been

   type 1.3.6.1.4.1.1466.115.121.1.7 x,y,z;


The only thing that is good about LDAP schemas is that they are not XML schemas.
...

7 comments:

Unknown said...

Actually UBF has spread a little - see Infernos man pages:
http://www.vitanuova.com/inferno/man/6/ubfa.html

Anonymous said...

I see UBF (and these Google protocol buffers, not to mention JSON, Serialized PHP, and their ilk) as attempts at solving an entirely different problem than XML.

XML is a language for writing other languages. UBF seems to be a way of serializing language objects to some bytecode that could (potentially) be executed by (the parsing) bytecode interpreter at the other end.

We can disagree on the true definition of the word "parse", but I do believe that if you serialize a Java object to a UBF form, you will (on the other end) need to "de-serialize" it from UBF to the language form at the other end (translating bytecode to executable code in this case). That is the same general class of problem as parsing an XML document, and then creating some kind of (programming) language-model from the serialized form.

So how is it that UBF avoids "parsing" (effectively translating from the wire format to some other language?)

Secondly, your statement:

"Things which deeply flawed and industry standards things like XML can lead to endless discussions - great fun - lots of hot air. Project management can happily preside over "the illusion of work" - wages get paid - everybody is happy. Projects get delayed - project management becomes very happy."

is, to say the least, a little inflammatory, without providing any hint that XML is actually useful for something. Given that XML and XML-based languages are in widespread use, despite their flaws, your attack appears unjustified, and does not seem even related to UBF.

I wouldn't recommend an XML language to be used as bytecode. But I also wouldn't understand why I should send a syndicated feed of my photos as UBF.

Anonymous said...

It seems that UBF could be very interesting for grid, as against to that bloated XML.

Please add ML9 feed instead Atom here :-)

Unknown said...

Joe, I am pleased to see you writing about UBF again. Keep up the provocation. I think we need more of it, even, especially if some people find it inflammatory. Perhaps you could link to your earlier work on UBF so that readers of your blog might go find out what you are talking about. Also thanks for the book reference from Gilad Bracha. Reading that might help me understand why this industry seems hell bent on burning money and time while seeming to want the opposite.

Anonymous said...

I am glad someone else likes Pascal. I spent 15+ years writing HA Financial Market systems in Pascal on VMS. Robust and can be beautiful. Java was "ok" but I have now found Erlang! One of two projects I am looking into is a stock exchange/crossing engine, btw.

grantmichaels said...

I'm new to Erlang, but have thus far thoroughly enjoyed the guiding principles, and it's really a breath of fresh air that you take an unmitigated stance on matters of programming philosophy and aren't playing to 'all-pleasing' role ...

grantmichaels

Daira Hopwood said...

UBF(A) is a great design. You say, though, that it doesn't require parsing because it can treated as bytecode for a VM. However, a bytecoded VM does parse its code.

A UBF(A) parser can be extremely simple, and that's a valuable property (particularly for security reasons), but there's no need to overstate it.

Rosco wrote "Keep up the provocation." Seconded.