01 September 2013

Lessons in working with file formats in flux

As you may know, my tool Arbed for Xojo (formerly known as Real Studio) is capable of reading and modifying Xojo's project files. This ability is used to perform often needed operations that the Xojo IDE doesn't offer (yet), such as comparing projects and single classes, scripted code replacement, code obfuscation, preparing code for localization and more.

I'm going to tell you a little about the challenges I have to deal with.

There are 3 different project formats supported by Xojo:
  1. Binary, aka RBP, aka RbBF, using extensions .rbp and .xojo_binary_project (sheesh!)
  2. XML, using extensions .xml and .xojo_xml_project
  3. Textual, aka VCP, using extensions .rbvcp and .xojo_project

Dealing with RBP and XML


The first two formats are structurally identical: The RBP format is nothing more than a more compact version of the XML format, using a binary representation of the XML by using 4-letter-codes for the tags and 4 byte length fields to indicate the size of the elements that follow. (Xojo engineers informed me that actually RBP was designed first, and XML is a translation of the RBP format.)

Arbed originally only supported these first two formats, and it was pretty easy to handle. The data was well structured and repetitive in a way that gives little room for (problematic) surprises. If you've ever looked at a xml (or html) file, you probably understand what I mean.

When Arbed reads a project in the RBP and XML formats, it modifies only the parts it well understands while leaving all the other data untouched. For instance, when it changes some source code inside a function, it only modifies the affected <SourceLine> elements. That way, it can safely modify the project even if Arbed doesn't understand everything that's inside the project. And believe me, even though XML is apparently self-explanatory, there are a lot of unexplained things in there that just leaves one guessing.

The blame for this can be put with Xojo for not documenting this. I guess they do not even have an internal documentation. As so often in rushed software engineering, the code is the only documentation (for that, it doesn't even deserve the term "engineering"). When I am about to write complex code, I usually start with documenting and specifying it first, so that I (and others that might join the project) can later reder to that. It really helps, and should be obvious. However, most people do not follow this simple rule. Many even never learn. It even starts with documenting your single subroutines. See my little article on coding guidelines.

Converting between RBP and XML formats.


It gets a little more complicated when using Arbed's Convert operations. They's available in the main window (titled Arbed Drop Pad). For instance, to convert a RBP to a XML file, it has to know how the 4-letter-codes in the RBP file translate to the XML tag names.

It happens every once in a while that Xojo adds new features to the project file format that lead to using newly names element tags. For instance, when the Web Edition was added, a new XML tag named WebApp got added. Now, while Arbed doesn't need to know what this value means, it needs to know how to translate it between the XML tag name and the RBP tag code. Therefore, when you're using an Arbed version that doesn't know this translation yet, it'll tell you about it if you ask it to convert a newer project file with yet-unknown tags in it. I will then have to update Arbed with the new code (which is easy to do, I just have to save the same project in both XML and RBP format in the IDE and see look for the code in question in both files).

Overall, working with RBP and XML projects in Arbed is therefore fairly safe and fast.

Arbed only rarely needs updates, e.g. when a new tag code gets added, and then only to make the Convert functions work (which are not even really necessary because you can just use the IDE to save a project in a different format).

Dealing with the VCP format


The VCP is a different and much more scary beast.

In theory, it's just plain RbScript code, which is fairly well specified (even though Xojo has failed to present a syntax/grammar spec for their language for 15 years now, it's fairly well understood by me by now, and my background in compiler design helps there, too).

I even have a fully working RbScript parser that I hope to use soon to implement some great features such as method name obfuscation (i.e. rename all custom method names in your source in order to foil reverse engineering attempts), automatic code reformatting and dead code identification and removal.

But the reality is harsh. The VCP format has a lot of inconsistencies and hard-to-understand behaviors that make parsing it challenging.

Some examples:

Properties of Controls vary in representation


Recent Real Studio versions tended to write some integer values of Control properties as floating point numbers. If you're using version control you may have noticed that a control's Width and Height were sometimes shown as 1.6e2 when you did input 160. Recent Xojo version seem to have finally fixed this.

Odd spacing in method declarations


This is how a normal Method declaration looks like when using qualified type identifiers:
Sub Foo(x as a.b)
And this is how an External Method looks like:
Declare Sub extmethod Lib ""  Foo(x as a . b)
Note the inserted blanks around the period. They're syntactically allowed but I didn't type them - the VCP format inserted them automagically.

Troubles with Attributes


Let's enter 3 attributes with, admittedly, some unusual values:
  • test: a
  • t2: \x22 (which gets automatically quoted into: "\x22")
  • t3: b=," (which gets automatically quoted into: "b=,""")

Now let's see how they look like in the VCP format.

This is how a normal Method looks like:
Attributes( test = a, t2 = "\x22", t3 = "b=,""" )  Sub Foo()
Alright. That's fairly readable. (Though what's with those spaces inside of the parentheses? Also note the double space before "Sub".)

Here's a similar Delegate declaration:
Attributes( test = a, t2 = "\x22", t3 = "b=,""" ) Delegate Sub Foo()
Looks pretty identical, doesn't it? And now we can also understand that double space: It makes room for plugging a "Delegate" word in there :)

But wait. Method declarations are surrounded by #tag markers, providing extra information that the RBP and XML formats usually store in extra elements or attribute fields. Here's the one for a normal Method:
#tag Method, Flags = &h0
Okay. Now the one for a Declare:
#tag DelegateDeclaration, Flags = &h0, Attributes = \"test \x3D a\x2C t2 \x3D "\x22"\x2C t3 \x3D "b\x3D\x2C""""
Uh, what? Not only it includes the Attributes that are already present - and more much readable - in the declaration source code line, but it's only appearing in this special "delegate" method declaration but not in a normal method - even though they have the same syntax and should thus be created the same way.

But it gets even weirder: The above was created in the latest Xojo IDE. When doing the same with Real Studio 2012r2.1, the above line does not include the Attributes in the #tag line. So, someone must have accidentally added this nonsense just recently, and only to Declares, not to normal functions. Or maybe it wasn't an accident. In any case, it's quite a mess.

Encoding horrors


As a final example, let's focus on the odd encoding of attributes in the #tag line. It's obvious that it tries to escape some codes so that they may also be used inside the attribute values. It looks like a homemade algorithm, but appears to work pretty well. As you can see above, I've used characters as values that it also uses for escaping to see if I can break it. But I couldn't. That is, almost. When I set the value of an attribute to something containing line feeds (returns), then the IDE crashes hard or the compiler spills out inexplicable error messages. Oh, and later I found that if you put \x22 into an attribute's value, it'll be converted to a comma next time you open the project. Doesn't happen in RBP format, of course.

Sure, the test with the return in the attribute value was something that's unlikely to happen, but it shows that the whole thing is put together with little understanding how escaping random data should be done. Heck, since the Xojo IDE is programmed in Xojo, they could have have just base64-encoded the string, or used quoted-printable encoding. That's what they're meant for, and both are functions that Xojo provides anyway! But no, a new technique had to be invented: Something unreadable and prone to crashing.

What this all means


Why do I even bother with this painful stuff?

Well, here's the thing: In 2012, I enhanced Arbed to read and write the VCP format.

It was one of the worst decisions I had made in past years, because it took me several months of intensive work to get even close to being usable. And since then I spent many more days working out all the kinks. These kinks are what you see above: Hardly anything that looks similar also behaves similarly. Behaviors change in different RS and Xojo versions, sometimes without making sense. I had to identify and special-case them all, for each single IDE release, so that Arbed generates the exact same output that the IDE does.

I am still not sure that it was worth the effort, because I do not use the VCP format for myself, and until then, all the features I added to Arbed were needed by myself. But when I started selling Arbed, I believed I had to add this feature because many expected it to just work.

Now, every time Xojo changes the output format of the VCP files only slightly, it might mean that my own code could miss it. For instance, some files use undocument flag values, such as &h1000. No idea what it's for. Formerly, I had just generated these flags from the declaration source line. But now, sometimes, the Flag in the #tag line contains more information and I have to preserve it. Special handling galore!

Arbed's VCP output needs to match that of the IDE exactly


The issue is that, contrary to how Arbed can edit just specific parts of a method's source code in the RBP and XML formats, it can't do this with VCP files. Arbed's project modification code was written to work on the RBP/XML format directly, in order to keep anything unmodified that the user doesn't directly alter. For this functionality to work with VCP files, Arbed has to convert it internally into RBP format, so that my existing code can operate on it. Then, when the user saves his changes, Arbed recreate the VCP file from the updated RBP code (Arbed only rewrites the files that were affected by the changes). But that means that I have to recreate the entire VCP file from RBP data, instead of just modifying the source lines or whatever else was changed by the user as I did with RBP/XML files.

All this requires that Arbed understands every detail of a VCP file in order to properly convert it to RBP and back, internally.

Potential for damage


And that's what bit me a few times already in the past. I would miss small changes in the format, such as that color constants were originally written as hex values (&h...), but recently Xojo changed this to using &c... codes instead. My code was not prepared for this and would then write any color back as a 0 (black).

Basically, Arbed could, unknowingly, damage your VCP files if you edited and saved them in Arbed. You'd notice eventually, but Arbed should notice this before it does such damage.

So, how do you make sure that your code that reads and writes an external file format doesn't damage it just because the file format introduces changes that you are not aware of, yet?

The solution is: When reading the input, recreate the would-be output from it right away and then verify that both match. If there's a mismatch, your code is prone to damaging the file when writing it back.

Simple as that.

Arbed 1.7 is made safer by performing a self-test on any VCP file it reads.


Therefore, from Version 1.7b9 on (which I'll release shortly), Arbed will verify its own VCP read/write code to make sure that it entirely understands the project it reads and (optinally) writes.

So that, when you save a project, anything you didn't specifically change, will remain the same in the rewritten VCP file. So that when you use version control (git, mercurial, subversion) with your VCP files, it won't show lots of unrelated changes after Arbed wrote the file back (even if it's still valid to Xojo). Something Arbed may do even better than the Xojo IDE itself (ever seen those TabStops disappear and reappear randomly?)

Arbed accomplishes this self-check by converting the read project into its internal RBP format, then back the VCP representation, then comparing its VCP output with the original file. If it finds a mismatch, it warns the user that the data was not fully understood. It also creates a file on the user's Desktop containing the specifics of the rendering differences, so that the user can tell what's gone wrong (allowing the more experienced user to decide if the changes are benign), and if the user is sending the file to me, I can quickly fix Arbed and release a new version that deals with it.

Of course, if Arbed reports a problem with its VCP conversion, one can always just use the Xojo IDE to make the conversion: Open your project in Xojo, and use the Save As... menu command to save the project in the binary or XML format. Then use that file with Arbed, and if you've made changes in Arbed and saved them back to the project file, open that again in Xojo and use again Save As... to save it in VCP format. It's tedious but that's the way it's been working even before Arbed added support for VCP projects.

No comments:

Post a Comment