>XML Markup Design

>Jeni Tenison has an excellent article posted about Bad XML design. XML is one of the things that is both the greatest thing since sliced bread, and your worst nightmare. I’ve been specializing in XML data specifications and markup languages for the last several years, and before that worked with EDI and other file formats for exchanging business to business data. Poor XML Design is all to common. Little thought or effort is applied to it. Jeni lists in her blog the signs of good XML markup. This post will focus primarily on the big Bad design I see to often.

The Data Dump

To many projects use XML as a dump from existing objects to a serialiazable form and then trade this with other applications as is. Most cases, some sort of model is an after thought, and the vast majority of the XML that is out there doesn’t use a Schema for verification. Whether that be a DTD, RelaxNG, W3C Schema, or SchemaTron. One of the nightmare scenarios is when dealing with XML that has been a serializable dump of the data stored in a class or several classes. As Jeni says, markup like:

<entry name="value1">Some Value</entry>
<entry name="value2">Another value</entry>

It’s easy to serialize, but conveys no real meaning of the data that is being described. While this itself is alright for quick and dirty methods, it does little to help with maintainability of a system. Due to time constraints the quick and dirty method gets delivered into production. I’ve been talking about refactoring lately, and believe the same concepts of creating maintainable code applies just as much to the XML formats that you use and create. While it may make perfect sense when you are creating that XML format above, imagine if you had to come in a year later new to the system and try to figure out what it actually did. What is entry? What type of option? What does that information actually mean? Is there any type of documentation for the above? If there is is the documentation still current?

The problem in making generic general purpose names is that it doesn’t convey any type of meaning to what is being represented. Dumping the data while quick and dirty, doesn’t make it any more maintainable, plus as Jeni says:

… it’s easy to marshal data into it and unmarshal data out of it, and it’s not as if the configuration is going to be shared with other applications. But that mentality tightly couples your current implementation with the configuration file: bad news if your application’s data structures change down the road.

As we know loose-coupling is a good thing, it allows flexibility in design and implementation later on. Somehow, when people use XML they loose track of this principal.

Overuse of Attributes:

Another problem with XML Data Dumps are with the overuse of attributes. Again, very easy to marshall/unmarshall, but aren’t in themselves very extensible. The problem with attributes is that there is no nesting of the data, no hierarchy or order to it. Plus you can’t group like components together. Elliot Harold describes the faults with this pattern best in his “Effective XML” book. I’m going to pick on Mylyn for a moment, I use it every day, but it runs into this design issue.

<BugzillaReport Active="false" Complete="true" CreationDate="2008-05-01 12:30:00.0 GMT-05:00"
DueDate="" EndDate="2008-05-05 22:20:00.0 GMT-05:00" Estimated="1"
Floating="false" Handle="https://bugs.eclipse.org/bugs-229810"
IssueURL="https://bugs.eclipse.org/bugs/show_bug.cgi?id=229810" Kind="Bug"
Label="[xslt] 0.5M7 New and Noteworthy"
LastModified="2008-05-05 22:20:56 -0400" Notes="" NotifiedIncoming="true"
Owner="somebody@somwhere.com" Priority="P3" Reminded="false" ReminderDate=""
RepositoryUrl="https://bugs.eclipse.org/bugs" Stale="false"
bugzilla.product="Web Tools" bugzilla.severity="enhancement"

The above is just a dump of a lot of attributes that have no organization or order, are they necessarily all Attribute of the BugzillaReport, or could they be better expressed and more extensible broken out into elements? Why is DueDate included if it has no data value? What is the difference between Handle, IssueURL, and RepositoryURL? What is Handle’s function?

A possible restructuring could be (note I’m not doing the full restructuring):

<Report id="https://bugs.eclipse.org/bugs-229810">
<Description>[xslt] 0.5M7 New and Noteworthy</Description>
<CreationDate>2008-05-01 12:30:00.0 GMT-05:00</CreationDate>
<EndDate>2008-05-01 12:30:00.0 GMT-05:00</EndDate>

While the above is a bit more verbose, it is also a lot more extensible and reusable. It allows for the back-end underlying design of the implementation to evolve independent of how the data is formatted. It may take a bit more work, but it generalizes the overall concepts and organizes the data better. While Mylyn’s current TaskList structure works, it ties the implementation and data very closely together, and doesn’t allow for a clear communication of the intent of the data specification.

How you design the XML that you use, does have an impact on your systems.

This entry was posted in eclipse, standards, xml. Bookmark the permalink.

4 Responses to >XML Markup Design

  1. gerd says:

    >You can make the attribute style looking good just by using a good indentation as you did in your proposal.I have good reasons to prefer attributes where the information has only one value.As a human reader, I KNOW that there is only one possible value without knowing the schema or dtd. In the last example, I have to interfer my interpretation of the element name with my cultural background to make this assumption.The same holds for a program that reads this information into its object model. Using this XML style, I’m able to write code where I only need one line of code to map the attribute to an object field in my object model that is in the class which is responsible for that object.If you map Object XML element and field XML attribute (with a few simple additional footnotes), you get code that is extremely human readable, maintainable, and hidden in the responsible class for that element/object. It also turns out that this style results in very fast XML reading. Slow XML reading is also a result of your proposed style that needs heavy work in the XML framework.I found that others invented the same style and framework as I did. Especially in the area of embedded devices.Best regards,gerd

  2. David Carver says:

    >Some good points, Gerd. In fact, I don’t think there is a set rule one can come up with that fits everybody’s preferred preference. The OO frameworks out there for XML tend to favor the author’s particular preferences when it comes to design. There are some very light weight frameworks out there, that read the XML just as quickly from elements as they do from attributes.A good article discussing the trade off to the element vs attributes on going perma-thread can be found here:http://www.ibm.com/developerworks/xml/library/x-eleatt.html

  3. Mik Kersten says:

    >Interestingly the first pass at the tasklist.xml had almost everything as elements. We then flattened it into attributes for the reasons Gerd outlines, and due to memory usage concerns. While on disk the extra redundancy of a bunch of elements should be negligable, since we zip everything, with a Task List like mine (over 10K elements, 7.5MB uncompressed, 900K compressed) overuse of elements can use significant memory when reading and writing. If there’s a way to get the benefit of elements without excess memory consumption it would be great to hear about it, because I think that we’ve now crossed the line of abusing attributes. For example, the “bugzilla.product” and “bugzilla.severity” would make sense to group together as one element instead of having the categorization encoded into their names. It would be great if you could file a bug report against Mylyn with a link to your post so that we could discuss how to improve the design.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s