>Jeni Tenison has an excellent article posted about Bad XML design. XML is one of the things that is both the greatest thing since sliced bread, and your worst nightmare. I’ve been specializing in XML data specifications and markup languages for the last several years, and before that worked with EDI and other file formats for exchanging business to business data. Poor XML Design is all to common. Little thought or effort is applied to it. Jeni lists in her blog the signs of good XML markup. This post will focus primarily on the big Bad design I see to often.
The Data Dump
To many projects use XML as a dump from existing objects to a serialiazable form and then trade this with other applications as is. Most cases, some sort of model is an after thought, and the vast majority of the XML that is out there doesn’t use a Schema for verification. Whether that be a DTD, RelaxNG, W3C Schema, or SchemaTron. One of the nightmare scenarios is when dealing with XML that has been a serializable dump of the data stored in a class or several classes. As Jeni says, markup like:
<entry name="value1">Some Value</entry>
<entry name="value2">Another value</entry>
It’s easy to serialize, but conveys no real meaning of the data that is being described. While this itself is alright for quick and dirty methods, it does little to help with maintainability of a system. Due to time constraints the quick and dirty method gets delivered into production. I’ve been talking about refactoring lately, and believe the same concepts of creating maintainable code applies just as much to the XML formats that you use and create. While it may make perfect sense when you are creating that XML format above, imagine if you had to come in a year later new to the system and try to figure out what it actually did. What is entry? What type of option? What does that information actually mean? Is there any type of documentation for the above? If there is is the documentation still current?
The problem in making generic general purpose names is that it doesn’t convey any type of meaning to what is being represented. Dumping the data while quick and dirty, doesn’t make it any more maintainable, plus as Jeni says:
… it’s easy to marshal data into it and unmarshal data out of it, and it’s not as if the configuration is going to be shared with other applications. But that mentality tightly couples your current implementation with the configuration file: bad news if your application’s data structures change down the road.
As we know loose-coupling is a good thing, it allows flexibility in design and implementation later on. Somehow, when people use XML they loose track of this principal.
Overuse of Attributes:
Another problem with XML Data Dumps are with the overuse of attributes. Again, very easy to marshall/unmarshall, but aren’t in themselves very extensible. The problem with attributes is that there is no nesting of the data, no hierarchy or order to it. Plus you can’t group like components together. Elliot Harold describes the faults with this pattern best in his “Effective XML” book. I’m going to pick on Mylyn for a moment, I use it every day, but it runs into this design issue.
<BugzillaReport Active="false" Complete="true" CreationDate="2008-05-01 12:30:00.0 GMT-05:00"
DueDate="" EndDate="2008-05-05 22:20:00.0 GMT-05:00" Estimated="1"
Label="[xslt] 0.5M7 New and Noteworthy"
LastModified="2008-05-05 22:20:56 -0400" Notes="" NotifiedIncoming="true"
Owner="firstname.lastname@example.org" Priority="P3" Reminded="false" ReminderDate=""
bugzilla.product="Web Tools" bugzilla.severity="enhancement"
The above is just a dump of a lot of attributes that have no organization or order, are they necessarily all Attribute of the BugzillaReport, or could they be better expressed and more extensible broken out into elements? Why is DueDate included if it has no data value? What is the difference between Handle, IssueURL, and RepositoryURL? What is Handle’s function?
A possible restructuring could be (note I’m not doing the full restructuring):
<Description>[xslt] 0.5M7 New and Noteworthy</Description>
<CreationDate>2008-05-01 12:30:00.0 GMT-05:00</CreationDate>
<EndDate>2008-05-01 12:30:00.0 GMT-05:00</EndDate>
While the above is a bit more verbose, it is also a lot more extensible and reusable. It allows for the back-end underlying design of the implementation to evolve independent of how the data is formatted. It may take a bit more work, but it generalizes the overall concepts and organizes the data better. While Mylyn’s current TaskList structure works, it ties the implementation and data very closely together, and doesn’t allow for a clear communication of the intent of the data specification.
How you design the XML that you use, does have an impact on your systems.