"SML and Ockham’s Razor: Too Close a Shave?"

As presented at XTech 2000 by Evan Lenz

This paper immediately followed SML-DEV's panel presentation

The medieval philosopher William of Ockham today is probably best known for his famous "Ockham's Razor," in which he declared that the simplest explanation tends to be the best one. We don't actually have a specific quote of him saying this in any one place. Rather, the saying has been gleaned from the whole of his writings. Another way it has been stated is: "Entities are not to be multiplied beyond necessity." And my favorite: "What can be explained by the assumption of fewer things is vainly explained by the assumption of more things."

So why such philosophical talk at an XML conference? Because, at the hands of the SML-DEV group, it is Ockham's Razor that promises (or threatens) to cut out what is deemed unnecessary in XML. Short of indulging in clever debates about fairies dancing on pinheads, I will argue that Ockham's Razor prohibits rather than demands what the SML-DEV group has in mind.

SML-DEV's primary complaint is that, when it comes to simple e-commerce applications, XML retains too many of SGML's document-centric features--that, as a simplification of SGML, it did not go far enough. While nothing has yet been formally specified, current initiatives call for the removal of attributes, entities, mixed content, CDATA sections, DTDs, processing instructions, notations, and comments.

Arising out of the discussion is the proposal that two simplified subsets of XML be standardized, representing varying degrees of simplification. There seem to be two fundamental goals behind these proposals: 1) that XML be streamlined for simple applications such as business-to-business e-commerce and data interchange, and 2) that XML be easier to learn for the beginner. I will argue that these goals can be met by means other than the creation of formal subsets. Moreover, I will maintain that creating a subset, or subsets, of XML will, in the end, only complicate things, rather than simplify them.

(Note that when I use the term, "SML," I will generally be referring to the simpler of SML-DEV's two proposed subsets. Nonetheless, my thesis encompasses both.)

One of the most important distinctions that needs to be made is that between the SGML-XML relationship and the XML-SML relationship. Because they both involve a syntax simplification resulting in a subset, the two relationships are certainly analogous, but they have decidedly different real-world implications.

This simple graph depicts the self-evident nature of subsets. It’s really not saying anything more than "when you take syntax features away, you lose them." In a perfect world, we could all get along just fine with SGML. With our super-human brains, we’d glance at the spec and know immediately what features we could get by with in our applications. Thus, we’d really be best off sticking with SGML and all the flexibility it provides. But in the real world, a simplified subset was needed for generalized markup to take off on the Web. Today, we are beginning to see the results of that simplification in XML’s widespread acceptance. We are also seeing new applications that extend beyond the document-publishing world into the realm of data interchange and e-commerce. And, all the while, XML retains much of its originally intended document-publishing power. Thus, this graph does not give us enough information. It doesn’t show us anything about the real world.

This less-than-rigorous graph reflects my take on the real-world implications involved in the SGML subset saga. The x-axis is the same as in the previous graph and represents the progression of syntax simplification. The y-axis represents a unique construct to serve my purposes, which I intend to defend--real-world adoption for data and documents. The point is to depict, in one graph, the value added by the move to XML and the value that would be lost by a move to SML. (To be fair, this graph is fully loaded against SML, and ultimately misses the real point, as we’ll soon see.) The benefit of the move to XML is evidenced by its adoption in a wide variety of applications, whether for document publishing or data interchange. The downward turn to SML is intended to reflect the breaking point for document-publishing applications. The real trigger is the removal of support for attributes, without which it would be very difficult to continue most document-publishing applications.

Fortunately, my thesis does not depend on this graph, because, even if it is a somewhat accurate picture, given my perverse criterion, it is irrelevant, given the goals of SML-DEV, which are aimed toward specific application domains at the exclusion of the document-publishing world. Nevertheless, it does illustrate the fact that what SML-DEV is trying to do isn’t just more of the same. It’s different this time around.

XML has blurred the practical distinction between documents and data. As a result, it has afforded us a level of syntax interoperability that we have never known before. I am not suggesting that there is something particularly special about processing instructions or notations or any of the other less-than-beautiful quirks of the XML syntax. The magic is not so much in the phenomenon of the XML 1.0 spec, as the phenomenon of turn-of-the-century XML--in other words, XML as rooted in history. There’s no telling what sorts of applications the future will bring. Given a lowest common denominator of XML syntax, a new world is opened up to powerful applications in semantics analysis, information integration, and artificial intelligence. Do we really want to jeopardize that future by making certain features of XML optional?

One of the most important design goals in XML 1.0 was that optional features be kept to a minimum, ideally zero. With a few minor exceptions regarding things like the handling of external parsed entities, that design goal was met. The strict definition of the concept of an XML processor and how it must behave enabled the potential for any XML processor in the world to be able to read any XML document in the world (given character encoding compatibility). This is in contrast to SGML, where the spec was too large to reasonably expect everyone to have something like a generic, fully-functional SGML parser.

XML defines a limited spectrum of syntax features, some or all of which we can choose to use in defining a document type. Nothing in the XML 1.0 spec forbids us from choosing to not use certain features, such as CDATA sections, in our applications. It does, however, forbid us to write a parser that doesn’t know what CDATA sections are, and call it an XML processor. Yes, it is able to parse a particular kind of XML document, but, for that matter, it is also able to parse a particular kind of SGML document. Interoperability problems arise when the integrity of the XML processor is not protected. The standardization of an XML subset and the accompanying introduction of generic SML processors would jeopardize that integrity.

Here we should make a clear distinction between general-purpose XML processors as defined in the XML 1.0 spec, and streamlined, bundled processors, hidden inside specific applications. In these custom-built processors, such as the one that doesn’t support CDATA sections, the document type used by the application dictates what syntax needs to be used, and, consequently, how many different types of events your processor has to handle. In other words, the processor need only be as smart as the document type. Optimization occurs at the implementation level. But, again, we are no longer talking about an XML processor.

SML-DEV’s proposal amounts to premature optimization, because it attempts to optimize at the generic syntax level instead of the particular implementation level. The fact that already two subsets have been proposed is symptomatic of this approach. Some implementations might require, for example, attributes but no mixed content, in which case SML is too simple, but "Common XML" is not simple enough. Does that mean we need to define another simplified subset? The door is opened to a number of arbitrary simplifications. For that matter, someone in the publishing world might not want to let go of a particular SGML feature. Why not create an XML superset? The point is not that XML is any less arbitrary than any other SGML subset. Rather, the value of XML lies in the fact that it is succeeding in getting people to flock to generalized markup, and that it retains its identity as one language across many application domains.

One potential advantage of a simplified XML subset, as SML-DEV points out, is that it would be much easier to learn for beginners. If entities were removed, for example, the beginner wouldn’t have to worry about the difference between unparsed and parsed entities, internal and external entities, general and parameter entities, and whatever combinations of those are legal or even make sense--let alone how entities are expanded or what constitutes replacement text. A syntax that is reduced to just elements, attributes, and text nodes, for example, would be much easier to absorb for the beginner.

A large majority of the people who are and who will be using XML do not have the need to wade through the XML spec. The SML-DEV group would be the first to point out that many companies are already using simplified subsets of XML for data interchange, using only the features they need. They generally have no need to consult the spec, because they are not employing the more difficult aspects of the XML syntax. In any case, a practical understanding of XML does not necessarily require a comprehensive understanding of XML 1.0.

More important than the question of who needs to read the XML spec is the nature of XML as a meta-language and what that means for the way in which it is implemented. There is no such thing as a document that is "merely XML." It will always be XML-and-something. Since XML does not provide us with any semantics, we have to fill in those blanks ourselves. Accordingly, any given implementation of XML will always be rooted in a real-world vocabulary.

Newcomers to XML will always find themselves either 1) learning a specific pre-existing implementation of XML or 2) wanting to create their own markup language. The beginner who wants to learn a particular XML-based language need only learn the various features of XML syntax used in that language. If the schema for that language does not allow for the use of internal DTD subsets, for example, she will not need to know what those are. On the other hand, if her company is using a language in which internal subsets play an important role, she will have no choice but to learn how they work. The existence of a simplified meta-language out there will make no difference to her.

The other newcomer--the one who wants to design a markup language--will, of course, need to learn the fundamentals of the syntax that he is to employ. And, again, he does not necessarily have to learn all of XML. This is where I see SML-DEV’s efforts as being particularly helpful in providing guidelines for "all the XML you really need to know" for e-commerce applications and the like. They have identified a number of potential pitfalls in using various aspects of XML--whether ambiguities at the data modeling level, incongruities at the processing level, given current tools, or needless complexities at the application level--and pointed the would-be schema-designer in a direction likely to result in smoother sailing. These guidelines can and should be used with regard to specific XML implementations, but they hardly need to be promoted as a formalized specification.

The recurring theme here is that you do not necessarily need to have a comprehensive understanding of XML syntax rules in order to implement XML applications, whether pre-existing or new. The need for a syntax subset specification is negligible. Whereas SML-DEV points to de facto simplified subsets as evidence for the need of a formal spec, I would say that their existence is precisely evidence that we do not need one. What wasn’t happening with SGML is happening with XML. The simplification was sufficient to the extent that people are finally getting it. Why should we now confuse them with artificial choices between this meta-language and that meta-language?

The propagation of SML, or Common XML, or whatever other names might be used, as essentially alternatives to XML is what portends confusion--confusion not only for the beginner but also for the marketplace in general. The beginner would now have to wade through a slough of subsets and supersets to choose from before even beginning to learn or design specific schemas. And in the marketplace, the existence of general-purpose XML processors would now be supplemented by scaled-down versions which would be nice to use since they’re smaller and faster, but that may or may not work with one’s documents, if they are not SML-compliant.

There are a lot of subordinate issues here. We could argue all day long about the nature of attributes or the appropriateness of processing instructions in XML, but those questions detract from what I see as the real issue: whether XML should be split into subsets--whether we should give up the common syntax in favor of optimization for certain application domains, even at such an early point in time. I, for one, want to wait and see. On that note, let’s conclude with yet another version of Ockham’s Razor that seems particularly appropriate: "Plurality should not be posited without necessity."