Recently I purchased the ISO standard specification for SGML, the markup language from which HTML and XML arose. It’s a complicated language and most of the claims I’ve seen online about it is little more than heresay (likely because the specification isn’t open and available for free). It’s been a whirlwind saturating myself with the context and reality of what SGML is and tried to accomplish, and I wanted to share some thoughts. There are plenty of thoughts, but in this post I want to lay out some interesting things I didn’t realize until now, and some ways it’s impacted how I think about HTML and XML.
In time I hope to write more specifically about some of these points, but I realized that if I don’t toss them out in a more raw form then I may never get anything out.
In no particular order, and without proper reflection, here are things I’ve learned or thought about since working through the SGML specification.
- SGML is saturated with features meant to make it easier for humans to write more content and less markup. Much, or even most, of its complexity arises by forcing the software to meet the needs of the human rather than requiring a human to meet the needs of the software.
- SGML documents encode structure on top of plain text. The Document Type Definitions (DTDs), which formally describe document structure, make it possible for the SGML parser to infer structure when the text itself looks freeform.
- SGML documents represent content in storage. Even though they are meant to be editable by hand with a plain text editor, they are not meant for direct rendering or consumption. They are meant for running through an SGML parser and transformation pipeline, using stylesheets to indicate how to render the content.
- SGML is essentially a generic macro system for text entry to make it easier for people to write and maintain structured knowledge. Some of the macros are incredibly powerful, making it possible to embed Markdown sections, CSV content, tables, and other sub-formats that look like what they are, all with a single universal parser.
- SGML’s syntax is abstract. Even though the specification provides the “reference concrete syntax” most of us are familiar with (because HTML and XML both adopted it), the parsing rules talk about abstract token names like
STAGO(start tag opener) instead of characters like<. This means that SGML documents can look very different with their syntax, but still operate with the same universal parser. - Most of the aspects HTML adopted from SGML, which people critique for being loose or for allowing all forms of invalid markup, are intentional features intended to make it easier to physically type content – these include:
- optional starting tags like the implicit
HTMLandHEADandBODYelements. - optional end tags like when a new
Pelement implicitly closes an openPelement. - optional semicolons following character references when the following characters make identifying the reference unambiguous (for example,
1, 2, 3&hellip up until 1000is allowed but¬hingisn’t). - named character references exist not only because character sets like
ISO-8859-1can’t represent higher Unicode characters, but also because it was and can still be easier to type things like≈with the keyboard than to figure out how to insert U+2245, the≅character, through the operating system.
- optional starting tags like the implicit
- HTML was always intended to be a profile or application of SGML, but much of it was built by example from SGML documents that people had been seen. When HTML 4.0 attempted to write a formal DTD describing HTML as an SGML document, it was too late, as too much of the web was already written with non-conforming HTML. Alas, HTML was inspired by SGML, but its parsing rules were never fully compatible.
- The DTD is everything in SGML and is the bridge that connects seemingly free-form human entry with highly structured data. When XML was developed, people wanted to simplify SGML and make it possible to parse a document without the knowledge of the DTD. It’s also the removal of the DTD which makes optional tags impossible.
Certain things about SGML remain difficult:
- It wasn’t designed with UTF-8 in mind even though standardizing on that would simplify quite a bit of the parser requirements. It was designed at a time when different computers used incompatible character sets, and accounted for that, but today these systems only add confusion.
- DSSSL, the stylesheet language for transforming SGML into presentation formats, never took off and has its own complexity.
- Almost everything about SGML is lost on the web behind 404s of long-expired URLs.
- There are almost no tools available to aid in writing SGML – no syntax highlighters, no language servers, a single outdated spec-compliant parser.
- It lacks namespaces the way XML has them, and those could be very handy for mixing semantics. XML succeeded in the extensibility aspect in a way that SGML and HTML never did.
But apart from these, SGML feels like a technology I wish had succeeded. I think in many ways it could have resolved a lot of the problems that led to the proliferation of incompatible markup syntaxes on the web, and could have done so with one unifying and universal parser. I’d like to see what it would be like to have a modern updated version of SGML and edit posts in WordPress in plain text while maintaining the rich structure we often demand.
Finally, in reflection on the Block model and Block editor in WordPress, I think we lucked out in continuing some of the values SGML aimed for. We have a kind of universal format for including high-level rich content types, the posts are meant to be processed and transformed into different rendered or presentation formats, reinterpreting the raw attributes and block types. We have a format that by default and without any further server support renders usable documents in any browser. Block posts aren’t very convenient for editing by hand, but the Block editor itself is embeddable in any imaginable context. I hope that it continues to mature and one day we can look back and see it as the universal content editing model used to ensure free and interoperable sharing of knowledge around the web.
I agree with many of your points on SGML, I too wish it could have survived. Markup minimization and custom syntax using SHORTREFs are pretty nice features! On the downside, SGML has a steeper learning curve than related technologies i.e. HTML, XML
Regarding lack of namespaces; the CONCUR feature made available in the SGML declaration actually does something very similar, however it looks kinda ugly (IMO) and as far as I can tell it was never actually implemented by any software. Check out the paper ‘Do we really want to see markup?’ presented at Balisage 2019 by James David Mason to get more insight on how that particular feature came to be.
Thanks for the great article reference — I took a quick look just now and have added that to my reading list. CONCUR scared me when I first saw it, but it seems like just one of the many features for which I could find no actual supporting software.
One thing I am constantly challenged by is whether related technologies are in fact easier to pick up. Markdown itself as a kind of evolution of HTML formatting tags has an incredibly low barrier to entry, but brings even more constraints than HTML does. I think these easier markup languages have pushed us to think of markup purely as formatted paragraphs to the exclusion of other well-structured data.
One thing I love about SGML is the ease with which I can define minimized structure. I like to use SGML for indicating currency values, or flights with their codes and times, or projects with their ids and what not. HTML custom elements provides some similar functionality, but in a more verbose and less convenient way.
So maybe it’s just the same old story of expressiveness vs. verbosity vs. ease. 🤷♂️