Great to see that somebody else creates a true open source XSLT 3 and XPATH 3 implementation!
I worked on projects which refused to use anything more modern than XSLT & XPATH 1.0 because of lack of support in the non Java/Net World (1.0 = tech from 1999). Kudos to Saxon though, it was and is great but I wished there were more implementations of XSLT 2.0 & XPATH 2.0 and beyond in the open source World... both are so much more fun and easier to use in 2.0+ versions.
For that reason I've never touched XSLT 3.0 (because I stuck to Saxon B 9.1 from 2009). I have no doubt it's a great spec but there should be other ways than only Saxon HE to run it in an open source way.
It's like we have an amazing modern spec but only one browser engine to run it ;)
smartmic
Well, it's not as if this is the first free alternative. Here is a wonderful, incredibly powerful tool, not written in Java, but in Free Pascal, which is probably too often underestimated: Xidel[1]. Just have a look at the features and check its Github page[2]. I've often been amazed at its capabilities and, apart from web scraping, I mainly use it for XQuery executions - so far the latest version 0.9.9 has also implemented XPath/XQuery 3.1 perfectly for my requirements. Another insider tip is that XPath/XQuery 3.1 can also be used to transform JSON wonderfully - JSONiq is therefore obsolete.
We also use BaseX to write restful backends with RestXQ - https://docs.basex.org/12/RESTXQ - the documentation itself is written in XQuery as well and uses a BaseX database as a source.
therealmarv
interesting, did not know about that one! Thanks. (Small) but XSLT is not covered by it which is my main usage of XPATH unfortunately.
I will do some experiments with using newer XPATH on JSON... that could be interesting.
Finnucane
I've worked on archive projects with complex TEI xml files (which is why when people say xml is bad and it should be all json or whatever, I just LOL), and fortunately, my employer will pay for me to have an editor (Oxygen) that includes the enterprise version of Saxon and other goodies. An open-source xml processing engine that wasn't decades out of date would be a big deal in the digital humanities world.
sramsay
I don't think people realize just how important XML is in this space (complex documentary editing, textual criticism, scholarly full-text archives in the humanities). JSON cannot be used for the kinds of tasks to which TEI is put. It's not even an option.
Nothing could compel me to like XSLT. I admire certain elements of its design, but in practice, it just seems needlessly verbose. But I really love XPath, though.
miki123211
XML is great for documents.
If your data is essentially a long piece of text, with annotations associated with certain parts of that text, this is where XML shines.
When you try to use XML to represent something like an ecommerce order, financial transaction, instant message and so on, this is where you start to see problems. Trying to shove some extremely convoluted representation of text ranges and their attributes into JSON is just as bad.
A good "rule of thumb" would be "does this document still make sense if all the tags are stripped, and only the text nodes remain?" If yes, choose XML, if not, choose JSON.
amy_petrik
XML is honestly the greatest and I'm not sure why it didn't take off. People sometimes ask me, "what impacted the humanity the most - electricity? antibiotics? combustion engines?" -- no, no, and no, it was XML. Everything can be expressed in XML, and basically everything can read and write XML. It's like the whole world could read and write the same files. Imagine what if those files included programs, that's what XSLT is, a program that's a file of the XML format that performs transformations between XML format and XML format. Wow - now everything can read and write your programming language! About 90% of it is usually around a capacity to use XML to document your XML to XML transforming XML code, and then the other 9% is boilerplate, 1% does the lifting. Brilliant. Imagine a more verbose java, for those of us who find java terse, it almost feels like assembly to me. XML is like the tower of babel that unites all of humanity and JSON is the devil that shattered that dream.
What actually prevents JSON from being used in these spaces? It seems to me that any XML structure can be represented in JSON. Personally, I've yet to come across an XML document I didn't wish was JSON, but perhaps in spaces I haven't worked with, it exists.
geocar
> It seems to me that any XML structure can be represented in JSON
Well it can't: JSON has no processing instructions, no references, no comments, JSON "numbers" are problematic, and JSON arrays can't have attributes, so you're stuck with some kind of additional protocol that maps the two.
For something that is basically text (like an HTML document) or a list of dictionaries (like RSS) it may not seem obvious what the value of these things are (or even what they mean, if you have little exposure to XML), so I'll try and explain some of that.
1. Processing instructions are like <?xml?> and <?xml-stylesheet?> -- these let your application embed linear processing instructions that you know are for the implementation, and so you know what your implementation needs to do with the information: If it doesn't need to do anything, you can ignore them easily, because they are (parsewise) distinct.
2. References (called entities) are created with <!ENTITY x ...> and then you use them as &#x; maybe you are familiar with < representing < but this is not mere string replacement: you can work with the pre-parsed entity object (for example, if it's an image), or treat it as a reference (which can make circular objects possible to represent in XML) neither of which is possible in JSON. Entities can be behind external URI as well.
3. Comments are for humans. Lots of people put special {"comment":"xxx"} objects in their JSON, so you need to understand that protocol and filter it. They are obvious (like the processing instructions) in XML.
4. JSON numbers fold into floats of different sizes in different implementations, so you have to avoid them in interchange protocols. This is annoying and bug-prone.
5. Attributes are the things on xml tags <foo bar="42">...</foo> - Some people map this in JSON as {"bar":"42","children":[...],"tag":"foo"} and others like ["foo",{"bar":"42"},...] but you have to make a decision -- the former may be difficult to parse in a streaming way, but the latter creates additional nesting levels.
None of this is insurmountable: You can obviously encapsulate almost anything in almost anything else, but think about all the extra work you're doing, and how much risk there is in that code working forever!
For me: I process financial/business data mostly in XML, so it is very important I am confident my implementation is correct, because shit happens as the result of that document getting to me. Having the vendor provide a spec any XML software can understand helps us have a machine-readable contract, but I am getting a number of new vendors who want to use JSON, and I will tell you their APIs never work: They will give me openapi and swagger "templates" that just don't validate, and type-coding always requires extra parsing of the strings the JSON parsing comes back with. If there's a pager interface: I have to implement special logic for that (this is built-in to XML). If they implement dates, sometimes it's unix-time, sometimes it's 1000x off from that, sometimes it's a ISO8601-inspired string, and fuck sometimes I just get an HTTP date. And so on.
So I am always finding JSON that I wish were XML, because (in my use-cases) XML is just plain better than JSON, but if you do a lot in languages with poor XML support (like JavaScript, Python, etc) all of these things will seem hard enough you might think json+xyz is a good alternative (especially if you like JSON), so I understand the need for stuff like "xee" to make XML more accessible so that people stop doing so much with JSON. I don't know rust well enough to know if xee does that, but I understand fully the need.
Have you ever written Markdown? Markdown is typically mostly human-readable text, interspersed with occasional formatting instructions. That's what XML is good for, except that it's more verbose but also considerably more flexible, more precise, and more powerful. Sure, you can losslessly translate any structural format into almost any other structural format, but that doesn't mean that working with the latter format will be as convenient or as efficient as working with the former.
XML can really shine in the markup role. It got such a bad rap because people used it as a pure data format, something it isn't very suited for.
Finnucane
in addition to all the things listed above, json has no practical advantage. json offers no compelling feature that would make anyone switch. what would be gained?
finallyspeaking(dead)
[dead]
tomjen3
>JSON cannot be used for the kinds of tasks to which TEI is put. It's not even an option.
```js
import * as fastXmlParser from 'fast-xml-parser';
const xmlParser = new fastXmlParser.XMLParser({ ignoreAttributes: false });
```
Validate input as required with jschema.
faassen
My hope is that we can get a little collective together that is willing to invest in this tooling, either with time or money. I didn't have much hope, but after seeing the positive response today more than before.
cat_multiverse
Oxygen was such a clunky application back when I used it for DH. But very powerful and the best tool in the game. Would love to see a modern tool that doesn't get in the way for all those poorly paid, overworked DH research assistants caffeinated in the dead of night banging out the tedious, often very manual, TEI-XML encoding work...
infogulch
There are many humongous XML sources. E.g. the Wikipedia archive is 42GB of uncompressed text. Holding a fully parsed representation of it in memory would take even more, perhaps even >100GB which immediately puts this size of document out of reach.
How hard is it to implement XML/XSLT/XPATH streaming?
faassen
Anything could be supported with sufficient effort, but streaming hasn't been my priority so far and I haven't explored it in detail. I want to get XSLT 3.0 working properly first.
There's a potential alternative to streaming, though - succinct storage of XML in memory:
The parsed in memory overhead goes down to 20% of the original XML text in my small experiments.
There's a lot of questions on how this functions in the real world, but this library also has very interesting properties like "jump to the descendant with this tag without going through intermediaries".
bambax
> I want to get XSLT 3.0 working properly first
May I ask why? I used to do a lot of XSLT in 2007-2012 and stuck with XSLT 2.0. I don't know what's in 3.0 as I've never actually tried it but I never felt there was some feature missing from 2.0 that prevented me to do something.
As for streaming, an intermediary step would be the ability to cut up a big XML file in smaller ones. A big XML document is almost always the concatenation of smaller files (that's certainly the case for Wikipedia for example). If one can output smaller files, transform each of them, and then reconstruct the initial big file without ever loading it in full in memory, that should cover a huge proportion of "streaming" needs.
faassen
XSLT has been a goal of this project from the start, as my customer uses it. XSLT 3.0 simply as that's the latest specification. What tooling do you use for XSLT 2.0?
bambax
Saxon's free version, which IIRC only implemented 2.0.
infogulch
0.2x of the original size would certainly make big documents more accessible. I've heard of succinct storage, but not in the context of xml before, thanks for sharing!
faassen
I myself actually had no idea succinct data structures existed until last December , but then I found a paper that used them in the context of XML. Just to be clear: it's 120% of the original size; as it stands this library still uses more memory than the original document, just not a lot of overhead. Normal tree libraries, even if the tree is immutable, take a parent pointer, and a first child pointer and next and previous sibling pointers per node. Even though some nodes can be stored more compactly it does add up.
I suspect with the right FM-Index Xoz might be able to store huge documents in a smaller size than the original, but that's an experiment for the future.
lambda
Would you be able to parse it in a streaming fashion and just store the structure of the document in memory, with just offsets for all of the string locations, and then re-read those from disk as needed?
With modern SSDs and disk cache, that's likely enough to be plenty performant without having to store the whole document in memory at once.
jerf
"How hard is it to implement XML/XSLT/XPATH streaming?"
It's actually quite annoying on the general case. It is completely possible to write an XPath expression that says to match a super early tag on an arbitrarily-distant further tag.
In another post in this thread I mention how I think it's better to think of it as a multicursor, and this is part of why. XPath doesn't limit itself to just "descending", you can freely cursor up and down in the document as you write your expression.
So it is easy to write expressions where you literally can't match the first tag, or be sure you shouldn't return it, until the whole document has been parsed.
riedel
I think from a grammar side, XPath had made some decisions that make it really hard to generally implement it efficiently. About 10 years ago I was looking into binary XML systems and compiling stuff down for embedded systems realizing that it is really hard to e.g. create efficient transducers (in/out pushdown automata) for XSLT due to complexity of XPath.
ssdspoimdsjvv
Streaming is defined in the XSLT 3 spec: https://www.w3.org/TR/xslt-30/#streamability. When you want to use streaming, you are confined to a subset of XPath that is "guaranteed streamable", e.g. you can't just freely navigate the tree anymore. There are some special instructions in XSLT such as <xsl:merge> and <xsl:accumulator> that make it easier to collect your results.
Saxon's paid edition supports it. I've done it a few times, but you have to write your XSLT in a completely different way to make it work.
mintplant
Would it be possible to transform a large XML document into something on-disk that could be queried like a database by the XPath evaluator?
jerf
Given the nature of this processing, I think even an NVMe-based disk storage would be awfully slow. (People often forget, or never realize, that the "gigabytes per second" that NVMe yields is for sequential access. Random access is quite a bit slower; still stomps spinning rust, but by much less. And this is going to be a random access sort of job, so we're in the "several multiples slower than RAM" regime of access.) This sort of thing really wants RAM, and even then, RAM with an eye towards cache coherency and other such performance considerations.
chii
You'd basically be building an index into each node.
There's some fast databases that store prefix trees, which might be suitable for such a task actually (something like infinitydb). But building this database will basically take a while (it will require parsing the entire document). But i suppose if reading/querying is going to happen many times, its worth it?
econ
It seems to me one could replace each text node with the offset. Perhaps limit it to longer instances?
phonon
Like MarkLogic?
wongarsu
100GB doesn't sound that out of reach. It's expensive in a laptop, but in a desktop that's about $300 of RAM and our supported by many consumer mainboards. Hetzner will rent me a dedicated server with that amount of ram for $61/month.
If the payloads in question are in that range, the time spent to support streaming doesn't feel justified compared to just using a machine with more memory. Maybe reducing the size of the parsed representation would be worth it though, since that benefits nearly every use case
infogulch
I just pulled the 100GB number out of nowhere, I have no idea how much overhead parsed xml consumes, it could be less or it could be more than 2.5x (it probably depends on the specific document in question).
In any case I don't have $1500 to blow on a new computer with 100GB of ram in the unsubstantiated hope that it happens to fit, just so I can play with the Wikipedia data dump. And I don't think that's a reasonable floor for every person that wants to mess with big xml files.
philipkglass
In the case of Wikipedia dumps there is an easy work-around. The XML dump starts with a small header and "<siteinfo>" section. Then it's just millions of "page" documents for the Wiki pages.
You can read the document as a streaming text source and split it into chunks based on matching pairs of "<page>" and "</page>" with a simple state machine. Then you can stream those single-page documents to an XML parser without worrying about document size. This of course doesn't apply in the general case where you are processing arbitrary huge XML documents.
I have processed Wikipedia many times with less than 8 GB of RAM.
nicoburns
Shouldn't parsed XML be smaller than the raw uncompressed text? (as you could deduplicate strings). I'd expect that to be a significant saving for something like wikipedia in XML
In the English Wikipedia the wikitext accounts for about 80% of the bytes of the decompressed XML dump.
hyhjtgh
Xml and textual formats in general are ill suited to such large documents. Step 1 should really be to convert and/or split the file into smaller parts.
intrasight
Or shred into a proper database designed for that
heelix
I used to work for a NoSQL company that was more or less an XQUERY engine. One of the things we would complain about is we did use wikipedia as a test data set, so the standing joke was for those of us dealing with really big volumes we'd complain about 'only testing Wikipedia' sized things. Good times.
dleeftink
StackExchange also, not necessarily streamable but records are newline delimited which makes it easier to sample (at least the last time I worked with the Data Dump).
01HNNWZ0MV43FF
Is that all in one big document?
magicalhippo
We regularly parse ~1GB XML documents at work, and got laughed at by someone I know who worked with bulk invoices when I called it a large XML file.
Not sure how common 100GB files are but I can certainly image that being the norm in certain niches.
vessenes
This, thirty years later, is the best pitch for XML I’ve read. Essentially, it’s a slow moving, standards-based approach to data interoperability.
I hated it the minute I learned about it, because it missed something I knew I cared about, but didn’t have a word for in the 90s - developer ergonomics. XML sucks shit for someone who wants to think tersely and code by hand. Seriously, I hate it with a fiery passion.
Happily to my mind the economics of easier-for-creators -> make web browsers and rendering engines either just DEAL with weird HTML, or else force people to use terse data specs like JSON won out. And we have a better and more interesting internet because of it.
However, I’m old enough now to appreciate there is a place for very long-standing standards in the data and data transformation space, and if the XML folks want to pick up that banner, I’m for it. I guess another way to say it is that XML has always seemed to be a data standard which is intended to be what computers prefer, not people. I’m old enough to welcome both, finally.
tannhaeuser
> XML has always seemed to be a data standard which is intended to be what computers prefer, not people.
On one hand, you aren't wrong: XML has in fact been used for machine-to-machine communication mostly. OTOH, XML was just introduced as a subset of SGML doing away with the need of vocabulary-specific markup declarations for mere parsing in favor of always requiring explicit start- and end-element tags. Whereas HTML is chock full of SGMLisms such as tag inference (for example inferring paragraph ends on block elements), empty ("self-closing") elements and enumerated ("boolean") attributes driven by per-element declarations.
One can argue to death whether the web should work as a mere document delivery network with rigid markup a la XML, or that browsers should also directly support SGML authoring idioms such as the above shortform mechanisms. SGML also has text macros/shared fragments (entities) and even allows defining own parsing tokens for markdown, math, CSV, or custom syntaxes. HTML leans towards SGML in that its documentation portrays HTML as an authoring language, but browsers are lacking even in basic SGML features such as entities.
IgorPartola
That’s a flame war that’s been raging for decades for sure.
I do wonder what web application markup would look like today if designed from scratch. It is kind of amazing that HTML and CSS can be used for creating beautiful documents viewable on pretty much any device with a screen AND also for creating dynamic applications with pixel-perfect rendering, special effects, integrations with the device’s hardware, and even external peripherals.
If there was ever scope creep in a project this would be it. And given the recent discussion on here of curses based interfaces it reminded me just how primitive other GUI application layout tools can be while still achieving amazing results. Even something like GTK does not need the intense level of layout engine support and yet is somehow considered richer in some ways and probably more performant for a lot of stuff that’s done with it.
So I am curious what web application development would look like today if it wasn’t for HTML being “good enough”.
caspper69
Had we had better process isolation in the mid-90s, I assume web application development would mostly be Java apps, with a mini-vm for each one (sort of a qubes like environment).
We just couldn't keeps apps' hands out of the cookie jar back then.
immibis
Java tried to, and mostly successfully did, run trusted and untrusted code in the same VM. Your applet code ran in the same VM as all the code for managing applets. However, holes were frequent enough that they abandoned the whole idea. (Instead of sandboxing the VM as a whole? Why?)
lmz
The whole applet thing was already slow enough then without scrutinizing every syscall it makes.
Devasta
If a browser was designed from scratch today it wouldn't have a markup language, documents would be PDF and everything else would be Javascript to canvas.
Suggesting something like HTML would have you laughed out of the room.
fc417fc802
If it were designed from scratch by BigTech anyway. Rather than JS I'd guess it would be WASM with APIs for canvas and accessibility. JS would go via WASM just like any other language you might prefer. If you asked about HTML you'd get pointed at the relevant render-to-canvas library in the language of your choice.
This is only the case because the BigTech view is one of an application platform.
weinzierl
"This, thirty years later, is the best pitch for XML I’ve read."
I wish someone would write "XML - The Good Parts".
Others might argue that this is JSON but I'd disagree:
- No comments is a non-starter
- No proper integers
- No date format
- Schema validation is a primitive toy compared what we had for XML
- Lack of allowed trailing commas
YAML ain't better. I hated whitespace handling in XML, it's a miracle how YAML could make it even worse.
XML is from era long past and I certainly don't want to go back there, but it had its good parts and I feel we have not really learned a lot from its mistakes.
In the end maybe it is just that developer ergonomics is largely a matter of taste and no language will ever please everyone.
jlarocco
It's funny to hear people in the comments here talk about XML in the past tense.
I know it's passé in the web dev world, but in my work we still work with XML all the time. We even have work in our queue to add support for new data sources built on XML (specifically QIF https://qifstandards.org/).
It's fine with me... I've come to like XML. It's nice to have a standard, easy way to do seschemas, validators, processors, queries, etc. It can be overdone and it's not for every use case, but it's pretty good at what it does.
intrasight
I've come to think that XML will be with us for decades and probably follow us when we leave the small blue planet.
In my military work, I've heard the senior project managers refer to a modern battleship as a floating XML document.
bigstrat2003
> I know it's passé in the web dev world...
That is because the web dev world is unfortunately obsessed with the current thing. They chase trends like their lives depend on it.
tzcnt
Developer ergonomics is drastically underappreciated, even in modern times. Since we're talking about textual data formats, I'll go out on a limb here and say that I hate YAML. Double checking exactly how many spaces are present on each line is tedious. It manages to make a simple task like copy-pasting something from a different file (at a different indentation level) into an error-prone process. I'll take angle brackets any day.
chuckadams
You haven’t felt hate until you’ve counted spaces in your Helm templates in order to know what value to put after `nindent`. The punchline is that k8s doesn’t even speak yaml, the protocol is all json and it’s the tooling that inflicts yaml on us.
I can live with yaml as a config format, but once logic starts creeping in, give me anything else.
01HNNWZ0MV43FF
JSON5 is a real sweet spot for me. Closing brackets, but I don't have to type every tag twice. Comments and trailing commas.
consteval
I find for deeply hierarchical data that XML is much easier to read.
emporas
Emacs has pretty print JSON which makes it very easy to read. I don't find it possible, XML displayed for human consumption to be better than that.
froh
interrsting. I find the signal/noise ratio of XML really bad.
what I really dread in XML though is that XML only has idref/id standardized, and no path references. so without tool support you can't navigate to a reference target.
which turns XML into the "binary" format for GUI tools.
> Developer ergonomics is drastically underappreciated, even in modern times.
When was the last time you had an editor that wouldn't just auto close the current tag with "</" ? I mean it's a god-send for knowing where you are at in large structure. You aren't scrolling to the top to find which tag you are in.
formerly_proven
Working with large YAML documents is incredibly annoying and shows the benefit of closing tags.
4ndrewl
It all went downhill after we stopped using .ini files
Locutus_
Well....toml isn't that much more than .ini files slightly brough up in feature support.
Again not great for bigger documents.
iamthepieman
>XML has always seemed to be a data standard which is intended to be what computers prefer, not people
Interesting take, but I'm always a little hesitant to accept any anthropomorphizing of computer systems.
Isn't it always about what we can reason and extrapolate about what the computer is doing? Obviously computers have no preference so it seems like you're really saying
"XML is a poor abstraction for what it's trying to accomplish" or something like that.
Before jQuery, chrome, and web 2.0, I was building xslt driven web pages that transformed XML in an early nosql doc store into html and it worked quite beautifully and allowed us to skip a lot of schema work that we definitely were ready or knowledgeable enough to do.
EDIT: It was the perfect abstraction and tool for that job. However the application was very niche and I've never found a person or team who did anything similar (and never had the opportunity to do anything similar myself again)
vitaflo
I did this for many years at a couple different companies. As you said it worked very well especially at the time (early 2000’s). It was a great way to separate application logic from presentation logic especially for anything web based. Seems like a trivial idea now but at the time I loved it.
In fact the RSS reader I built still uses XSLT to transform the output to HTML as it’s just the easiest way to do so (and can now be done directly in the browser).
madkangas
Re xslt based web applications - a team at my employer did the same circa 2004. It worked beautifully except for one issue: inefficiency. The qps that the app could serve was laughable because each page request went through the xslt engine more than once. No amount of tuning could fix this design flaw, and the project was killed.
Names withheld to protect the guilty. :)
intrasight
Most every request goes through xslt in our team's corporate app. The other app teams are jealous of our performance.
wiremine
> developer ergonomics
That was a huge reason JSON took over.
Another reason was the overall XML ecosystem grew unwieldy and difficult to navigate: XPath, XSLT, SOAP, WSDL, Xpointer, XLink, SOAP, XForms... They all made sense in their own way, but it was difficult to master them all. That complexity, plus the poor ergonomics, is what paved the way for JSON to become preferred.
tonyedgecombe
I quite liked it when it first came out, I'd been dealing with a ton of bespoke formats up until then. Pretty much every one was ambiguous and painful to deal with. It was a step forward being able to push people towards a standard for document transfer.
I suspect it was SOAP and WSDL that killed it for a lot of people though. That was a typical example of a technical solution looking for a problem and complete overkill for most people.
The whole namespace thing was probably a step too far as well.
velcrovan
You should try using a LISP like Racket for XML. Because XML can be expressed directly as S-expressions, XML and LISP go together like peanut butter and jelly.
In my experience, at least with Clojure, it's much more convenient to serialize XML into a map-like structure. With your example, the data structure would look like so.
Some people use namespaced keywords (e.g. :xml/tag) to help disambiguate keys in the map. This kind of data structure tends to be more convenient than dealing with plain sexps or so-called "Hiccup syntax". i.e.
The above syntax is convenient to write, but it's tedious to manipulate. For instance, one needs to dispatch on types to determine whether an element at some index is an attribute map or a child. By using the former data structure, one simply looks up the :attrs or :content key. Additionally, the map structure is easier to depth-first search; it's a one-liner with the tree-seq function.
I've written a rudimentary EPUB parser in Clojure and found it easier to work with zippers than any other data structure to e.g. look for <rootfile> elements with a <container> ancestor.
Zippers are available in most programming languages, thankfully, so this advantage is not really unique to Clojure (or another Lisp). However, I will agree that something like sexps (or Hiccup) is more convenient than e.g. JSX, since you are dealing with the native syntax of the language rather than introducing a compilation step and non-standard syntax.
velcrovan
I have not looked into the use of zippers for this purpose, but I will do so!
This looks like it loses the distinction between attributes and nested tags?
As in, I don't see a difference between `(attr "val")` which expresses an attribute key/value pair and `(thing "world")` which expresses a tag/content relationship. Even if I thought the rule might be "if the first element of the list is a list itself then it should be interpreted as a set of attribute key value pairs" then I would still be ambiguous with:
(foo (bar "baz") "content")
which could serialize to either:
<foo bar="baz">content</foo>
or:
<foo><bar>baz</bar>content</foo>
In fact, this ambiguity between attributes and children has always been one of the head scratching things for me about XML. Well, the thing I've always disliked the most is namespaces but that is another matter.
shawn_w
There's no ambiguity. The first element is a symbol that's the name of a tag. If the second element is a list of two element symbol + string lists, it's the attributes. If it's one of the other recognized types, it's part of the contents of the tag.
Most Scheme tools for working with XML use a different layout where a list starting with the symbol @ indicates attributes. See https://en.wikipedia.org/wiki/SXML for it.
zoogeny
I see, so my example should be:
(foo (bar "baz") "content")
vs
(foo ((bar "baz")) "content")
Where the first one would be the nested tags and the second one would be a single `bar="baz"` attribute.
I would prefer the differentiation to be more explicit than the position and/or structure of the list, so the @ symbol modifier for the attribute list in other tools makes sense.
The sibling comment with a map with a :attrs key feels even better. I don't work in languages with pattern matching or that kind of thing very often, but if I was wanting to know if a particular element had 1 or more attributes then being able to check a dictionary key just feels like a nicer kind of anchor point to match against.
immibis
> In fact, this ambiguity between attributes and children has always been one of the head scratching things for me about XML. Well, the thing I've always disliked the most is namespaces but that is another matter.
Just remember that it's a markup language, and then it's not head-scratching at all: the text is the text being marked up, and the attribute values are the attribute of the markup - things like colour and font.
When it was co-opted to store structured data, those people didn't obey this rule (which would make everything attributes).
Namespaces had a very cool use in XHTML: you could just embed an SVG or MathML directly in your HTML and the browser would render it. This feature was copied into HTML5.
zoogeny
When you say "those people", you mean people like me who (used to) have to navigate how to model structured data using XML. I think the attribute vs. child distinction makes sense in a very flat hierarchy where you are marking up text but quickly devolves into ambiguity for many other uses cases.
I mean, if I'm modeling a <Person> node in some structured format, making a decision about "what is the attribute of the person node" vs "what is a property of the specific Person" isn't an easy call to make in all cases. And then there are cases where an attribute itself ought to have some kind of hierarchy. Even the text example works here: I have a set of font properties and it would make sense to maybe have:
Rather than a series of `fontFamily`, `fontSize`, etc. attributes. This is true when those attributes are complex objects that ended up having nesting at several levels. You end up in the circumstance where you are forced to make things that ought to be attributes into children because you want to model the nested structure of the attributes themselves. Then you end up with some kind of wrapper structure where you might have a section for meta-data and a section for the real content.
I just don't think the distinction works well for an extensible markup language where the nesting of elements is more or less the entire point.
It is much easier to write out though, which is why you see often see `<Element content=" ... " />` patterns all over the place.
immibis
When using XML for structured data the intended way, everything that is a string value (as opposed to a node hierarchy) would be an attribute. There's no text, so there would be no text.
froh
a lisp... like dsssl ? ;-)
bambax
I used to do a lot of XSLT coding, by hand, in text editors that weren't proper IDEs, and frankly it wasn't very hard to do.
There's something very zen-like with this language; you put a document in a kind of sieve and out comes a "better" document. It cannot fail; it can be wrong, full of errors, of course (although if you're validating the result against a schema it cannot be very wrong); but it will almost never explode in your face.
And then XSLT work kind of disappeared; I miss it a lot.
bigstrat2003
I'm gonna be honest, I find terseness to be highly overrated by programmers. I value it in moderation, but for a lot of people they say things like "this language is verbose" like that is a problem unto itself. If verbosity is gaining you something (generally clarity), then I think that's a reasonable cost to pay. Terseness is not, in my opinion, a goal unto itself (though many programmers certainly treat it as such). It's something you should seek only to the extent that it makes a language easier to use.
wongarsu
And not only does the XML format have bad developer ergonomics, most XML parsers are equally terrible to use. There are many things I like about XML: name spaces, schemas, XPath, to some degree even XSLT. But the typical XML developer experience is terrible on every layer
kibwen
> XML sucks shit for someone who wants to think tersely and code by hand. Seriously, I hate it with a fiery passion.
At the risk of glibly missing the main point of your comment, take a look at KDL. Unlike JSON/TOML/YAML, it features XML-style node semantics. Unlike XML, it's intended to be human-readable and writeable by hand. It has specifications for both a query language and a schema language as well as implementations in a bunch of languages. https://kdl.dev/
baq
XML is a big improvement over YAML.
There, I said it.
bigstrat2003
YAML is great. For simple configuration files. For anything more complex it gets gnarly quick, but honestly? If I need a config file for a script I'm writing I will reach for YAML every time. It really is amazing for that use case.
alabastervlog
I find yaml tolerable for cases where ini would have been just as good. Anything else, and… no, it’s bad.
pezezin
CSV encoded in EBCDIC is an improvement over YAML. God what an awful format...
KingLancelot(dead)
[dead]
IshKebab
The main thing I hate about XML (apart from the tedious syntax and terrible APIs - who thought SAX was a sane idea?) is that the data model is wrong for 99% of use cases.
XML gives you an object soup where text objects can be anywhere and data can be randomly stored in tags or attributes.
It just doesn't at all match the object model used by basically all programming languages.
I think that's a big reason JSON is so successful. It's literally the object model used by JavaScript. There's no weird impedance mismatch between the data represented on disk and in your program.
Then someone had to go and screw things up with YAML...
JSON5 is the way.
nickm12
This is fantastic to see! I've used XML off and on since it was the red hot tech of the early 2000s. I wouldn't choose it today for a green field project, but it's still around in so many places, so we definitely need a high-performance, high-quality library written in Rust for this.
This could become a great foundation for a typed, (mostly) etree-compatible, python library built on top of this. I've used lxml for years and it's still my goto, but there are lots of places where it could be modernized.
I can’t say this with certainty, but I have some reason to suspect I might be partially to blame for this fun fact!
A couple years ago, I stumbled on a discussion considering deprecation/removal of XSLT support in Chrome. At some point in the discussion, they mentioned observing a notable uptick in usage—enough of an uptick (from a baseline of approximately zero) that they backed out.
The timing was closely correlated with work I’d done to adapt a library, which originally used XSLT via native Node extensions, to browser XSLT APIs. The project isn’t especially “popular” in the colloquial sense of the term, but it does have a substantial niche user base. I’m not sure how much uptake the browser adaptation of this library has had since, but some quick napkin math suggested it was at least plausible that the uptick in usage they saw might have been the onslaught of automated testing I used to validate the change while I was working on it.
vanderZwan
And this kids is one more reason for us to use testing while developing
Telemakhos
This is true only of XSLT 1.0. The current standard is 3.0.
falcor84
Oh, a shame. Is there any way to track browser version adoption on caniuse, or any other site?
Also, is it up to browser implementations, or does WHATWG expect browsers to stay at version XSLT 1?
tannhaeuser
There's nothing to track here really. For better or worse, browsers are stuck with 1999's XSLT 1.0, and it's a miracle it's still part of native browser stacks given PDF rendering has been implemented using JS for well over a decade now.
XSLT 2 and 3 is a W3C standard written by the sole commercial provider of an XSLT 2 or 3 processor, which is problematic not only because it reduces W3C to a moniker for pushing sales, but also because it undermines W3C's own policy of at least two interworking implementations for a spec to get "recommendation" status.
XSLT is of course a competent language for manipulating XML. It would be a good fit if your processing requires lots of XML literals/fragments to be copied into your target document since XSLT is an XML language itself. Though OTOH it uses XPath embedded in strings excessively, thereby distrusting XML syntax for a core part of its language itself, and coding XPath in XML attributes can be awkward due to restrictive contextual encoding rules for special characters and such.
XSLT can be a maintenance burden if used casually/rarely, since revisiting XSLT requires substantial relearning and time investment due to its somewhat idiosyncratic nature. IDE support for discovery, refactoring, and test automation etc. is lacking.
velcrovan
My favorite and only use of XSLT that still works pretty well is to allow people to browse my RSS feed as if it were a web page.
> XSLT 2 and 3 is a W3C standard written by the sole commercial provider of an XSLT 2 or 3 processor,
I was following the W3C XSLT mailing list for quite some time back when they were doing 3.x, and this does not strike me as accurate.
Telemakhos
I think it's up to browser implementations, but JSON and JavaScript stole much of XML's thunder in the browser anyway, plus HTML5's relaxed tags won out over XHTML 4's strictness (strictness was a benefit if you were actually working with the data). There are still plenty of web-available uses of XML, like RSS and RDF and podcasts/OPML, but people are more likely to call xmlhttp.responseXML and parse a new DOM than wrap their head around XSL templates.
The big place I've successfully used XSLT was in TEI, which nobody outside digital humanities uses. Even then, the XSLT processing is usually minimal, and Javascript is going to do a lot of work that XSL could have done.
ajxs
Being interested in archaic technologies, I built a website using XML/XSLT not that long ago. The site was an archive of a band I was in, which made it fundamentally data oriented: We recorded multiple albums, with different tracks, and a different lineup of musicians each time. There's lots of different databases I could built a static site generator around, but what if the browser could render the page straight from the data? That's what's cool about XML/XSLT. On paper, I think it's actually a pretty nice idea: The browser starts by loading the actual data, and then renders it into HTML according to a specific stylesheet. Obviously the history of browser tech forked in a different direction, but the idea remains good. What if there was native browser support for styling JSON into HTML?
athanagor2
The fact it could be compiled in WASM is a good thing, given the Chrome team was considering removing libxml and XSLT support a few years back. The reasons cited were mostly about security (and share of users).
It's another proof that working on fundamental tools is a good thing.
Very cool! I recently wrote an XSLT 2 transpiler for js (https://github.com/egh/xjslt) - it's nice to see some options out there! Writing the xpath engine is probably the hard part (I relied on fontoxpath). I'm going to be looking into what you have done for inspiration!
airstrike
What problems are {elegantly, neatly, best} solved by using XPath and XSLT today that would make them reasonable choices over alternatives?
jerf
XPath is a very nice language for querying over XML. Most places pitch it as a "declarative" syntax, but as I am quite skeptical of "declarative" as a concept, you can also look at the vast majority of the XPath standard as a way to imperatively drive a multicursor over an XML document, diving in out and out nodes and extracting bits of text and such, without having to write the equivalent code in your language to do so, which will be inevitably quite a bit more verbose. When you need it, it's really useful.
In my very opinionated opinion, XPath is about 99% of the value of XSLT, and XSLT itself is a misfire. Embedding an XML language in XML, rather than being an amazing value proposition, is actually a huge and really annoying mistake, in much the same way and for much the same reason as anyone who has spent much time around shell scripting has found trying to embed shell strings in shell strings (and, if the situation is particularly dire, another third or fourth level of such nesting) is quite unpleasant. Imagine trying to deal with bash, except you have to first quote all the command lines as bash strings like you're using bash -c, all the time. I think "XPath + your favorite language" has all the power of XSLT and, generally, better ergonomics and comprehensibility. Once you've got the selection of nodes in hand, a general-purpose programming language is a better way to deal with their contents then what XSLT provides. Hence why it has always languished.
int_19h
XQuery is the best of both worlds - you get almost all the benefits of XSLT like e.g. the ability to define your own functions, but with non-XML-based syntax that is a superset of XPath.
Basically the only thing it's missing in XQuery vs XSLT is template rules and their application; but IMO simple ones are just as easy to write explicitly, and complex rulesets are hard to reason about and maintain anyway.
jrpelkonen
It’s been a while since I’ve had to deal with XML, but I remember finding it fairly convenient to restructure XML documents with XSLT. Modifying the data in those documents, much less so. I think there’s a sweet spot.
akshayshah
To someone who hasn’t worked much with XML, this seems like a reasonable take!
For cases where a host system wants to execute user-defined data transformations safely, XSLT seems like it might be useful. When they mature, maybe WASM and WASI will fill the same niche with better developer ergonomics?
therealmarv
Interesting take about XSLT. But I agree... XSLT could be something much more simple (and non XML initself) and combined with XPATH. It feels like a lot of boiler code to write XSLT.
password4321
XPATH+XSLT is SQL for XML, declarative selection and transformation.
Using an XML library to iterate through an entire XML document without XPATH is like looping through entire database tables without a JOIN filter or a WHERE clause.
XSLT is the SELECT, transforming XML output with a new level of crazy for recursion.
mickeyp
XPath is a superb query language for XML (or anything that you can structure as a DOM) --- it is also, with some obscure exceptions, the only query language with serious adoption, so it's an easy choice and readily available in XML tools. The only caveat is there are various spec versions and most never added support for newer versions.
Let's look at JSON by comparison. Hmm, let's see: JSONPath, JMESPath, jq, jsonql, ...
never_inline
JQ is the most feature-rich of the bunch. It's defacto standard and I usually just default to it because it offers so much - assignment, various builtins such as base64 encoding.
The disadvantage is that it's not easily embeddable in your own programs - so programs use JSONPath / Go templates often.
bbkane
I also don't think there's a specification written for the jq query language, unlike https://jmespath.org/ , which as you mentioned also has more client libraries.
The jaq author is working on a formal specification of jq https://github.com/01mf02/jq-lang-spec. I think it has also help that there are several implementations of jq now like, gojq, jaq and jqjq, that has helped finding edge cases and weird behaviours.
gojq is also my favorite for two "day to day" reasons:
- the error messages are night and day better than C-jq
- the magick of $(gojq --yaml-input), although I deeply abhor that it is 10 characters longer than "-y"
It's worth mentioning https://github.com/01mf02/jaq(MIT) because it actually strives to be an implementation of the specification versus just "execute better" as gojq does
BoingBoomTschak
And it's yet another terrible DSL that you must learn when it could have been a language everybody already knows, like Python. The query part isn't even that well done, compared to XPath/JSONPath.
> And it's yet another terrible DSL that you must learn when it could have been a language everybody already knows, like Python.
Oh, yeah, I 100% want to type this 15 times a day
# I'll grant you the imports, in the spirit of fairness
aws ec2 describe-instances | python -c '
for r in json.load(sys.stdin)["Reservations"]:
print("\n".join(i["PrivateIpAddress"] for i in r["Instances"]))
'
I mean, seriously, who can read that terrible DSL with all of its line noise
> The query part isn't even that well done, compared to XPath/JSONPath.
XPath I'll grant you, because it's actually very strong but putting JSONPath near jq in a "could be improved" debate tells me you're just not serious. JSONPath is a straight up farce
trallnag
Recently discovered Jsonata thanks to AWS adding it to Step Functions. Feel free to add it to your enumeration
therealmarv
E.g. massive XML documents with complexity which you need to be transformed into other structured XML. Or if you need to parse complex XML. Some people hate XSLT, XPATH with a passion and would rather write much more complex lxml code. It has a steep learning curve but once you understand the fundamentals you can transform XML more easily and especially predictable and reliable than ever.
Another example: If you have very large XML you cannot fit even into memory you can still stream process them with XSLT.
It makes you the master of XML transformations and fetching information out of complex XML ;)
never_inline
I have used it when using scraping some data from web pages using scrapy framework. It's reliable way to extract something from web pages compared to regex.
mdaniel
don't overlook the ability to mix and match them, because each "axis" is good at its own things
I manage a team who build and maintain trading data reports for brokers, we have everything generate in a fairly standard format and customize to those brokers exact needs with XSLT. Hundreds of reports, couldnt manage without it.
jeffbee
What alternatives exist for extracting structured data from the web? I have several ETL pipelines that use htmltidy to turn tag soup into something approximately valid and xmlstarlet to transform it into tabular data.
riedel
Love to see stuff outside the Java space since I really like thedoing stuff in XSLT. Question: Does this work on a textual XML representation or can you plug in different XML readers? I have had really great fun in the past using http://www.ananas.org/xi/ transforming arbitrarily for formated files using XSLT. Also it is today really important that XML Reader has error correction capabilities, since lots of tools don't write well-formed XML, which often is a showstopper for employing to transforms from my experience.
jchw
I wonder if this could perhaps some day be used in Wine, for the MSXML implementations. Maybe not, since those implementations need to be bug-compatible where applications depend on said bugs; but the current implementation(s) are also not fantastic. I believe it is still using libxml2.
(Aside: A long time ago, I had written an alternate XPath 1.1 implementation for Wine during GSoC, but rather shamefully, I never actually got it merged. Life became very hectic for me during that time period and I never really looped back to it. Still feel pretty bad about it all these years later.)
samsk
Nice !
I've a scrapper using XPath/XSLT extensively and 90% of the XPath selectors work like for years without a change.
With CSS selectors I've had more problems...
ebruchez
CSS selectors have spent the last few decades reinventing XPath. XPath introduced right from the beginning the notion of axes, which allow you to navigate down, up, preceding, following, etc. as makes sense. XPath also always had predicates, even in version 1.0. CSS just recently started supporting :has() and :is(), in particular. Eventually, CSS selectors will match XPath's query abilities, although with worse syntax.
samsk
The problem with CSS selectors (at least in scrapers) is also that they change relatively often, compared to (html) document structure, thats why XPath last longer.
But you are right, CSS selectors compared to 20 years old XPath are realy worse.
masklinn
On the other hand:
- XPath literally didn't exist when CSS selectors were introduced
- XPath's flexibility makes it a lot more challenging to implement efficiently, even more so when there are thousands of rules which need to be dynamically reevaluated at each document update
- XPath is lacking conveniences dedicated to HTML semantics, and handrolling them in xpath 1.0 was absolutely heinous (go try and implement a class predicate in xpath 1.0 without extensions)
mdaniel
> - XPath literally didn't exist when CSS selectors were introduced
> W3C Recommendation 17 Dec 1996, revised 11 Jan 1999
There are various drafts and statuses, so it's always open to hair-splitting but based only on the publication date CSS does appear to win
bambax
> CSS selectors have spent the last few decades reinventing XPath
YES! This is so true! And ridiculous! It's a mystery why we didn't simply reuse XPath for selectors... it's all in there!!
masklinn
> It's a mystery why we didn't simply reuse XPath for selectors... it's all in there!!
It's not really a mystery:
> CSS was first proposed by Håkon Wium Lie on 10 October 1994. [...] discussions on public mailing lists and inside World Wide Web Consortium resulted in the first W3C CSS Recommendation (CSS1) being released in 1996
> XPath 1.0 was published in 1999
CSS2 was released before XPath 1.0.
ebruchez
Fair enough. By the way, the original CSS from 1996 featured only:
- the "descendant" combinator (whitespace)
- the "class" selector (".foo")
The 1998 CSS2 introduced "child", "following sibling", and attribute selectors. This state of things then remained unchanged forever (I see that Selectors Level 3 became a recommendation only in 2018?).
On the other hand, in 1999, XPath already specified all those basic ways to navigate the DOM, and CSS still doesn't have them all as of 2025.
mattrighetti
I will definitely try this out!
I have a service that extracts <meta> tags in webpages and to do that I'm currently using (and need) three different dependencies: html5ever, markup5ever_rcdom, markup5ever. I don't like those to be honest, the documentation is quite bad and it was difficult to understand how I should have used the libraries to achieve such a simple task.
XPath on the other hand makes this extremely easy in comparison, I wonder how this will perform compared to my current solution.
faassen
Thanks!
Unfortunately at this point there's no HTML parser frontend for Xee (and its underlying library Xot) yet (HTML 5 parser serialization is supported at least in code). It shouldn't be too hard to add at least HTML 5 support using something like html5ever.
mdaniel
I always hate it when license files have "yes, but" language in them because if the license file differs in some non-obvious way, now I have to pay lawyers to interpret it
Doesn’t look like “yes, but” language to me. Looks like the code is plain old MIT and the author is doing their due diligence with respect to vendored content in the repository subject to different licensing. Seems like they are being paid by a company to work on this, so it’s not surprising that they actually pay attention to copyright.
The fact that many project maintainers forget about vendored content and haphazardly slap the MIT license (or whatever) verbatim into a LICENSE file doesn’t actually give you a get-out-of-paying-lawyers-free card! If anything, Xee’s COPYRIGHT file gives me more confidence in my legal footing than an unadulterated LICENSE file would. It indicates the maintainer at least has a basic understanding of how copyright applies to their project.
chromatin
NCBI still emits XML from their most prominent databases (e.g., PubMed). I'm looking forward to adopting this library into some of my production code that interfaces with PubMed!
tracnar
Nice! I tried using XQuery (superset of XPath 3) for a while through the BaseX implementation. It's pretty nice, but you have to face XML problems like namespaces, document order, attributes vs nodes, you don't know if you can have 0, 1 or more nodes, etc. Something I wish was more readily available would be to run XPath against JSON, yaml, etc. It's a nicer language than say jq, but its ties to XML sometimes make it hard to transfer.
Another pain point with XML is the lack of inline schema, so the languages around like XPath have to work with arbitrary structures unlike say JSON where you at least have basic primitives like map/dict, numbers, bool, etc
trympet
I recently had the pleasure of using XSLT after never having seen it before. I used it to transform a huge 130K line XML manifest with MAPI property metadata into C# source code. It was so simple, readable, and intuitive to use.
squiggleblaz
I learnt XSLT in university back in the early/mid part of the first decade of this century. I didn't much enjoy it. I've never used it, but all my career I've had to deal with terrible ad hoc templating languages. I recently had total freedom to choose what terrible ad hoc templating language to use, and I chose XSLT. I actually totally liked it: and it seemed to have everything I've needed. In previous jobs, there was always tickets that amounted to "make a fork of the terrible ad hoc templating language and hack it until it does this", but I reckon XSLT could do everything and then some.
threecheese
This is great, I’ve been looking for performant and safe XML processing to replace IBM stuff (websphere/datapower) that we really only keep around for hw accelerated payload processing. At our scale, lxml and others + BYO gateway tech has a similar run cost even considering IBM licensing. I hate running their crap, which requires k8s at a version that’s some hair-thin slice above the minimum supported EKS version, it’s almost like they want us to live in 24/7 fear of being OOS.
xvilka
I miss XHTML and XSL times. Time, when Web would have been more prepared for the AI consumption, less dynamic nonsense, and more focus on the actual content. Time shows all these Flash and Java gimmicks died off.
o_pax
This is really good news, I am looking forward to trying it out! Is XQuery also planned as an additional frontend? By the way, there is also χrust, a rust project working towards pretty similar goals (XPath 3.1, XQuery 3.1 and XSLT 3.0). At first glance, the architecture also seems quite similar, it is not as far along, though. Have you had any contact with them?
nashashmi
Just want to say that Microsoft has some sort of implementation of an xml application using Microsoft word or Ms word. But I have struggled to find examples I can use, but for a long time I have been trying to convert an office repository of corporate resumes to xml.
1shooner
I miss the declarative purity of XSLT as an HTML templating layer. I'd love to know if there is a similar system for more popular/current web stack.
torginus
I yearn for the day when people will stop considering the main advertising bullet point feature that their software was written in Rust. Rust 1.0 was released a decade ago, plenty of time for its alleged technical advantages to become apparent.
It's like a handbag whose main claim to being a premium product isn't workmanship or materials, but that it has Gucci on its side.
chuckadams
> It's like a handbag whose main claim to being a premium product isn't workmanship or materials, but that it has Gucci on its side.
Knockoffs aside, the latter is intended to serve as a proxy for the former. I too will be happy when Rust is the boring everyday choice, but in 2025 we still see new buffer overflows every day. And if I'm picking a library, I still want to know if it's in the same language as the app it's going into.
forty
An xpath/xslt engine is something you might want to include in other software, the programming language used might be an important information for this purpose.
resonious
Personally I consider the programming language used for a piece of software to be similar to the materials used for a handbag.
This sounds fantastic! Thank you for your work. Now I gotta go learn Rust :-).
blacklion
Does XSLT still used in a new projects? I have impression, that it was not popular even when XML was.
For example, apache HTTPD never has official module to serve XML via XSLT transformation.
And XSL:FO looks even more obscure.
int_19h
XSL:FO is dead for all practical purposes.
XSLT was not popular for its original intended application - which is to say, serving XML data from web servers and translating it to HTML (or XSL:FO, or ...) on the client as needed. However, it was used plenty for XML processing outside of that particular niche.
New projects these days rarely have to process complicated XML to begin with. But when you do, I'd say XSLT (or perhaps better yet, XQuery) is a very useful tool to have in your toolbox.
froh
syntext serna was such an engineering marvel. a wysiwyg XML editor that used xslt to fo to specify your rendering. was built in the context of docbook and dita but did work for any xsd with a xslt to fo. amazing technology. ahead of its time. and then came json :-(
mdaniel
> XSL:FO is dead for all practical purposes.
As opposed to what for cooking "PDF via XML" files? Because I can assure you than feeding rando.odt into $(libreoffice -pdf $TMPDIR/ohgawd) is 100% not the same as $(fop -fo $TMPDIR/my.fo -pdf $TMPDIR/out.pdf)
int_19h
As opposed to code in e.g. Java that uses the built-in XML APIs and some third party PDF output lib.
kondro
There are a lot of APIs out there that are still XML-based, especially from enterprise suppliers.
Equifax and Experian’s APis immediately come to mind as documents that generate complex results that people often want to turn into some type of visual representation with XSLT.
blacklion
I see a lot of XML APIs and formats around me, it is true. But it is machine-machine formats or complex configuration files formats which doesn't need visualization. It needs schema support and tooling, but not visualization or transformation. It is more serialization formats for complex object trees, and all processing is done on these object trees, not XML itself.
But of course, I see only a part of the picture.
mvc
Nice work. Xpath is a beast. Obvious why paligo would be interested too. Must be a lot of commercial documentation out there where the best representation they can get looks a bit XMLish.
yxhuvud
I hope this will be packaged into shared libraries at some point so that languages that isn't rust will get access to it.
faitswulff
The author mentions Python bindings in the post.
shadowtree
Throwback shoutout to Steve Muench and his genius method of grouping elements in XSLT 1.0.
eXcellent, it's good to see new work on XSLT, reviled bysome it's actually great tech and useful in all sorts of places.
notfed
Does it preserve whitespace? Something that I always found asinine about XSLT is that it wipes out whitespace when transforming. Imagine you have thousands of corporate XML files in source control, and you want to tranform them all, performing some simple mutation. XSLT claims to be fit for this job, but in practice your diff is going to be full of unintentional whitespace mangling.
ebruchez
XSLT will perform the transformations that you instruct it to do. It does not wipe out whitespace just on its own. Do you mean that you'd like facilities to nicely reindent the output?
notfed
> It does not wipe out whitespace just on its own.
Sounds nice but doesn't match my lived experience with both Chrome's built-in XSLT processor and `xsltproc`. (I was using XSLT 1.0, for legacy reasons, so maybe this is an XSLT 1.0 issue?)
> Do you mean that you'd like facilities to nicely reindent the output?
No, I do mean preserve whitespace (i.e., formatting), such as between elements and between attributes.
froh
usually whitespace (in) significance is specified in the XML schema. so if you provide a schema and instruct the xslt engine to comply with the schema, do your issues persist?
smitty1e
> XML is now niche technology, but it's a bigger niche than you might think, and it's not going to go away any time soon.
When you consider that .docx, .pptx, and .xlsx files are zipped XML archives, "niche" seems a misnomer.
mdaniel
especially .xlsx which is some "hold my beer" for someone trying to encode a dataframe in .xml :-(
smitty1e
Openpyxl is a great library.
immibis
XSLT is great for nerd cred, when someone selects "view source" on your page and there's not an HTML tag in sight. I did this once.
Maybe it's good for compression, but probably not by a factor much bigger than gzip/brotli/zstd.
mickeyp
It's interesting to see the slow rehabilitation of XML and its tooling now that there's a new generation of developers who have not grown up in the shadow of XML's prime in the late 90s / early 2000s, and who have not heard (or did not buy into) the anti-XML crowd's ranting --- even though some of their criticisms were legitimate.
I've always liked XML, and especially XPath, and even though there were a large number of missteps in the heyday of XML, I feel it has always been unfairly maligned. Look at all the people who reinvent XML tooling but for JSON, but not nearly as well. Luckily, people who value XML can still use it, provided the fit is right. But it's nice to see the tides turning.
Most fashions really are cyclical.
linguae
It’s the “slope of enlightenment” phase of the Gartner hype cycle, where people are able to make sober assessments of technologies without undue influence from hype or its backlash. We’re long past the days where XML is used for everything, even when it’s inappropriate, and we’re also past the “trough of disillusionment” phase where people sought alternatives to XML.
I think XML is good for expressing document formats and for configuration settings. I prefer JSON for data serialization, though.
My complaints about XML remain pretty much unchanged since 10 years ago.
- Not including self-closing tags, there should only be one close tag: </>
- Elements are for data. Attributes are evil
- XPath indexing should be 0-based
- Documents without a schema should not make your tools panic or complain
- An xml document shouldn't have to waste it's time telling you it's an xml document in xml
I maintain that one of the reasons JSON got so popular so quickly is because it does all of the above. The problem with JSON is that you lose the benefits of having a schema to validate against.
bambax
> Elements are for data. Attributes are evil
This is like, your opinion, man... ;-) You can devise your schema any way you want. Attributes are great, and they exist in HTML in the form of datasets, which, as usual, are a poorly-specified and ill-designed rethinking of XML attributes
> Documents without a schema should not make your tools panic or complain
They don't. You absolutely don't need a schema. If you declare a schema, it should exist. If not, no problem?
da_chicken
No, the problem with attributes is that people consistently misuse them. So many things about XML break down when you make everything a self closing tag with 50 attributes. So many programmers just seem to say, "oh, it's shorter text so it must be inherently better" or "oh it's one-to-one so I should strictly avoid anything resembling a heirarchy."
Though I don't necessarily agree with the "data format" framing. This idea that markup languages are not data formats seems confused.
> They don't. You absolutely don't need a schema. If you declare a schema, it should exist. If not, no problem?
I agree that they should not.
However, I have used many tools that puke when presented with XML fragments or XML with no schema.
Mountain_Skies
Sometime attribute use goes too far such as when they contain comma separated lists of items.
bambax
Sure, but that's not the fault of the format itself, is it? You can write extremely long enumerations in any natural language -- that's the author's fault.
ebruchez
There have been proposals a long time ago, including by Tim Bray, for an XML 2.0 that would remove some warts. But there was no appetite in the industry to move forward.
Mountain_Skies
Microsoft seems to be especially obsessed with making as much as possible into attributes. Makes me wonder if there is some hidden historical reason for that like an especially powerful evangelist inside the company that loved attributes during the early days of adopting XML.
int_19h
Attributes are way shorter to write.
That said, these days most Microsoft XML dialects are actually XAML-based, and in XAML attributes are basically syntactic sugar - you can write:
<Foo Bar="123">
or
<Foo>
<Foo.Bar>123</Foo.Bar>
</Foo>
(the dot in the syntax makes it possible for the XAML parser to distinguish nested elements that represent properties from nested elements that represent child objects)
immibis
So how do I specify the font of a word without attributes?
int_19h
Curiously, one of the driving forces behind renewed interest in XML is that language models seem to handle large XML documents better than JSON. I suspect this has something to do with it being more redundant - e.g. closing tags including the element name - making it easier for the model to keep track of structure.
ctrlp
XML/XPath are very useful but I've definitely lived through their abuses. Still abusus non tollit usam and I've had many positive experiences with XPath especially. XmlStarlet has been especially useful, also xmllint. I welcome more tooling like this. The major downside to XML is the verbosity and cognitive load. Tooling that manages that is a godsend.
j-pb
XML is still a huge mistake for most stuff. It's fine for _documents_ but not as a data storage solution. Bloat, ambiguities, virtually impossible to canonicalise.
XPath is cute, but if you don't mind bloat, text-only and lack of ergonomics, anyways then Conjunctive Regular Path Queries and RDF are miles ahead of XML as a data storage solution. (Not serialised as XML please xD)
Mountain_Skies
I made extensive use of XPath and XSL(T) back in their heyday and in general was fine with them but the architect astronauts who love showing off how clever they are with artificial complexity had a tendency to make use of XML tech to complicate things unnecessarily. Think that might be where many people's dislike of it came from, especially those whose first exposure wasn't learning through simple structures when XML was new but were thrown into the type of morass that develops when a tech is climbing the maturity curve.
Devasta
I manage a team of business analysts and accountants who use XSLT for generating reports for banks, XSLT is usually their first experience programming outside some linkedin learning courses. Not once has one of them ever complained about namespaces, or verbosity or anything like it, this is something I only see on HN or the programming subreddits.
The vast vast majority of Devs only experience of XML is what they hear second hand, I'm sure a lot more would like it if they tried it.
kgwxd
XML, and other X[x] standards, are just horrible to read. On top of that, XML was made 10x worse by wrapping things in SOAP and the like over the wire, back in the day.
XSD, XPath, XSLT are all domains where I'd argue that reading/reasoning about are way more important.
When troubleshooting an issue, I don't mind scanning XML for a few data points so I can confirm what values are being communicated, but when I need to figure out how/why a specific value came to be, I don't want the logic spread throughout a giant text file wrapped in attribute value strings, and other non-debuggable "code". I'd rather it just be in a proper programming language.
faassen
The specifications are certainly not easy to read, and I wouldn't recommend them to learn about XML. But from the perspective of someone implementing them they are quite useful!
As someone who has used many programming languages and who went through the process of implementing this one I have many opinions about XPath and XSLT as programming languages. I myself am more interested in implementing them for others who value using them than using them myself. I do recognize there is a sizeable community of people who do use these tools and are passionate about them - and that's interesting to see and more power to them!
JTyQZSnP3cQGa8B
It's only a sample of one but I'm really unhappy with the issues and limitations that JSON and YAML have, and I welcome XML if it has good tools.
bluGill
That depends on what I'm doing. Most what what I'm doing is simple and so xml is just way to complex for the task. However when I need something complex xml can handle things that the others cannot - at the expense of being really complex to work with.
hkgjjgjfjfjfjf(dead)
[dead]
rambojohnson(dead)
[flagged]
geodel
Maybe similar reason as people deploy 100 requests a week micro service on multiple kubernetes clusters across 3 AZs to make sure it is highly available.
I worked on projects which refused to use anything more modern than XSLT & XPATH 1.0 because of lack of support in the non Java/Net World (1.0 = tech from 1999). Kudos to Saxon though, it was and is great but I wished there were more implementations of XSLT 2.0 & XPATH 2.0 and beyond in the open source World... both are so much more fun and easier to use in 2.0+ versions. For that reason I've never touched XSLT 3.0 (because I stuck to Saxon B 9.1 from 2009). I have no doubt it's a great spec but there should be other ways than only Saxon HE to run it in an open source way.
It's like we have an amazing modern spec but only one browser engine to run it ;)
[1] https://www.videlibri.de/xidel.html
[2] https://github.com/benibela/xidel
[1] https://basex.org/basex/xquery/
I will do some experiments with using newer XPATH on JSON... that could be interesting.
Nothing could compel me to like XSLT. I admire certain elements of its design, but in practice, it just seems needlessly verbose. But I really love XPath, though.
If your data is essentially a long piece of text, with annotations associated with certain parts of that text, this is where XML shines.
When you try to use XML to represent something like an ecommerce order, financial transaction, instant message and so on, this is where you start to see problems. Trying to shove some extremely convoluted representation of text ranges and their attributes into JSON is just as bad.
A good "rule of thumb" would be "does this document still make sense if all the tags are stripped, and only the text nodes remain?" If yes, choose XML, if not, choose JSON.
Well it can't: JSON has no processing instructions, no references, no comments, JSON "numbers" are problematic, and JSON arrays can't have attributes, so you're stuck with some kind of additional protocol that maps the two.
For something that is basically text (like an HTML document) or a list of dictionaries (like RSS) it may not seem obvious what the value of these things are (or even what they mean, if you have little exposure to XML), so I'll try and explain some of that.
1. Processing instructions are like <?xml?> and <?xml-stylesheet?> -- these let your application embed linear processing instructions that you know are for the implementation, and so you know what your implementation needs to do with the information: If it doesn't need to do anything, you can ignore them easily, because they are (parsewise) distinct.
2. References (called entities) are created with <!ENTITY x ...> and then you use them as &#x; maybe you are familiar with < representing < but this is not mere string replacement: you can work with the pre-parsed entity object (for example, if it's an image), or treat it as a reference (which can make circular objects possible to represent in XML) neither of which is possible in JSON. Entities can be behind external URI as well.
3. Comments are for humans. Lots of people put special {"comment":"xxx"} objects in their JSON, so you need to understand that protocol and filter it. They are obvious (like the processing instructions) in XML.
4. JSON numbers fold into floats of different sizes in different implementations, so you have to avoid them in interchange protocols. This is annoying and bug-prone.
5. Attributes are the things on xml tags <foo bar="42">...</foo> - Some people map this in JSON as {"bar":"42","children":[...],"tag":"foo"} and others like ["foo",{"bar":"42"},...] but you have to make a decision -- the former may be difficult to parse in a streaming way, but the latter creates additional nesting levels.
None of this is insurmountable: You can obviously encapsulate almost anything in almost anything else, but think about all the extra work you're doing, and how much risk there is in that code working forever!
For me: I process financial/business data mostly in XML, so it is very important I am confident my implementation is correct, because shit happens as the result of that document getting to me. Having the vendor provide a spec any XML software can understand helps us have a machine-readable contract, but I am getting a number of new vendors who want to use JSON, and I will tell you their APIs never work: They will give me openapi and swagger "templates" that just don't validate, and type-coding always requires extra parsing of the strings the JSON parsing comes back with. If there's a pager interface: I have to implement special logic for that (this is built-in to XML). If they implement dates, sometimes it's unix-time, sometimes it's 1000x off from that, sometimes it's a ISO8601-inspired string, and fuck sometimes I just get an HTTP date. And so on.
So I am always finding JSON that I wish were XML, because (in my use-cases) XML is just plain better than JSON, but if you do a lot in languages with poor XML support (like JavaScript, Python, etc) all of these things will seem hard enough you might think json+xyz is a good alternative (especially if you like JSON), so I understand the need for stuff like "xee" to make XML more accessible so that people stop doing so much with JSON. I don't know rust well enough to know if xee does that, but I understand fully the need.
XML can really shine in the markup role. It got such a bad rap because people used it as a pure data format, something it isn't very suited for.
```js import * as fastXmlParser from 'fast-xml-parser'; const xmlParser = new fastXmlParser.XMLParser({ ignoreAttributes: false }); ```
Validate input as required with jschema.
The obvious solution is streaming, but streaming appears to not be supported, though is listed under Challenging Future Ideas: https://github.com/Paligo/xee/blob/main/ideas.md
How hard is it to implement XML/XSLT/XPATH streaming?
There's a potential alternative to streaming, though - succinct storage of XML in memory:
https://blog.startifact.com/posts/succinct/
I've built a succinct XML library named Xoz (not integrated into Xee yet):
https://github.com/Paligo/xoz
The parsed in memory overhead goes down to 20% of the original XML text in my small experiments.
There's a lot of questions on how this functions in the real world, but this library also has very interesting properties like "jump to the descendant with this tag without going through intermediaries".
May I ask why? I used to do a lot of XSLT in 2007-2012 and stuck with XSLT 2.0. I don't know what's in 3.0 as I've never actually tried it but I never felt there was some feature missing from 2.0 that prevented me to do something.
As for streaming, an intermediary step would be the ability to cut up a big XML file in smaller ones. A big XML document is almost always the concatenation of smaller files (that's certainly the case for Wikipedia for example). If one can output smaller files, transform each of them, and then reconstruct the initial big file without ever loading it in full in memory, that should cover a huge proportion of "streaming" needs.
I suspect with the right FM-Index Xoz might be able to store huge documents in a smaller size than the original, but that's an experiment for the future.
With modern SSDs and disk cache, that's likely enough to be plenty performant without having to store the whole document in memory at once.
It's actually quite annoying on the general case. It is completely possible to write an XPath expression that says to match a super early tag on an arbitrarily-distant further tag.
In another post in this thread I mention how I think it's better to think of it as a multicursor, and this is part of why. XPath doesn't limit itself to just "descending", you can freely cursor up and down in the document as you write your expression.
So it is easy to write expressions where you literally can't match the first tag, or be sure you shouldn't return it, until the whole document has been parsed.
Saxon's paid edition supports it. I've done it a few times, but you have to write your XSLT in a completely different way to make it work.
There's some fast databases that store prefix trees, which might be suitable for such a task actually (something like infinitydb). But building this database will basically take a while (it will require parsing the entire document). But i suppose if reading/querying is going to happen many times, its worth it?
If the payloads in question are in that range, the time spent to support streaming doesn't feel justified compared to just using a machine with more memory. Maybe reducing the size of the parsed representation would be worth it though, since that benefits nearly every use case
In any case I don't have $1500 to blow on a new computer with 100GB of ram in the unsubstantiated hope that it happens to fit, just so I can play with the Wikipedia data dump. And I don't think that's a reasonable floor for every person that wants to mess with big xml files.
You can read the document as a streaming text source and split it into chunks based on matching pairs of "<page>" and "</page>" with a simple state machine. Then you can stream those single-page documents to an XML parser without worrying about document size. This of course doesn't apply in the general case where you are processing arbitrary huge XML documents.
I have processed Wikipedia many times with less than 8 GB of RAM.
In the English Wikipedia the wikitext accounts for about 80% of the bytes of the decompressed XML dump.
Not sure how common 100GB files are but I can certainly image that being the norm in certain niches.
I hated it the minute I learned about it, because it missed something I knew I cared about, but didn’t have a word for in the 90s - developer ergonomics. XML sucks shit for someone who wants to think tersely and code by hand. Seriously, I hate it with a fiery passion.
Happily to my mind the economics of easier-for-creators -> make web browsers and rendering engines either just DEAL with weird HTML, or else force people to use terse data specs like JSON won out. And we have a better and more interesting internet because of it.
However, I’m old enough now to appreciate there is a place for very long-standing standards in the data and data transformation space, and if the XML folks want to pick up that banner, I’m for it. I guess another way to say it is that XML has always seemed to be a data standard which is intended to be what computers prefer, not people. I’m old enough to welcome both, finally.
On one hand, you aren't wrong: XML has in fact been used for machine-to-machine communication mostly. OTOH, XML was just introduced as a subset of SGML doing away with the need of vocabulary-specific markup declarations for mere parsing in favor of always requiring explicit start- and end-element tags. Whereas HTML is chock full of SGMLisms such as tag inference (for example inferring paragraph ends on block elements), empty ("self-closing") elements and enumerated ("boolean") attributes driven by per-element declarations.
One can argue to death whether the web should work as a mere document delivery network with rigid markup a la XML, or that browsers should also directly support SGML authoring idioms such as the above shortform mechanisms. SGML also has text macros/shared fragments (entities) and even allows defining own parsing tokens for markdown, math, CSV, or custom syntaxes. HTML leans towards SGML in that its documentation portrays HTML as an authoring language, but browsers are lacking even in basic SGML features such as entities.
I do wonder what web application markup would look like today if designed from scratch. It is kind of amazing that HTML and CSS can be used for creating beautiful documents viewable on pretty much any device with a screen AND also for creating dynamic applications with pixel-perfect rendering, special effects, integrations with the device’s hardware, and even external peripherals.
If there was ever scope creep in a project this would be it. And given the recent discussion on here of curses based interfaces it reminded me just how primitive other GUI application layout tools can be while still achieving amazing results. Even something like GTK does not need the intense level of layout engine support and yet is somehow considered richer in some ways and probably more performant for a lot of stuff that’s done with it.
So I am curious what web application development would look like today if it wasn’t for HTML being “good enough”.
We just couldn't keeps apps' hands out of the cookie jar back then.
Suggesting something like HTML would have you laughed out of the room.
This is only the case because the BigTech view is one of an application platform.
I wish someone would write "XML - The Good Parts".
Others might argue that this is JSON but I'd disagree:
- No comments is a non-starter
- No proper integers
- No date format
- Schema validation is a primitive toy compared what we had for XML
- Lack of allowed trailing commas
YAML ain't better. I hated whitespace handling in XML, it's a miracle how YAML could make it even worse.
XML is from era long past and I certainly don't want to go back there, but it had its good parts and I feel we have not really learned a lot from its mistakes.
In the end maybe it is just that developer ergonomics is largely a matter of taste and no language will ever please everyone.
I know it's passé in the web dev world, but in my work we still work with XML all the time. We even have work in our queue to add support for new data sources built on XML (specifically QIF https://qifstandards.org/).
It's fine with me... I've come to like XML. It's nice to have a standard, easy way to do seschemas, validators, processors, queries, etc. It can be overdone and it's not for every use case, but it's pretty good at what it does.
In my military work, I've heard the senior project managers refer to a modern battleship as a floating XML document.
That is because the web dev world is unfortunately obsessed with the current thing. They chase trends like their lives depend on it.
what I really dread in XML though is that XML only has idref/id standardized, and no path references. so without tool support you can't navigate to a reference target.
which turns XML into the "binary" format for GUI tools.
When was the last time you had an editor that wouldn't just auto close the current tag with "</" ? I mean it's a god-send for knowing where you are at in large structure. You aren't scrolling to the top to find which tag you are in.
Again not great for bigger documents.
Interesting take, but I'm always a little hesitant to accept any anthropomorphizing of computer systems.
Isn't it always about what we can reason and extrapolate about what the computer is doing? Obviously computers have no preference so it seems like you're really saying
"XML is a poor abstraction for what it's trying to accomplish" or something like that.
Before jQuery, chrome, and web 2.0, I was building xslt driven web pages that transformed XML in an early nosql doc store into html and it worked quite beautifully and allowed us to skip a lot of schema work that we definitely were ready or knowledgeable enough to do.
EDIT: It was the perfect abstraction and tool for that job. However the application was very niche and I've never found a person or team who did anything similar (and never had the opportunity to do anything similar myself again)
In fact the RSS reader I built still uses XSLT to transform the output to HTML as it’s just the easiest way to do so (and can now be done directly in the browser).
Names withheld to protect the guilty. :)
That was a huge reason JSON took over.
Another reason was the overall XML ecosystem grew unwieldy and difficult to navigate: XPath, XSLT, SOAP, WSDL, Xpointer, XLink, SOAP, XForms... They all made sense in their own way, but it was difficult to master them all. That complexity, plus the poor ergonomics, is what paved the way for JSON to become preferred.
I suspect it was SOAP and WSDL that killed it for a lot of people though. That was a typical example of a technical solution looking for a problem and complete overkill for most people.
The whole namespace thing was probably a step too far as well.
I've written a rudimentary EPUB parser in Clojure and found it easier to work with zippers than any other data structure to e.g. look for <rootfile> elements with a <container> ancestor.
Zippers are available in most programming languages, thankfully, so this advantage is not really unique to Clojure (or another Lisp). However, I will agree that something like sexps (or Hiccup) is more convenient than e.g. JSX, since you are dealing with the native syntax of the language rather than introducing a compilation step and non-standard syntax.
Racket has helper libraries like TxExpr (https://docs.racket-lang.org/txexpr/index.html) that make it pretty easy to manipulate S-expressions of this kind.
As in, I don't see a difference between `(attr "val")` which expresses an attribute key/value pair and `(thing "world")` which expresses a tag/content relationship. Even if I thought the rule might be "if the first element of the list is a list itself then it should be interpreted as a set of attribute key value pairs" then I would still be ambiguous with:
which could serialize to either: or: In fact, this ambiguity between attributes and children has always been one of the head scratching things for me about XML. Well, the thing I've always disliked the most is namespaces but that is another matter.See a grammar for the representation at https://docs.racket-lang.org/xml/index.html#%28def._%28%28li...
Most Scheme tools for working with XML use a different layout where a list starting with the symbol @ indicates attributes. See https://en.wikipedia.org/wiki/SXML for it.
I would prefer the differentiation to be more explicit than the position and/or structure of the list, so the @ symbol modifier for the attribute list in other tools makes sense.
The sibling comment with a map with a :attrs key feels even better. I don't work in languages with pattern matching or that kind of thing very often, but if I was wanting to know if a particular element had 1 or more attributes then being able to check a dictionary key just feels like a nicer kind of anchor point to match against.
Just remember that it's a markup language, and then it's not head-scratching at all: the text is the text being marked up, and the attribute values are the attribute of the markup - things like colour and font.
When it was co-opted to store structured data, those people didn't obey this rule (which would make everything attributes).
Namespaces had a very cool use in XHTML: you could just embed an SVG or MathML directly in your HTML and the browser would render it. This feature was copied into HTML5.
I mean, if I'm modeling a <Person> node in some structured format, making a decision about "what is the attribute of the person node" vs "what is a property of the specific Person" isn't an easy call to make in all cases. And then there are cases where an attribute itself ought to have some kind of hierarchy. Even the text example works here: I have a set of font properties and it would make sense to maybe have:
Rather than a series of `fontFamily`, `fontSize`, etc. attributes. This is true when those attributes are complex objects that ended up having nesting at several levels. You end up in the circumstance where you are forced to make things that ought to be attributes into children because you want to model the nested structure of the attributes themselves. Then you end up with some kind of wrapper structure where you might have a section for meta-data and a section for the real content.I just don't think the distinction works well for an extensible markup language where the nesting of elements is more or less the entire point.
It is much easier to write out though, which is why you see often see `<Element content=" ... " />` patterns all over the place.
There's something very zen-like with this language; you put a document in a kind of sieve and out comes a "better" document. It cannot fail; it can be wrong, full of errors, of course (although if you're validating the result against a schema it cannot be very wrong); but it will almost never explode in your face.
And then XSLT work kind of disappeared; I miss it a lot.
At the risk of glibly missing the main point of your comment, take a look at KDL. Unlike JSON/TOML/YAML, it features XML-style node semantics. Unlike XML, it's intended to be human-readable and writeable by hand. It has specifications for both a query language and a schema language as well as implementations in a bunch of languages. https://kdl.dev/
There, I said it.
XML gives you an object soup where text objects can be anywhere and data can be randomly stored in tags or attributes.
It just doesn't at all match the object model used by basically all programming languages.
I think that's a big reason JSON is so successful. It's literally the object model used by JavaScript. There's no weird impedance mismatch between the data represented on disk and in your program.
Then someone had to go and screw things up with YAML...
JSON5 is the way.
This could become a great foundation for a typed, (mostly) etree-compatible, python library built on top of this. I've used lxml for years and it's still my goto, but there are lots of places where it could be modernized.
A couple years ago, I stumbled on a discussion considering deprecation/removal of XSLT support in Chrome. At some point in the discussion, they mentioned observing a notable uptick in usage—enough of an uptick (from a baseline of approximately zero) that they backed out.
The timing was closely correlated with work I’d done to adapt a library, which originally used XSLT via native Node extensions, to browser XSLT APIs. The project isn’t especially “popular” in the colloquial sense of the term, but it does have a substantial niche user base. I’m not sure how much uptake the browser adaptation of this library has had since, but some quick napkin math suggested it was at least plausible that the uptick in usage they saw might have been the onslaught of automated testing I used to validate the change while I was working on it.
Also, is it up to browser implementations, or does WHATWG expect browsers to stay at version XSLT 1?
XSLT 2 and 3 is a W3C standard written by the sole commercial provider of an XSLT 2 or 3 processor, which is problematic not only because it reduces W3C to a moniker for pushing sales, but also because it undermines W3C's own policy of at least two interworking implementations for a spec to get "recommendation" status.
XSLT is of course a competent language for manipulating XML. It would be a good fit if your processing requires lots of XML literals/fragments to be copied into your target document since XSLT is an XML language itself. Though OTOH it uses XPath embedded in strings excessively, thereby distrusting XML syntax for a core part of its language itself, and coding XPath in XML attributes can be awkward due to restrictive contextual encoding rules for special characters and such.
XSLT can be a maintenance burden if used casually/rarely, since revisiting XSLT requires substantial relearning and time investment due to its somewhat idiosyncratic nature. IDE support for discovery, refactoring, and test automation etc. is lacking.
https://joeldueck.com/feed.atom
I was following the W3C XSLT mailing list for quite some time back when they were doing 3.x, and this does not strike me as accurate.
The big place I've successfully used XSLT was in TEI, which nobody outside digital humanities uses. Even then, the XSLT processing is usually minimal, and Javascript is going to do a lot of work that XSL could have done.
It's another proof that working on fundamental tools is a good thing.
In my very opinionated opinion, XPath is about 99% of the value of XSLT, and XSLT itself is a misfire. Embedding an XML language in XML, rather than being an amazing value proposition, is actually a huge and really annoying mistake, in much the same way and for much the same reason as anyone who has spent much time around shell scripting has found trying to embed shell strings in shell strings (and, if the situation is particularly dire, another third or fourth level of such nesting) is quite unpleasant. Imagine trying to deal with bash, except you have to first quote all the command lines as bash strings like you're using bash -c, all the time. I think "XPath + your favorite language" has all the power of XSLT and, generally, better ergonomics and comprehensibility. Once you've got the selection of nodes in hand, a general-purpose programming language is a better way to deal with their contents then what XSLT provides. Hence why it has always languished.
Basically the only thing it's missing in XQuery vs XSLT is template rules and their application; but IMO simple ones are just as easy to write explicitly, and complex rulesets are hard to reason about and maintain anyway.
For cases where a host system wants to execute user-defined data transformations safely, XSLT seems like it might be useful. When they mature, maybe WASM and WASI will fill the same niche with better developer ergonomics?
Using an XML library to iterate through an entire XML document without XPATH is like looping through entire database tables without a JOIN filter or a WHERE clause.
XSLT is the SELECT, transforming XML output with a new level of crazy for recursion.
Let's look at JSON by comparison. Hmm, let's see: JSONPath, JMESPath, jq, jsonql, ...
The disadvantage is that it's not easily embeddable in your own programs - so programs use JSONPath / Go templates often.
I too am probably going to embed jmespath in my app.I need it to allow users to fill CLI flags from config files, and it'll replace my crappy homegrown version ( https://github.com/bbkane/warg/blob/740663eeeb5e87c9225fb627... )
- the error messages are night and day better than C-jq
- the magick of $(gojq --yaml-input), although I deeply abhor that it is 10 characters longer than "-y"
It's worth mentioning https://github.com/01mf02/jaq (MIT) because it actually strives to be an implementation of the specification versus just "execute better" as gojq does
I said goodbye to it a few weeks ago, personally (https://world-playground-deceit.net/blog/2025/03/a-common-li... https://world-playground-deceit.net/blog/2025/03/speeding-up...)
Oh, yeah, I 100% want to type this 15 times a day
because that is undoubtedly better than I mean, seriously, who can read that terrible DSL with all of its line noise> The query part isn't even that well done, compared to XPath/JSONPath.
XPath I'll grant you, because it's actually very strong but putting JSONPath near jq in a "could be improved" debate tells me you're just not serious. JSONPath is a straight up farce
Another example: If you have very large XML you cannot fit even into memory you can still stream process them with XSLT.
It makes you the master of XML transformations and fetching information out of complex XML ;)
(Aside: A long time ago, I had written an alternate XPath 1.1 implementation for Wine during GSoC, but rather shamefully, I never actually got it merged. Life became very hectic for me during that time period and I never really looped back to it. Still feel pretty bad about it all these years later.)
- XPath literally didn't exist when CSS selectors were introduced
- XPath's flexibility makes it a lot more challenging to implement efficiently, even more so when there are thousands of rules which need to be dynamically reevaluated at each document update
- XPath is lacking conveniences dedicated to HTML semantics, and handrolling them in xpath 1.0 was absolutely heinous (go try and implement a class predicate in xpath 1.0 without extensions)
[citation required]
https://www.w3.org/TR/1999/REC-xpath-19991116/
https://www.w3.org/TR/REC-CSS1-961217
> W3C Recommendation 17 Dec 1996, revised 11 Jan 1999
There are various drafts and statuses, so it's always open to hair-splitting but based only on the publication date CSS does appear to win
YES! This is so true! And ridiculous! It's a mystery why we didn't simply reuse XPath for selectors... it's all in there!!
It's not really a mystery:
> CSS was first proposed by Håkon Wium Lie on 10 October 1994. [...] discussions on public mailing lists and inside World Wide Web Consortium resulted in the first W3C CSS Recommendation (CSS1) being released in 1996
> XPath 1.0 was published in 1999
CSS2 was released before XPath 1.0.
- the "descendant" combinator (whitespace) - the "class" selector (".foo")
The 1998 CSS2 introduced "child", "following sibling", and attribute selectors. This state of things then remained unchanged forever (I see that Selectors Level 3 became a recommendation only in 2018?).
On the other hand, in 1999, XPath already specified all those basic ways to navigate the DOM, and CSS still doesn't have them all as of 2025.
I have a service that extracts <meta> tags in webpages and to do that I'm currently using (and need) three different dependencies: html5ever, markup5ever_rcdom, markup5ever. I don't like those to be honest, the documentation is quite bad and it was difficult to understand how I should have used the libraries to achieve such a simple task.
XPath on the other hand makes this extremely easy in comparison, I wonder how this will perform compared to my current solution.
Unfortunately at this point there's no HTML parser frontend for Xee (and its underlying library Xot) yet (HTML 5 parser serialization is supported at least in code). It shouldn't be too hard to add at least HTML 5 support using something like html5ever.
https://github.com/Paligo/xee/blob/xee-v0.1.5/COPYRIGHT
And that goes double for when there is a separate LICENSE file in the repo https://github.com/Paligo/xee/blob/xee-v0.1.5/LICENSE-MIT
The fact that many project maintainers forget about vendored content and haphazardly slap the MIT license (or whatever) verbatim into a LICENSE file doesn’t actually give you a get-out-of-paying-lawyers-free card! If anything, Xee’s COPYRIGHT file gives me more confidence in my legal footing than an unadulterated LICENSE file would. It indicates the maintainer at least has a basic understanding of how copyright applies to their project.
Another pain point with XML is the lack of inline schema, so the languages around like XPath have to work with arbitrary structures unlike say JSON where you at least have basic primitives like map/dict, numbers, bool, etc
It's like a handbag whose main claim to being a premium product isn't workmanship or materials, but that it has Gucci on its side.
Knockoffs aside, the latter is intended to serve as a proxy for the former. I too will be happy when Rust is the boring everyday choice, but in 2025 we still see new buffer overflows every day. And if I'm picking a library, I still want to know if it's in the same language as the app it's going into.
For example, apache HTTPD never has official module to serve XML via XSLT transformation.
And XSL:FO looks even more obscure.
XSLT was not popular for its original intended application - which is to say, serving XML data from web servers and translating it to HTML (or XSL:FO, or ...) on the client as needed. However, it was used plenty for XML processing outside of that particular niche.
New projects these days rarely have to process complicated XML to begin with. But when you do, I'd say XSLT (or perhaps better yet, XQuery) is a very useful tool to have in your toolbox.
As opposed to what for cooking "PDF via XML" files? Because I can assure you than feeding rando.odt into $(libreoffice -pdf $TMPDIR/ohgawd) is 100% not the same as $(fop -fo $TMPDIR/my.fo -pdf $TMPDIR/out.pdf)
Equifax and Experian’s APis immediately come to mind as documents that generate complex results that people often want to turn into some type of visual representation with XSLT.
But of course, I see only a part of the picture.
So good it has its own Wikipedia page!
https://en.wikipedia.org/wiki/XSLT/Muenchian_grouping
I mean, talk about hacker cred.
Sounds nice but doesn't match my lived experience with both Chrome's built-in XSLT processor and `xsltproc`. (I was using XSLT 1.0, for legacy reasons, so maybe this is an XSLT 1.0 issue?)
> Do you mean that you'd like facilities to nicely reindent the output?
No, I do mean preserve whitespace (i.e., formatting), such as between elements and between attributes.
When you consider that .docx, .pptx, and .xlsx files are zipped XML archives, "niche" seems a misnomer.
Maybe it's good for compression, but probably not by a factor much bigger than gzip/brotli/zstd.
I've always liked XML, and especially XPath, and even though there were a large number of missteps in the heyday of XML, I feel it has always been unfairly maligned. Look at all the people who reinvent XML tooling but for JSON, but not nearly as well. Luckily, people who value XML can still use it, provided the fit is right. But it's nice to see the tides turning.
Most fashions really are cyclical.
I think XML is good for expressing document formats and for configuration settings. I prefer JSON for data serialization, though.
If you like React's JSX; enjoy its strictures and clean, readable "HTML"; then good news, you're writing XML (but without namespacing).
- Not including self-closing tags, there should only be one close tag: </>
- Elements are for data. Attributes are evil
- XPath indexing should be 0-based
- Documents without a schema should not make your tools panic or complain
- An xml document shouldn't have to waste it's time telling you it's an xml document in xml
I maintain that one of the reasons JSON got so popular so quickly is because it does all of the above. The problem with JSON is that you lose the benefits of having a schema to validate against.
This is like, your opinion, man... ;-) You can devise your schema any way you want. Attributes are great, and they exist in HTML in the form of datasets, which, as usual, are a poorly-specified and ill-designed rethinking of XML attributes
> Documents without a schema should not make your tools panic or complain
They don't. You absolutely don't need a schema. If you declare a schema, it should exist. If not, no problem?
Like I think this guy is mostly correct in identifying bad XML: https://www.devever.net/~hl/xml
Though I don't necessarily agree with the "data format" framing. This idea that markup languages are not data formats seems confused.
> They don't. You absolutely don't need a schema. If you declare a schema, it should exist. If not, no problem?
I agree that they should not.
However, I have used many tools that puke when presented with XML fragments or XML with no schema.
That said, these days most Microsoft XML dialects are actually XAML-based, and in XAML attributes are basically syntactic sugar - you can write:
or (the dot in the syntax makes it possible for the XAML parser to distinguish nested elements that represent properties from nested elements that represent child objects)XPath is cute, but if you don't mind bloat, text-only and lack of ergonomics, anyways then Conjunctive Regular Path Queries and RDF are miles ahead of XML as a data storage solution. (Not serialised as XML please xD)
The vast vast majority of Devs only experience of XML is what they hear second hand, I'm sure a lot more would like it if they tried it.
XSD, XPath, XSLT are all domains where I'd argue that reading/reasoning about are way more important.
When troubleshooting an issue, I don't mind scanning XML for a few data points so I can confirm what values are being communicated, but when I need to figure out how/why a specific value came to be, I don't want the logic spread throughout a giant text file wrapped in attribute value strings, and other non-debuggable "code". I'd rather it just be in a proper programming language.
As someone who has used many programming languages and who went through the process of implementing this one I have many opinions about XPath and XSLT as programming languages. I myself am more interested in implementing them for others who value using them than using them myself. I do recognize there is a sizeable community of people who do use these tools and are passionate about them - and that's interesting to see and more power to them!
There's an XML conference?!
https://www.xmlprague.cz/ https://www.balisage.net/ https://declarative.amsterdam/ https://markupuk.org/ https://xmlsummerschool.org/