Knowledge base‎ > ‎

The concept behind fragmented XML text nodes

posted Apr 20, 2010, 5:28 AM by Szabolcs Szádeczky-Kardoss   [ updated Apr 22, 2010, 1:50 AM by István Soós ]
A few weeks ago a friend asked me about a problem with XMLStreamReader. We have quickly concluded that it is no error at all, it is in the nature of the XML processing tools, but if you encounter it at the first time, it could seem strange. It is about the fact that XML text nodes are not necessarily processed at once, and while you read the XML, you might receive only fragments.

For example if you have the text: "Q&A", which in XML will be escaped to "Q&A", you might end up with reading first the string "Q" then, the "&" and finally the "A", instead of reading it as a whole string. Like the following code:

public class TestTextNode {
public static void main(String[] args) throws Exception {
String xml = "<?xml version=\"1.0\" ?><test>Q&amp;A</test>";
XMLInputFactory factory = XMLInputFactory.newInstance();
XMLStreamReader reader = factory.createXMLStreamReader(new StringReader(xml));;;
On Sun's Java 6 JVM you shall receive just "Q" in the first round. On the consecutive reads, you will receive the other characters, but for the unprepared people, it is just strange. So, why this happens?

XML allows you to have very large files. If you look for example at the XML dumps, it is not unusual to have XML files larger than a few GB. There is no limit on how big a text node can be, so it is the responsibility of the tool to process it in reasonable chunks. If you order it to load into a DOM, you will receive a large tree in the memory - if you have much more than the XML side itself, you have good chances that it will fit. However on large XMLs or for some kinds of processing, you just stream through the data and do not build a DOM tree.

As in the example above, while you stream though the XML, you will receive TextNodes. These are usually constrained by the:
  • closing or other opening tag
  • buffer size of the streamer (if it is full, the stream reader will receive the text)
  • special escape characters (as above, the escaped &amp; resulted in a new fragment
While the first one is trivial, the second and third is a less-known internal of the XML parsers, but from the memory consumption perspective, it seems it has a good reason behind it.

Now the question remains: are you able to parse the XML and receive all the text consecutive nodes compacted? It depends on the parser, but in Java, you can, just put the following code after the factory initialization:
      factory.setProperty(XMLInputFactory.IS_COALESCING, true);
So it is not magic to change the behavior, although with the recent hardwares and softwares, it might be better to have the coalescing by default, and it could be turned off - although it is definitely fail-safe this way.

published: 2009-08-29, a:István, y:2009, l:java, l:xml