A few weeks ago a friend asked me
about a problem with XMLStreamReader. We have quickly concluded that it
is no error at all, it is in the nature of the XML processing tools, but
if you encounter it at the first time, it could seem strange. It is
about the fact that XML text nodes are not necessarily processed at
once, and while you read the XML, you might receive only fragments.
For example if you have the text: "Q&A", which in XML will
be escaped to "Q&A", you might end up with reading first the
string "Q" then, the "&" and finally the "A", instead of reading it
as a whole string. Like the following code:
import java.io.StringReader; On Sun's Java 6 JVM you shall receive just
"Q" in the first round. On the consecutive reads, you will receive the
other characters, but for the unprepared people, it is just strange. So,
why this happens?
XML allows you to have very large files. If
you look for example at the wikipedia.org XML dumps, it is not unusual
to have XML files larger than a few GB. There is no limit on how big a
text node can be, so it is the responsibility of the tool to process it
in reasonable chunks. If you order it to load into a DOM, you will
receive a large tree in the memory - if you have much more than the XML
side itself, you have good chances that it will fit. However on large
XMLs or for some kinds of processing, you just stream through the data
and do not build a DOM tree.
As in the example above, while you stream
though the XML, you will receive TextNodes. These are usually
constrained by the:
While the first one is trivial, the second
and third is a less-known internal of the XML parsers, but from the
memory consumption perspective, it seems it has a good reason behind it.
Now the question remains: are you able to
parse the XML and receive all the text consecutive nodes compacted? It
depends on the parser, but in Java, you can, just put the following code
after the factory initialization:
factory.setProperty(XMLInputFactory.IS_COALESCING, true); So it is not magic to change the behavior,
although with the recent hardwares and softwares, it might be better to
have the coalescing by default, and it could be turned off - although
it is definitely fail-safe this way. published: 2009-08-29, a:István, y:2009, l:java, l:xml |
Knowledge base >