XML Parsing
XML documents can be parsed efficiently and more critically because XML is a widely accepted language. It is extremely crucial to programming for the web that XML data be parsed efficiently, especially in cases a where the applications that are required to handle huge volumes of data. When parsing is improper it can increase memory usage and time for processing which directly affects the scalability by decreasing it.
There are many XML parsers that are available. Choosing a right one for your situation might be challenging. There are three XML parsing techniques which are extremely popular and are used for Java and it also guides you to choose the correct make right choice of method based on the application and its requirements.
An Extensive Markup Language parser takes a serialized string which is raw as input and performs a series of operations with it. First and foremost the XML data is checked for syntax errors and how well it formed is, and it also makes sure that the start tags will have end tags that match and that there are no elements which are overlapping with each other. Many parsers implement first validate the Document Type Definition (DTD) or even the XML Schema sometimes to verify if the structure along with the content are correctly specified by you. In the end the output after parsing is provided access to the XML document’s content through the APIs programming modules.
The three XML parsing that are popularly used with techniques for Java is, Document Object Model (DOM), it is w3c provided mature standard, and Simple API for XML (SAX), it was one of the first to be widely adapted form of API for XML in Java and has become the standard, the third one is Streaming API for XML (StAX), which is a new model for parsing in XML but is very efficient and has a promising future. Each one of the mentioned techniques has their advantages and disadvantages.
Parsing with DOM
Data Object Model or the DOM technique that based on the tree structure parsing and it builds an entire parsing tree in the memory. It also lets the DOM have complete access to the entire XML document dynamically.
The data object model is a tree like structure. So the document is considered to be the root from which all the DOM trees take birth, and the root will have one child node at the least, and the root element, which usually catalogues elements keeps it in the sample code. Another node that is created is the Document Type, which is used for the Document Type Data declarations. The elements in the catalog usually have child nodes, and these Child nodes are used as elements.
The DOM program takes the XML filename, and then creates the DOM tree. It uses the function called getElementsByTagName() for finding all the Data Object Model element nodes that can be used as the title elements. After this it finally prints the information in the text that is associated with the title elements. It achieves this by inspecting the list of title elements and then it examines the first child separately. The first child element is usually located between the start and end tags of the element, and it also uses the function getFirstChild() method to achieve this.
The Data object model is a direct model and very straight forward in its functions. XML document can be accessed randomly at any time because the memory stores the entire tree. DOM APIs also modify the nodes like for example appending a child or restructuring and updating or removing or deleting a node. There is a lot of support for navigating the memory tree in the DOM; but simultaneously there are issues related to parsing that have to be considered. It is essential in this system that the entire document has to be parsed at one single shot and the same time, it cannot be parsed partially or in intervals. If the XML document is huge then building the entire tree in the memory will become an extensive and an expensive process. The Data object model tree can actually consume a lot of memory. Though the DOM is very interoperable and interoperability is the biggest positive point it can offer at the same time it is not very good with binding and this proves to be its draw back when it comes to object binding.
There are a lot of applications which are well suited for DOM parsing. If the application needs to have immediate access to the XML document randomly then in such cases the DOM parsing is appropriate. For example an Extensive Style Language processor always has the need to navigate through an entire file and this becomes a repeated process while it is processing templates. Dom is dynamic when it comes to updating or modifying data so this feature is extremely convenient for applications, like the XML editors, which need to frequently modify data.
Parsing with SAX
SAX processing model is entirely based on stream of events and is an event-driven model for the processing of XML documents. Though it is not a standard declared by the W3C, it is still a very famous form of API that many SAX parsers use in without offending compliance or crating issues related to compliance. Unlike the DOM where it builds an entire tree to represent the data, the SAX parser streams a series of events while it reads the document. These events are forwarded to event handlers, which also provide access to the data of the document. There are three basic types of event handlers the DTD Handler which is used for accessing the data of XML DTD’s. The error handlers which are used for creating a low-level access to the errors created while parsing. The last but not the least Content handler which is used for accessing the content in the document
The difference between the DOM and the SAX parser offers a great benefit in terms of performance. It provides a low-level access which is efficient at the same time to the XML documents contents. Whereas the SAX model while having the major advantage of consuming extremely low memory, mainly because the document in its entirety does not have the need to be loaded into the memory slot at one time, and this feature enables a SAX parser to be able to parse a document which is much larger than the system’s own memory component. In addition to this, you don’t have the need to create objects for each and every node, unlike the DOM environment. SAX "push" model finally can be used in a broad context, when it comes to multiple content handlers which can be registered and used to receive events in a parallel way, instead of receiving them one by one in a pipeline in a series.
One of the disadvantages of SAX can be that you will have to implement all the event handlers to handle each and every incoming event. The application code must be maintained in this state of events. The SAX parser is incapable of processing the events when it comes to the DOM’s element supports, and you also have to keep track of the parsers position in the document hierarchy. The application logic gets tougher as the document gets complicated and bigger. It may not be required that the entire document be loaded but a SAX parser still requires to parse the whole document, similar to the DOM.
One of the biggest problems the SAX is facing today is that it lacks a built-in document support for navigation like the one which is provided by XPath. Along with the existing problem the one-pass parsing syndrome also limits the random access support. These kinds of limitations also start affecting the namespaces. These shortcomings make SAX a not so good choice when it comes to manipulating and even modifying a XML document.
Applications that can read the documents content in one single pass can derive huge benefits from SAX parsing. Many Business to Business Portals and applications use XML so that the data can be encapsulated in a format in which it can be received and retrieved using a simple process. This is the only scenario where the SAX might win hands down compared to DOM, purely due to the efficiency of SAX which results in high output. The modern SAX 2.0 also has a built-in filtering mechanism which makes very easy for the documents output to be subset. SAX parsing is also considered very useful when it comes to validating DTDs and the XML schemas.
Parsing with STax
Stax is a brand new parsing technique which is very similar to SAX and also an improvisation to it. The STAX uses a model that is event-driven. The only difference between sax and STAAX here is that the sax uses a push model and the STAX uses a pull model for event processing. And also another notable feature is instead of using call back options the STAX parser returns events which are requested by the applications in use.