XML Processing

XML documents process is explained by a huge set of specifications and the list of these specifications is growing endlessly. A lot of applications depend on these specifications to work with XML or extensive markup language. These specifications will have all the requirements listed for XML processing model and even the XML language specifications. These specifications are more at the conceptual level and contain descriptions about the language based interactions.

The XML documents are treated as a set of information modules and the specifications contains processes which construct new sets of information modules, inspect the information sets, modify them or extract information from the per existing information sets. The processing model has to be described in terms of the info set and the applications which have been working with the solid object models cannot be considered as the info set. The applications use DOM object models or the SAXX event stream or other representations of the info sets.

Requirements of the XML processing model

The language should be able to address the concerns related to interoperability. The language itself should be easily operated and should be simple for the XML processing model. The language should be able to specify the input and output and all the required paramet6ers of the document. The language should define mandatory processing options for input and also error reporting options in the XML processing model. This has to be done for the sake of interoperability. The language should be capable of specifying the documents and the set of and components separately. The language itself should be easy for implementation but it should be also be sophisticated for performing operations that can be optimized.

The XML processing model should be extensible so that the applications have the ability to define new functions and design them min the pipeline. The model should have a plan for error handling and fallback scenarios. The XML processing model should be able to select different components depending on the run time and should also allow processing which is conditional to take place. The information exchange between the components should take place in a standardized way. The language should be able to use the XML tools for manipulating the data and so the data should be essentially in XML.

Processing XML with Java

XML document is a tree of objects and there are standard API’s which are used to represent them using the World Wide Web’s data object model specifications. It is represented as a series of events in the SAX. The standard API for the Java XML parsers is called the JAXP and the JAXP 1.1 is expanded to include an API for the engines in XSLT also. This phenomenon is called TRAX which is a standard or Transformation of API for XMLAPI is very powerful if you understand its usage and the top level interfaces of the TRAX.

Uses of TRaX

The XML transformation is included in the TRAX API and the original work of the JAXP is extended to bring in a vendor and a standard Java API for identifying and carry out the XML transformations. TraX plays a more important role in this environment that just being an API engine and its main usage is for being a general-purpose interface for transformation of XML documents. TRaX is not in competition with the data object model or the java data object model or even the SAX, it is just an API which is used to represent the XML transformation methods and bridge these various methods. It includes SAX events and templates from XSLT. TRax also relies upon SAX2 and the Data object model or the DOM and their parsers to a great extent. TRaX basically provides the same level of functionalities like the XSLT engines but the parsers can be changed by changing their properties. In certain codes for a successive transformation the XSLT code has to be reprocessed. A common scenario is that the same set of transformations is used to apply to different sources repeatedly but possible in different series of threads. A better way to approach this whole thing would be to process the style sheet transformation only once and keep this as a copy by saving it for the other repetitive transformation cycles. This way a lot of time can be saved and the process need not be repeated over and over again. By using the TraX interface and its templates this can be done.

When the transformation is taking place with the help of the transformer the actual instance for the template would be the real run time processing that takes place during the transformation and the instructions that go into it. If you would like to increase output and performance levels then these templates instances can be saved and used and also these templates are thread safe. The very fact that a XSLT style sheet contains a huge collection of templates of one or more elements leads to interfaces which end up with plural names. Each style sheet transformation is defined by a template element within the same style sheet and therefore it chooses the simplest name available for the template for representing the collection of templates

XML Processing in Python

SAX or the Simple API for XML and DOM the data object model are two popular and basic ways which create an environment to work with XML. SAX method carries out its functions by reading the XML in divisions, some at a time and whenever it finds an element it calls for it. This is somewhat similar to the HTTP which works in a similar fashion by calling out elements as and when it finds it in the document. The Data Object Model reads the entire document first and then it creates references through out the document using the Python classes and links all these references it has been collecting into a tree shaped structure. But the draw back is if the XML document is huge it is going to end up spending a lot of time scanning the entire document a creating references and also it is going to take a lot of memory space to store that tree shaped structure which it will create at the end of it all. Python has its own standard modules for parsing the XML document.

Parsing XML using DOM level 2

The data object model basically represents the entire data in an XML document in a tree shaped structure like format. This tree shaped structure format can be easily manipulated by Java because as it is DOM has it that it is very simple for other programs to use as an advantage. You can use this advantage to modify data and even extract data when needed fro this tree shaped structure. But what Dom basically does is it parses the whole document and not some parts of it like the SAX. So if you have no need for the entire document then parsing the whole document will be a waste of time and a wasted effort and a waste of memory space for you. When you have large XML documents and have to parse only a small portion of it then it makes sense to use the SAX. While parsing the XML data using DOM there are two major tasks to be fulfilled, one is converting the XML data into DOM data and the other is looking at the data that would be useful for you. XML processing with Java takes place when a parser is specified and if a parser is not specified then the Apache Xerces parser is used.

Parsing in SAX

SAX parsing also includes two major tasks while parsing just like the DOM. One is to create a content handler and the other is to invoke the process and direct it to the content handler. However some instructions have to follow while parsing like telling the system about which parser to use. You have to create an instance for the parser and also then create a content handler which will respond to the parser. The start of document and the end of document should be declared along with start element and end element. The Characters and the white spaces which can be ignored should be clear. Finally the content handler has to be designated to invoke the parser. If the last step is not done then the entire processing function of the parser in the SAX will not happen.

The start element is something which is found in the start tag of the document. In case you forget to mention the element in the tag then the start element will not be present and there for the document itself will not be identified. In case there are errors while parsing this is the first place to check for errors. The end element is typically found in the end tag of the document and it takes values by subtracting two from the indentation and then presents a message. A character is something which is used to print the first word of the tag body and it does’ not change the indentation.

[catlist id=166].