... in Java by Richard G Baldwin

Java API for XML Processing (JAXP), Getting Started

Baldwin begins by reviewing the salient aspects of XML. Then he shows you how to (a) use JAXP, DOM, and an input XML file to create a Document object that represents the XML file, (b) recursively traverse the DOM tree, getting information about each node in the tree along the way, and (c) use the information about the nodes to create a new XML file that represents the Document object.

Published: October 21, 2003
By Richard G. Baldwin

Java Programming Notes # 2200

Preface
General Background Information on XML
Preview
Discussion and Sample Code
Run the Program
Summary
What's Next?
Complete Program Listings

Preface

What is JAXP?

As the name implies, the Java API for XML Processing (JAXP) is an API provided by Sun designed to help you write programs for processing XML documents. JAXP is very important for many reasons, not the least of which is the fact that it is a critical part of the Java Web Services Developer Pack (Java WSDP). According to Sun,

"The Java WSDP is an all-in-one download containing key technologies to simplify building of Web services using the Java 2 Platform."

This is the first lesson in a series designed to initially help you understand how to use JAXP, and to eventually help you understand how to use the Java WSDP.

What is XML?

If you have been around Information Technology (IT) for the past several years, it is doubtful that you have escaped hearing about the eXtensible Markup Language (XML). However, if you are like many of the professional programmers who enroll in my Java courses, you may not yet know much about XML.

I will not attempt to teach XML in this series of tutorial lessons. Rather, I will assume that you already understand XML. I will teach you how to use JAXP to write programs for creating and processing XML documents.

Regarding XML, let me simply refer you to numerous tutorial lessons on XML that I have previously published at Gamelan.com and www.DickBaldwin.com. However, as a convenience to you, I will review many of the salient aspects of XML later in this document under General Background Information on XML.

Viewing tip

You may find it useful to open another copy of this lesson in a separate browser window. That will make it easier for you to scroll back and forth among the different listings and figures while you are reading about them.

Supplementary material

I recommend that you also study the other lessons in my extensive collection of online Java tutorials. You will find those lessons published at Gamelan.com. However, as of the date of this writing, Gamelan doesn't maintain a consolidated index of my Java tutorial lessons, and sometimes they are difficult to locate there. You will find a consolidated index at www.DickBaldwin.com.

General Background Information on XML

Purpose of this section

As mentioned earlier, the purpose of this section is to review the salient aspects of XML for those who are unfamiliar with the topic. Those of you who already know about XML can skip ahead to the Preview section. Those of you who are just getting your feet wet in XML (and may have found the XML water to be a little deep) should continue reading this section.

Jargon

Computer people are the world's worst at inventing new jargon. XML people seem to be the worst of the worst in this regard.

Go to an XML convention and everything that you hear will be X-this, X-that, X-everything. Sometimes I get dizzy just trying to keep the various X's separated from one another. In this explanation of XML, I will try to either avoid the use of jargon, or will explain the jargon the first time I use it.

So, just what is XML?

There are many definitions and descriptions for XML. I like the one given in Figure 1.

XML gives us a way to create and maintain structured documents in plain text that can be rendered in a variety of different ways.

A primary objective of XML is to separate content from presentation.

Figure 1

What do I mean by a "structured document?"

I will answer this question by providing an example. A book is typically a structured document.

In its simplest form, a book may be composed of chapters. The chapters may be composed of sections. The sections may contain illustrations and tables. The tables are composed of rows and columns.

Thus, it should be possible to draw a hierarchical diagram that illustrates the structure of a book, and most people who are familiar with books will probably recognize it as such.

What do I mean by "plain text?"

Characters such as the letters of the alphabet and punctuation marks are represented in the computer by numeric values, similar to a simple substitution code that a child might devise.

ASCII is an encoding scheme

For example in one popular encoding scheme (ASCII), the upper-case version of the character "A" is represented by the numeric value 65, a "B" is represented by the value 66, a "C" is represented by 67, etc.

The actual correspondence between the characters and the specific numeric values representing the characters has been described by several different encoding schemes over the years.

ASCII is also an organization

One of the most common and enduring schemes for encoding characters is a scheme that was devised many years ago by an organization known as the American Standards Committee on Information Interchange.

Given the initials of the organization, this encoding scheme is commonly known as the ASCII code.

Figure 2 contains a quotation from one author regarding the ASCII code (or plain text).

"This stands for American Standards Committee on Information Interchange. What it means in practice is plain text, that is to say text which is readable directly without using any special software. The advantage of ASCII is that it is a lowest common denominator which can be displayed on any platform. The disadvantage is that it is rather limited and somewhat boring. The text cannot display bold, italics or underlined fonts, and there is no scope for graphics or hypertext. However, it is simple, ... and is almost idiot-proof as a means of information exchange. To see a short example of ASCII click HERE, or to see a journal article in ASCII click HERE."

Figure 2

XML is not confined to the ASCII code

XML is not confined to the use of the ASCII encoding scheme. Several different encoding schemes can be used.

However, all of them have been selected to make it possible to read a raw XML document without the requirement for any special software (other than perhaps a text editor or the DOS type command).

What is a raw XML document?

A raw XML document is the string of sequential characters that makes up the document, before any specific rendering has been applied to the document.

What do I mean by "rendering?"

In modern computer jargon, rendering typically means to present something for human consumption.

Rendering a drawing

For example, in a computer, drawings and images are nothing more or less than sets of numbers and possibly formulas. Those numbers and formulas, taken at face value, usually mean very little to most human observers.

Recognition by a human observer

When we speak of rendering a drawing or an image, we usually mean that we are going to present it in a way that makes it look like a drawing or an image to a human observer. In other words, we convert the numbers and formulas that comprise the drawing to a set of colored dots (pixels) that a human observer will recognize as a drawing.

Rendering a document

When we speak of rendering a document, we usually mean that we are going to present it in a way that a human will recognize it as a book, a newspaper, or some other document style that can be read by the human observer.

Passing information through typography

Rendering, in this case, often means to present some of the material in boldface, some of the material in Italics, some of the material underlined, some of the material in color, etc. (For example, when you view this document using an HTML browser, it is rendered to show boldface, Italics, color, etc.)

To separate presentation from content

Raw XML doesn't exhibit any of these presentation properties, such as boldface, Italics, or color. Remember, a main objective of XML is to separate presentation from content. XML provides only the content. The presentation of that content must come from somewhere else.

Consider a newspaper

These days, there are at least two different ways to render a newspaper. One way is to print the information (daily news), mostly in black and white, on large sheets of low-grade paper commonly known as newsprint. This is the rendering format that ends up on my driveway each morning.

My online newspaper

Another way to render a newspaper is to present the information on a computer screen, usually in full color, with the information content trying to fight its way through dozens of animated advertisements.

USA Today

For example, here is the sort of rendering format that USA Today provides for the online version of its newspaper. Most of you are probably already familiar with the newsprint rendering of that well known newspaper.

The news doesn't change

The base information for the newspaper doesn't (or shouldn't) change for the newsprint and online renderings. After all, news is news and the content of the news shouldn't depend on how it is presented. What does change is the manner in which that information is presented.

A newspaper is a structured document consisting of pages, columns, etc., which could be maintained using XML.

The great promise of XML

When the information content of a newspaper is created and maintained in XML, that same information content can be rendered on newsprint paper, on your computer screen, on your call phone, or potentially in other formats without having to rewrite the information content.

Not necessarily boring

If you visit the above link to the journal article rendered solely in ASCII, you will probably agree that from a presentation viewpoint it is pretty boring (no offense intended to the author of the article).

However, XML documents created and maintained in plain text need not necessarily be boring.

When you combine a rendering engine with XML ...

It is possible to apply a rendering engine (such as XSL) to the XML content and to render that content in rich and exciting ways. (XSL is an acronym for the Extensible Stylesheet Language, and is an advanced topic that I will be covering in future lessons in this series.)

Separating content from presentation

XML is responsible for maintaining the content, independent of presentation.

A rendering engine, such as XSL, is responsible for rendering that content in ways that are appropriate for the application.

Achieving Structure

So, just how does XML use plain text to create and maintain structure?

Consider the following simple structure that represents a book. (This book certainly wasn't written by me, because it is much too brief.)

The book described by the structure in Figure 3 has two chapters with some text in each chapter.

Begin Book

  Begin Chapter 1
    Text for Chapter 1
  End Chapter 1

  Begin Chapter 2
    Text for Chapter 2
  End Chapter 2

End Book

Figure 3

A simple example

A real book obviously has a lot more structure than this. For example, a typical book will probably have a Forward or a Preface. A typical book will usually have a Table of Contents.

Breaking the structure down further produces paragraphs within the text, words within the paragraphs, etc. Also, a book will frequently have an Alphabetical Index.

However, I am trying to keep this example as simple as possible, so I left those things out.

A primary objective

In the earlier description, I told you "A primary objective of XML is to separate content from presentation."

This separation, and the fact that the XML document is maintained in plain text, makes it possible to share the same physical document among different computers in a way that they all understand. (This is often not true, for example, for documents that are maintained in the proprietary formats of word processing software.)

Many different computers and operating systems

Sharing of a document among different computers is no small accomplishment. Over the years, dozens of different types of computers have been built, operating under different operating systems, and running thousands of different programs.

As a result, the modern computer world is often like being on an island where everyone speaks a different language.

A common language for structured documents

XML attempts to rectify this situation by providing a common language for structured documents.

What Does XML Contribute?

I am going to ease into the technical details later. At this point, suffice it to say that XML provides a definition of a simple scheme by which the structure and the content of a document can be established.

Even I can understand an XML document

The resulting physical document is so simple that any computer (and most humans) can read it with only a modest amount of preparation.

You will sometimes see XML referred to as a "meta" language.

What does meta mean?

In computer jargon, the term meta is often used to identify something that provides information about other information.

Stock and bond price information

For example, consider the listings of stock prices, bond prices, and mutual fund prices that commonly appear in most daily newspapers.

The various tables on the page provide information about the bid and ask prices for the various stock, bond, and mutual fund instruments.

What you need is meta information

But, how do you read those charts? How do you extract information from the charts? You need some information about the information contained in the charts. You need some meta information.

Stock and bond meta information

Usually somewhere on the page, you will find an explanation as to how to interpret the information presented throughout the remainder of the page.

You could probably think of the information contained in the explanation as meta information. It provides information about other information.

What about the alphabetical index in a book?

Is the alphabetical index of a book a form of meta information? Probably so.

For example, the alphabetical index can tell you if the book contains information about XML or other topics of interest to you. If so, it will tell you where in the book you can find that information.

The index can also tell you where to find information about elements and attributes that I will discuss later. So, yes, in my opinion, the alphabetical index in a book provides meta information.

So, why might people refer to XML as a meta language?

If you write a book and maintain its content in XML, XML doesn't tell you how to structure the document that represents your book.

XML provides a set of rules for structuring

Rather, XML provides you with a set of rules that you can use to establish your own structure and content when you create the document that represents your book.

XML is not the language that you use to establish the structure and content of your book. Rather, XML tells you how to create your own language for creating structure and maintaining content.

It is up to you to decide how you will use those rules to define your own language for establishing the structure and content of your book.

Invent your own language

You might say that XML is a language that provides information about a new language that you are free to invent.

Does everyone use a different language?

As it turns out, different groups of people having common interests have gotten together and have used XML to invent a common language by which the persons in the group can create, maintain, and exchange document structure in their areas of interest.

The Chemical Markup Language

For example, a group of chemists has gotten together and has used the XML rules to invent a common language by which they create and exchange structured documents on chemistry.

MathML

Similarly, a group of mathematicians has gotten together and has invented a common language by which they create and exchange structured documents on mathematics.

XML is easily transported

If you follow the rules for creating an XML document, the document that you create can easily be transported among various computers and rendered in a variety of different ways.

Two different renderings

For example, you might want to have two different renderings of your book. One rendering might be in conventional printed format and the other rendering might be in an online format.

No requirement to modify the XML source document

The use of XML makes it practical to render your book in two or more different ways without any requirement to modify the original document that you produce.

This leads to the name: eXtensible Markup Language or XML.

Applying XML

Now let's look at a couple of sample XML documents, either of which might reasonably represent the simple book presented earlier.

The first sample XML document is shown in Listing 1.

<?xml version="1.0"?>
<book>
<chap>
Text for Chapter 1
</chap>

<chap>
Text for Chapter 2
</chap>
</book>

Listing 1

This example shows typical XML syntax.

Compare with earlier book description

If you compare this example with the informal book example given earlier in Figure 3, you should see a one-to-one correspondence between the "elements" in this XML document and the informal description of the book presented earlier.

An improved example

Listing 2 shows a modest improvement over the XML code in Listing 1, by including an "attribute" named number in each of the chapter elements. This attribute contains the chapter number and is part of the information that defines the structure of the book.

<?xml version="1.0"?>
<book>
<chap number="1">
Text for Chapter 1
</chap>

<chap number="2">
Text for Chapter 2
</chap>
</book>

Listing 2

The book represented by the XML code in Listing 2 has two chapters with some text in each chapter. This XML code contains an attribute that describes the chapter number in each chapter element.

Now consider a new jargon word: tag.

What is a tag?

The common jargon for XML items (such as those shown in Figure 4) enclosed in angle brackets is tag. (You may be familiar with this jargon based on HTML experience.)

<book>

Figure 4

Start tags and end tags

The tag shown in Figure 4 is often referred to as a start tag or a beginning tag.

The tag shown in Figure 5 is often referred to as an end tag.

</book>

Figure 5

The end tag contains a slash

What is the difference between a start tag and an end tag? In this case, the start tag and the end tag differ only in that the end tag contains a slash character.

Sometimes there are other differences

However, the start tag can also contain optional attributes as discussed below. (There is also another form where the start tag and end tag are combined into something often called an empty element.)

What is an element?

It is time to learn the meaning of the jargon element, content, and attribute.

Using widely accepted XML jargon, I will call the sequence of characters in Figure 6 an element.

An element begins with a start tag and ends with an end tag and includes everything in between.

<chap number="1">Text for Chapter 1</chap>

Figure 6

Color coded for clarity

I used artificial color coding in Figure 6 to make it easier to refer to the different parts of the element.

(Note however, that because an XML document is maintained in plain text, the characters in an XML document do not have color properties.)

What is the content?

The characters in between the tags (rendered in green in Figure 6) constitute the content. (For more information on content, use your browser to search for the word content in The XML FAQ.)

What is an attribute?

The characters rendered in blue in Figure 6 constitute an attribute.

To recap so you will remember it

An element consists of a start tag and an end tag with the content being sandwiched in between the two tags. The content is part of the element.

May include optional attributes

The start tag may contain optional attributes. In this example, a single attribute provides the number value for the chapter. The start tag can contain any number of attributes, including none.

Tell me more about attributes

The term attribute is a commonly used term in computer science and usually has about the same meaning, regardless of whether the discussion revolves around XML, Java programming, or database management.

Attributes belong to things, or things have attributes

A chapter in a book is a thing. A chapter has a number. In this example, the chapter number is an attribute of the chapter element.

An apple has a color, red or green. An apple also has a taste, sweet or sour.

A dog has a size, small, medium, or large.

In the above statements, number, color, taste, and size are attributes. Those attributes have values like red, green, sweet, sour, small, medium, and large.

As you can see, attributes are a very common part of the world in which we live and work.

People have attributes

A person also has attributes, and each attribute has a value.

Figure 7 contains a list of some of the attributes (along with their values) that might be used to describe a person.

name="Joe"
height="84"
weight="176"
complexion="pale"
sex="male"
training="Java programmer"
degree="Masters"

Figure 7

Obviously, there are many more attributes that could be used to describe a person.

The importance of an attribute depends on the context

The decision as to which of many possible attributes are important depends on the context in which the person is being considered.

Attributes for basketball players

For example, if the person is being considered in the context of being a candidate for an all male basketball team, the height, weight, and sex attributes of a person will probably be important considerations.

Attributes for programmers

On the other hand, if the person is being considered in the context of being a candidate for employment as a programmer, the height, weight, and sex attributes should not be important at all, but the training and degree attributes might be very important.

Why does XML use attributes?

Earlier in this lesson, I suggested that the most common modern use of the word rendering means to present something for human consumption. Usually, but not always, that refers to visual consumption. (My grandmother used to render fat to make soap, but that is not modern usage of the term.)

Multiple renderings for the same document

I gave an example of a newspaper that can either be rendered on newsprint paper, or can be rendered on a computer screen.

What is a rendering engine?

If the newspaper (structured document) is created and maintained as an XML document, then some sort of computer program (often referred to as a rendering engine) will probably be used to render it into the desired presentation format.

What about rendering our book?

Our book could also be rendered in a variety of different ways.

Regardless of how the book is rendered, it will probably be useful to separate and number the chapters.

The value of the number attribute for each chapter element could be used by the rendering engine to present the chapter number for a specific rendering.

Chapter numbers may be rendered differently

In some renderings, the number might appear on an otherwise blank page that begins a new chapter. This is common in printed books, but is not common in online presentations.

In a different rendering, the chapter number might appear in the upper right or left-hand corner of each page.

Separation of content from presentation

To reiterate, one of the most important characteristics of XML (as opposed to HTML) is that XML separates content from presentation.

The XML document contains information about structure and content. It does not contain presentation information (as does HTML).

Presentation of XML requires a rendering engine

The presentation of an XML document requires the use of a rendering engine of some sort to render the XML document in a particular presentation style.

IE 5.0 (and later) contains a rendering engine

As an example of rendering, IE 5.0 (and later versions) contains a rendering engine for XML. When provided with an XML document and no rendering instructions, IE will render the XML document in a default format similar to that shown in Figure 8.

Fig 8 IE Rendering of XML file

Figure 8 IE Rendering of XML File

This default rendering of an XML document is designed to emphasize the tree structure of an XML document. With the IE default rendering, the nodes in the tree can be collapsed and expanded by clicking the - and + symbols on the left, much as you can collapse and expand the nodes in Windows Explorer (File Manager).

When provided with an XML document and appropriate rendering instructions (such as an XSLT document), IE can transform XML data into HTML data and render it in the browser window in different formats.

What is an XSLT document?

I will have a lot to say about the Extensible Stylesheet Language (XSL), and stylesheet transformations (XSLT) in future lessons.

Attributes may be useful in rendering

Now getting back to attributes, they provide information about XML elements that may be useful to the rendering engine.

If the attribute values for an element are not important in a particular presentation context, the rendering engine for that context will simply ignore them. If they are important in a particular context, the rendering engine will use them.

(The default IE rendering engine makes no use of attributes, but does display them along with the other information in the XML document.)

Elements, content, etc.

So far in this lesson, I have introduced tags, elements, content and attributes. I have discussed tags and attributes in detail. Now let's continue the discussion with particular emphasis on elements and content.

What is meant by content?

You already know about start tags and end tags. You also know that an element consists of a start tag (with optional attributes), an end tag, and the content in between as shown in Figure 9.

<chapter number="1">Content for Chapter 1</chapter>

Figure 9

In Figure 9, the optional attribute is colored blue and the content is colored green.

(Recall however, that because an XML document is maintained in plain text, the characters in an XML document do not have color properties. I used color in this lesson simply to aid in the explanation.)

Elements can be nested

Elements can be nested inside other elements in the construction of the XML document as shown in Figure 10.

<book>
  <chapter number="1">Content for Chapter 1
  </chapter>
  <chapter number="2">Content for Chapter 2
  </chapter>
</book>

Figure 10

Color coding and indentation

In Figure 10, the tags belonging to the book element are shown in blue while the tags belonging to the chapter elements are shown in green.

I also provided artificial indentation to make it easier to see that two chapter elements are nested inside a single book element.

Indentation is common

Such indentation is common in the presentation of raw XML data for human consumption. For example, the default rendering of an XML document by IE is an indented tree structure as shown in Figure 8.

Identify the elements

The book element consists of its start tag, its end tag, and everything in between (including nested elements), as shown in Figure 11.

<book>
  ...
</book>

Figure 11

Each chapter element consists of its start tag, its end tag, and everything in between, as shown in Figure 12.

  <chapter number="1">
    ...
  </chapter>

Figure 12

Content of the book element

In this case, the two chapter elements form the content of the book element.

So, what is an element?

The element is the fundamental unit of information in an XML document. Most XML processing programs (such as rendering engines) depend on this fundamental unit of information in order to do their job.

An XML document is an element

The entire XML document is an element. As shown in Listing 2, the entire XML document consists of the book element. It is often referred to as the root element.

To be of much use, an XML document must have other elements nested inside the root element. For example, a nested element can define some type of information, such as chapter in our book example. Other possibilities would be table elements and appendix elements.

Meta information

Through the use of attributes, the element often defines information about the information provided by the XML document (sometimes referred to as meta information).

In our book example, the number attribute provides the chapter number for each of the chapter elements. In effect, the chapter number is information about the information contained in the chapter.

The content

Sandwiched in between the start tag and the end tag of an element, we find the information (content) that the XML document is designed to convey.

So, what are elements good for?

By using a well-defined structure (based on XML elements) to create and maintain your document, you make it much easier to write computer programs that can be used to render, and otherwise process your document.

Writing programs to process XML documents

At some point, you might want to visit one of my earlier articles entitled "What is SAX, Part 1."

(You will find a link to that article at www.dickbaldwin.com.)

That article describes how to write computer programs (using the Java programming language) that decompose an XML document into its elements for some useful purpose.

In those articles, I explain that SAX supports an event-based approach to XML document processing. (If you have a background in event-driven programming, such as Java or Visual Basic, you will like the SAX approach.)

Parsing events

An event-based approach reports parsing events (such as the start and end of elements) to the program using callbacks. The program implements and registers event handlers (callback methods) for the different events.

Code in the event handlers is designed to achieve the objective of the program.

Not critical to understanding XML

I will have a great deal more to say about processing XML documents using SAX in future lessons. I realize that a discussion of event-driven programming for the processing of XML documents might not be classified as "information for Getting Started with JAXP." It is not even critical for an understanding of XML.

However, it is a good way to illustrate the benefits provided by XML elements. Don't worry too much about SAX at this at this point. Just keep studying, and at some point in the future, it will fall into place.

What we have learned so far?

So far in this lesson, I have introduced you to tags, elements, content, and attributes. I have discussed tags, attributes, and elements in detail. Now, I will discuss content in detail.

What is content?

Of the four terms mentioned above, content is the easy one. Sandwiched in between the start tag and the end tag of an element, we find the information (content) that the XML document is designed to convey.

This is where we put the information for which the document was created.

An XML newspaper

For example, if the XML document is being used for creation and maintenance of material for a newspaper, the content is the news.

A Java programming textbook

If the XML document is being used for creation and maintenance of a Java programming textbook, the content contains the information about Java programming that we want to present to the student.

Tags, attributes, and elements define structure

The content is the raw information. The tags, attributes, and elements define the structure into which we insert that information.

Why do we need structure?

One of the primary objectives of XML is to separate content from presentation.

If we insert the raw material as content into a structure defined by the tags, elements, and attributes, then that raw material can be presented (rendered) in a variety of ways. It can also be searched in a variety of ways that can produce results that are more meaningful than simple keyword searches.

Same content, different renderings

For example, an XML document can be used to represent a newspaper.

Then that document can be presented as an ordinary hard copy newspaper by printing the content on newsprint in a format defined by the structure. Typically, we would use a rendering engine designed for that purpose.

The same XML document can be used to present the same information in a completely different rendering on a computer screen. Again, we would probably use a rendering engine designed for that purpose.

Rendering engine formats the content

In both cases, the rendering engine would examine the structure defined by the tags, elements, and attributes and would then format and present the news (content) in a format appropriate for the presentation media being used.

What does the future hold for XML?

Obviously, I believe that XML has a very bright future. Otherwise, I wouldn't be making the kind of substantial investment in time and energy that I am making in order to understand XML.

I base this belief on the fact that many large companies, including Microsoft and IBM have adopted XML as an important part of their future.

XML will grease the skids of electronic commerce

For example, here are some of the things that Simon Phipps, IBM's chief XML and Java evangelist had to say in his keynote speech at the Software Development East conference a few years ago.

"Because it allows companies to share information with customers or business partners without first negotiating technical details, Extensible Markup Language (XML) will grease the skids of electronic business and become the assumed data format at the end of 2001."

XML provides vendor independence

Phipps went on to say:

"Other successful Internet technologies let people run their systems without having to take into account another company's own computer systems, notably TCP/IP for networking, Java for programming, and Web browsers for content delivery. XML fills the data formatting piece of the puzzle."

"These technologies do not create dependencies. It means you can build solutions that are completely agnostic about the platforms and software that you use."

XML can reduce system costs

In the speech, entitled "Escaping Entropy Death" Phipps noted that users are reaching the point where the cost of simply owning some systems is exceeding the value they provide.

"The key benefit to IT managers that adopt XML and other non-proprietary standards is that they will greatly reduce the cost of maintaining a computer's systems and will allow them to extend existing systems."

"In the next decade, you can't just ask when can you have [a new application]. You also have to ask how much will it cost to own."

No more vendor-imposed standards

According to Phipps:

"The solution, interestingly enough, is not constant innovation. You have to redeem the best of the parts you have and combine them with the best of the future."

Phipps contended that the IT industry has moved on from the era of "vendor-imposed standards."

This is an interesting observation by a representative from IBM. I grew up on computers during an era when IBM was the vendor who imposed the standards.

Some would say that the role of imposing standards has now been assumed by Microsoft (much to the dismay of IBM management).

What about Microsoft and XML?

Microsoft is making a huge investment in XML. As mentioned earlier, Microsoft's IE browser currently supports XML documents, XSL stylesheets, and XSL transforms.

(You can find links to several articles that I have previously written discussing the rendering of XML documents using XSLT at www.DickBaldwin.com.)

In addition, many aspects of Microsoft's latest MS.NET product depend extensively on XML.

The XSL Debugger from Microsoft

XSL is complex (much more complex than XML). Designing an XSL stylesheet, to be used by a rendering engine to properly render an XML document, can be a daunting task.

To help us in that regard, Microsoft has developed an XSL debugger, and has made it freely available for downloading. As of the date of this writing, the debugger can be downloaded from http://www.vbxml.com/xsldebugger/. I will discuss the use of this debugger in future lessons that discuss the creation of XML processing programs using XSLT and JAXP.

Check out XML in MS Word

If you happen to have a copy of Microsoft Word around, use it to create a simple HTML file. Load that file into your HTML browser and view the source. When you do, you will find XML appearing at various locations in the control information created by Word in that HTML document.

What we have learned so far?

So far in this lesson, I have discussed tags, elements, content, and attributes in detail. I have also presented a short sales pitch designed to convince you of the importance of XML.

Now we are ready to move on to a new set of topics: valid documents, well-formed documents, and the DTD.

What is a DTD?

The quotation in Figure 13 was extracted from The XML FAQ.

"A DTD is usually a file (or several files to be used together) which contains a formal definition of a particular type of document. This sets out what names can be used for elements, where they may occur, and how they all fit together. For example, if you want a document type to describe <LIST>s which contain <ITEM>s, part of your DTD would contain something like

<!ELEMENT item (#pcdata)>

<!ELEMENT list (item)+>

This defines items containing text, and lists containing items.

It's a formal language which lets processors automatically parse a document and identify where every element comes and how they relate to each other, so that stylesheets, navigators, browsers, search engines, databases, printing routines, and other applications can be used."

Figure 13

DTDs can be very complicated

I included the above quotation to emphasize one important point – DTDs are, or can be, very complicated.

The reality is that the creation of a DTD of any significance is a very complex task.

Don't panic

However, despite their complexity, many of you will never need to worry about having to create DTDs for the following two reasons:

XML does not require the use of a DTD.
Even when it is necessary to use a DTD, someone else may have already created it for you.

Many "standard" DTDs have already been developed and are available for your use without any requirement for you to develop them.

The three amigos

An XML document has two very close friends, one of which is optional.

I'm going to refer to them as three files just so I will have something to call them (but they don't have to be separate physical files).

One file contains the content

One file contains the content of the document (words, pictures, etc.). This is the part that the author wants to expose to the client. This file contains the XML code that I have been discussing up to this point.

This is the file that is composed of elements, having start tags, end tags, attributes, and content. For convenience, the file name often has an extension of xml, although that is not a requirement.

A second file contains the DTD

A second file contains the DTD, which meets the above definition that was extracted from the FAQ. This file is optional.

(Note that a modern alternative to the DTD is often called a schema. A schema, when it is available, serves the same purpose as a DTD, but is often more powerful. I will have more to say about schema in future lessons.)

A third file contains a stylesheet

A third file contains a stylesheet, which establishes how the content (that optionally conforms to the DTD) is to be rendered on the output device for a particular application.

This file defines how the author wants the material to be presented to the client.

Rendering the XML document

Different stylesheets are often used with the same XML data to cause that data to be rendered in different ways. For example a tag with an attribute of "red" might cause something to be presented bright red according to one stylesheet and dull red according to another stylesheet. (It might even be presented as some shade of green according to still another stylesheet, but that wouldn't be a very good design.)

DTD is optional, stylesheet is not

With XML, the DTD is optional but the stylesheet (or some processing mechanism that substitutes for a stylesheet) is required. At least that is true if the XML document is ever to be rendered for the benefit of a client.

Something must provide rendering specifications

Remember, XML separates content from presentation.

There is no presentation information in the XML document itself.

Therefore, rendering specifications must be provided to make it possible to render the content of the XML document in the manner intended by the author.

A stylesheet is typical, but not required

Typically, the rendering specifications are contained in a stylesheet. The stylesheet is used by a rendering engine to render the XML document according to the specifications in the stylesheet.

However, it is possible that the specifications could be hard-coded into a program written specifically for the purpose of rendering the XML document. In that case, a stylesheet might not be required.

Rendering XML with XSL and MS IE

As mentioned earlier, I have published several articles that deal with using IE to render XML using stylesheets written in XSL. You will find links to those articles at www.DickBaldwin.com.

Now back to the DTD.

A DTD can be very complex

Again, according to The XML FAQ,

"... the design and construction of a DTD can be a complex and non-trivial task, so XML has been designed so it can be used either with or without a DTD. DTDless operation means you can invent markup without having to define it formally.

To make this work, a DTDless file in effect 'defines’ its own markup, informally, by the existence and location of elements where you create them.

But when an XML application such as a browser encounters a DTDless file, it needs to be able to understand the document structure as it reads it, because it has no DTD to tell it what to expect, so some changes have been made to the rules."

Figure 14

What does this really mean?

It means that it is possible to create and process an XML document without the requirement for a DTD. A little later, I will discuss this possibility in connection with the term well-formed.

In the meantime...

You don't always have the luxury of avoiding the DTD. In some situations, you may be required to create an XML document that meets specifications that someone else has defined.

Hopefully, a DTD will be available

Ideally, in those cases, the person who defined the specifications has also created a DTD and can provide it to you for your use.

A valid document

Here is a new term -- a valid XML document.

In the normal sense of the word, if something is not valid, that usually means that it is not any good. However, that is not the case for XML.

An invalid XML document can be a good XML document

An invalid XML document can be a perfectly good and useful XML document. A very large percentage of useful XML documents are not valid XML documents.

So, what is a valid XML document?

Drum roll please!!! Without further delay, a valid XML document is one that conforms to an existing DTD in every respect.

For example...

Unless the DTD allows an element with the name "color", an XML document containing an element with that name is not valid according to that DTD (but it might be valid according to some other DTD).

Validity is not a requirement of XML

Many very useful XML documents are not valid, simply because they were not constructed according to an existing DTD.

To make a long story short, validation against a DTD can often be very useful, but may not be required.

A well-formed document

Here is another new term -- a well-formed document.

The concept of being well-formed was introduced as a requirement of XML, to deal with the situation where a DTD is not available (an invalid document).

Again, according to The XML FAQ,

"For example, HTML's <IMG> element is defined as 'EMPTY': it doesn't have an end-tag. Without a DTD, an XML application would have no way to know whether or not to expect an end-tag for an element, so the concept of 'well-formed' has been introduced.

This makes the start and end of every element, and the occurrence of EMPTY elements completely unambiguous."

Figure 15

What is an HTML <IMG> tag?

Although you may not know anything about the HTML <IMG> tag, you do know about start tags and end tags from previous discussion in this article.

Although HTML is related to XML (a distant cousin that combines content and presentation in the same document), HTML documents are not required to be well-formed.

The quotation in Figure 15 refers to the use of a start tag (<IMG>) in HTML that doesn't require an end tag. If used in that manner in an XML document, the document would not be well-formed.

All XML documents must be well-formed

XML documents need not be valid. However:

All XML documents must be well-formed.

What does it mean to be well-formed?

For a rigorous definition of a well-formed document, see http://www.w3.org/TR/2000/REC-xml-20001006#sec-well-formed.

From a somewhat less rigorous viewpoint, XML documents must adhere to the following rules to be well-formed.

Every start-tag must have a matching end-tag. All elements that can contain character data must have both start and end tags. (Empty elements have a different requirement, which I will discuss later.)
Tags can't overlap. In other words, all elements must be properly nested. If one element contains another element, the entire second element must be defined inside the start and end tags of the first element.
XML documents can have only one root element.
Element names must obey XML naming conventions
XML is case sensitive
XML will keep white space in your text

What is character data?

Although not rigorously true, for purposes of this discussion, let's just say that the content that we discussed in an earlier section comprises character data.

Other requirements

All attribute values must be in quotes (apostrophes or double quotes). You already know about attributes. I discussed them earlier in this lesson.

You can surround the value with apostrophes (single quotes) if the attribute value contains a double quote. Conversely, an attribute value that is surrounded by double quotes can contain apostrophes.

Dealing with empty elements

We must also deal with empty elements. Empty elements are those that don't contain any character data. You can deal with empty elements by writing them in either of the two ways shown in Figure 16.

<book></book>

<book/>

Figure 16

You will recognize the format of the first line as simply writing a start tag followed immediately by an end tag with nothing in between. The format of the second line in Figure 16 has a slash at the end of the word book.

The second format is preferable

This is the first time in this lesson that I have mentioned the second format, which is actually preferable.

One reason the second format is preferable is that because of word wrap and other causes, you could end up with the first format in Figure 16 being converted to that shown in Figure 17.

<book>
</book>

Figure 17

Really not empty

Once this happens, although the element may look empty to you, it really isn't empty. Rather it contains whatever characters are used by that platform to represent a newline character sequence.

Typically a newline is either a carriage return character, a line feed character, or a combination of the two. While these characters are not visible, their presence will cause an element to be not empty.

If an element is supposed to be empty, but it is not really empty, this can cause problems when the XML file is processed.

The preferred approach

So, to reiterate, the preferred approach for representing an empty element is as shown by the second line in Figure 16.

Empty element can contain attributes

Note that an empty element can contain one or more attributes inside the start tag, as shown in by the example in Figure 18.

<book author="baldwin" price="$9.95" />

Figure 18

Again, note the slash character at the end.

Another rule: No markup characters are allowed

For a document to be well-formed, it must not have markup characters (<, >, or &) in the text data.

What is a markup character?

Since the < character represents the beginning of a new tag, if it were included in the text data, it would cause the processor to become confused. Similarly, because the > character represents the end of a tag, inclusion of that character in the text data can also cause problems. The solution to this problem (entities, as described below) also makes it necessary to exclude the & character from the text data.

The solution

If you need for your text to include the < character, the > character, or the & character, you can represent them using < > and & instead. (Note that I purposely omitted the use of a comma in this list of entities to avoid having a comma become confused with the required syntax for an entity, which always begins with an ampersand and always ends with a semicolon.)

Entities

According to the prevailing jargon, these are called entities. You insert them into your text in place of the prohibited characters.

Entities always start with an ampersand character and end with a semicolon. It is that combination of characters that the processor uses to distinguish them from ordinary text.

Other common entities

Although it may not be necessary for well-formedness, it is also common practice to use an entity to represent the quotation mark character (") by the entity ". It is also possible to use an entity to represent many other characters, including characters that don't appear on a standard English-language keyboard.

Recap of validity and well-formed requirements

Valid XML files are those which have (or refer to) a DTD and which conform to the DTD in all respects.

XML files must be well-formed, but there is no requirement for them to be valid. Therefore, a DTD is not required, in which case validity is impossible to establish.

If XML documents do have or refer to a DTD, they must conform to it, which makes them valid.

Why use a DTD if it is not required?

There are several reasons to use a DTD, in spite of the fact that XML doesn't require one.

Enforcing format specifications

Suppose, for example, that you have been charged with publishing a weekly newsletter, and you intend to produce the newsletter as an XML file.

Suppose also that you occasionally have a guest editor who produces the newsletter on your behalf.

Establish format specifications

You will probably establish a set of format specifications for your newsletter and you will need to publish those specifications for the benefit of the guest editors.

No guarantee of compliance

However, simply publishing a document containing format specifications does not ensure that the guest editors will comply with the specifications.

Use a DTD to enforce format specifications

You can enforce the format specifications by also establishing a DTD that matches the specifications.

Then, if either you, or one of your guest editors produces an XML document that doesn't meet the specifications, the XML processor that you use to render your newsletter into its final form will notify you that the document is not valid.

Improved parser diagnostic data

Another reason that I have found a DTD to be useful is the following.

I am occasionally called upon to write a Java program that will parse and process an XML document in some fashion.

My experience is that the parsers that I have used are much more effective in identifying XML structural problems when the XML document has a DTD than when it doesn't.

By this I mean that often the diagnostic information provided by the parser is more helpful when the XML document has a DTD.

This tends to make it easier to repair the document because a validating parser does a better job of isolating the problem.

More than you wanted to know

And that is probably more than you ever wanted to know about XML. Now it's time to terminate this review of XML and get to the meat of this series of tutorial lessons - using Java JAXP to process XML documents.

Preview

Having taken a very long detour to help the XML newcomers catch up with everyone else, I will now get back on track and begin discussing JAXP.

XML by itself isn't very useful

In reality, an XML document is simply a text document constructed according to a certain set of rules and containing information that the author of the document may want to expose to a client. (The client could be a human, or could be another computer.)

Taken by itself, the XML document isn't worth much, particularly in those cases where the client is a human. To be very useful, the XML document must be combined with a program that is designed to do something useful with that document. In other words, in order for an XML document to be useful to you, you need access to a program that can process that document to your satisfaction.

DOM and SAX

Regardless of the intended result, many XML processing programs often begin by applying a software construct called a parser to the XML document. The parser performs several different functions. One important function is quality control. A non-validating parser will test the XML document to confirm that it is well-formed. A validating parser will confirm well-formedness, and will also test the XML document to confirm that it conforms to the specified DTD or schema.

Two of the most common types of parsers are:

A parser based on the Document Object Model otherwise known as DOM.
A parser based on the Simple API for XML, otherwise known as SAX.

I will have a great deal more to say about DOM and SAX in future lessons. For purposes of this lesson, I need to provide a brief introduction to DOM because I will use a DOM-based parser in the sample program to be discussed later.

Brief introduction to DOM

An XML document can be viewed as a tree structure where the elements constitute the nodes in the tree. Some of the nodes have child nodes and some do not.

(Usually those nodes that have no children are referred to as leaf nodes. This notation is based on the concept of a physical tree where the root subdivides into trunk, limbs, branches, twigs, and finally leaves. However, the leaves don't subdivide. Leaves on a physical tree don't have children.)

An example of a tree structure

Referring back to the XML document in Listing 1, the element named book could be viewed as the root of a tree structure. It has two children, which are the elements named chap. Each of the elements named chap has a child, which is the text shown in Listing 1. The text forms the leaves of this tree.

A tree structure in memory

A DOM parser can be used to create a tree structure in memory, which represents an XML document. In Java, that tree structure is encapsulated in an object of the interface type Document. Document declares numerous methods. Document is also a subinterface of Node, and inherits many method declarations from Node.

Many operations are possible

Given an object of type Document, there are many methods that can be invoked on the object to perform a variety of operations. For example, it is possible to move nodes from one location in the tree to another location in the tree, thus rearranging the structure of the XML document represented by the Document object. It is also possible to delete nodes, and to insert new nodes. As you will see in the sample program in this lesson, it is also possible to recursively traverse the tree, extracting information about the nodes along the way.

I will show you ...

In this lesson, I will show you how to:

Use JAXP, DOM, and an input XML file to create a Document object that represents the XML file.
Recursively traverse the DOM tree, getting information about each node in the tree along the way.
Use the information about the nodes to create a new XML file that represents the Document object.

The Document object represents the original XML file and the DOM tree is not modified in this example. The final XML file represents the unmodified Document object, which represents the original XML file. Therefore, the final XML file will be functionally equivalent to the original XML file.

Nothing fancy intended

This sample program is not intended to do anything fancy. Rather, it is intended simply to help you take the first small step into the fascinating world of Java, JAXP, and XML.

Discussion and Sample Code

In total, this sample program consists of a class named Dom02.java, a class named Dom02Writer.java, and an XML file named Dom02.xml. I will discuss these files in fragments. Complete listings of the three files are shown beginning with Listing 28 near the end of the lesson.

The XML file named Dom02.xml

I will begin my discussion with the XML file named Dom02.xml. A listing of this file begins in Listing 3.

An XML file always starts with a prolog, which is the part of the XML document that precedes the XML data. The minimal prolog, shown in Listing 3, contains a declaration that identifies the document as an XML document.

(Note that the declaration may also contain additional information that is not included in this simple XML document.)

<?xml version="1.0"?>

Listing 3

The root element

The root element of this XML document is named bookOfPoems. An abbreviated form of the root element, (with all of its content removed), is shown in Listing 4.

<bookOfPoems>
...
</bookOfPoems>

Listing 4

Children of the root element

As shown in Listing 5, the root element contains two child elements named poem. (For clarity, I eliminated the content of each of the poem elements in Listing 5.)

<bookOfPoems>
<poem PoemNumber="1" DumAtr="dum val">
...
</poem>
<?processor ProcInstr="Dummy"?>

<poem PoemNumber="2" DumAtr="dum val">
...
</poem>
</bookOfPoems>

Listing 5

Processing instructions and comments

Listing 5 also shows a processing instruction (colored red for identification), and a comment (colored blue for identification).

Comments are (or may be) ignored by XML processors. Processing instructions are intended to provide instructions to XML processors. Depending on the overall design, some XML processors may pay attention to some processing instructions and ignore others. For example, a given XML document may be processed by two or more processors for different purposes. The document may contain different processing instructions for the different XML processors.

Attributes of the poem element

Listing 5 also shows that each of the poem elements have two attributes (colored green for identification):

PoemNumber
DumAtr

Content of the first poem element

Listing 6 shows the content of the first poem element (colored blue for identification).

<poem PoemNumber="1" DumAtr="dum val">
<line>Roses are red,</line>
<line>Violets are blue.</line>
<line>Sugar is sweet,</line>
<line>and so are you.</line>
</poem>

Listing 6

As you can see from Listing 6, the content of the first poem element consists of a sequence of four elements named line. The content of each of the line elements is the text that constitutes one line in the poem. When this XML document is converted to a DOM tree, each of the text lines will constitute one leaf node in the tree.

Content of the second poem element

Listing 7 shows the content of the second poem element. There is nothing new here, except for the indication that I could never make a living as a poet.

<poem PoemNumber="2" DumAtr="dum val">
<line>Roses are pink,</line>
<line>Dandelions are yellow,</line>
<line>If you like Java,</line>
<line>You are a good fellow.</line>
</poem>

Listing 7

The entire XML document

Listing 8 shows the entire XML document with the same color coding as above, so that you can identify all the parts, and view them in context:

<bookOfPoems>
<poem PoemNumber="1" DumAtr="dum val">
<line>Roses are red,</line>
<line>Violets are blue.</line>
<line>Sugar is sweet,</line>
<line>and so are you.</line>
</poem>
<?processor ProcInstr="Dummy"?>

<poem PoemNumber="2" DumAtr="dum val">
<line>Roses are pink,</line>
<line>Dandelions are yellow,</line>
<line>If you like Java,</line>
<line>You are a good fellow.</line>
</poem>
</bookOfPoems>

Listing 8

It is important to note that although I have presented this XML document with different colors to identify the different parts, there is no color in an actual XML document. Recall from the earlier discussion that one of the most important aspects of XML documents is that they exist in plain text, which doesn't include attributes such as boldface, Italics, underline, or color. This makes XML documents easily transportable among different kinds of computers and different operating systems.

The class named Dom02

The controlling class for this program is named Dom02. I will discuss this class in fragments. As mentioned earlier, a complete listing of the class is provided in Listing 28 near the end of the lesson.

This class, when executed in conjunction with the class named Dom02Writer:

Creates a Document object using JAXP, DOM, and the input XML file named Dom02.xml.
Traverses the DOM tree, getting information about each element (each node in the tree).
Uses the information describing the nodes to create an output XML file that represents the Document object (and is functionally equivalent to the input XML file).

Why not identical?

By now you may be wondering why I used the weasel words "functionally equivalent" instead of saying that the output XML file is identical to the input XML file. This has to do with the topic of whitespace, which is a fairly complex topic in XML. (I will have much more to say about whitespace in future lessons.)

For now, suffice it to say that much of the whitespace in Listing 8 (newlines, indentation, etc.) was put there for cosmetic reasons. For reasons that I won't attempt to explain in this simple example, some of that cosmetic whitespace is not reflected in the output XML file.

Input and output file names

The names of the input and output XML files are provided to this program by command-line arguments when the program is executed. The name of the input file is the first argument, and the name of the output file is the second argument.

DocumentBuilder and Document objects

The program creates a DOM parser object, of type DocumentBuilder, based on JAXP. This object, along with its parse method, is used to create a Document object (DOM tree) that represents the input XML file.

Traverse the tree

The Document object's reference is passed to the writeXmlFile method of an anonymous object of the Dom02Writer class, which traverses the tree and produces the output XML file representing that tree. As you will see, this is by far the most complex part of the entire operation. (In the next lesson, I will show you how to accomplish the same thing with less complexity.)

Miscellaneous comments about the program

The program was tested using Sun's SDK 1.4.2 under WinXp along with the file named Dom02.xml described above.

No effort was made to provide meaningful information about errors and exceptions. The topic of providing such meaningful information, particularly regarding parsing errors is fairly complex, and will be addressed in a future lesson.

Import directives

Because the primary purpose of this lesson is to get you started using JAXP, I will highlight the first three import directives, and the classes that they represent, in Listing 9.

import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.DocumentBuilder;
import org.w3c.dom.Document;

import java.io.File;

Listing 9

Steps for creating a Document object

As you will see when we get into the code, creating a Document object involves three steps:

Create a DocumentBuilderFactory object
Use the DocumentBuilderFactory object to create a DocumentBuilder object
Use the DocumentBuilder object to create a Document object

Both the DocumentBuilderFactory class and the DocumentBuilder class belong to the javax.xml.parsers package. As of this writing, this package is part of J2SE 1.4.2.

The DocumentBuilderFactory Class

According to Sun, the DocumentBuilderFactory class

"Defines a factory API that enables applications to obtain a parser that produces DOM object trees from XML documents."

The DocumentBuilderFactory class extends Object, and defines about fifteen methods, one of which is a static method named newInstance. As is often the case with factory objects, the newInstance method is used to create an object of the class.

The class also defines the newDocumentBuilder instance method, which is used to create objects of the DocumentBuilder class, discussed in the next section.

(Note that although the quotation from Sun in the next section uses the terminology DocumentBuilderFactory.newDocumentBuilder method, the newDocumentBuilder method is an instance method and is not a static or class method.)

The DocumentBuilder Class

According to Sun, the DocumentBuilder class

"Defines the API to obtain DOM Document instances from an XML document. Using this class, an application programmer can obtain a Document from XML.

An instance of this class can be obtained from the DocumentBuilderFactory.newDocumentBuilder method. Once an instance of this class is obtained, XML can be parsed from a variety of input sources. These input sources are InputStreams, Files, URLs, and SAX InputSources."

This class also extends Object, and defines about ten methods, which include several overloaded versions of the parse instance method. When the parse method is invoked and passed an input source containing XML, the method returns a Document object (DOM tree) that represents the XML.

The code in this program will pass the file named Dom02.xml to the parse method, thus producing a DOM tree that represents the XML contained in that file.

The Document interface

Document is an interface in the org.w3c.dom package, which extends the Node interface belonging to the same package. Thus, when we invoke the parse method described above, the method returns a reference to an object instantiated from a class that implements the Document interface. The reference is returned as type Document, not as the name of the class from which the object was actually instantiated.

(Because Document extends Node, that object could also be treated as type Node when appropriate.)

Don't know and don't care

As is often the case in situations like this, we don't know, and usually don't care about the actual name of the class from which the Document object was instantiated, so long as the class correctly implements the methods declared in Document and Node.

What does Sun have to say?

Sun has this to say about a Document object:

"The Document interface represents the entire HTML or XML document. Conceptually, it is the root of the document tree, and provides the primary access to the document's data."

Sun describes a Node as follows:

"The Node interface is the primary datatype for the entire Document Object Model. It represents a single node in the document tree. While all objects implementing the Node interface expose methods for dealing with children, not all objects implementing the Node interface may have children. For example, Text nodes may not have children, and adding children to such nodes results in a DOMException being raised."

Methods of Document and Node

The Document and Node interfaces declare a large number of methods, which make it possible to manipulate and perform operations on the DOM tree structure encapsulated in the Document object. We will see several of those methods being used in the class named Dom02Writer, as it traverses to tree to create an output XML file that represents the tree.

The File class

The fourth import directive in Listing 9 imports the File class. I will assume that you already know all you need to know about this class. If not, see my tutorial lessons on file I/O at www.DickBaldwin.com.

Enough talk, let's see some code

Listing 10 shows the beginning of the class named Dom02, and the main method for that class.

public class Dom02{

public static void main(String argv[]) {
if (argv.length != 2) {
System.err.println(
"usage: java Dom02 fileIn fileOut");
System.exit(0);
}//end if

Listing 10

The code in Listing 10 simply checks to confirm that the user entered the correct number of command-line arguments, and terminates with an error message if not true.

Recall that argv[0] should contain the name of the input XML file and argv[1] should contain the name of the output XML file.

A DocumentBuilderFactory object

The code in Listing 11 creates and configures an object of type DocumentBuilderFactory, which is capable of producing objects of type DocmentBuilder. Objects of type DocumentBuilder are, in turn, capable of producing objects of type Document.

try{
DocumentBuilderFactory factory =
DocumentBuilderFactory.newInstance();

//Configure the factory object
factory.setValidating(false);
factory.setNamespaceAware(false);

Listing 11

Configuration

An object of the DocumentBuilderFactory class provides several methods (such as setValidating), which can be used to control the behavior of DocumentBuilder objects produced by the factory object. For example, if you want the want the parser that will be produced by this and the following code to be a validating parser, you must invoke the setValidating method at this point, passing true as a parameter.

(Note that the validating and namespaceAware properties are false by default, so inclusion of the corresponding statement in Listing 11 didn't accomplish anything, other than to illustrate the location and use of these methods.)

Get a DocumentBuilder (parser) object

As described earlier, the code in Listing 12 invokes the newDocumentBuilder method on the factory object produced in Listing 11, to produce a DocumentBuilder object. That object's reference is saved in the local variable named builder.

DocumentBuilder builder =
factory.newDocumentBuilder();

Listing 12

The object produced by the code in Listing 12, is the kind of object that is commonly referred to in the XML literature as an XML parser.

(Thus, it would have been equally appropriate to save the object's reference in a variable named parser.)

Create a Document object

The code in Listing 13 invokes the parse method on the DocumentBuilder (parser) object to parse the XML file whose name and path were provided by the user as the first command-line argument (argv[0]).

Since this is a non-validating parser, the parse method will confirm that the XML is well-formed. (The parser will not attempt to validate the XML.) If the XML is not well-formed, the parse method will throw an exception. If the XML is well-formed, the parse method will create an object that represents the XML in a DOM tree, and return that object's reference as the interface type Document.

Document document = builder.parse(
new File(argv[0]));

Listing 13

The code in Listing 13 saves the Document object's reference in the local variable named document.

Process the DOM tree

At this point, the DOM tree represents the XML in the input file. The methods of the Document and Node interfaces could be used to perform a variety of operations on that tree, such as moving nodes, deleting nodes, inserting new nodes, modifying text nodes, etc. Having performed such operations, the program could then create a new XML file that represents the modified DOM tree.

In this simple program, however, we won't modify the DOM tree. Rather, we will simply create a new XML file that represents the unmodified DOM tree. Thus, the output XML file should be functionally equivalent to the input XML file.

Create the output file

This program will invoke a method named writeXmlFile on an anonymous object of the Dom02Writer class to create the output file, whose name and path were provided by the user as the second command-line argument. The writeXmlFile method is invoked by the code in Listing 14, passing the Document object's reference as a parameter to the method.

The writeXmlFile method will recursively traverse the DOM tree represented by the Document object. Along the way, it will extract information about each of the nodes and use this information to construct the elements in the output XML file.

new Dom02Writer(argv[1]).
writeXmlFile(document);
}catch(Exception e){
e.printStackTrace(System.err);
}//end catch

}// end main()
} // class Dom02

Listing 14

The catch block

Listing 14 also contains the catch block that receives control if any of the code in the try block beginning in Listing 11 throws an error or an exception.

As mentioned earlier, the code in this catch block makes no attempt to provide meaningful information in the event of an error or an exception. The code to provide meaningful information in the event of parsing errors can be rather complex, and is a topic that will be covered in a future lesson.

End of the Dom02 class

The code in Listing 14 also signals the end of the Dom02 class, and the main method belonging to that class.

The Dom02Writer class

This class provides a utility method named writeXmlFile, which receives a Document object's reference as a parameter and writes an output XML file that matches the information encapsulated in the Document object.

The output file is created by recursively traversing the DOM tree encapsulated in the Document object, identifying each of the nodes in that tree, and converting each node to text in an XML format.

No effort is made to insert spaces and line breaks to make the output cosmetically pleasing. Also, nothing is done to eliminate cosmetic whitespace that may exist in the Document object.

The name of the output XML file is established as a parameter to the constructor for the class.

Testing

This class was briefly tested using SDK 1.4.2 and WinXP. Note however that this class has not been thoroughly tested. If you use the class for a critical application, be sure to test it thoroughly before using it.

The class definition

The beginning of the Dom02Writer class, including an instance variable and the constructor is shown in Listing 15. (See the complete listing near the end of the lesson for the required import directives.)

public class Dom02Writer {
private PrintWriter out;

public Dom02Writer(String xmlFile) {
try {
out = new PrintWriter(
new FileOutputStream(xmlFile));
}catch (Exception e) {
e.printStackTrace(System.err);
}//end catch
}//end constructor

Listing 15

The constructor

The constructor is very straightforward, having nothing to do with JAXP or XML. The purpose of the constructor is to receive the output file name as an incoming parameter and to establish an output stream of type PrintWriter that is used to write information to the output file.

If this code is unfamiliar to you, you can learn about Java stream I/O at www.DickBaldwin.com.

The writeXmlFile method

Listing 16 shows the entire method named writeXmlFile, which converts an incoming Document object to an output XML file.

public void writeXmlFile(Document document){
try {
writeNode(document);
}catch (Exception e) {
e.printStackTrace(System.err);
}//end catch
}//end writeXmlFile()

Listing 16

This method is also straightforward. All that it does is pass the Document object's reference to a recursive method named writeNode.

What does this mean?

Recall that I told you earlier that, according to Sun,

"The Document interface represents the entire ... XML document. Conceptually, it is the root of the document tree, ..."

Recall also that when discussing a Document object, I told you

"Because Document extends Node, that object could also be treated as type Node when appropriate."

We're now going to put all of that to the test. In effect, the Document object is a Node, which represents the root node of the DOM tree, and we can pass its reference to the method named writeNode, which requires an incoming parameter of type Node.

Recursion

Here is where things get a little more complicated, particularly if you don't have a strong background in recursive algorithms. The writeNode method implements a recursive algorithm.

(Typically at a time like this, I would tell you that if you don't understand recursion, you could visit my web site where you will find tutorial lessons that explain recursion. However, I have just realized that despite the fact that I have published several hundred lessons on OOP and Java, I have never published a lesson that concentrates on the implementation of recursion in Java. Therefore, the best that I can do at this point is to tell you to fire up your Google search engine and search for the keywords Java and recursion. You will probably find many sites that deal with recursion in Java.)

The writeNode method

The writeNode method, which begins in Listing 17, is invoked recursively to convert Node data to XML format and to write the XML format data to the output file.

The method begins by executing code designed to avoid the infamous NullPointerException that occurs when the incoming reference fails to refer to an actual object of type Node. In this event, the program will abort gracefully with a message appearing on the standard error device.

public void writeNode(Node node) {
if (node == null) {
System.err.println(
"Nothing to do, node is null");
return;
}//end if

Listing 17

Process the node based on its type

The code in Listing 18 invokes the getNodeType method to determine the type of the node whose reference was received an incoming parameter. According to Sun, this method returns a short value representing the type of the node. (Why did I treat it as type int? Just an oversight I suppose.)

int type = node.getNodeType();

Listing 18

The Sun documentation shows that the Node interface defines final static variables that represent the following types (variables defined in an interface are implicitly final):

ATTRIBUTE_NODE
CDATA_SECTION_NODE
COMMENT_NODE
DOCUMENT_FRAGMENT_NODE
DOCUMENT_NODE
DOCUMENT_TYPE_NODE
ELEMENT_NODE
ENTITY_NODE
ENTITY_REFERENCE_NODE
NOTATION_NODE
PROCESSING_INSTRUCTION_NODE
TEXT_NODE

These values will be used in a switch statement to identify the type of incoming node, and to take appropriate action regarding the information written in the output XML file. (Note however, that this simple test case was not designed to test all possibilities in the above list.)

Process the Document node

I will discuss each case in the switch statement separately. Listing 19 shows the code that is executed when the incoming Node object is type
DOCUMENT_NODE.

switch (type) {
case Node.DOCUMENT_NODE: {
out.print("<?xml version=\"1.0\"?>");

//Get and write the root element of the
// Document. Note that this is a
// recursive call.
writeNode(
((Document)node).getDocumentElement());
out.flush();
break;
}//end case Node.DOCUMENT_NODE

Listing 19

The code in Listing 19 begins by writing the required line in the XML file that indicates that the file contains XML data. This is required as the first line in an XML file.

The getDocumentElement method

Then the code in Listing 19 downcasts the Node object's reference to type Document and invokes the getDocumentElement method on that reference. Here is what Sun has to say about this method:

"This is a convenience attribute that allows direct access to the child node that is the root element of the document. For HTML documents, this is the element with the tagName "HTML"."

For the XML file being processed in this example, this will be the element named bookOfPoems.

(Although I'm not certain, I suspect that the documentation author intended to say convenience method instead of convenience attribute.)

An object of interface type Element

The getDocumentElement method returns a reference to an object of the interface type Element, which is a subinterface of the Node interface.

Here is what Sun has to say about objects of type Element:

"The Element interface represents an element in an HTML or XML document. Elements may have attributes associated with them; since the Element interface inherits from Node, the generic Node interface attribute attributes may be used to retrieve the set of all attributes for an element. There are methods on the Element interface to retrieve either an Attr object by name or an attribute value by name. In XML, where an attribute value may contain entity references, an Attr object should be retrieved to examine the possibly fairly complex sub-tree representing the attribute value."

The interface declares about fifteen methods, which make it possible to perform various operations on an Element object.

A recursive call to the writeNode method

The code in Listing 19 gets the Element object, (which is also a Node object) corresponding to the root element of the XML document and passes that object's reference, recursively, to the writeNode method.

When the writeNode method ultimately returns, the code in Listing 19 flushes the output buffer to ensure that all data that has been written to the output buffer is actually written to the output file.

Important: The statement that reads out.flush and all of the remaining code in this method will not be executed until the recursive call to writeNode() returns.

In effect, the code for the DOCUMENT_NODE case in the writeNode method (Listing 19) simply gets the object in the DOM tree corresponding to the root element in the XML document and passes it recursively to the writeNode method. This causes the information corresponding to the root element to be written to the output file.

Node type ELEMENT_NODE

Listing 20 shows the beginning of the code in the switch case where the node type is ELEMENT_NODE.

case Node.ELEMENT_NODE: {
out.print('<');//begin the start tag
out.print(node.getNodeName());

Listing 20

The code in Listing 20 is simple enough.

Begin the case clause for type ELEMENT_NODE.
Write a left angle bracket ("<") into the output file to begin the tag for the element.
Get and write the name of the node immediately following the left angle bracket.

The Attr interface

An element can have none, one, or more attributes. The Attr interface extends the Node interface. Here is part of what Sun has to say about the Attr interface:

"The Attr interface represents an attribute in an Element object. Typically the allowable values for the attribute are defined in a document type definition.

Attr objects inherit the Node interface, but since they are not actually child nodes of the element they describe, the DOM does not consider them part of the document tree. Thus, the Node attributes parentNode, previousSibling, and nextSibling have a null value for Attr objects. The DOM takes the view that attributes are properties of elements rather than having a separate identity from the elements they are associated with; this should make it more efficient to implement such features as default attributes associated with all elements of a given type. Furthermore, Attr nodes may not be immediate children of a DocumentFragment. However, they can be associated with Element nodes contained within a DocumentFragment. In short, users and implementors of the DOM need to be aware that Attr nodes have some things in common with other objects inheriting the Node interface, but they also are quite distinct."

Process the attributes, if any

Continuing with the case for node type ELEMENT_NODE, the code in Listing 21 gets the attributes, if any, belonging to the element and writes them into the output file in the correct XML format.

//Get attributes into an array
Attr attrs[] = getAttrArray(
node.getAttributes());

//Process attributes in the array.
for (int i = 0; i < attrs.length; i++){
Attr attr = attrs[i];
out.print(' ');//write a space
out.print(attr.getNodeName());
out.print("=\"");//write ="
//Convert <,>,&, and quotation char to
// entities and write the text
// containing the entities.
out.print(
strToXML(attr.getNodeValue()));
out.print('"');//write closing quote
}//end for loop
out.print('>');//write end of start tag

Listing 21

Get attributes into an array object

The code begins by invoking the getAttrArray method, (which is defined later in this class), to get the attributes and to store them in an array object of type Attr. I will explain the getAttrArray method later. For now, suffice it to say that the getAttrArray method returns a reference to an array object of type Attr where each element in the array represents one of the attributes associated with the node being processed.

Process the attributes, if any

With one exception, the code to process the array, getting the name and value of each attribute and writing them into the output XML file is straightforward. All it really amounts to is invoking the getNodeName and getNodeValue methods to get the name and the value of the attribute, and then creating the correct sequence of text, spaces, and punctuation characters. For the case where the node is of type Attr, these two methods simply return strings.

The strToXML method

The exception mentioned above has to do with the call to the method named strToXML. This method is used to replace extraneous angle brackets, ampersands, and quotation marks in the text with the corresponding XML entities. I will explain the inner workings of this method later in this lesson.

Nested elements

At this point, we must deal with the possibility that this node may have children, and must process them if they exist. This is accomplished by the code in Listing 22, where we are still dealing with the switch case of node type ELEMENT_NODE.

NodeList children = node.getChildNodes();

if (children != null) {//nested elements
int len = children.getLength();
//Iterate on NodeList of child nodes.
for (int i = 0; i < len; i++) {
//Write each of the nested elements
// recursively.
writeNode(children.item(i));
}//end for loop
}//end if
break;
}//end case Node.ELEMENT_NODE

Listing 22

Listing 22 begins by invoking the getChildNodes method on the current node to get an object of type NodeList containing a collection of the children of this node.

The items in the NodeList object are accessible via an integral index, starting from 0, via a method named item. The item method takes an integral index as a parameter, and returns a reference to an object of type Node.

A NodeList object also provides a method named getLength, which returns the number of nodes in the list.

Getting the nodes in the list

The getChildNodes method returns an empty list if there are no children. (If there are children, getLength returns a value greater than zero.)

Assuming that you are comfortable with recursion, the code in Listing 22 is straightforward:

Invoke getLength to get the number of nodes.
Use a for loop to iterate on each of the nodes.
Make a recursive call inside the for loop to the writeNode method to process each child node.

That ends the processing for the switch case ELEMENT_NODE.

Entity reference nodes

The code in Listing 23 is the code for the switch case ENTITY_REFERENCE_NODE.

case Node.ENTITY_REFERENCE_NODE:{
out.print('&');
out.print(node.getNodeName());
out.print(';');
break;
}//end case Node.ENTITY_REFERENCE_NODE

Listing 23

The code in Listing 23 sandwiches the name of a node of type ENTITY_REFERENCE_NODE between an ampersand and a semicolon and writes the combination into the output file. This produces an entity reference in the output XML file.

Briefly, an entity reference is a reference to something that has been defined elsewhere. Since this lesson is not intended to teach you about entities, I will drop it at that. The sample XML file used to test this program didn't contain any entity references, so this code has not been tested.

Text nodes

The code in Listing 24 handles the following switch cases:

CDATA_SECTION_NODE
TEXT_NODE

case Node.CDATA_SECTION_NODE:
case Node.TEXT_NODE: {
//Eliminate <,>,& and quotation marks and
// write to output file.
out.print(strToXML(node.getNodeValue()));
break;
}//end case Node.TEXT_NODE

Listing 24

Without getting into the technical XML details as to why, a block of text can be represented by either of two node types:

CDATA_SECTION_NODE
TEXT_NODE

(The sample XML file that I used to test this program contained only the second type.)

The processing for this type of node, shown in Listing 24, is very simple:

Get the value of the node, which contains the actual text.
Invoke the strToXML method to replace angle brackets, ampersands, and quotation marks with entities.
Write the modified text to the output file.

Note, however, that by replacing angle brackets, ampersands, and quotation marks with entities, the code in Listing 24 essentially converts CDATA into PCDATA. In some cases, that may not be desirable, so this may not be the best approach for dealing with CDATA.

Processing instruction nodes

The code in Listing 25 is the code for switch case PROCESSING_INSTRUCTION_NODE

case Node.PROCESSING_INSTRUCTION_NODE:{
out.print("<?");
out.print(node.getNodeName());
String data = node.getNodeValue();
if (data != null && data.length() > 0){
out.print(' ');//write space
out.print(data);
}//end if
out.print("?>");
break;
}//end Node.PROCESSING_INSTRUCTION_NODE
}//end switch

Listing 25

Based on what you have learned up to this point, the processing of this node type in Listing 25 should be straightforward. In this case, the getNodeName method returns a string corresponding to the target of the processing instruction. The getNodeValue method returns a string consisting of the "entire content excluding the target."

The target string is written into the output XML file preceded by "<?".

If the string returned by getNodeValue is not null and has a length greater than zero, that string is then written into the output file preceded by a space.

Finally the characters "?>" are written into the output file completing the processing instruction.

Close the element

There is one more thing that needs to be done before exiting the writeNode method being used to process a node. As shown in Listing 26, if the node being processed is an element, the end tag for the element needs to be created and written to the output file. That is accomplished in a straightforward manner in Listing 26.

//Now write the end tag for element nodes
if (type == Node.ELEMENT_NODE) {
out.print("</");
out.print(node.getNodeName());
out.print('>');

}//end if

}//end writeNode(Node)

Listing 26

Listing 26 also signals the end of the writeNode method.

Utility methods

That brings us to some utility methods that are invoked by the code discussed above.

The strToXML method

The purpose of the strToXML method, shown in Listing 27, is to modify and return a String object replacing angle brackets, ampersands, and quotation marks with XML entities.

private String strToXML(String s) {
StringBuffer str = new StringBuffer();

int len = (s != null) ? s.length() : 0;

for (int i = 0; i < len; i++) {
char ch = s.charAt(i);
switch (ch) {
case '<': {
str.append("<");
break;
}//end case '<'
case '>': {
str.append(">");
break;
}//end case '>'
case '&': {
str.append("&");
break;
}//end case '&'
case '"': {
str.append(""");
break;
}//end case '"'
default: {
str.append(ch);
}//end default
}//end switch
}//end for loop

return str.toString();

}//end strToXML()

Listing 27

The method receives a String object's reference as an incoming parameter. It replaces the <,>,&, and quotation mark characters in that string with XML entities, and returns the modified string.

The code in Listing 27 is completely straightforward, and shouldn't require further explanation.

The getAttrArray method

In the earlier discussion of attribute elements, I promised to provide a further discussion of the getAttrArray method shown in Listing 28. Briefly, this method converts a NamedNodeMap into an array object of type Attr.

private Attr[] getAttrArray(
NamedNodeMap attrs){
int len = (attrs != null) ?
attrs.getLength() : 0;
Attr array[] = new Attr[len];
for (int i = 0; i < len; i++) {
array[i] = (Attr)attrs.item(i);
}//end for loop

return array;
}//end getAttrArray()

} // end class Dom02Writer

Listing 28

The getAttributes method

Backtracking a bit, the code in Listing 21 invokes the getAttributes method on the node, and passes the returned value as a parameter to the getAttrArray method shown in Listing 28.

The getAttributes method returns a reference to an object of type NamedNodeMap containing the attributes of the node (if it is an Element) and null otherwise.

Thus, the getAttrArray method shown in Listing 28 receives an incoming parameter of type NamedNodeMap, which may be null.

The NamedNodeMap interface

Here is part of what Sun has to say about a NamedNodeMap object:

"Objects implementing the NamedNodeMap interface are used to represent collections of nodes that can be accessed by name. ... Objects contained in an object implementing NamedNodeMap may also be accessed by an ordinal index, but this is simply to allow convenient enumeration of the contents of a NamedNodeMap, and does not imply that the DOM specifies an order to these Nodes."

A NamedNodeMap object provides several methods, which can be used to

Get the number of items in the collection.
Access the items in the collection.
Remove items from the collection.
Add items to the collection.

The method named item

The code in Listing 28 takes advantage of the fact that "Objects contained in an object implementing NamedNodeMap may also be accessed by an ordinal index, ..." This is accomplished by invoking the method named item on the NamedNodeMap object, passing an ordinal index as a parameter.

Given this information, the process for converting a NamedNodeMap object into an array object of type Attr, as implemented by the getAttrArray method in Listing 28, is relatively straightforward:

Get required length for the array.
Instantiate the new array object of the proper length.
Use a for loop and the item method to extract each item from the NamedNodeMap object and use it to populate the array object.
Return the array object.

End of class Dom02Writer

The code in Listing 28 also signals the end of the class definition for the class named Dom02Writer.

Run the Program

I encourage you to copy the code from Listings 28, 29, and 30 into your text editor, compile it, and execute it. Experiment with it, making changes, and observing the results of your changes.

Summary

In this first lesson on Java JAXP, I began by providing a brief description of JAXP and XML. Then I reviewed the salient aspects of XML for those who need to catch up on XML technology.

Following that, I provided a brief discussion of the Document Object Model (DOM) and the Simple API for XML (SAX). I discussed how a DOM object represents an XML document as a tree structure in memory. I explained that once you have the tree structure in memory, there are many operations that you can perform to create, manipulate, and/or modify the structure. Then you can convert that modified tree structure into a new XML document.

Using two sample Java class files, I showed you how to:

Use JAXP, DOM, and an input XML file to create a Document object that represents the XML file.
Recursively traverse the DOM tree, getting information about each node in the tree along the way.
Use the information about the nodes to create a new XML file that represents the Document object.

What's Next?

What I did not do in this lesson, (but will do in a future lesson), is to show you how to modify the tree structure for purposes of creating a modified XML file.

The things that you learned about traversing the tree structure and getting information about each node in the tree will serve you well in the future. However, if all you need to do is to write an output XML file that represents the DOM, there is an easier way to do that using Extensible Stylesheet Language Transformations (XSLT). That will be the primary topic of the next lesson.

In this lesson, I didn't show you how to write code that produces meaningful output in the event of a parser error or exception. I will also cover that topic in the next lesson.

Complete Program Listings

Complete listings of the two Java classes and the XML document discussed in this lesson are shown in Listings 28, 29, and 30 below.

/*File Dom02.java
Copyright 2003 R.G.Baldwin

This program and the class named Dom02Writer used
by this program shows you how to:

1. Create a Document object using JAXP, DOM, and
an input XML file.
2. Traverse the DOM tree getting information
about each element.
3. Use the information describing the elements to
create an XML file that represents the
Document object.

The input XML file name is provided by the user
as the first command-line argument. The output
XML file name is provided by the user as the
second command-line argument.

The program requires access to the following
class file:
Dom02Writer.class

The program instantiates a DOM parser object
based on JAXP. The parser is non-validating
and is not namespace aware.

The program uses the parse() method of the parser
object to parse an XML file specified on the
command line. The parse method returns an object
of type Document that represents the parsed XML
file.

The program passes the Document object to a
method named writeXmlFile() on an object of a
class named Dom02Writer. The purpose of this
method and this class is to write an XML file
that represents the information contained in the
Document object.

Tested using JDK 1.4.2 and WinXP with an XML
file that reads as follows:

<?xml version="1.0"?>
<bookOfPoems>
<poem PoemNumber="1" DumAtr="dum val">
<line>Roses are red,</line>
<line>Violets are blue.</line>
<line>Sugar is sweet,</line>
<line>and so are you.</line>
</poem>
<?processor ProcInstr="Dummy"?>

<poem PoemNumber="2" DumAtr="dum val">
<line>Roses are pink,</line>
<line>Dandelions are yellow,</line>
<line>If you like Java,</line>
<line>You are a good fellow.</line>
</poem>
</bookOfPoems>

When viewed with an editor that restores most of
the cosmetic XML structure, the output file looks
like the following (note that the comment from
the input XML file is missing in the output
XML file):

<?xml version="1.0"?><bookOfPoems>
<poem PoemNumber="1" DumAtr="dum val">
<line>Roses are red,</line>
<line>Violets are blue.</line>
<line>Sugar is sweet,</line>
<line>and so are you.</line>
</poem>
<?processor ProcInstr="Dummy"?>

<poem PoemNumber="2" DumAtr="dum val">
<line>Roses are pink,</line>
<line>Dandelions are yellow,</line>
<line>If you like Java,</line>
<line>You are a good fellow.</line>
</poem>
</bookOfPoems>

Note. No effort was made to provide meaningful
information about errors and exceptions. This is
a complex topic that will be covered in a
subsequent sample program.
************************************************/

import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.DocumentBuilder;
import java.io.File;
import org.w3c.dom.Document;

public class Dom02 {

public static void main(String argv[]) {
if (argv.length != 2) {
System.err.println(
"usage: java Dom02 fileIn fileOut");
System.exit(0);
}//end if

try{
//Get a factory object for DocumentBuilder
// objects
DocumentBuilderFactory factory =
DocumentBuilderFactory.newInstance();

//Configure the factory object
factory.setValidating(false);
factory.setNamespaceAware(false);

//Get a DocumentBuilder (parser) object
DocumentBuilder builder =
factory.newDocumentBuilder();

//Parse the XML input file to create a
// Document object that represents the
// input XML file.
Document document = builder.parse(
new File(argv[0]));

//Use an anonymous object of the
// Dom02Writer class to traverse the
// Document object, extracting information
// about each of the nodes, and using that
// information to write an output XML
// file that represents the Document
// object.
new Dom02Writer(argv[1]).
writeXmlFile(document);
}catch(Exception e){
e.printStackTrace(System.err);
}//end catch

}// end main()
} // class Dom02

Listing 28

/*File Dom02Writer.java
Copyright 2003 R.G.Baldwin

This class provides a utility method named
writeXmlFile() that receives a DOM Document
object as a parameter and writes an output XML
file that matches the information contained in
the Document object.

The output file is created by recursively
traversing the Document object, identifying each
of the nodes in that object, and converting each
node to text in an XML format.

No effort is made to insert spaces and line
breaks to make the output cosmetically pleasing.
Also, nothing is done to eliminate cosmetic
whitespace that may exist in the Document object.

The name of the XML file is established as a
parameter to the constructor for the class.

A cosmetically pleasing view of the output file
can be obtained by opening the output file in
IE 5.0 or later.

Briefly tested using JDK 1.4.2 and WinXP. Note
however that this class has not been thoroughly
tested. If you use it for a critical application,
test it thoroughly before using it.
************************************************/

import java.io.PrintWriter;
import java.io.FileOutputStream;

import org.w3c.dom.*;

public class Dom02Writer {
private PrintWriter out;

//-------------------------------------------//

public Dom02Writer(String xmlFile) {
try {
out = new PrintWriter(
new FileOutputStream(xmlFile));
}catch (Exception e) {
e.printStackTrace(System.err);
}//end catch
}//end constructor
//-------------------------------------------//

//This method converts an incoming Document
// object to an output XML file
public void writeXmlFile(Document document){
try {
//Write the contents of the Document object
// into an ontput file in XML file format.
writeNode(document);
}catch (Exception e) {
e.printStackTrace(System.err);
}//end catch
}//end writeXmlFile()
//-------------------------------------------//

//This method is used recursively to convert
// node data to XML format and to write the XML
// format data to the output file.
public void writeNode(Node node) {
if (node == null) {
System.err.println(
"Nothing to do, node is null");
return;
}//end if

//Process the node based on its type.
int type = node.getNodeType();

switch (type) {
//Process the Document node
case Node.DOCUMENT_NODE: {
//Write a required line for an XML
// document.
out.print("<?xml version=\"1.0\"?>");

//Get and write the root element of the
// Document. Note that this is a
// recursive call.
writeNode(
((Document)node).getDocumentElement());
out.flush();
break;
}//end case Node.DOCUMENT_NODE

//Write an element with attributes
case Node.ELEMENT_NODE: {
out.print('<');//begin the start tag
out.print(node.getNodeName());

//Get and write the attributes belonging
// to the element. First get the
// attributes in the form of an array.
Attr attrs[] = getAttrArray(
node.getAttributes());

//Now process all of the attributes in
// the array.
for (int i = 0; i < attrs.length; i++){
Attr attr = attrs[i];
out.print(' ');//write a space
out.print(attr.getNodeName());
out.print("=\"");//write ="
//Convert <,>,&, and quotation char to
// entities and write the text
// containing the entities.
out.print(
strToXML(attr.getNodeValue()));
out.print('"');//write closing quote
}//end for loop
out.print('>');//write end of start tag

//Deal with the possibility that there
// may be other elements nested in this
// element.
NodeList children = node.getChildNodes();
if (children != null) {//nested elements
int len = children.getLength();
//Iterate on NodeList of child nodes.
for (int i = 0; i < len; i++) {
//Write each of the nested elements
// recursively.
writeNode(children.item(i));
}//end for loop
}//end if
break;
}//end case Node.ELEMENT_NODE

//Handle entity reference nodes
case Node.ENTITY_REFERENCE_NODE:{
out.print('&');
out.print(node.getNodeName());
out.print(';');
break;
}//end case Node.ENTITY_REFERENCE_NODE

//Handle text
case Node.CDATA_SECTION_NODE:
case Node.TEXT_NODE: {
//Eliminate <,>,& and quotation marks and
// write to output file.
out.print(strToXML(node.getNodeValue()));
break;
}//end case Node.TEXT_NODE

//Handle processing instruction
case Node.PROCESSING_INSTRUCTION_NODE:{
out.print("<?");
out.print(node.getNodeName());
String data = node.getNodeValue();
if (data != null && data.length() > 0){
out.print(' ');//write space
out.print(data);
}//end if
out.print("?>");
break;
}//end Node.PROCESSING_INSTRUCTION_NODE
}//end switch

//Now write the end tag for element nodes
if (type == Node.ELEMENT_NODE) {
out.print("</");
out.print(node.getNodeName());
out.print('>');

}//end if

}//end writeNode(Node)
//-------------------------------------------//

//The following methods are utility methods

//This method inserts entities in place
// of <,>,&, and quotation mark
private String strToXML(String s) {
StringBuffer str = new StringBuffer();

int len = (s != null) ? s.length() : 0;

for (int i = 0; i < len; i++) {
char ch = s.charAt(i);
switch (ch) {
case '<': {
str.append("<");
break;
}//end case '<'
case '>': {
str.append(">");
break;
}//end case '>'
case '&': {
str.append("&");
break;
}//end case '&'
case '"': {
str.append(""");
break;
}//end case '"'
default: {
str.append(ch);
}//end default
}//end switch
}//end for loop

return str.toString();

}//end strToXML()
//-------------------------------------------//

//This method converts a NamedNodeMap into an
// array of type Attr
private Attr[] getAttrArray(
NamedNodeMap attrs){
int len = (attrs != null) ?
attrs.getLength() : 0;
Attr array[] = new Attr[len];
for (int i = 0; i < len; i++) {
array[i] = (Attr)attrs.item(i);
}//end for loop

return array;
}//end getAttrArray()

//-------------------------------------------//

} // end class Dom02Writer

Listing 29

<?xml version="1.0"?>
<bookOfPoems>
<poem PoemNumber="1" DumAtr="dum val">
<line>Roses are red,</line>
<line>Violets are blue.</line>
<line>Sugar is sweet,</line>
<line>and so are you.</line>
</poem>
<?processor ProcInstr="Dummy"?>

<poem PoemNumber="2" DumAtr="dum val">
<line>Roses are pink,</line>
<line>Dandelions are yellow,</line>
<line>If you like Java,</line>
<line>You are a good fellow.</line>
</poem>
</bookOfPoems>

Listing 30

About the author

Richard Baldwin is a college professor (at Austin Community College in Austin, TX) and private consultant whose primary focus is a combination of Java, C#, and XML. In addition to the many platform and/or language independent benefits of Java and C# applications, he believes that a combination of Java, C#, and XML will become the primary driving force in the delivery of structured information on the Web.

Richard has participated in numerous consulting projects, and he frequently provides onsite training at the high-tech companies located in and around Austin, Texas. He is the author of Baldwin's Programming Tutorials, which has gained a worldwide following among experienced and aspiring programmers. He has also published articles in JavaPro magazine.

Richard holds an MSEE degree from Southern Methodist University and has many years of experience in the application of computer technology to real-world problems.

Baldwin@DickBaldwin.com

-end-