1. Overview

In this article, we will explore various methods to turn XML files into CSV format using Java.

XML (Extensible Markup Language) and CSV (Comma-Separated Values) are both popular choices for data exchange. While XML is a powerful option that allows for a structured, layered approach to complicated data sets, CSV is more straightforward and designed primarily for tabular data. 

Sometimes, there might be situations where we need to convert an XML to a CSV to make data import or analysis easier.

2. Introduction to XML Data Layout

Imagine we run a bunch of bookstores, and we’ve stored our inventory data in an XML format similar to the example below:

<?xml version="1.0"?>
<Bookstores>
    <Bookstore id="S001">
        <Books>
            <Book id="B001" category="Fiction">
                <Title>Death and the Penguin</Title>
                <Author id="A001">Andrey Kurkov</Author>
                <Price>10.99</Price>
            </Book>
            <Book id="B002" category="Poetry">
                <Title>Kobzar</Title>
                <Author id="A002">Taras Shevchenko</Author>
                <Price>8.50</Price>
            </Book>
        </Books>
    </Bookstore>
    <Bookstore id="S002">
        <Books>
            <Book id="B003" category="Novel">
                <Title>Voroshilovgrad</Title>
                <Author id="A003">Serhiy Zhadan</Author>
                <Price>12.99</Price>
            </Book>
        </Books>
    </Bookstore>
</Bookstores>

This XML organizes attributes ‘id’ and ‘category’ and text elements ‘Title,’ ‘Author,’ and ‘Price’ neatly in a hierarchy. Ensuring a well-structured XML simplifies the conversion process, making it more straightforward and error-free.

The goal is to convert this data into a CSV format for easier handling in tabular form. To illustrate, let’s take a look at how the bookstores from our XML data would be represented in the CSV format:

Moving forward, we’ll discuss the methods to achieve this conversion.

3. Converting using XSLT

3.1. Introduction to XSLT

XSLT (Extensible Stylesheet Language Transformations) is a tool that changes XML files into various other formats like HTML, plain text, or even CSV.

It operates by following rules set in a special stylesheet, usually an XSL file. This becomes especially useful when we aim to convert XML to CSV for easier use.

3.2. XSLT Conversion Process

To get started, we’ll need to create an XSLT stylesheet that uses XPath to navigate the XML tree structure and specifies how to convert the XML elements into CSV rows and columns.

Below is an example of such an XSLT file:

<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="text" omit-xml-declaration="yes" indent="no"/>
    <xsl:template match="/">
        <xsl:text>bookstore_id,book_id,category,title,author_id,author_name,price</xsl:text>
        <xsl:text>&#xA</xsl:text>
        <xsl:for-each select="//Bookstore">
            <xsl:variable name="bookstore_id" select="@id"/>
            <xsl:for-each select="./Books/Book">
                <xsl:variable name="book_id" select="@id"/>
                <xsl:variable name="category" select="@category"/>
                <xsl:variable name="title" select="Title"/>
                <xsl:variable name="author_id" select="Author/@id"/>
                <xsl:variable name="author_name" select="Author"/>
                <xsl:variable name="price" select="Price"/>
                <xsl:value-of select="concat($bookstore_id, ',', $book_id, ',', $category, ',', $title, ',', $author_id, ',', $author_name, ',', $price)"/>
                <xsl:text>&#xA</xsl:text>
            </xsl:for-each>
        </xsl:for-each>
    </xsl:template>
</xsl:stylesheet>

This stylesheet first matches the root element and then examines each ‘Bookstore’ node, gathering its attributes and child elements. Like the book’s id, category, and so on, into variables. These variables are then used to build out each row in the CSV file. CSV will have columns for bookstore ID, book ID, category, title, author ID, author name, and price.

The xsl:template sets transformation rules. It targets the XML root with <xsl:template match=”/”> and then defines the CSV header.

The instruction <xsl:for-each select=”//Bookstore”> processes each ‘Bookstore’ node and captures its attributes. Another inner instruction, <xsl:for-each select=”./Books/Book”>, processes each ‘Book‘ within the current ‘Bookstore‘.

The concat() function combines these values into a CSV row.

The adds a line feed (LF) character, corresponding to the ASCII value of 0xA in hexadecimal notation.

Here’s how we can use the Java-based XSLT processor:

void convertXml2CsvXslt(String xslPath, String xmlPath, String csvPath) throws IOException, TransformerException {
    StreamSource styleSource = new StreamSource(new File(xslPath));
    Transformer transformer = TransformerFactory.newInstance()
      .newTransformer(styleSource);
    Source source = new StreamSource(new File(xmlPath));
    Result outputTarget = new StreamResult(new File(csvPath));
    transformer.transform(source, outputTarget);
}

We use TransformerFactory to compile our XSLT stylesheet. Then, we create a Transformer object, which takes care of applying this stylesheet to our XML data, turning it into a CSV file. Once the code runs successfully, a new file will appear in the specified directory.

Using XSLT for XML to CSV conversion is highly convenient and flexible, offering a standardized and powerful approach for most use cases, but it requires loading the whole XML file into memory. This can be a drawback for large files. While it’s perfect for medium-sized data sets, if we have a larger dataset, you might want to consider using StAX, which we’ll get into next.

4. Using StAX

4.1. Introduction to StAX

StAX (Streaming API for XML) is designed to read and write XML files in a more memory-efficient way. It allows us to process XML documents on the fly, making it ideal for handling large files.

Converting using StAX involves three main steps.

  • Initialize the StAX Parser
  • Reading XML Elements
  • Writing to CSV

4.2. StAX Conversion Process

Here’s a full example, encapsulated in a method named convertXml2CsvStax():

void convertXml2CsvStax(String xmlFilePath, String csvFilePath) throws IOException, TransformerException {
    XMLInputFactory inputFactory = XMLInputFactory.newInstance();

    try (InputStream in = Files.newInputStream(Paths.get(xmlFilePath)); BufferedWriter writer = new BufferedWriter(new FileWriter(csvFilePath))) {
        writer.write("bookstore_id,book_id,category,title,author_id,author_name,price\n");

        XMLStreamReader reader = inputFactory.createXMLStreamReader(in);

        String currentElement;
        StringBuilder csvRow = new StringBuilder();
        StringBuilder bookstoreInfo = new StringBuilder();

        while (reader.hasNext()) {
            int eventType = reader.next();

            switch (eventType) {
                case XMLStreamConstants.START_ELEMENT:
                    currentElement = reader.getLocalName();
                    if ("Bookstore".equals(currentElement)) {
                        bookstoreInfo.setLength(0);
                        bookstoreInfo.append(reader.getAttributeValue(null, "id"))
                          .append(",");
                    }
                    if ("Book".equals(currentElement)) {
                        csvRow.append(bookstoreInfo)
                          .append(reader.getAttributeValue(null, "id"))
                          .append(",")
                          .append(reader.getAttributeValue(null, "category"))
                          .append(",");
                    }
                    if ("Author".equals(currentElement)) {
                        csvRow.append(reader.getAttributeValue(null, "id"))
                          .append(",");
                    }
                    break;

                case XMLStreamConstants.CHARACTERS:
                    if (!reader.isWhiteSpace()) {
                        csvRow.append(reader.getText()
                          .trim())
                          .append(",");
                    }
                    break;

                case XMLStreamConstants.END_ELEMENT:
                    if ("Book".equals(reader.getLocalName())) {
                        csvRow.setLength(csvRow.length() - 1);
                        csvRow.append("\n");
                        writer.write(csvRow.toString());
                        csvRow.setLength(0);
                    }
                    break;
            }
        }
    } catch (Exception e) {
        e.printStackTrace();
    }
}

To begin, we initialize the StAX parser by creating an instance of XMLInputFactory. We then use this factory object to generate an XMLStreamReader:

XMLInputFactory inputFactory = XMLInputFactory.newInstance();
InputStream in = new FileInputStream(xmlFilePath);
XMLStreamReader reader = inputFactory.createXMLStreamReader(in);

We use the XMLStreamReader to iterate through the XML file, and based on the event type, such as START_ELEMENT, CHARACTERS, and END_ELEMENT, we build our CSV rows.

As we read the XML data, we build up CSV rows and write them to the output file using a BufferedWriter.

So, in a nutshell, StAX offers a memory-efficient solution that’s well-suited for processing large or real-time XML files. While it may require more manual effort and lacks some of the transformation features of XSLT, it excels in specific scenarios where resource utilization is a concern. With the foundational knowledge and example provided, we are now prepared to use StAX for our XML to CSV conversion needs when those specific conditions apply.

5. Additional Methods

We’ve primarily focused on XSLT and StAX as XML to CSV conversion methods. However, other options like DOM (Document Object Model) parsers, SAX (Simple API for XML) parsers, and Apache Commons CSV also exist.

Yet, there are some factors to consider. DOM parsers are great for loading the whole XML file into memory, giving you the flexibility to traverse and manipulate the XML tree freely. On the other hand, they do make you work a bit harder when you need to transform that XML data into CSV format.

When it comes to SAX parsers, they are more memory-efficient but can present challenges for complex manipulations. Their event-driven nature requires you to manage the state manually, and they offer no option for looking ahead or behind in the XML document, making certain transformations cumbersome.

Apache Commons CSV shines when writing CSV files but expects you to handle the XML parsing part yourself.

In summary, while each alternative has its own advantages, for this example, XSLT and StAX provide a more balanced solution for most XML to CSV conversion tasks.

6. Best Practices

To convert XML to CSV, several factors, such as data integrity, performance, and error handling, need to be considered. Validating the XML against its schema is crucial for confirming the data structure. In addition, proper mapping of XML elements to CSV columns is a fundamental step.

For large files, using streaming techniques like StAX can be advantageous for memory efficiency. Also, consider breaking down large files into smaller batches for easier processing.

It’s important to mention that the code examples provided may not handle special characters found in XML data, including but not limited to commas, newlines, and double quotes. For example, a comma within a field value can conflict with the comma used to delimit fields in the CSV. Similarly, a newline character could disrupt the logical structure of the file.

Addressing such issues can be complex and varies depending on specific project requirements. To work around commas, you can enclose fields in double quotes in the resulting CSV file. That said, to keep the code examples in this article easy to follow, these special cases have not been addressed. Therefore, this aspect should be taken into account for a more accurate conversion.

7. Conclusion

In this article, we explored various methods for converting XML to CSV, specifically diving into the XSLT and StAX methods. Regardless of the method chosen, having a well-suited XML structure for CSV, implementing data validation, and knowing which special characters to handle are essential for a smooth and successful conversion. The code for these examples is available on GitHub.