1. Overview

When processing text containing comma-separated-values, it may be necessary to ignore commas that occur in quoted sub-strings.

In this tutorial, we’ll explore different approaches for ignoring commas inside quotes when splitting a comma-separated String.

2. Problem Statement

Suppose we need to split the following comma-separated input:

String input = "baeldung,tutorial,splitting,text,\"ignoring this comma,\"";

After splitting this input and printing the result, we’d expect the following output:

baeldung
tutorial
splitting
text
"ignoring this comma,"

In other words, we cannot consider all comma characters as being separators. We must ignore the commas that occur inside quoted sub-strings.

3. Implementing a Simple Parser

Let’s create a simple parsing algorithm:

List<String> tokens = new ArrayList<String>();
int startPosition = 0;
boolean isInQuotes = false;
for (int currentPosition = 0; currentPosition < input.length(); currentPosition++) {
    if (input.charAt(currentPosition) == '\"') {
        isInQuotes = !isInQuotes;
    }
    else if (input.charAt(currentPosition) == ',' && !isInQuotes) {
        tokens.add(input.substring(startPosition, currentPosition));
        startPosition = currentPosition + 1;
    }
}

String lastToken = input.substring(startPosition);
if (lastToken.equals(",")) {
    tokens.add("");
} else {
    tokens.add(lastToken);
}

Here, we start by defining a List called tokens, which is responsible for storing all the comma-separated values.

Next, we iterate over the characters in the input String.

In each loop iteration, we need to check if the current character is a double quote. When a double quote is found, we use the isInQuotes flag to indicate that all upcoming commas after the double quotes should be ignored. The isInQuotes flag will be set false when we find enclosing double-quotes.

A new token will be added to the tokens list when isInQuotes is false, and we find a comma character. The new token will contain the characters from startPosition until the last position before the comma character.

Then, the new startPosition will be the position after the comma character.

Finally, after the loop, we’ll still have the last token that goes from startPosition to the last position of the input. Therefore, we use the substring() method to get it. If this last token is just a comma, it means that the last token should be an empty string. Otherwise, we add the last token to the tokens list.

Now, let’s test the parsing code:

String input = "baeldung,tutorial,splitting,text,\"ignoring this comma,\"";
var matcher = contains("baeldung", "tutorial", "splitting", "text", "\"ignoring this comma,\"");
assertThat(splitWithParser(input), matcher);

Here, we’ve implemented our parsing code in a static method called splitWithParser. Then, in our test, we define a simple test input containing a comma enclosed by double quotes. Next, we use the hamcrest testing framework to create a contains matcher for the expected output. Finally, we use the assertThat testing method to check if our parser returns the expected output.

In an actual scenario, we should create more unit tests to verify the behavior of our algorithm with other possible inputs.

4. Applying Regular Expressions

Implementing a parser is an efficient approach. However, the resulting algorithm is relatively large and complex. Thus, as an alternative, we can use regular expressions.

Next, we will discuss two possible implementations that rely on regular expressions. Nevertheless, they should be used with caution as their processing time is high compared to the previous approach. Therefore, using regular expressions for this scenario can be prohibitive when processing large volumes of input data.

4.1. String split() Method

In this first regular expression option, we’ll use the split() method from the String class. This method splits the String around matches of the given regular expression:

String[] tokens = input.split(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)", -1);

At first glance, the regular expression may seem highly complex. However, its functionality is relatively simple.

In short, using positive lookahead, tells to split around a comma only if there are no double quotes or if there is an even number of double quotes ahead of it.

The last parameter of the split() method is the limit. When we provide a negative limit, the pattern is be applied as many times as possible, and the resulting array of tokens can have any length.

4.2. Guava’s Splitter Class

Another alternative based on regular expressions is the use of the Splitter class from the Guava library:

Pattern pattern = Pattern.compile(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)");
Splitter splitter = Splitter.on(pattern);
List<String> tokens = splitter.splitToList(input);

Here, we are creating a splitter object based on the same regular expression pattern as before. After creating the splitter, we use the splitToList() method, which returns a List of tokens after splitting the input String.

5. Using a CSV Library

Although the alternatives presented are interesting, it may be necessary to use a CSV parsing library such as OpenCSV.

Using a CSV library has the advantage of requiring less effort, as we don’t need to write a parser or a complex regular expression. As a result, our code ends up being less error-prone and easier to maintain.

Moreover, a CSV library may be the best approach when we are not sure about the shape of our input. For example, the input may have escaped quotes, which would not be properly handled by previous approaches.

To use OpenCSV, we need to include it as a dependency. In a Maven project, we include the opencsv dependency:

<dependency>
    <groupId>com.opencsv</groupId>
    <artifactId>opencsv</artifactId>
    <version>5.8</version>
</dependency>

Then, we can use OpenCSV as follows:

CSVParser parser = new CSVParserBuilder()
  .withSeparator(',')
  .build();

CSVReader reader = new CSVReaderBuilder(new StringReader(input))
  .withCSVParser(parser)
  .build();

List<String[]> lines = new ArrayList<>();
lines = reader.readAll();
reader.close();

Using the CSVParserBuilder class, we start by creating a parser with a comma separator. Then, we use the CSVReaderBuilder to create a CSV reader based on our comma-based parser.

In our example, we provide a StringReader as an argument to the CSVReaderBuilder constructor.  However, we can use different readers (e.g., a file reader) if required.

Finally, we call the readAll() method from our reader object to get a List of String arrays. Since OpenCSV is designed to handle multi-line inputs, each position in the lines list corresponds to a line from our input. Thus, for each line, we have a String array with the corresponding comma-separated values.

Unlike previous approaches, with OpenCSV, the double quotes are removed from the generated output.

6. Conclusion

In this article, we explored multiple alternatives for ignoring commas in quotes when splitting a comma-separated String. Besides learning how to implement our own parser, we explored the use of regular expressions and the OpenCSV library.

As always, the code samples used in this tutorial are available over on GitHub.