1. Introduction

CSV file is a versatile information format: It’s compressible, human-readable, and supported in all the major data sheet applications (MS Excel, Google Sheets, LibreOffice). What’s even more important, we can easily split a CSV file into several smaller files or combine several CSV files into one. This enables parallel processing and makes the automated gathering of data extremely easy.

How does Kotlin fare with CSV files?

Well, for one, with Kotlin’s focus on functional programming, any batch job is easy to code. In this tutorial, we’ll look at the means that the language itself provides. We also will touch on the libraries that facilitate CSV streaming and batch processing.

2. Read and Write CSV With Pure Kotlin

CSV is a simple format. However, if we can’t guarantee that our data source is reliable and our data are mostly numbers, things quickly become complex. That’s why it’s entirely possible to write a simple parser in pure Kotlin, but using one of the libraries helps us to cover many edge cases.

Let’s find a sample CSV file and try to read it:

"Year", "Score", "Title"
1968,  86, "Greetings"
1970,  17, "Bloody Mama"
1970,  73, "Hi, Mom!"

This file includes all the movies which starred Robert De Niro, their ratings, and the year they were filmed. Let’s assume we want to parse that file into a structure:

data class Movie(
    val year: Year,
    val score: Int,
    val title: String,
)

We can assume some things: that the first row will be a header, each subsequent row will have three fields, and the first two of them will be numbers. We only need to account for an empty line at the end of the file and the possibility that fields may contain some extra spaces:

fun readCsv(inputStream: InputStream): List<Movie> {
    val reader = inputStream.bufferedReader()
    val header = reader.readLine()
    return reader.lineSequence()
        .filter { it.isNotBlank() }
        .map {
            val (year, rating, title) = it.split(',', ignoreCase = false, limit = 3)
            Movie(Year.of(year.trim().toInt()), rating.trim().toInt(), title.trim().removeSurrounding("\""))
        }.toList()
}
val movies = readCsv(/*Open a stream to CSV file*/)

The problems will start to appear if the title column becomes the first one. As some of the movie titles contain commas (“New York, New York”), we can’t use a simple split anymore. However, in simple cases, this solution might just be enough.

Writing a CSV file is actually much easier without any library:

fun OutputStream.writeCsv(movies: List<Movie>) {
    val writer = bufferedWriter()
    writer.write(""""Year", "Score", "Title"""")
    writer.newLine()
    movies.forEach {
        writer.write("${it.year}, ${it.score}, \"${it.title}\"")
        writer.newLine()
    }
    writer.flush()
}
FileOutputStream("filename.csv").apply { writeCsv(movies) }

Of course, this writer is highly specialized and only works for our Movie class. On the other hand, any library requires some middleware logic to translate specific models to generic types supported by library writers.

Let’s find a data sample that is more complex:

"Index", "Item", "Cost", "Tax", "Total"
 1, "Fruit of the Loom Girl's Socks",  7.97, 0.60,  8.57
 2, "Banana Boat Sunscreen, 8 oz",     6.68, 0.50,  7.18

This file was formatted to be extra human-readable. For our algorithm, however, it presents an extra challenge, as there are not only leading spaces but leading tabs. It also has a string field in the second column. Such a column would present a challenge for parsing with only Kotlin language tools.

3. kotlin-csv Library

There are in fact pure Kotlin libraries for dealing with CSV, notably kotlin-csv. However, none of them has yet achieved the status of the de-facto standard. Moreover, kotlin-csv has some problems dealing with slightly relaxed CSV formats, like the one we quoted earlier. Specifically, we noticed three problems with the kotlin-csv library and the taxable goods format:

  1. When run with the default setting, it fails to process a quote () in a field with leading spaces: header “Index”, “Item”, “Cost”, “Tax”, “Total” has spaces after commas.
  2. With escapeChar = ‘\\’ it runs correctly for a time but then produces on each row a map with keys that include leading spaces and quotes.
  3. Consider row 2 of the second example above, which has a comma within the item name. This library fails when it discovers a comma in the item name, assuming that this row has six columns instead of five.

However, if we follow the strict CSV, then parsing with Kotlin CSV is easy:

fun readStrictCsv(inputStream: InputStream): List<TaxableGood> = csvReader().open(inputStream) {
    readAllWithHeaderAsSequence().map {
        TaxableGood(
            it["Index"]!!.trim().toInt(),
            it["Item"]!!.trim(),
            BigDecimal(it["Cost"]),
            BigDecimal(it["Tax"]),
            BigDecimal(it["Total"])
        )
    }.toList()
}

What’s significant here is that we don’t need to read the whole file — we’re processing it row by row. Sometimes, this is an important point to consider in terms of memory requirements.

4. Apache CSV Library

Being disappointed by kotlin-csv, we turn to the JVM-world classics, like the Apache Commons library – Apache CSV.

The reader with Apache CSV library is really straightforward:

fun readCsv(inputStream: InputStream): List<TaxableGood> =
    CSVFormat.Builder.create(CSVFormat.DEFAULT).apply {
        setIgnoreSurroundingSpaces(true)
    }.build().parse(inputStream.reader())
        .drop(1) // Dropping the header
        .map {
            TaxableGood(
                index = it[0].toInt(),
                item = it[1],
                cost = BigDecimal(it[2]),
                tax = BigDecimal(it[3]),
                total = BigDecimal(it[4])
            )
        }

Apart from the DEFAULT format, there are other variations. RFC4180 is similar to DEFAULT, but doesn’t allow for empty lines. EXCEL allows for some columns to miss a header. TDF defines the .tsv format. The formats also support the ability to add a comment to the whole file (which will be printed before the main body), define a comment marker to insert (or safely ignore) a comment line within the document itself, and several other settings.

Let’s ignore extra spaces, and we can address the row fields as array elements.

Writing is even easier:

fun Writer.writeCsv(goods: List<TaxableGood>) {
    CSVFormat.DEFAULT.print(this).apply {
        printRecord("Index", "Item", "Cost", "Tax", "Total")
        goods.forEach { (index, item, cost, tax, total) -> printRecord(index, item, cost, tax, total) }
    }
}

However, as CSV doesn’t actually have the concept of a “pretty” format, we can’t serialize our data to look exactly like input data, with aligning tabs.

5. FasterXML Jackson CSV Library

Jackson library might not be the lightest possible. It requires some setup, and the mapper instance is heavy and needs to be cached. But it’s known to be quite stable and integrates well with the JVM typing.

If we try to parse the same taxable goods file, we’ll need to define the mapper and the file schema first:

val csvMapper = CsvMapper().apply {
    enable(CsvParser.Feature.TRIM_SPACES)
    enable(CsvParser.Feature.SKIP_EMPTY_LINES)
}

val schema = CsvSchema.builder()
    .addNumberColumn("Index")
    .addColumn("Item")
    .addColumn("Cost")
    .addColumn("Tax")
    .addColumn("Total")
    .build()

We configure CsvMapper to accept our deviations from the strict format: the empty line at the end of the file and the tabs and spaces surrounding the data.

After that, we need to markup our data class with the JsonProperty annotation because the names of our columns do not exactly match the field names of the data class:

data class TaxableGood(
    @field:JsonProperty("Index") val index: Int,
    @field:JsonProperty("Item") val item: String?,
    @field:JsonProperty("Cost") val cost: BigDecimal?,
    @field:JsonProperty("Tax") val tax: BigDecimal?,
    @field:JsonProperty("Total") val total: BigDecimal?
) {
    constructor() : this(0, "", BigDecimal.ZERO, BigDecimal.ZERO, BigDecimal.ZERO)
}

We also need a zero-arg constructor for the Jackson parser. We can create it manually, like in the example, or else use the no-arg compiler plugin.

Then we can open the stream and read the data:

fun readCsv(inputStream: InputStream): List<TaxableGood> =
    csvMapper.readerFor(TaxableGood::class.java)
        .with(schema.withSkipFirstDataRow(true))
        .readValues<TaxableGood>(inputStream)
        .readAll()

Note how we skip the header with withSkipFirstDataRow(true). The writing is also pretty straightforward:

fun OutputStream.writeCsv(goods: List<TaxableGood>) {
    csvMapper.writer().with(schema.withHeader()).writeValues(this).writeAll(goods)
}

Thanks to the schema, the writer knows the number and the order of the columns.

The advantage of the Jackson library is that we can parse CSV rows straight into the data class objects, which are easy to deal with.

6. Conclusion

In this tutorial, we parsed and wrote CSV with various methods. We can use pure Kotlin with only standard libraries to write a parser that won’t change much during the life of our software. Such a parser will depend heavily on the data source and will require significant efforts to make it versatile and generic.

Therefore, Apache CSV or Jackson CSV are usually better solutions, providing better support for slight irregularities in the data. Apache CSV is a part of the Apache Commons family and, as such, is extremely stable. Jackson provides a nice bridge between tokenized rows and data classes. In terms of configuration, Jackson is slightly more demanding, requiring the creation of a parser and a schema beforehand, whereas Apache CSV is good to go in just one call.

The kotlin-csv library might be good for most CSV files, but it requires the strict following of the standard.

As always, all the code samples used in this tutorial can be found on GitHub.