1. Overview

In this short tutorial, we’re going to learn how to remove all non-alphanumeric characters from a string in Kotlin.

2. Removing Non-Alphanumeric Characters

In order to remove all non-alphanumeric characters from a string, we can use regular expressions along with the replace() extension function. To be more specific, the following regex matches with all non-alphanumeric characters:

val nonAlphaNum = "[^a-zA-Z0-9]".toRegex()

The above regular expression will match with any character that isn’t (because of the ^ which is a negation modifier) a lowercase letter, an uppercase letter, or a digit. Therefore, using this regex, we can get rid of all non-alphanumeric characters in a string:

val text = "This notebook costs 2000€ (including tax)"
val nonAlphaNum = "[^a-zA-Z0-9]".toRegex()
val justAlphaNum = text.replace(nonAlphaNum, "")
assertEquals("Thisnotebookcosts2000includingtax", justAlphaNum)

As shown above, we’re removing the euro sign, parentheses, and space characters.

3. Supporting Unicode Letters

Unfortunately, the same simple regex won’t recognize letters and numbers in different languages. For instance, here, the German umlaut is removed as if it’s a non-alphanumeric character:

assertEquals("hnlich", "ähnlich".replace(nonAlphaNum, ""))

As shown above, we’re accidentally removing the “ä” letter in “ähnlich”. The same is true for other Unicode letters and numbers:

assertEquals("", "آب".replace(nonAlphaNum, "")) // water in Persian
assertEquals("", "۴۲".replace(nonAlphaNum, "")) // 42 in Arabic
assertEquals("ao", "año".replace(nonAlphaNum, "")) // year in Spanish

Here, we’re removing a few Persian letters, a number with Arabic digits, and a letter with an accent mark in Spanish, even though they’re legit letters or numbers. In order to fix this, we have two solutions.

First, we can tune the regex with \p{} property tokens to include Unicode letters and numbers:

val nonAlphaNum = "[^a-zA-Z0-9\\p{L}\\p{M}*\\p{N}]".toRegex()

The “\\p{L}\\p{M}*” is equivalent to all Unicode letters with all sorts of marks and accents. Also, the “\\p{N}” is equivalent to all Unicode digits. Now, with this regex, we should be able to recognize Unicode letters and digits, as well:

assertEquals("Thisnotebookcosts2000includingtax", justAlphaNum)
assertEquals("ähnlich", "ähnlich".replace(nonAlphaNum, ""))
assertEquals("آب", "آب".replace(nonAlphaNum, ""))
assertEquals("۴۲", "۴۲".replace(nonAlphaNum, ""))
assertEquals("año", "año".replace(nonAlphaNum, ""))

As shown above, we don’t remove some legit alphanumeric characters accidentally here.

As a second solution, we can take advantage of the isLetterOrDigit() extension function to filter out the non-alphanumeric characters:

fun String.onlyAlphanumericChars() =
  this.asSequence().filter { it.isLetterOrDigit() }.joinToString("")

Here, we’re using the asSequence() extension function to avoid the creation of unnecessary intermediate arrays and collections along the way. Besides that, the logic is pretty simple as we’re only keeping the alphanumeric characters and joining them to a new string.

4. Conclusion

In this tutorial, we learned a few ways to remove all non-alphanumeric characters from a string in Kotlin.

As usual, all the examples are available over on GitHub.


» 下一篇: Kotlin中复制列表