1. Overview
In this tutorial, we study the basic concepts behind regular expressions and their applications.
We start by describing the syntax used in the regular expressions, and contextually learn how to build expressions of increasing complexity.
Then, we’ll see some guided exercises in the usage of regular expressions for the solution of practical tasks. We’ll thus learn how to reason over a problem in order to formalize it in a manner that allows us to solve it through regular expressions.
At the end of this tutorial, we’ll know how to build regular expressions for the solution of practical problems related to the parsing of strings and pattern matching.
2. Introductory Notes to Regular Expressions
Regular Expressions (short form: RegEx, plural RegExes) are formulas that identify one or more sequences of characters that we’re searching in a string. The idea behind them is that if we’re faced with a text whose content is at least partially unknown, we want to be able to extract the parts of it that satisfy some arbitrary conditions.
The usage of RegExes is common in all sorts of tasks for natural language processing. This includes pattern matching itself, but also tokenization, stemming and lemmatization, parsing of words and sentences, string replacement, and document and information retrieval.
The usage of RegExes is so common that they’re implemented in all programming languages, including Kotlin, Java, Scala, Groovy, and AWK, and we can find the language-specific implementation in the relevant tutorials on our website.
3. Syntax of Regular Expressions
3.1. Single Characters
One quick note on notation, before we get into it deeply. For the course of this article, we’ll indicate with a RegEx operating on a string . The application of on , the so-called pattern matching, we indicate with , in order to remain language-agnostic. We also sporadically use the variable to indicate alternative RegExes that we want to compare with a main RegEx .
The most simple RegEx is the one that searches for all instances of one given character in a string. Say that we’re looking for all occurrences of the character “A” in the string “aaAbbBccC”. The RegEx that we use to accomplish this task corresponds to the character being sought, such that “A” “A”.
Notice that if we’re searching for a character that repeats twice or more, such as with the RegEx “a”, the same search would normally return multiple values: [“a”, “a”].
3.2. Groups of Characters and Square Brackets
Let’s now imagine that we want to find all groups of characters containing the letter “a” twice, as in “aa”. In this case, the regular expression “aa” would return the combination of two characters, but not the two isolated instances of the same letter: “aa” [“a”, “a”].
We can also however search for all groups of two repetitions of the letter “a”, independently from its case. This means that we accept as an answer both “a” and “A”, whenever one of these occurs in the string . In this case, we can create a RegEx that contains square brackets, within which we indicate the two possible alternative letters:
“[aA]” “aaAbbBccC” [“a”, “a”, “A”]
Finally, we can specify that we’re interested only in groups of letters starting with lower-case “a”, whenever this is followed by another lower-case “a” or upper-case “A”. This corresponds to specifying the following RegEx:
“a[aA]” “aabaA” [“aa”, “aA”]
Notice how the first lower-case “a” is outside of the square brackets, and how we inserted the two alternative cases for “a” and “A” inside the square brackets. Notice also how the same RegEx wouldn’t find a string of the form “Aa”, where the first character is upper-case:
“a[aA]” “Aa”
To find the group “Aa” in the string “Aa”, we need to invert the order of the terms in the RegEx:
“[aA]a” “Aa” “Aa”
3.3. Range of Characters
We can also use the square brackets to indicate a range of characters, rather than an explicit list. If we, for example, wanted to find any instances of upper-case letters in the string “aaAbbBccC”, we could define this RegEx:
“[ABC]” [“A”, “B”, “C”]
The same expression can, however, be rewritten by using the operator “-“. The hyphen, when preceded and followed by another character, indicates all characters contained between those two extremes:
“[A-C]” [“A”, “B”, “C”]
We could similarly indicate a RegEx “[A-Z]” that matches all upper-case letters of the English alphabet. The RegEx “[a-z]” similarly matches all lower-case letters, which implies that “[A-Z]” “[a-z]” matches all letters of the alphabet. This expression can be written in RegEx form as:
“[a-zA-Z]” “aaAbbBccC” [“a”, “a”, “A”, “b”, “b”, “B”, “c”, “c”, “C”]
The expression that matches all alphanumeric characters can be created by adding the range “[0-9]” to the previous expression, as:
“[a-zA-Z0-9]” “abcABC123” [“a”, “b”, “c”, “A”, “B”, “C”, “1”, “2”, “3”]
3.4. Some Special Characters: Words and Digits
We can also use a shortcut that indicates a range containing all characters, which include all alphanumeric characters with the addition of the punctuation marks, the symbols for currencies, and all others with an exception that we’ll see shortly. This shortcut is simply the single dot “.”, written without square brackets:
“.” “aAbB12-?” [“a”, “A”, “b”, “B”, “1”, “2”, “-“, “?”]
Notice that this is a slight simplification, and in practice, the dot doesn’t always match all characters. In particular, in some programming languages, the newline character “\n” is often skipped by the RegEx “.”. If we want to refer to the literal dot “.” instead, we can precede the dot in the RegEx by a backslash as in “\.”.
Other special characters include “\w”, that matches all alphanumeric characters plus the underscore:
“a_B1?” “\w” “[a-zA-Z0-9_]” [“a”, “_”, “B”, “1”]
Its complement, “\W”, matches all characters that aren’t alphanumeric or underscores:
“\W” “a_B1?” “?”
We can similarly use the RegEx “\d” to match all digits:
“\d” “a_B1?” “1”
Its complement “\D” matches instead all non-digit characters:
“\D” “a_B1?” [“a”, “_”, “B”, “?]
3.5. Empty Strings and Blank Spaces
In relation to words, we can also use a RegEx “\b” that matches all empty strings immediately before or after any alphanumeric characters or underscore that would be matched by the expression “\w”. This RegEx is normally used in contexts where we’re searching for a full word, but we’re not interested in cases where a string occurs as a subset of a word.
Say we’re searching for instances of the word “ship” in a text, but we want to avoid compound words such as “flagship”. We could then use a RegEx containing the word “ship” wrapped around the “\b” pattern:
“\bship\b” “flagship”“ship” “ship”
Another important pattern when working with words is the RegEx for blank spaces, indicated with “\s”. This pattern matches not only the whitespace ” “, but also tab “\t”, newline “\n”, and carriage return “\r”. As was the case before, its complement “\S” matches all characters that aren’t whitespaces.
3.6. Quantifiers
We can now move to quantifiers, which indicate multiple repetitions of the same character or group of characters. All quantifiers must be written immediately to the right of the character or group to which they refer.
The most simple quantifier is “?”, which means “zero or one” repetitions of the character or group that it follows. If we, for example, don’t know whether a certain “Smith” is a man or a woman, we could create a RegEx that matches both “Mr” and “Mrs”:
“Mrs?” “Mr Smith” “Mr” “Mrs Smith” “Mrs”
We can also use a quantifier that matches all zero or more occurrences of a character, not just one. This quantifier is the asterisk “*”, which is very useful when parsing short text messages, given the tendency of humans to repeat some characters when they are excited:
“ni*ce” “nice” “nice” “niiice”“niiice”
Notice however that this pattern would also match strings where the character isn’t present:
“ni*ce” “nce” “nce”
To address to this problem, we can use the quantifier “+”, which matches all instances of one or more repetitions of the same character:
“ni+ce” “nice” “nice” “nce”
3.7. Specific Intervals or Repetitions
We can also define intervals corresponding to the specific number of repetitions that we want. We might, for instance, be interested in finding all repetitions of exactly three digits preceded by a hyphen. In many countries, this indicates the urban prefix for national calls, and we might want to know what city are we calling.
This is done by adding the expression “{3}” immediately after the pattern for digits, “\d”, that we saw above, and by following with the hyphen:
“\d{3}-” “555-12345” “555-” “55-12345”
We could also however be unsure as to how many digits the urban prefix has. In this case, we could indicate either an open interval ![3, \infty) or a closed interval , if we know what is the upper bound of this interval. In RegEx terms, we can express intervals as:
- “{n, m}”, which indicates the closed interval
- “{n,}”, which indicates the half-open interval ![n, \infty)
The RegEx that catches all groups of 3 to 8 digits following a hyphen (included) is, for example, this:
“-\d{3,8}” “555-12345” “-12345” “55-123456789” “-12345678”
3.8. Groups
From what we’ve seen so far, quantifiers have only been applied to individual characters. It’s also possible to apply them to groups or sequences of characters though.
In RegEx syntax, a group is defined as a sequence of characters or an expression that’s contained within round brackets. The expression “(abc)”, for example, matches the whole string “abc” but not its substrings. It also matches twice the string “abcabc”, though.
If we wanted to match all double repetitions of the pattern “abc” in the string “abcabc”, for example, but not the unique occurrences, we could either write a RegEx “abcabc” or, more elegantly:
“(abc){2}” “abcabc” “abcabc”
Notice that, if we wrote the RegEx without parentheses, the quantifier “{2}” would apply only to the letter “c”. This means that “abc{2}” matches “abcc” but not “abcabc”.
The other quantifiers, such as “*”, “+”, “?”, and “{n,m}”, can all equally be applied to groups contained within brackets. If we use them in this manner, they operate over the whole group and not over individual characters.
3.9. Boolean Operator OR
Groups can also contain alternative expressions, which is particularly useful in case many variations of the same string exist. We know for example that the color “gray” is spelled in English in two possible manners, according to the origin of the speaker. If we’re looking for that color in a text, a RegEx of the form “gray” wouldn’t match “grey”, and vice-versa.
We can, however, use groups and the Boolean operator or, expressed in RegEx with “|”, to indicate alternatives:
“gr(a|e)y” “gray” “gray” “grey” “grey”
Longer chains of or clauses are also possible, by attaching various elements in succession:
“0(1|2|3)” “010203” [“01”, “02”, “03”]
If the group is the only component of a RegEx, its parentheses can be omitted. This means that the RegExes “1|2|3” and “(1|2|3)” correspond perfectly if there aren’t any other elements in the same expression.
3.10. Beginning and End of Strings
We might also be interested in finding patterns that are contained between the beginning and the end of a string. In this case, we can use the caret “^” and dollar “$” symbols to indicate, respectively, the start of line and end of line characters.
Let’s imagine that we have to parse a CSV file for machine learning tasks. One typical step in preprocessing is to exclude all null observations, that correspond to any repetitions of the string “0,” contained within two “\n”. The RegEx that matches all these patterns is this:
“^(0,)*$” “\n0,0,0,\n” “0,0,0,”
Notice that if we didn’t include the caret and the dollar symbols, the RegEx would also match instances of null values within the row, as in:
“(0,)*” “\n0,1,0,\n” [“0,”, “0,”]
3.11. Greediness
The last aspect of RegEx syntax that we’ll study here corresponds to greediness. Before we get into the concept of greediness, the most important note to make is this: the symbol that we use here, “?”, is the same that we saw under quantifiers, above, but serves a different purpose and has different rules, so be careful.
Greediness refers to the preference, for quantifiers such as “?”, “+”, and “*”, to match as many possible instances of any pattern that they’re repeating. This means that, if given the choice, they will always take as many repetitions of characters as possible:
“\d+” “555” “555”
We might, however, prefer to find as little possible repetitions of a pattern, rather than as many as possible. In this case, the solution is to follow the quantifier with an additional symbol “?”, that tells the quantifier to not be greedy. The not-greedy version of the quantifiers “?”, “+”, and “*”, is therefore “??”, “+?”, “*?”:
“\d+?” “555” [“5″,”5″,”5”]
4. Examples of Usage of RegExes
4.1. First Example: Mr., Mrs., and Ms.
Now that we completed the study of the syntax of regular expressions, we can see how to apply them in practical contexts. In this section, we’re thus going to conduct two examples of practical usages of regular expressions for the solution of concrete tasks.
The first example is an extension of the one we briefly touched on before. We want to find the two RegExes that identify the titles for men and distinguish them from those for women. This corresponds, respectively, to a pattern that matches the strings “Mr.” but not “Mrs.” and “Ms.”, and a pattern that matches “Mrs.” or “Ms.” but not “Mr.”
We know that, in all cases, the first character is an uppercase “M” while the last one is a period “.”. This means that all RegExes start with the letter “M” and finish with the pattern that matches exclusively the dot. Because we remember that, in RegExes, the pattern “.” indicates any character except the newline and not just the character “.”, we must be careful to precede it with a slash “\.”. In this manner, the dot is intended literally and not as a shortcut for “any character”.
The first RegEx has to match “Mr.” but not “Mrs.” or “Ms.”. This means that the RegEx “Mr\.” does the job perfectly.
The second RegEx has to match both “Mrs.” or “Ms.”, but not “Mr.”. This RegEx is slightly more complex than the previous one. If we used “Mrs\.” we wouldn’t match “Ms\.”, and vice-versa.
If we notice however that the character “r”, which was mandatory for , is here optional, we can use the pattern “?” after it to indicate that it can be skipped: “Mr?s\.”
4.2. What If We Also Introduce “Miss”?
We can make the job harder if, among the possible titles for women, we also include the title of “Miss” together with “Mrs.” and “Ms.”. Notice how “Miss” goes without the period, and therefore we now have to make the “\.” at the end of the RegEx facultative. We can build the new RegEx like this.
The letter “M” has to stay in the first position, as before. Then we’re searching for alternative options:
- The first alternatives correspond to the strings “rs.” or “s.”.
- The second alternative corresponds to the string “iss”
This means that there’s a group of characters on which we apply a Boolean operation or inside of our RegEx. Because the group containing the or operation looks like “(r?s\.|iss)”, the whole RegEx that solves the problem therefore is:
“M(r?s\.|iss)”
4.3. Second Example: Which Ones Are Valid Email Addresses?
The second example we study relates to the extraction of email addresses from unstructured text. This is one of the most typical tasks in the parsing or scraping of HTML websites, but also in the validating of the content of forms.
Let’s imagine that the document is a long collection of characters, that may or may not contain email addresses:
We want to find a regular expression that matches all instances of emails in that document, such that .
We know that an email address is, first of all, characterized by the presence of a sign “@” somewhere in the middle of the address. Therefore, we can imagine that the RegEx should have that particular symbol somewhere at its center:
We also know that the second part of the email address must point to a website, which is normally indicated with its domain name that comprises a top-level and a second-level domain, separated by a dot. We can, therefore, add some more information to the structure of the RegEx, that now looks a bit like this:
“(some characters)@(second-level domain) (top-level domain)”
Notice that the slash “\” precedes the dot because we intend it in its literal form. Now the last thing that we need to do is to impose some additional conditions on the characters other than “@” and “.”. Specifically, we want to limit the possible characters to the alphanumeric characters, with the addition of some symbols commonly present in email addresses such as “_” and “-“.
4.4. The RegEx We Need
To sum-up the considerations in the previous section, an email address in a document is a string that possesses:
- a “@” character between some alphanumeric characters or some symbols such as underscore or dash
- a “.” character at least two characters after the “@”, surrounded by alphanumeric characters or some symbols
- a top-level domain comprised of at least two characters
- and the rest of the characters that belong to the alphanumeric set or to the set
- it must lastly be separated from other text by blank spaces
The RegEx that satisfies these conditions is, therefore:
“\b[A-Za-z0-9.-_+%]+@[A-Za-z0-9-.]+.[A-Za-z]{2,}\b”
5. Conclusions
In this article, we studied the syntactic rules for regular expressions.
We also saw some practical examples of their usages for the solution of tasks of pattern matching.