1. Introduction

In this tutorial, we’ll show how to parse a stream of characters into tokens using the Java StreamTokenizer class.

2. StreamTokenizer

The StreamTokenizer class reads the stream character by character. Each of them can have zero or more of the following attributes: white space, alphabetic, numeric, string quote or comment character.

Now, we need to understand the default configuration. We have the following types of characters:

  • Word characters: ranges like ‘a’ to ‘z’ and ‘A’ to ‘Z
  • Numeric characters: 0,1,…,9
  • Whitespace characters: ASCII values from 0 to 32
  • Comment character: /
  • String quote characters: ‘ and “

Note that the ends of lines are treated as whitespaces, not as separate tokens, and the C/C++-style comments are not recognized by default.

This class possesses a set of important fields:

  • TT_EOF – A constant indicating the end of the stream
  • TT_EOL – A constant indicating the end of the line
  • TT_NUMBER – A constant indicating a number token
  • TT_WORD – A constant indicating a word token

3. Default Configuration

Here, we’re going to create an example in order to understand the StreamTokenizer mechanism. We’ll start by creating an instance of this class and then call the nextToken() method until it returns the TT_EOF value:

private static final int QUOTE_CHARACTER = '\'';
private static final int DOUBLE_QUOTE_CHARACTER = '"';

public static List<Object> streamTokenizerWithDefaultConfiguration(Reader reader) throws IOException {
    StreamTokenizer streamTokenizer = new StreamTokenizer(reader);
    List<Object> tokens = new ArrayList<Object>();

    int currentToken = streamTokenizer.nextToken();
    while (currentToken != StreamTokenizer.TT_EOF) {

        if (streamTokenizer.ttype == StreamTokenizer.TT_NUMBER) {
            tokens.add(streamTokenizer.nval);
        } else if (streamTokenizer.ttype == StreamTokenizer.TT_WORD
            || streamTokenizer.ttype == QUOTE_CHARACTER
            || streamTokenizer.ttype == DOUBLE_QUOTE_CHARACTER) {
            tokens.add(streamTokenizer.sval);
        } else {
            tokens.add((char) currentToken);
        }

        currentToken = streamTokenizer.nextToken();
    }

    return tokens;
}

The test file simply contains:

3 quick brown foxes jump over the "lazy" dog!
#test1
//test2

Now, if we printed out the contents of the array, we’d see:

Number: 3.0
Word: quick
Word: brown
Word: foxes
Word: jump
Word: over
Word: the
Word: lazy
Word: dog
Ordinary char: !
Ordinary char: #
Word: test1

In order to better understand the example, we need to explain the StreamTokenizer.ttype, StreamTokenizer.nval and StreamTokenizer.sval fields.

The ttype field contains the type of the token just read. It could be TT_EOF, TT_EOL, TT_NUMBER, TT_WORD. However, for a quoted string token, its value is the ASCII value of the quote character. Moreover, if the token is an ordinary character like ‘!’, with no attributes, then the ttype will be populated with the ASCII value of that character.

Next, we’re using sval field to get the token, only if it’s a TT_WORD, that is, a word token. But, if we’re dealing with a quoted string token – say “lazy” – then this field contains the body of the string.

Last, we’ve used the nval field to get the token, only if it’s a number token, using TT_NUMBER.

4. Custom Configuration

Here, we’ll change the default configuration and create another example.

First, we’re going to set some extra word characters using the wordChars(int low, int hi) method. Then, we’ll make the comment character (‘/’) an ordinary one and promote ‘#’ as the new comment character.

Finally, we’ll consider the end of the line as a token character with the help of the eolIsSignificant(boolean flag) method.

We only need to call these methods on the streamTokenizer object:

public static List<Object> streamTokenizerWithCustomConfiguration(Reader reader) throws IOException {
    StreamTokenizer streamTokenizer = new StreamTokenizer(reader);
    List<Object> tokens = new ArrayList<Object>();

    streamTokenizer.wordChars('!', '-');
    streamTokenizer.ordinaryChar('/');
    streamTokenizer.commentChar('#');
    streamTokenizer.eolIsSignificant(true);

    // same as before

    return tokens;
}

And here we have a new output:

// same output as earlier
Word: "lazy"
Word: dog!
Ordinary char: 

Ordinary char: 

Ordinary char: /
Ordinary char: /
Word: test2

Note that the double quotes became part of the token, the newline character is not a whitespace character anymore, but an ordinary character, and therefore a single-character token.

Also, the characters following the ‘#’ character are now skipped and the ‘/’ is an ordinary character.

We could also change the quote character with the quoteChar(int ch) method or even the whitespace characters by calling whitespaceChars(int low, int hi) method. Thus, further customizations can be made calling StreamTokenizer‘s methods in different combinations**.**

5. Conclusion

In this tutorial, we’ve seen how to parse a stream of characters into tokens using the StreamTokenizer class. We’ve learned about the default mechanism and created an example with the default configuration.

Finally, we’ve changed the default parameters and we’ve noticed how flexible the StreamTokenizer class is.

As usual, the code can be found over on GitHub.