1. Introduction
In this tutorial, we’ll show how to count words in LaTeX documents.
We’ll present 4 approaches to solving this problem. The first is the utility detex, available on most Linux installations. Then, there are two Perl scripts latexcount.pl and texcount.pl, both available on the web. Finally, we’ll use the shell script wordcount.sh, also available on the web.
2. A LaTeX Running Example
We’ll assume we have a LaTeX file example_Latex_document.tex as follows:
\documentclass{article}
\title{Example \LaTeX\ document}
\author{Gonzo T. Clown}
\date{\today}
\begin{document}
\maketitle
\thispagestyle{empty}
\section{The First Section}
This is an example of a \LaTeX\ source file. We can
write ordinary English as well as in-line mathematics,
such as $s=ut+ 1/2 at^2$.
\section{The Second Section}
In addition, we can also use arrays of equations.
\begin{eqnarray}
v &=& u+at\\
e &=& mc^2\\
P_1V_1 &=& P_2V_2
\end{eqnarray}
\section{The Third Section}
We can also present material in a tabular format.
\begin{tabular}{|c|c|}\hline
Type & Characteristics\\\hline
Mammals & Warm-blooded\\
Birds & Can fly\\
Reptiles & Cold-blooded\\\hline
\end{tabular}
\end{document}
When we run this through pdflatex or similar command we get the file example_Latex_document.pdf:
3. Word Count Using detex
We can use detex to strip out all LaTeX commands from a document. Here is how we apply detex to the previous example:
$ detex example_Latex_document.tex
This is the output we get:
Example document
Gonzo T. Clown
The First Section
This is an example of a source file. We can
write ordinary English as well as in-line mathematics,
such as .
The Second Section
In addition we can also use arrays of equations.
The Third Section
We can also present material in a tabular format.
We see that all LaTeX commands are stripped out, including the \section and \tabular commands.
So, to obtain the word count, we feed the output to wc -w:
$ detex example_Latex_document.tex | wc -w
53
The sole period preceded by a blank can be filtered out:
$ detex example_Latex_document.tex | sed 's/ \.//g' | wc -w
52
The period has to be escaped to avoid its normal meaning in sed (match any character).
4. Using the Perl Script latexcount.pl
We can run latexcount.pl like this:
$ perl latexcount.pl example_Latex_document.tex
79 words in the main text
in the footnotes
79 total
The result we get is 79 words. In contrast, the detex approach counted only 52. The reason is that detex disregarded all the words in the table and headings since it filters out LaTeX commands.
5. Using the Perl Script texcount.pl
Script texcount.pl gives us more detailed information:
$ perl texcount.pl example_Latex_document.tex
File: example_Latex_document.tex
Encoding: ascii
Words in text: 39
Words in headers: 12
Words outside text (captions, etc.): 0
Number of headers: 4
Number of floats/tables/figures: 0
Number of math inlines: 1
Number of math displayed: 1
Subcounts:
text+headers+captions (#headers/#floats/#inlines/#displayed)
0+3+0 (1/0/0/0) _top_
21+3+0 (1/0/1/0) Section: The First Section
9+3+0 (1/0/0/1) Section: The Second Section
9+3+0 (1/0/0/0) Section: The Third Section
We can obtain a brief output thus:
$ perl texcount.pl -brief example_Latex_document.tex
39+12+0 (4/0/1/1) File: example_Latex_document.tex
We see that the result (39+12=51) is close to the one we got with detex.
6. Using the wordcount.sh Script
To run wordcount.sh, we must place the script file wordcount.tex in the same directory as our LaTeX file. This approach works on Unix/Linux systems.
After making the script executable, we can run it as follows:
$ ./wordcount.sh example_Latex_document.tex
The count will be on the last line of the output:
example_Latex_document.tex contains 437 characters and 73 words.
To filter out the unnecessary parts of the output, we can feed it to tail:
$ ./wordcount.sh example_Latex_document.tex | tail -n1
example_Latex_document.tex contains 437 characters and 73 words.
The end result we get is close to that of latexcount.pl.
7. A Comparison of the Approaches
We found that detex and texcount.pl gave similar results, which was also the case with latexcount.pl and wordcount.sh:
tool
plain text
inline math
displayed math
tabular material
detex
69
47
47
27
latexcount.pl
71
62
68
53
texcount.pl
69
47
47
27
wordcount.sh
69
50
48
37
We can see that there are wide variations in the words reported by each of the tools. This is so because they strip out different classes of LaTeX commands and define a word in different ways.
8. Conclusion
In this article, we described four different methods to count the words in a LaTeX document. All returned different estimates of the count.