搜索多个PDF文件的内容

1. Introduction

Searching for occurrences of a given string in a bunch of PDF files is something we find ourselves doing very often. It may be for personal utility such as finding the train ticket for an upcoming journey or a business utility where we’ve to extract some data from PDF files. While we could manually open each PDF file with a GUI viewer and search for strings, this process will get very cumbersome when the set of files is large.

With command-line tools, we can easily automate searching a large number of files. However, we must note that PDF is a binary format, and plain text search commands such as grep and sed will not work as expected on PDF files.

In this tutorial, we’ll look at some specialized commands that can be used to search for strings in PDF files.

2. Using pdftotext

The pdftotext command is a utility that converts a PDF file into plain text. It’s provided on most Linux distributions by default. We can use this command to convert all our PDF files to plain text and then run grep on the resultant plain text outputs. This is a multi-step process. So we’ll look at each step one by one and then combine all the steps into a single command.

2.1. Converting a PDF File into Plain Text

We can convert a single PDF file into the plain text as follows:

$ pdftotext filename.pdf -

The hyphen at the end is used to instruct the command to send output to stdout. Otherwise, it’ll save the output in a text file. We’ll need the output coming into stdout so we can pipe it into other commands for further processing.

2.2. Searching in a PDF File

We can pipe the plain text output from the above command into the grep command to search for a string or pattern in the file:

$ pdftotext train-ticket.pdf - | grep --with-filename --label=train-ticket.pdf -i "bengaluru"
train-ticket.pdf:From: KSR BENGALURU(SBC)
train-ticket.pdf:Boarding At: KSR BENGALURU(SBC)

We’ve added –with-filename and –label flags to print the filename for each match. The -i flag is used to perform a case insensitive search with the provided pattern. We can omit this to perform a case-sensitive search.

2.3. Finding All PDF Files in the Folder

Now that we’ve got the command to search a single file, we’ll have to iterate this over all the PDF files using the find command. To start with, we’ll simply run the find command to print all the PDF file paths in the current folder:

$ find . -name '*.pdf'
./sbc-2022-01-02.pdf
./train-ticket.pdf
./downloads/HR_23-01-2022.pdf
./downloads/subfolder/20-01-2022 HMB English.pdf
./30-01-2022 HMB English.pdf

The dot in the command indicates the current folder, and we can replace it with any other path to search in. *.pdf as the name argument filters for filenames with the .pdf extension.

2.4. Combining the Steps

Now we can use the exec argument of the find command to execute a search over each file and print the results file-wise:

$ find . -name '*.pdf' -exec sh -c 'pdftotext "{}" - | grep --with-filename --label="{}" -i bengaluru' \;
./sbc-2022-01-02.pdf:Resv. Upto: KSR BENGALURU(SBC)
./sbc-2022-01-02.pdf:To: KSR BENGALURU(SBC)
./train-ticket.pdf:From: KSR BENGALURU(SBC)
./train-ticket.pdf:Boarding At: KSR BENGALURU(SBC)
./downloads/subfolder/20-01-2022 HMB English.pdf:Bengaluru Urban

We see that the find command does a recursive search over the PDF files in the current folder and also the subfolders. We can add the maxdepth argument to search only the folder or only include up to a specified level of subfolders:

$ find . -maxdepth 1 -name '*.pdf' -exec sh -c 'pdftotext "{}" - | grep --with-filename --label="{}" -i bengaluru' \;

3. Using pdfgrep

The pdfgrep command can be used to search for patterns in PDF files in a single step. However, it may not be available on our Linux distribution by default, and we’ll need to install the pdfgrep package to be able to use it. Once we’ve got everything set up, using it is very easy:

$ pdfgrep -HiR bengaluru .
./sbc-2022-01-02.pdf:   From: MYSURU JN(MYS)                                                               Date Of Journey: 02-Jan-2022                                                         To: KSR BENGALURU(SBC)
./sbc-2022-01-02.pdf:   Resv. Upto: KSR BENGALURU(SBC)                                                     Scheduled Arrival: 02-Jan-2022 21:05 *                                               Adult: 2 Child: 0
./train-ticket.pdf:   From: KSR BENGALURU(SBC)                                                           Date Of Journey: 01-Jan-2022                                                         To: MYSURU JN(MYS)
./train-ticket.pdf:   Boarding At: KSR BENGALURU(SBC)                                                    Date Of Boarding: 01-Jan-2022                                                        Scheduled Departure: 01-Jan-2022 10:50 *

We’ve used the H option to print file names, the i option for a case insensitive search, and the R option to search recursively in all subfolders of the specified folder (current folder in this case). The output from the conversion can be a bit messy with large spaces, as we see above.

4. Using ripgrep-all

We can use the rga command from the utility ripgrep-all to find patterns in PDF files as well as other file formats. While installing the package is slightly cumbersome, using the command is very simple:

$ rga --type pdf bengaluru
sbc-2022-01-02.pdf
Page 1: Resv. Upto: KSR BENGALURU(SBC)
Page 1: To: KSR BENGALURU(SBC)

train-ticket.pdf
Page 1: From: KSR BENGALURU(SBC)
Page 1: Boarding At: KSR BENGALURU(SBC)

The command prints all the matches along with the file name as well as the page number in which the pattern occurs.

5. Conclusion

In this article, we looked at different methods to search for a string or a pattern in a collection of PDF files. Our first method was a multi-step process. It involved running find to iterate over all the PDF files, running pdftotext, and then grep on each one of them to find the occurrence of the pattern. While this is a tricky process, it uses commands that are already available on most systems by default.

As alternatives to the above method, we can also use pdfgrep and ripgrep-all. These are simpler, single-step alternatives, but they may not be installed on our system by default.

Persistence

REST

Security