如何在 Linux 中从 PDF 文件提取嵌入的图像

1. Introduction

PDF files serve as containers for various types of content, including text, images, and sometimes even interactive elements. Extracting text from PDFs is a common task unless we’re dealing with image-based PDF files that require OCR tools to extract the text.

Extracting embedded images might seem more intricate than text extraction. However, Linux offers several tools that we can use to extract embedded images.

In this tutorial, we’ll look at the different command-line and graphical tools we can use to extract embedded images from PDF files.

2. Using pdfimages

pdfimages is a command-line utility that’s part of the Poppler software package, widely utilized in Linux environments for working with PDF files.

It’s specifically designed to extract images embedded within PDF documents and efficiently locates and extracts all images found within a PDF file. Moreover, it supports various image output formats, and we can specify the resolution.

pdfimages is available by default on most Linux distros. However, in case it’s not available, we can install it from poppler-utils through the local package manager:

$ sudo apt-get install poppler-utils

After installation, let’s look at the basic syntax of pdfimages:

$ pdfimages <input.pdf> <output-prefix>

For example, let’s extract images from example.pdf:

$ pdfimages example.pdf baeldung_image

The command above outputs all images from example.pdf and saves them with file names starting with baeldung_image followed by a numerical suffix.

We can also customize the extraction process by using options such as -f and -l to specify the range of pages:

$ pdfimages -f 1 -l 5 example.pdf baeldung_image

This command extracts images from pages one to five only. Furthermore, we can specify the extracted image format using different options.

For example, we can use the -j option to specify the JPEG format:

$ pdfimages -j example.pdf baeldung_image

Finally, we can also specify the image output resolution:

$ pdfimages -png -r 300 -f 1 -l 5 example.pdf baeldung_image

We’re specifying the image output format as PNG and setting the resolution to 300px using the -r option.

3. Using pdftohtml

pdftohtml is also a command-line utility that’s part of the Poppler utility suite in Linux systems. It parses the content within PDF files and attempts to represent it in HTML format, preserving the structure and layout as closely as possible.

Although pdftohtml primarily converts PDF files to HTML, it also extracts all images embedded in the PDF file.

Let’s use pdftohtml to extract all images from example.pdf:

$ pdftohtml example.pdf baeldung

This command generates a separate HTML file for each page and extracts all images in their original format. It also generates an index HTML page with links to all the other pages.

We can also specify the page range to convert using the -f option and -l option:

$ pdftohtml -f 1 -l 3 example.pdf baeldung

The command above will only convert pages one to three, subsequently extracting all images within those pages.

4. Using PDFmod

PDFmod is a graphical tool for modifying PDF documents. It can manipulate PDF files by rearranging, deleting, inserting, rotating, and extracting images from the pages.

Since it offers a GUI for performing these tasks, it’s more suitable for users who prefer a more visual approach.

4.1. Installation

PDFmod isn’t available by default; however, we can install it from the local package manager.

For example, for Debian, we can use the APT package manager:

$ sudo apt install pdfmod

Alternatively, on Arch Linux, we can use Pacman:

$ pacman -S pdfmod

Finally, on Fedora, we can use DNF:

$ sudo dnf install pdfmod

After installation, let’s use PDFmod to extract images from a PDF file.

4.2. Usage

We can search for and open PDFmod from the applications menu:

Searching for PDFmod in the applications menu

The main PDFmod window is simple, with a few dropdowns and icons that control different functionalities of the application.

Our first step is importing a sample PDF file by clicking on the File option and then selecting Open from the dropdown. Finally, we can select the PDF file from the system:

pdf mod import pdf file

From here, we can select a specific page, or pages, and then right-click to see the Export Image option from the dropdown. Clicking the button will extract all images on the selected page and export them to a new directory within the current working directory.

5. Conclusion

In this article, we’ve looked at the different command-line and GUI tools we can use to extract images embedded in PDF files. pdfimages is one of the best tools for this task, since it was specifically designed for image extraction and offers us the flexibility to specify the image output format, resolution, and more.

We also used pdftohtml to export all images within a PDF file while converting the pages from PDF to HTML. Finally, PDFmod is favorable for users who prefer using the GUI.

Persistence

REST

Security