1. Introduction

Some seemingly binary files are actually compressed and encoded streams of text commands. Portable Document Format (PDF) files are examples of this.

In this tutorial, we consider the PDF format and explore ways to view and edit its original source code. First, we take a thorough look at the portable document and another similar format. Next, we show samples of two different ways for export and storage. After that, we discuss the general idea and pitfalls behind PDF creation and editing. Finally, we delve deep into many tools that enable us to handle and repair PDF files.

We tested the code in this tutorial on Debian 11 (Bullseye) with GNU Bash 5.1.4. It should work in most POSIX-compliant environments.

2. PostScript (PS)

Before going into PDF, we start with its much older relative. PostScript (PS) is a format, page description language (PDL), and printer control language (PCL). In other words, it can specify the design elements of pages and how they appear with human-readable text:

%!PS
/Courier
10 selectfont
100 666 moveto
(Baeldung) show
showpage

This sample PS file begins with a header similar to the shebang in Linux. The next two lines choose the Courier font and select it in size 10 with selectfont. Actually, font handling is one of the most complex activities of PostScript and, by extension, PDF.

After that, moveto specifies a location, where show writes Baeldung. To display the page, we finish with showpage.

Currently, PostScript 3 (PS3) is the latest iteration from 1997, as described in the PostScript Language Reference Third Edition (ISBN 0-201-37922-8).

Since PostScript is the basis of many other formats such as Encapsulated PostScript (EPS) and PDF files, we can convert them to and from PS. In fact, the main difference between PDF and PS is the former’s lack of a general-purpose programming language backbone. One can very roughly compare PDF’s static structure to that of the HyperText Markup Language (HTML), unlike PS, which can compute graphics dynamically.

3. Portable Document Format (PDF)

The Portable Document Format (PDF) is a universal and portable way to view and transfer structured data:

  • text
  • links
  • buttons
  • forms and form fields
  • audio and video

Of course, the main aim is to have a standard that defines how all of these are to be embedded in a single file so that software on any operating system (OS) and hardware can handle them. In a way, PDF is also a PDL and PCL.

There have been many PDF specification iterations over the years. The first one was created by Adobe in 1993, but the format has been an ISO standard since 2008.

4. Sample Pure-Text PDF Structure

Essentially, PDF files consist of pages, described by objects. This collection of objects is the format’s backbone and plays a key role in the presentation, so its structure is well-defined.

Let’s create a sample pure-text PDF:

%PDF-1.1
%¥±ë

1 0 obj
  << /Type /Catalog /Pages 2 0 R >>
endobj

2 0 obj
  << /Type /Pages /Kids [3 0 R] /Count 1 /MediaBox [0 0 300 144] >>
endobj

3 0 obj
  <<
    /Type /Page
    /Parent 2 0 R
    /Resources
    <<
      /Font
      <<
        /F1
        << /Type /Font /Subtype /Type1 /BaseFont /Times-Roman >>
      >>
    >>
    /Contents 4 0 R
  >>
endobj

4 0 obj
  << /Length 55 >>
stream
  BT
    /F1 18 Tf
    0 0 Td
    (Hello World) Tj
  ET
endstream
endobj

xref
0 5
0000000000 65535 f 
0000000015 00000 n 
0000000077 00000 n 
0000000179 00000 n 
0000000433 00000 n 
trailer
<< /Root 1 0 R /Size 5 >>
startxref
541
%%EOF

For example, this simple blank PDF has four objects:

  1. Object 1 0
  2. Object 2 0
  3. Object 3 0
  4. Object 4 0

Each of these objects has a dictionary (<<>>) of key-value pairs like /Type /Page and /MediaBox [0 0 300 144]. Also, the latter assigns an array to /MediaBox, which is similar to a page size. Moreover, we can use object references such as 2 0 R and 3 0 R.

Further, the xref reference table is an index for objects. The first line of the table sets the first object number (0) and has the total object count for this file (4). Each following line defines a successive object (by number, starting from the first) and has the same structure:

  1. Offset in bytes from the start of the file to the beginning of the object content
  2. Generation number, matching the one after the object number (most often 0)
  3. Either n if the object is in use or f otherwise

Offsets are strict, so any changes to a PDF file may corrupt it. On the other hand, the xref table is no longer necessary for PDF versions from 1.5. Further, copying the code above to a new text file creates a valid PDF for version 1.1, as defined by the mandatory first line.

5. Sample Binary PDF

The optional second line of PDF files often contains several non-ASCII characters like %¥±ë. These serve as a hint to processing software that a PDF file is best read as binary data.

Often, a pure-text representation is unusual. For example, we can have the sample PDF file from earlier, but in binary form:

%PDF-1.1
%âãÏÓ
1 0 obj

<< /Type /Catalog /Pages 2 0 R >>
endobj

2 0 obj

<< /Kids [3 0 R] /Type /Pages /Count 1 /MediaBox [0 0 300 144] >>
endobj

3 0 obj

<<
/Contents 4 0 R
/Type /Page
/Resources 
<<
/Font 
<<
/F1 
<< /Subtype /Type1 /Type /Font /BaseFont /Times-Roman >>
>>
>>
/Parent 2 0 R
/MediaBox [0 0 300 144]
>>
endobj

4 0 obj

<< /Length 52 /Filter /FlateDecode >>
stream
xœSPp
áR }7CC…40Ï CRÀL
Ôœœ|…ðü¢œM…,  k +
endstream 
endobj
xref
0 5
0000000000 65535 f 
0000000015 00000 n 
0000000066 00000 n 
0000000149 00000 n 
0000000331 00000 n 
trailer

<< /Root 1 0 R /Size 5 >>
startxref
456
%%EOF

In this case, we see object 4 0 is an encoded stream in a binary representation (here, flate), which makes the file smaller.

However, we’re no longer able to directly see or edit the PDF’s previous contents:

4 0 obj
  << /Length 55 >>
stream
  BT
    /F1 18 Tf
    0 0 Td
    (Hello World) Tj
  ET
endstream
endobj

Here, we begin a text segment at BT, change the current font via Tf and write Hello World with Tj, before ending the text segment at ET. Without seeing the commands, it’s much harder to modify them.

Of course, other objects like media might not have a purely textual representation. Still, how can we convert the operators of a PDF file and as much of its content as possible to editable ASCII text?

6. PDF Creation and Editing

Due to the format’s ubiquity, many tools can generate PDF files. Thus, it’s up to the creator of the original content as well as their tool of choice to save or export the file in a given way.

For example, some Adobe products offer options to save a PDF decompressed or uncompressed, i.e., without compression. The same goes for many open-source tools, which use any of the libraries we discuss here. Compression is a method to reduce the size of a PDF file via specific encodings that can convert a pure-text object to a binary stream, thereby sometimes rendering the source PDF operators obfuscated. In addition, PDF-compressed objects can have their own additional encodings and compressions.

By decompressing, we end up with a file much like the pure-text sample PDF from earlier. Thus, after decompression, we can simply use an editor like vi as long as it can handle large files and preserve binary data:

$ vi /file.pdf
%PDF-1.1
%¥±ë

1 0 obj
  << /Type /Catalog /Pages 2 0 R >>
endobj
[...]

After some types of edits, we might have to repair the resulting PDF to avoid errors from strict PDF viewers due to object offsets and other references.

Of course, a lot of libraries also have stand-alone tools for this and other purposes. Most are built with a command-line interface (CLI), but some have a graphical user interface (GUI) as well.

7. PDF Toolkit (PDFtk)

To begin with, we compressed our sample pure-text PDF to a part-binary PDF via PDFtk.

We can usually install the latest version of PDFtk from the pdftk package:

$ apt-get install pdftk
[...]
$ apt-cache policy pdftk
pdftk:
  Installed: 2.02-5+b1
[...]
$ pdftk --version
pdftk port to java 3.2.2 a Handy Tool for Manipulating PDF Documents
[...]

Now, let’s continue with some of its options.

7.1. PDFtk Features

This PDF toolkit includes the pdftk utility enabling us to perform many operations as subcommands:

  • cat – merge, split, or rotate pages
  • shuffle – collate pages
  • burst – split into pages
  • rotate – rotate given pages
  • generate_fdf – generate FDF file for automatic form filling
  • fill_form – fill forms with FDF
  • background, multiackground – place watermark or watermarks under page contents
  • stamp, multistamp – place watermark or watermarks over pages
  • dump_data, dump_data_utf8 – report metadata, bookmarks, page metrics, and others (optionally in UTF-8)
  • dump_data_fields, dump_data_fields_utf8 – get form field statistics (optionally in UTF-8)
  • dump_data_annots – get link annotations
  • update_info, update_info_utf8 – sets metadata and bookmarks (optionally in UTF-8)
  • attach_files – pack files in PDF
  • unpack_files – unpack files from a PDF

Besides these, pdftk offers several security options for passwords and encryption.

7.2. Compress and Uncompress With pdftk

We can also use pdftk to compress and uncompress with the respective subcommand:

$ pdftk in.pdf output out.pdf compress

Actually, we compressed our sample PDF and object 4 0 in it with this exact command line. Naturally, performing the reverse operation yields a text-only PDF:

$ pdftk in.pdf output out.pdf uncompress

Moreover, pdftk* offers compress and decompress options for many of its subcommands, which affects their *output.

7.3. Repair PDF With pdftk

Repairing a PDF with pdftk is simple, but not always effective:

$ pdftk in.pdf output out.pdf

In essence, we just pass our file through the tool.

8. MuPDF Ecosystem

MuPDF has an ecosystem with an open-source software library, command-line tools, and viewers. It even has a dedicated MuPDF Explored book.

On most platforms, the viewer is in the mupdf package, while the command-line tools are in the mupdf-tools package:

$ apt-get install mupdf-tools
[...]
$ apt-cache policy mupdf-tools
mupdf-tools:
  Installed: 1.17.0+ds1-2
[...]

Now, let’s explore MuPDF further.

8.1. mutool Features

Among the MuPDF tools is the mutool utility with its subcommands:

  • draw – convert documents to images (among others) with lots of options
  • convert – convert documents into other formats simply
  • trace – debugging tool for tracing
  • show – show internal PDF objects
  • extract – extract resources like images and embedded fonts
  • clean – fix PDF by rewriting it in a potentially human-readable form
  • merge – merge pages
  • poster – subdivide pages into pieces
  • create – use a text file with commands to create a PDF
  • sign – digital signature operations
  • info – get page object details
  • pages – get page media box, artbox, and others
  • run – versatile but complex universal tool that runs JavaScript code to perform any action

Unlike pdftk, mutool requires a subcommand on each run. By default, mutool preserves the original way a PDF is structured as long as we don’t explicitly request changes that alter that.

8.2. Compress and Uncompress With mutool

To compress or decompress, we can use clean:

$ mutool clean -d -z -gggg -i -a in.pdf out.pdf

As long as we don’t opt to preserve some, the clean command of mutool with its -d flag decompresses all streams, while potentially performing other optimizations:

  • -i – compress or leave compressed image streams
  • -a – use ASCII hex to encode any binary streams
  • -g – remove unused objects
  • -gg-g and compact xref table
  • -ggg-gg and merge duplicate objects
  • -gggg-ggg and deduplicate streams
  • -s – clean and streamline content streams

In fact, we only skip two options:

  • -l – reorder contents and objects as they are referenced by page (quick loading)
  • -f – compress or leave compressed font streams
  • -p – password, if needed

After any edits, we can also repair our file.

8.3. Repair PDF with mutool

The repair mechanism of clean is very effective in many circumstances, even after custom edits:

$ mutool clean in.pdf out.pdf

Considering the versatility of the MuPDF ecosystem, its licensing might be its main drawback.

9. QPDF Tool

QPDF is an open-source PDF CLI toolset for PDF file transformations.

On most platforms, we can install QPDF from the qpdf package:

$ apt-get install qpdf
[...]
$ apt-cache policy qpdf
qpdf:
  Installed: 10.1.0-1
[...]
$ qpdf --version
qpdf version 10.1.0
Run qpdf --copyright to see copyright and license information.

Next, let’s see what QPDF can offer.

9.1. QPDF Features

As the main tool in the package, qpdf performs many tasks:

  • –linearize – reorder contents and objects as they are referenced by page (quick loading)
  • *–compress-streams=*[n|y] – toggle compression of streams
  • *–decode-level=*parameter – decompress and decode given streams
  • *–stream-data=*parameter – preset combinations of –compress-streams and –decode-level
  • –qpdf – rewrite the file for viewing and editing
  • –collate[=n] – collate pages, optionally by groups of n
  • –split-pages – split into pages
  • –overlay – overlay the pages of one file over another
  • –underlay – underlay the pages of one file under another
  • –rotate – rotate pages
  • encryption
  • embedding and attaching files
  • extraction of data such as media, metadata, object information, and more even as a JSON

For each of the above and more, qpdf offers many general, advanced control, and subcommand-specific options as well.

9.2. QDF Mode and Decompression With qpdf

Importantly, QPDF offers the QDF Mode via its –qpdf option, which produces a PDF specifically for the viewing and modification of its code (not page contents) within a text editor:

$ qpdf --qdf in.pdf out.pdf

In fact, the mode is a way of processing unique for this toolkit:

  • incompatible with –linearize
  • all uncompressible streams are decompressed
  • content streams are normalized
  • encryption is decrypted
  • restructure objects to be more readable, albeit less efficient
  • add hinting comments

On the other hand, we can skip –qdf in favor of just combining other options:

$ qpdf --decode-level=all --compress-streams=n in.pdf out.pdf

Here, –compress-streams=n decompresses streams or just leaves them uncompressed, while –decode-level=all ensures this is done for all streams. We can achieve a similar effect, but only for generalized streams via the older –stream-data=uncompress option.

Moreover, by adding the –json and –json-stream-data=file options, we can output data in a JSON format as well.

9.3. Repair PDF with fix-qdf

After any changes, we can use the included fix-qdf command to repair hand-edited QDF files partially based on the changes –qdf introduces:

$ fix-qdf in.pdf > out.pdf

Still, this should work for many non-QDF PDF files as well.

10. Ghostscript

Ghostscript is a family of open-source products for PDF, PostScript, and other files:

  • Ghostscript PDF and PS intepreter
  • GhostPDF, a PDF interpretation component, currently available as an old PostScript-based and new C-based version
  • GhostPDL, the umbrella term for all Ghostscript products
  • GhostPCL PCL and PXL interpreter
  • GhostXPS XPS interpreter
  • font information

First, let’s install Ghostscript via the ghostscript package:

$ apt-get install ghostscript
[...]
$ apt-cache policy ghostscript
ghostscript:
  Installed: 9.53.3~dfsg-7+deb11u2
[...]
$ ghostscript --version
9.53.3

Now, let’s focus on the main Ghostscript tool.

10.1. Ghostscript PDF Interpreter Features

The gs (gswin32 or gswin64 on Microsoft Windows) Ghostscript interpreter works at a low level, which means it doesn’t necessarily provide single subcommands for its many abilities. Instead, gs has output devices with options, which we can combine to perform tasks. Actually, this is one of its main strengths.

For example, the pdfwrite, ps2write, and eps2write PDF and PostScript output devices are very versatile as they can do most of what we already discussed about other toolsets:

  • merge files
  • split files with OutputFile, %d, -dFirstPage, and -dLastPage
  • rotate pages
  • change PDF version
  • change PDF type
  • embed fonts
  • compress and decompress fonts
  • compress and decompress stream compression
  • compress and decompress pages
  • convert colors
  • change resolution
  • change page position
  • many other options

Critically, due to its low-level operation, Ghostscript doesn’t preserve the original input files but instead creates a new one through the requested virtual device. While file appearance might be the same, gs changes the way it’s achieved. On the other hand, commands like mutool do preserve the contents as long as we don’t explicitly request such structure modifications.

10.2. Compress and Decompress PDF With gs and Ghostscript

Just like other tools, gs can decompress many elements of a PDF:

$ gs -dNOSAFER -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -dCompatibilityLevel=1.7 -dEmbedAllFonts=true -dCompressEntireFile=false -dCompressStreams=false -dCompressPages=false -dCompressFonts=false  -sOutputFile="out.pdf" -f "in.pdf"

Let’s break down this command:

  • -dNOSAFER – enable changes to the filesystem when allowed
  • -dNOPAUSE – disable pausing on each page
  • -dBATCH – exit after processing all files
  • -sDEVICE=pdfwrite – select pdfwrite as the initial output device
  • -dBATCH – exit after processing all files
  • dCompatibilityLevel – set format version of output
  • -dEmbedAllFonts=true – embed all fonts
  • -dCompressEntireFile=false – apply no additional compression
  • -dCompressStreams=false – decompress non-font and non-page streams
  • -dCompressPages=false – decompress page content streams
  • -dCompressFonts=false – decompress embedded fonts
  • -sOutputFile – select output file for the output device
  • -f – makes supplying input file name(s) safer

Since Ghostscript generates a new PDF file, it’s possible that the uncompressed streams don’t contain their exact original data.

Further, the decompression mechanisms of gs are generally not as advanced or universal as those of other solutions.

10.3. Repair PDF With gs and Ghostscript

Importantly, Ghostscript has no repair facilities and, unlike most readers and other tools, is highly intolerant to syntax and specification problems. Even so, gs itself can sometimes cause issues:

  • forms not working
  • incomplete fonts
  • missing characters or glyphs
  • missing ligatures

Moreover, the official Ghostscript website hosts the MuPDF Explored book and its clean subcommand chapter.

Of course, there are other tools that can perform a repair but can’t toggle compression.

11. Poppler Tools

Although Poppler is open source and based on the code of Xpdf 3.03, it’s now usually preferred to the latter.

Thus, to install Poppler, we can use the poppler-utils package on most and the xpdf package on some distributions:

$ apt-get install poppler-utils
[...]
$ apt-cache policy poppler-utils
poppler-utils:
  Installed: 20.09.0-3.1+deb11u1

Poppler is both a library with several API options and a suite of tools that leverage them. On the back end, it works with Cairo, Splash, or even the incomplete and mostly abandoned Arthur.

11.1. Poppler Features

While Poppler doesn’t provide a way to control PDF file compression, it does provide stable stand-alone utilities when it comes to many other PDF operations:

  • pdfattach – embed attachments
  • pdfdetach – extract attachments
  • pdffonts – get font information
  • pdfimages – extract all images
  • pdfinfo – get metadata like page sizes, numbers, encryption, and others
  • pdfseparate – extract pages
  • pdftocairo – convert to PostScript, vector, and bitmap via Cairo, handling aspects of the conversion
  • pdftohtml – convert to HTML
  • pdftoppm – convert to bitmap
  • pdftops – convert to PostScript
  • pdftotext – extract text
  • pdfunite – merge files

Unlike its close competitor MuPDF, Poppler has no JavaScript support. On the other hand, each of its tools provides multiple configuration options.

11.2. Repair PDF with pdftocairo and Poppler

In addition to enabling conversions, pdftocairo can be very helpful when it comes to standardizing PDF files:

$ pdftocairo -pdf in.pdf out.pdf

Similar to the output of qpdf –qdf, the -pdf option of pdftocairo reliably produces PDF files with common characteristics such as structure and object specifics.

Because of this and the PDF specification tolerance of the tools, we can employ pdftocairo -pdf as a stable and comprehensive way to fix and repair a problematic PDF.

12. Xpdf Tools

Like its codebase relative Poppler, the Xpdf family is open source, but it contains a reader and command-line tools.

The xpdf PDF reader is usually in the xpdf package, while the xpdf-utils package contains the CLI utilities. Yet, depending on the Linux distribution, this xpdf-utils package can often just be an alias for poppler-utils.

Consequently, getting and installing the Xpdf toolset might be best done by downloading the package from the official website with a tool like wget or curl:

$ wget https://dl.xpdfreader.com/xpdf-tools-linux-4.04.tar.gz

After that, we can simply unpack with tar, copy, and use as necessary. Since Xpdf provides its own versions of pdftops, pdftotext, pdftohtml, pdfinfo, pdffonts, pdfdetach, pdftoppm, pdftopng, pdfimages, this approach also avoids conflicts with Poppler.

Further, the Xpdf tools don’t include the coveted pdftocairo and the features they offer are a subset of their Poppler relatives. Hence the replacement of xpdf-utils with poppler-utils in most Linux versions.

13. Summary

In this article, we discussed the PDF file format, how to view its contents, as well as tools that can handle and manipulate it under Linux.

In conclusion, while we can open PDF files with a text editor, pre- and postprocessing can be critical for reading all contents and properly performing any edits to leave a valid file.