I recently had to deal with some malformed PDF files. The files were chapters, each around 150 pages in length, from a big technical document. The problem was that the pages were all out of order in a seemingly random fashion. This made reading the document very difficult.
Thankfully, each page had a page number printed on it, and for the most part these were three- or four-digit numbers, so they were pretty searchable. There were often references to page numbers within the text, however, so a search for a particular page number could give multiple hits.
The page numbers were always at the bottom of each page. To deal with the issue of page references in the main text, I used pdfjam
to crop the pages to include only the bottom few inches of the page, where the page number but no body text was present. I saved this to a temporary file using the following command:
pdfjam input.pdf --trim '0in 0in 0in 8in' --clip true --papersize '{8.27in,3.7in}' -o test.pdf
The above command trims 8 inches from the top of the page, leaving only the bottom 3.7 inches, and resizes the resultant document to 8.27 by 3.7 inches. (This document was originally A4 sized, hence the unusual dimensions.)
Once I had the trimmed document with only the bottoms containing the page numbers, I used pdfgrep
to search for each page in the page range. This first required finding the minimum and maximum page numbers in the chapter, which was basically a guess-and-check process in macOS Preview, searching for max/min page numbers (does this chapter go to page 2670? 2680? 2690?). This process ended up being pretty quick.
Then I wrote a little Python script to search for each page number in a loop. If there was a single match for the page number, I ran pdfgrep
again to return the page number for the match. If there was no match, or multiple matches, the script writes out a “missing” message for that page. This happens when, e.g., one of the pages is rotated and the page number is no longer at the bottom of the page.
#!/usr/bin/env python
import subprocess
pages = range(2575, 2725) # these are the min/max(+1) pages
curloc = []
file = "test.pdf"
for page in pages:
pagestr = str(page)
a = subprocess.run(
["pdfgrep", "-c", "-n", pagestr, file], capture_output=True, text=True
)
if int(a.stdout.rstrip()) == 1:
b = subprocess.run(
["pdfgrep", "-n", pagestr, file], capture_output=True, text=True
)
curloc.append(b.stdout.split(":")[0])
print(pagestr, b.stdout.split(":")[0])
else:
print(pagestr, "missing")
print(",".join(x for x in curloc))
The Python script will print the current order of the pages for the incrementing total page range. We can take this order and pass it to pdfjam
again to rebuild the original PDF with the correct order:
pdfjam input.pdf '1,2,3,111,112,113,114,115,93,94,92,95,96,97' -o out.pdf
The above command (vastly shortened from those used on the actual files, which would have ~150 comma-separated page numbers printed by the Python script inserted in the argument after input.pdf
) will rebuild our file in the correct order.