Extracting text-based PDF screenplays¶
Occasionally you’ll get annoying PDFs that you can’t automatically parse with a language like C, PHP, or Java. In this case, if it’s a text-based export from something like Final Draft, you can use the free and open source XPDF-based library pdftotext.
On OS X, it is part of the poppler package:
brew install poppler-utils
On Linux, it’s the same:
yum install poppler-utils # CentOS apt-get install poppler-utils # Debian
Convert from the command line (+/- password)¶
Usage from the command line using a file without password protection:
pdftotext script.pdf script.txt
And WITH password encryption:
pdftotext -upw 'password' script.pdf script.txt
And called from a back-end web server process, e.g. in Laravel:
$cmd = 'pdftotext -layout -upw '.$password_text.' '.$pdf_file_path.' '.$output_txt_path; exec($cmd, $pdftotext_output, $exit_code); $contents = file_get_contents($output_txt_path);
There are plenty of packages available for NPM:
Extracting image-based PDF screenplays¶
More often, - especially with older scripts - you’ll have a PDF that contains image scans of each page. Script vendors often do this along with disabling printing etc, thinking it’s a form of “copy protection”.
For this type of file, you’re going to need Optical Character Recognition (OCR).
First-off, OCR only works effectively with high-resolution image files, so you need to convert the PDF to TIFF format.
Note: there are plenty of packages available for NPM:
Export to TIFF¶
Open the PDF in Preview, and use File > Export to save as a TIFF file. This can take a long time and produce a file that is dozens of GBs in size.
You can also do this programmatically with ImageMagick, obviously.
convert -density 300 /path/to/script.pdf -depth 8 -strip -background white -alpha off script.tiff
Tesseract (https://github.com/tesseract-ocr/tesseract) is an open-source OCR engine that can be installed on OS X and/or Linux.
On OS X:
brew install tesseract
apt-get install tesseract-ocr
Perform OCR on your TIFF file¶
OCR isn’t perfect, so the file will need manual correction. But it’s better than typing the thing out by hand, manually.
tesseract script.tiff script.txt