I like reading things via the Kindle app on my phone, because then I can read from anywhere. Unfortunately, most of what I want to read is in PDF format, so the text can’t “reflow” on my phone’s small screen like a normal ebook does. PDF text extraction programs aim to solve this problem by extracting the text (and in some cases, other elements) from a PDF and exporting it to a format that allows text reflow, for example .docx or .epub.
Which PDF text extraction program is best? I couldn’t find any credible comparisons, so I decided to do my own.
My criteria for were:
- The program must run on Mac OS X or run in the cloud.
- It must be free or have a free trial available, so I can run this test without spending hundreds of dollars.
- It must be easy to use. If I have to install special packages or tweak environment variables to run the program, it doesn’t qualify.
- It must preserve images, tables, and equations amidst the text, since the documents I want to read often include important charts, tables, and equations. (It’s fine if equations and tables are simply handled as images.)
- It must be able to handle multi-column pages.
- It must work with English, but I don’t care about other languages because I can’t read them anyway.
- I don’t care that much about final file size or how long the conversion takes, so long as the program doesn’t crash on 1 out of every 10 attempts and doesn’t create crazy 200mb files or something like that.
To run my test, I assembled a gauntlet of 16 PDFs of the sort I often read, including several PDFs from journal websites, a paper from arXiv, and multiple scanned-and-OCRed academic book chapters.
A quick search turned up way too many Mac or cloud-based programs to test, so I decided to focus on a few that were from major companies or were particularly easy to use.