I like reading things via the Kindle app on my phone, because then I can read from anywhere. Unfortunately, most of what I want to read is in PDF format, so the text can’t “reflow” on my phone’s small screen like a normal ebook does. PDF text extraction programs aim to solve this problem by extracting the text (and in some cases, other elements) from a PDF and exporting it to a format that allows text reflow, for example .docx or .epub.
Which PDF text extraction program is best? I couldn’t find any credible comparisons, so I decided to do my own.
My criteria for were:
- The program must run on Mac OS X or run in the cloud.
- It must be free or have a free trial available, so I can run this test without spending hundreds of dollars.
- It must be easy to use. If I have to install special packages or tweak environment variables to run the program, it doesn’t qualify.
- It must preserve images, tables, and equations amidst the text, since the documents I want to read often include important charts, tables, and equations. (It’s fine if equations and tables are simply handled as images.)
- It must be able to handle multi-column pages.
- It must work with English, but I don’t care about other languages because I can’t read them anyway.
- I don’t care that much about final file size or how long the conversion takes, so long as the program doesn’t crash on 1 out of every 10 attempts and doesn’t create crazy 200mb files or something like that.
To run my test, I assembled a gauntlet of 16 PDFs of the sort I often read, including several PDFs from journal websites, a paper from arXiv, and multiple scanned-and-OCRed academic book chapters.
A quick search turned up way too many Mac or cloud-based programs to test, so I decided to focus on a few that were from major companies or were particularly easy to use. 1
Below are the results of my tests.
Send to Kindle
First, what if I just send the PDFs to my phone’s Kindle app using the Mac desktop version of Send to Kindle, with the option to ‘convert PDFs to Kindle format’ checked?
The files that arrived on my phone’s Kindle app contained reflowing rich text. Equations were almost always mangled beyond recognition. In several files, images of each entire page from the original PDF was interspersed with the reflowing text from those pages, and images were lost entirely (except as they appeared in the images of entire pages). In most files, text changed to red or bold for a bit and then went back to normal, for no apparent reason.
In a few files, images were preserved, whereas in most others, images were converted to long strings of numbers or just lost. In one file, tables were preserved as images, while in all others they were converted to text and mangled. In some files, headers and footers were (happily) removed, while in others they weren’t (and thus showed up in the middle of body text paragraphs).
In a file with a box-lines-and-words diagram, the diagram was converted to text and mangled. In another file, the large image on the first page was spliced up into dozens of image fragments which took up ~20 screens at the beginning of the file on my phone’s Kindle app. A file generated from a PDF with three columns sometimes had the paragraph ordering mixed up.
Two of my test PDFs weren’t scans, had a single column of text, and had no images, tables, or equations. Those files converted more-or-less just fine.
My impression was that about 60% of the text was accurate.
Adobe Acrobat XI
Method: Click File -> Save as Other -> Microsoft Word -> Microsoft Word (using default settings).
This produced a document laid out exactly like the source PDF, with images and text formatting preserved, and with text reproduced as text, including in tables (usually). Once, a paper’s abstract was missing, but the rest was fine. Equations were usually mangled beyond recognition. Tables turned on their side were captured as images rather than text. One image that was just boxes, lines, and text was inconsistently converted to a mix of images and text, and some of the lines were missing. In a 3-column paper, several images were mostly missing and, on some pages, an entire column of text was missing.
My impression was that about 70% of the text was accurate.
And what happened when I sent these Word documents to my Kindle app?
Equations remained mangled, of course. In multiple files, text switched back and forth between left-aligned, right-aligned, and justified. In many files, diagrams and tables were chopped up into little bits of images, text, numbers, and lines.
In some files, some images and tables were captured as images, while other tables (with identical formatting) what converted to text and mangled.
The simplest files seemed ok, but the formatting was worse than when using Send to Kindle directly.
RootRise PDF to Text
Method: Drag and drop PDFs into main window, click Convert All.
This produced a single-column rich text document with lots of extra line breaks in the middle of paragraphs, no images, and no tables. I didn’t bother sending these files to my Kindle app to see what they looked like on my phone.
Microsoft Word Online
Method: Upload PDF to OneDrive. Open that PDF from Word Online, click Edit in Word, click Convert, wait a bit, then click Edit. Then click File -> Save As -> Download a Copy.
Microsoft Word Online apparently has a size limit but I couldn’t find out what it was. When converting one of my smaller PDFs (480 KB), it said “Sorry, Word Online can’t open this document because it’s too big.” Oddly, it didn’t give me errors when converting the largest PDFs I tested (~7.5 MB).
Anyway, as with Adobe Acrobat, this method produced a document laid out exactly like the source PDF, with images and text formatting preserved, and with text reproduced as text, including in tables (usually). Compared to Adobe Acrobat, Word produced files with layouts somewhat less faithful to the original PDF, and many images were lost.
Word didn’t handle the 3-column document well at all; the whole thing was mangled. Nor did it handle the scanned-and-OCRed book chapters well — one was mangled, the other just contained images of the pages rather than text. One other file was also just images of the pages rather than text.
Since Word Online’s output was strictly worse than Adobe Acrobat’s, I didn’t bother sending it to the Kindle app on my phone.
Method: Upload PDF to Google Drive, right-click the PDF and choose Open With -> Google Docs. Then click File -> Download As -> Microsoft Word.
This method produced documents that alternated between an image of the original page and Google’s attempts to represent the text from that page as text, but the text was typically more mangled than the text produced by either Send to Kindle or Adobe Acrobat. Images and tables were lost, except for appearing in the images of entire pages.
I didn’t bother sending these files to the Kindle app on my phone.
Method: Select PDFs for conversion, choose .docx or .mobi output format, provide email address, click Convert.
Zamzar failed to convert 1 of my 16 PDFs; I don’t know why. It was one of the simpler-looking PDFs.
Zamzar’s .docx files were similar to those produced by Word Online and Adobe Acrobat, but generally more mangled than those produced by Adobe Acrobat. Zamzar also failed badly on the 3-column paper, and on the scanned-and-OCRed book chapters. In several files, even fairly simple ones, two layers of text were pasted on top of each other.
I didn’t bother sending the .docx files to the Kindle app on my phone.
Zamzar’s .mobi files were a total shitshow; I won’t bother to say more.
Method: Leave the default format option, click ‘Upload a PDF file to convert,’ wait a bit, click Download.
This produced Word files similar to those produced by Adobe Acrobat. The 3-column files so mangled by Adobe Acrobat was actually reproduced more accurately by this service. Unlike Adobe Acrobat, it converted a table turned on its side to text, but then rotated the text to be horizontal, making the table unreadable. Overall, page layout was less accurately reproduced than with Adobe Acrobat, but ConvertPDFtoWord.net committed fewer gross errors than Adobe Acrobat had, except for the important fact that several files had two layers of overlapping (and thus unreadable) text.
I was curious to see whether the two layers of overlapping text would go away if I sent these files to the Kindle app on my phone. The overlapping text did go away, but the formatting was pretty messed up — worse than the output of converting the PDF directly via Send to Kindle.
Method: Click Choose File, choose file, click Convert.
Most of the files produced this way were totally broken, sometimes even mostly empty.
Method: Drag and drop PDF into interface, choose output format (.docx or .epub), and click Convert.
This produced .docx files kind of like Adobe Acrobat’s files but worse, and cut off after a seemingly random number of pages — often 5, but sometimes more. Maybe this was a limitation of the free trial version; the website doesn’t say.
The .epub files weren’t cut off before the end, but they were hopelessly mangled anyway.
The results of this test are pretty boring. They are:
- If you want to convert a PDF to a text-reflowing format that preserves images, tables, or equations… too bad. None of the programs I tested came remotely close to doing this well.
- If you have a PDF without images, tables, or equations, and with only one or two columns of text, just use the Send to Kindle program and it’s got a decent chance of converting pretty cleanly.
- If you want convert a PDF to Word so you can edit it, Adobe Acrobat seems like the overall best choice.
- Microsoft Word 2016 for Mac doesn’t allow PDF to Word conversion like its Windows counterpart does; that’s why I tried Word Online instead. Nuance‘s PDF programs seem to be either not available for Mac, or not have a free trial. Nitro PDF to Word Converter is only for Windows, and the online version doesn’t support files larger than 5MB, so I didn’t include it. Lighten PDF to Word Converter has a free trial, but it only converts 3 pages per PDF, so I excluded it. Same problem for FreePDFconvert.com and iSkySoft PDF Converter, which have 2-page and 5-page limits for their trial versions, respectively. As far as I can tell, Qoppa PDF Studio Pro can extract plain text but not rich text from PDFs. I used Adobe Acrobat XI (instead of a newer version) because I already owned it.[↩]
If you relax the condition of the output being reflowable text, another possibility is k2pdfopt (http://www.willus.com/k2pdfopt/), which converts pdfs into pdfs with a different page size. Unfortunately:
a) it’s output is inconsistent, in that deals surprisingly well with equations and, but it seems to hate the vertical label on the first page of arXiv papers and the ocr is rather poor.
b) it’s slow
c) if one wants to be secure, the download page does not have a valid https certificate (as it’s for hosting company)…
In the case of arXiv papers, and elsewhere when one has access to the latex source, two further possibilities are to:
a) compile the latex directly to xhtml* (see for example http://arxmliv.kwarc.info/ where they tried to convert all of arXiv to xhtml — the success rate is not very encouraging, but in the successful cases the output is quite pretty, at least in MathML compliant browsers (firefox and to some extent safari)**) or
b) specify a different page geometry/number of columns for the pdf output (I tried this a year ago and the results were not too bad, the main stumbling blocks being atypical latex geometry packages and now excessively wide tables/equations that strayed into the margins — as I was doing this as an experiment I didn’t try to solve these issues properly, so they might be surmountable).
* which can then be trivially converted to epub
** though this could be circumvented by using MathJax etc (however I don’t know how many epub readers, as opposed to browsers, would actually support this)
I hope at least some of this might be helpful to you or other people reading your blog.
P.S. Congratulations for having the second-best designed website (with the best obviously being http://practicaltypography.com/ :p )
Thanks for these notes. They don’t really accomplish what I was hoping for, but they may be of use to me in the future or to readers of this blog post.
The vertical label on ArXiv papers is trivial to deal with in k2pdfopt. Just crop out the left-most 0.9 inches of each page:
(or whatever value works to get rid of the label)
The OCR engine provided by k2pdfopt is Tesseract, but OCR was not part of this test.
Have you considered trying Calibre? I think it has an OS X version. It’s worked very well for me, although I generally convert simpler documents, so no guarantees. It’s got a bewildering array of options to customize output — tabs upon tabs! — so maybe they thought of this sort of thing.
Not Will says
Hey, listen to this guy and his excellent solution to your text-flow problems.