Coder.Haus | Installing pdftotext through pip on Windows 10
Learn how to set up pdftotext on Win10 for extracting text from PDFs. We use the pip installed version, conda for installing poppler, and pull this all together in Python on Windows 10.
ITIL, SLA, Service Level Agreement, Change, Change Management, Development, Coding, Programming, Javascript, Java, Kotlin, Arduino, RaspberryPi, RPi, Android, VSCode, Hacker, Maker, Infrastructure
15723
post-template-default,single,single-post,postid-15723,single-format-standard,ajax_fade,page_not_loaded,,qode-title-hidden,qode_grid_1300,qode_popup_menu_push_text_top,qode-content-sidebar-responsive,qode-theme-ver-17.2,qode-theme-bridge,disabled_footer_bottom,wpb-js-composer js-comp-ver-5.6,vc_responsive

Installing PDFTOTEXT through PIP on Windows 10

Edit: 10/7/2019 – Submitted PR to fix this.
https://github.com/jalan/pdftotext/pull/47

Edit: 10/1/2019 – A fix for this issue, if you’re so inclined to pull the github repo, is to add the following to the setup.py file, in the if block checking for platform.

Assumes that Anaconda and Build Tools for Visual Studio 2019 are installed. Also assumes conda install poppler has been executed.

Read below for the full walkthrough.

# this goes at the top of setup.py
from os import getenv, path

# this goes with the rest of the platform.system() calls
elif platform.system() in ['Windows']:
    conda_dir = os.getenv('CONDA_PREFIX')
    anaconda_poppler_include_dir = os.path.join(conda_dir, 'Library\include')
    anaconda_poppler_library_dir = os.path.join(conda_dir, 'Library\lib')
    include_dirs = [anaconda_poppler_include_dir]
    library_dirs = [anaconda_poppler_library_dir]

In this post we’ll explore installing the pdftotext library for Python using Anaconda Python on Windows 10.

Install Anaconda Python. We won’t explore the how to here, as there are many articles on installing Anaconda.

Try to run

pip install pdftotext

you will get an error that the Microsoft Visual C++ is required.

Navigate in a browser to http://visualstudio.microsoft.com/downloads. Under the Tools for Visual Studio 2019 tab download the Build Tools for Visual Studio 2019. You’ll then install the tools by checking the C++ build tools option box and clicking Install.

You should now get the pip install to move past the VC++ error. Unfortunately you’ll now get the error “Cannot open include file: ‘poppler/cpp/poppler-document.h’. This is because you’re missing the poppler libraries.

Head back to the internets! You’ll need poppler for windows. At the time of this writing, your best option is http://blog.alivate.com.au/poppler-windows. Grab the latest binary, and uncompress it. If you look at the error, pip is looking for the header file at {Anaconda3 directory}\include\poppler\cpp\poppler-document.h. So look in the archive you just unzipped. In the include folder, you’ll see a poppler directory. If you go down into the cpp directory in there you’ll find the poppler-document.h file.

I’ll copy the entire poppler directory and paste it into the Anaconda3\include folder.

Ok, so let’s run pip install again. We’re still getting a ton of errors! But now we’re not getting any of the errors that we saw, instead we’re getting an error for a missing linked library, poppler-cpp.lib. A search through Conda installs on another machine shows that this comes from the poppler package.

Looking at the full error, it wants to find the file in {Anaconda3 directory}\libs. So let’s find a copy of this file and move it!

Edit: 11/21 – For some folks, the system is looking for the linked library in the AppData/Local/Programs/Python/Python{PythonVersion}/libs directory, where {PythonVersion} could be something like 36 or 37, instead of the Anaconda/libs directory. If that’s the case, and you’re still receiving errors after the step above, try to move the file over to the appropriate Python version directory as above. If you look at the error you should be able to see something pointing to the /LIBPATH where this file should go, like /LIBPATH:C:\Users\IEUser\Anaconda3\libs. Big thank you to Katia Gil Guzman for help uncovering the possibility of the Python{PythonVersion} directory!

conda install -c conda-forge poppler

will install our poppler-cpp.lib file. Then we can copy the file from its home at {Anaconda3 directory}\Library\lib\poppler-cpp.lib and paste it where pdftotext is expecting it at {Anaconda3 directory}\libs.

So let’s give it a whirl and do pip install pdftotext again!

There it is! Now our final test is write a piece of code to be sure it works.

And there we go! Installing the Python library pdftotext on Win10. I’m sure I’ll be able to refine this a bit, but for now we have a working pdftotext Python library on Win10.

Tags: