In this tutorial we will explore how to extract text from PDF files using Python.

Table of Contents



Introduction

Extracting text from PDF files is a very common task that’s often performed when working with reports and research papers.

It’s a tedious task if you do it manually for every file using the available software and online tools.

In this tutorial we will explore how to extract text from PDF files using Python with a few lines of code.

To continue following this tutorial we will need the following Python library: PyPDF2.

If you don’t have it installed, please open “Command Prompt” (on Windows) and install them using the following code:


pip install PyPDF2

Sample PDF file

Here is the PDF file we will use in this tutorial:

This PDF file will reside in the same folder as the main.py with our code.

Here is how the structure of my files looks like:


Extract text from PDF using Python

Now we have everything we need and can easily extract text from PDF using Python:


#Import the required dependency
from PyPDF2 import PdfFileReader

#Define path to PDF file
pdf_file_name = 'sample_file.pdf'

#Open the file in binary mode for reading
with open(pdf_file_name, 'rb') as pdf_file:
    #Read the PDF file
    pdf_reader = PdfFileReader(pdf_file)
    #Get number of pages in the PDF file
    page_nums = pdf_reader.numPages
    #Iterate over each page number
    for page_num in range(page_nums):
        #Read the given PDF file page
        page = pdf_reader.getPage(page_num)
        #Extract text from the given PDF file page
        text = page.extractText()
        #Print text
        print(text)

And you should get:

Sample Page 1
Sample Page 2
Sample Page 3

Conclusion

In this article we explored how to extract text from PDF files using Python and PyPDF2.

Feel free to leave comments below if you have any questions or have suggestions for some edits and check out more of my Python for PDF tutorials.