In this tutorial we will explore how to extract text from PDF files using Python.
Table of Contents
Introduction
Extracting text from PDF files is a very common task that’s often performed when working with reports and research papers.
It’s a tedious task if you do it manually for every file using the available software and online tools.
In this tutorial we will explore how to extract text from PDF files using Python with a few lines of code.
To continue following this tutorial we will need the following Python library: PyPDF2.
If you don’t have it installed, please open “Command Prompt” (on Windows) and install them using the following code:
pip install PyPDF2
Sample PDF file
Here is the PDF file we will use in this tutorial:
This PDF file will reside in the same folder as the main.py with our code.
Here is how the structure of my files looks like:
Extract text from PDF using Python
Now we have everything we need and can easily extract text from PDF using Python:
#Import the required dependency
from PyPDF2 import PdfFileReader
#Define path to PDF file
pdf_file_name = 'sample_file.pdf'
#Open the file in binary mode for reading
with open(pdf_file_name, 'rb') as pdf_file:
#Read the PDF file
pdf_reader = PdfFileReader(pdf_file)
#Get number of pages in the PDF file
page_nums = pdf_reader.numPages
#Iterate over each page number
for page_num in range(page_nums):
#Read the given PDF file page
page = pdf_reader.getPage(page_num)
#Extract text from the given PDF file page
text = page.extractText()
#Print text
print(text)
And you should get:
Sample Page 1
Sample Page 2
Sample Page 3
Conclusion
In this article we explored how to extract text from PDF files using Python and PyPDF2.
Feel free to leave comments below if you have any questions or have suggestions for some edits and check out more of my Python for PDF tutorials.