In this tutorial we will explore how to extract metadata from PDF using Python.
Table of Contents
Introduction
PDF metadata consists of information about the PDF document, which includes title, author, creation date, and so on. All of these are searchable fields of each PDF document and can be retrieved.
To continue following this tutorial we will need the following Python library: pikepdf.
If you don’t have it installed, please open “Command Prompt” (on Windows) and install it using the following code:
pip install pikepdf
Sample PDF
In order to continue in this tutorial we will need some PDF file to work with.
Let’s reuse one of the PDF we created in one of our previous tutorials:
Extract metadata from PDF using Python
In order to extract metadata from PDF using Python, we will follow the three simple steps:
- Open PDF using pikepdf
- Extract metadata from PDF
- Print out metadata
And now we can create the metadata from PDF using the following code:
import pikepdf
#Open PDF with pikepdf
pdf = pikepdf.Pdf.open('webpage.pdf')
#Extract metadata from PDF
pdf_info = pdf.docinfo
#Print out the metadata
for key, value in pdf_info.items():
print(key, ':', value)
You should get:
/CreationDate : D:20220624153735-04'00'
/Creator : wkhtmltopdf 0.12.6
/Producer : Qt 4.8.7
/Title : wkhtmltopdf
Conclusion
In this article we explored how to extract metadata from PDF using Python and pikepdf.
Feel free to leave comments below if you have any questions or have suggestions for some edits and check out more of my Python for PDF tutorials.