In this article we will explore how to convert PDF files into Microsoft Word docx format and vice versa using Python.
Table of contents
- Introduction
- Sample files
- How to convert PDF files to docx format
- How to convert docx files to PDF format
- Conclusion
Introduction
In one of our tutorials explaining how to work with PDF files in Python, and specifically how to extract tables from PDF files, we focused on PDF files with tables. However, in a lot of instances, we don’t want to only export certain parts of the PDF, rather than convert the whole PDF file to docx to allow for editing. In this article we will see how to easily and efficiently perform this conversion.
To continue following this tutorial we will need the following Python libraries: pdf2docx and docx2pdf.
If you don’t have it installed, please open “Command Prompt” (on Windows) and install it using the following code:
pip install pdf2docx
pip install docx2pdf
Sample files
In order to follow the examples shown below, you will need to have a PDF file to make the conversion to docx format.
The file I will be using for this article is here. And you can also download it from below:
Once you have downloaded the file, place it in the same folder as you Python file. For example, here is my setup:
For the docx to PDF conversion, you will need a sample docx file which you can download below:
Once you have downloaded the file, place it in the same folder as you Python file. For example, here is my setup:
How to convert PDF files to docx format using Python
Using pdf2docx library, we can perform the conversion in a few lines of code. This library is quite extensive and I encourage readers to check out their official documentation to learn more about its capabilities.
Convert all pages from PDF file to docx format using Python
Method 1:
First step is to import the required dependencies:
from pdf2docx import Converter
Second step is to define input and output paths. The input path should be the path to your PDF file, and the output path should be the path to where you would like to write out the .docx file to (in our case it’s just filenames since the code and the files are located in the same directory):
pdf_file = 'sample.pdf'
docx_file = 'sample.docx'
Third step is to convert PDF file to .docx:
cv = Converter(pdf_file)
cv.convert(docx_file)
cv.close()
And you should see the sample.docx created in the same directory.
Method 2:
First step is to import the required dependencies:
from pdf2docx import parse
Second step is to define input and output paths. The input path should be the path to your PDF file, and the output path should be the path to where you would like to write out the .docx file to (in our case it’s just filenames since the code and the files are located in the same directory):
pdf_file = 'sample.pdf'
docx_file = 'sample.docx'
Third step is to convert PDF file to .docx:
parse(pdf_file, docx_file)
And you should see the sample.docx created in the same directory.
Convert a single page from PDF file to docx format using Python
First step is to import the required dependencies:
from pdf2docx import Converter
Second step is to define input and output paths. The input path should be the path to your PDF file, and the output path should be the path to where you would like to write out the .docx file to (in our case it’s just filenames since the code and the files are located in the same directory):
pdf_file = 'sample.pdf'
docx_file = 'sample.docx'
Third step is to define a list with the numbers of pages you would like to be converted. In our case, the PDF file has 2 pages and let’s say we would like to print the first one. We would define a pages_list and assign the index number of the page (first page is index 0).
pages_list = [0]
Fourth step is to convert PDF file specific pages to .docx:
cv = Converter(pdf_file)
cv.convert(docx_file, pages=pages_list)
cv.close()
How to convert docx files to PDF format using Python
Using docx2pdf library, we can perform the conversion in a few lines of code. This library is quite extensive and I encourage readers to check out their official documentation to learn more about its capabilities.
First step is to import the required dependencies:
from docx2pdf import convert
Second step is to define input and output paths. The input path should be the path to your PDF file, and the output path should be the path to where you would like to write out the .docx file to (in our case it’s just filenames since the code and the files are located in the same directory):
docx_file = 'input.docx'
pdf_file = 'output.pdf'
Third step is to convert the .docx file to PDF:
convert(docx_file, pdf_file)
And you should see the output.pdf created in the same directory.
Conclusion
In this article we explored how to convert PDF files into Microsoft Word docx format and vice versa using Python.
Feel free to leave comments below if you have any questions or have suggestions for some edits and check out more of my Python for PDF tutorials.