PDF is a widely-used document format for
digital publications. Python, on the other hand, is a versatile programming language with a vast range
of applications in today's digital world. When used together, Python can become an efficient tool in manipulating and extracting information from PDF documents. In this article, we will explore the different ways Python can be used for
PDF processing, and how it can help us improve our productivity and efficiency.
Python PDF Libraries
To work with PDF files in Python, there are various libraries available. Some of the popular libraries to use Python with PDF are PyPDF2, reportlab, and fpdf.
Reading PDF with Python
To read a PDF file
, you can use the PyPDF2 library. Here's an example:
import json
import PyPDF2
# Open the PDF file
pdf_file = open('example.pdf', 'rb')
# Create a PDF reader object
pdf_reader = PyPDF2.PdfFileReader(pdf_file)
# Get the number of pages in the PDF file
num_pages = pdf_reader.numPages
# Loop through all the pages and extract the text
for page in range(num_pages):
page_obj = pdf_reader.getPage(page)
print(page_obj.extractText())
# Close the PDF file
pdf_file.close()
Generating PDF with Python
To generate new PDF files from scratch, you can use the reportlab or fpdf
library. Here's an example using reportlab
:
from reportlab.pdfgen import canvas
# Create a new PDF file
pdf_file = canvas.Canvas('example.pdf')
# Add text to the PDF
pdf_file.drawString(100, 750, "Hello World")
# Save and close the PDF file
pdf_file.save()
Similarly, you can use fpdf
library to create PDF.
Editing PDF with Python
To edit existing PDF files, you can use PyPDF2 library. Here's an example to rotate the pages in a PDF file:
import PyPDF2
# Open the PDF file
pdf_file = open('example.pdf', 'rb')
# Create a PDF reader object
pdf_reader = PyPDF2.PdfFileReader(pdf_file)
# Create a PDF writer object
pdf_writer = PyPDF2.PdfFileWriter()
# Rotate the pages and add them to the PDF writer
for page in range(pdf_reader.numPages):
page_obj = pdf_reader.getPage(page)
page_obj.rotateClockwise(90)
pdf_writer.addPage(page_obj)
# Save the rotated PDF file
with open('example_rotated.pdf', 'wb') as pdf_output:
pdf_writer.write(pdf_output)
# Close the PDF files
pdf_file.close()
pdf_output.close()
In summary, Python provides multiple libraries to work with PDF files, enabling you to read, generate, and edit PDFs programmatically.
How to Extract Text from PDF with Python
To extract text
from a PDF with Python, you can use the PyPDF2
or pdfminer libraries. These libraries allow you to parse the PDF and extract the text
content.
Example 1: Using PyPDF2
import PyPDF2
pdf_file = open('file.pdf', 'rb')
pdf_reader = PyPDF2.PdfFileReader(pdf_file)
text = ''
for page_num in range(pdf_reader.numPages):
page = pdf_reader.getPage(page_num)
text += page.extractText()
print(text)
Example 2: Using pdfminer
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO
def pdf_to_text(pdf_path):
manager = PDFResourceManager()
output = StringIO()
converter = TextConverter(manager, output, laparams=LAParams())
interpreter = PDFPageInterpreter(manager, converter)
with open(pdf_path, 'rb') as file:
for page in PDFPage.get_pages(file, check_extractable=True):
interpreter.process_page(page)
text = output.getvalue()
return text
Both of these methods will allow you to extract text
content from a PDF with Python.
How to Combine PDF Pages
Merging multiple PDF files into a single document is a common task in document processing. The PyPDF2
library in Python makes it easy to merge multiple PDF files into a single document.
Merge Two PDF Pages Using PyPDF2
import PyPDF2
# Open the first PDF file
pdf1 = PyPDF2.PdfFileReader(open('file1.pdf', 'rb'))
# Open the second PDF file
pdf2 = PyPDF2.PdfFileReader(open('file2.pdf', 'rb'))
# Merge the two PDF files
output = PyPDF2.PdfFileWriter()
output.addPage(pdf1.getPage(0))
output.addPage(pdf2.getPage(0))
# Save the merged PDF file
with open('merged.pdf', 'wb') as f:
output.write(f)
Merge entire PDF files Using PyPDF2
from PyPDF2 import PdfFileMerger
pdfs = ['file1.pdf', 'file2.pdf']
merger = PdfFileMerger()
for pdf in pdfs:
merger.append(open(pdf, 'rb'))
with open('merged_pdf.pdf', 'wb') as f:
merger.write(f)
Using the above code examples, you can merge multiple PDF pages or entire PDF files in Python using the PyPDF2
library. By combining PDF files, you can easily create a single document that is easier to manage and distribute.
How to Remove Watermark from PDF
Removing watermark from PDF files in Python is easy and can be done using a number of libraries. Here are some solutions to remove watermarks using PyPDF2
and PyMuPDF
libraries.
# Solution 1
import PyPDF2
# Open the PDF file
pdf = open('filename.pdf', 'rb')
# Create a PDFReader object
pdf_reader = PyPDF2.PdfReader()
# Create a PDFWriter object
pdf_writer = PyPDF2.PdfWriter()
# Iterate over the pages in the PDF file
for page in pdf_reader:
# Remove the watermark
page.mergePage(None)
# Add the page to the PDFWriter object
pdf_writer.addPage(page)
# Save the PDF with the watermark removed
with open('filename_nw.pdf', 'wb') as f:
pdf_writer.write(f)
import fitz
# Solution 2
# Open the PDF file
pdf = fitz.open('filename.pdf')
# Iterate over the pages in the PDF file
for page in pdf:
# Get the annotations on the page
annotations = page.annots()
# Iterate through the annotations
for annotation in annotations:
# Check if the annotation is a watermark
if annotation.type[0] == 8:
# Remove the annotation
page.deleteAnnot(annotation)
# Save the PDF with the watermark removed
pdf.save('filename_nw.pdf')
With these simple solutions, you can easily remove watermarks from PDF files using Python and the PyPDF2
and PyMuPDF libraries.
How to convert HTML to PDF
Converting HTML
to PDF is a common task in web development. Fortunately, Python provides several libraries to accomplish this task effortlessly. Here are two examples of how to convert HTML
to PDF using popular Python libraries:
Using the pdfkit
library
import pdfkit
pdfkit.from_file('path/to/file.html', 'path/to/output.pdf')
Using the weasyprint
library
from weasyprint import HTML
HTML('path/to/file.html').write_pdf('path/to/output.pdf')
Both libraries provide the ability to convert HTML
to PDF with just a few lines of code, making it easy to incorporate into any Python project. Don't forget to install the required libraries using pip before implementing the solution.
Contribute with us!
Do not hesitate to contribute to Python tutorials on GitHub: create a fork, update content and issue a pull request.