Skip to main content

PDF Files Handling

PDF Files Handling in Python

PDF is a widely-used document format for digital publications. Python, on the other hand, is a versatile programming language with a vast range of applications in today's digital world. When used together, Python can become an efficient tool in manipulating and extracting information from PDF documents. In this article, we will explore the different ways Python can be used for PDF processing, and how it can help us improve our productivity and efficiency.

Python PDF Libraries

To work with PDF files in Python, there are various libraries available. Some of the popular libraries to use Python with PDF are PyPDF2, reportlab, and fpdf.

Reading PDF with Python

To read a PDF file, you can use the PyPDF2 library. Here's an example:

import json
import PyPDF2

# Open the PDF file
pdf_file = open('example.pdf', 'rb')

# Create a PDF reader object
pdf_reader = PyPDF2.PdfFileReader(pdf_file)

# Get the number of pages in the PDF file
num_pages = pdf_reader.numPages

# Loop through all the pages and extract the text
for page in range(num_pages):
    page_obj = pdf_reader.getPage(page)
    print(page_obj.extractText())
    
# Close the PDF file
pdf_file.close()

Generating PDF with Python

To generate new PDF files from scratch, you can use the reportlab or fpdf library. Here's an example using reportlab:

from reportlab.pdfgen import canvas

# Create a new PDF file
pdf_file = canvas.Canvas('example.pdf')

# Add text to the PDF
pdf_file.drawString(100, 750, "Hello World")

# Save and close the PDF file
pdf_file.save()

Similarly, you can use fpdf library to create PDF.

Editing PDF with Python

To edit existing PDF files, you can use PyPDF2 library. Here's an example to rotate the pages in a PDF file:

import PyPDF2

# Open the PDF file
pdf_file = open('example.pdf', 'rb')

# Create a PDF reader object
pdf_reader = PyPDF2.PdfFileReader(pdf_file)

# Create a PDF writer object
pdf_writer = PyPDF2.PdfFileWriter()

# Rotate the pages and add them to the PDF writer
for page in range(pdf_reader.numPages):
    page_obj = pdf_reader.getPage(page)
    page_obj.rotateClockwise(90)
    pdf_writer.addPage(page_obj)
    
# Save the rotated PDF file
with open('example_rotated.pdf', 'wb') as pdf_output:
    pdf_writer.write(pdf_output)
    
# Close the PDF files
pdf_file.close()
pdf_output.close()

In summary, Python provides multiple libraries to work with PDF files, enabling you to read, generate, and edit PDFs programmatically.

How to Extract Text from PDF with Python

To extract text from a PDF with Python, you can use the PyPDF2 or pdfminer libraries. These libraries allow you to parse the PDF and extract the text content.

Example 1: Using PyPDF2

import PyPDF2

pdf_file = open('file.pdf', 'rb')
pdf_reader = PyPDF2.PdfFileReader(pdf_file)

text = ''
for page_num in range(pdf_reader.numPages):
    page = pdf_reader.getPage(page_num)
    text += page.extractText()

print(text)

Example 2: Using pdfminer

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO

def pdf_to_text(pdf_path):
    manager = PDFResourceManager()
    output = StringIO()
    converter = TextConverter(manager, output, laparams=LAParams())
    interpreter = PDFPageInterpreter(manager, converter)

    with open(pdf_path, 'rb') as file:
        for page in PDFPage.get_pages(file, check_extractable=True):
            interpreter.process_page(page)

        text = output.getvalue()

    return text

Both of these methods will allow you to extract text content from a PDF with Python.

How to Combine PDF Pages

Merging multiple PDF files into a single document is a common task in document processing. The PyPDF2 library in Python makes it easy to merge multiple PDF files into a single document.

Merge Two PDF Pages Using PyPDF2

import PyPDF2

# Open the first PDF file
pdf1 = PyPDF2.PdfFileReader(open('file1.pdf', 'rb'))

# Open the second PDF file
pdf2 = PyPDF2.PdfFileReader(open('file2.pdf', 'rb'))

# Merge the two PDF files
output = PyPDF2.PdfFileWriter()
output.addPage(pdf1.getPage(0))
output.addPage(pdf2.getPage(0))

# Save the merged PDF file
with open('merged.pdf', 'wb') as f:
    output.write(f)

Merge entire PDF files Using PyPDF2

from PyPDF2 import PdfFileMerger

pdfs = ['file1.pdf', 'file2.pdf']
merger = PdfFileMerger()

for pdf in pdfs:
    merger.append(open(pdf, 'rb'))

with open('merged_pdf.pdf', 'wb') as f:
    merger.write(f)

Using the above code examples, you can merge multiple PDF pages or entire PDF files in Python using the PyPDF2 library. By combining PDF files, you can easily create a single document that is easier to manage and distribute.

How to Remove Watermark from PDF

Removing watermark from PDF files in Python is easy and can be done using a number of libraries. Here are some solutions to remove watermarks using PyPDF2 and PyMuPDF libraries.

# Solution 1
import PyPDF2

# Open the PDF file
pdf = open('filename.pdf', 'rb')

# Create a PDFReader object
pdf_reader = PyPDF2.PdfReader()

# Create a PDFWriter object
pdf_writer = PyPDF2.PdfWriter()

# Iterate over the pages in the PDF file
for page in pdf_reader:
    # Remove the watermark 
    page.mergePage(None)
    # Add the page to the PDFWriter object
    pdf_writer.addPage(page)

# Save the PDF with the watermark removed
with open('filename_nw.pdf', 'wb') as f:
    pdf_writer.write(f)
import fitz
# Solution 2
# Open the PDF file
pdf = fitz.open('filename.pdf')

# Iterate over the pages in the PDF file
for page in pdf:
    # Get the annotations on the page
    annotations = page.annots()
    # Iterate through the annotations
    for annotation in annotations:
        # Check if the annotation is a watermark
        if annotation.type[0] == 8:
            # Remove the annotation
            page.deleteAnnot(annotation)
    
# Save the PDF with the watermark removed
pdf.save('filename_nw.pdf')

With these simple solutions, you can easily remove watermarks from PDF files using Python and the PyPDF2 and PyMuPDF libraries.

How to convert HTML to PDF

Converting HTML to PDF is a common task in web development. Fortunately, Python provides several libraries to accomplish this task effortlessly. Here are two examples of how to convert HTML to PDF using popular Python libraries:

Using the pdfkit library

import pdfkit

pdfkit.from_file('path/to/file.html', 'path/to/output.pdf')

Using the weasyprint library

from weasyprint import HTML

HTML('path/to/file.html').write_pdf('path/to/output.pdf')

Both libraries provide the ability to convert HTML to PDF with just a few lines of code, making it easy to incorporate into any Python project. Don't forget to install the required libraries using pip before implementing the solution.

Contribute with us!

Do not hesitate to contribute to Python tutorials on GitHub: create a fork, update content and issue a pull request.

Profile picture for user AliaksandrSumich
Python engineer, expert in third-party web services integration.
Profile picture for user angarsky
Evgeniy Melnikovcontributor
Updated: 05/03/2024 - 21:52