Extract Annotations from PDF Files in Python Like a Pro

·

6 min read

Annotations in PDFs provide valuable context, clarify viewpoints, and often contain crucial information, such as detailed explanations or comments related to the content. Manually extracting them can be time-consuming, especially when dealing with large files. This guide shows you how to extract annotations from PDF files using Python efficiently. Whether you need specific annotations, all annotations on a page, or annotations across the entire document, our step-by-step instructions will help you handle these tasks quickly and meet your diverse needs.


Python Library to Extract Annotations from PDFs

Speaking of using Python to extract annotations from PDF documents, third-party libraries like Spire.Doc for Python, Apose.Words for Python via .NET, and python-docx are commonly used. Among these, Spire.Doc (short for Spire.Doc of Python) stands out for its intuitive API and timely technical support, making it an excellent choice for handling annotation extraction. In this guide, we’ll use Spire.Doc for demonstration. Don’t worry—most libraries follow similar steps, so the approach can be easily adapted to others.

You can install it with the pip command: pip install Spire.Doc.

How to Extract Specified Annotations from PDF Files

Sometimes, you may only need one or a few specific annotations from a particular page, rather than all of them. In such cases, you can use the PdfAnnotationCollection.get_Item() method provided by Spire.Doc. By following a few simple steps—navigating to the desired PDF page, retrieving the annotation collection for that page, and extracting the specific annotations—you can easily obtain the information you need, including the annotation content and the time it was added.

Steps to extract the specified annotation from a PDF page:

  • Create an object of the PdfDocument class and read a PDF document from the disk with PdfDocument.LoadFromFile() method.

  • Get a certain page using the PdfDocument.Pages[] property.

  • Access the collection of annotations on the page with the PdfPageBase.AnnotationsWidget property.

  • Get the specified annotation by calling the PdfAnnotationCollection.get_Item() method.

  • Append the annotation details into a list.

  • Save the list as a Text file

Here is the code example of extracting the first annotation on the first page of a PDF:

from spire.pdf.common import *
from spire.pdf import *

# Create a new PDF document
pdf = PdfDocument()

# Load the file from disk
pdf.LoadFromFile( "AI-Generated Art.pdf")

# Get the first page 
page = pdf.Pages[0]

# Access the annotations on the page
annotations = page.AnnotationsWidget

# Create a list to save information of annotations
sb = []

# Access the first annotation on the page
annotation = annotations.get_Item(0)

# Append the annotation details to the list
sb.append("Annotation information: ")
sb.append("Text: " + annotation.Text)
modifiedDate = annotation.ModifiedDate.ToString()
sb.append("ModifiedDate: " + modifiedDate)

# Save the list as a Text file
with open("GetSpecificAnnotation.txt", "w", encoding="utf-8") as file:
    file.write("\n".join(sb))

# Close the PDF file
pdf.Close()

Extract Specific Annotation on a PDF Page

Export All Annotations from Specified PDF Pages

The steps for extracting all annotations from a page are essentially the same as those for exporting specific annotations. However, you need to iterate through the annotation collection on the page before retrieving them to ensure that each annotation is accessed without omission when using the PdfAnnotationCollection.get_Item() method. Let’s take a look at the detailed steps and a code example.

Steps to export all annotations from a PDF page:

  • Create an instance of the PdfDocument class, and use the PdfDocument.LoadFromFile() method to specify the file path of the source file.

  • Access the collection of annotations on a specified page with the PdfDocument.Pages.AnnotationsWidget property.

  • Loop through all annotations.

    • Retrieve each annotation through the PdfAnnotationCollection.get_Item() method.
  • Add details of annotations to a list and save it.

Below is the code example of extracting all annotations from the second page of a PDF:

from spire.pdf.common import *
from spire.pdf import *

# Create a new PDF document
pdf = PdfDocument()

# Load the file from disk
pdf.LoadFromFile("AI-Generated Art.pdf")

# Get all annotations from the second page
annotations = pdf.Pages[1].AnnotationsWidget

# Create a list to maintain annotation details
sb = []

# Loop through annotations on the page
if annotations.Count > 0:
    for i in range(annotations.Count):
        # Get the current annotation
        annotation = annotations.get_Item(i)

        # Get the annotation details
        if isinstance(annotation, PdfPopupAnnotationWidget):
            continue
        sb.append("Annotation information: ")
        sb.append("Text: " + annotation.Text)
        modifiedDate = annotation.ModifiedDate.ToString()
        sb.append("ModifiedDate: " + modifiedDate)

# Save annotations as a Text file
with open("GetAllAnnotationsFromPage.txt", "w", encoding="utf-8") as file:
    file.write("\n".join(sb))

# Release resources
pdf.Close()

Extract All Annotations on a PDF Page

How to Extract All Annotations from the Entire PDF

After going through the previous two chapters, you might be interested in learning how to extract all annotations from a PDF file. If you're archiving or reusing comments and notes, the best approach is to extract them all. Similarly, you'll still use the PdfDocument.Pages.AnnotationsWidget property and the PdfAnnotationCollection.get_Item() method, but this time you'll also need to add a step to iterate through all the pages. Here's a detailed guide to help you achieve this.

Steps to extract all annotations from a PDF:

  • Instantiate a PdfDocument object, and use the PdfDocument.LoadFromFile() method to load a sample PDF from files.

  • Loop through all pages.

    • Access the annotation collection of each page through the PdfDocument.Pages.AnnotationsWidget property.

    • Iterate through each annotation of a collection, and get annotations using the PdfAnnotationCollection.get_item() method.

      • Append annotation details to a list.
  • Save the list.

Here is the code example of extracting all annotations of a PDF and saving them as a Text file:

from spire.pdf.common import *
from spire.pdf import *

# Create a new PDF document
pdf = PdfDocument()

# Load the file from disk 
pdf.LoadFromFile("AI-Generated Art.pdf")

# Create a list to save annotation details
sb = []

# Iterate through all pages in the PDF document
for pageIndex in range(pdf.Pages.Count):
    sb.append(f"Page {pageIndex + 1}:")

    # Access the annotation collection of the current page
    annotations = pdf.Pages[pageIndex].AnnotationsWidget

    # Loop through annotations in the collection
    if annotations.Count > 0:
        for i in range(annotations.Count):
            # Get the annotations of the current page
            annotation = annotations.get_Item(i)

            # Skip invalid annotations (empty text and default date)
            if not annotation.Text.strip() and annotation.ModifiedDate.ToString() == "0001/1/1 0:00:00":
                continue

            # Extract annotation information
            sb.append("Annotation information: ")
            sb.append("Text: " + (annotation.Text.strip() or "N/A"))
            modifiedDate = annotation.ModifiedDate.ToString()
            sb.append("ModifiedDate: " + modifiedDate)
    else:
        sb.append("No annotations found.")

    # Add a blank line after each page
    sb.append("")

# Save all annotations to a file
with open("GetAllAnnotationsFromDocument.txt", "w", encoding="utf-8") as file:
    file.write("\n".join(sb))

# Close the PDF document
pdf.Close()

Extract All Annotations from a PDF File

The Conclusion

This guide explores how to extract annotations from PDF files using Python, with step-by-step instructions for extracting specific annotations from a page, all annotations from a single page, or the entire PDF document. Clear code examples are provided for your reference, making the process straightforward and accessible. By the end of this guide, you'll see just how easy it is to handle PDF annotations!