Skip to main content

Command Palette

Search for a command to run...

Automating PDF Table Extraction in Java: From Text to Excel/CSV for All Scenarios

Published
5 min read

As enterprise demand for data automation continues to grow, extracting tables from PDF documents has become an increasingly important requirement. When working with structured documents such as financial reports or supply chain lists, developers need more than raw content—they need to preserve logical structure and data integrity. In this article, based on the latest Spire.PDF for Java, we demonstrate how to accurately convert PDF tables into Text, CSV, and Excel formats, enabling full-scenario automated data extraction.

Environment Setup

Before diving into the implementation, make sure your development environment is properly configured:

  • JDK Support: JDK 8 or later is recommended.

  • Maven Dependency: Add the following repository and dependency to your pom.xml to access the latest version.

<repositories>
    <repository>
        <id>com.e-iceblue</id>
        <name>e-iceblue</name>
        <url>https://repo.e-iceblue.cn/repository/maven-public/</url>
    </repository>
</repositories>
<dependencies>
    <dependency>
        <groupId>e-iceblue</groupId>
        <artifactId>spire.pdf</artifactId>
        <version>11.12.16</version>
    </dependency>
</dependencies>

Basic Extraction: Converting PDF Tables to Text

As an entry-level solution, extracting PDF tables as text is suitable for scenarios where formatting is not critical and only raw content is required.

Technical Logic:
Use PdfTableExtractor to detect tables, then iterate through rows and columns to retrieve cell text.

import com.spire.pdf.*;
import com.spire.pdf.utilities.PdfTable;
import com.spire.pdf.utilities.PdfTableExtractor;

import java.io.FileWriter;
import java.io.IOException;

public class ExtractTable {
    public static void main(String[] args)throws IOException {
        // Create a PdfDocument instance
        PdfDocument pdf = new PdfDocument();

        // Load the PDF file
        pdf.loadFromFile("/input/sample.pdf");

        // Create a StringBuilder to store extracted content
        StringBuilder builder = new StringBuilder();

        // Create a PdfTableExtractor instance
        PdfTableExtractor extractor = new PdfTableExtractor(pdf);

        // Loop through all pages
        for (int page = 0; page < pdf.getPages().getCount(); page++)
        {
            // Extract tables from the current page
            PdfTable[] tableLists = extractor.extractTable(page);
            if (tableLists != null && tableLists.length > 0)
            {
                // Iterate through each table
                for (PdfTable table : tableLists)
                {
                    int row = table.getRowCount();      // Number of rows
                    int column = table.getColumnCount(); // Number of columns
                    for (int i = 0; i < row; i++)
                    {
                        for (int j = 0; j < column; j++)
                        {
                            // Get text from each cell
                            String text = table.getText(i, j);

                            // Append cell content to the buffer
                            builder.append(text + " ");
                        }
                        // Line break after each row
                        builder.append("\r\n");
                    }
                }
            }
        }

        // Save the result as a TXT file
        FileWriter fileWriter = new FileWriter("/output/totext.txt");
        fileWriter.write(builder.toString());
        fileWriter.flush();
        fileWriter.close();
    }
}

Code explanation (key points):

  • PdfTableExtractor identifies table structures instead of extracting plain text.

  • Nested loops ensure row–column traversal, preserving table order.

  • StringBuilder improves performance when concatenating large amounts of text.

Solution 1: Lightweight Export — Convert PDF Tables to CSV

While text output is fast and simple, CSV is a better choice when data needs further processing, database import, or analytical computation. CSV is widely supported across databases, analytics tools, and programming languages, making it an ideal intermediate format for automated pipelines.

Technical Logic:
After extracting tables via PdfTableExtractor, write the content using comma-separated values while handling special characters properly.

import com.spire.pdf.*;  
import com.spire.pdf.utilities.*;  

import java.io.*;  

public class ExtractTable {  
    public static void main(String[] args) throws Exception {  
        // 1. Load the PDF document  
        PdfDocument pdf = new PdfDocument();  
        pdf.loadFromFile("/input/sample.pdf");  

        // StringBuilder to collect CSV content  
        StringBuilder sb = new StringBuilder();  

        // 2. Iterate through pages and extract tables  
        for (int i = 0; i < pdf.getPages().getCount(); i++) {  
            PdfTableExtractor extractor = new PdfTableExtractor(pdf);  
            PdfTable[] tableLists = extractor.extractTable(i);  

            if (tableLists != null) {  
                for (PdfTable table : tableLists) {  
                    for (int row = 0; row < table.getRowCount(); row++) {  
                        for (int col = 0; col < table.getColumnCount(); col++) {  
                            // Escape special characters for CSV compliance  
                            String cellText = escapeCsvField(table.getText(row, col));  
                            sb.append(cellText);  

                            // Separate columns with commas  
                            if (col < table.getColumnCount() - 1) {  
                                sb.append(",");  
                            }  
                        }  
                        // New line after each row  
                        sb.append("\n");  
                    }  
                }  
            }  
        }  

        // 3. Write CSV file with UTF-8 BOM  
        File outFile = new File("/output/tocsv.csv");  

        if (!outFile.getParentFile().exists()) {  
            outFile.getParentFile().mkdirs();  
        }  

        try (FileOutputStream fos = new FileOutputStream(outFile)) {  
            // Write UTF-8 BOM to prevent Excel encoding issues  
            fos.write(0xEF);  
            fos.write(0xBB);  
            fos.write(0xBF);  

            try (Writer writer = new OutputStreamWriter(fos, "UTF-8")) {  
                writer.write(sb.toString());  
            }  
        }  

        pdf.close();  
        System.out.println("PDF tables have been successfully exported to CSV.");  
    }  

    /**  
     * Escapes CSV fields by:  
     * 1. Removing line breaks  
     * 2. Escaping double quotes  
     * 3. Wrapping fields containing special characters in quotes  
     */  
    private static String escapeCsvField(String text) {  
        if (text == null) return "";  

        // Remove line breaks inside cells  
        text = text.replaceAll("[\\n\\r]", " ");  

        boolean containsSpecialChar = text.contains(",") || text.contains(";") ||  
                text.contains("\"") || text.contains("\n");  

        if (containsSpecialChar) {  
            text = text.replace("\"", "\"\"");  
            text = "\"" + text + "\"";  
        }  

        return text.trim();  
    }  
}

Solution 2: Structured Export — Convert PDF Tables Directly to Excel

CSV stores only plain text, meaning formatting, merged cells, and styles are lost. For reports such as financial statements or audit documents, exporting PDF tables directly to Excel is often the better option.

Core Steps:

  1. Load the source PDF document.

  2. Extract tables and iterate through their cells.

  3. Write data into Excel worksheets.

  4. (Optional) Adjust column widths and styles.

  5. Save the workbook as .xlsx.

import com.spire.pdf.PdfDocument;  
import com.spire.pdf.utilities.PdfTable;  
import com.spire.pdf.utilities.PdfTableExtractor;  
import com.spire.xls.ExcelVersion;  
import com.spire.xls.Workbook;  
import com.spire.xls.Worksheet;  

public class ExtractTable {  

    public static void main(String[] args) {  

        // Load the PDF document  
        PdfDocument pdf = new PdfDocument("/input/sample.pdf");  

        // Create a table extractor  
        PdfTableExtractor extractor = new PdfTableExtractor(pdf);  

        // Extract tables from the first page  
        PdfTable[] pdfTables  = extractor.extractTable(0);  

        // Create an Excel workbook  
        Workbook wb = new Workbook();  
        wb.getWorksheets().clear();  

        if (pdfTables != null && pdfTables.length > 0) {  
            for (int tableNum = 0; tableNum < pdfTables.length; tableNum++) {  
                Worksheet sheet = wb.getWorksheets().add("Table - " + (tableNum + 1));  

                for (int rowNum = 0; rowNum < pdfTables[tableNum].getRowCount(); rowNum++) {  
                    for (int colNum = 0; colNum < pdfTables[tableNum].getColumnCount(); colNum++) {  
                        String text = pdfTables[tableNum].getText(rowNum, colNum);  
                        sheet.get(rowNum + 1, colNum + 1).setText(text);  
                    }  
                }  

                // Auto-fit column width  
                for (int col = 0; col < sheet.getColumns().length; col++) {  
                    sheet.autoFitColumn(col + 1);  
                }  
            }  
        }  

        // Save as Excel file  
        wb.saveToFile("/output/toexcel.xlsx", ExcelVersion.Version2016);  
    }  
}

Summary

From basic text extraction to full CSV and Excel exports, modern tooling has significantly simplified PDF table processing. With Spire.PDF for Java, you can flexibly choose between lightweight CSV output and high-fidelity Excel conversion based on your business requirements—enabling efficient, scalable, and automated data workflows.