Automating PDF Table Extraction in Java: From Text to Excel/CSV for All Scenarios
As enterprise demand for data automation continues to grow, extracting tables from PDF documents has become an increasingly important requirement. When working with structured documents such as financial reports or supply chain lists, developers need more than raw content—they need to preserve logical structure and data integrity. In this article, based on the latest Spire.PDF for Java, we demonstrate how to accurately convert PDF tables into Text, CSV, and Excel formats, enabling full-scenario automated data extraction.
Environment Setup
Before diving into the implementation, make sure your development environment is properly configured:
JDK Support: JDK 8 or later is recommended.
Maven Dependency: Add the following repository and dependency to your pom.xml to access the latest version.
<repositories>
<repository>
<id>com.e-iceblue</id>
<name>e-iceblue</name>
<url>https://repo.e-iceblue.cn/repository/maven-public/</url>
</repository>
</repositories>
<dependencies>
<dependency>
<groupId>e-iceblue</groupId>
<artifactId>spire.pdf</artifactId>
<version>11.12.16</version>
</dependency>
</dependencies>
Basic Extraction: Converting PDF Tables to Text
As an entry-level solution, extracting PDF tables as text is suitable for scenarios where formatting is not critical and only raw content is required.
Technical Logic:
Use PdfTableExtractor to detect tables, then iterate through rows and columns to retrieve cell text.
import com.spire.pdf.*;
import com.spire.pdf.utilities.PdfTable;
import com.spire.pdf.utilities.PdfTableExtractor;
import java.io.FileWriter;
import java.io.IOException;
public class ExtractTable {
public static void main(String[] args)throws IOException {
// Create a PdfDocument instance
PdfDocument pdf = new PdfDocument();
// Load the PDF file
pdf.loadFromFile("/input/sample.pdf");
// Create a StringBuilder to store extracted content
StringBuilder builder = new StringBuilder();
// Create a PdfTableExtractor instance
PdfTableExtractor extractor = new PdfTableExtractor(pdf);
// Loop through all pages
for (int page = 0; page < pdf.getPages().getCount(); page++)
{
// Extract tables from the current page
PdfTable[] tableLists = extractor.extractTable(page);
if (tableLists != null && tableLists.length > 0)
{
// Iterate through each table
for (PdfTable table : tableLists)
{
int row = table.getRowCount(); // Number of rows
int column = table.getColumnCount(); // Number of columns
for (int i = 0; i < row; i++)
{
for (int j = 0; j < column; j++)
{
// Get text from each cell
String text = table.getText(i, j);
// Append cell content to the buffer
builder.append(text + " ");
}
// Line break after each row
builder.append("\r\n");
}
}
}
}
// Save the result as a TXT file
FileWriter fileWriter = new FileWriter("/output/totext.txt");
fileWriter.write(builder.toString());
fileWriter.flush();
fileWriter.close();
}
}
Code explanation (key points):
PdfTableExtractor identifies table structures instead of extracting plain text.
Nested loops ensure row–column traversal, preserving table order.
StringBuilder improves performance when concatenating large amounts of text.
Solution 1: Lightweight Export — Convert PDF Tables to CSV
While text output is fast and simple, CSV is a better choice when data needs further processing, database import, or analytical computation. CSV is widely supported across databases, analytics tools, and programming languages, making it an ideal intermediate format for automated pipelines.
Technical Logic:
After extracting tables via PdfTableExtractor, write the content using comma-separated values while handling special characters properly.
import com.spire.pdf.*;
import com.spire.pdf.utilities.*;
import java.io.*;
public class ExtractTable {
public static void main(String[] args) throws Exception {
// 1. Load the PDF document
PdfDocument pdf = new PdfDocument();
pdf.loadFromFile("/input/sample.pdf");
// StringBuilder to collect CSV content
StringBuilder sb = new StringBuilder();
// 2. Iterate through pages and extract tables
for (int i = 0; i < pdf.getPages().getCount(); i++) {
PdfTableExtractor extractor = new PdfTableExtractor(pdf);
PdfTable[] tableLists = extractor.extractTable(i);
if (tableLists != null) {
for (PdfTable table : tableLists) {
for (int row = 0; row < table.getRowCount(); row++) {
for (int col = 0; col < table.getColumnCount(); col++) {
// Escape special characters for CSV compliance
String cellText = escapeCsvField(table.getText(row, col));
sb.append(cellText);
// Separate columns with commas
if (col < table.getColumnCount() - 1) {
sb.append(",");
}
}
// New line after each row
sb.append("\n");
}
}
}
}
// 3. Write CSV file with UTF-8 BOM
File outFile = new File("/output/tocsv.csv");
if (!outFile.getParentFile().exists()) {
outFile.getParentFile().mkdirs();
}
try (FileOutputStream fos = new FileOutputStream(outFile)) {
// Write UTF-8 BOM to prevent Excel encoding issues
fos.write(0xEF);
fos.write(0xBB);
fos.write(0xBF);
try (Writer writer = new OutputStreamWriter(fos, "UTF-8")) {
writer.write(sb.toString());
}
}
pdf.close();
System.out.println("PDF tables have been successfully exported to CSV.");
}
/**
* Escapes CSV fields by:
* 1. Removing line breaks
* 2. Escaping double quotes
* 3. Wrapping fields containing special characters in quotes
*/
private static String escapeCsvField(String text) {
if (text == null) return "";
// Remove line breaks inside cells
text = text.replaceAll("[\\n\\r]", " ");
boolean containsSpecialChar = text.contains(",") || text.contains(";") ||
text.contains("\"") || text.contains("\n");
if (containsSpecialChar) {
text = text.replace("\"", "\"\"");
text = "\"" + text + "\"";
}
return text.trim();
}
}
Solution 2: Structured Export — Convert PDF Tables Directly to Excel
CSV stores only plain text, meaning formatting, merged cells, and styles are lost. For reports such as financial statements or audit documents, exporting PDF tables directly to Excel is often the better option.
Core Steps:
Load the source PDF document.
Extract tables and iterate through their cells.
Write data into Excel worksheets.
(Optional) Adjust column widths and styles.
Save the workbook as .xlsx.
import com.spire.pdf.PdfDocument;
import com.spire.pdf.utilities.PdfTable;
import com.spire.pdf.utilities.PdfTableExtractor;
import com.spire.xls.ExcelVersion;
import com.spire.xls.Workbook;
import com.spire.xls.Worksheet;
public class ExtractTable {
public static void main(String[] args) {
// Load the PDF document
PdfDocument pdf = new PdfDocument("/input/sample.pdf");
// Create a table extractor
PdfTableExtractor extractor = new PdfTableExtractor(pdf);
// Extract tables from the first page
PdfTable[] pdfTables = extractor.extractTable(0);
// Create an Excel workbook
Workbook wb = new Workbook();
wb.getWorksheets().clear();
if (pdfTables != null && pdfTables.length > 0) {
for (int tableNum = 0; tableNum < pdfTables.length; tableNum++) {
Worksheet sheet = wb.getWorksheets().add("Table - " + (tableNum + 1));
for (int rowNum = 0; rowNum < pdfTables[tableNum].getRowCount(); rowNum++) {
for (int colNum = 0; colNum < pdfTables[tableNum].getColumnCount(); colNum++) {
String text = pdfTables[tableNum].getText(rowNum, colNum);
sheet.get(rowNum + 1, colNum + 1).setText(text);
}
}
// Auto-fit column width
for (int col = 0; col < sheet.getColumns().length; col++) {
sheet.autoFitColumn(col + 1);
}
}
}
// Save as Excel file
wb.saveToFile("/output/toexcel.xlsx", ExcelVersion.Version2016);
}
}
Summary
From basic text extraction to full CSV and Excel exports, modern tooling has significantly simplified PDF table processing. With Spire.PDF for Java, you can flexibly choose between lightweight CSV output and high-fidelity Excel conversion based on your business requirements—enabling efficient, scalable, and automated data workflows.