How do you read the contents of a PDF file using Java?

11 months ago

Emily Johnson

2 minutes

Java has the capability to utilize the Apache PDFBox library to extract the content from PDF files. PDFBox is an open-source Java library that can be used for working with PDF files. Below is a simple example code demonstrating how to use PDFBox to read the content from a PDF file.

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;

import java.io.File;
import java.io.IOException;

public class ReadPDF {
    public static void main(String[] args) {
        try {
            // 加载PDF文件
            File file = new File("path/to/your/pdf/file.pdf");
            PDDocument document = PDDocument.load(file);

            // 创建PDFTextStripper对象来提取文本
            PDFTextStripper stripper = new PDFTextStripper();

            // 获取PDF文件的内容
            String content = stripper.getText(document);

            // 打印PDF文件的内容
            System.out.println(content);

            // 关闭PDF文档
            document.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

In the code above, replace “path/to/your/pdf/file.pdf” with the actual path to your PDF file. You can use the getText() method of the PDFTextStripper class to extract the plain text content of the PDF file. Finally, close the PDF document by calling the close() method of the PDDocument class.

Please make sure you have imported the PDFBox library’s dependencies before running the code. You can add the following dependencies in a Maven project.

<dependency>
    <groupId>org.apache.pdfbox</groupId>
    <artifactId>pdfbox</artifactId>
    <version>2.0.26</version>
</dependency>

This way, you can read the content of a PDF file using Java.