How can Python read text from a PDF document?

1 year ago

Liam

1 minute

In Python, you can use the PyPDF2 library to extract text from PDF files. To do this, you first need to install the PyPDF2 library by using the following command:

pip install PyPDF2

Next, you can use the following code to read the text in a PDF file:

import PyPDF2

# 打开PDF文件
pdf_file = open('example.pdf', 'rb')

# 创建PDF文件阅读器对象
pdf_reader = PyPDF2.PdfFileReader(pdf_file)

# 获取PDF文件中的页面数
num_pages = pdf_reader.numPages

# 读取每一页的文本内容
for page_num in range(num_pages):
    page = pdf_reader.getPage(page_num)
    text = page.extract_text()
    print(text)

# 关闭PDF文件
pdf_file.close()

The code above will open a PDF file named example.pdf, read the text content page by page, and print it out. Of course, you can also process the text content according to specific needs or save it to a file.