可以通过以下步骤实现:
pip install pdfminer.six
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
import os
def extract_text_from_pdf(pdf_path):
resource_manager = PDFResourceManager()
output_string = io.StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(resource_manager, output_string, codec=codec, laparams=laparams)
with open(pdf_path, 'rb') as file:
interpreter = PDFPageInterpreter(resource_manager, device)
for page in PDFPage.get_pages(file, check_extractable=True):
interpreter.process_page(page)
text = output_string.getvalue()
device.close()
output_string.close()
return text
pdf_folder = 'path/to/pdf/folder'
output_folder = 'path/to/output/folder'
for filename in os.listdir(pdf_folder):
if filename.endswith('.pdf'):
pdf_path = os.path.join(pdf_folder, filename)
text = extract_text_from_pdf(pdf_path)
output_path = os.path.join(output_folder, filename.replace('.pdf', '.txt'))
with open(output_path, 'w', encoding='utf-8') as file:
file.write(text)
以上脚本将循环遍历指定的PDF文件夹中的所有PDF文件,并将每个PDF文件提取的文本保存为相应的文本文件(以相同的文件名,但扩展名为.txt)。
推荐的腾讯云相关产品:腾讯云对象存储(COS)用于存储PDF文件和提取后的文本文件,腾讯云函数计算(SCF)用于托管和运行Python脚本。
腾讯云对象存储(COS)产品介绍链接:https://cloud.tencent.com/product/cos
腾讯云函数计算(SCF)产品介绍链接:https://cloud.tencent.com/product/scf
领取专属 10元无门槛券
手把手带您无忧上云