【PDF拆分+识别+重命名+导出表格】PDF文件拆分为单独页面后批量提取内容重名命，将所有的区域的内容保存后导出表格，基于 WPF 和腾讯云的实现方案

原创

不负众望

发布于 2025-03-06 08:58:15

39200

代码可运行

运行总次数：0

代码可运行

一、项目背景

在众多业务场景中，如文档管理、数据提取等，经常需要对 PDF 文件进行精细处理。传统方式下，将 PDF 文件拆分为单独页面并对每个页面进行有意义的重命名以及提取关键信息并导出表格，通常需要人工手动操作，这不仅效率低下，还容易出错。随着业务数据量的增长，这种人工处理方式已无法满足需求。因此，我们需要一个自动化的解决方案来高效完成这些任务。本方案基于 WPF（Windows Presentation Foundation）构建用户界面，方便用户操作，同时借助腾讯云提供的云服务能力，实现 PDF 文件的拆分、内容识别、重命名以及信息导出表格等功能。

二、详细步骤

（一）环境搭建

安装 WPF 开发环境：确保安装了 Visual Studio，并在创建项目时选择 WPF 应用程序模板。
配置腾讯云 SDK：前往腾讯云官网下载对应语言（如 C#）的 SDK。在 Visual Studio 项目中，通过 NuGet 包管理器安装腾讯云 SDK 相关依赖包，如用于 OCR（光学字符识别）的 SDK。

（二）PDF 文件拆分

引入第三方 PDF 处理库：例如使用 iTextSharp 库。通过 NuGet 安装 iTextSharp 包。
编写拆分代码：在 WPF 项目中创建一个方法用于拆分 PDF 文件。示例代码如下：

using iTextSharp.text.pdf;using System.IO;public void SplitPdf(string inputPdfPath, string outputFolder){    if (!Directory.Exists(outputFolder))    {        Directory.CreateDirectory(outputFolder);    }    PdfReader reader = new PdfReader(inputPdfPath);    for (int i = 1; i <= reader.NumberOfPages; i++)    {        Document document = new Document(reader.GetPageSizeWithRotation(i));        PdfCopy copy = new PdfCopy(document, new FileStream(Path.Combine(outputFolder, $"page_{i}.pdf"), FileMode.Create));        document.Open();        copy.AddPage(copy.GetImportedPage(reader, i));        document.Close();    }    reader.Close();}

（三）内容识别与重命名

调用腾讯云 OCR 服务：在 WPF 项目中配置腾讯云 OCR 服务的认证信息，如密钥等。编写代码调用 OCR 接口对拆分后的每个 PDF 页面进行文字识别。示例代码如下：

using TencentCloud.Common;using TencentCloud.Common.Profile;using TencentCloud.Ocr.V20181119;using TencentCloud.Ocr.V20181119.Models;using System.IO;using System.Threading.Tasks;public async Task<string> RecognizeTextFromPdfPage(string pdfPagePath){    byte[] fileBytes = File.ReadAllBytes(pdfPagePath);    string base64File = Convert.ToBase64String(fileBytes);    Credential cred = new Credential    {        SecretId = "YOUR_SECRET_ID",        SecretKey = "YOUR_SECRET_KEY"    };    ClientProfile clientProfile = new ClientProfile();    HttpProfile httpProfile = new HttpProfile();    httpProfile.Endpoint = "ocr.tencentcloudapi.com";    clientProfile.HttpProfile = httpProfile;    OcrClient client = new OcrClient(cred, "ap-guangzhou", clientProfile);    GeneralBasicOCRRequest req = new GeneralBasicOCRRequest();    req.ImageBase64 = base64File;    GeneralBasicOCRResponse resp = await client.GeneralBasicOCR(req);    string recognizedText = resp.TextDetections[0].DetectedText;    return recognizedText;}

根据识别内容重命名：根据识别出的文字内容，提取关键信息用于重命名文件。例如，如果识别内容中包含日期和客户名称，可将文件名重命名为 “日期_客户名称.pdf”。编写重命名方法如下：

public void RenameFileBasedOnText(string pdfPagePath, string recognizedText){    string folderPath = Path.GetDirectoryName(pdfPagePath);    string newFileName = $"{ExtractKeyInfo(recognizedText)}.pdf";    string newFilePath = Path.Combine(folderPath, newFileName);    File.Move(pdfPagePath, newFilePath);}private string ExtractKeyInfo(string text){    // 这里编写提取关键信息的逻辑，例如通过正则表达式匹配日期和客户名称    // 示例：假设文本中日期格式为YYYY-MM-DD，客户名称在特定关键词后    string datePattern = @"\d{4}-\d{2}-\d{2}";    string clientNamePattern = @"客户名称：(\w+)";    Match dateMatch = Regex.Match(text, datePattern);    Match clientNameMatch = Regex.Match(text, clientNamePattern);    if (dateMatch.Success && clientNameMatch.Success)    {        return $"{dateMatch.Value}_{clientNameMatch.Groups[1].Value}";    }    return "default_name";}

（四）信息导出表格

创建数据结构：在 WPF 项目中定义一个类来存储需要导出的信息，例如每个页面的文件名、识别出的关键信息等。示例代码如下：

public class PdfPageInfo{    public string FileName { get; set; }    public string RecognizedText { get; set; }}

填充数据并导出表格：将每个页面的相关信息填充到上述数据结构中，并使用第三方库（如 ClosedXML）将数据导出为 Excel 表格。示例代码如下：

using ClosedXML.Excel;using System.Collections.Generic;public void ExportToExcel(List<PdfPageInfo> pageInfos, string outputExcelPath){    using (XLWorkbook wb = new XLWorkbook())    {        IXLWorksheet ws = wb.AddWorksheet("PDF Page Information");        ws.Cell(1, 1).Value = "File Name";        ws.Cell(1, 2).Value = "Recognized Text";        for (int i = 0; i < pageInfos.Count; i++)        {            ws.Cell(i + 2, 1).Value = pageInfos[i].FileName;            ws.Cell(i + 2, 2).Value = pageInfos[i].RecognizedText;        }        wb.SaveAs(outputExcelPath);    }}

（五）WPF 界面交互

设计界面：在 WPF 的 XAML 文件中设计用户界面，包含选择 PDF 文件的按钮、选择输出文件夹的按钮、开始处理的按钮以及显示处理进度和结果的文本框或列表框等。
绑定事件处理：为各个按钮绑定对应的事件处理方法，例如选择 PDF 文件按钮绑定文件选择对话框的打开方法，开始处理按钮绑定调用上述拆分、识别、重命名和导出表格等一系列操作的方法。示例代码如下：

<Window x:Class="PdfProcessingApp.MainWindow"        xmlns="http://schemas.microsoft.com/winfx/2006/xaml/presentation"        xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml"        Title="PDF Processing" Height="350" Width="525">    <Grid>        <Button Content="Select PDF File" HorizontalAlignment="Left" Margin="10,10,0,0" VerticalAlignment="Top" Width="120" Click="SelectPdfFile_Click"/>        <Button Content="Select Output Folder" HorizontalAlignment="Left" Margin="10,40,0,0" VerticalAlignment="Top" Width="120" Click="SelectOutputFolder_Click"/>        <Button Content="Start Processing" HorizontalAlignment="Left" Margin="10,70,0,0" VerticalAlignment="Top" Width="120" Click="StartProcessing_Click"/>        <TextBox x:Name="ResultTextBox" HorizontalAlignment="Left" Margin="140,10,0,0" VerticalAlignment="Top" Width="375" Height="230" IsReadOnly="True"/>    </Grid></Window>

using System.Windows;using System.Windows.Forms;using System.IO;namespace PdfProcessingApp{    public partial class MainWindow : Window    {        private string pdfFilePath;        private string outputFolderPath;        public MainWindow()        {            InitializeComponent();        }        private void SelectPdfFile_Click(object sender, RoutedEventArgs e)        {            OpenFileDialog openFileDialog = new OpenFileDialog();            openFileDialog.Filter = "PDF Files|*.pdf";            if (openFileDialog.ShowDialog() == DialogResult.OK)            {                pdfFilePath = openFileDialog.FileName;                ResultTextBox.Text += $"Selected PDF file: {pdfFilePath}\n";            }        }        private void SelectOutputFolder_Click(object sender, RoutedEventArgs e)        {            FolderBrowserDialog folderBrowserDialog = new FolderBrowserDialog();            if (folderBrowserDialog.ShowDialog() == DialogResult.OK)            {                outputFolderPath = folderBrowserDialog.SelectedPath;                ResultTextBox.Text += $"Selected output folder: {outputFolderPath}\n";            }        }        private async void StartProcessing_Click(object sender, RoutedEventArgs e)        {            if (!string.IsNullOrEmpty(pdfFilePath) &&!string.IsNullOrEmpty(outputFolderPath))            {                SplitPdf(pdfFilePath, outputFolderPath);                List<PdfPageInfo> pageInfos = new List<PdfPageInfo>();                string[] pdfPageFiles = Directory.GetFiles(outputFolderPath, "*.pdf");                foreach (string pdfPageFile in pdfPageFiles)                {                    string recognizedText = await RecognizeTextFromPdfPage(pdfPageFile);                    RenameFileBasedOnText(pdfPageFile, recognizedText);                    string newFileName = Path.GetFileName(pdfPageFile);                    pageInfos.Add(new PdfPageInfo { FileName = newFileName, RecognizedText = recognizedText });                }                ExportToExcel(pageInfos, Path.Combine(outputFolderPath, "PDF_Info.xlsx"));                ResultTextBox.Text += "Processing completed. Information exported to Excel.\n";            }            else            {                ResultTextBox.Text += "Please select both a PDF file and an output folder.\n";            }        }    }}

原创声明：本文系作者授权腾讯云开发者社区发表，未经许可，不得转载。

如有侵权，请联系 cloudcommunity@tencent.com 删除。

pdf