如何使用Python读取目录中的所有HTML文件并将内容写入CSV文件？

使用Python读取目录中的所有HTML文件并将内容写入CSV文件的步骤如下：

导入所需的模块：

import os
import csv
from bs4 import BeautifulSoup

定义函数来读取HTML文件并提取内容：

def read_html_file(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        html_content = file.read()
        soup = BeautifulSoup(html_content, 'html.parser')
        # 在这里根据HTML结构提取所需的内容
        # 示例：假设需要提取标题和正文内容
        title = soup.find('title').text
        body = soup.find('body').text
        return title, body

定义函数来遍历目录中的HTML文件并调用上述函数提取内容：

def process_html_files(directory):
    html_files = [f for f in os.listdir(directory) if f.endswith('.html')]
    data = []
    for file in html_files:
        file_path = os.path.join(directory, file)
        title, body = read_html_file(file_path)
        data.append([title, body])
    return data

定义函数来将提取的内容写入CSV文件：

def write_to_csv(data, output_file):
    with open(output_file, 'w', newline='', encoding='utf-8') as file:
        writer = csv.writer(file)
        writer.writerow(['Title', 'Body'])  # 写入CSV文件的表头
        writer.writerows(data)  # 写入提取的内容

调用上述函数来完成操作：

directory = '目录路径'  # 替换为实际的目录路径
output_file = '输出文件路径.csv'  # 替换为实际的输出文件路径
data = process_html_files(directory)
write_to_csv(data, output_file)

以上代码将遍历指定目录中的所有HTML文件，提取标题和正文内容，并将其写入CSV文件中。你可以根据实际需要修改提取内容的方式和CSV文件的表头。

扫码

添加站长进交流群

领取专属 10元无门槛券

手把手带您无忧上云

如何使用Python读取目录中的所有HTML文件并将内容写入CSV文件？

相关·内容

Serverless架构开发与SCF部署实践

赋能业务创新-云数据库最佳应用实践

容器服务最佳部署与应用实践

信息系统迁移难点与解法

小游戏（杭州站）

扫码

相关资讯

热门标签

活动推荐

运营活动

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐