零代码编程：用ChatGPT批量下载某个网页中的特定网页链接

AIGC部落

发布于 2024-06-24 15:48:39

800

发布于 2024-06-24 15:48:39

文章被收录于专栏：Dance with GenAI

以lexfridman的官方网站为例，https://lexfridman.com/podcast/，如何批量下载网页呢?

查看网页源文件，播客transcript的链接是：

<a href="https://lexfridman.com/jared-kushner-transcript">Transcript</a>

文件标题在这里：<title>Transcript for Jared Kushner: Israel, Palestine, Hamas, Gaza, Iran, and the Middle East | Lex Fridman Podcast #399 - Lex Fridman</title>

所以，在ChatGPT中输入提示词如下：

你是一个Python编程专家，要完成一个批量下载网页的任务，具体步骤如下：

在本地电脑D盘新建一个文件夹：lexfridman-podtext;

打开网页https://lexfridman.com/podcast/，解析网页源代码；

提取网页源代码中所有<a>标签的内容；

如果<a>标签的内容为“Transcript”，就提取<a>标签的href属性值，作为网页下载地址；

然后解析这个网页源代码，提取其<title>标签内容，作为网页的文件标题名；

下载这个网页，保存到文件夹lexfridman-podtext；

注意：<title>标签内容提取后，要将其中的“: ”、“ | ”等Windows系统中不允许作为文件名的特殊符号替换为符号”-”;

每一步都要输出相关信息；

ChatGPT生成源代码如下：

import os

import requests

from bs4 import BeautifulSoup

# 函数用于清理文件名中的非法字符

def clean_filename(filename):

return filename.replace(':', '-').replace('|', '-')

# 在D盘新建文件夹

folder_path = "D:\\lexfridman-podtext"

if not os.path.exists(folder_path):

os.makedirs(folder_path)

print(f"新建文件夹：{folder_path}")

# 打开并解析网页

source_url = "https://lexfridman.com/podcast/"

print(f"打开并解析网页：{source_url}")

response = requests.get(source_url)

soup = BeautifulSoup(response.text, 'html.parser')

# 提取所有<a>标签

a_tags = soup.find_all('a')

# 遍历<a>标签，找出内容为“Transcript”的

for a_tag in a_tags:

if a_tag.text == "Transcript":

# 提取<a>标签的href属性值

href = a_tag['href']

print(f"找到Transcript网页地址：{href}")

# 打开并解析Transcript网页

transcript_response = requests.get(href)

transcript_soup = BeautifulSoup(transcript_response.text, 'html.parser')

# 提取<title>标签内容

title_tag = transcript_soup.find('title')

if title_tag:

title = clean_filename(title_tag.text)

print(f"找到Transcript网页标题：{title}")

# 下载网页并保存到文件

filename = os.path.join(folder_path, title + ".html")

with open(filename, 'w', encoding='utf-8') as f:

f.write(transcript_response.text)

print(f"网页已保存到：{filename}")

运行程序，全部网页下载保存到本地：

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2023-11-30，如有侵权请联系 cloudcommunity@tencent.com 删除

编程

本文分享自 Dance with GenAI 微信公众号，前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

登录后参与评论

0 条评论

热度

零代码编程：用ChatGPT批量下载某个网页中的特定网页链接

零代码编程：用ChatGPT批量下载某个网页中的特定网页链接

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐