AI网络爬虫：deepseek爬取百度新闻资讯的搜索结果

AIGC部落

发布于 2024-06-28 13:21:53

600

发布于 2024-06-28 13:21:53

文章被收录于专栏：Dance with GenAIDance with GenAI

打开百度的搜索页面：

https://www.baidu.com/s?rtt=1&bsst=1&cl=2&tn=news&ie=utf-8&word=%E8%85%BE%E8%AE%AF%E4%BA%91%E6%99%BA%E8%83%BD%E8%AF%AD%E9%9F%B3+++%E9%87%91%E8%9E%8D

查看翻页规律：

https://www.baidu.com/s?rtt=1&bsst=1&cl=2&tn=news&ie=utf-8&word=%E8%85%BE%E8%AE%AF%E4%BA%91%E6%99%BA%E8%83%BD%E8%AF%AD%E9%9F%B3+++%E9%87%91%E8%9E%8D&x_bfe_rqs=03E80&x_bfe_tjscore=0.100000&tngroupname=organic_news&newVideo=12&goods_entry_switch=1&rsv_dl=news_b_pn&pn=40

https://www.baidu.com/s?rtt=1&bsst=1&cl=2&tn=news&ie=utf-8&word=%E8%85%BE%E8%AE%AF%E4%BA%91%E6%99%BA%E8%83%BD%E8%AF%AD%E9%9F%B3+++%E9%87%91%E8%9E%8D&x_bfe_rqs=03E80&x_bfe_tjscore=0.100000&tngroupname=organic_news&newVideo=12&goods_entry_switch=1&rsv_dl=news_b_pn&pn=30

https://www.baidu.com/s?rtt=1&bsst=1&cl=2&tn=news&ie=utf-8&word=%E8%85%BE%E8%AE%AF%E4%BA%91%E6%99%BA%E8%83%BD%E8%AF%AD%E9%9F%B3+++%E9%87%91%E8%9E%8D&x_bfe_rqs=03E80&x_bfe_tjscore=0.100000&tngroupname=organic_news&newVideo=12&goods_entry_switch=1&rsv_dl=news_b_pn&pn=0

这三个URL都是百度搜索的链接，用于搜索特定关键词的新闻结果。它们之间的规律主要体现在URL参数中的`pn`（页面编号）参数上。

- 第一个URL的`pn`参数值为40，表示请求的是第40页的新闻结果。

- 第二个URL的`pn`参数值为30，表示请求的是第30页的新闻结果。

- 第三个URL的`pn`参数值为0，表示请求的是第1页的新闻结果。

其他参数在这三个URL中都是相同的，包括：

- `rtt`：搜索结果排序方式，这里设置为1，可能是指按时间排序。

- `bsst`：是否强制搜索，这里设置为1，表示强制搜索。

- `cl`：搜索类型，这里设置为2，表示搜索新闻。

- `tn`：搜索服务类型，这里设置为news，表示搜索新闻。

- `ie`：输入编码，这里设置为utf-8。

- `word`：搜索关键词，这里设置为“腾讯云智能语音金融”。

- `x_bfe_rqs`：可能是百度内部使用的请求参数。

- `x_bfe_tjscore`：可能是百度内部使用的参数，用于计算或记录搜索结果的相关性分数。

- `tngroupname`：可能是百度内部使用的参数，用于指定搜索结果的组名。

- `newVideo`：可能是百度内部使用的参数，用于控制视频新闻的显示。

- `goods_entry_switch`：可能是百度内部使用的参数，用于控制商品入口的开关。

- `rsv_dl`：可能是百度内部使用的参数，用于指定搜索结果的显示方式。

这些URL的规律在于它们都是请求相同关键词的新闻搜索结果，但是请求的页面不同，因此`pn`参数的值不同。每页显示的新闻数量可能由其他参数控制，但在这个例子中没有明确指出。通常，每页显示的新闻数量是固定的，比如每页20条或10条新闻。

Deepseek中输入提示词：

你是一个Python编程专家，要完成一个百度搜索页面爬取的Python脚本，具体任务如下：

解析网页：

https://www.baidu.com/s?rtt=1&bsst=1&cl=2&tn=news&ie=utf-8&word=%E8%85%BE%E8%AE%AF%E4%BA%91%E6%99%BA%E8%83%BD%E8%AF%AD%E9%9F%B3+++%E9%87%91%E8%9E%8D&x_bfe_rqs=03E80&x_bfe_tjscore=0.100000&tngroupname=organic_news&newVideo=12&goods_entry_switch=1&rsv_dl=news_b_pn&pn={pagenumber}

{pagenumber}的值从0开始，以10递增，到40结束；

定位其中所有class="result-op c-container xpath-log new-pmd"的div标签，

定位div标签中class="news-title-font_1xS-F"的a标签，提取其href属性值作为网页下载URL，提取其aria-label属性值，作为网页文件名；

定位div标签中class="c-font-normal c-color-text"的span标签，作为网页的内容摘要；

网页下载URL、网页文件名、网页的内容摘要都写入到Excel文件中，Excel文件保存到文件夹：F:\aivideo\finance

下载网页，保存网页到文件夹：F:\aivideo\finance

注意：

每一步都要输出信息到屏幕上

每爬取一个网址后，随机暂停3-5秒；

每下载一个网页后，随机暂停3-6秒；

设置请求标头：

Accept:

text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7

Accept-Encoding:

gzip, deflate, br, zstd

Accept-Language:

zh-CN,zh;q=0.9,en;q=0.8

Connection:

keep-alive

Host:

http://www.baidu.com

Referer:

https://www.baidu.com/s?rtt=1&bsst=1&cl=2&tn=news&ie=utf-8&word=%E8%85%BE%E8%AE%AF%E4%BA%91%E6%99%BA%E8%83%BD%E8%AF%AD%E9%9F%B3+++%E9%87%91%E8%9E%8D&x_bfe_rqs=03E80&x_bfe_tjscore=0.100000&tngroupname=organic_news&newVideo=12&goods_entry_switch=1&rsv_dl=news_b_pn&pn=30

Sec-Ch-Ua:

"Not/A)Brand";v="8", "Chromium";v="126", "Google Chrome";v="126"

Sec-Ch-Ua-Mobile:

?0

Sec-Ch-Ua-Platform:

"Windows"

Sec-Fetch-Dest:

document

Sec-Fetch-Mode:

navigate

Sec-Fetch-Site:

same-origin

Sec-Fetch-User:

?1

Upgrade-Insecure-Requests:

1

User-Agent:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36

Deepseek生成的源代码：

import requests

from bs4 import BeautifulSoup

import pandas as pd

import time

import random

import os

# 设置请求头

headers = {

"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",

"Accept-Encoding": "gzip, deflate, br, zstd",

"Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8",

"Connection": "keep-alive",

"Host": "http://www.baidu.com",

"Referer": "https://www.baidu.com/s?rtt=1&bsst=1&cl=2&tn=news&ie=utf-8&word=%E8%85%BE%E8%AE%AF%E4%BA%91%E6%99%BA%E8%83%BD%E8%AF%AD%E9%9F%B3+++%E9%87%91%E8%9E%8D&x_bfe_rqs=03E80&x_bfe_tjscore=0.100000&tngroupname=organic_news&newVideo=12&goods_entry_switch=1&rsv_dl=news_b_pn&pn=30",

"Sec-Ch-Ua": '"Not/A)Brand";v="8", "Chromium";v="126", "Google Chrome";v="126"',

"Sec-Ch-Ua-Mobile": "?0",

"Sec-Ch-Ua-Platform": '"Windows"',

"Sec-Fetch-Dest": "document",

"Sec-Fetch-Mode": "navigate",

"Sec-Fetch-Site": "same-origin",

"Sec-Fetch-User": "?1",

"Upgrade-Insecure-Requests": "1",

"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36"

}

# 创建文件夹

output_folder = 'F:\\aivideo\\finance'

if not os.path.exists(output_folder):

os.makedirs(output_folder)

# 初始化DataFrame

df = pd.DataFrame(columns=["URL", "File Name", "Content Summary"])

# 循环请求

for pagenumber in range(0, 41, 10):

url = f"https://www.baidu.com/s?rtt=1&bsst=1&cl=2&tn=news&ie=utf-8&word=%E8%85%BE%E8%AE%AF%E4%BA%91%E6%99%BA%E8%83%BD%E8%AF%AD%E9%9F%B3+++%E9%87%91%E8%9E%8D&x_bfe_rqs=03E80&x_bfe_tjscore=0.100000&tngroupname=organic_news&newVideo=12&goods_entry_switch=1&rsv_dl=news_b_pn&pn={pagenumber}"

response = requests.get(url, headers=headers)

if response.status_code == 200:

soup = BeautifulSoup(response.text, 'html.parser')

results = soup.find_all('div', class_='result-op c-container xpath-log new-pmd')

for result in results:

a_tag = result.find('a', class_='news-title-font_1xS-F')

if a_tag:

href = a_tag.get('href')

file_name = a_tag.get('aria-label')

span_tag = result.find('span', class_='c-font-normal c-color-text')

content_summary = span_tag.text if span_tag else ''

df = pd.concat([df, pd.DataFrame([{"URL": href, "File Name": file_name, "Content Summary": content_summary}])], ignore_index=True)