很多刚入门做数据分析的人,在处理热门平台(例如小红书)时都会遇到类似的困难:
这篇文章结合一个真实的业务场景——市场热点追踪,分享如何通过 浏览器自动化工具(Selenium/Playwright) 配合 生成式方法 来推断页面结构,并抓取出有价值的数据。我们还会顺便引入代理服务,保证访问过程更加稳定。
如果你打算自己动手,可以先具备以下条件:
你可以选择 Selenium 或 Playwright。
pip install selenium
或:
pip install playwright
playwright install
以常见的爬虫代理服务为例,写法类似:
#设置爬虫代理(参考亿牛云示例)
proxy_host = "proxy.16yun.cn"
proxy_port = "3100"
proxy_user = "16YUN"
proxy_pass = "16IP"
proxy = f"http://{proxy_user}:{proxy_pass}@{proxy_host}:{proxy_port}"
这样请求时就会通过中转,提高稳定性。
Selenium 方式
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument(f'--proxy-server={proxy}')
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(options=chrome_options)
driver.get("https://www.xiaohongshu.com/explore")
Playwright 方式
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(proxy={"server": proxy}, headless=True)
page = browser.new_page()
page.goto("https://www.xiaohongshu.com/explore")
与其死记硬背固定 XPath,不如让程序基于上下文“猜测”可能的路径。示例函数:
def ai_like_guess(html_snippet, target="title"):
if target == "title":
return '//div[contains(@class,"note-title")]/text()'
elif target == "likes":
return '//span[contains(@class,"like")]/text()'
elif target == "comments":
return '//span[contains(@class,"comment")]/text()'
return None
# -*- coding: utf-8 -*-
from playwright.sync_api import sync_playwright
from lxml import etree
#设置爬虫代理(参考亿牛云示例)
proxy_host = "proxy.16yun.cn"
proxy_port = "3100"
proxy_user = "16YUN"
proxy_pass = "16IP"
proxy = f"http://{proxy_user}:{proxy_pass}@{proxy_host}:{proxy_port}"
def ai_like_guess(html_snippet, target="title"):
if target == "title":
return '//div[contains(@class,"note-title")]/text()'
elif target == "likes":
return '//span[contains(@class,"like")]/text()'
elif target == "comments":
return '//span[contains(@class,"comment")]/text()'
return None
def scrape_hot_notes(url="https://www.xiaohongshu.com/explore"):
with sync_playwright() as p:
browser = p.chromium.launch(proxy={"server": proxy}, headless=True)
page = browser.new_page()
page.goto(url, timeout=30000)
html = page.content()
dom = etree.HTML(html)
titles = dom.xpath(ai_like_guess(html, "title"))
likes = dom.xpath(ai_like_guess(html, "likes"))
comments = dom.xpath(ai_like_guess(html, "comments"))
results = []
for i in range(min(len(titles), len(likes), len(comments))):
results.append({
"标题": titles[i].strip(),
"点赞数": likes[i].strip(),
"评论数": comments[i].strip()
})
return results
if __name__ == "__main__":
data = scrape_hot_notes()
for d in data[:5]:
print(d)
很多人一开始会犯几个典型错误:
更稳妥的做法是:模拟滚动加载更多条目,提前准备 Cookie,人工校验生成规则是否靠谱。
整套方案的思路是:用浏览器驱动拿到动态页面 → 借助生成式推断结构 → 结合代理提高成功率。
它最大的价值在于,可以快速获取社交平台上的热门信息,并做基础的市场热点分析。
如果你把思路拓展一下,这套方法同样适用于其他平台,比如微博热搜、短视频评论区等。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。