对于足球迷、资讯编辑与数据分析师来说,最快、最准确把握一场比赛的核心信息至关重要:比分、关键事件(进球、点球、红黄牌、换人、判罚争议等)、以及球员表现。传统基于规则的爬虫在面对不同媒体页面与频繁改版时常显笨拙:XPath/正则需要频繁维护,动态加载内容常常漏抓。
引入生成式 AI(Large Language Models, LLM)后,我们可以把抓到的比赛报道或直播实录交给模型,让它“读懂”文章并输出结构化结果(JSON),同时生成简明的自然语言摘要,极大提高抽取的鲁棒性与可读性。本教程以实战角度,演示如何抓取 ESPN、虎扑、腾讯体育 三类站点关于五大联赛(英超、西甲、德甲、意甲、法甲)的比赛报道,并用 AI 总结比分、关键事件与球员表现。
pip install requests beautifulsoup4 playwright
playwright install
(若使用外部 LLM SDK,再安装对应库;若只做静态抓取可不安装 Playwright)
域名:`proxy.16yun.cn`(示例)
端口:`3100`
用户名:`16YUN`、密码:`16IP`(示例,请替换)
ESPN 足球:`https://www.espn.com/soccer/`
虎扑足球:`https://soccer.hupu.com/`
腾讯体育足球频道:`https://sports.qq.com/soccer/`(站点结构会变,按实际页面定位)
下面给出一个精简、可运行的示例骨架,演示从网页抓文本并调用伪 LLM 接口获得结构化输出。请把伪接口替换成你实际使用的 LLM SDK 或公司内部服务;把代理凭证替换为生产凭证。
# file: football_ai_pipeline.py
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
import json, time, os
# ---设置代理(亿牛云示例) ---
PROXY_HOST = os.getenv("PROXY_HOST", "proxy.16yun.cn")
PROXY_PORT = os.getenv("PROXY_PORT", "3100")
PROXY_USER = os.getenv("PROXY_USER", "16YUN")
PROXY_PASS = os.getenv("PROXY_PASS", "16IP")
proxy_url = f"http://{PROXY_USER}:{PROXY_PASS}@{PROXY_HOST}:{PROXY_PORT}"
proxies = {"http": proxy_url, "https": proxy_url}
HEADERS = {
"User-Agent": ("Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120 Safari/537.36")
}
def fetch_html(url, use_proxy=True, timeout=15):
kwargs = {"headers": HEADERS, "timeout": timeout}
if use_proxy:
kwargs["proxies"] = proxies
resp = requests.get(url, **kwargs)
resp.raise_for_status()
return resp.text
def extract_text(html):
soup = BeautifulSoup(html, "lxml")
for tag in soup(["script", "style", "noscript", "header", "footer", "aside"]):
tag.decompose()
texts = [s.strip() for s in soup.stripped_strings]
return "\n".join(texts)
def call_llm_for_match(text_snippet):
"""
将文本片段送给 LLM 并返回结构化字典。
在此用示例返回占位数据;生产环境替换为真实 API 调用并处理返回结果。
"""
# 示例返回结构(实际按你的 schema 定制)
return {
"league": "英超",
"home_team": "曼联",
"away_team": "利物浦",
"score": "2-1",
"events": [
{"minute": 14, "type": "goal", "team": "曼联", "player": "拉什福德"},
{"minute": 46, "type": "goal", "team": "利物浦", "player": "萨拉赫"},
{"minute": 78, "type": "goal", "team": "曼联", "player": "B费"}
],
"player_summary": "拉什福德状态出色,门将几次关键扑救,替补球员改变了战局。"
}
def process_match_url(url):
html = fetch_html(url)
text = extract_text(html)
# 若文本过长,可切分并合并 LLM 输出
snippet = text[:5000] # 示例截断,按模型上下文调整
result = call_llm_for_match(snippet)
# 后验简单校验(示例)
if "score" in result:
pass # 在此做正则或数值校验
return result
if __name__ == "__main__":
urls = [
"https://www.espn.com/soccer/report?gameId=xxxx", # 替换为实际赛报链接
"https://soccer.hupu.com/games/xxxx",
"https://sports.qq.com/a/xxxx.htm"
]
results = []
for u in urls:
try:
r = process_match_url(u)
results.append(r)
except Exception as e:
print("抓取/解析失败:", u, e)
time.sleep(1.5)
with open("matches_aggregated.json", "w", encoding="utf-8") as f:
json.dump(results, f, ensure_ascii=False, indent=2)
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。