
在当今大数据时代,高效的网络爬虫是数据采集的关键工具。传统的同步爬虫(如requests库)由于受限于I/O阻塞,难以实现高并发请求。而Python的aiohttp库结合asyncio,可以轻松实现异步高并发爬虫,达到每秒千次甚至更高的请求速率。
本文将详细介绍如何使用aiohttp构建一个高性能爬虫,涵盖以下内容:
最后,我们将提供一个完整的代码示例,并进行基准测试,展示如何真正实现每秒千次的网页抓取。
ClientSession:管理HTTP连接池,复用TCP连接,减少握手开销。async/await语法:Python 3.5+的异步编程方式,使代码更简洁。asyncio.gather():并发执行多个协程任务。import aiohttp
import asyncio
from bs4 import BeautifulSoup
async def fetch(session, url):
async with session.get(url) as response:
return await response.text()
async def parse(url):
async with aiohttp.ClientSession() as session:
html = await fetch(session, url)
soup = BeautifulSoup(html, 'html.parser')
title = soup.title.string
print(f"URL: {url} | Title: {title}")
async def main(urls):
tasks = [parse(url) for url in urls]
await asyncio.gather(*tasks)
if __name__ == "__main__":
urls = [
"https://example.com",
"https://python.org",
"https://aiohttp.readthedocs.io",
]
asyncio.run(main(urls))代码解析:
fetch() 发起HTTP请求并返回HTML。parse() 解析HTML并提取标题。main() 使用asyncio.gather()并发执行多个任务。默认情况下,aiohttp会自动复用TCP连接,但我们可以手动优化:
conn = aiohttp.TCPConnector(limit=100, force_close=False) # 最大100个连接
async with aiohttp.ClientSession(connector=conn) as session:
# 发起请求...避免因请求过多被目标网站封禁:
semaphore = asyncio.Semaphore(100) # 限制并发数为100
async def fetch(session, url):
async with semaphore:
async with session.get(url) as response:
return await response.text()防止某些请求卡住整个爬虫:
timeout = aiohttp.ClientTimeout(total=10) # 10秒超时
async with session.get(url, timeout=timeout) as response:
# 处理响应...from fake_useragent import UserAgent
ua = UserAgent()
headers = {"User-Agent": ua.random}
async def fetch(session, url):
async with session.get(url, headers=headers) as response:
return await response.text()import aiohttp
import asyncio
from fake_useragent import UserAgent
# 代理配置
proxyHost = "www.16yun.cn"
proxyPort = "5445"
proxyUser = "16QMSOML"
proxyPass = "280651"
# 构建带认证的代理URL
proxy_auth = aiohttp.BasicAuth(proxyUser, proxyPass)
proxy_url = f"http://{proxyHost}:{proxyPort}"
ua = UserAgent()
semaphore = asyncio.Semaphore(100) # 限制并发数
async def fetch(session, url):
headers = {"User-Agent": ua.random}
timeout = aiohttp.ClientTimeout(total=10)
async with semaphore:
async with session.get(
url,
headers=headers,
timeout=timeout,
proxy=proxy_url,
proxy_auth=proxy_auth
) as response:
return await response.text()
async def main(urls):
conn = aiohttp.TCPConnector(limit=100, force_close=False)
async with aiohttp.ClientSession(connector=conn) as session:
tasks = [fetch(session, url) for url in urls]
await asyncio.gather(*tasks)
if __name__ == "__main__":
urls = ["https://example.com"] * 1000
asyncio.run(main(urls))import time
async def benchmark():
urls = ["https://example.com"] * 1000 # 测试1000次请求
start = time.time()
await main(urls)
end = time.time()
qps = len(urls) / (end - start)
print(f"QPS: {qps:.2f}")
asyncio.run(benchmark())import aiohttp
import asyncio
from fake_useragent import UserAgent
ua = UserAgent()
semaphore = asyncio.Semaphore(100) # 限制并发数
async def fetch(session, url):
headers = {"User-Agent": ua.random}
timeout = aiohttp.ClientTimeout(total=10)
async with semaphore:
async with session.get(url, headers=headers, timeout=timeout) as response:
return await response.text()
async def main(urls):
conn = aiohttp.TCPConnector(limit=100, force_close=False)
async with aiohttp.ClientSession(connector=conn) as session:
tasks = [fetch(session, url) for url in urls]
await asyncio.gather(*tasks)
if __name__ == "__main__":
urls = ["https://example.com"] * 1000
asyncio.run(main(urls))通过aiohttp和asyncio,我们可以轻松构建一个高并发的异步爬虫,实现每秒千次以上的网页抓取。关键优化点包括:
✅ 使用ClientSession管理连接池
✅ 控制并发量(Semaphore)
✅ 代理IP和随机User-Agent防止封禁
✅ 超时设置避免卡死
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。