在数据驱动的今天,Python爬虫技术已成为获取网络数据的重要手段。本文将从Python爬虫的基础知识入手,逐步深入到多领域的实战应用,帮助读者构建一个完整的爬虫系统。
确保你的计算机上安装了Python。推荐使用Python 3.6或更高版本。安装必要的库:
pip install requests beautifulsoup4 lxml selenium
爬虫通过发送HTTP请求获取网页内容,然后解析这些内容以提取有用的数据。
使用requests
库发送HTTP请求:
import requests
def get_page(url):
response = requests.get(url)
return response.text
page = get_page('http://example.com')
print(page)
使用BeautifulSoup
解析HTML:
from bs4 import BeautifulSoup
soup = BeautifulSoup(page, 'html.parser')
print(soup.title.string) # 打印网页标题
使用requests.Session
来管理Cookie:
session = requests.Session()
response = session.get('http://example.com/login', data={'username': 'user', 'password': 'pass'})
对于JavaScript生成的内容,使用Selenium
:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('http://example.com')
element = driver.find_element_by_id('dynamic-content')
print(element.text)
driver.quit()
处理请求和解析过程中可能出现的异常:
try:
response = requests.get('http://example.com')
response.raise_for_status() # 检查请求是否成功
soup = BeautifulSoup(response.text, 'html.parser')
except requests.exceptions.RequestException as e:
print(e)
假设我们要抓取一个包含书籍信息的网页:
def scrape_books(url):
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
books = soup.find_all('div', class_='book')
for book in books:
title = book.find('h3').text
author = book.find('span', class_='author').text
print(f'Title: {title}, Author: {author}')
scrape_books('http://books.example.com')
使用Selenium抓取一个需要用户交互的网页:
def scrape_dynamic_data(url):
driver = webdriver.Chrome()
driver.get(url)
# 假设需要点击一个按钮来加载数据
button = driver.find_element_by_id('load-data-button')
button.click()
soup = BeautifulSoup(driver.page_source, 'html.parser')
data = soup.find('div', id='data-container').text
driver.quit()
return data
data = scrape_dynamic_data('http://dynamic.example.com')
print(data)
将抓取的数据存储到文件:
def save_data(data, filename):
with open(filename, 'w', encoding='utf-8') as file:
file.write(data)
save_data(page, 'scraped_data.html')
import requests
from bs4 import BeautifulSoup
def fetch_html(url):
response = requests.get(url)
return response.text
url = 'http://example.com'
html_content = fetch_html(url)
print(html_content)
import tweepy
import json
# 配置Twitter API的认证信息
consumer_key = 'YOUR_CONSUMER_KEY'
consumer_secret = 'YOUR_CONSUMER_SECRET'
access_token = 'YOUR_ACCESS_TOKEN'
access_token_secret = 'YOUR_ACCESS_TOKEN_SECRET'
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)
# 获取用户的时间线
public_tweets = api.home_timeline()
for tweet in public_tweets:
print(json.dumps(tweet._json, indent=4))
from selenium import webdriver
# 设置Selenium使用的WebDriver
driver = webdriver.Chrome('/path/to/chromedriver')
# 访问网页
driver.get('http://example.com')
# 等待页面加载完成
driver.implicitly_wait(10)
# 获取页面源代码
html_content = driver.page_source
# 关闭浏览器
driver.quit()
print(html_content)
import scrapy
from scrapy.crawler import CrawlerProcess
class ProductSpider(scrapy.Spider):
name = 'product_spider'
start_urls = ['http://example.com/products']
def parse(self, response):
for product in response.css('div.product'):
yield {
'name': product.css('h3::text').get(),
'price': product.css('p.price::text').get(),
'url': product.css('a::attr(href)').get(),
}
# 运行爬虫
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0 (compatible; Scrapy/1.2; +http://example.com)'
})
process.crawl(ProductSpider)
process.start()
import requests
from fake_useragent import UserAgent
ua = UserAgent()
headers = {'User-Agent': ua.random}
proxies = {
'http': 'http://10.10.1.10:3128',
'https': 'https://10.10.1.10:1080',
}
def fetch_html_with_proxies(url):
response = requests.get(url, headers=headers, proxies=proxies)
return response.text
html_content = fetch_html_with_proxies('http://example.com')
print(html_content)
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。