表格抓取是指从网页或文档中提取结构化表格数据的过程。在Python中,可以通过多种方式实现表格数据的抓取,主要取决于表格的来源和格式。
优势:
示例代码:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://example.com/table-page'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# 找到表格
table = soup.find('table')
rows = table.find_all('tr')
data = []
for row in rows:
cols = row.find_all(['th', 'td'])
data.append([col.text.strip() for col in cols])
df = pd.DataFrame(data)
print(df)
优势:
示例代码:
import pandas as pd
# 从网页读取表格
tables = pd.read_html('https://example.com/table-page')
df = tables[0] # 假设第一个表格是我们需要的
print(df)
# 从本地HTML文件读取
tables = pd.read_html('table.html')
优势:
示例代码:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import pandas as pd
service = Service('chromedriver_path')
driver = webdriver.Chrome(service=service)
driver.get('https://example.com/dynamic-table')
# 等待表格加载(可能需要添加显式等待)
table = driver.find_element(By.TAG_NAME, 'table')
rows = table.find_elements(By.TAG_NAME, 'tr')
data = []
for row in rows:
cols = row.find_elements(By.TAG_NAME, 'td')
data.append([col.text for col in cols])
df = pd.DataFrame(data)
print(df)
driver.quit()
原因:
解决方案:
原因:
解决方案:
原因:
解决方案:
base_url = 'https://example.com/table?page={}'
all_data = []
for page in range(1, 6): # 假设抓取前5页
url = base_url.format(page)
tables = pd.read_html(url)
all_data.append(tables[0])
final_df = pd.concat(all_data)
from urllib.parse import urljoin
base_url = 'https://example.com'
links = []
for row in soup.find_all('tr'):
link = row.find('a')
if link:
full_url = urljoin(base_url, link['href'])
links.append(full_url)
import requests
from io import BytesIO
from PIL import Image
img_urls = [img['src'] for img in soup.find_all('img')]
for i, url in enumerate(img_urls):
response = requests.get(url)
img = Image.open(BytesIO(response.content))
img.save(f'image_{i}.jpg')
通过以上方法和技巧,你可以高效地使用Python抓取各种类型的表格数据,并根据需要进行进一步的处理和分析。
没有搜到相关的文章