网页抓取(Web Scraping)是指从网页中提取数据的过程。这个过程通常涉及解析HTML或XML文档,提取所需的信息,并将其存储在结构化的数据格式中,如CSV、JSON等。URL抓取则是从网页中提取链接的过程,这些链接可以指向其他网页或资源。
requests
)获取网页内容。BeautifulSoup
或lxml
)解析网页内容。以下是一个简单的Python示例,展示如何从网页中抓取URL并进行递归抓取:
import requests
from bs4 import BeautifulSoup
def get_urls(url):
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
urls = [a['href'] for a in soup.find_all('a', href=True)]
return urls
def crawl(start_url, depth=2):
visited = set()
to_visit = [(start_url, 0)]
while to_visit:
url, current_depth = to_visit.pop(0)
if url in visited or current_depth > depth:
continue
visited.add(url)
print(f'Crawling: {url}')
for new_url in get_urls(url):
if new_url not in visited:
to_visit.append((new_url, current_depth + 1))
# 示例使用
start_url = 'https://example.com'
crawl(start_url)
set
)来存储已访问的URL,避免重复抓取。通过以上步骤和方法,可以有效地从网页中抓取URL并进行进一步的抓取。
领取专属 10元无门槛券
手把手带您无忧上云