在本公众号以前的一篇推送中介绍了互联网大数据采集的爬虫策略及其实现原理,遍历策略是爬虫的核心问题之一,这里以Python来介绍其具体实现。
宽度优先遍历BFS,采用队列来实现是自然的做法,即新抓取到的URL放置在队列的末尾,而即将抓取的URL从队列头部获得,即先进先出。
深度优先编列DFS,可以用递归来实现,直到不再有子链接时返回上一层。当然也可以用堆栈来实现非递归的算法,即先进后出。
可以看出,DFS和BFS具有高度的对称性,因此在Python中我们不需要将两种数据结构分开,只需要构建一个数据结构,即本程序中的UrlSequence。
在UrlSequenceUrl中设置的未访问序列self.unvisited可以完成出队列和栈的操作,前者通过 pop(0)来实现,后者自然是pop() 。
UrlSequenceUrl类存放了所有和URL序列有关的函数,维护了两个序列visited和unvisited,并实现了对URL的元素增加、删除、计数等操作。以下是该类的定义,有详细的备注。
class UrlSequence:
#初始化
def __init__(self):
# the set of visited URLs
self.visited = []
# the set of unvisited URLs
self.unvisited = []
# get "visited" list
def getVisitedUrl(self):
return self.visited
# get "unvisited" list
def getUnvisitedUrl(self):
return self.unvisited
# Add to visited URLs,当访问完一个url之后,将它加入到visited序列表示该url已经被访问过了。
def Visited_Add(self, url):
self.visited.append(url)
# Remove from visited URLs
def Visited_Remove(self, url):
# Dequeue from unvisited URLs,pop函数在默认的情况下返回队列的最末尾的一个元素,因此要采用pop(0)
def Unvisited_Dequeue(self):
try:
except:
return None
# Pop from unvisited URLs, 出栈,适合于深度优先遍历
def Unvisited_Pop(self):
try:
except:
return None
# Add new URLs,解析出来的新的url,需要加入到未访问的URL序列中,但要保证它们不再visited和unvisited序列中,以免重复抓取。
def Unvisited_Add(self, url):
if url != "" and url not in self.visited and url not in self.unvisited:
self.unvisited.append(url)
# The count of visited URLs
def Visited_Count(self):
return len(self.visited)
# The count of unvisited URLs
def Unvisited_Count(self):
return len(self.unvisited)
# Determine whether "unvisited" queue is empty
def UnvisitedIsEmpty(self):
return len(self.unvisited) == 0
以下则是Crawler类,用于实现URL序列的初始化和爬取网页的过程,以及获取网页源码和链接的函数。它包括四个函数 ,分别是初始化函数、爬取函数、获取链接函数和获取网页源代码的函数。以下是该类的定义,有详细的备注。
class Crawler:
def __init__(self, base_url):
# Using base_url to initialize URL queue
self.UrlSequence = UrlSequence()
# Add base_url to the "unvisited" list
self.UrlSequence.Unvisited_Add(base_url)
print "Add the base url \"%s\" to the unvisited url list" % str(self.UrlSequence.unvisited)
def crawling(self, base_url, max_count, flag):
# Loop condition: the "unvisited" list is not empty & the number of visited URLs is not bigger than max_count
while self.UrlSequence.UnvisitedIsEmpty() is False and self.UrlSequence.Visited_Count()
# Dequeue or pop, according to the flag
if flag == 1: # using BFS
visitUrl = self.UrlSequence.Unvisited_Dequeue()
else: # using DFS
visitUrl = self.UrlSequence.Unvisited_Pop()
print "Pop out one url \"%s\" from unvisited url list" % visitUrl
if visitUrl in self.UrlSequence.visited or visitUrl is None or visitUrl == "":
continue
# Get the links
links = self.getLinks(visitUrl)
print "Get %d new links" % len(links)
# Now "visitUrl" has been visited. Add it to the "visited" list
self.UrlSequence.Visited_Add(visitUrl)
print "Visited url count: " + str(self.UrlSequence.Visited_Count())
# Add the unvisited URL to the "unvisited" list
for link in links:
self.UrlSequence.Unvisited_Add(link)
print "%d unvisited links:" % len(self.UrlSequence.getUnvisitedUrl())
# Get links from the source code
def getLinks(self, url):
links = []
data = self.getPageSource(url)
if data[0] == "200":
# Create a BeautifulSoup object
soup = BeautifulSoup(data[1])
# Find all HTML sentences
# .* is a regular expression meaning getting the content between quotation marks(""), i.e. the URLs
a = soup.findAll("a", {"href": re.compile(".*")})
for i in a:
if i["href"].find("http://") != -1 or i["href"].find("https://") != -1:
links.append(i["href"])
return links
# Get the page source code
def getPageSource(self, url, timeout=100, coding=None):
try:
socket.setdefaulttimeout(timeout)
req = urllib2.Request(url)
# Add HTTP header
req.add_header('User-agent', 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)')
response = urllib2.urlopen(req)
if coding is None:
if coding is None:
page = response.read()
else:
page = response.read()
page = page.decode(coding).encode('utf-8')
# HTTP Status Code 200 means requests are accepted by the server
return ["200", page]
except Exception, e:
print str(e)
return [str(e), None]
最后的main函数就很简单了
def main(base_url, max_count, flag):
craw = Crawler(base_url)
craw.crawling(base_url, max_count, flag)
各种爬虫的技术(爬虫内核、普通爬虫、主题爬虫、DeepWeb爬虫、动态网页技术、反爬虫等)原理见《互联网大数据处理技术与应用》的第二章。
作者编著的《互联网大数据处理技术与应用》专著(清华大学出版社,2017)、同名公众号,专注于大数据技术的相关科学和工程知识传播,同时也为读者提供一些拓展阅读材料。关注后可阅读以前推送的原创文章。部分如下,关注后可阅读更多原创文章。
领取专属 10元无门槛券
私享最新 技术干货