一般来说,我对web scraping和python是个新手,但是我有点纠结于如何纠正我的函数。我的任务是抓取以特定字母开头的单词的站点,并返回匹配的单词列表,最好使用正则表达式。感谢您的时间,这是我到目前为止的代码。
import urllib
import re
def webscraping(website):
fhand = urllib.urlopen(website).read()
for line in fhand:
line = fhand.strip()
if line.startswith('h'):
print line
webscraping("https://en.wikipedia.org/wiki/Web_scraping")
发布于 2016-12-03 14:02:32
不要使用正则表达式来解析HTML,您可以使用Beautiful Soup
import urllib
from BeautifulSoup import *
todo = list()
visited = list()
url = raw_input('Enter - ')
todo.append(url)
while len(todo) > 0 :
print "====== Todo list count is ",len(todo)
url = todo.pop()
if ( not url.startswith('http') ) :
print "Skipping", url
continue
if ( url.find('facebook') > 0 ) :
continue
if ( url in visited ) :
print "Visited", url
continue
print "===== Retrieving ", url
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
visited.append(url)
# Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
newurl = tag.get('href', None)
if ( newurl != None ) :
todo.append(newurl)
https://stackoverflow.com/questions/40943789
复制相似问题