我正在尝试使用scrapy抓取this page。我可以成功地抓取页面上的数据,但我也希望能够从其他页面抓取数据。(说下一步的)。以下是我的代码的相关部分:
def parse(self, response):
item = TimemagItem()
item['title']= response.xpath('//div[@class="text"]').extract()
links = response.xpath('//h3/a').extract()
crawledLinks=[]
linkPattern = re.compile("^(?:ftp|http|https):\/\/(?:[\w\.\-\+]+:{0,1}[\w\.\-\+]*@)?(?:[a-z0-9\-\.]+)(?::[0-9]+)?(?:\/|\/(?:[\w#!:\.\?\+=&%@!\-\/\(\)]+)|\?(?:[\w#!:\.\?\+=&%@!\-\/\(\)]+))?$")
for link in links:
if linkPattern.match(link) and not link in crawledLinks:
crawledLinks.append(link)
yield Request(link, self.parse)
yield item
我得到了正确的信息:来自链接页面的标题,但它根本不是“导航”。如何告诉scrapy导航?
发布于 2014-10-31 19:51:51
请查看Scrapy Link Extractors文档。它们是告诉您的爬虫跟随页面上的链接的正确方式。
看看你想要抓取的页面,我相信你应该遵循两个提取器规则。这是一个简单的爬行器的例子,它的规则适合你的时代网页需求:
from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
class TIMESpider(CrawlSpider):
name = "time_spider"
allowed_domains = ["time.com"]
start_urls = [
'http://search.time.com/results.html?N=45&Ns=p_date_range|1&Ntt=&Nf=p_date_range%7cBTWN+19500101+19500130'
]
rules = (
Rule (SgmlLinkExtractor(restrict_xpaths=('//div[@class="tout"]/h3/a',))
, callback='parse'),
Rule (SgmlLinkExtractor(restrict_xpaths=('//a[@title="Next"]',))
, follow= True),
)
def parse(self, response):
item = TimemagItem()
item['title']= response.xpath('.//title/text()').extract()
return item
https://stackoverflow.com/questions/26681957
复制相似问题