我有100多个蜘蛛,我想一次运行5个蜘蛛使用脚本。为此,我在数据库中创建了一个表,以了解蜘蛛的状态,即它是否已完成运行、运行或等待运行。
我知道如何在脚本中运行多个蜘蛛。
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
process = CrawlerProcess(get_project_settings())
for i in range(10): #this range is just for demo instead of this i
我和两只蜘蛛一起做了个小项目。我还创建了test.py (在这个项目中)来爬行蜘蛛
代码:
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
process = CrawlerProcess(get_project_settings())
process.crawl('nameofspider1', domain='domain')
process.crawl('nameofspider2', doma
我的目标是测试使用scrapy (Python)编写的蜘蛛。我试过使用,但实际上它是有限的,因为我不能测试诸如分页之类的东西,或者某些属性是否被正确提取。
def parse(self, response):
""" This function parses a sample response. Some contracts are mingled
with this docstring.
@url http://someurl.com
@returns items 1 16
@returns requests 0 0
我有这段代码,当两个蜘蛛完成后,程序还在运行。
#!C:\Python27\python.exe
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log, signals
from carrefour.spiders.tesco import TescoSpider
from carrefour.spiders.carr import CarrSpider
from scrapy.utils.project import get_project_setting
我正在使用以下方法检查我的spider.py中的(internet)连接错误
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, callback=self.parse, errback=self.handle_error)
def handle_error(self, failure):
if failure.check(DNSLookupError): # or failure.check(UnknownHostError):
r
我用的是一只蜘蛛,一切都很正常。这很好,但是每当蜘蛛完成的时候,赫洛库就会感到困惑。它打印以下日志,并再次启动应用程序,因为它认为应用程序崩溃。
app[worker.1]: 2020-11-23 00:04:10 [scrapy.core.engine] INFO: Spider closed (finished)
heroku[worker.1]: Process exited with status 0
heroku[worker.1]: State changed from up to crashed
heroku[worker.1]: State changed from crash