我的目标是从一个页面中提取所有链接并存储它,这样我就可以设计另一个爬虫来从它们中提取信息,并且有一个详尽的相关链接列表。然而,似乎我并没有将爬虫指向正确的方向来提取这些链接,因为我得到了一个空列表。
“”“
class ArticleSpider(scrapy.Spider):
name = 'links'
start_urls = [
'https://abcnews.go.com/search?searchtext=Coronavirus&type=Story&sort=date'
]
def parse(self, response):
all_links = response.css("h2.selectorgadget_selected
a.AnchorLink.selectorgadget_selected::attr(href)").extract()
yield{'linktext':all_links}
“”“
发布于 2020-11-26 13:51:23
您得到的是空列表,因为项目及其链接是由JavaScript加载的。在Chrome工具中按Cntrl + Shift +P,然后执行禁用JavaScript.After,这样当Scrapy请求start_urls中的url时,您可以看到得到了什么。
幸运的是,对于您来说,脚本只需向API发出请求,您就可以轻松地从mock.Here中看到JSON响应。
因此,您只需要在解析方法- https://abcnews.go.com/meta/api/search?q=Coronavirus&limit=10&sort=date&type=Story§ion=&totalrecords=true&offset=0中向这个url发出请求。在此之后,您所要做的就是解析这个响应并获取您需要的url。
import scrapy
import json
from scrapy import Request
class ArticleSpider(scrapy.Spider):
name = 'links'
start_urls = [
'https://abcnews.go.com/search?searchtext=Coronavirus&type=Story&sort=date'
]
api_url = "https://abcnews.go.com/meta/api/search?q=Coronavirus&limit=100&sort=date&type=Story§ion=&totalrecords=true&offset=0"
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:83.0) Gecko/20100101 Firefox/83.0",
"Accept": "*/*",
"Accept-Language": "en-US,en;q=0.5",
"Accept-Encoding": "gzip, deflate, br",
"Connection": "keep-alive",
"Pragma": "no-cache",
"Cache-Control": "no-cache",
}
custom_settings = {
'ROBOTSTXT_OBEY': 'False',
}
def parse(self, response):
# We need this because otherwise Scrapy won't manage Cookies for us.
yield Request(self.api_url, self.parse_api, headers=self.headers)
def parse_api(self, response):
data = json.loads(response.text)
# From data you can get your links
https://stackoverflow.com/questions/65023050
复制相似问题