问Web抓取站点未返回正确的值
EN

Stack Overflow用户

提问于 2021-06-26 19:31:51

回答 2查看 59关注 0票数 1

我正在抓取的site。我的目标是抓取产品ID/sku并获得链接。但是这些元素在站点中，当我抓取数据时，我的输出将是空白/错误。当前代码：

import requests
from bs4 import BeautifulSoup

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'}
url = "https://www.adidas.com.sg/yeezy/"
productsource = requests.get(url,headers=headers,timeout=15)
productinfo = BeautifulSoup(productsource.content, "lxml")

for item in productinfo.select('div',class_='src-components-___coming-soon__row___NfXc3'):

    sku = item.find('div', class_="src-components-___coming-soon__product___2Gai4")['id']
    link = item.a['href']
    print(sku,'\n',link)

结果：

Traceback (most recent call last):
  File "c:\Users\matta\OneDrive\xeonon\testing monitors\test.py", line 14, in <module>
    sku = item.find('div', class_="src-components-___coming-soon__product___2Gai4")['id']
TypeError: 'NoneType' object is not subscriptable

有人能帮上忙吗？我做错了什么？

更新:如何提取第一个url？

      "imageUrls": [
                    "https://assets.adidas.com/images/w_840,h_840,q_auto:sensitive/3d37a43625ce413ea6d3ad44013560db_9366/GZ0954_01_standard.jpg",
                    "https://assets.adidas.com/images/w_840,h_840,q_auto:sensitive/e1748ff26ad54f559ffbad4401356122_9366/GZ0954_01_standard1_hover.jpg",
                    "https://assets.adidas.com/images/w_840,h_840,q_auto:sensitive/3da89e0f71064a958377ad4401355e12_9366/GZ0954_01_standard2.jpg",
                    "https://assets.adidas.com/images/w_840,h_840,q_auto:sensitive/43136245b78840e9901bad44013561bf_9366/GZ0954_02_standard.jpg",
                    "https://assets.adidas.com/images/w_840,h_840,q_auto:sensitive/c116076d86b34098bf9cad4401355ee8_9366/GZ0954_03_standard.jpg"
                ],

web-scraping

beautifulsoup

python-requests

python

DNS解析特惠

DNS解析提供智能解析、流量调度、安全防护等服务

回答 2

Stack Overflow用户

发布于 2021-06-26 21:18:17

import requests
from bs4 import BeautifulSoup
import json

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0'
}


def main(url):
    r = requests.get(url, headers=headers)
    soup = BeautifulSoup(r.text, 'lxml')
    goal = soup.select_one('script').string.split("=", 1)[1]
    print(json.loads(goal)['productIds'])


main('https://www.adidas.com.sg/yeezy')

输出：

['GZ0953', 'GZ0954', 'GZ0955', 'GZ5551', 'GZ5554']

票数 1

Stack Overflow用户

发布于 2021-06-26 21:18:40

数据被嵌入到JavaScript中的页面中。您可以使用此示例来解析它：

import re
import json
import requests


url = "https://www.adidas.com.sg/yeezy"

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0"
}

html_doc = requests.get(url, headers=headers).text
data = re.search(r"window\.ENV = ({.*})", html_doc).group(1)
data = json.loads(data)

# uncomment this to print all data:
# print(json.dumps(data, indent=4))

for id_, product in data["productData"].items():
    print(id_, product["shared"]["trackingName"], product["localized"]["color"])
    print("https://www.adidas.com.sg/yeezy/product/{}".format(id_))

打印：

GZ0953 YEEZY SLIDE ADULTS ENFLAME ORANGE
https://www.adidas.com.sg/yeezy/product/GZ0953
GZ0954 YEEZY SLIDE KIDS ENFLAME ORANGE
https://www.adidas.com.sg/yeezy/product/GZ0954
GZ0955 YEEZY SLIDE INFANTS ENFLAME ORANGE
https://www.adidas.com.sg/yeezy/product/GZ0955
GZ5551 YEEZY SLIDE RESIN
https://www.adidas.com.sg/yeezy/product/GZ5551
GZ5554 YEEZY SLIDE PURE
https://www.adidas.com.sg/yeezy/product/GZ5554