我想从百度那里得到搜索结果。但现在我被塞在这里:
import sys
import urllib
import urllib2
from bs4 import BeautifulSoup
question_word = "Hello"
url = "http://www.baidu.com/s?wd=" + urllib.quote(question_word.decode(sys.stdin.encoding).encode('gbk'))
htmlpage = urllib2.urlopen(url).read()
soup = BeautifulSoup(htmlpage)
for child in soup.findAll("h3", {"class": "t"}):
print child.contents[0]
这将返回具有目标urls的所有标记。我不知道如何使用.get('href')列出实际的urls
因此,我对Python还不熟悉,因此可能对基本概念有一些混淆。我真的很感激你的帮助。
发布于 2017-03-12 21:57:58
for child in soup.findAll("h3", {"class": "t"}):
print child.a.get('href')
使用.
获取h3
标记中的下一个a
标记,然后可以使用.get()
发布于 2021-03-31 23:15:49
有一些选项可以抓取href
,这里有两个选项:.a.get('href')
或.a['href']
您还可以使用SelectorGadget获取CSS
选择器,并使用select()
或select_one()
方法查找要查找的内容,并使用for loop
对它们进行迭代。
代码与全例
from bs4 import BeautifulSoup
import requests
import lxml
headers = {
"User-Agent":
"Mozilla/5.0 (Linux; Android 10; HD1913) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.105 Mobile Safari/537.36 EdgA/46.1.2.5140"
}
response = requests.get('https://www.baidu.com/s?tn=baidu&wd=lasagna', headers=headers).text
soup = BeautifulSoup(response, 'lxml')
for link in soup.select('.result .t a'):
links = link['href']
print(links)
输出:
http://www.baidu.com/link?url=HsRYa1X1Me8F9GY13KcqBcfFuLv1ST0OgGrTV-czCKVjRFlnzBIkhefpxX7Qy5a1vmfDn2sfZm_oJ9MB5dsthKQp8-oUt5QAQWjtTepQ46O
http://www.baidu.com/link?url=BShcWZT386OcLGiTRZZvHVYxk3c5RJg8xP7NmWyzzelEgMi4Udwsmkn_0HKwm3bAYwD4vijoRAAsPkfU03Gvv_
http://www.baidu.com/link?url=YYtyQWUvIzNvsul5yxH16yYtuvgQG7otyqUlc218FmFKR3It4jnLn69Iuk8h9wAx
http://www.baidu.com/link?url=l-_isPDZWm36Hu8otKlJYz6PM80wnr8gEXS9NcuQ-lj7xIWrIBCW6-s8J-ovtiZZeUqosqAOOMpSlclq2CHN7_
http://www.baidu.com/link?url=86Rh7aSzzMlVz6Upzra57OKMc4sMZkTgECNk9IQi493D6jDqtNYSSaYNHDSeXZ40zpU29RxRvq6W7xm4O8FNVq
http://www.baidu.com/link?url=91g1S6LBPAA3sXIU3b-OSsbNVYvBUS6iD6FzAoTdNS_NBQn1L5uHvvo8X8RialUm
http://www.baidu.com/link?url=wkVK0QjvpTMNxQuvIy--P8BHeW8oeSkBizDxP2JDtWcT4AZ5DCYd-DDLjWBwQ7Lj
或者,您可以使用来自SerpApi的SerpApi。这是一个免费试用的付费API。
合并守则:
import os
from serpapi import GoogleSearch
params = {
"engine": "baidu",
"q": "(意大利)千层面",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for organic_result in results["organic_results"]:
if 'link' in organic_result:
print(f"Link: {organic_result['link']}")
输出:
Link: http://www.baidu.com/link?url=bvvL93wtWMZN2N7XNDAdcHv7-V4xk_ltvtDwQpIEGi8Cikedd2OTlCkiJU_sAJY-nvitYCHfwj9yP9FoyM2YwTYcwygddL1LGqNnf4RjmX35OE5GYNM1TMo4FIA70pIpyaPodDHyAUGxAFZAdoD9Nq
Link: http://www.baidu.com/link?url=5bldYEmDCmryDDeWYPjUO87EvQUYyaS5R1MmXhm7t9pReijqCgbkqP0N4wxukwDRmNya1_HQKn1flaWGQ5s5zeBgvIv1sDmeXziyyksttWtCZv488feUpO6Su3aKLJm--5RKj78YbxM2MYlCblC-RhAzyIPwPmD4ujkVnx2I9mhiWYyOp2F5JhCTM4p50qFf
Link: http://www.baidu.com/link?url=tOaRrtH80vM4w3Nvj81lE-HTgS22ZT5dDFb4mZb_mydsXNT502JrvDeFtmv92DJ-iO4XQ2x_RWPT383yfPN_5q
Link: http://www.baidu.com/link?url=-Oftww54gKeV5Y9Arv7Pb98F_J_wMJ-ZjAT8nCiDUG1KP9GgTUI6rttZwk1GUVeimShwiXUb6XQ0Avg6t45Ssq
Link: http://www.baidu.com/link?url=PXtVGPa_zrRvxOWHOXEdfmPdBONvR0bKiQ18pqZssqf3MD3EyfgZl0DF90wNpSvoLwHYajnd6yW73yNvYY8kiSDfakZjA8vHuT0G4hKsi6S
Link: http://www.baidu.com/link?url=vsamKJMS3JWKjipxbRZadTfqBDMUsYxp_jUzXZbZ4mXWPxH2G3lkl7crUt8huAMAdy4R-_T0BGF7zQwU8JOPPK
Link: http://www.baidu.com/link?url=0uIAbYhNuwLDNvyoD2bUusKPqt9Yh5Kw763DhNSlHE2f3WS0CnY5tZZgwRsihjJZTp4bN_2HDfIJ2MFggYFNdABt6FyV-Rnb-YCjzAf80SC
Link: http://www.baidu.com/link?url=Uwm9NtMcJGu4Mj4BN1nKyDUUERR1KXv9y9PJHzuo8PUdEY8wE6jxsc1sSuDkzIIR
Link: http://www.baidu.com/link?url=_ey4jd5jd0vxvTwIRhrFleI-q4hVk7yZG-aaEdW0BpgaeM_3jcxqSN_jWw3w5syB3V4xFWxzZDS6lZNoqwv8aOo6U269AfBME86GxQ3b_F7
Link: http://www.baidu.com/link?url=B6gRmV70secoV7cM_KcJnbJFzgFCpWV0ShwONC9t5fFuZlOkbirx5qYdIxDUq8jPxP4HhaQvSOxX_qMEglAs2_
免责声明,我为SerpApi工作。
https://stackoverflow.com/questions/42757310
复制相似问题