我的任务是找到文章<div id="bodyContent">
的正文,并在其中计算链接的最大序列的长度,在这些链接之间没有其他打开或关闭的标记。例如:
<p>
<span><a></a></span>
**<a></a>
<a></a>**
</p>
<p>
**<a><span></span></a>
<a></a>
<a></a>**
</p
代码:
import requests
from bs4 import BeautifulSoup
html = requests.get('https://en.wikipedia.org/wiki/Stone_Age')
soup = BeautifulSoup(html.text, "lxml")
body = soup.find(id="bodyContent")
# get first link
first_link = body.a
# find all links that are in the same level
first_link.find_next_siblings('a')
如何转到以下链接?
诚挚的问候!
发布于 2018-09-15 12:51:37
我的解决方案是:
import requests
from bs4 import BeautifulSoup
html = requests.get('https://en.wikipedia.org/wiki/Stone_Age')
soup = BeautifulSoup(html.text, "lxml")
body = soup.find(id="bodyContent")
tag = body.find_next("a")
linkslen = -1
while (tag):
curlen = 1
for tag in tag.find_next_siblings():
if tag.name != 'a':
break
curlen += 1
if curlen > linkslen:
linkslen = curlen
tag = tag.find_next("a")
print(linkslen)
发布于 2020-08-07 12:51:49
另一种解决方案
import requests
from bs4 import BeautifulSoup
html = requests.get('https://en.wikipedia.org/wiki/Stone_Age')
soup = BeautifulSoup(html.text, "lxml")
body = soup.find(id="bodyContent")
all_links = body.find_all('a')
sequence = 0
for link in all_links:
len = 1
for l in link.find_next_siblings():
if l.name != 'a':
break
len += 1
sequence = max(sequence, len)
print(sequence)
https://stackoverflow.com/questions/52291029
复制相似问题