我对BeauitfulSoup非常陌生。
如何从html源代码中提取段落中的文本,在有a时将文本拆分,并将其存储到数组中,以便数组中的每个元素都是段落文本中的一个块(由a拆分)?
例如,对于以下段落:
<p>
<strong>Pancakes</strong>
<br/>
A <strong>delicious</strong> type of food
<br/>
</p>我希望将其存储到以下数组中:
['Pancakes', 'A delicious type of food']
我试过的是:
import bs4 as bs
soup = bs.BeautifulSoup("<p>Pancakes<br/> A delicious type of food<br/></p>")
p = soup.findAll('p')
p[0] = p[0].getText()
print(p)但这将输出一个只有一个元素的数组:
['Pancakes A delicious type of food']有什么方法可以对它进行编码,这样我就可以得到一个数组,其中包含段落中任何一个被分割的段落文本?
发布于 2020-06-26 03:08:25
尝尝这个
from bs4 import BeautifulSoup, NavigableString
html = '<p>Pancakes<br/> A delicious type of food<br/></p>'
soup = BeautifulSoup(html, 'html.parser')
p = soup.findAll('p')
result = [str(child).strip() for child in p[0].children
if isinstance(child, NavigableString)]为深度递归更新
from bs4 import BeautifulSoup, NavigableString, Tag
html = "<p><strong>Pancakes</strong><br/> A <strong>delicious</strong> type of food<br/></p>"
soup = BeautifulSoup(html, 'html.parser')
p = soup.find('p').find_all(text=True, recursive=True)再次更新文本,仅由
from bs4 import BeautifulSoup, NavigableString, Tag
html = "<p><strong>Pancakes</strong><br/> A <strong>delicious</strong> type of food<br/></p>"
soup = BeautifulSoup(html, 'html.parser')
text = ''
for child in soup.find_all('p')[0]:
if isinstance(child, NavigableString):
text += str(child).strip()
elif isinstance(child, Tag):
if child.name != 'br':
text += child.text.strip()
else:
text += '\n'
result = text.strip().split('\n')
print(result)https://stackoverflow.com/questions/62587253
复制相似问题