我的网页抓取脚本由于某种原因返回重复的结果,我尝试了这么多替代方案,但就是不能让它工作。有谁能帮帮忙吗?
import requests
from bs4 import BeautifulSoup as bs
from bs4.element import Tag
import csv
soup = [ ]
pages = [ ]
csv_file = open('444.csv', 'w')
csv_writer = csv.writer(csv_file)
csv_writer.writerow(['Practice', 'Practice Manager'])
for i in range(35899, 35909):
url = 'https://www.nhs.uk/Services/GP/Staff/DefaultView.aspx?id=' + str(i)
pages.append(url)
for item in pages:
page = requests.get(item)
soup.append(bs(page.text, 'lxml'))
business = []
for items in soup:
h1Obj = items.select('[class^=panel]:has([class^="gp notranslate"]:contains(""))')
for i in h1Obj:
tagArray = i.findChildren()
for tag in tagArray:
if isinstance(tag,Tag) and tag.name in 'h1':
business.append(tag.text)
else:
print('no-business')
names = []
for items in soup:
h4Obj = items.select('[class^=panel]:not(p):has([class^="staff-title"]:contains("Practice Manager"))')
for i in h4Obj:
tagArray = i.findChildren()
for tag in tagArray:
if isinstance(tag,Tag) and tag.name in 'h4':
names.append(tag.text)
else:
print('no-name')
print(business, names)
csv_writer.writerow([business, names])
csv_file.close()
它当前在all上返回重复的值。
它需要做的是为每个url调用返回一个'business‘和一个'names’值。如果没有'business‘或'name',则需要返回值' no -business’或'no-name‘。
有谁能帮帮我吗?
发布于 2019-05-15 12:59:42
您可以使用以下id来生成列表的初始列表。您可以将每一行写入csv,而不是附加到最终列表。
import requests
from bs4 import BeautifulSoup as bs
results = []
with requests.Session() as s:
for i in range(35899, 35909):
r = s.get('https://www.nhs.uk/Services/GP/Staff/DefaultView.aspx?id=' + str(i))
soup = bs(r.content, 'lxml')
row = [item.text for item in soup.select('.staff-title:has(em:contains("Practice Manager")) [id]')]
if not row: row = ['no practice manager']
practice = soup.select_one('.gp').text if soup.select_one(':has(#org-title)') else 'No practice name'
row.insert(0, practice)
results.append(row)
print(results)
不确定您希望如何列出多个名称
import requests
from bs4 import BeautifulSoup as bs
import csv
with open('output.csv', 'w', newline='') as csvfile:
w = csv.writer(csvfile, delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)
with requests.Session() as s:
for i in range(35899, 35909):
r = s.get('https://www.nhs.uk/Services/GP/Staff/DefaultView.aspx?id=' + str(i))
soup = bs(r.content, 'lxml')
row = [item.text for item in soup.select('.staff-title:has(em:contains("Practice Manager")) [id]')]
if not row: row = ['no practice manager']
practice = soup.select_one('.gp').text if soup.select_one(':has(#org-title)') else 'No practice name'
row.insert(0, practice)
w.writerow(row)
发布于 2019-05-15 11:12:59
我不知道这是否是最好的方法,但是我使用set而不是list来删除重复项,并且在保存文件之前,我将set转换为列表,如下所示:
import requests
from bs4 import BeautifulSoup as bs
from bs4.element import Tag
import csv
soup = [ ]
pages = [ ]
csv_file = open('444.csv', 'w')
csv_writer = csv.writer(csv_file)
csv_writer.writerow(['Practice', 'Practice Manager'])
for i in range(35899, 35909):
url = 'https://www.nhs.uk/Services/GP/Staff/DefaultView.aspx?id=' + str(i)
pages.append(url)
for item in pages:
page = requests.get(item)
soup.append(bs(page.text, 'lxml'))
business = set()
for items in soup:
h1Obj = items.select('[class^=panel]:has([class^="gp notranslate"]:contains(""))')
for i in h1Obj:
tagArray = i.findChildren()
for tag in tagArray:
if isinstance(tag,Tag) and tag.name in 'h1':
business.add(tag.text)
else:
print('no-business')
names = set()
for items in soup:
h4Obj = items.select('[class^=panel]:not(p):has([class^="staff-title"]:contains("Practice Manager"))')
for i in h4Obj:
tagArray = i.findChildren()
for tag in tagArray:
if isinstance(tag,Tag) and tag.name in 'h4':
names.add(tag.text)
else:
print('no-business')
print(business, names)
csv_writer.writerow([list(business), list(names)])
csv_file.close()
发布于 2019-05-15 11:47:52
看起来这个问题源于这样一个事实,在这些页面中,根本没有任何信息,并且你得到了一个“隐藏的配置文件”错误。我稍微修改了你的代码,覆盖了前5页。除了保存到文件之外,它看起来像这样:
[same imports]
pages = [ ]
for i in range(35899, 35904):
url = 'https://www.nhs.uk/Services/GP/Staff/DefaultView.aspx?id=' + str(i)
pages.append(url)
soup = [ ]
for item in pages:
page = requests.get(item)
soup.append(bs(page.text, 'lxml'))
business = []
for items in soup:
h1Obj = items.select('[class^=panel]:has([class^="gp notranslate"]:contains(""))')
for i in h1Obj:
tagArray = i.findChildren()
for tag in tagArray:
if isinstance(tag,Tag) and tag.name in 'h1':
business.append(tag.text)
names = []
for items in soup:
h4Obj = items.select('[class^=panel]:not(p):has([class^="staff-title"]:contains("Practice Manager"))')
for i in h4Obj:
tagArray = i.findChildren()
for tag in tagArray:
if isinstance(tag,Tag) and tag.name in 'h4':
names.append(tag.text)
for bus, name in zip(business,names):
print(bus,'---',name)
输出如下所示:
Bilbrook Medical Centre --- Di Palfrey
Caversham Group Practice --- Di Palfrey
Caversham Group Practice --- Di Palfrey
The Moorcroft Medical Ctr --- Ms Kim Stanyer
Brotton Surgery --- Mrs Gina Bayliss
请注意,只有第二个和第三个条目是重复的;这(不知何故,不知道为什么)是由第三页中的“隐藏配置文件”引起的。因此,如果您将代码的主要块修改为:
business = []
for items in soup:
if "ProfileHiddenError.aspx" in (str(items)):
business.append('Profile Hidden')
else:
h1Obj = items.select('[class^=panel]:has([class^="gp notranslate"]:contains(""))')
for i in h1Obj:
tagArray = i.findChildren()
for tag in tagArray:
if isinstance(tag,Tag) and tag.name in 'h1':
business.append(tag.text)
names = []
for items in soup:
if "ProfileHiddenError.aspx" in (str(items)):
names.append('Profile Hidden')
elif not "Practice Manager" in str(items):
names.append('No Practice Manager Specified')
else:
h4Obj = items.select('[class^=panel]:not(p):has([class^="staff-title"]:contains("Practice Manager"))')
for i in h4Obj:
tagArray = i.findChildren()
for tag in tagArray:
if isinstance(tag,Tag) and tag.name in 'h4':
names.append(tag.text)
for bus, name in zip(business,names):
print(bus,'---',name)
这次的输出是:
BBilbrook Medical Centre --- Di Palfrey
Caversham Group Practice --- No Practice Manager Specified
Profile Hidden --- Profile Hidden
The Moorcroft Medical Ctr --- Ms Kim Stanyer
Brotton Surgery --- Mrs Gina Bayliss
希望这能帮助你解决这个问题。
https://stackoverflow.com/questions/56145213
复制