我已经编写了以下代码来使用BeautifulSoup i提取表数据
import requests
website= requests.get('https://memeburn.com/2010/09/the-100-most-influential-news-media-twitter-accounts/').text
from bs4 import BeautifulSoup
soup= BeautifulSoup(website, 'lxml')
table= soup.find('table')
table_rows = table.findAll('tr')
for tr in table_rows:
td= tr.findAll('td')
rows = [i.text for i in td]
print(rows)这是我的输出
['Number', '@name', 'Name', 'Followers', 'Influence Rank']
[]
['1', '@mashable', 'Pete Cashmore', '2037840', '59']
[]
['2', '@cnnbrk', 'CNN Breaking News', '3224475', '71']
[]
['3', '@big_picture', 'The Big Picture', '23666', '92']
[]
['4', '@theonion', 'The Onion', '2289939', '116']
[]
['5', '@time', 'TIME.com', '2111832', '143']
[]
['6', '@breakingnews', 'Breaking News', '1795976', '147']
[]
['7', '@bbcbreaking', 'BBC Breaking News', '509756', '168']
[]
['8', '@espn', 'ESPN', '572577', '187']
[]请帮我把这些数据写到.csv文件中(我是个新手)
发布于 2020-10-03 17:50:36
使用csv编写器。将每行写入csv文件。
import requests
import csv
from bs4 import BeautifulSoup
website= requests.get('https://memeburn.com/2010/09/the-100-most-influential-news-media-twitter-accounts/').text
soup= BeautifulSoup(website, 'lxml')
table= soup.find('table')
table_rows = table.findAll('tr')
csvfile = 'twitterusers2.csv';
# Python 2
# with open(csvfile, 'wb') as outfile:
# Python 3 to ommit newline caracter
with open(csvfile, 'w', newline='') as outfile:
wr = csv.writer(outfile)
for tr in table_rows:
td= tr.findAll('td')
# Python 2 .encode("utf8") is mendatory sometimes playing with twitter data
rows = [i.text.encode("utf8") for i in td]
#ignore the empty elements and row td count not equal to 5
if(len(rows) == 5):
print(rows)
wr.writerow(rows)发布于 2020-10-03 17:57:21
更好的解决方案是使用pandas,因为它比其他库更快。下面是完整的代码:
import requests
import pandas as pd
website= requests.get('https://memeburn.com/2010/09/the-100-most-influential-news-media-twitter-accounts/').text
from bs4 import BeautifulSoup
soup= BeautifulSoup(website, 'lxml')
table= soup.find('table')
table_rows = table.findAll('tr')
first = True
details_dict = {}
count = 0
final_rows = []
for tr in table_rows:
td= tr.findAll('td')
rows = [i.text for i in td]
#print(rows)
for i in rows:
if first == True:
details_dict[i] = []
else:
key = list(details_dict.keys())[count]
details_dict[key].append(i)
count+=1
count = 0
first = False
#print(details_dict)
df = pd.DataFrame(details_dict)
df.to_csv('D:\\Output.csv',index = False)输出屏幕截图:

希望这能对你有所帮助!
发布于 2020-10-03 18:03:54
最简单的方法是使用pandas
# pip install pandas lxml beautifulsoup4
import pandas as pd
URI = 'https://memeburn.com/2010/09/the-100-most-influential-news-media-twitter-accounts/'
# read and clean
data = pd.read_html(URI, flavor='lxml', skiprows=0, header=0)[0].dropna()
# save to csv called data
data.to_csv('data.csv', index=False, encoding='utf-8')https://stackoverflow.com/questions/64182464
复制相似问题