前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >百家号爬取(2)

百家号爬取(2)

作者头像
Centy Zhao
发布2019-12-26 16:44:33
8640
发布2019-12-26 16:44:33
举报
文章被收录于专栏:icecream小屋

此篇文章主要讲述百家号评论数阅读数的爬取

评论数和阅读数都在单独的一个json数据表中

https://mbd.baidu.com/webpage?type=homepage&action=interact&format=jsonp&params=%5B%7B%22user_type%22%3A%223%22%2C%22dynamic_id%22%3A%229683117499664348209%22%2C%22dynamic_type%22%3A%222%22%2C%22dynamic_sub_type%22%3A%222001%22%2C%22thread_id%22%3A%221113000014175815%22%2C%22feed_id%22%3A%229683117499664348209%22%7D%2C%7B%22user_type%22%3A%223%22%2C%22dynamic_id%22%3A%228997120757336896754%22%2C%22dynamic_type%22%3A%222%22%2C%22dynamic_sub_type%22%3A%222001%22%2C%22thread_id%22%3A%221106000014171319%22%2C%22feed_id%22%3A%228997120757336896754%22%7D%2C%7B%22user_type%22%3A%223%22%2C%22dynamic_id%22%3A%229442416292259854102%22%2C%22dynamic_type%22%3A%222%22%2C%22dynamic_sub_type%22%3A%222001%22%2C%22thread_id%22%3A%221106000014171220%22%2C%22feed_id%22%3A%229442416292259854102%22%7D%2C%7B%22user_type%22%3A%223%22%2C%22dynamic_id%22%3A%228994022518148142722%22%2C%22dynamic_type%22%3A%222%22%2C%22dynamic_sub_type%22%3A%222001%22%2C%22thread_id%22%3A%221084000014170786%22%2C%22feed_id%22%3A%228994022518148142722%22%7D%2C%7B%22user_type%22%3A%223%22%2C%22dynamic_id%22%3A%229180210467318996709%22%2C%22dynamic_type%22%3A%222%22%2C%22dynamic_sub_type%22%3A%222001%22%2C%22thread_id%22%3A%221110000014181138%22%2C%22feed_id%22%3A%229180210467318996709%22%7D%2C%7B%22user_type%22%3A%223%22%2C%22dynamic_id%22%3A%229470100560664750777%22%2C%22dynamic_type%22%3A%222%22%2C%22dynamic_sub_type%22%3A%222001%22%2C%22thread_id%22%3A%221119000014172446%22%2C%22feed_id%22%3A%229470100560664750777%22%7D%5D&uk=D0hHfmuMEVka02HZelKA7g&_=1548119615162&callback=jsonp1

该url解析

主要是从上个json数据表中获得的

"user_type"

dynamic_id"

"dynamic_type"

"dynamic_sub_type"

"thread_id"

"feed_id"

进行拼装

代码为

for iin range(len(title)):

user_type = re.findall(r'"user_type":"(.+?)",', asyncData[i])[0]

dynamic_id = re.findall(r'"dynamic_id":"(.+?)",', asyncData[i])[0]

dynamic_type=re.findall(r'"dynamic_type":"(.+?)",', asyncData[i])[0]

dynamic_sub_type=re.findall(r'"dynamic_sub_type":"(.+?)",', asyncData[i])[0]

thread_id=re.findall(r'"thread_id":"(.+?)",', asyncData[i])[0]

feed_id=re.findall(r'"feed_id":"(.+?)"', asyncData[i])[0]

print(title[i],url[i],date[i],cerate[i],publish[i],updated[i])

if i<len(title)-1

readjson+='user_type%22%3A%22'+user_type+'%22%2C%22'\

+'dynamic_id%22%3A%22'+dynamic_id+'%22%2C%22'\

+'dynamic_type%22%3A%22'+dynamic_type+'%22%2C%22'\

+'dynamic_sub_type%22%3A%22'+dynamic_sub_type+'%22%2C%22'\

+'thread_id%22%3A%22'+thread_id+'%22%2C%22'\

+'feed_id%22%3A%22'+feed_id+'%22%7D%2C%7B%22'

else:

readjson +='user_type%22%3A%22' + user_type +'%22%2C%22' \

+'dynamic_id%22%3A%22' + dynamic_id +'%22%2C%22' \

+'dynamic_type%22%3A%22' + dynamic_type +'%22%2C%22' \

+'dynamic_sub_type%22%3A%22' + dynamic_sub_type +'%22%2C%22' \

+'thread_id%22%3A%22' + thread_id +'%22%2C%22' \

+'feed_id%22%3A%22' + feed_id +'%22%7D%5D'

readjson+='&uk=D0hHfmuMEVka02HZelKA7g&_='+str(b)

注:feed_id最后一个接的是%22%7D%5D,而不是之前的'%22%7D%2C%7B%22'

本文参与 腾讯云自媒体同步曝光计划,分享自作者个人站点/博客。
如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 作者个人站点/博客 前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档