此篇文章主要讲述百家号评论数阅读数的爬取
评论数和阅读数都在单独的一个json数据表中
https://mbd.baidu.com/webpage?type=homepage&action=interact&format=jsonp¶ms=%5B%7B%22user_type%22%3A%223%22%2C%22dynamic_id%22%3A%229683117499664348209%22%2C%22dynamic_type%22%3A%222%22%2C%22dynamic_sub_type%22%3A%222001%22%2C%22thread_id%22%3A%221113000014175815%22%2C%22feed_id%22%3A%229683117499664348209%22%7D%2C%7B%22user_type%22%3A%223%22%2C%22dynamic_id%22%3A%228997120757336896754%22%2C%22dynamic_type%22%3A%222%22%2C%22dynamic_sub_type%22%3A%222001%22%2C%22thread_id%22%3A%221106000014171319%22%2C%22feed_id%22%3A%228997120757336896754%22%7D%2C%7B%22user_type%22%3A%223%22%2C%22dynamic_id%22%3A%229442416292259854102%22%2C%22dynamic_type%22%3A%222%22%2C%22dynamic_sub_type%22%3A%222001%22%2C%22thread_id%22%3A%221106000014171220%22%2C%22feed_id%22%3A%229442416292259854102%22%7D%2C%7B%22user_type%22%3A%223%22%2C%22dynamic_id%22%3A%228994022518148142722%22%2C%22dynamic_type%22%3A%222%22%2C%22dynamic_sub_type%22%3A%222001%22%2C%22thread_id%22%3A%221084000014170786%22%2C%22feed_id%22%3A%228994022518148142722%22%7D%2C%7B%22user_type%22%3A%223%22%2C%22dynamic_id%22%3A%229180210467318996709%22%2C%22dynamic_type%22%3A%222%22%2C%22dynamic_sub_type%22%3A%222001%22%2C%22thread_id%22%3A%221110000014181138%22%2C%22feed_id%22%3A%229180210467318996709%22%7D%2C%7B%22user_type%22%3A%223%22%2C%22dynamic_id%22%3A%229470100560664750777%22%2C%22dynamic_type%22%3A%222%22%2C%22dynamic_sub_type%22%3A%222001%22%2C%22thread_id%22%3A%221119000014172446%22%2C%22feed_id%22%3A%229470100560664750777%22%7D%5D&uk=D0hHfmuMEVka02HZelKA7g&_=1548119615162&callback=jsonp1
该url解析
主要是从上个json数据表中获得的
"user_type"
dynamic_id"
"dynamic_type"
"dynamic_sub_type"
"thread_id"
"feed_id"
进行拼装
代码为
for iin range(len(title)):
user_type = re.findall(r'"user_type":"(.+?)",', asyncData[i])[0]
dynamic_id = re.findall(r'"dynamic_id":"(.+?)",', asyncData[i])[0]
dynamic_type=re.findall(r'"dynamic_type":"(.+?)",', asyncData[i])[0]
dynamic_sub_type=re.findall(r'"dynamic_sub_type":"(.+?)",', asyncData[i])[0]
thread_id=re.findall(r'"thread_id":"(.+?)",', asyncData[i])[0]
feed_id=re.findall(r'"feed_id":"(.+?)"', asyncData[i])[0]
print(title[i],url[i],date[i],cerate[i],publish[i],updated[i])
if i<len(title)-1
readjson+='user_type%22%3A%22'+user_type+'%22%2C%22'\
+'dynamic_id%22%3A%22'+dynamic_id+'%22%2C%22'\
+'dynamic_type%22%3A%22'+dynamic_type+'%22%2C%22'\
+'dynamic_sub_type%22%3A%22'+dynamic_sub_type+'%22%2C%22'\
+'thread_id%22%3A%22'+thread_id+'%22%2C%22'\
+'feed_id%22%3A%22'+feed_id+'%22%7D%2C%7B%22'
else:
readjson +='user_type%22%3A%22' + user_type +'%22%2C%22' \
+'dynamic_id%22%3A%22' + dynamic_id +'%22%2C%22' \
+'dynamic_type%22%3A%22' + dynamic_type +'%22%2C%22' \
+'dynamic_sub_type%22%3A%22' + dynamic_sub_type +'%22%2C%22' \
+'thread_id%22%3A%22' + thread_id +'%22%2C%22' \
+'feed_id%22%3A%22' + feed_id +'%22%7D%5D'
readjson+='&uk=D0hHfmuMEVka02HZelKA7g&_='+str(b)
注:feed_id最后一个接的是%22%7D%5D,而不是之前的'%22%7D%2C%7B%22'