爬虫是一种用来自动浏览万维网的网络机器人(英语:Internet bot)。其目的一般为编纂网络索引(英语:Web indexing)。网络搜索引擎等站点通过爬虫软件更新自身的网站内容(英语:Web content)或其对其他网站的索引。网络爬虫可以将自己所访问的页面保存下来,以便搜索引擎事后生成索引(英语:Index (search engine))供用户搜索。流程如下:
我们对上一章节说的四个步骤,结合按地区对企查查进行爬取,进行详细的解释。
这一步主要使用观察法。
URL
列表。实现方式如下:url_list = []
url_temp = "https://www.qichacha.com/g_AH_{}.html"
for i in range(1,501)
url=url_temp.format(i)
url_list.append(url)
这一步我们需要解决几个问题
1.解决第一个问题,需要使用requests库,requests.get()
方法提供了解决方法,需要详细了解requests库的使用方法,见链接。
import requests
response = requests.get(url)
提取响应内容提取内容,有很多种方式。比如正则表达式、Xpath。这里使用Xpath。python提供了lxml库。这个库提供了解析网页,提取内容的方法。
etree.HTML()
方法方法,把response
转化为element
对象。然后使用Xpath语法对其遍历提取内容。
具体的Xpath语法见链接保存内容把提取到的内容可以保存到数据库中,也可以保存到文本中。这里我们以csv格式文件保存下来。
def save\_content(self,item\_list):
name\_list = 'company\_name', 'company\_master', 'company\_time', 'company\_capital', 'company\_catagory', 'company\_position'
with open("HB.csv","a") as f:
f\_csv = csv.DictWriter(f,fieldnames=name\_list)
f\_csv.writeheader()
for item in item\_list:
f\_csv.writerow(item)
`
2.使用requests.get()
方法,返回的值就是内容,用变量接受即可。
#coding = utf8
import requests
import csv
from lxml import etree
import time
class SpiderAnhui:
def __init__(self):
self.temp_url= "https://www.qichacha.com/g_HB_{}"
self.headers = {
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"accept-encoding": "HBip, deflate, br",
"accept-language": "zh-CN,zh;q=0.9",
"cache-control": "max-age=0",
"cookie": "你的cooiles",
"referer": "https://www.qichacha.com/",
"upgrade-insecure-requests": "1",
"user-agent": "your information"
}
def get_url_list(self): #获取url列表
url_list = []
for i in range(1,501):
url = self.temp_url.format(i)
url_list.append(url)
return url_list
def parse_url(self,url): #发送url,获得响应
print("parsing:",url)
response = requests.get(url,headers = self.headers)
html = etree.HTML(response.content)
#print(html)
return html
def get_content(self,html):#提取内容
item_list = []
company_list = html.xpath("//div[@class='col-md-12']/section")
#print(company_list)
for company in company_list:
company_name_1 = company.xpath(".//span[@class='name']/em/text()")
company_name_2 = company.xpath(".//span[@class='name']/text()")
company_name = "".join(company_name_1+company_name_2)
company_master = company.xpath("./a/span[2]/small[1]//i[@class='i i-user3']/following::text()[1]")
company_master = [i.replace("\t","") for i in company_master]
company_master = [i.replace(" ", "") for i in company_master]
company_master = "".join(company_master)
company_time = company.xpath("./a/span[2]/small[1]//i[@class='i i-user3']/following::text()[2]")
company_time = [i.replace("\t","")for i in company_time]
company_time = [i.replace(" ", "") for i in company_time]
company_time = [i.replace("\n", "") for i in company_time]
company_time = "".join(company_time)
company_capital = company.xpath("./a/span[2]/small[1]//i[@class='i i-user3']/following::text()[3]")
company_capital = [i.replace("\t","")for i in company_capital]
company_capital = [i.replace("\n", "") for i in company_capital]
company_capital = [i.replace(" ", "") for i in company_capital]
company_capital = "".join(company_capital)
company_catagory = company.xpath("./a/span[2]/small[1]//i[@class='i i-user3']/following::text()[4]")
company_catagory = [i.replace("\t", "") for i in company_catagory]
company_catagory = [i.replace("\n", "") for i in company_catagory]
company_catagory = [i.replace(" ", "") for i in company_catagory]
company_catagory = "".join(company_catagory)
company_position = company.xpath("./a/span[2]/small[2]/text()")
company_position = ("".join(company_position)).replace(" ","")
item = dict(
company_name = company_name,
company_master =company_master,
company_time =company_time,
company_capital =company_capital,
company_catagory =company_catagory,
company_position =company_position
)
print(item)
item_list.append(item)
return item_list
def save_content(self,item_list):#保存内容
name_list = ['company_name', 'company_master', 'company_time', 'company_capital', 'company_catagory', 'company_position']
with open("HB.csv","a") as f:
f_csv = csv.DictWriter(f,fieldnames=name_list)
f_csv.writeheader()
for item in item_list:
f_csv.writerow(item)
print("save sucess")
def run(self):
#1.获取URL规律
url_list= self.get_url_list()
#2.发送请求,获取响应
for url in url_list:
html = self.parse_url(url)
item_list = self.get_content(html)
self.save_content(item_list)
time.sleep(1)
#3.提取内容
#4.保存内容
if __name__ == '__main__':
HB = SpiderAnhui()
HB.run()
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。