Python学习之爬虫入门

原创

小瞳

修改于 2018-07-24 18:58:34

6150

文章被收录于专栏：小瞳的专栏小瞳的专栏

本文路线

爬虫的介绍与流程
如何实现
案例之企查查公司信息爬取

爬虫的介绍与流程

爬虫是一种用来自动浏览万维网的网络机器人（英语：Internet bot）。其目的一般为编纂网络索引（英语：Web indexing）。网络搜索引擎等站点通过爬虫软件更新自身的网站内容（英语：Web content）或其对其他网站的索引。网络爬虫可以将自己所访问的页面保存下来，以便搜索引擎事后生成索引（英语：Index (search engine)）供用户搜索。流程如下：

获取URL规律
发送URL，获得响应
提取响应内容
保存内容

如何实现

我们对上一章节说的四个步骤，结合按地区对企查查进行爬取，进行详细的解释。

如何获取URL规律

这一步主要使用观察法。

首先打开一个网址：https://www.qichacha.com/g_AH.html, 界面如图1：
我们看到这个公司的信息有五百页。通过点击2、3、4等网页发现了规律为：数字对应网页的为https://www.qichacha.com/gAH_2.html、https://www.qichacha.com/g_AH_3.html等。数字与网页中的数字所对应。可以归结为https://www.qichacha.com/g_AH{}.html 。通过pyhton中format方法，把网页补充完整，组成一个 URL列表。实现方式如下：

url_list = []
url_temp = "https://www.qichacha.com/g_AH_{}.html"
for i in range(1,501)
    url=url_temp.format(i)
    url_list.append(url)

发送URL，获取响应内容

这一步我们需要解决几个问题

如何发送URL请求
如何获取的响应内容

1.解决第一个问题，需要使用requests库，requests.get()方法提供了解决方法，需要详细了解requests库的使用方法，见链接。

import requests
response  = requests.get(url)

提取响应内容提取内容，有很多种方式。比如正则表达式、Xpath。这里使用Xpath。python提供了lxml库。这个库提供了解析网页，提取内容的方法。

etree.HTML()方法方法，把response转化为element对象。然后使用Xpath语法对其遍历提取内容。

具体的Xpath语法见链接保存内容把提取到的内容可以保存到数据库中，也可以保存到文本中。这里我们以csv格式文件保存下来。

 def save\_content(self,item\_list):
        name\_list = 'company\_name', 'company\_master', 'company\_time', 'company\_capital', 'company\_catagory', 'company\_position'
        with open("HB.csv","a") as f:
            f\_csv = csv.DictWriter(f,fieldnames=name\_list)
            f\_csv.writeheader()
            for item in item\_list:
                f\_csv.writerow(item)

2.使用requests.get()方法，返回的值就是内容，用变量接受即可。

案例之企查查公司信息爬取

#coding = utf8
import requests 
import csv
from lxml import etree
import time
class SpiderAnhui:
    def __init__(self):
        self.temp_url=  "https://www.qichacha.com/g_HB_{}"
        self.headers = {
            "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
            "accept-encoding": "HBip, deflate, br",
            "accept-language": "zh-CN,zh;q=0.9",
            "cache-control": "max-age=0",
            "cookie": "你的cooiles",
            "referer": "https://www.qichacha.com/",
            "upgrade-insecure-requests": "1",
            "user-agent": "your information"
        }
    def get_url_list(self): #获取url列表
        url_list = []
        for i in range(1,501):
            url = self.temp_url.format(i)
            url_list.append(url)
        return url_list
    def parse_url(self,url): #发送url,获得响应
        print("parsing:",url)
        response = requests.get(url,headers = self.headers)
        html = etree.HTML(response.content)
        #print(html)
        return html
    def get_content(self,html):#提取内容
        item_list = []
        company_list = html.xpath("//div[@class='col-md-12']/section")
        #print(company_list)
        for company in company_list:
            company_name_1 = company.xpath(".//span[@class='name']/em/text()")
            company_name_2 = company.xpath(".//span[@class='name']/text()")
            company_name = "".join(company_name_1+company_name_2)
            company_master = company.xpath("./a/span[2]/small[1]//i[@class='i i-user3']/following::text()[1]")
            company_master = [i.replace("\t","") for i in company_master]
            company_master = [i.replace(" ", "") for i in company_master]
            company_master = "".join(company_master)
            company_time = company.xpath("./a/span[2]/small[1]//i[@class='i i-user3']/following::text()[2]")
            company_time = [i.replace("\t","")for i in company_time]
            company_time = [i.replace(" ", "") for i in company_time]
            company_time = [i.replace("\n", "") for i in company_time]
            company_time = "".join(company_time)
            company_capital = company.xpath("./a/span[2]/small[1]//i[@class='i i-user3']/following::text()[3]")
            company_capital = [i.replace("\t","")for i in company_capital]
            company_capital = [i.replace("\n", "") for i in company_capital]
            company_capital = [i.replace(" ", "") for i in company_capital]
            company_capital = "".join(company_capital)
            company_catagory = company.xpath("./a/span[2]/small[1]//i[@class='i i-user3']/following::text()[4]")
            company_catagory = [i.replace("\t", "") for i in company_catagory]
            company_catagory = [i.replace("\n", "") for i in company_catagory]
            company_catagory = [i.replace(" ", "") for i in company_catagory]
            company_catagory = "".join(company_catagory)
            company_position = company.xpath("./a/span[2]/small[2]/text()")
            company_position = ("".join(company_position)).replace(" ","")
            item = dict(
                company_name = company_name,
                company_master =company_master,
                company_time =company_time,
                company_capital =company_capital,
                company_catagory =company_catagory,
                company_position =company_position
            )
            print(item)
            item_list.append(item)
        return item_list
    def save_content(self,item_list):#保存内容
        name_list = ['company_name', 'company_master', 'company_time', 'company_capital', 'company_catagory', 'company_position']
        with open("HB.csv","a") as f:
            f_csv = csv.DictWriter(f,fieldnames=name_list)
            f_csv.writeheader()
            for item in item_list:
                f_csv.writerow(item)
            print("save sucess")
    def run(self):
        #1.获取URL规律
        url_list= self.get_url_list()
        #2.发送请求，获取响应
        for url in url_list:
            html = self.parse_url(url)
            item_list = self.get_content(html)
            self.save_content(item_list)
            time.sleep(1)
        #3.提取内容
        #4.保存内容
if __name__ == '__main__':
    HB = SpiderAnhui()
    HB.run()

原创声明：本文系作者授权腾讯云开发者社区发表，未经许可，不得转载。

如有侵权，请联系 cloudcommunity@tencent.com 删除。

python