Python爬虫实战——爬取小说

吉吉的机器学习乐园

发布于 2022-07-13 08:41:20

3K00

代码可运行

文章被收录于专栏：吉吉的机器学习乐园吉吉的机器学习乐园

运行总次数：0

代码可运行

今天分享一个简单的爬虫——爬取小说。

页面分析

首先我们进入某小说网的主页，找到免费完本的页面。

然后随便挑一本小说点击进入小说的详细页面，点击查看目录。

按F12或鼠标右键检查，使用选取页面元素的工具定位各个章节的位置，并且查看对应的链接。

可以发现，所有章节的链接均包裹在class为cf的ul中，我们需要将所有章节的链接获取到。

我们使用requests库获取页面数据（getPage函数），使用BeautifulSoup获取链接，并将章节名称和章节链接存入列表返回。

GetChapterUrl.py

# coding=utf-8
import requests
from bs4 import BeautifulSoup
class GetChapterUrl:
    def getPage(self, url):
        try:
            res = requests.get(url)
            if res.status_code == 200:
                # print(res.text)
                return res.text
        except Exception as e:
            print(e)
            
    def getChapterUrl(self, url):
        print("正在获取各个章节url……")
        urlsList = []
        try:
            pageText = self.getPage(url)
            soup = BeautifulSoup(pageText, 'lxml')
            soupContent = soup.find_all(name="ul", attrs={"class": "cf"})
            urlsSoup = BeautifulSoup(str(soupContent), 'lxml')
            urlsContent = urlsSoup.find_all(name="a")
            for i in urlsContent:
                urlsList.append([i.text, "https:" + i.get('href')])
                # print(i.get('href'))
                # print(i.text)
        except Exception as e:
            print(e)
        print("各个章节url已获取完毕！")
        return urlsList

获取完各个章节链接后，我们进入章节详情，按F12或鼠标右键检查，我们可以发现章节内容均在class为read-content j_readContent的div中，并且包裹在p标签下，使用BeautifulSoup中的find_all方法可以获取所有p标签的内容，并以列表的形式返回，因此，我们只需要遍历这个列表，并且将内容以utf-8的编码写入txt即可。

GetChapterContent.py

# coding=utf-8
from SpiderQiDian.GetChapterUrl import *
class GetChapterContent:
    def __init__(self, url):
        self.url = url
        self.urlsList = GetChapterUrl().getChapterUrl(self.url)

    def getPage(self, url):
        try:
            res = requests.get(url)
            if res.status_code == 200:
                # print(res.text)
                return res.text
        except Exception as e:
            print(e)

    def getChapterContent(self):
        print("开始获取各个章节内容，请耐心等待……")
        cnt = 1
        for url in self.urlsList:
            try:
                pageText = self.getPage(url[1])
                soup = BeautifulSoup(pageText, 'lxml')
                soupContent = soup.find_all(name="div", attrs={"class": "read-content j_readContent"})
                spanSoup = BeautifulSoup(str(soupContent), 'lxml')
                spanContent = spanSoup.find_all(name="p")
                url[0] = url[0].replace("/", " ")
                with open("book/" + str(cnt) + " " + url[0] + ".txt", "w", encoding='utf-8') as f:
                    f.write(url[0] + "\n")
                    for content in spanContent:
                        f.write(content.text + "\n")
                        # print(content.text)
                print(url[0] + "——已爬取完毕！")
                cnt += 1
            except Exception as e:
                print(e)

爬虫获取

我们封装获取各个章节Url的类和获取各个章节内容的类，编写一个启动文件，并且传入查看小说目录的Url。

StartSpider.py

from SpiderQiDian.GetChapterContent import *

if __name__ == '__main__':
    # 要爬取小说的url（查看目录页面）
    url = "https://book.qidian.com/info/1029575290/#Catalog"
    GetChapterContent(url).getChapterContent()