前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >专栏 >18.scrapy_maitian

18.scrapy_maitian

作者头像
hankleo
发布于 2020-09-17 03:03:01
发布于 2020-09-17 03:03:01
29100
代码可运行
举报
文章被收录于专栏:Hank’s BlogHank’s Blog
运行总次数:0
代码可运行

ershoufang.py

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
# -*- coding: utf-8 -*-
import scrapy

class ErshoufangSpider(scrapy.Spider):
    name = 'ershoufang'
    allowed_domains = ['maitian.com']
    start_urls = ['http://maitian.com/']

    def parse(self, response):
        pass

zufang_spider.py

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
import scrapy
from maitian.items import MaitianItem

class MaitianSpider(scrapy.Spider):
    name = "zufang"
    start_urls = ['http://bj.maitian.cn/zfall/PG1']

    def parse(self, response):
        for zufang_itme in response.xpath('//div[@class="list_title"]'):
            yield {
                'title': zufang_itme.xpath('./h1/a/text()').extract_first().strip(),
                'price': zufang_itme.xpath('./div[@class="the_price"]/ol/strong/span/text()').extract_first().strip(),
                'area': zufang_itme.xpath('./p/span/text()').extract_first().replace('㎡', '').strip(),
                'district': zufang_itme.xpath('./p//text()').re(r'昌平|朝阳|东城|大兴|丰台|海淀|石景山|顺义|通州|西城')[0],
            }

        next_page_url = response.xpath(
            '//div[@id="paging"]/a[@class="down_page"]/@href').extract_first()
        if next_page_url is not None:
            yield scrapy.Request(response.urljoin(next_page_url))

items.py

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class MaitianItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    price = scrapy.Field()
    area = scrapy.Field()
    district = scrapy.Field()

middlewares.py

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
# -*- coding: utf-8 -*-
# Define here the models for your spider middleware
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html

from scrapy import signals


class MaitianSpiderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.

        # Should return None or raise an exception.
        return None

    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.

        # Must return an iterable of Request, dict or Item objects.
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.

        # Should return either None or an iterable of Response, dict
        # or Item objects.
        pass

    def process_start_requests(self, start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn’t have a response associated.

        # Must return only requests (not items).
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)


class MaitianDownloaderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        return None

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

pipelines.py

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
# -*- coding: utf-8 -*-
# Define your item pipelines here
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

import pymongo
from scrapy.conf import settings

class MaitianPipeline(object):
    def __init__(self):
        host = settings['MONGODB_HOST']
        port = settings['MONGODB_PORT']
        db_name = settings['MONGODB_DBNAME']
        client = pymongo.MongoClient(host=host, port=port)
        db = client[db_name]
        self.post = db[settings['MONGODB_DOCNAME']]

    def process_item(self, item, spider):
        zufang = dict(item)
        self.post.insert(zufang)
        return item

settings.py

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
# -*- coding: utf-8 -*-
# Scrapy settings for maitian project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'maitian'

SPIDER_MODULES = ['maitian.spiders']
NEWSPIDER_MODULE = 'maitian.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'maitian (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'maitian.middlewares.MaitianSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'maitian.middlewares.MaitianDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    'maitian.pipelines.MaitianPipeline': 300,
#}
ITEM_PIPELINES = {
   'maitian.pipelines.DuplicatesPipeline': 301,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

ITEM_PIPELINES = {'maitian.pipelines.MaitianPipeline': 300,}

MONGODB_HOST = '127.0.0.1'
MONGODB_PORT = 27017
MONGODB_DBNAME = 'maitian'
MONGODB_DOCNAME = 'zufang'
本文参与 腾讯云自媒体同步曝光计划,分享自作者个人站点/博客。
原始发表:2019-05-04 ,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 作者个人站点/博客 前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
暂无评论
推荐阅读
编辑精选文章
换一批
拥有了这个, 天下的美图都是你的!!!
今天本狗就给大家分享一串神奇的 ” 东东“, 它可以下载任意多的图片,因为本狗很喜欢那个网站的图片了, 所以就,,,, 而且都是高清图哦!!在此分享给大家!!!
Python知识大全
2020/02/13
5030
拥有了这个, 天下的美图都是你的!!!
scapy 如何爬取妹子图全站
Scrapy是用纯Python实现一个为了爬取网站数据、提取结构性数据而编写的应用框架,用途非常广泛。
百里丶落云
2023/11/14
2440
安装和使用Scrapy
可以先创建虚拟环境并在虚拟环境下使用pip安装scrapy。 $ 项目的目录结构如下图所示。 (venv) $ tree . |____ scrapy.cfg |____ douban | |____ spiders | | |____ __init__.py | | |____ __pycache__ | |____ __init__.py | |____ __pycache__ | |____ middlewares.py | |____ settings.py | |____ items.py |
用户8442333
2021/05/21
5030
爬取4567电影网「建议收藏」
发布者:全栈程序员栈长,转载请注明出处:https://javaforall.cn/154556.html原文链接:https://javaforall.cn
全栈程序员站长
2022/09/06
2.3K0
Python网络爬虫(七)- 深度爬虫CrawlSpider1.深度爬虫CrawlSpider2.链接提取:LinkExtractor3.爬取规则:rules4.如何在pycharm中直接运行爬虫5.
目录: Python网络爬虫(一)- 入门基础 Python网络爬虫(二)- urllib爬虫案例 Python网络爬虫(三)- 爬虫进阶 Python网络爬虫(四)- XPath Python网络爬虫(五)- Requests和Beautiful Soup Python网络爬虫(六)- Scrapy框架 Python网络爬虫(七)- 深度爬虫CrawlSpider Python网络爬虫(八) - 利用有道词典实现一个简单翻译程序 深度爬虫之前推荐一个简单实用的库fake-useragent,可以伪装
Python攻城狮
2018/08/23
1.9K0
Python网络爬虫(七)- 深度爬虫CrawlSpider1.深度爬虫CrawlSpider2.链接提取:LinkExtractor3.爬取规则:rules4.如何在pycharm中直接运行爬虫5.
scrapy 爬取校花网,并作数据持久化处理
-:process_item方法中return item 的操作将item 传递给下一个即将被执行的管道类
百里丶落云
2023/11/14
4803
Scrapy 持续自动翻页爬取数据
概述 方案一: 根据URL寻找规律适用于没有下一页button的网页,或者button不是url的网页 [uhhxjjlim2.png] 方案二: 根据下一页button获取button内容 [pjnmr582t3.png] 修改代码 这里使用方案二 通过F12 得到下一页buton的Xpath [图片.png] # -*- coding: utf-8 -*- import scrapy from scrapy import Request from urllib.parse import urljoi
待你如初见
2019/03/21
5.4K0
Scrapy 持续自动翻页爬取数据
收藏| Scrapy框架各组件详细设置
大家好,关于Requests爬虫我们已经讲了很多。今天我们就说一下Scrapy框架各组件的详细设置方便之后更新Scrapy爬虫实战案例。
刘早起
2020/05/14
7560
Python爬虫Scrapy入门
Scrapy是Python开发的一个快速、高层次的屏幕抓取和web抓取框架,用于抓取web站点并从页面中提取结构化的数据。
里克贝斯
2021/05/21
6790
Python爬虫Scrapy入门
《手把手带你学爬虫──初级篇》第6课 强大的爬虫框架Scrapy
Scrapy是一个Python爬虫应用框架,爬取和处理结构性数据非常方便。使用它,只需要定制开发几个模块,就可以轻松实现一个爬虫,让爬取数据信息的工作更加简单高效。
GitOPEN
2019/01/29
1.2K0
《手把手带你学爬虫──初级篇》第6课  强大的爬虫框架Scrapy
Scrapy-Redis分布式抓取麦田二手房租房信息与数据分析准备工作租房爬虫二手房分布式爬虫数据分析及可视化
试着通过抓取一家房产公司的全部信息,研究下北京的房价。文章最后用Pandas进行了分析,并给出了数据可视化。 ---- 准备工作 麦田房产二手房页面(http://bj.maitian.cn/esfa
SeanCheney
2018/04/24
1.5K0
Scrapy-Redis分布式抓取麦田二手房租房信息与数据分析准备工作租房爬虫二手房分布式爬虫数据分析及可视化
利用Scrapy框架爬取LOL皮肤站高清壁纸
成品打包:点击进入 代码: 爬虫文件 # -*- coding: utf-8 -*- import scrapy from practice.items import PracticeItem from urllib import parse class LolskinSpider(scrapy.Spider): name = 'lolskin' allowed_domains = ['lolskin.cn'] start_urls = ['https://lolsk
SingYi
2022/07/14
4740
利用Scrapy框架爬取LOL皮肤站高清壁纸
爬虫之scrapy-splash
目前,为了加速页面的加载速度,页面的很多部分都是用JS生成的,而对于用scrapy爬虫来说就是一个很大的问题,因为scrapy没有JS engine,所以爬取的都是静态页面,对于JS生成的动态页面都无法获得
周小董
2019/03/25
2K0
爬虫之scrapy-splash
scrapy框架的介绍
Scrapy Engine(引擎): 负责Spider、ItemPipeline、Downloader、Scheduler中间的通讯,信号、数据传递等。
用户2337871
2019/07/19
6380
scrapy框架的介绍
Python自动化开发学习-Scrapy
讲师博客:https://www.cnblogs.com/wupeiqi/p/6229292.html 中文资料(有示例参考):http://www.scrapyd.cn/doc/
py3study
2020/01/08
1.6K0
python之crawlspider初探
<pre style="margin: 0px; padding: 0px; white-space: pre-wrap; overflow-wrap: break-word; font-family: "Courier New" !important; font-size: 12px !important;">""" 1、用命令创建一个crawlspider的模板:scrapy genspider -t crawl <爬虫名> <all_domain>,也可以手动创建 2、CrawlSpider中不能再有以parse为名字的数据提取方法,这个方法被CrawlSpider用来实现基础url提取等功能 3、一个Rule对象接受很多参数,首先第一个是包含url规则的LinkExtractor对象, 常有的还有callback(制定满足规则的解析函数的字符串)和follow(response中提取的链接是否需要跟进) 4、不指定callback函数的请求下,如果follow为True,满足rule的url还会继续被请求 5、如果多个Rule都满足某一个url,会从rules中选择第一个满足的进行操作 """</pre>
用户5760343
2019/08/02
5020
python之crawlspider初探
Scrapy项目实战:爬取某社区用户详情
get_cookies.py from selenium import webdriver from pymongo import MongoClient from scrapy.crawler import overridden_settings # from segmentfault import settings import time import settings class GetCookies(object): def __init__(self): # 初始化组件
hankleo
2020/09/17
5740
Scrapy框架: middlewares.py设置
# -*- coding: utf-8 -*- # Define here the models for your spider middleware # # See documentation in: # https://doc.scrapy.org/en/latest/topics/spider-middleware.html from scrapy import signals class DownloadtestSpiderMiddleware(object): # Not all m
hankleo
2020/09/17
3770
scrapy-redis 分布式哔哩哔哩
scrapy里面,对每次请求的url都有一个指纹,这个指纹就是判断url是否被请求过的。默认是开启指纹即一个URL请求一次。如果我们使用分布式在多台机上面爬取数据,为了让爬虫的数据不重复,我们也需要一个指纹。但是scrapy默认的指纹是保持到本地的。所有我们可以使用redis来保持指纹,并且用redis里面的set集合来判断是否重复。
py3study
2020/01/16
4280
python爬虫scrapy框架_python主流爬虫框架
闲来无聊,写了一个爬虫程序获取百度疫情数据。申明一下,研究而已。而且页面应该会进程做反爬处理,可能需要调整对应xpath。
全栈程序员站长
2022/11/07
1.4K0
相关推荐
拥有了这个, 天下的美图都是你的!!!
更多 >
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档
本文部分代码块支持一键运行,欢迎体验
本文部分代码块支持一键运行,欢迎体验