首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
社区首页 >问答首页 >IndexError:超出范围的列表索引(在Reddit数据爬虫上)

IndexError:超出范围的列表索引(在Reddit数据爬虫上)
EN

Stack Overflow用户
提问于 2020-04-14 03:54:19
回答 2查看 240关注 0票数 0

是预期的,下面应该是运行没有问题。

Reddit数据解决方案:

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
    import requests
    import re
    import praw
    from datetime import date
    import csv
    import pandas as pd
    import time
    import sys

    class Crawler(object):
        '''
            basic_url is the reddit site.
            headers is for requests.get method
            REX is to find submission ids.
        '''
        def __init__(self, subreddit="apple"):
            '''
                Initialize a Crawler object.
                    subreddit is the topic you want to parse. default is r"apple"
                basic_url is the reddit site.
                headers is for requests.get method
                REX is to find submission ids.
                submission_ids save all the ids of submission you will parse.
                reddit is an object created using praw API. Please check it before you use.
            '''
            self.basic_url = "https://www.reddit.com"
            self.headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}
            self.REX = re.compile(r"<div class=\" thing id-t3_[\w]+")
            self.subreddit = subreddit
            self.submission_ids = []
            self.reddit = praw.Reddit(client_id="your_id", client_secret="your_secret", user_agent="subreddit_comments_crawler")

        def get_submission_ids(self, pages=2):
            '''
                Collect all ids of submissions..
                One page has 25 submissions.
                page url: https://www.reddit.com/r/subreddit/?count25&after=t3_id
                    id(after) is the last submission from last page.
            '''
    #         This is page url.
            url = self.basic_url + "/r/" + self.subreddit

            if pages <= 0:
                return []

            text = requests.get(url, headers=self.headers).text
            ids = self.REX.findall(text)
            ids = list(map(lambda x: x[-6:], ids))
            if pages == 1:
                self.submission_ids = ids
                return ids

            count = 0
            after = ids[-1]
            for i in range(1, pages):
                count += 25
                temp_url = self.basic_url + "/r/" + self.subreddit + "?count=" + str(count) + "&after=t3_" + ids[-1]
                text = requests.get(temp_url, headers=self.headers).text
                temp_list = self.REX.findall(text)
                temp_list = list(map(lambda x: x[-6:], temp_list))
                ids += temp_list
                if count % 100 == 0:
                    time.sleep(60)
            self.submission_ids = ids
            return ids

        def get_comments(self, submission):
            '''
                Submission is an object created using praw API.
            '''
    #         Remove all "more comments".
            submission.comments.replace_more(limit=None)
            comments = []
            for each in submission.comments.list():
                try:
                    comments.append((each.id, each.link_id[3:], each.author.name, date.fromtimestamp(each.created_utc).isoformat(), each.score, each.body) )
                except AttributeError as e: # Some comments are deleted, we cannot access them.
    #                 print(each.link_id, e)
                    continue
            return comments

        def save_comments_submissions(self, pages):
            '''
                1. Save all the ids of submissions.
                2. For each submission, save information of this submission. (submission_id, #comments, score, subreddit, date, title, body_text)
                3. Save comments in this submission. (comment_id, submission_id, author, date, score, body_text)
                4. Separately, save them to two csv file.
                Note: You can link them with submission_id.
                Warning: According to the rule of Reddit API, the get action should not be too frequent. Safely, use the defalut time span in this crawler.
            '''

            print("Start to collect all submission ids...")
            self.get_submission_ids(pages)
            print("Start to collect comments...This may cost a long time depending on # of pages.")
            submission_url = self.basic_url + "/r/" + self.subreddit + "/comments/"
            comments = []
            submissions = []
            count = 0
            for idx in self.submission_ids:
                temp_url = submission_url + idx
                submission = self.reddit.submission(url=temp_url)
                submissions.append((submission.name[3:], submission.num_comments, submission.score, submission.subreddit_name_prefixed, date.fromtimestamp(submission.created_utc).isoformat(), submission.title, submission.selftext))
                temp_comments = self.get_comments(submission)
                comments += temp_comments
                count += 1
                print(str(count) + " submissions have got...")
                if count % 50 == 0:
                    time.sleep(60)
            comments_fieldnames = ["comment_id", "submission_id", "author_name", "post_time", "comment_score", "text"]
            df_comments = pd.DataFrame(comments, columns=comments_fieldnames)
            df_comments.to_csv("comments.csv")
            submissions_fieldnames = ["submission_id", "num_of_comments", "submission_score", "submission_subreddit", "post_date", "submission_title", "text"]
            df_submission = pd.DataFrame(submissions, columns=submissions_fieldnames)
            df_submission.to_csv("submissions.csv")
            return df_comments


    if __name__ == "__main__":
        args = sys.argv[1:]
        if len(args) != 2:
            print("Wrong number of args...")
            exit()

        subreddit, pages = args
        c = Crawler(subreddit)
        c.save_comments_submissions(int(pages))

,但我得到了:

UserAir:scrape_reddit用户$ python reddit_crawler.py apple 2

开始收集所有提交的身份证..。

回溯(最近一次调用):

文件"reddit_crawler.py",第127行,在

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
c.save_comments_submissions(int(pages))

文件"reddit_crawler.py",第94行,在save_comments_submissions

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
self.get_submission_ids(pages)

文件"reddit_crawler.py",第54行,在get_submission_ids

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
after = ids[-1]

IndexError:列出超出范围的索引

EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2020-04-14 18:49:31

Erik's answer诊断这一错误的具体原因,但更广泛地说,我认为这是因为您没有最大限度地使用PRAW。您的脚本导入requests并执行PRAW已有方法的大量手动请求。PRAW的全部目的是防止您不得不编写这些请求来执行诸如分页列表之类的事情,因此我建议您利用这一点。

例如,您的get_submission_ids函数(它会抓取Reddit的web版本并处理分页)可以被替换为

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
def get_submission_ids(self, pages=2):
    return [
        submission.id
        for submission in self.reddit.subreddit(self.subreddit).hot(
            limit=25 * pages
        )
    ]

因为.hot() function做了你想要做的所有事情。

我将在这里更进一步,函数只返回一个Submission对象列表,因为您的代码的其余部分通过与PRAW Submission对象交互来完成更好的工作。下面是代码(我将函数重命名以反映其更新的目的):

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
def get_submissions(self, pages=2):
    return list(self.reddit.subreddit(self.subreddit).hot(limit=25 * pages))

(我已经更新了这个函数以返回它的结果,因为您的版本既返回值,又将其设置为self.submission_ids,除非 pages0。这感觉很不一致,所以我只返回值。)

您的get_comments函数看起来不错。

save_comments_submissions函数和get_submission_ids一样,完成了PRAW能够处理的大量手工工作。您可以构造一个具有post完整URL的temp_url,然后使用它来生成PRAW Submission对象,但是我们可以直接使用get_submissions返回的URL来替换它。您也有一些对time.sleep()的调用,我删除了这些调用,因为PRAW将自动为您提供适当的睡眠量。最后,我删除了该函数的返回值,因为函数的目的是将数据保存到磁盘,而不是将其返回到其他任何地方,而脚本的其余部分不使用返回值。下面是该函数的更新版本:

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
def save_comments_submissions(self, pages):
    """
        1. Save all the ids of submissions.
        2. For each submission, save information of this submission. (submission_id, #comments, score, subreddit, date, title, body_text)
        3. Save comments in this submission. (comment_id, submission_id, author, date, score, body_text)
        4. Separately, save them to two csv file.
        Note: You can link them with submission_id.
        Warning: According to the rule of Reddit API, the get action should not be too frequent. Safely, use the defalut time span in this crawler.
    """

    print("Start to collect all submission ids...")
    submissions = self.get_submissions(pages)
    print(
        "Start to collect comments...This may cost a long time depending on # of pages."
    )
    comments = []
    pandas_submissions = []
    for count, submission in enumerate(submissions):
        pandas_submissions.append(
            (
                submission.name[3:],
                submission.num_comments,
                submission.score,
                submission.subreddit_name_prefixed,
                date.fromtimestamp(submission.created_utc).isoformat(),
                submission.title,
                submission.selftext,
            )
        )
        temp_comments = self.get_comments(submission)
        comments += temp_comments
        print(str(count) + " submissions have got...")

    comments_fieldnames = [
        "comment_id",
        "submission_id",
        "author_name",
        "post_time",
        "comment_score",
        "text",
    ]
    df_comments = pd.DataFrame(comments, columns=comments_fieldnames)
    df_comments.to_csv("comments.csv")
    submissions_fieldnames = [
        "submission_id",
        "num_of_comments",
        "submission_score",
        "submission_subreddit",
        "post_date",
        "submission_title",
        "text",
    ]
    df_submission = pd.DataFrame(pandas_submissions, columns=submissions_fieldnames)
    df_submission.to_csv("submissions.csv")

以下是完全使用PRAW的整个脚本的更新版本:

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
from datetime import date
import sys


import pandas as pd
import praw


class Crawler:
    """
        basic_url is the reddit site.
        headers is for requests.get method
        REX is to find submission ids.
    """

    def __init__(self, subreddit="apple"):
        """
            Initialize a Crawler object.
                subreddit is the topic you want to parse. default is r"apple"
            basic_url is the reddit site.
            headers is for requests.get method
            REX is to find submission ids.
            submission_ids save all the ids of submission you will parse.
            reddit is an object created using praw API. Please check it before you use.
        """
        self.subreddit = subreddit
        self.submission_ids = []
        self.reddit = praw.Reddit(
            client_id="your_id",
            client_secret="your_secret",
            user_agent="subreddit_comments_crawler",
        )

    def get_submissions(self, pages=2):
        """
            Collect all submissions..
            One page has 25 submissions.
            page url: https://www.reddit.com/r/subreddit/?count25&after=t3_id
                id(after) is the last submission from last page.
        """
        return list(self.reddit.subreddit(self.subreddit).hot(limit=25 * pages))

    def get_comments(self, submission):
        """
            Submission is an object created using praw API.
        """
        #         Remove all "more comments".
        submission.comments.replace_more(limit=None)
        comments = []
        for each in submission.comments.list():
            try:
                comments.append(
                    (
                        each.id,
                        each.link_id[3:],
                        each.author.name,
                        date.fromtimestamp(each.created_utc).isoformat(),
                        each.score,
                        each.body,
                    )
                )
            except AttributeError as e:  # Some comments are deleted, we cannot access them.
                #                 print(each.link_id, e)
                continue
        return comments

    def save_comments_submissions(self, pages):
        """
            1. Save all the ids of submissions.
            2. For each submission, save information of this submission. (submission_id, #comments, score, subreddit, date, title, body_text)
            3. Save comments in this submission. (comment_id, submission_id, author, date, score, body_text)
            4. Separately, save them to two csv file.
            Note: You can link them with submission_id.
            Warning: According to the rule of Reddit API, the get action should not be too frequent. Safely, use the defalut time span in this crawler.
        """

        print("Start to collect all submission ids...")
        submissions = self.get_submissions(pages)
        print(
            "Start to collect comments...This may cost a long time depending on # of pages."
        )
        comments = []
        pandas_submissions = []
        for count, submission in enumerate(submissions):
            pandas_submissions.append(
                (
                    submission.name[3:],
                    submission.num_comments,
                    submission.score,
                    submission.subreddit_name_prefixed,
                    date.fromtimestamp(submission.created_utc).isoformat(),
                    submission.title,
                    submission.selftext,
                )
            )
            temp_comments = self.get_comments(submission)
            comments += temp_comments
            print(str(count) + " submissions have got...")

        comments_fieldnames = [
            "comment_id",
            "submission_id",
            "author_name",
            "post_time",
            "comment_score",
            "text",
        ]
        df_comments = pd.DataFrame(comments, columns=comments_fieldnames)
        df_comments.to_csv("comments.csv")
        submissions_fieldnames = [
            "submission_id",
            "num_of_comments",
            "submission_score",
            "submission_subreddit",
            "post_date",
            "submission_title",
            "text",
        ]
        df_submission = pd.DataFrame(pandas_submissions, columns=submissions_fieldnames)
        df_submission.to_csv("submissions.csv")


if __name__ == "__main__":
    args = sys.argv[1:]
    if len(args) != 2:
        print("Wrong number of args...")
        exit()

    subreddit, pages = args
    c = Crawler(subreddit)
    c.save_comments_submissions(int(pages))

我意识到我的答案进入了领域,但我希望这个答案有助于理解PRAW所能做的一些事情。使用预先存在的库代码可以避免“超出范围的列表索引”错误,因此我认为这是解决问题的方法。

票数 3
EN

Stack Overflow用户

发布于 2020-04-14 03:58:55

my_list[-1]抛出一个IndexError时,它意味着my_list是空的:

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
>>> ids = []
>>> ids[-1]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: list index out of range
>>> ids = ['1']
>>> ids[-1]
'1'
票数 2
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/61207061

复制
相关文章
BUG-索引超出范围
小心一点 System.ArgumentOutOfRangeException:“Index was out of range. Must be non-negative and less than the size of the collection.”
用户9857551
2022/06/28
1.4K0
BUG-索引超出范围
【说站】python列表删除项目的方法
2、使用list对象的pop方法。此方法将项目的索引作为参数并弹出该索引处的项目。
很酷的站长
2022/11/24
1.4K0
【说站】python列表删除项目的方法
在nginx上配置禁止搜索引擎爬虫访问网站
是这么一回事:code.DragonOS.org的引擎,最近总是被某个爬虫刷我们的流量,导致产生费用。而这个网站不需要爬虫抓取,因此我想到了配置robots.txt来禁止爬虫抓取。但是,这个网站部署在我们实验室的服务器上,是通过frp映射到阿里云的服务器上,并且由服务器上面的nginx反向代理,最终才能被公网访问。
灯珑LoGin
2023/10/18
1K0
python学习3-内置数据结构1-列表
lst = list(range(1,20))    #使用list把可迭代对象转换为列表
py3study
2020/01/10
1.1K0
Python面试题目之列表取值超出范围
# 下面列表取值超出范围,会报错还是有返回值: L1 = ['1','2','3','4','5','6',] print(L1[10]) print(L1[10:]) 第一个打印会报错: 第二个打
Jetpropelledsnake21
2019/02/15
1K0
Python基础语法-内置数据结构之列表
列表特性总结 列表的一些特点: 列表是最常用的线性数据结构 list是一系列元素的有序组合 list是可变的 列表的操作, 增:append、extend、insert 删:clear、pop、remove 改:reverse、sort 查:count、index 其他:copy >>> [a for a in dir(list) if not a.startswith('__')] ['append', 'clear', 'copy', 'count', 'extend', 'index', 'inser
1846122963
2018/03/09
1.5K0
Python基础语法-内置数据结构之列表
可以通过下标访问列表中的元素,下标从0开始。当下标超出范围时,会抛出IndexError异常。下标可以是负数,负数下标从右开始,与正数下标刚好相反。负数下标从-1开始。不管是正数的下标还是负数的下标,只要超出范围,就会抛出异常。
TestOps
2022/04/02
9750
python基础1| 索引与切片
看似简单的索引,有的人不以为然,我们这里采用精准的数字索引,很容易排查错误。若索引是经过计算出的一个变量,就千万要小心了,否则失之毫厘差之千里。
统计学家
2019/04/08
1.4K0
听说你会玩 Python 系列 4 - LBYL vs EAFP
写了 Python 这么久,是不是对 LBYL 和 EAFP 这两个缩写还一无所知?先看一下它们的全称:
用户5753894
2020/06/04
1.2K0
Python 爬虫的工具列表
这个列表包含与网页抓取和数据处理的Python库 网络 通用 urllib -网络库(stdlib)。 requests -网络库。 grab – 网络库(基于pycurl)。 pycurl – 网络库(绑定libcurl)。 urllib3 – Python HTTP库,安全连接池、支持文件post、可用性高。 httplib2 – 网络库。 RoboBrowser – 一个简单的、极具Python风格的Python库,无需独立的浏览器即可浏览网页。 MechanicalSoup -一个与网站自动交互Py
CDA数据分析师
2018/02/05
2.3K0
【Python】列表 List ② ( 使用下标索引访问列表 | 正向下标索引 | 反向下标索引 | 嵌套列表下标索引 | 下标索引越界错误 )
在 Python 列表 List 中的每个 数据元素 , 都有对应的 位置下标索引 ,
韩曙亮
2023/10/11
9620
【Python】列表 List ② ( 使用下标索引访问列表 | 正向下标索引 | 反向下标索引 | 嵌套列表下标索引 | 下标索引越界错误 )
【Python】列表 List ② ( 使用下标索引访问列表 | 正向下标索引 | 反向下标索引 | 嵌套列表下标索引 | 下标索引越界错误 )
在 Python 列表 List 中的每个 数据元素 , 都有对应的 位置下标索引 ,
韩曙亮
2023/10/11
5310
【Python】列表 List ② ( 使用下标索引访问列表 | 正向下标索引 | 反向下标索引 | 嵌套列表下标索引 | 下标索引越界错误 )
C#网络爬虫实例:使用RestSharp获取Reddit首页的JSON数据并解析
Reddit 是一个非常受欢迎的分享社交新闻聚合网站,用户可以在上面发布和内容。我们的目标是抓取 Reddit 首页的数据 JSON,以便进一步分析和使用。
小白学大数据
2023/10/16
4300
数据结构-散列表(上)
Word 这种文本编辑器你平时应该经常用吧,那你有没有留意过它的拼写检查功能呢?一旦我们在 Word 里输入一个错误的英文单词,它就会用标红的方式提示“拼写错误”。Word 的这个单词拼写检查功能,虽然很小但却非常实用。你有没有想过,这个功能是如何实现的呢?
acc8226
2022/05/17
8780
数据结构-散列表(上)
IndexError: list index out of range问题的解决
问题现象 Traceback (most recent call last): File "C:/Users/qiu/PycharmProjects/baobiao/plt.py", line 16, in <module> time[0](content) IndexError: list index out of range #故障解释:索引错误:列表的索引分配超出范围 Process finished with exit code 1 源码如下: time=[] #时间 for i in
用户1456517
2019/03/05
26.9K0
微信小程序----全国机场索引列表(MUI索引列表)
效果展示图 实现的原理 '当前选择机场’和右侧的导航栏采用的是固定定位; 左侧的展示窗口的滚动采用的是scroll-view组件; 选择中的字母提示是自己WXSS样式制作。 WXML <view cl
Rattenking
2021/02/01
9600
微信小程序----全国机场索引列表(MUI索引列表)
Python基础语法(2)
一、控制流 1. if 语句 i = 10 n = int(raw_input("enter a number:")) if n == i: print "equal" elif n < i: print "lower" else: print "higher" 2. while语句 while True: pass else: pass #else语句可选,当while为False时,else语句被执行。 pass是空语句。 3. for 循环 for..i
昱良
2018/04/04
1.3K0
利用虚拟列表改造索引列表(IndexList)
在一个倡导“快速开发”的团队中,交付日期往往是衡量工作的第一标准。而遇到问题的解决方式也会偏暴力,暴力的方式往往大脑都会厌恶和失声,尤其是在面试官问开发过程中的难点的时候更是无法回答,只能无底气的回一句“感觉开发过程很顺利,并没有碰到什么难以解决的问题。”。
玖柒的小窝
2021/10/19
1.5K0
利用虚拟列表改造索引列表(IndexList)
列表长度与索引
借助llength命令可获取列表的长度(列表所包含的元素的个数,不难发现很多跟列表相关的命令都是以英文单词l(其大写为L)开头的)。图1显示列表a的长度为3。
Lauren的FPGA
2019/10/30
1.5K0
【我问Crossin】爬虫学习该如何入门?
1 程序中使用了列表,运行过程中报错: IndexError: list index out of range 这是由于列表的下标索引值超过了列表的总长度。 举例: >>>l = [1,2] >>>l[5] IndexError: list index out of range 为了避免这种情况,可以使用 len() 函数取得列表的总长,再进行索引 if len(l) > 5: print(l[5]) 2 如何将一个 py 文件打包为 exe 文件? 将 .py 文件打包为 exe 文件可以使用
Crossin先生
2018/04/17
8030

相似问题

IndexError:列表索引超出范围

23

IndexError:列表索引超出范围

51

IndexError:列表索引超出范围

10

IndexError:列表索引超出范围?

30

IndexError:列表索引超出范围

230
添加站长 进交流群

领取专属 10元无门槛券

AI混元助手 在线答疑

扫码加入开发者社群
关注 腾讯云开发者公众号

洞察 腾讯核心技术

剖析业界实践案例

扫码关注腾讯云开发者公众号
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档
查看详情【社区公告】 技术创作特训营有奖征文