文章/答案/技术大牛

发布

python自然语言处理：（二）获得文本语料和词汇资源

文章来源：企鹅号 - 宇哥情报局

python自然语言处理：(二)获得文本语料和词汇资源

首先，介绍几个语料库：

'''''''''

name:pikachu

title: python自然语言处理：(二)获得文本语料和词汇资源

version:1.0

date:20181227

'''

importnltk

fromnltk.bookimport*

fromnltk.corpusimportwebtext

fromnltk.corpusimportnps_chat

fromnltk.corpusimportbrown

fromnltk.corpusimportreuters

fromnltk.corpusimportinaugural

fromnltk.corpusimportudhr

fromnltk.corpusimportPlaintextCorpusReader

'''2.1.1古腾堡语料库'''

nltk.corpus.gutenberg.fileids()

'''通过循环遍历前面列出的gutenberg文件标识符链表相应的fileid，然后计算统计每个文本。为了使输出看起来紧凑，我们使用函数int()来确保数字都是整数。'''

print('平均词长','平均句子长度','每个词出现的平均次数')

forfileidingutenberg.fileids():

num_chars=len(gutenberg.raw(fileid))

num_words=len(gutenberg.words(fileid))

num_sents=len(gutenberg.sents(fileid))#sents()函数把文本划分成句子

num_vocab=len(set([w.lower()forwingutenberg.words(fileid)]))

print(int(num_chars/num_words),int(num_words/num_sents),int(num_sents),int(num_words/num_vocab),fileid)

'''2.1.2网络和聊天文本'''

forfileidinwebtext.fileids():

print(fileid,webtext.raw(fileid)[:10])

chatroom=nps_chat.posts('10-19-20s_706posts.xml')

print(chatroom[1])

'''2.1.3布朗语料库'''

'''布朗语料库是第一个百万词级的英语电子语料库的，由布朗大学于1961年创建。这个语料库包含500个不同来源的文本，按照文体分类。'''

print(brown.categories())

'''布朗语料库是一个研究文体之间的系统性差异——一种叫做文体学的语言学研究——很方便的资源。让我们来比较不同文体中的情态动词的用法。'''

news_text=brown.words(categories='news')

fdist=nltk.FreqDist([w.lower()forwinnews_text])

modals=['can','could','may','might','must','will']

forminmodals:

print(m+':',fdist[m])

'''下面，我们来统计每一个感兴趣的文体。我们使用NLTK提供的带条件的频率分布函数。'''

cfd=nltk.ConditionalFreqDist(

(genre,word)

forgenreinbrown.categories()

forwordinbrown.words(categories=genre)

)

genre=['news','religion','hobbies','science_fiction','romance','humor']

modals=['can','could','may','might','must','will']

print(cfd.tabulate(conditions=genre,samples=modals))

'''2.1.4路透社语料库'''

'''路透社语料库包含10,788个新闻文档，共计130万字。这些文档分成90个主题，按照“训练”和“测试”分为两组。因此，fileid为“test/14826”的文档属于测试组。'''

print(reuters.fileids())

'''2.1.5就职演说语料库'''

print(inaugural.fileids())

'''让我们来看看词汇america和citizen随时间推移的使用情况。'''

cfd=nltk.ConditionalFreqDist(

(target,file[:4])

forfileidininaugural.fileids()

forwininaugural.words(fileid)

fortargetin['america','citizen']

ifw.lower().startswith(target)

)

cfd.plot()

'''2.1.6标注文本语料库'''

'''许多文本语料库都包含语言学标注，有词性标注、命名实体、句法结构、语义角色等。NLTK中提供了很方便的方式来访问这些语料库中的几个，还有一个包含语料库和语料样本的数据包，用于教学和科研的话可以免费下载.'''

'''2.1.7其他语言的语料库'''

'''NLTK包含多国语言语料库。某些情况下你在使用这些语料库之前需要学习如何在Python中处理字符编码.'''

languages=['Chinese','English']

cfd=nltk.ConditionalFreqDist(

(lang,len(word))

forlanginlanguages

forwordinudhr.words(lang+'-UTF8')

)

cfd.plot(cumulative=True)

'''2.1.8载入自己的语料库'''

corpus_root='C:/Users/'

wordlists=PlaintextCorpusReader(corpus_root,'.*')

print(wordlists.fileids())

'''2.3词汇列表语料库'''

names=nltk.corpus.names

male_names=names.words('male.txt')

female_names=names.words('female.txt')

cfd=nltk.ConditionalFreqDist(

(fileid, name[-1])

forfileidinnames.fileids()

fornameinnames.words(fileid)

)

cfd.plot()

文本语料库的结构：

NLTK 中定义的基本语料库函数：

2.2 条件概率分布

2.5 wordnet简介

WordNet是面向语义的英语词典，类似与传统辞典，但具有更丰富的结构。NLTK 中包括英语WordNet，共有155,287 个词和117,659 个同义词集合。我们将以寻找同义词和它们在WordNet 中如何访问开始。

发表于: 2018-12-272018-12-27 18:28:03
原文链接：https://kuaibao.qq.com/s/20181227G162M600?refer=cp_1026
腾讯「腾讯云开发者社区」是腾讯内容开放平台帐号（企鹅号）传播渠道之一，根据《腾讯内容开放平台服务协议》转载发布内容。
如有侵权，请联系 cloudcommunity@tencent.com 删除。

python自然语言处理：（二）获得文本语料和词汇资源

相关快讯

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐