在下面的文章中,将了解如何以快速简便的方式开始使用spaCy。它对NLP领域的初学者爱好者特别有用,并提供逐步说明和明亮的例子。
spaCy是一个NLP框架,由Explosion AI于2015年2月发布。它被认为是世界上最快的。易于使用并具有使用神经网络的能力是其他优点。
步骤1:安装spaCy
打开终端(命令提示符)并写入:
pip install spacy
步骤2:下载语言模型
编写以下命令
python -m spacy download en_core_web_lg
模型(en_core_web_lg)是spaCy最大的英文模型,大小为788 MB。英语中有较小的模型,其他语言有一些其他模型(英语,德语,法语,西班牙语,葡萄牙语,意大利语,荷兰语,希腊语)。
步骤3:导入库并加载模型
在python编辑器中编写以下行之后,已准备好了一些NLP乐趣:
import spacynlp = spacy.load(‘en_core_web_lg’)
步骤4:创建示例文本
sample_text = “Mark Zuckerberg took two days to testify before members of Congress last week, and he apologised for privacy breaches on Facebook. He said that the social media website didnot take a broad enough view of its responsibility, which was a big mistake. He continued to take responsibility for Facebook, saying that he started it, runs it, and he is responsible for what happens at the company. Illinois Senator Dick Durbin asked Zuckerberg whether he would be comfortable sharing the name of the hotel where he stayed the previous night, or the names of the people who he messaged that week. The CEO was startled by the question, and he took about 7 seconds to respond with no.”doc = nlp(sample_text)
步骤5:拆分段落的句子
将这个文本分成句子,并在每个句子的末尾写下每个句子的字符长度:
sentences = list(doc3.sents)for i in range(len(sentences)): print(sentences[i].text) print(“Number of characters:”, len(sentences[i].text)) print(“ — — — — — — — — — — — — — — — — — -”)
输出:
Mark Zuckerberg took two days to testify before members of Congress last week, and he apologised for privacy breaches on Facebook.Number of characters: 130-----------------------------------He said that the social media website did not take a broad enough view of its responsibility, which was a big mistake.Number of characters: 118-----------------------------------He continued to take responsibility for Facebook, saying that he started it, runs it, and he is responsible for what happens at the company.Number of characters: 140-----------------------------------Illinois Senator Dick Durbin asked Zuckerberg whether he would be comfortable sharing the name of the hotel where he stayed the previous night, or the names of the people who he messaged that week.Number of characters: 197-----------------------------------The CEO was startled by the question, and he took about 7 seconds to respond with no.Number of characters: 85-----------------------------------
步骤6:实体识别
实体识别性能是NLP模型的重要评估标准。spaCy通过一行代码实现它并且非常成功:
from spacy import displacydisplacy.render(doc, style=’ent’, jupyter=True)
输出:
步骤7:标记化和词性标注
标记文本并查看每个标记的一些属性:
for token in doc: print(“{0}\t{1}\t{2}\t{3}\t{4}\t{5}\t{6}\t{7}”.format( token.text, token.idx, token.lemma_, token.is_punct, token.is_space, token.shape_, token.pos_, token.tag_ ))
输出:
Mark0markFalseFalseXxxxPROPNNNPZucker.5zucker.FalseFalseXxxxxPROPNNNPtook16takeFalseFalsexxxxVERBVBDtwo21twoFalseFalsexxxNUMCDdays25dayFalseFalsexxxxNOUNNNSto30toFalseFalsexxPARTTOtestify33testifyFalseFalsexxxxVERBVBbefore41beforeFalseFalsexxxxADPINmembers48memberFalseFalsexxxxNOUNNNSof56ofFalseFalsexxADPIN
同样它很容易应用并立即给出令人满意的结果。关于打印出的属性的简要说明:
text: token itselfidx: starting byte of the tokenlemma_: root of the wordis_punct: is it a punctuation symbol or notis_space: is it a space or notshape_: shape of the token to show which letter is the capitalpos_: the simple part of speech tagtag_: the detailed part of speech tag
什么是语音标签?
它是在将整个文本拆分成标记之后为每个标记分配标记的过程,如名词,动词,形容词。
步骤8:只有数字
当处理语言和文本时,数字来自何处?
由于机器需要将所有内容转换为数字以理解世界,因此每个单词都由NLP世界中的数组(单词向量)表示。这是spaCy词典中“man”的单词vector:
[-1.7310e-01, 2.0663e-01, 1.6543e-02, ....., -7.3803e-02]
spaCy的单词向量的长度是300.它可以在其他框架中有所不同。
在建立了单词向量之后,可以观察到上下文相似的单词在数学上也是相似的。这里有些例子:
from scipy import spatialcosine_similarity = lambda x, y: 1 — spatial.distance.cosine(x, y)print(“apple vs banana: “, cosine_similarity(nlp.vocab[‘apple’].vector, nlp.vocab[‘banana’].vector))print(“car vs banana: “, cosine_similarity(nlp.vocab[‘car’].vector, nlp.vocab[‘banana’].vector))print(“car vs bus: “, cosine_similarity(nlp.vocab[‘car’].vector, nlp.vocab[‘bus’].vector))print(“tomatos vs banana: “, cosine_similarity(nlp.vocab[‘tomatos’].vector, nlp.vocab[‘banana’].vector))print(“tomatos vs cucumber: “, cosine_similarity(nlp.vocab[‘tomatos’].vector, nlp.vocab[‘cucumber’].vector))
输出:
apple vs banana: 0.5831844210624695car vs banana: 0.16172660887241364car vs bus: 0.48169606924057007tomatos vs banana: 0.38079631328582764tomatos vs cucumber: 0.5478045344352722
令人印象深刻的?当比较两种水果或蔬菜或两种车辆时,相似性更高。当两个不相关的物体如汽车与香蕉相比时,相似性相当低。当检查西红柿和香蕉的相似性时,观察到它高于汽车与香蕉的相似性,但低于西红柿对黄瓜和苹果对香蕉的反映现实。
步骤9:国王=女王+(男人 - 女人)?
如果一切都用数字表示,如果可以用数学方法计算相似性,可以做一些其他的计算吗?例如,如果从“男人”中减去“女人”并将差异添加到“女王”中,能找到“国王”吗?试试吧:
from scipy import spatial cosine_similarity = lambda x, y: 1 — spatial.distance.cosine(x, y) man = nlp.vocab[‘man’].vectorwoman = nlp.vocab[‘woman’].vectorqueen = nlp.vocab[‘queen’].vectorking = nlp.vocab[‘king’].vectorcalculated_king = man — woman + queenprint(“similarity between our calculated king vector and real king vector:”, cosine_similarity(calculated_king, king))
输出:
similarity between our calculated king vector and real king vector: 0.771614134311676
可以尝试使用不同的替代词,并观察类似的有希望的结果。
结论
本文的目的是对spaCy框架进行简单而简要的介绍,并展示一些简单的NLP应用程序示例。希望这是有益的。可以在设计精良且信息丰富的网站中找到详细信息和大量示例。
扫码关注腾讯云开发者
领取腾讯云代金券
Copyright © 2013 - 2025 Tencent Cloud. All Rights Reserved. 腾讯云 版权所有
深圳市腾讯计算机系统有限公司 ICP备案/许可证号:粤B2-20090059 深公网安备号 44030502008569
腾讯云计算(北京)有限责任公司 京ICP证150476号 | 京ICP备11018762号 | 京公网安备号11010802020287
Copyright © 2013 - 2025 Tencent Cloud.
All Rights Reserved. 腾讯云 版权所有