Serverless 实战：如何结合 NLP 实现文本摘要和关键词提取？

腾讯云serverless团队

发布于 2020-06-06 14:10:19

1.4K00

代码可运行

文章被收录于专栏：Tencent Serverless 官方专栏Tencent Serverless 官方专栏

运行总次数：0

代码可运行

对文本进行自动摘要的提取和关键词的提取，属于自然语言处理的范畴。提取摘要的一个好处是可以让阅读者通过最少的信息判断出这个文章对自己是否有意义或者价值，是否需要进行更加详细的阅读；而提取关键词的好处是可以让文章与文章之间产生关联，同时也可以让读者通过关键词快速定位到和该关键词相关的文章内容。

文本摘要和关键词提取都可以和传统的 CMS 进行结合，通过对文章 / 新闻等发布功能进行改造，同步提取关键词和摘要，放到 HTML 页面中作为 Description 和 Keyworks。这样做在一定程度上有利于搜索引擎收录，属于 SEO 优化的范畴。

关键词提取

关键词提取的方法很多，但是最常见的应该就是 tf-idf 了。

通过 jieba 实现基于 tf-idf 关键词提取的方法：

jieba.analyse.extract_tags(text, topK=5, withWeight=False, allowPOS=('n', 'vn', 'v'))

文本摘要

文本摘要的方法也有很多，如果从广义上来划分，包括提取式和生成式。其中提取式就是在文章中通过 TextRank 等算法，找出关键句然后进行拼装，形成摘要，这种方法相对来说比较简单，但是很难提取出真实的语义等；另一种方法是生成式，通过深度学习等方法，对文本语义进行提取再生成摘要。

如果简单理解，提取式方式生成的摘要，所有句子来自原文，而生成式方法则是独立生成的。

为了简化难度，本文将采用提取式来实现文本摘要功能，通过 SnowNLP 第三方库，实现基于 TextRank 的文本摘要功能。我们以《海底两万里》部分内容作为原文，进行摘要生成。

原文

通过 SnowNLP 提供的算法：

from snownlp import SnowNLP
 
text = " 上面的原文内容，此处省略 "
s = SnowNLP(text)
print("。".join(s.summary(5)))

输出结果：

自然就分成观点截然不同的两派：一派说这是一个力大无比的怪物。这种假设也不能成立。我到纽约时。说它是一块浮动的船体或是一堆大船残片。另一派说这是一艘动力极强的“潜水船”

初步来看，效果并不是很好，接下来我们自己计算句子权重，实现一个简单的摘要功能，这个就需要 jieba：

import re
import jieba.analyse
import jieba.posseg
 
 
class TextSummary:
    def __init__(self, text):
        self.text = text
 
    def splitSentence(self):
        sectionNum = 0
        self.sentences = []
        for eveSection in self.text.split("\n"):
            if eveSection:
                sentenceNum = 0
                for eveSentence in re.split("!|。|？", eveSection):
                    if eveSentence:
                        mark = []
                        if sectionNum == 0:
                            mark.append("FIRSTSECTION")
                        if sentenceNum == 0:
                            mark.append("FIRSTSENTENCE")
                        self.sentences.append({
                            "text": eveSentence,
                            "pos": {
                                "x": sectionNum,
                                "y": sentenceNum,
                                "mark": mark
                            }
                        })
                        sentenceNum = sentenceNum + 1
                sectionNum = sectionNum + 1
                self.sentences[-1]["pos"]["mark"].append("LASTSENTENCE")
        for i in range(0, len(self.sentences)):
            if self.sentences[i]["pos"]["x"] == self.sentences[-1]["pos"]["x"]:
                self.sentences[i]["pos"]["mark"].append("LASTSECTION")
 
    def getKeywords(self):
        self.keywords = jieba.analyse.extract_tags(self.text, topK=20, withWeight=False, allowPOS=('n', 'vn', 'v'))
 
    def sentenceWeight(self):
        # 计算句子的位置权重
        for sentence in self.sentences:
            mark = sentence["pos"]["mark"]
            weightPos = 0
            if"FIRSTSECTION"in mark:
                weightPos = weightPos + 2
            if"FIRSTSENTENCE"in mark:
                weightPos = weightPos + 2
            if"LASTSENTENCE"in mark:
                weightPos = weightPos + 1
            if"LASTSECTION"in mark:
                weightPos = weightPos + 1
            sentence["weightPos"] = weightPos
 
        # 计算句子的线索词权重
        index = [" 总之 ", " 总而言之 "]
        for sentence in self.sentences:
            sentence["weightCueWords"] = 0
            sentence["weightKeywords"] = 0
        for i in index:
            for sentence in self.sentences:
                if sentence["text"].find(i) >= 0:
                    sentence["weightCueWords"] = 1
 
        for keyword in self.keywords:
            for sentence in self.sentences:
                if sentence["text"].find(keyword) >= 0:
                    sentence["weightKeywords"] = sentence["weightKeywords"] + 1
 
        for sentence in self.sentences:
            sentence["weight"] = sentence["weightPos"] + 2 * sentence["weightCueWords"] + sentence["weightKeywords"]
 
    def getSummary(self, ratio=0.1):
        self.keywords = list()
        self.sentences = list()
        self.summary = list()
 
        # 调用方法，分别计算关键词、分句，计算权重
        self.getKeywords()
        self.splitSentence()
        self.sentenceWeight()
 
        # 对句子的权重值进行排序
        self.sentences = sorted(self.sentences, key=lambda k: k['weight'], reverse=True)
 
        # 根据排序结果，取排名占前 ratio% 的句子作为摘要
        for i in range(len(self.sentences)):
            if i < ratio * len(self.sentences):
                sentence = self.sentences[i]
                self.summary.append(sentence["text"])
 
        return self.summary

这段代码主要是通过 tf-idf 实现关键词提取，然后通过关键词提取对句子尽心权重赋予，最后获得到整体的结果，运行：

testSummary = TextSummary(text)
print("。".join(testSummary.getSummary()))

可以得到结果：

Building prefix dict from the default dictionary ...
Loading model from cache /var/folders/yb/wvy_7wm91mzd7cjg4444gvdjsglgs8/T/jieba.cache
Loading model cost 0.721 seconds.
Prefix dict has been built successfully.
看来，只有政府才有可能拥有这种破坏性的机器，在这个灾难深重的时代，人们千方百计要增强战争武器威力，那就有这种可能，一个国家瞒着其他国家在试制这类骇人听闻的武器。于是，我就抓紧这段候船逗留时间，把收集到的矿物和动植物标本进行分类整理，可就在这时，斯科舍号出事了。同样的道理，说它是一块浮动的船体或是一堆大船残片，这种假设也不能成立，理由仍然是移动速度太快

我们可以看到，整体效果要比刚才的好一些。

发布 API

通过 Serverless 架构，将上面代码进行整理，并发布。

代码整理结果：

import re, json
import jieba.analyse
import jieba.posseg
 
 
class NLPAttr:
    def __init__(self, text):
        self.text = text
 
    def splitSentence(self):
        sectionNum = 0
        self.sentences = []
        for eveSection in self.text.split("\n"):
            if eveSection:
                sentenceNum = 0
                for eveSentence in re.split("!|。|？", eveSection):
                    if eveSentence:
                        mark = []
                        if sectionNum == 0:
                            mark.append("FIRSTSECTION")
                        if sentenceNum == 0:
                            mark.append("FIRSTSENTENCE")
                        self.sentences.append({
                            "text": eveSentence,
                            "pos": {
                                "x": sectionNum,
                                "y": sentenceNum,
                                "mark": mark
                            }
                        })
                        sentenceNum = sentenceNum + 1
                sectionNum = sectionNum + 1
                self.sentences[-1]["pos"]["mark"].append("LASTSENTENCE")
        for i in range(0, len(self.sentences)):
            if self.sentences[i]["pos"]["x"] == self.sentences[-1]["pos"]["x"]:
                self.sentences[i]["pos"]["mark"].append("LASTSECTION")
 
    def getKeywords(self):
        self.keywords = jieba.analyse.extract_tags(self.text, topK=20, withWeight=False, allowPOS=('n', 'vn', 'v'))
        return self.keywords
 
    def sentenceWeight(self):
        # 计算句子的位置权重
        for sentence in self.sentences:
            mark = sentence["pos"]["mark"]
            weightPos = 0
            if"FIRSTSECTION"in mark:
                weightPos = weightPos + 2
            if"FIRSTSENTENCE"in mark:
                weightPos = weightPos + 2
            if"LASTSENTENCE"in mark:
                weightPos = weightPos + 1
            if"LASTSECTION"in mark:
                weightPos = weightPos + 1
            sentence["weightPos"] = weightPos
 
        # 计算句子的线索词权重
        index = [" 总之 ", " 总而言之 "]
        for sentence in self.sentences:
            sentence["weightCueWords"] = 0
            sentence["weightKeywords"] = 0
        for i in index:
            for sentence in self.sentences:
                if sentence["text"].find(i) >= 0:
                    sentence["weightCueWords"] = 1
 
        for keyword in self.keywords:
            for sentence in self.sentences:
                if sentence["text"].find(keyword) >= 0:
                    sentence["weightKeywords"] = sentence["weightKeywords"] + 1
 
        for sentence in self.sentences:
            sentence["weight"] = sentence["weightPos"] + 2 * sentence["weightCueWords"] + sentence["weightKeywords"]
 
    def getSummary(self, ratio=0.1):
        self.keywords = list()
        self.sentences = list()
        self.summary = list()
 
        # 调用方法，分别计算关键词、分句，计算权重
        self.getKeywords()
        self.splitSentence()
        self.sentenceWeight()
 
        # 对句子的权重值进行排序
        self.sentences = sorted(self.sentences, key=lambda k: k['weight'], reverse=True)
 
        # 根据排序结果，取排名占前 ratio% 的句子作为摘要
        for i in range(len(self.sentences)):
            if i < ratio * len(self.sentences):
                sentence = self.sentences[i]
                self.summary.append(sentence["text"])
 
        return self.summary
 
 
def main_handler(event, context):
    nlp = NLPAttr(json.loads(event['body'])['text'])
    return {
        "keywords": nlp.getKeywords(),
        "summary": "。".join(nlp.getSummary())
    }

编写项目serverless.yaml文件：

nlpDemo:
  component:"@serverless/tencent-scf"
  inputs:
    name:nlpDemo
    codeUri:./
    handler:index.main_handler
    runtime:Python3.6
    region:ap-guangzhou
    description:文本摘要/关键词功能
    memorySize:256
    timeout:10
    events:
      -apigw:
          name:nlpDemo_apigw_service
          parameters:
            protocols:
              -http
            serviceName:serverless
            description:文本摘要/关键词功能
            environment:release
            endpoints:
              -path:/nlp
                method:ANY