首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >将长串裁剪成包含完整句子的段落

将长串裁剪成包含完整句子的段落
EN

Stack Overflow用户
提问于 2018-03-04 20:28:38
回答 1查看 420关注 0票数 1

我的任务是用在线翻译api (google,yandex等)翻译非常长的文本(超过50k符号)。它们都有请求长度的限制。所以,我想把我的文本剪成长度小于那些限制的字符串列表,但也要保存未剪的句子。

例如,如果我要处理限制为300个符号的文本:

斯坦福NLP集团让我们的一些自然语言处理软件提供给每个人!我们为主要的计算语言学问题提供统计NLP、深入学习NLP和基于规则的NLP工具,这些工具可以与人类语言技术相结合,needs.These软件包在工业、学术界和政府中得到了广泛的应用。这段代码正在积极开发中,我们试图在最大努力的基础上回答问题并修复bug。我们支持的所有软件发行版都是用Java编写的。从2014年10月开始,我们软件的当前版本需要Java 8+。(2013年3月至2014年9月的版本需要Java 1.6+;2005年至2013年2月的版本需要Java 1.5+。斯坦福分析器最初是用Java1.1编写的。)发行包包括命令行调用的组件、jar文件、Java和源代码。您还可以在GitHub和Maven上找到我们。许多乐于助人的人扩展了我们的工作,为其他语言提供绑定或翻译。因此,该软件的大部分也可以很容易地从Python (或Jython)、Ruby、Perl、Javascript、F#和其他.NET和JVM语言中使用。

我应该得到这个输出:

代码语言:javascript
运行
复制
['The Stanford NLP Group makes some of our Natural Language Processing software available to everyone! We provide statistical NLP, deep learning NLP, and rule-based NLP tools for major computational linguistics problems, which can be incorporated into applications with human language technology needs.', 
'These packages are widely used in industry, academia, and government. This code is actively being developed, and we try to answer questions and fix bugs on a best-effort basis. All our supported software distributions are written in Java.', 
'Current versions of our software from October 2014 forward require Java 8+. (Versions from March 2013 to September 2014 required Java 1.6+; versions from 2005 to Feb 2013 required Java 1.5+. The Stanford Parser was first written in Java 1.1.)', 
'Distribution packages include components for command-line invocation, jar files, a Java API, and source code. You can also find us on GitHub and Maven. A number of helpful people have extended our work, with bindings or translations for other languages.', 
'As a result, much of this software can also easily be used from Python (or Jython), Ruby, Perl, Javascript, F#, and other .NET and JVM languages.']  

做这件事最重要的方式是什么?是否存在实现这一目标的regexp?

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2018-03-04 20:42:39

regex不是解析段落中句子的正确工具。你应该看看nltk

代码语言:javascript
运行
复制
import nltk

# this line only needs to be run once per environment:
nltk.download('punkt') 

text = """The Stanford NLP Group makes some of our Natural Language Processing software available to everyone! We provide statistical NLP, deep learning NLP, and rule-based NLP tools for major computational linguistics problems, which can be incorporated into applications with human language technology needs.These packages are widely used in industry, academia, and government. This code is actively being developed, and we try to answer questions and fix bugs on a best-effort basis. All our supported software distributions are written in Java. Current versions of our software from October 2014 forward require Java 8+. (Versions from March 2013 to September 2014 required Java 1.6+; versions from 2005 to Feb 2013 required Java 1.5+. The Stanford Parser was first written in Java 1.1.) Distribution packages include components for command-line invocation, jar files, a Java API, and source code. You can also find us on GitHub and Maven. A number of helpful people have extended our work, with bindings or translations for other languages. As a result, much of this software can also easily be used from Python (or Jython), Ruby, Perl, Javascript, F#, and other .NET and JVM languages."""

sents = nltk.sent_tokenize(text)

sents
# outputs:
['The Stanford NLP Group makes some of our Natural Language Processing software available to everyone!',
 'We provide statistical NLP, deep learning NLP, and rule-based NLP tools for major computational linguistics problems, which can be incorporated into applications with human language technology needs.These packages are widely used in industry, academia, and government.',
 'This code is actively being developed, and we try to answer questions and fix bugs on a best-effort basis.',
 'All our supported software distributions are written in Java.',
 'Current versions of our software from October 2014 forward require Java 8+.',
 '(Versions from March 2013 to September 2014 required Java 1.6+; versions from 2005 to Feb 2013 required Java 1.5+.',
 'The Stanford Parser was first written in Java 1.1.)',
 'Distribution packages include components for command-line invocation, jar files, a Java API, and source code.',
 'You can also find us on GitHub and Maven.',
 'A number of helpful people have extended our work, with bindings or translations for other languages.',
 'As a result, much of this software can also easily be used from Python (or Jython), Ruby, Perl, Javascript, F#, and other .NET and JVM languages.']

根据累积长度聚合句子的一种方法是使用生成器函数:

在这里,如果字符串的长度超过300个字符或到达可迭代结束,函数g将产生一个连接字符串。这个函数假定没有一个句子超过300个字符的限制。

代码语言:javascript
运行
复制
def g(sents):
    idx = 0
    text_length = 0
    for i, s in enumerate(sents):
        if text_length + len(s) > 300:
            yield ' '.join(sents[idx:i])
            text_length = len(s)
            idx = i
        else:
            text_length += len(s)
    yield ' '.join(sents[idx:])

句子聚合器可以如下所示:

代码语言:javascript
运行
复制
for s in g(sents):
    print(s)
outputs:
The Stanford NLP Group makes some of our Natural Language Processing software available to everyone!
We provide statistical NLP, deep learning NLP, and rule-based NLP tools for major computational linguistics problems, which can be incorporated into applications with human language technology needs.These packages are widely used in industry, academia, and government.
This code is actively being developed, and we try to answer questions and fix bugs on a best-effort basis. All our supported software distributions are written in Java. Current versions of our software from October 2014 forward require Java 8+.
(Versions from March 2013 to September 2014 required Java 1.6+; versions from 2005 to Feb 2013 required Java 1.5+. The Stanford Parser was first written in Java 1.1.) Distribution packages include components for command-line invocation, jar files, a Java API, and source code.
You can also find us on GitHub and Maven. A number of helpful people have extended our work, with bindings or translations for other languages. As a result, much of this software can also easily be used from Python (or Jython), Ruby, Perl, Javascript, F#, and other .NET and JVM languages.

检查每个文本段的长度表明,所有片段的字符少于300个:

代码语言:javascript
运行
复制
[len(s) for s in g(sents)]
#outputs:
[100, 268, 244, 276, 289]
票数 3
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/49100086

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档