社区首页 >专栏 >GPT-2生成《神奇宝贝》动漫台词

GPT-2生成《神奇宝贝》动漫台词

代码医生工作室

发布于 2019-11-12 07:37:14

86400

代码可运行

文章被收录于专栏：相约机器人相约机器人

运行总次数：0

代码可运行

作者 | Thiago Lira

来源 | Medium

Ludicolo was a salsa master, he would teach Ash how to move like a god. He would make fun of Ash for being unable to move so quickly, and would even attack him for being weak.

OpenAI提出的GPT-2模型是AI生成文本中的改变者。今天将展示如何使用模型生成神奇宝贝台词。

最终结果：

http : //pokegen.thiagolira.com.br/

所有代码都可以在Github存储库中找到。

https://github.com/ThiagoLira/pkm-ep-generator

在本文中，将解释端到端机器学习项目的主要挑战。

数据

机器学习方法用于提取信息并从数据中推断出模式。虽然经典的统计方法将有许多参数和假设由统计学家在建模阶段选择，但机器学习方法却让数据说明了一切。这是可以解释的（经典）模型和准确的（机器学习）模型之间的折中方案。机器学习的预测能力基本上来自拥有大量数据和足够复杂的模型以从中捕获高度微妙的模式。有一个“平滑”的假设，即该模型是在足够大的现实样本上进行训练的，以推断它没有直接看到的内容，并假设该模型与某些实际已经被训练的示例接近。

最近的NLP（自然语言处理）模型没有什么不同，它们需要大量的文本和计算能力来进行训练。这些新模型从对语言的零知识开始，到最后，它们变得非常擅长于从单词序列中测量上下文信息。可以理解的是，相同的单词在句子的不同位置具有不同的含义，这是经典NLP模型无法很好实现的。

GPT-2模型已经在Wikipedia，Reddit和许多其他地方进行了预训练。根据互联网上一组更具体的文本对模型进行微调。这是Internet的子集，由Pokémon动漫剧集摘要组成。

编写的爬虫下载了社区编写的大约400集摘要。下边是一个情节的示例：

Ash declares to himself and the Pokémon of the world that he will become a Pokémon Master. His speech, however, is interrupted by his mom who tells him to get to bed as he has a big day tomorrow. Ash protests that he’s too excited to sleep, so his mom tells him that if he won’t sleep then to at least get ready for the next day as she switches on a program hosted by the town’s Pokémon expert, Professor Oak. Ash watches as Oak explains that new Trainers get to pick one of three Pokémon to start their journey; the Grass-type Bulbasaur, the Fire-type Charmander or the Water-type Squirtle.

爬虫位于crawler_bulbapedia.py文件上，运行时将创建一个名为data / pokeCorpusBulba的文件夹，它将每个情节存储在单独的文本文件中。

尚未准备好将数据提供给模型。另一个名为prepare_corpus.py的脚本将清除文本并将它们全部合并到一个名为train.txt的文件中，准备与GPT-2一起使用。

模型

GPT-2是基于Transformer的模型，它使用一种称为自我注意的技术，以惊人的自然方式学习单词如何完成或继续句子。可以从纯粹的编程角度提供一些见解，以了解如何使用经过预训练的模型，就像它是文本生成API一样。为此找到了一个出色的资源gpt-2-simple python库，该库使所有Tensorflow复杂性基本不可见，并提供了一些非常简单的功能来从GPT-2模型下载，微调和采样。

基本上，语言模型会尝试从句子中预测下一个单词，可以继续从模型中获取预测以生成新文本，将最后的预测作为新输入来获取越来越多的单词。因此，作为示例，可以为模型提供前缀输入“Ash and Pikachu were”：

GPT-2使用注意力机制的作用是动态地评估最后一个单词对预测下一个单词的重要性。模型内部有一个称为“transformer cell”的东西，用于计算输入序列上每个单词相对于每个其他单词的关注值。所有这些都传递来生成输出，即预测句子中的下一个单词。

作为一个稍微简化的示例，通过注意力值的强度（越是紫色，注意力越强），可以清楚地看到“ Ash”和“ Pikachu”与确定“是”之后的内容有关。对于这种模型，Naive-Bayes之类的经典“计数单词”方法无法做到的。

训练是通过将语料库中的句子中的单词删掉并微调模型以正确预测它们来进行的。最后，有一个checkpoint文件夹，这是需要从该模型生成文本的唯一内容。由tensorflow创建的此文件夹包含与Pokémon语料库进行微调后的整个模型状态，并且gpt-2-simple库将在生成新文本时查找它。

服务器

这是FAR最具挑战性的部分。在Internet上为该模型提供推论并不是一件容易的事，因为文本生成非常占用内存。

基本上，服务器结构会回答指向端口5000的GET请求。它具有一个函数来回答此请求，以获取参数（用户输入），初始化模型，生成一定数量的文本并返回JSON中的所有内容。困难的部分是该模型占用高达1GB的内存来进行推断。因此，在一切之前，必须拥有一台具有相当数量的RAM的服务器。因此，最终选择了AWS上的EC2 t2-medium实例，并进行了设置。

Gabriela从文章中彻底复制了以下EC2实例中的Web服务器结构。

选择在此EC2实例上运行的Web服务器是nginx，它侦听请求，然后将其转发到通过WSGI协议与Flask应用通信的uWSGI Web服务器。基本上具有以下结构：

Gabriela Melo的图表

WSGI协议的目的是为使用Python编写的Web应用程序创建通用接口。因此可以更改应用程序框架（从Flask到Django）或应用程序服务器（从uWSGI到Unicorn），而这在其他部分基本上是不可见的。

为什么不只是将uWSGI服务器提供给网络？为什么要使用另一层，即nginx？好吧，简单的答案是nginx提取了服务器负载可能带来的一些问题，而uWSGI本身不适合处理。

必须将所有这些软件打包在一个Docker容器中。

Flask App

Flask App（在服务器上运行模型的地方）具有单个请求入口，即generate函数：

@app.route('/',methods=['GET'])
def generate():
 # Since Flask forks the python process to answer for requests
 # we need to do this to avoid errors with tensorflow
 tf.reset_default_graph()


 # Start tf session and load model into memory
 sess = gpt2.start_tf_sess(threads=1)
 gpt2.load_gpt2(sess)

 # Get our params from the GET request
 callback = request.args.get('callback')
 sample = request.args.get('sample')

 # If the user was too lazy to input something we just feed the model with a default
 if (not sample):
     sample = "Ash and Pikachu were"

 samples = gpt2.generate(sess,prefix =sample,return_as_list=True,length=256)


 # The model will generated a fixed amount of words
 # Let's just throw away everything that is not a complete sentence
 lst = re.split('\\\\.',samples[0])
 # Remove last incomplete sentence (denoted by a period)
 generated_text = '.'.join(lst[:-1]) + "."

 # Our return data
 data = {
     'sample_text' : generated_text
 }


 # Garbage collect since memory doesn't grow on trees
 gc.collect()


 return '{0}({1})'.format(callback,data)

从中学到

如果想使用机器学习做出新的事情，数据非常重要。
GPT-2模型用作按需文本生成工具是不切实际的，它需要太多的内存和CPU能力才能运行。具有需要1GB RAM的服务来满足每个请求的服务非常昂贵。
docker system prune是友好的。
Web服务器的Python生态系统并不难使用，并且有许多示例。

一些输出示例

模型的输入提要以粗体显示。当然涉及到一些挑选，但这就是生成模型的方式。

Ash and Misty were dating when they fell in love. As they both recall their respective first experiences, ash’s first brush with pokémon is all he ever remembers, as he was just a child. later, after ash had made his first poké ball, he skipped lunch and pursued a friend and switched trainers. This ended in them both falling in love, leaving dawn and brock in tears. When they were out searching for ash’s bulbasaur, a wild gyarados swatted them off. (…)

Pikachu was tired of all this sh*t. He runs in fear of the grass, and runs in fear of the trainers, too. Jessie and james run outside and run outside. (…)

Ash wanted to be the very best and trained all of his pokémon to get there. He told his trainer the whole story and promised to be a great trainer. He told his parents and his friends that he would train them as best he could. They were surprised and were ready to give up on him, when his parents started crying. His mother told him to come back home and find his friends. They had no choice but to go with him.

Pikachu was being arrested for tax evasion. Once the trio get out of the olivine city pokémon center, they are immediately attacked by a former police officer, a police detective, a nurse joy, and a nurse joy’s glameow.

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2019-11-05，如有侵权请联系 cloudcommunity@tencent.com 删除

机器学习