首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >如何在Tensorflow模型中添加文本预处理标记化步骤

如何在Tensorflow模型中添加文本预处理标记化步骤
EN

Stack Overflow用户
提问于 2022-07-13 22:34:34
回答 1查看 227关注 0票数 3

我有一个TensorFlow模型SavedModel,其中包括saved_model.pbvariables文件夹。预处理步骤还没有包含到这个模型中,这就是为什么我需要在将数据输入到预测方面的模型之前进行预处理(Tokenization等)。

我正在寻找一种方法,我可以结合预处理步骤到模型。我见过这里这里的例子,但是它们是图像数据。

为了了解培训部分是如何完成的,这是我们进行培训的代码的一部分(如果您需要实现我在这里使用的函数,请告诉我(我没有包括它以使我的问题更容易理解))

培训:

代码语言:javascript
复制
processor = IntentProcessor(FLAGS.data_path, FLAGS.test_data_path,
                            FLAGS.test_proportion, FLAGS.seed, FLAGS.do_early_stopping)


bert_config = modeling.BertConfig.from_json_file(FLAGS.bert_config_file)
tokenizer = tokenization.FullTokenizer(
    vocab_file=FLAGS.vocab_file, do_lower_case=FLAGS.do_lower_case)

run_config = tf.estimator.RunConfig(
    model_dir=FLAGS.output_dir,
    save_checkpoints_steps=FLAGS.save_checkpoints_steps)

train_examples = None
num_train_steps = None
num_warmup_steps = None
if FLAGS.do_train:
    train_examples = processor.get_train_examples()
    num_iter_per_epoch = int(len(train_examples) / FLAGS.train_batch_size)
    num_train_steps = num_iter_per_epoch * FLAGS.num_train_epochs
    num_warmup_steps = int(num_train_steps * FLAGS.warmup_proportion)
    run_config = tf.estimator.RunConfig(
        model_dir=FLAGS.output_dir,
        save_checkpoints_steps=num_iter_per_epoch)

best_temperature = 1.0  # Initiate the best T value as 1.0 and will
# update this during the training

model_fn = model_fn_builder(
    bert_config=bert_config,
    num_labels=len(processor.le.classes_),
    init_checkpoint=FLAGS.init_checkpoint,
    learning_rate=FLAGS.learning_rate,
    num_train_steps=num_train_steps,
    num_warmup_steps=num_warmup_steps,
    best_temperature=best_temperature,
    seed=FLAGS.seed)

estimator = tf.estimator.Estimator(
    model_fn=model_fn,
    config=run_config)
# add parameters by passing a prams variable

if FLAGS.do_train:
    train_features = convert_examples_to_features(
        train_examples, FLAGS.max_seq_length, tokenizer)
    train_labels = processor.get_train_labels()
    train_input_fn = input_fn_builder(
        features=train_features,
        is_training=True,
        batch_size=FLAGS.train_batch_size,
        seed=FLAGS.seed,
        labels=train_labels
    )
    estimator.train(input_fn=train_input_fn, max_steps=num_train_steps)

这是我在培训中使用的预处理:

代码语言:javascript
复制
LABEL_LIST = ['negative', 'neutral', 'positive']
INTENT_MAP = {i: LABEL_LIST[i] for i in range(len(LABEL_LIST))}
BATCH_SIZE = 1
MAX_SEQ_LEN = 70
def convert_examples_to_features(texts, max_seq_length, tokenizer):
    """Loads a data file into a list of InputBatchs.
       texts is the list of input text
    """
    features = {}
    input_ids_list = []
    input_mask_list = []
    segment_ids_list = []

    for (ex_index, text) in enumerate(texts):
        tokens_a = tokenizer.tokenize(str(text))
        # Account for [CLS] and [SEP] with "- 2"
        if len(tokens_a) > max_seq_length - 2:
            tokens_a = tokens_a[0:(max_seq_length - 2)]
        tokens = []
        segment_ids = []
        tokens.append("[CLS]")
        segment_ids.append(0)
        for token in tokens_a:
            tokens.append(token)
            segment_ids.append(0)
        tokens.append("[SEP]")
        segment_ids.append(0)

        input_ids = tokenizer.convert_tokens_to_ids(tokens)
        # print(tokens)

        # The mask has 1 for real tokens and 0 for padding tokens. Only real
        # tokens are attended to.
        input_mask = [1] * len(input_ids)

        # Zero-pad up to the sequence length.
        while len(input_ids) < max_seq_length:
            input_ids.append(0)
            input_mask.append(0)
            segment_ids.append(0)

        assert len(input_ids) == max_seq_length
        assert len(input_mask) == max_seq_length
        assert len(segment_ids) == max_seq_length

        input_ids_list.append(input_ids)
        input_mask_list.append(input_mask)
        segment_ids_list.append(segment_ids)

    features['input_ids'] = np.asanyarray(input_ids_list)
    features['input_mask'] = np.asanyarray(input_mask_list)
    features['segment_ids'] = np.asanyarray(segment_ids_list)

    # tf.data.Dataset.from_tensor_slices needs to pass numpy array not
    # tensor, or the tensor graph (shape) should match

    return features

推理是这样的:

代码语言:javascript
复制
def inference(texts,MODEL_DIR, VOCAB_FILE):
    if not isinstance(texts, list):
        texts = [texts]
    tokenizer = FullTokenizer(vocab_file=VOCAB_FILE, do_lower_case=False)
    features = convert_examples_to_features(texts, MAX_SEQ_LEN, tokenizer)

    predict_fn = predictor.from_saved_model(MODEL_DIR)
    response = predict_fn(features)
    #print(response)
    return get_sentiment(response)

def preprocess(texts):
    if not isinstance(texts, list):
        texts = [texts]
    tokenizer = FullTokenizer(vocab_file=VOCAB_FILE, do_lower_case=False)
    features = convert_examples_to_features(texts, MAX_SEQ_LEN, tokenizer)

    return features

def get_sentiment(response):
    idx = response['intent'].tolist()
    print(idx)
    print(INTENT_MAP.get(idx[0]))
    outputs = []
    for i in range(0, len(idx)):
        outputs.append({
            "sentiment": INTENT_MAP.get(idx[i]),
            "confidence": response['prob'][i][idx[i]]
        })
    return outputs

    sentence = 'The movie is ok'
    inference(sentence, args.model_path, args.vocab_path)

这就是model_fn_builder的实现

代码语言:javascript
复制
def model_fn_builder(bert_config, num_labels, init_checkpoint, learning_rate,
                     num_train_steps, num_warmup_steps, best_temperature, seed):
    """Returns multi-intents `model_fn` closure for Estimator"""

    def model_fn(features, labels, mode,
                 params):  # pylint: disable=unused-argument
        """The `model_fn` for Estimator."""

        tf.logging.info("*** Features ***")
        for name in sorted(features.keys()):
            tf.logging.info(
                "  name = %s, shape = %s" % (name, features[name].shape))

        input_ids = features["input_ids"]
        input_mask = features["input_mask"]
        segment_ids = features["segment_ids"]

        is_training = (mode == tf.estimator.ModeKeys.TRAIN)

        (total_loss, per_example_loss, logits) = create_intent_model(
            bert_config, is_training, input_ids, input_mask, segment_ids,
            labels, num_labels, mode, seed)

        tvars = tf.trainable_variables()

        initialized_variable_names = None
        if init_checkpoint:
            (assignment_map,
             initialized_variable_names) = \
                modeling.get_assignment_map_from_checkpoint(
                    tvars, init_checkpoint)

            tf.train.init_from_checkpoint(init_checkpoint, assignment_map)

        tf.logging.info("**** Trainable Variables ****")
        for var in tvars:
            init_string = ""
            if var.name in initialized_variable_names:
                init_string = ", *INIT_FROM_CKPT*"
            tf.logging.info("  name = %s, shape = %s%s", var.name, var.shape,
                            init_string)

        output_spec = None
        if mode == tf.estimator.ModeKeys.TRAIN:

            train_op = optimization.create_optimizer(
                total_loss, learning_rate, num_train_steps, num_warmup_steps)

            output_spec = tf.estimator.EstimatorSpec(
                mode=mode,
                loss=total_loss,
                train_op=train_op)

        elif mode == tf.estimator.ModeKeys.EVAL:

            def metric_fn(per_example_loss, labels, logits):
                predictions = tf.argmax(logits, axis=-1, output_type=tf.int32)
                accuracy = tf.metrics.accuracy(labels, predictions)
                loss = tf.metrics.mean(per_example_loss)
                return {
                    "eval_accuracy": accuracy,
                    "eval_loss": loss
                }

            eval_metrics = metric_fn(per_example_loss, labels, logits)
            output_spec = tf.estimator.EstimatorSpec(
                mode=mode,
                loss=total_loss,
                eval_metric_ops=eval_metrics)

        elif mode == tf.estimator.ModeKeys.PREDICT:
            predictions = {
                'intent': tf.argmax(logits, axis=-1, output_type=tf.int32),
                'prob': tf.nn.softmax(logits / tf.constant(best_temperature)),
                'logits': logits
            }
            output_spec = tf.estimator.EstimatorSpec(
                mode=mode,
                predictions=predictions)

        return output_spec

    return model_fn

这就是create_intent_model的实现

代码语言:javascript
复制
def create_intent_model(bert_config, is_training, input_ids, input_mask,
                        segment_ids,
                        labels, num_labels, mode, seed):
    model = modeling.BertModel(
        config=bert_config,
        is_training=is_training,
        input_ids=input_ids,
        input_mask=input_mask,
        token_type_ids=segment_ids,
        use_one_hot_embeddings=False,
        seed=seed
    )
    output_layer = model.get_pooled_output()

    hidden_size = output_layer.shape[-1].value

    with tf.variable_scope("loss"):
        output_weights = tf.get_variable(
            "output_weights", [num_labels, hidden_size],
            initializer=tf.truncated_normal_initializer(stddev=0.02, seed=seed))
        output_bias = tf.get_variable(
            "output_bias", [num_labels], initializer=tf.zeros_initializer())

        if is_training:
            # I.e., 0.1 dropout
            output_layer = tf.nn.dropout(output_layer, keep_prob=0.9, seed=seed)

        logits = tf.matmul(output_layer, output_weights, transpose_b=True)
        logits = tf.nn.bias_add(logits, output_bias)

        loss = None
        per_example_loss = None

        if mode == tf.estimator.ModeKeys.TRAIN or mode == \
                tf.estimator.ModeKeys.EVAL:
            log_probs = tf.nn.log_softmax(logits, axis=-1)

            one_hot_labels = tf.one_hot(labels, depth=num_labels,
                                        dtype=tf.float32)

            per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs,
                                              axis=-1)

            loss = tf.reduce_mean(per_example_loss)

        return loss, per_example_loss, logits

这是与tensorflow相关的列表库:

代码语言:javascript
复制
tensorboard==1.15.0
tensorflow-estimator==1.15.1
tensorflow-gpu==1.15.0

有很好的文档这里,但是,它使用Keras。此外,我也不知道如何在这里合并预处理层,即使使用Keras。

同样,我的最终目标是将预处理步骤合并到模型构建阶段,以便以后加载模型时直接将The movie is ok传递给模型?

我只需要一个关于如何将预处理层合并到这个基于函数的代码中的想法。

提前感谢~

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2022-07-23 12:57:57

您可以如下所示使用TextVectorization层。但是要完全回答你的问题,我需要知道model_fn_builder()函数中的内容。我将向您展示如何使用Keras模型构建API来做到这一点。

代码语言:javascript
复制
class BertTextProcessor(tf.keras.layers.Layer):

  def __init__(self, max_length):
    super().__init__()
    self.max_length = max_length
    # Here I'm setting any preprocessing to none
    # by default this layer lowers case and remove punctuation
    # i.e. tokens like [CLS] would become cls
    self.vectorizer = tf.keras.layers.TextVectorization(output_sequence_length=max_length, standardize=None)

  def call(self, inputs):

    inputs = "[CLS] " + inputs + " [SEP]"
    tok_inputs = self.vectorizer(inputs)

    return {
        "input_ids": tok_inputs, 
        "input_mask": tf.cast(tok_inputs != 0, 'int32'),
        "segment_ids": tf.zeros_like(tok_inputs)
        }

  def adapt(self, data):
    data = "[CLS] " + data + " [SEP]"
    self.vectorizer.adapt(data)

  def get_config(self):
    return {
        "max_length": self.max_length
    }

使用,

代码语言:javascript
复制
input_str = tf.constant(["movie is okay good plot very nice", "terrible movie bad actors not good"])

proc = BertTextProcessor(8)
# You need to call this so that the vectorizer layer learns the vocabulary
proc.adapt(input_str)
print(proc(input_str))

输出,

代码语言:javascript
复制
{'input_ids': <tf.Tensor: shape=(2, 10), dtype=int64, numpy=
array([[ 5,  2, 12,  9,  3,  8,  6, 11,  4,  0],
       [ 5,  7,  2, 13, 14, 10,  3,  4,  0,  0]])>, 'input_mask': <tf.Tensor: shape=(2, 10), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 0],
       [1, 1, 1, 1, 1, 1, 1, 1, 0, 0]], dtype=int32)>, 'segment_ids': <tf.Tensor: shape=(2, 10), dtype=int64, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])>}

您可以使用此层作为Keras模型的输入,就像使用任何层一样。

您还可以使用返回的proc.vectorizer.get_vocabulary()获得词汇表,

代码语言:javascript
复制
['',
 '[UNK]',
 'movie',
 'good',
 '[SEP]',
 '[CLS]',
 'very',
 'terrible',
 'plot',
 'okay',
 'not',
 'nice',
 'is',
 'bad',
 'actors']

tf-models-official的替代

要以伯特接受的格式获取数据,还可以使用tf-models-official库。具体来说,您可以使用对象

我最近更新了我的一本书的代码,在分类中您可以看到它是如何使用的。生成BERT正确的输入格式的部分展示了如何做到这一点。

编辑:如何在tensorflow==1.15.0中做到这一点

为了在TensorFlow 1.x中实现这一点,您将需要进行一些重新工作,因为在最初的答案中缺少了很多功能。下面是一个如何做到这一点的示例,您需要根据特定的usecase/方法相应地修改这些代码。

代码语言:javascript
复制
lookup_layer = tf.lookup.StaticHashTable(
    tf.lookup.TextFileInitializer(
      "vocab.txt", tf.string, tf.lookup.TextFileIndex.WHOLE_LINE,
      tf.int64, tf.lookup.TextFileIndex.LINE_NUMBER, delimiter=" "),
      100
) 

text = tf.constant(["bad film", "movie is okay good plot very nice", "terrible movie bad actors not good"])
text = "[CLS]" + text + "[SEP]"
text = tf.strings.split(text, result_type="RaggedTensor")
text_dense = text.to_tensor("[PAD]")

out = lookup_layer.lookup(text_dense)

with tf.Session() as sess:
  sess.run(tf.tables_initializer())
  print(sess.run(out))
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/72973349

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档