Loading [MathJax]/jax/output/CommonHTML/config.js
前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >专栏 >Using side features: feature preprocessing

Using side features: feature preprocessing

原创
作者头像
XianxinMao
修改于 2021-07-30 08:37:42
修改于 2021-07-30 08:37:42
44600
代码可运行
举报
文章被收录于专栏:深度学习框架深度学习框架
运行总次数:0
代码可运行

One of the great advantages of using a deep learning framework to build recommender models is the freedom to build rich, flexible feature representations.

These need to be appropriately transformed in order to be useful in building models:

  • User and item ids have to be translated into embedding vectors: high-dimensional numerical representations that are adjusted during training to help the model predict its objective better.
  • Raw text needs to be tokenized (split into smaller parts such as individual words) and translated into embeddings.
  • Numerical features need to be normalized so that their values lie in a small interval around 0.

The MovieLens dataset

Let's first have a look at what features we can use from the MovieLens dataset:

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
import pprint
​
import tensorflow_datasets as tfds
​
ratings = tfds.load("movielens/100k-ratings", split="train")for x in ratings.take(1).as_numpy_iterator():
  pprint.pprint(x)

There are a couple of key features here:

  • Movie title is useful as a movie identifier.
  • User id is useful as a user identifier.
  • Timestamps will allow us to model the effect of time.

The first two are categorical features; timestamps are a continuous feature.

Turning categorical features into embeddings

A categorical feature is a feature that does not express a continuous quantity, but rather takes on one of a set of fixed values.

Most deep learning models express these feature by turning them into high-dimensional vectors. During model training, the value of that vector is adjusted to help the model predict its objective better. For example, suppose that our goal is to predict which user is going to watch which movie. To do that, we represent each user and each movie by an embedding vector. Initially, these embeddings will take on random values - but during training, we will adjust them so that embeddings of users and the movies they watch end up closer together. Taking raw categorical features and turning them into embeddings is normally a two-step process:

  1. Firstly, we need to translate the raw values into a range of contiguous integers, normally by building a mapping (called a "vocabulary") that maps raw values ("Star Wars") to integers (say, 15)
  2. Secondly, we need to take these integers and turn them into embeddings.

Defining the vocabulary

The first step is to define a vocabulary. We can do this easily using Keras preprocessing layers.

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
import numpy as np
import tensorflow as tf
​
movie_title_lookup = tf.keras.layers.experimental.preprocessing.StringLookup()

The layer itself does not have a vocabulary yet, but we can build it using our data.

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
movie_title_lookup.adapt(ratings.map(lambda x: x["movie_title"]))print(f"Vocabulary: {movie_title_lookup.get_vocabulary()[:3]}")

Once we have this we can use the layer to translate raw tokens to embedding ids:

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
movie_title_lookup(["Star Wars (1977)", "One Flew Over the Cuckoo's Nest (1975)"])

Note that the layer's vocabulary includes one (or more!) unknown (or "out of vocabulary", OOV) tokens. This is really handy: it means that the layer can handle categorical values that are not in the vocabulary. In practical terms, this means that the model can continue to learn about and make recommendations even using features that have not been seen during vocabulary construction.

Using feature hashing

We can take this to its logical extreme and rely entirely on feature hashing, with no vocabulary at all. This is implemented in the tf.keras.layers.experimental.preprocessing.Hashing layer.

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
# We set up a large number of bins to reduce the chance of hash collisions.
num_hashing_bins = 200_000
​
movie_title_hashing = tf.keras.layers.experimental.preprocessing.Hashing(
    num_bins=num_hashing_bins
)

We can do the lookup as before without the need to build vocabularies:

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
movie_title_hashing(["Star Wars (1977)", "One Flew Over the Cuckoo's Nest (1975)"])

Defining the embeddings

Now that we have integer ids, we can use the Embedding layer to turn those into embeddings.

An embedding layer has two dimensions: the first dimension tells us how many distinct categories we can embed; the second tells us how large the vector representing each of them can be.

When creating the embedding layer for movie titles, we are going to set the first value to the size of our title vocabulary (or the number of hashing bins). The second is up to us: the larger it is, the higher the capacity of the model, but the slower it is to fit and serve.

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
movie_title_embedding = tf.keras.layers.Embedding(
    # Let's use the explicit vocabulary lookup.
    input_dim=movie_title_lookup.vocab_size(),
    output_dim=32
)

We can put the two together into a single layer which takes raw text in and yields embeddings.

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
movie_title_model = tf.keras.Sequential([movie_title_lookup, movie_title_embedding])

Just like that, we can directly get the embeddings for our movie titles:

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
movie_title_model(["Star Wars (1977)"])

We can do the same with user embeddings:

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
user_id_lookup = tf.keras.layers.experimental.preprocessing.StringLookup()
user_id_lookup.adapt(ratings.map(lambda x: x["user_id"]))
​
user_id_embedding = tf.keras.layers.Embedding(user_id_lookup.vocab_size(), 32)
​
user_id_model = tf.keras.Sequential([user_id_lookup, user_id_embedding])

Normalizing continuous features

Continuous features also need normalization. For example, the timestamp feature is far too large to be used directly in a deep model

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
for x in ratings.take(3).as_numpy_iterator():
  print(f"Timestamp: {x['timestamp']}.")

We need to process it before we can use it. While there are many ways in which we can do this, discretization and standardization are two common ones.

Standardization

Standardization rescales features to normalize their range by subtracting the feature's mean and dividing by its standard deviation. It is a common preprocessing transformation.

This can be easily accomplished using the tf.keras.layers.experimental.preprocessing.Normalization layer:

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
timestamp_normalization = tf.keras.layers.experimental.preprocessing.Normalization()
timestamp_normalization.adapt(ratings.map(lambda x: x["timestamp"]).batch(1024))for x in ratings.take(3).as_numpy_iterator():
  print(f"Normalized timestamp: {timestamp_normalization(x['timestamp'])}.")

Discretization

Another common transformation is to turn a continuous feature into a number of categorical features. This makes good sense if we have reasons to suspect that a feature's effect is non-continuous. To do this, we first need to establish the boundaries of the buckets we will use for discretization. The easiest way is to identify the minimum and maximum value of the feature, and divide the resulting interval equally:

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
max_timestamp = ratings.map(lambda x: x["timestamp"]).reduce(
    tf.cast(0, tf.int64), tf.maximum).numpy().max()
min_timestamp = ratings.map(lambda x: x["timestamp"]).reduce(
    np.int64(1e9), tf.minimum).numpy().min()
​
timestamp_buckets = np.linspace(
    min_timestamp, max_timestamp, num=1000)print(f"Buckets: {timestamp_buckets[:3]}")

Given the bucket boundaries we can transform timestamps into embeddings:

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
timestamp_embedding_model = tf.keras.Sequential([
  tf.keras.layers.experimental.preprocessing.Discretization(timestamp_buckets.tolist()),
  tf.keras.layers.Embedding(len(timestamp_buckets) + 1, 32)
])for timestamp in ratings.take(1).map(lambda x: x["timestamp"]).batch(1).as_numpy_iterator():
  print(f"Timestamp embedding: {timestamp_embedding_model(timestamp)}.")

Processing text features

We may also want to add text features to our model. Usually, things like product descriptions are free form text, and we can hope that our model can learn to use the information they contain to make better recommendations, especially in a cold-start or long tail scenario. While the MovieLens dataset does not give us rich textual features, we can still use movie titles. This may help us capture the fact that movies with very similar titles are likely to belong to the same series. The first transformation we need to apply to text is tokenization (splitting into constituent words or word-pieces), followed by vocabulary learning, followed by an embedding.

The Keras tf.keras.layers.experimental.preprocessing.TextVectorization layer can do the first two steps for us:

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
title_text = tf.keras.layers.experimental.preprocessing.TextVectorization()
title_text.adapt(ratings.map(lambda x: x["movie_title"]))

Let's try it out:

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
for row in ratings.batch(1).map(lambda x: x["movie_title"]).take(1):
  print(title_text(row))

Each title is translated into a sequence of tokens, one for each piece we've tokenized.

We can check the learned vocabulary to verify that the layer is using the correct tokenization:

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
title_text.get_vocabulary()[40:45]

This looks correct: the layer is tokenizing titles into individual words. To finish the processing, we now need to embed the text. Because each title contains multiple words, we will get multiple embeddings for each title. For use in a donwstream model these are usually compressed into a single embedding. Models like RNNs or Transformers are useful here, but averaging all the words' embeddings together is a good starting point.

Putting it all together

With these components in place, we can build a model that does all the preprocessing together.

User model

The full user model may look like the following:

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
class UserModel(tf.keras.Model):
​
  def __init__(self):
    super().__init__()
​
    self.user_embedding = tf.keras.Sequential([
        user_id_lookup,
        tf.keras.layers.Embedding(user_id_lookup.vocab_size(), 32),
    ])
    self.timestamp_embedding = tf.keras.Sequential([
      tf.keras.layers.experimental.preprocessing.Discretization(timestamp_buckets.tolist()),
      tf.keras.layers.Embedding(len(timestamp_buckets) + 2, 32)
    ])
    self.normalized_timestamp = tf.keras.layers.experimental.preprocessing.Normalization()
​
  def call(self, inputs):
​
    # Take the input dictionary, pass it through each input layer,
    # and concatenate the result.
    return tf.concat([
        self.user_embedding(inputs["user_id"]),
        self.timestamp_embedding(inputs["timestamp"]),
        self.normalized_timestamp(inputs["timestamp"])
    ], axis=1)

Let's try it out:

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
user_model = UserModel()
​
user_model.normalized_timestamp.adapt(
    ratings.map(lambda x: x["timestamp"]).batch(128))for row in ratings.batch(1).take(1):
  print(f"Computed representations: {user_model(row)[0, :3]}")

Movie model

We can do the same for the movie model:

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
class MovieModel(tf.keras.Model):
​
  def __init__(self):
    super().__init__()
​
    max_tokens = 10_000
​
    self.title_embedding = tf.keras.Sequential([
      movie_title_lookup,
      tf.keras.layers.Embedding(movie_title_lookup.vocab_size(), 32)
    ])
    self.title_text_embedding = tf.keras.Sequential([
      tf.keras.layers.experimental.preprocessing.TextVectorization(max_tokens=max_tokens),
      tf.keras.layers.Embedding(max_tokens, 32, mask_zero=True),
      # We average the embedding of individual words to get one embedding vector
      # per title.
      tf.keras.layers.GlobalAveragePooling1D(),
    ])
​
  def call(self, inputs):
    return tf.concat([
        self.title_embedding(inputs["movie_title"]),
        self.title_text_embedding(inputs["movie_title"]),
    ], axis=1)

Let's try it out:

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
movie_model = MovieModel()
​
movie_model.title_text_embedding.layers[0].adapt(
    ratings.map(lambda x: x["movie_title"]))for row in ratings.batch(1).take(1):
  print(f"Computed representations: {movie_model(row)[0, :3]}")

代码地址: https://codechina.csdn.net/csdn_codechina/enterprise_technology/-/blob/master/NLP_recommend/Using%20side%20features:%20feature%20preprocessing.ipynb

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

评论
登录后参与评论
暂无评论
推荐阅读
编辑精选文章
换一批
Taking advantage of context features
In the featurization tutorial we incorporated multiple features beyond just user and movie identifiers into our models, but we haven't explored whether those features improve model accuracy.
XianxinMao
2021/07/30
2400
Building deep retrieval models
In the featurization tutorial we incorporated multiple features into our models, but the models consist of only an embedding layer. We can add more dense layers to our models to increase their expressive power. In general, deeper models are capable of learning more complex patterns than shallower models. For example, our user model incorporates user ids and timestamps to model user preferences at a point in time. A shallow model (say, a single embedding layer) may only be able to learn the simplest relationships between those features and movies: a given movie is most popular around the time of its release, and a given user generally prefers horror movies to comedies. To capture more complex relationships, such as user preferences evolving over time, we may need a deeper model with multiple stacked dense layers.
XianxinMao
2021/07/30
3570
TensorFlow Recommenders: Quickstart
In this tutorial, we build a simple matrix factorization model using the MovieLens 100K dataset with TFRS. We can use this model to recommend movies for a given user.
XianxinMao
2021/07/30
4240
谷歌开源Tensorflow推荐器TFRS
最近Google开源了基于Tensorflow的推荐器, 一个新的开源Tensorflow包。它的特点可以总结为下面四个:
炼丹笔记
2021/05/14
1.1K0
谷歌开源Tensorflow推荐器TFRS
Recommending movies: retrieval
Real-world recommender systems are often composed of two stages:
XianxinMao
2021/07/28
4680
TFRS | 谷歌开源新一代推荐系统库
TensorFlow推荐器是一个使用TensorFlow构建推荐系统模型的库。它有助于构建推荐系统的全部工作流程:数据准备、模型制定、训练、评估和部署。它构建在Keras上,目标是让学习者有一个平缓的学习曲线,同时仍然给你构建复杂模型的灵活性。
张小磊
2021/06/10
1.1K0
TFRS | 谷歌开源新一代推荐系统库
text classification with RNN
本次用到的数据集是 IMDB,一共有 50000 条电影评论,其中 25000 条是训练集,另外 25000 条是测试集
XianxinMao
2021/07/26
5490
【tensorflow2.0】处理文本数据-imdb数据
训练集有20000条电影评论文本,测试集有5000条电影评论文本,其中正面评论和负面评论都各占一半。
西西嘛呦
2020/08/26
1.1K0
【tensorflow2.0】处理文本数据-imdb数据
特征列feature_column
特征列 通常用于对结构化数据实施特征工程时候使用,图像或者文本数据一般不会用到特征列。
lyhue1991
2020/07/20
1.2K0
特征列feature_column
stack overflow 问题分类
本教程的目的是带领大家学会如何给 stack overflow 上的问题进行打标签
XianxinMao
2021/07/28
8010
推荐系统!基于tensorflow搭建混合神经网络精准推荐! ⛵
本文从常见的推荐系统方法(基于内容、协同过滤等近邻算法、基于知识等)讲起,一直覆盖到前沿的新式推荐系统,不仅详细讲解原理,还手把手教大家如何用代码实现。
ShowMeAI
2022/08/26
1.2K0
推荐系统!基于tensorflow搭建混合神经网络精准推荐! ⛵
Text classification with TensorFlow Hub: Movie reviews
This notebook classifies movie reviews as positive or negative using the text of the review. This is an example of binary—or two-class—classification, an important and widely applicable kind of machine learning problem.
XianxinMao
2021/07/31
2840
使用神经网络为图像生成标题
我们都知道,神经网络可以在执行某些任务时复制人脑的功能。神经网络在计算机视觉和自然语言生成方面的应用已经非常引人注目。
deephub
2020/07/30
1.1K0
使用神经网络为图像生成标题
只能用于文本与图像数据?No!看TabTransformer对结构化业务数据精准建模
自 Transformers 出现以来,基于它的结构已经颠覆了自然语言处理和计算机视觉,带来各种非结构化数据业务场景和任务的巨大效果突破,接着大家把目光转向了结构化业务数据,它是否能在结构化表格数据上同样有惊人的效果表现呢?
ShowMeAI
2022/10/30
8831
只能用于文本与图像数据?No!看TabTransformer对结构化业务数据精准建模
电影推荐项目实战(双塔模型)
推荐系统简单来说就是, 高效地达成用户与意向对象的匹配。具体可见之前文章:【一窥推荐系统的原理】。而技术上实现两者匹配,简单来说有两类方法:
算法进阶
2022/06/02
7510
电影推荐项目实战(双塔模型)
用 tf.data 加载 CSV 数据
原文链接:https://tensorflow.google.cn/beta/tutorials/load_data/csv?hl=zh_cn 这篇教程使用的是泰坦尼克号乘客的数据。模型会根据乘客的
狼啸风云
2019/08/18
3.7K0
图像分类-flower_photos 实验研究
flower_photos 数据量比较小,所以 simple_cnn 可以在 trainset 上拟合到 0.99,意思就是数据复杂度 < 模型复杂度
XianxinMao
2021/08/22
6120
迁移学习之快速搭建【卷积神经网络】
卷积神经网络 概念认识:https://cloud.tencent.com/developer/article/1822928
一颗小树x
2021/05/14
2K0
迁移学习之快速搭建【卷积神经网络】
Load and preprocess images
This tutorial shows how to load and preprocess an image dataset in three ways. First, you will use high-level Keras preprocessing utilities and layers to read a directory of images on disk. Next, you will write your own input pipeline from scratch using tf.data. Finally, you will download a dataset from the large catalog available in TensorFlow Datasets.
XianxinMao
2021/07/29
7160
机器学习中的嵌入:释放表征的威力
机器学习通过使计算机能够从数据学习和做出预测来彻底改变了人工智能领域。机器学习的一个关键方面是数据的表示,因为表示形式的选择极大地影响了算法的性能和有效性。嵌入已成为机器学习中的一种强大技术,提供了一种捕获和编码数据点之间复杂关系的方法。本文[1]探讨了嵌入的概念,其意义及其在各个领域的应用。
数据科学工厂
2023/11/03
3730
机器学习中的嵌入:释放表征的威力
相关推荐
Taking advantage of context features
更多 >
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档
本文部分代码块支持一键运行,欢迎体验
本文部分代码块支持一键运行,欢迎体验