问在Keras中，使用SGD，为什么model.fit()训练得很顺利，但分步训练方法给出了爆炸性的梯度和损失
EN

Stack Overflow用户

提问于 2021-08-06 12:36:10

回答 1查看 46关注 0票数 0

因为这种爆炸性的梯度和爆炸性的损失发生在网络巨大的时候，所以我不在这里张贴整个网络。但我已经尽了最大的努力，在过去的两周里，我深入研究了源代码的每个细节来监控一些权重，手动编写更新步骤来监控损失、权重、更新、梯度和超参数，以便与内部状态进行比较。我想在我问之前，我已经做了一些功课。

问题是，使用Keras API有两种训练方法，一种是model.fit()，第二种是更多的定制方法，用于更复杂的训练和网络，但是虽然我几乎所有的东西都保持不变，model.fit()没有爆炸性的损失，但自定义方法给出了爆炸性的损失。有趣的是，当我在一个小得多的网络中监控许多细节时，两种方法看起来都是一样的。

环境：

# tensorflow 1.14
import tensorflow as tf
from tensorflow.keras import backend as K

对于model.fit()方法：

# I skipped the details of the below two lines as I couldn't share the very details. but x is [10000, 32, 32, 3] image data, y is [10000, 10, 1] label. model is regular Keras model.

x_train, y_train, x_test, y_test = get_data()
model = get_keras_model()

loss_fn = tf.keras.losses.CategoricalCrossentropy()
sgd = tf.keras.optimizers.SGD(lr=.1, momentum=0.9, nesterov=True)

model.compile(loss=loss_fn, optimizer=sgd, metrics=['accuracy'])
history = model.fit(x_train, y_train, batch_size=128, epochs=100, validation_data=(x_test, y_test))

自定义方法：

x_train, y_train, x_test, y_test = get_data()
model = get_keras_model()

input = model.inputs[0]
y_true = tf.placeholder(dtype = tf.int32, shape = [None, 10])
y_pred = model.outputs[0]

loss_fn = tf.keras.losses.CategoricalCrossentropy()
loss = loss_fn(y_true, y_pred)
weights = model.trainable_weights
sgd = tf.keras.optimizers.SGD(lr=.1, momentum=0.9, nesterov=True)

training_updates = sgd.get_updates(loss, weights)
training_fn = K.function([y_true, input], [loss], training_updates)

num_train = 10000
steps_per_epoch = int(num_train / 128) # batch size 128
total_steps = steps_per_epoch * 100 # epoch 100

for step in total_steps:
    idx = np.random.randint(0, 10000, 128)
    input_img = x_train[idx]
    ground_true = y_train[idx]

    cur_loss = training_fn([ground_true, input_img])

简而言之，相同的模型，相同的损失函数，相同的优化器SGD，相同的图像馈送(我确实控制图像馈送顺序，尽管这里的代码是从训练数据中随机选择的)。在model.fit()的内部过程中，有什么可以防止损失或梯度爆炸的东西吗？

keras

deep-learning

gradient-exploding

python

tensorflow

回答 1

Stack Overflow用户

回答已采纳

发布于 2021-08-29 15:09:11

在深入研究源代码后，我找到了梯度爆炸的原因，正确的代码(最小的变化如下所示)：

x_train, y_train, x_test, y_test = get_data()
model = get_keras_model()

input = model.inputs[0]
y_true = tf.placeholder(dtype = tf.int32, shape = [None, 10])
y_pred = model.outputs[0]

loss_fn = tf.keras.losses.CategoricalCrossentropy()
loss = loss_fn(y_true, y_pred)
weights = model.trainable_weights
sgd = tf.keras.optimizers.SGD(lr=.1, momentum=0.9, nesterov=True)

training_updates = sgd.get_updates(loss, weights)

# Correct:
training_fn = K.function([y_true, input, K.symbolic_learning_phase()], [loss], training_updates)

# Before:
# training_fn = K.function([y_true, input], [loss], training_updates)

num_train = 10000
steps_per_epoch = int(num_train / 128) # batch size 128
total_steps = steps_per_epoch * 100 # epoch 100

for step in total_steps:
    idx = np.random.randint(0, 10000, 128)
    input_img = x_train[idx]
    ground_true = y_train[idx]

    # Correct:
    cur_loss = training_fn([ground_true, input_img, True])

    # Before:
    # cur_loss = training_fn([ground_true, input_img])

我对这个特殊的张量K.symbolic_learning_phase()的理解是，它有默认值设置为False (如果你在初始化时检查源代码)，BatchNormalization和Dropout层等在训练阶段和测试阶段表现不同。在这种情况下，BatchNormalization层是导致梯度爆炸的原因(现在有一些帖子提到他们使用BatchNormalization层进行梯度爆炸)，这是因为它的两个可训练权重batch_normalization_1/gamma:0和batch_normalization_1/beta:0依赖于这个张量，并且使用默认值False，它们没有学习，它们的权重在训练过程中很快就变成了nan。

我注意到，使用这种training_updates方法的Keras代码并不是真正将K.symbolic_learning_phase()放在代码中，然而，这是Keras的API在幕后做的事情。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/68687237

复制

基于MNIST手写体数字识别--含可直接使用代码【Python+Tensorflow+CNN+Keras】

腾讯云测试服务数据处理

利用数据集：MNIST http://yann.lecun.com/exdb/mnist/ 完成手写体数字识别紫色yyds

司六米希

2022/11/15

5.5K0