(题图来自维基百科 Sigmoid function)
今天有人提到这个问题,为什么现在一般深度学习的分类模型最后输出层都用Softmax而不是简单的Sigmoid?
谷歌到两个相关回答
Sigmoid + cross-entropy (eq.57) follows the Bernoulli distribution, while softmax + log-likelihood (eq.80) follows the multinomial distribution with one observation (which is a multiclass version of the Bernoulli). For binary classification problems, the softmax function outputs two values (between 0 and 1 and sum up to 1), to represent the probabilities of each class. While the sigmoid function outputs one value between 0 and 1, to represent the probability of one class (so the probability of the other class is just 1-p). dontloo ( neural networks )
Sigmoid+互信息输出结果是伯努利分布(注:
)
而Softmax输出的是多项分布(注:
)
对于二值分类问题,Softmax输出两个值,这两个值相加为1
对于Sigmoid来说,也输出两个值,不过没有可加性,两个值各自是0到1的某个数,对于一个值p来说,1-p是它对应的另一个概率。
例如:
如果我们预测某个东西是或者不是,那么我们可以这样:
输出(0, 1)代表“是”,输出(1, 0)代表“否”
Softmax可能输出(0.3, 0.7),代表算法认为“是”的概率是0.7,“否”的概率是0.3,相加为1
Sigmoid的输出可能是(0.4, 0.8),它们相加不为1,解释来说就是Sigmoid认为输出第一位为1的概率是0.4,第一位不为1的概率是0.6(1-p),第二位为1的概率是0.8,第二位不为1的概率是0.2。
Geoff Hinton covered exactly this topic in his coursera course on neural nets. The problem with sigmoids is that as you reach saturation (values get close to 1 or 0), the gradients vanish. This is detrimental to optimization speed. Softmax doesn't have this problem, and in fact if you combine softmax with a cross entropy error function the gradients are just (z-y), as they would be for a linear output with least squares error. nkorslund ( https://www.reddit.com/r/MachineLearning/comments/32iyt9/question_comparison_between_softmax_and_sigmoid/ )
这个回答提到Hinton在coursera的课提到这个课题了,很可惜我没上过这门课(不过这门课正在准备2016年9月份重开,https://www.coursera.org )。Hinton认为当Sigmoid函数的某个输出接近1或者0的时候,就会产生梯度消失,严重影响优化速度,而Softmax没有这个问题。