如何在 Python 中实现 Softmax 函数答案

【问题标题】：How to implement the Softmax function in Python如何在 Python 中实现 Softmax 函数
【发布时间】：2016-04-30 08:28:10
【问题描述】：

来自Udacity's deep learning class，y_i 的 softmax 就是简单的指数除以整个 Y 向量的指数之和：

其中S(y_i) 是y_i 的softmax 函数，e 是指数函数，j 是编号。输入向量 Y 中的列数。

我尝试了以下方法：

import numpy as np

def softmax(x):
    """Compute softmax values for each sets of scores in x."""
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum()

scores = [3.0, 1.0, 0.2]
print(softmax(scores))

返回：

[ 0.8360188   0.11314284  0.05083836]

但建议的解决方案是：

def softmax(x):
    """Compute softmax values for each sets of scores in x."""
    return np.exp(x) / np.sum(np.exp(x), axis=0)

它产生与第一个实现相同的输出，即使第一个实现显式地获取每列的差值和最大值，然后除以总和。

有人能用数学方法说明原因吗？一个正确一个错误？

实现在代码和时间复杂度方面是否相似？哪个更高效？

【问题讨论】：

我很好奇您为什么尝试使用 max 函数以这种方式实现它。是什么让你有这样的想法？
我不知道，我认为将最大值视为 0，有点像将图形向左移动并在 0 处剪辑会有所帮助。然后我的范围从-inf to +inf 缩短到-inf to 0。我想我想多了。哈哈哈
我还有一个 sub) 问题，下面似乎没有回答。 Udacity 建议的答案中axis = 0 的意义是什么？
如果您查看 numpy 文档，它会讨论 sum(x, axis=0)--以及类似的 axis=1-- 的作用。简而言之，它提供了对数组求和的方向。在这种情况下，它告诉它沿向量求和。在这种情况下，它对应于 softmax 函数中的分母。
就像每隔一周一样，有一个更正确的答案，直到我的数学还不足以决定谁是正确的 =) 任何没有提供答案的数学高手都可以帮助决定哪个对吗？

标签： python numpy machine-learning logistic-regression softmax

【解决方案1】：

它们都是正确的，但从数值稳定性的角度来看，你的更好。

你开始

e ^ (x - max(x)) / sum(e^(x - max(x))

通过使用 a^(b - c) = (a^b)/(a^c) 我们有这个事实

= e ^ x / (e ^ max(x) * sum(e ^ x / e ^ max(x)))

= e ^ x / sum(e ^ x)

这是另一个答案所说的。您可以将 max(x) 替换为任何变量，它会取消。

【讨论】：

重新格式化您的答案@TrevorM 以进一步澄清：e ^ (x - max(x)) / sum(e^(x - max(x)) 使用 a^(b - c) = ( a^b)/(a^c) 我们有， = e^ x / {e ^ max(x) * sum(e ^ x / e ^ max(x))} = e ^ x / sum(e ^ x )
@Trevor Merrifield，我认为第一种方法没有任何“不必要的术语”。事实上它比第二种方法更好。我已将这一点作为单独的答案添加。
@Shagun 你是对的。这两者在数学上是等价的，但我没有考虑数值稳定性。
希望您不介意：我删除了“不必要的术语”，以防人们不阅读 cmets（或 cmets 消失）。这个页面从搜索引擎获得了相当多的流量，这是目前人们看到的第一个答案。
我想知道为什么你减去 max(x) 而不是 max(abs(x)) （确定值后修复符号）。如果您的所有值都低于零且绝对值非常大，并且只有值（最大值）接近零，则减去最大值不会改变任何内容。它不会在数值上仍然不稳定吗？

【解决方案2】：

（嗯……这里有很多混乱，无论是在问题还是在答案中……）

首先，这两种解决方案（即您的解决方案和建议的解决方案）不等效；它们碰巧仅对一维分数数组的特殊情况是等价的。如果您也尝试过 Udacity 测验提供的示例中的二维分数数组，您就会发现它。

就结果而言，两种解决方案之间唯一的实际区别是axis=0 参数。要了解情况是否如此，让我们尝试您的解决方案 (your_softmax)，其中唯一的区别是 axis 参数：

import numpy as np

# your solution:
def your_softmax(x):
    """Compute softmax values for each sets of scores in x."""
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum()

# correct solution:
def softmax(x):
    """Compute softmax values for each sets of scores in x."""
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum(axis=0) # only difference

正如我所说，对于一维分数数组，结果确实是相同的：

scores = [3.0, 1.0, 0.2]
print(your_softmax(scores))
# [ 0.8360188   0.11314284  0.05083836]
print(softmax(scores))
# [ 0.8360188   0.11314284  0.05083836]
your_softmax(scores) == softmax(scores)
# array([ True,  True,  True], dtype=bool)

尽管如此，以下是 Udacity 测验中给出的二维分数数组的结果作为测试示例：

scores2D = np.array([[1, 2, 3, 6],
                     [2, 4, 5, 6],
                     [3, 8, 7, 6]])

print(your_softmax(scores2D))
# [[  4.89907947e-04   1.33170787e-03   3.61995731e-03   7.27087861e-02]
#  [  1.33170787e-03   9.84006416e-03   2.67480676e-02   7.27087861e-02]
#  [  3.61995731e-03   5.37249300e-01   1.97642972e-01   7.27087861e-02]]

print(softmax(scores2D))
# [[ 0.09003057  0.00242826  0.01587624  0.33333333]
#  [ 0.24472847  0.01794253  0.11731043  0.33333333]
#  [ 0.66524096  0.97962921  0.86681333  0.33333333]]

结果不同 - 第二个确实与 Udacity 测验中预期的结果相同，其中所有列的总和确实为 1，而第一个（错误）结果并非如此。

所以，所有的大惊小怪实际上都是为了实现细节——axis 参数。根据numpy.sum documentation：

默认值，axis=None，将对输入数组的所有元素求和

而在这里我们想要逐行求和，因此axis=0。对于一维数组，（唯一）行的总和和所有元素的总和恰好相同，因此在这种情况下您的结果相同......

抛开axis 问题不谈，您的实现（即您选择先减去最大值）实际上比建议的解决方案更好！事实上，这是实现 softmax 函数的推荐方式 - 请参阅 here 了解理由（数值稳定性，这里的其他一些答案也指出了这一点）。

【讨论】：

好吧，如果你只是在谈论多维数组。第一个解决方案可以通过将axis 参数添加到max 和sum 来轻松解决。但是，第一个实现仍然更好，因为您在使用exp时很容易溢出
@LouisYang 我没有关注；哪个是“第一个”解决方案？哪个不使用exp？除了添加axis 参数之外，这里还修改了什么？
第一个解决方案参考@alvas的解决方案。不同之处在于，alvas 问题中建议的解决方案缺少减去最大值的部分。这很容易导致溢出，例如，exp(1000) / (exp(1000) + exp(1001)) vs exp(-1) / (exp(-1) + exp(0)) 在数学上是相同的，但是第一个会溢出。
@LouisYang 仍然，不确定我是否理解您发表评论的必要性 - 所有这些都已在答案中明确解决。
@LouisYang 请不要让线程的（后续）流行度欺骗您，并尝试想象提供自己答案的上下文：一个困惑的 OP（“两者都给出相同的结果”）和一个（仍然！）接受的答案声称“两者都是正确的”（嗯，它们是不是）。答案绝不是“一般来说这是计算 softmax 的最正确和最有效的方法”；它只是为了证明为什么，在所讨论的特定 Udacity 测验中，这两种解决方案不等效。

【解决方案3】：

所以，这确实是对沙漠航海者的回答的评论，但由于我的声誉，我还不能评论它。正如他所指出的，只有当您的输入包含单个样本时，您的版本才是正确的。如果您的输入由多个样本组成，那是错误的。 但是，desertnaut 的解决方案也是错误的。 问题在于，一旦他接受 1 维输入，然后他接受 2 维输入。让我给你看看。

import numpy as np

# your solution:
def your_softmax(x):
    """Compute softmax values for each sets of scores in x."""
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum()

# desertnaut solution (copied from his answer): 
def desertnaut_softmax(x):
    """Compute softmax values for each sets of scores in x."""
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum(axis=0) # only difference

# my (correct) solution:
def softmax(z):
    assert len(z.shape) == 2
    s = np.max(z, axis=1)
    s = s[:, np.newaxis] # necessary step to do broadcasting
    e_x = np.exp(z - s)
    div = np.sum(e_x, axis=1)
    div = div[:, np.newaxis] # dito
    return e_x / div

让我们以沙漠英雄为例：

x1 = np.array([[1, 2, 3, 6]]) # notice that we put the data into 2 dimensions(!)

这是输出：

your_softmax(x1)
array([[ 0.00626879,  0.01704033,  0.04632042,  0.93037047]])

desertnaut_softmax(x1)
array([[ 1.,  1.,  1.,  1.]])

softmax(x1)
array([[ 0.00626879,  0.01704033,  0.04632042,  0.93037047]])

你可以看到desernauts版本在这种情况下会失败。（如果输入只是像 np.array([1, 2, 3, 6]) 这样的一维，则不会。

现在让我们使用 3 个样本，因为这就是我们使用二维输入的原因。以下 x2 与 desernauts 示例中的 x2 不同。

x2 = np.array([[1, 2, 3, 6],  # sample 1
               [2, 4, 5, 6],  # sample 2
               [1, 2, 3, 6]]) # sample 1 again(!)

此输入由具有 3 个样本的批次组成。但样品一和样品三基本相同。我们现在期望 3 行 softmax 激活，其中第一行应该与第三行相同，也与我们对 x1 的激活相同！

your_softmax(x2)
array([[ 0.00183535,  0.00498899,  0.01356148,  0.27238963],
       [ 0.00498899,  0.03686393,  0.10020655,  0.27238963],
       [ 0.00183535,  0.00498899,  0.01356148,  0.27238963]])


desertnaut_softmax(x2)
array([[ 0.21194156,  0.10650698,  0.10650698,  0.33333333],
       [ 0.57611688,  0.78698604,  0.78698604,  0.33333333],
       [ 0.21194156,  0.10650698,  0.10650698,  0.33333333]])

softmax(x2)
array([[ 0.00626879,  0.01704033,  0.04632042,  0.93037047],
       [ 0.01203764,  0.08894682,  0.24178252,  0.65723302],
       [ 0.00626879,  0.01704033,  0.04632042,  0.93037047]])

我希望你能看到这只是我的解决方案的情况。

softmax(x1) == softmax(x2)[0]
array([[ True,  True,  True,  True]], dtype=bool)

softmax(x1) == softmax(x2)[2]
array([[ True,  True,  True,  True]], dtype=bool)

另外，这里是TensorFlows softmax实现的结果：

import tensorflow as tf
import numpy as np
batch = np.asarray([[1,2,3,6],[2,4,5,6],[1,2,3,6]])
x = tf.placeholder(tf.float32, shape=[None, 4])
y = tf.nn.softmax(x)
init = tf.initialize_all_variables()
sess = tf.Session()
sess.run(y, feed_dict={x: batch})

结果：

array([[ 0.00626879,  0.01704033,  0.04632042,  0.93037045],
       [ 0.01203764,  0.08894681,  0.24178252,  0.657233  ],
       [ 0.00626879,  0.01704033,  0.04632042,  0.93037045]], dtype=float32)

【讨论】：

那将是一个地狱般的评论;-)
np.exp(z) / np.sum(np.exp(z), axis=1, keepdims=True) 达到与您的 softmax 函数相同的结果。带 s 的步骤是不必要的。
这个页面上有很多不正确/低效的解决方案。帮自己一个忙，使用 PabTorre 的
@PabTorre 你的意思是axis=-1吗？ axis=1 不适用于单维输入
需要“s”操作来确保softmax函数在数值上是稳定的。这对于学校项目可能很好，但对于在生产中构建模型非常宝贵。

【解决方案4】：

我想说，虽然两者在数学上都是正确的，但在实现方面，第一个更好。在计算 softmax 时，中间值可能会变得非常大。将两个大数相除在数值上可能是不稳定的。 These notes（来自斯坦福）提到了一个归一化技巧，这基本上就是你正在做的事情。

【讨论】：

灾难性取消的影响不容小觑。

【解决方案5】：

sklearn 还提供了 softmax 的实现

from sklearn.utils.extmath import softmax
import numpy as np

x = np.array([[ 0.50839931,  0.49767588,  0.51260159]])
softmax(x)

# output
array([[ 0.3340521 ,  0.33048906,  0.33545884]])

【讨论】：

这究竟如何回答具体问题，即关于实现本身而不是关于某些第三方库中的可用性？
我正在寻找第三方实现来验证这两种方法的结果。这就是这条评论有帮助的方式。

【解决方案6】：

从数学的角度来看，双方是平等的。

你可以很容易地证明这一点。让我们m=max(x)。现在你的函数softmax返回一个向量，其第i个坐标等于

请注意，这适用于任何 m，因为对于所有（甚至复数）数字 e^m != 0

从计算复杂度的角度来看，它们也是等效的，并且都在O(n) 时间运行，其中n 是向量的大小。
从numerical stability 的角度来看，第一个解决方案是首选，因为e^x 增长非常快，即使x 的值非常小，它也会溢出。减去最大值可以消除这种溢出。要实际体验我所说的内容，请尝试将x = np.array([1000, 5]) 输入到您的两个函数中。一个会返回正确的概率，第二个会溢出nan
您的解决方案仅适用于向量（Udacity 测验也希望您为矩阵计算它）。为了修复它，您需要使用sum(axis=0)

【讨论】：

什么时候能够在矩阵而不是向量上计算softmax有用？即什么模型输出矩阵？它可以更立体吗？
你的意思是“从数值稳定性的角度来看，第二个解决方案是首选......”中的第一个解决方案？

【解决方案7】：

编辑。从 1.2.0 版本开始，scipy 包含 softmax 作为一个特殊功能：

https://scipy.github.io/devdocs/generated/scipy.special.softmax.html

我写了一个在任意轴上应用 softmax 的函数：

def softmax(X, theta = 1.0, axis = None):
    """
    Compute the softmax of each element along an axis of X.

    Parameters
    ----------
    X: ND-Array. Probably should be floats. 
    theta (optional): float parameter, used as a multiplier
        prior to exponentiation. Default = 1.0
    axis (optional): axis to compute values along. Default is the 
        first non-singleton axis.

    Returns an array the same size as X. The result will sum to 1
    along the specified axis.
    """

    # make X at least 2d
    y = np.atleast_2d(X)

    # find axis
    if axis is None:
        axis = next(j[0] for j in enumerate(y.shape) if j[1] > 1)

    # multiply y against the theta parameter, 
    y = y * float(theta)

    # subtract the max for numerical stability
    y = y - np.expand_dims(np.max(y, axis = axis), axis)

    # exponentiate y
    y = np.exp(y)

    # take the sum along the specified axis
    ax_sum = np.expand_dims(np.sum(y, axis = axis), axis)

    # finally: divide elementwise
    p = y / ax_sum

    # flatten if X was 1D
    if len(X.shape) == 1: p = p.flatten()

    return p

正如其他用户所描述的那样，减去最大值是一种很好的做法。我写了一篇关于它的详细帖子here。

【讨论】：

【解决方案8】：

Here 你可以找出他们为什么使用- max。

从那里：

“当你在实践中编写用于计算 Softmax 函数的代码时，中间项可能由于指数而非常大。除以大数可能在数值上不稳定，因此使用归一化技巧很重要。”

【讨论】：

【解决方案9】：

更简洁的版本是：

def softmax(x):
    return np.exp(x) / np.exp(x).sum(axis=0)

【讨论】：

这会导致算术溢出

【解决方案10】：

要提供替代解决方案，请考虑您的论点数量级非常大的情况，以致 exp(x) 会下溢（在否定情况下）或溢出（在肯定情况下）。在这里，您希望尽可能长时间地保留在日志空间中，仅在您可以相信结果会表现良好的末尾取幂。

import scipy.special as sc
import numpy as np

def softmax(x: np.ndarray) -> np.ndarray:
    return np.exp(x - sc.logsumexp(x))

【讨论】：

要使其与海报代码相同，您需要将axis=0 作为参数添加到logsumexp。
或者，可以解压额外的 args 以传递给 logsumexp。

【解决方案11】：

我很想知道它们之间的性能差异

import numpy as np

def softmax(x):
    """Compute softmax values for each sets of scores in x."""
    return np.exp(x) / np.sum(np.exp(x), axis=0)

def softmaxv2(x):
    """Compute softmax values for each sets of scores in x."""
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum()

def softmaxv3(x):
    """Compute softmax values for each sets of scores in x."""
    e_x = np.exp(x - np.max(x))
    return e_x / np.sum(e_x, axis=0)

def softmaxv4(x):
    """Compute softmax values for each sets of scores in x."""
    return np.exp(x - np.max(x)) / np.sum(np.exp(x - np.max(x)), axis=0)



x=[10,10,18,9,15,3,1,2,1,10,10,10,8,15]

使用

print("----- softmax")
%timeit  a=softmax(x)
print("----- softmaxv2")
%timeit  a=softmaxv2(x)
print("----- softmaxv3")
%timeit  a=softmaxv2(x)
print("----- softmaxv4")
%timeit  a=softmaxv2(x)

增加 x (+100 +200 +500...) 内的值，我使用原始 numpy 版本始终获得更好的结果（这里只是一个测试）

----- softmax
The slowest run took 8.07 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 17.8 µs per loop
----- softmaxv2
The slowest run took 4.30 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 23 µs per loop
----- softmaxv3
The slowest run took 4.06 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 23 µs per loop
----- softmaxv4
10000 loops, best of 3: 23 µs per loop

直到.... x 内的值达到〜800，然后我得到

----- softmax
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:4: RuntimeWarning: overflow encountered in exp
  after removing the cwd from sys.path.
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:4: RuntimeWarning: invalid value encountered in true_divide
  after removing the cwd from sys.path.
The slowest run took 18.41 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 23.6 µs per loop
----- softmaxv2
The slowest run took 4.18 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 22.8 µs per loop
----- softmaxv3
The slowest run took 19.44 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 23.6 µs per loop
----- softmaxv4
The slowest run took 16.82 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 22.7 µs per loop

正如某些人所说，您的版本“对于大数”在数值上更加稳定。对于小数字可能是相反的方式。

【讨论】：

【解决方案12】：

我需要与Tensorflow 的密集层输出兼容的东西。

@desertnaut 的解决方案在这种情况下不起作用，因为我有批量数据。因此，我提出了另一种适用于两种情况的解决方案：

def softmax(x, axis=-1):
    e_x = np.exp(x - np.max(x)) # same code
    return e_x / e_x.sum(axis=axis, keepdims=True)

结果：

logits = np.asarray([
    [-0.0052024,  -0.00770216,  0.01360943, -0.008921], # 1
    [-0.0052024,  -0.00770216,  0.01360943, -0.008921]  # 2
])

print(softmax(logits))

#[[0.2492037  0.24858153 0.25393605 0.24827873]
# [0.2492037  0.24858153 0.25393605 0.24827873]]

参考：Tensorflow softmax

【讨论】：

请记住，答案是指问题中描述的非常具体的设置；它绝不是“如何在任何情况下或以您喜欢的数据格式计算 softmax”...
我明白了，我把它放在这里是因为问题是指“Udacity 的深度学习课程”，如果您使用 Tensorflow 构建模型，它将无法工作。您的解决方案既酷又干净，但它只适用于非常特定的场景。无论如何，谢谢。

【解决方案13】：

我建议这样做：

def softmax(z):
    z_norm=np.exp(z-np.max(z,axis=0,keepdims=True))
    return(np.divide(z_norm,np.sum(z_norm,axis=0,keepdims=True)))

它适用于随机和批次。
有关更多详细信息，请参阅： https://medium.com/@ravish1729/analysis-of-softmax-function-ad058d6a564d

【讨论】：

【解决方案14】：

为了保持数值稳定性，应减去 max(x)。下面是softmax函数的代码；

def softmax(x):

if len(x.shape) > 1:
    tmp = np.max(x, axis = 1)
    x -= tmp.reshape((x.shape[0], 1))
    x = np.exp(x)
    tmp = np.sum(x, axis = 1)
    x /= tmp.reshape((x.shape[0], 1))
else:
    tmp = np.max(x)
    x -= tmp
    x = np.exp(x)
    tmp = np.sum(x)
    x /= tmp


return x

【讨论】：

【解决方案15】：

已在上述答案中详细回答。减去max 以避免溢出。我在这里在 python3 中添加了另一个实现。

import numpy as np
def softmax(x):
    mx = np.amax(x,axis=1,keepdims = True)
    x_exp = np.exp(x - mx)
    x_sum = np.sum(x_exp, axis = 1, keepdims = True)
    res = x_exp / x_sum
    return res

x = np.array([[3,2,4],[4,5,6]])
print(softmax(x))

【讨论】：

【解决方案16】：

似乎每个人都发布了他们的解决方案，所以我将发布我的解决方案：

def softmax(x):
    e_x = np.exp(x.T - np.max(x, axis = -1))
    return (e_x / e_x.sum(axis=0)).T

我得到的结果与从 sklearn 导入的结果完全相同：

from sklearn.utils.extmath import softmax

【讨论】：

【解决方案17】：

import tensorflow as tf
import numpy as np

def softmax(x):
    return (np.exp(x).T / np.exp(x).sum(axis=-1)).T

logits = np.array([[1, 2, 3], [3, 10, 1], [1, 2, 5], [4, 6.5, 1.2], [3, 6, 1]])

sess = tf.Session()
print(softmax(logits))
print(sess.run(tf.nn.softmax(logits)))
sess.close()

【讨论】：

欢迎来到 SO。解释您的代码如何回答问题总是有帮助的。

【解决方案18】：

根据所有回复和CS231n notes，请允许我总结一下：

def softmax(x, axis):
    x -= np.max(x, axis=axis, keepdims=True)
    return np.exp(x) / np.exp(x).sum(axis=axis, keepdims=True)

用法：

x = np.array([[1, 0, 2,-1],
              [2, 4, 6, 8], 
              [3, 2, 1, 0]])
softmax(x, axis=1).round(2)

输出：

array([[0.24, 0.09, 0.64, 0.03],
       [0.  , 0.02, 0.12, 0.86],
       [0.64, 0.24, 0.09, 0.03]])

【讨论】：

【解决方案19】：

我想补充一点对问题的理解。这里减去数组的最大值是正确的。但是，如果您运行另一篇文章中的代码，您会发现当数组是 2D 或更高维度时，它并没有给出正确的答案。

这里我给你一些建议：

要获得最大值，请尝试沿 x 轴进行，您将获得一维数组。
将您的最大数组重塑为原始形状。
让 np.exp 得到指数值。
沿轴执行 np.sum。
获得最终结果。

按照结果进行矢量化，您将得到正确答案。由于和大学作业有关，这里不能贴出具体代码，不明白的地方还望多多指教。

【讨论】：

它与任何大学作业无关，仅与非认可课程中的未评分练习测验有关，正确答案将在下一步中提供...

【解决方案20】：

目标是使用 Numpy 和 Tensorflow 获得类似的结果。与原始答案的唯一变化是axis api 的np.sum 参数。

初始方法：axis=0 - 但是，当维度为 N 时，这并不能提供预期的结果。

修改后的方法：axis=len(e_x.shape)-1 - 总是在最后一个维度求和。这提供了与 tensorflow 的 softmax 函数类似的结果。

def softmax_fn(input_array):
    """
    | **@author**: Prathyush SP
    |
    | Calculate Softmax for a given array
    :param input_array: Input Array
    :return: Softmax Score
    """
    e_x = np.exp(input_array - np.max(input_array))
    return e_x / e_x.sum(axis=len(e_x.shape)-1)

【讨论】：

【解决方案21】：

这是使用 numpy 的通用解决方案，并使用 tensorflow 和 scipy 比较正确性：

数据准备：

import numpy as np

np.random.seed(2019)

batch_size = 1
n_items = 3
n_classes = 2
logits_np = np.random.rand(batch_size,n_items,n_classes).astype(np.float32)
print('logits_np.shape', logits_np.shape)
print('logits_np:')
print(logits_np)

输出：

logits_np.shape (1, 3, 2)
logits_np:
[[[0.9034822  0.3930805 ]
  [0.62397    0.6378774 ]
  [0.88049906 0.299172  ]]]

使用张量流的Softmax：

import tensorflow as tf

logits_tf = tf.convert_to_tensor(logits_np, np.float32)
scores_tf = tf.nn.softmax(logits_np, axis=-1)

print('logits_tf.shape', logits_tf.shape)
print('scores_tf.shape', scores_tf.shape)

with tf.Session() as sess:
    scores_np = sess.run(scores_tf)

print('scores_np.shape', scores_np.shape)
print('scores_np:')
print(scores_np)

print('np.sum(scores_np, axis=-1).shape', np.sum(scores_np,axis=-1).shape)
print('np.sum(scores_np, axis=-1):')
print(np.sum(scores_np, axis=-1))

输出：

logits_tf.shape (1, 3, 2)
scores_tf.shape (1, 3, 2)
scores_np.shape (1, 3, 2)
scores_np:
[[[0.62490064 0.37509936]
  [0.4965232  0.5034768 ]
  [0.64137274 0.3586273 ]]]
np.sum(scores_np, axis=-1).shape (1, 3)
np.sum(scores_np, axis=-1):
[[1. 1. 1.]]

使用 scipy 的 Softmax：

from scipy.special import softmax

scores_np = softmax(logits_np, axis=-1)

print('scores_np.shape', scores_np.shape)
print('scores_np:')
print(scores_np)

print('np.sum(scores_np, axis=-1).shape', np.sum(scores_np, axis=-1).shape)
print('np.sum(scores_np, axis=-1):')
print(np.sum(scores_np, axis=-1))

输出：

scores_np.shape (1, 3, 2)
scores_np:
[[[0.62490064 0.37509936]
  [0.4965232  0.5034768 ]
  [0.6413727  0.35862732]]]
np.sum(scores_np, axis=-1).shape (1, 3)
np.sum(scores_np, axis=-1):
[[1. 1. 1.]]

使用 numpy 的 Softmax (https://nolanbconaway.github.io/blog/2017/softmax-numpy)：

def softmax(X, theta = 1.0, axis = None):
    """
    Compute the softmax of each element along an axis of X.

    Parameters
    ----------
    X: ND-Array. Probably should be floats.
    theta (optional): float parameter, used as a multiplier
        prior to exponentiation. Default = 1.0
    axis (optional): axis to compute values along. Default is the
        first non-singleton axis.

    Returns an array the same size as X. The result will sum to 1
    along the specified axis.
    """

    # make X at least 2d
    y = np.atleast_2d(X)

    # find axis
    if axis is None:
        axis = next(j[0] for j in enumerate(y.shape) if j[1] > 1)

    # multiply y against the theta parameter,
    y = y * float(theta)

    # subtract the max for numerical stability
    y = y - np.expand_dims(np.max(y, axis = axis), axis)

    # exponentiate y
    y = np.exp(y)

    # take the sum along the specified axis
    ax_sum = np.expand_dims(np.sum(y, axis = axis), axis)

    # finally: divide elementwise
    p = y / ax_sum

    # flatten if X was 1D
    if len(X.shape) == 1: p = p.flatten()

    return p


scores_np = softmax(logits_np, axis=-1)

print('scores_np.shape', scores_np.shape)
print('scores_np:')
print(scores_np)

print('np.sum(scores_np, axis=-1).shape', np.sum(scores_np, axis=-1).shape)
print('np.sum(scores_np, axis=-1):')
print(np.sum(scores_np, axis=-1))

输出：

scores_np.shape (1, 3, 2)
scores_np:
[[[0.62490064 0.37509936]
  [0.49652317 0.5034768 ]
  [0.64137274 0.3586273 ]]]
np.sum(scores_np, axis=-1).shape (1, 3)
np.sum(scores_np, axis=-1):
[[1. 1. 1.]]

【讨论】：

【解决方案22】：

softmax 函数是一种激活函数，可将数字转换为总和为 1 的概率。 softmax 函数输出一个向量，表示结果列表的概率分布。它也是深度学习分类任务中使用的核心元素。

当我们有多个类时使用Softmax函数。

这对于找出具有最大值的类很有用。概率。

Softmax 函数理想地用于输出层，我们实际上是在尝试获得定义每个输入类别的概率。

范围从 0 到 1。

Softmax 函数将 logits [2.0, 1.0, 0.1] 转换为概率 [0.7, 0.2, 0.1]，概率总和为 1。 Logits 是神经网络最后一层输出的原始分数。在激活发生之前。要理解 softmax 函数，我们必须看第 (n-1) 层的输出。

softmax 函数实际上是一个 arg max 函数。这意味着它不会返回输入中的最大值，而是返回最大值的位置。

例如：

在softmax之前

X = [13, 31, 5]

softmax 之后

array([1.52299795e-08, 9.99999985e-01, 5.10908895e-12]

代码：

import numpy as np

# your solution:

def your_softmax(x): 

"""Compute softmax values for each sets of scores in x.""" 

e_x = np.exp(x - np.max(x)) 

return e_x / e_x.sum() 

# correct solution: 

def softmax(x): 

"""Compute softmax values for each sets of scores in x.""" 

e_x = np.exp(x - np.max(x)) 

return e_x / e_x.sum(axis=0) 

# only difference

【讨论】：

【解决方案23】：

这也适用于 np.reshape。

   def softmax( scores):
        """
        Compute softmax scores given the raw output from the model

        :param scores: raw scores from the model (N, num_classes)
        :return:
            prob: softmax probabilities (N, num_classes)
        """
        prob = None

        exponential = np.exp(
            scores - np.max(scores, axis=1).reshape(-1, 1)
        )  # subract the largest number https://jamesmccaffrey.wordpress.com/2016/03/04/the-max-trick-when-computing-softmax/
        prob = exponential / exponential.sum(axis=1).reshape(-1, 1)

        

        return prob

【讨论】：

【解决方案24】：

softmax 函数的目的是保持向量的比率，而不是在值饱和时用 sigmoid 压缩端点（即趋于 +/- 1 (tanh) 或从 0 到 1（逻辑））。这是因为它保留了更多关于端点变化率的信息，因此更适用于具有 1-of-N 输出编码的神经网络（即，如果我们压缩端点，则更难区分 1 -of-N 输出类，因为我们无法分辨哪个是“最大”或“最小”，因为它们被压扁了。）；也使得总输出总和为 1，明显的赢家将更接近 1，而其他彼此接近的数字的总和为 1/p，其中 p 是具有相似值的输出神经元的数量。

从向量中减去最大值的目的是，当你做 e^y 指数时，你可能会得到非常高的值，它将浮点数限制在导致平局的最大值处，这在本例中不是这种情况。如果你减去最大值得到一个负数，这将成为一个大问题，然后你有一个负指数会迅速缩小改变比率的值，这就是海报问题中发生的事情并产生了错误的答案。

Udacity 提供的答案非常低效。我们需要做的第一件事是计算所有向量分量的 e^y_j，保持这些值，然后将它们相加，然后除。 Udacity 搞砸的地方是他们计算 e^y_j TWICE！正确答案如下：

def softmax(y):
    e_to_the_y_j = np.exp(y)
    return e_to_the_y_j / np.sum(e_to_the_y_j, axis=0)

【讨论】：

【解决方案25】：

这概括并假设您正在规范化尾随维度。

def softmax(x: np.ndarray) -> np.ndarray:
    e_x = np.exp(x - np.max(x, axis=-1)[..., None])
    e_y = e_x.sum(axis=-1)[..., None]
    return e_x / e_y

【讨论】：