应用 Sklearn 高斯混合算法拟合 GM 曲线答案

【问题标题】：Applying Sklearn Gaussian Mixture algorithm to fit GM curves应用 Sklearn 高斯混合算法拟合 GM 曲线
【发布时间】：2017-03-04 02:54:56
【问题描述】：

我一直在查看 Sklearn 库，它似乎非常准确地在 gaussian mixtures distributions 中拟合宽组件：

我想对我的天文数据尝试这种方法（稍微修改一下，因为前面的示例已弃用，并且在当前版本中不起作用）

但是，在我的数据中，我有一条数据曲线，而不是点分布。因此，我从 numpy random.choice 函数生成分布，以生成由曲线形状加权的分布。之后我运行 sklearn fit：

import numpy            as np
from sklearn.mixture    import GMM, GaussianMixture
import matplotlib.pyplot as plt
from scipy.stats        import norm

#Raw data
data = np.array([[6535.62597656, 7.24362260936e-17],
        [6536.45898438, 6.28683338273e-17],
        [6537.29248047, 5.84596729207e-17],
        [6538.12548828, 8.13193914837e-17],
        [6538.95849609, 6.70583742068e-17],
        [6539.79199219, 7.8511483881e-17],
        [6540.625, 9.22121293063e-17],
        [6541.45800781, 7.81353615478e-17],
        [6542.29150391, 8.58095991639e-17],
        [6543.12451172, 9.30569784967e-17],
        [6543.95800781, 9.92541957936e-17],
        [6544.79101562, 1.1682282379e-16],
        [6545.62402344, 1.21238102142e-16],
        [6546.45751953, 1.51062780724e-16],
        [6547.29052734, 1.92193416858e-16],
        [6548.12402344, 2.12669644265e-16],
        [6548.95703125, 1.89356624109e-16],
        [6549.79003906, 1.62571112976e-16],
        [6550.62353516, 1.73262984876e-16],
        [6551.45654297, 1.79300635724e-16],
        [6552.29003906, 1.93990357551e-16],
        [6553.12304688, 2.15530881856e-16],
        [6553.95605469, 2.13273711105e-16],
        [6554.78955078, 3.03175829363e-16],
        [6555.62255859, 3.17610250166e-16],
        [6556.45556641, 3.75917668914e-16],
        [6557.2890625, 4.64631505826e-16],
        [6558.12207031, 6.9828152092e-16],
        [6558.95556641, 1.19680535606e-15],
        [6559.78857422, 2.18677945421e-15],
        [6560.62158203, 4.07692754678e-15],
        [6561.45507812, 5.89089137849e-15],
        [6562.28808594, 7.48005986578e-15],
        [6563.12158203, 7.49293900174e-15],
        [6563.95458984, 4.59418727426e-15],
        [6564.78759766, 2.25848015792e-15],
        [6565.62109375, 1.04438093017e-15],
        [6566.45410156, 6.61019482779e-16],
        [6567.28759766, 4.45881319808e-16],
        [6568.12060547, 4.1486649376e-16],
        [6568.95361328, 3.69435405178e-16],
        [6569.78710938, 2.63747028003e-16],
        [6570.62011719, 2.58619514057e-16],
        [6571.453125, 2.28424298265e-16],
        [6572.28662109, 1.85772271843e-16],
        [6573.11962891, 1.90082094593e-16],
        [6573.953125, 1.80158097764e-16],
        [6574.78613281, 1.61992695352e-16],
        [6575.61914062, 1.44038495311e-16],
        [6576.45263672, 1.6536593789e-16],
        [6577.28564453, 1.48634721076e-16],
        [6578.11914062, 1.28145245545e-16],
        [6578.95214844, 1.30889102898e-16],
        [6579.78515625, 1.42521644591e-16],
        [6580.61865234, 1.6919170778e-16],
        [6581.45166016, 2.35394744146e-16],
        [6582.28515625, 2.75400454352e-16],
        [6583.11816406, 3.42150435774e-16],
        [6583.95117188, 3.06301301529e-16],
        [6584.78466797, 2.01059337187e-16],
        [6585.61767578, 1.36484708427e-16],
        [6586.45068359, 1.26422274651e-16],
        [6587.28417969, 9.79250952203e-17],
        [6588.1171875, 8.77299287344e-17],
        [6588.95068359, 6.6478752208e-17],
        [6589.78369141, 4.95864370066e-17]])


#Get the data
obs_wave, obs_flux = data[:,0], data[:,1]

#Center the x data in zero and normalized the y data to the area of the curve
n_wave = obs_wave - obs_wave[np.argmax(obs_flux)]
n_flux = obs_flux / sum(obs_flux) 

#Generate a distribution of points matcthing the curve
line_distribution   = np.random.choice(a = n_wave, size = 100000, p = n_flux)
number_points       = len(line_distribution)

#Run the fit
gmm = GaussianMixture(n_components = 4)
gmm.fit(np.reshape(line_distribution, (number_points, 1)))
gauss_mixt = np.array([p * norm.pdf(n_wave, mu, sd) for mu, sd, p in zip(gmm.means_.flatten(), np.sqrt(gmm.covariances_.flatten()), gmm.weights_)])
gauss_mixt_t = np.sum(gauss_mixt, axis = 0)  

#Plot the data
fig, axis = plt.subplots(1, 1, figsize=(10, 12))
axis.plot(n_wave, n_flux, label = 'Normalized observed flux')
axis.plot(n_wave, gauss_mixt_t, label = '4 components fit')

for i in range(len(gauss_mixt)):
    axis.plot(n_wave, gauss_mixt[i], label = 'Gaussian '+str(i))

axis.set_xlabel('normalized wavelength')
axis.set_ylabel('normalized flux')
axis.set_title('Sklearn fit GM fit')

axis.legend()
plt.show()

这给了我：

和缩放

如果有人试图为此目的使用这个库，我的问题是两个：

1) sklearn 中是否有一个类可以在不生成数据分布作为中间步骤的情况下执行这种拟合？

2) 我应该如何提高合身度？有没有约束变量的方法？例如将所有窄分量设置为相同的标准差？

感谢您的建议

【问题讨论】：

据我所知，遗憾的是没有办法避免这个中间步骤。为了提高合身度，我会查看docs 课程，看看你可以在那里做什么。您可以设置初始权重，例如降低收敛阈值等。

标签： python numpy scikit-learn curve-fitting gaussian

【解决方案1】：

对于问题 1：

因此，我从 numpy random.choice 生成一个分布函数生成由我的曲线形状加权的分布。之后我运行 sklearn fit：

这听起来对我来说是正确的。 Fast arbitrary distribution random sampling 中给出了另一个可能的答案

对于问题 2：

在拟合 GMM 等模型时，有一种称为“方差地板”的技术可以阻止组件变得非常窄（当一个组件（过度）拟合好几个点时，可能会发生这种情况）。来自Schlapbach et al., A Writer Identification System for On-line Whiteboard Data, 2001：

[...] 方差地板用于避免过度拟合方差参数。方差地板的想法是施加较低的限制在方差参数上，作为仅从少数数据估计的方差点可能非常小，可能不代表底层证券数据的分布。最小方差值定义为

min_sigma**2 = phi * sigma_global**2

其中 phi 表示方差底因子，全局方差sigma_global**2 是在完整的训练集上计算的。最小方差min_sigma**2，用于初始化模型的方差参数。在 EM 更新步骤中，如果计算出的方差参数小于min_sigma**2，则将方差参数设置为此值。

然而，这意味着修改代码。通过增加sklearn.mixture.GaussianMixture 的reg_covar 参数，您可能会达到类似的效果。

【讨论】：