无论参数如何，scipy.interpolate.UnivariateSpline 都不平滑答案

【问题标题】：scipy.interpolate.UnivariateSpline not smoothing regardless of parameters无论参数如何，scipy.interpolate.UnivariateSpline 都不平滑
【发布时间】：2012-02-01 22:25:53
【问题描述】：

我无法让 scipy.interpolate.UnivariateSpline 在插值时使用任何平滑。基于function's page以及一些previous posts，我相信它应该提供s参数的平滑。

这是我的代码：

# Imports
import scipy
import pylab

# Set up and plot actual data
x = [0, 5024.2059124920379, 7933.1645067836089, 7990.4664106277542, 9879.9717114947653, 13738.60563208926, 15113.277958924193]
y = [0.0, 3072.5653360000988, 5477.2689107965398, 5851.6866463790966, 6056.3852496014106, 7895.2332350173638, 9154.2956175610598]
pylab.plot(x, y, "o", label="Actual")

# Plot estimates using splines with a range of degrees
for k in range(1, 4):
    mySpline = scipy.interpolate.UnivariateSpline(x=x, y=y, k=k, s=2)
    xi = range(0, 15100, 20)
    yi = mySpline(xi)
    pylab.plot(xi, yi, label="Predicted k=%d" % k)

# Show the plot
pylab.grid(True)
pylab.xticks(rotation=45)
pylab.legend( loc="lower right" )
pylab.show()

结果如下：

我已尝试使用一系列 s 值（0.01、0.1、1、2、5、50）以及显式权重，设置为相同的值 (1.0) 或随机设置。我仍然无法进行任何平滑处理，并且结的数量始终与数据点的数量相同。特别是，我正在寻找像第 4 点（7990.4664106277542、5851.6866463790966）这样的异常值来平滑处理。

是因为我没有足够的数据吗？如果是这样，我可以应用类似的样条函数或聚类技术来实现对这几个数据点的平滑吗？

【问题讨论】：

标签： python scipy splines

【解决方案1】：

虽然我不知道有任何库可以为你做这件事，但我会尝试更多的 DIY 方法：我会从在原始数据点之间制作一个带有结的样条线开始，在x 和 y。在您的特定示例中，在第 4 点和第 5 点之间有一个结应该可以解决问题，因为它会删除 x=8000 附近的巨大导数。

【讨论】：

我可以试一试，但我想如果忽略它，我想我不明白这个函数有一个平滑参数的意义。为什么他们包含这个参数？

【解决方案2】：

@Zhenya 在数据点之间手动设置节点的答案过于粗略，无法在噪声数据中提供良好的结果，而无需选择如何应用该技术。然而，在他/她的建议的启发下，我使用 scikit-learn 包中的Mean-Shift clustering 取得了成功。它执行集群计数的自动确定，并且似乎做了相当好的平滑工作（实际上非常平滑）。

# Imports
import numpy
import pylab
import scipy
import sklearn.cluster

# Set up original data - note that it's monotonically increasing by X value!
data = {}
data['original'] = {}
data['original']['x'] = [0, 5024.2059124920379, 7933.1645067836089, 7990.4664106277542, 9879.9717114947653, 13738.60563208926, 15113.277958924193]
data['original']['y'] = [0.0, 3072.5653360000988, 5477.2689107965398, 5851.6866463790966, 6056.3852496014106, 7895.2332350173638, 9154.2956175610598]

# Cluster data, sort it and and save
inputNumpy = numpy.array([[data['original']['x'][i], data['original']['y'][i]] for i in range(0, len(data['original']['x']))])
meanShift = sklearn.cluster.MeanShift()
meanShift.fit(inputNumpy)
clusteredData = [[pair[0], pair[1]] for pair in meanShift.cluster_centers_]
clusteredData.sort(lambda pair1, pair2: cmp(pair1[0],pair2[0]))
data['clustered'] = {}
data['clustered']['x'] = [pair[0] for pair in clusteredData]
data['clustered']['y'] = [pair[1] for pair in clusteredData]

# Build a spline using the clustered data and predict
mySpline = scipy.interpolate.UnivariateSpline(x=data['clustered']['x'], y=data['clustered']['y'], k=1)
xi = range(0, round(max(data['original']['x']), -3) + 3000, 20)
yi = mySpline(xi)

# Plot the datapoints
pylab.plot(data['clustered']['x'], data['clustered']['y'], "D", label="Datapoints (%s)" % 'clustered')
pylab.plot(xi, yi, label="Predicted (%s)" %  'clustered')
pylab.plot(data['original']['x'], data['original']['y'], "o", label="Datapoints (%s)" % 'original')

# Show the plot
pylab.grid(True)
pylab.xticks(rotation=45)
pylab.legend( loc="lower right" )
pylab.show()

【讨论】：

那么您介意发布您的代码吗？我想这对社区是有益的，因为这个主题经常在不同的环境中出现？
@Zhenya：可以，我需要把它从几个类/函数中提取出来，变成一个独立的例子。我会在第二天左右发布在我的答案中。

【解决方案3】：

简短答案：您需要选择s更仔细的值。

@ 987654321的文档：

Positive smoothing factor used to choose the number of knots. Number of 
knots will be increased until the     smoothing condition is satisfied:
sum((w[i]*(y[i]-s(x[i])))**2,axis=0) <= s

从此可以推断出平滑的“合理”的值，如果你没有通过显式权重，在s = m * v 987654325 @是数据点数和v的veriance的数据。在这种情况下，s_good ~ 5e7。

编辑：s @ view当然也取决于数据中的噪声水平。文档似乎建议在范围内选择s 987654330 std是与“噪声”相关的标准偏差，你想要平滑。

【讨论】：

ah，我没有尝试在该幅度附近的任何位置。谢谢你的澄清。 span>

【解决方案4】：

我无法运行 BigChef 的答案，这是适用于 python 3.6 的变体：

# Imports
import pylab
import scipy
import sklearn.cluster

# Set up original data - note that it's monotonically increasing by X value!
data = {}
data['original'] = {}
data['original']['x'] = [0, 5024.2059124920379, 7933.1645067836089, 7990.4664106277542, 9879.9717114947653, 13738.60563208926, 15113.277958924193]
data['original']['y'] = [0.0, 3072.5653360000988, 5477.2689107965398, 5851.6866463790966, 6056.3852496014106, 7895.2332350173638, 9154.2956175610598]

# Cluster data, sort it and and save
import numpy
inputNumpy = numpy.array([[data['original']['x'][i], data['original']['y'][i]] for i in range(0, len(data['original']['x']))])
meanShift = sklearn.cluster.MeanShift()
meanShift.fit(inputNumpy)
clusteredData = [[pair[0], pair[1]] for pair in meanShift.cluster_centers_]

clusteredData.sort(key=lambda li: li[0])
data['clustered'] = {}
data['clustered']['x'] = [pair[0] for pair in clusteredData]
data['clustered']['y'] = [pair[1] for pair in clusteredData]

# Build a spline using the clustered data and predict
mySpline = scipy.interpolate.UnivariateSpline(x=data['clustered']['x'], y=data['clustered']['y'], k=1)
xi = range(0, int(round(max(data['original']['x']), -3)) + 3000, 20)
yi = mySpline(xi)

# Plot the datapoints
pylab.plot(data['clustered']['x'], data['clustered']['y'], "D", label="Datapoints (%s)" % 'clustered')
pylab.plot(xi, yi, label="Predicted (%s)" %  'clustered')
pylab.plot(data['original']['x'], data['original']['y'], "o", label="Datapoints (%s)" % 'original')

# Show the plot
pylab.grid(True)
pylab.xticks(rotation=45)
pylab.show()

【讨论】：