Numpy 方法来整理一个凌乱的数组以进行绘图答案

【问题标题】：Numpy way to sort out a messy array for plottingNumpy 方法来整理一个凌乱的数组以进行绘图
【发布时间】：2016-06-07 00:17:19
【问题描述】：

我有两个数组上的绘图数据，这些数组以未排序的方式存储，因此绘图不连续地从一个地方跳到另一个地方：我试过one example of finding the closest point in a 2D array:

import numpy as np

def distance(pt_1, pt_2):
    pt_1 = np.array((pt_1[0], pt_1[1]))
    pt_2 = np.array((pt_2[0], pt_2[1]))
    return np.linalg.norm(pt_1-pt_2)

def closest_node(node, nodes):
    nodes = np.asarray(nodes)
    dist_2 = np.sum((nodes - node)**2, axis=1)
    return np.argmin(dist_2)

a = []
for x in range(50000):
    a.append((np.random.randint(0,1000),np.random.randint(0,1000)))
some_pt = (1, 2)

closest_node(some_pt, a)

我可以用它来“清理”我的数据吗？（在上面的代码中，a 可以是我的数据）

我计算的示例数据是：

array([[  2.08937872e+001,   1.99020033e+001,   2.28260611e+001,
          6.27711094e+000,   3.30392288e+000,   1.30312878e+001,
          8.80768833e+000,   1.31238275e+001,   1.57400130e+001,
          5.00278061e+000,   1.70752624e+001,   1.79131456e+001,
          1.50746185e+001,   2.50095731e+001,   2.15895974e+001,
          1.23237801e+001,   1.14860312e+001,   1.44268222e+001,
          6.37680265e+000,   7.81485403e+000],
       [ -1.19702178e-001,  -1.14050879e-001,  -1.29711421e-001,
          8.32977493e-001,   7.27437322e-001,   8.94389885e-001,
          8.65931116e-001,  -6.08199292e-002,  -8.51922900e-002,
          1.12333841e-001,  -9.88131292e-324,   4.94065646e-324,
         -9.88131292e-324,   4.94065646e-324,   4.94065646e-324,
          0.00000000e+000,   0.00000000e+000,   0.00000000e+000,
         -4.94065646e-324,   0.00000000e+000]])

使用radial_sort_line（来自Joe Kington）后，我收到了以下情节：

【问题讨论】：

能否发布您的数据或如何获取数据？
a 不是你展示的情节
另外，你真的需要线图吗？你不能只绘制数据点（不画线）吗？
看起来您需要的只是按其y 值对数据进行排序，但发布一些示例数据会有所帮助。
@Ohm，我认为具有合适距离度量的最近邻方法可能是最好的解决方案，请参阅编辑。

标签： python arrays sorting numpy matplotlib

【解决方案1】：

按照@JoeKington 的解决方案中相对于中心的角度对数据库进行排序可能会导致某些数据部分出现问题：

In [1]:

import scipy.spatial as ss
import matplotlib.pyplot as plt
import numpy as np
import re
%matplotlib inline
In [2]:

data=np.array([[  2.08937872e+001,   1.99020033e+001,   2.28260611e+001,
                  6.27711094e+000,   3.30392288e+000,   1.30312878e+001,
                  8.80768833e+000,   1.31238275e+001,   1.57400130e+001,
                  5.00278061e+000,   1.70752624e+001,   1.79131456e+001,
                  1.50746185e+001,   2.50095731e+001,   2.15895974e+001,
                  1.23237801e+001,   1.14860312e+001,   1.44268222e+001,
                  6.37680265e+000,   7.81485403e+000],
               [ -1.19702178e-001,  -1.14050879e-001,  -1.29711421e-001,
                  8.32977493e-001,   7.27437322e-001,   8.94389885e-001,
                  8.65931116e-001,  -6.08199292e-002,  -8.51922900e-002,
                  1.12333841e-001,  -9.88131292e-324,   4.94065646e-324,
                 -9.88131292e-324,   4.94065646e-324,   4.94065646e-324,
                  0.00000000e+000,   0.00000000e+000,   0.00000000e+000,
                 -4.94065646e-324,   0.00000000e+000]])
In [3]:

plt.plot(data[0], data[1])
plt.title('Unsorted Data')
Out[3]:
<matplotlib.text.Text at 0x10a5c0550>

查看 15 到 20 之间的 x 值排序不正确。

In [10]:

#Calculate the angle in degrees of [0, 360]
sort_index = np.angle(np.dot((data.T-data.mean(1)), np.array([1.0, 1.0j])))
sort_index = np.where(sort_index>0, sort_index, sort_index+360)

#sorted the data by angle and plot them
sort_index = sort_index.argsort()
plt.plot(data[0][sort_index], data[1][sort_index])
plt.title('Data Sorted by angle relatively to the centroid')

plt.plot(data[0], data[1], 'r+')
Out[10]:
[<matplotlib.lines.Line2D at 0x10b009e10>]

我们可以基于最近邻方法对数据进行排序，但由于 x 和 y 的尺度非常不同，距离度量的选择成为一个重要问题。我们将尝试scipy 中提供的所有距离指标来了解一下：

In [7]:

def sort_dots(metrics, ax, start):
    dist_m = ss.distance.squareform(ss.distance.pdist(data.T, metrics))

    total_points = data.shape[1]
    points_index = set(range(total_points))
    sorted_index = []
    target    = start
    ax.plot(data[0, target], data[1, target], 'o', markersize=16)

    points_index.discard(target)
    while len(points_index)>0:
        candidate = list(points_index)
        nneigbour = candidate[dist_m[target, candidate].argmin()]
        points_index.discard(nneigbour)
        points_index.discard(target)
        #print points_index, target, nneigbour
        sorted_index.append(target)
        target    = nneigbour
    sorted_index.append(target)

    ax.plot(data[0][sorted_index], data[1][sorted_index])
    ax.set_title(metrics)
In [6]:

dmetrics = re.findall('pdist\(X\,\s+\'(.*)\'', ss.distance.pdist.__doc__)
In [8]:

f, axes = plt.subplots(4, 6, figsize=(16,10), sharex=True, sharey=True)
axes = axes.ravel()
for metrics, ax in zip(dmetrics, axes):
    try:
        sort_dots(metrics, ax, 5)
    except:
        ax.set_title(metrics + '(unsuitable)')

看起来标准化的欧几里得和马哈那罗比斯度量给出了最好的结果。注意我们选择第6个数据的起点（索引5），就是这个y值最大的数据点（当然用argmax获取索引）。

In [9]:

f, axes = plt.subplots(4, 6, figsize=(16,10), sharex=True, sharey=True)
axes = axes.ravel()
for metrics, ax in zip(dmetrics, axes):
    try:
        sort_dots(metrics, ax, 13)
    except:
        ax.set_title(metrics + '(unsuitable)')

如果您选择最大值的起点，就会发生这种情况。 x 值（索引 13）。看起来马哈那罗比斯度量优于标准化欧几里得，因为它不受我们选择的起点的影响。

【讨论】：

好主意，它也可以用来将数据分成不同的曲线吗？真实数据应该包含一种半圆形曲线，向右开口，还有一条水平直线..

【解决方案2】：

这实际上是一个比你通常想象的更难的问题。

在您的确切情况下，您可能能够摆脱按 y 值排序。从剧情上很难确定。

因此，对于像这样的圆形形状，更好的方法是进行径向排序。

例如，让我们生成一些与您的数据有些相似的数据：

import numpy as np
import matplotlib.pyplot as plt

t = np.linspace(.2, 1.6 * np.pi)
x, y = np.cos(t), np.sin(t)

# Shuffle the points...
i = np.arange(t.size)
np.random.shuffle(i)
x, y = x[i], y[i]

fig, ax = plt.subplots()
ax.plot(x, y, color='lightblue')
ax.margins(0.05)
plt.show()

好的，现在让我们尝试使用径向排序来撤消该随机排序。我们将以点的质心为中心，计算到每个点的角度，然后按该角度排序：

x0, y0 = x.mean(), y.mean()
angle = np.arctan2(y - y0, x - x0)

idx = angle.argsort()
x, y = x[idx], y[idx]

fig, ax = plt.subplots()
ax.plot(x, y, color='lightblue')
ax.margins(0.05)
plt.show()

好的，非常接近！如果我们正在处理一个封闭的多边形，我们就完成了。

但是，我们有一个问题——这填补了错误的差距。我们宁愿让角度从直线中最大间隙的位置开始。

因此，我们需要计算新行上每个相邻点的间距，并根据新的起始角度重新进行排序：

dx = np.diff(np.append(x, x[-1]))
dy = np.diff(np.append(y, y[-1]))
max_gap = np.abs(np.hypot(dx, dy)).argmax() + 1

x = np.append(x[max_gap:], x[:max_gap])
y = np.append(y[max_gap:], y[:max_gap])

结果：

作为一个完整的独立示例：

import numpy as np
import matplotlib.pyplot as plt

def main():
    x, y = generate_data()
    plot(x, y).set(title='Original data')

    x, y = radial_sort_line(x, y)
    plot(x, y).set(title='Sorted data')

    plt.show()

def generate_data(num=50):
    t = np.linspace(.2, 1.6 * np.pi, num)
    x, y = np.cos(t), np.sin(t)

    # Shuffle the points...
    i = np.arange(t.size)
    np.random.shuffle(i)
    x, y = x[i], y[i]

    return x, y

def radial_sort_line(x, y):
    """Sort unordered verts of an unclosed line by angle from their center."""
    # Radial sort
    x0, y0 = x.mean(), y.mean()
    angle = np.arctan2(y - y0, x - x0)

    idx = angle.argsort()
    x, y = x[idx], y[idx]

    # Split at opening in line
    dx = np.diff(np.append(x, x[-1]))
    dy = np.diff(np.append(y, y[-1]))
    max_gap = np.abs(np.hypot(dx, dy)).argmax() + 1

    x = np.append(x[max_gap:], x[:max_gap])
    y = np.append(y[max_gap:], y[:max_gap])
    return x, y

def plot(x, y):
    fig, ax = plt.subplots()
    ax.plot(x, y, color='lightblue')
    ax.margins(0.05)
    return ax

main()

【讨论】：

这很好用，但在某些情况下，当我绘制数据时，它会填充一个方向有多个 y 值的区域..

【解决方案3】：

如果我们假设数据是二维的并且 x 轴应该是递增的，那么你可以：

对 x 轴数据进行排序，例如x_old 并将结果存储在不同的变量中，例如x_new
对于x_new 中的每个元素，在x_old 数组中找到它的索引
根据您从上一步获得的索引重新排列 y_axis 数组中的元素

由于list.index 方法比numpy.where 方法更容易操作，我会使用python 列表而不是numpy 数组来执行此操作。

例如（并假设 x_old 和 y_old 分别是您以前的 x 和 y 轴的 numpy 变量）

import numpy as np

x_new_tmp = x_old.tolist()
y_new_tmp = y_old.tolist()

x_new = sorted(x_new_tmp)

y_new = [y_new_tmp[x_new_tmp.index(i)] for i in x_new]

然后你可以绘制x_new和y_new

【讨论】：

对于问题中显示的图像，此处无法按 x 值排序。这可能适用于 y 值
正如汤姆所说，在这种情况下这不适用于 x，但可能适用于 y。无论如何，如果您有 numpy 数组，请不要为此使用列表。而是做x, y = x[y.argsort()], x[y.argsort()]。