对地理坐标数据集进行分箱/分组答案

【问题标题】：Binning/grouping a dataset of geographic coordinates对地理坐标数据集进行分箱/分组
【发布时间】：2014-04-03 13:00:19
【问题描述】：

我有一个包含两列的大型数据集：timestamp 和 lat/lon。我想以某种方式对坐标进行分组以确定记录的不同地点的数量，将彼此相距一定距离内的所有东西都视为一个位置。本质上，我想弄清楚这个数据集中有多少不同的“地方”。 A good visual example is this 我想在这里结束，但我不知道我的数据集的集群在哪里。

【问题讨论】：

你需要一个聚类算法；例如见here

标签： python pandas geometry leaflet geo

【解决方案1】：

详细了解 behzad.nouri 的参考资料

# X= your Geo Array

# Standardize features by removing the mean and scaling to unit variance
X = StandardScaler().fit_transform(X)

# Compute DBSCAN
db = DBSCAN(eps=0.3, min_samples=3).fit(X)

# HERE
# eps -- The maximum distance between two samples 
#  for them to be considered as in the same neighborhood.
# min_samples -- The number of samples in a neighborhood for a point 
#  to be considered as a core point.

core_samples = db.core_sample_indices_
labels = db.labels_

# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)

【讨论】：

感谢您提供更多详细信息。足以让我从“哦..这是很多数学”变成“好的，我可以做到”。

【解决方案2】：

此伪代码演示了如何在计算网格分区中的点数时将一组点减少为每个网格分区的一个点。如果您有一组点，其中一些区域稀疏而其他区域密集，但希望显示点的分布均匀（例如在地图上），这将很有用。

要使用该函数，需要通过一个轴（例如 X）上的一组点和分区数。另一个轴（例如 Y）将使用相同的分区。因此，如果一个指定为 3，则将创建 9 (3*3) 个大小相等的分区。该函数首先遍历点集以找到约束整个点集的最外层 X 和 Y（最小和最大）坐标。然后将最外面的 X 轴和 Y 轴之间的距离除以分区数以确定网格大小。

然后该函数逐步遍历每个网格分区并检查集合中的每个点是否在网格分区内。如果该点在网格分区内，它会检查这是否是在网格分区中遇到的第一个点。如果是，则设置一个标志以指示已找到第一个点。否则，不是网格分区中的第一个点，则从点集中删除该点。

对于在分区中找到的每个点，该函数都会增加一个计数。最后，当每个网格分区的归约/统计完成后，就可以可视化统计的点（例如，在地图上的单个点上显示标记，并带有统计指示器）：

function TallyPoints( array points, int npartitions )
{
    array partition = new Array();

    int max_x = 0, max_y = 0;
    int min_x = MAX_INT, min_y = MAX_INT

    // Find the bounding box of the points
    foreach point in points
    {
        if ( point.X > max_x )
            max_x = point.X;
        if ( point.Y < min_x )
            min_x = point.X;
        if ( point.Y > max_y )
            max_y = point.Y;
        if ( point.Y < min_y )
            min_y = point.Y;
    }

    // Get the X and Y axis lengths of the paritions
    float partition_length_x =  ( ( float ) ( max_x - min_x ) ) / npartitions;
    float partition_length_y =  ( ( float ) ( max_y - min_y ) ) / npartitions;

    // Reduce the points to one point in each grid partition
    // grid partition
    for ( int n = 0; n < npartitions; n++ )
    {
        // Get the boundary of this grid paritition
        int min_X = min_x + ( n * partition_length_x );
        int min_Y = min_y + ( n * partition_length_y );
        int max_X = min_x + ( ( n + 1 ) * partition_length_x );
        int max_Y = min_y + ( ( n + 1 ) * partition_length_y );

        // reduce and tally points
        int     tally  = 0;
        boolean reduce = false; // set to true after finding the first point in the paritition
        foreach point in points
        {
            // the point is in the grid parition
            if ( point.X >= min_x && point.X < max_x &&
                 point.Y >= min_y && point.X < max_y )
            {
                // first point found
                if ( false == reduce )
                {
                    reduce = true;
                    partition[ n ].point = point;   // keep this as the single point for the grid
                }
                else
                    points.Remove( point ); // remove the point from the list

                // increment the tally count
                tally++;
            }
        }

        // store the tally for the grid
        partition[ n ].tally = tally;

        // visualize the tallied point here (e.g., marker on Google Map)
    }
}

【讨论】：