找到最高浓度的区域答案

【问题标题】：Find the highest concentration areas找到最高浓度的区域
【发布时间】：2016-10-26 17:28:19
【问题描述】：

我有一个大型数据集（200 万行），其中每行代表一个点，其空间坐标以米为单位（x 和 y）及其分数。它看起来像这样：

my_points <- data.frame(ID = 1:2e6, 
    x = sample(x = 1:1e6, size = 2e6, replace = TRUE), 
    y = sample(x = 1:1e6, size = 2e6, replace = TRUE), 
    Score = sample(x = 1:1e3, size = 2e6, replace = TRUE))

head(my_points)
# ID      x      y Score
#  1  21984 628151    54
#  2 675714  27715   431
#  3 273248 127287    47
#  4 659750 795394   921
#  5 478142 417083   416
#  6 783249 440782   253

所有点都位于一个大区域（1000 x 1000 公里）内。

我正在尝试在半径 100 米内找到得分最高的点组。

到目前为止，我已经尝试了两种解决方案，但没有一个能够处理这么多数据（即使使用并行计算或data.table 包）：

第一种解决方案：

我已经建立了一个覆盖所有空间的空间网格。我为网格选择了一小步（10 米），以确保收集所有可能的解决方案。对于网格的每个点，我将距离小于 100 米的点的得分相加。这个解决方案需要太多时间（在我的电脑上可能需要数周或数月）...

第二个解决方案

我已经构建了一个函数，对于一对 (x, y)，返回包含在中心 (x, y) 和半径 100 米的圆内的分数。我试图找到这个函数的最大值，但我无法为这种非连续函数找到合适的方法......

关于更快的解决方案（不到一天）的任何想法？

【问题讨论】：

标签： r

【解决方案1】：

好的 - 我认为我的解决方案有效，但速度很慢。

library(Rcpp)

sourceCpp(code = '
  #include <Rcpp.h>

  using namespace Rcpp;

  // determine, if a point is in a polygon
  bool pnp(NumericVector vertx, NumericVector verty, float testx, float testy) {

    int nvert = vertx.size();
    bool c = FALSE;
    int i, j = 0;

    for (i = 0, j = nvert-1; i < nvert; j = i++) {
      if ( ((verty[i]>testy) != (verty[j]>testy)) &&
           (testx < (vertx[j]-vertx[i]) * (testy-verty[i]) / (verty[j]-verty[i]) + vertx[i]) )
        c = !c;
    }

    return c;
  }

  // create a circle polygon (36 corners) around a point with a certain radius
  NumericMatrix circle(float centerx, float centery, float radius){

    int pnum = 36;
    double rotation = 2 * 3.14159 / pnum;
    NumericMatrix res(36, 2);

    for (int p1 = 0; p1 < pnum; ++p1) {
        double rot = p1 * rotation;
        res(p1, 0) = centerx + cos(rot) * radius;
        res(p1, 1) = centery + sin(rot) * radius;
    }

    return res;
  }

  // create a vector with the circle score sum of each point 
  // [[Rcpp::export]]
  NumericVector searchmaxclust(DataFrame points) {

    Function asMatrix("as.matrix");

    SEXP points2m = points;
    NumericMatrix pm = asMatrix(points2m);

    NumericVector co(pm.nrow());

    for (int p1 = 0; p1 < pm.nrow(); p1++) {
      NumericVector curp = pm(p1,_);
      NumericMatrix circ = circle(curp(1), curp(2), 100.0);

      for (int p2 = 0; p2 < pm.nrow(); p2++) {
        NumericVector curp2 = pm(p2,_);
        bool isin = pnp(circ(_,0), circ(_,1), curp2(1), curp2(2));

        if (isin) {
          co(p1) = co(p1) + curp2(3);
        }

      }

    }

    return co;
  }
')

我使用 Rcpp 来加快速度 - 算法非常简单。

围绕每个点创建一个圆形多边形
检查所有其他点是否在圆形多边形内，并将所有正确点的分数相加

1000点大约需要0.6s。我想这意味着，您的 2000000 点需要大约一个月的时间。嗯。无论如何，我决定发布这个。也许它可以帮助别人。

【讨论】：

谢谢@nevrome！但是，在您的解决方案中，您只是将初始点测试为圆心。但是得分最高的区域的中心没有理由成为我初始数据的一个点......
哦-好的。另一个想法：也许使用 raster 包进行图像分析是可行的方法。他们为栅格数据实现了非常快速的算法。