【问题标题】:How to get distances in pyspark?如何在pyspark中获得距离?
【发布时间】:2020-03-28 12:45:39
【问题描述】:

我有一张如下表:

+--------------------+--------------------+-------------------+
|                  ID|               point|          timestamp|
+--------------------+--------------------+-------------------+
|679ac975acc4bdec9...|POINT (-73.267631...|2020-01-01 17:10:49|
|679ac975acc4bdec9...|POINT (-73.271446...|2020-01-01 02:12:31|
|679ac975acc4bdec9...|POINT (-73.265991...|2020-01-01 17:10:40|
|679ac975acc4bdec9...|POINT (-73.271446...|2020-01-01 02:54:15|
|679ac975acc4bdec9...|POINT (-73.265609...|2020-01-01 17:10:24|
+--------------------+--------------------+-------------------+

我想计算所有点之间的距离,但我做不到。

但是我可以通过以下方式计算从point comlumn 中每个点到特定点的距离

distances = spark.sql(
    """
        SELECT ID, timestamp, point,
        ST_Distance(point, ST_PointFromText('-74.00672149658203, 40.73177719116211', ',')) as distance
        FROM myTable
    """).show(5)



+--------------------+-------------------+--------------------+------------------+
|                  ID|          timestamp|               point|          distance|
+--------------------+-------------------+--------------------+------------------+
|679ac975acc4bdec9...|2020-01-01 17:10:49|POINT (-73.267631...|0.7485722629444987|
|679ac975acc4bdec9...|2020-01-01 02:12:31|POINT (-73.271446...|0.7452303978930688|
|679ac975acc4bdec9...|2020-01-01 17:10:40|POINT (-73.265991...|0.7503403834426271|
|679ac975acc4bdec9...|2020-01-01 02:54:15|POINT (-73.271446...|0.7452310193408604|
|679ac975acc4bdec9...|2020-01-01 17:10:24|POINT (-73.265609...|0.7511492495935203|
+--------------------+-------------------+--------------------+------------------+

如何计算从一行中的一个点到下一个点的距离?

【问题讨论】:

    标签: python sql pyspark geospatial azure-databricks


    【解决方案1】:

    如果我对问题的理解正确,您希望收集point 列中行之间的相邻差异。我相信你可以通过 lag 函数 (https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.lag) 和 Window (http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=window#pyspark.sql.Window) 来实现这一点:

    from pyspark.sql.functions import lag, col
    from pyspark.sql.window import Window
    
    window = Window.partitionBy().orderBy("ID")
    df = df.withColumn('distance', col('point') - lag(col('point')).over(window))
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2019-11-12
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多