【问题标题】:What is the difference between zipWithUniqueId and zipWithIndex in pyspark?pyspark 中的 zipWithUniqueId 和 zipWithIndex 有什么区别?
【发布时间】:2022-02-13 14:33:55
【问题描述】:

文档说一个会触发 spark 作业,而另一个不会。我不确定我是否理解这意味着什么。你能帮我理解两者之间的区别吗?

【问题讨论】:

标签: apache-spark pyspark apache-spark-sql


【解决方案1】:

真相来源于最新代码:

  /**
   * Zips this RDD with its element indices. The ordering is first based on the partition index
   * and then the ordering of items within each partition. So the first item in the first
   * partition gets index 0, and the last item in the last partition receives the largest index.
   *
   * This is similar to Scala's zipWithIndex but it uses Long instead of Int as the index type.
   * This method needs to trigger a spark job when this RDD contains more than one partitions.
   *
   * @note Some RDDs, such as those returned by groupBy(), do not guarantee order of
   * elements in a partition. The index assigned to each element is therefore not guaranteed,
   * and may even change if the RDD is reevaluated. If a fixed ordering is required to guarantee
   * the same index assignments, you should sort the RDD with sortByKey() or save it to a file.
   */
  def zipWithIndex(): RDD[(T, Long)] = withScope {
    new ZippedWithIndexRDD(this)
  }

  /**
   * Zips this RDD with generated unique Long ids. Items in the kth partition will get ids k, n+k,
   * 2*n+k, ..., where n is the number of partitions. So there may exist gaps, but this method
   * won't trigger a spark job, which is different from [[org.apache.spark.rdd.RDD#zipWithIndex]].
   *
   * @note Some RDDs, such as those returned by groupBy(), do not guarantee order of
   * elements in a partition. The unique ID assigned to each element is therefore not guaranteed,
   * and may even change if the RDD is reevaluated. If a fixed ordering is required to guarantee
   * the same index assignments, you should sort the RDD with sortByKey() or save it to a file.
   */
  def zipWithUniqueId(): RDD[(T, Long)]

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1396

【讨论】:

    猜你喜欢
    • 2018-07-05
    • 1970-01-01
    • 2021-10-17
    • 2014-12-30
    • 1970-01-01
    • 1970-01-01
    • 2010-10-02
    • 2011-12-12
    • 2010-09-16
    相关资源
    最近更新 更多