带蜂巢的父子火花scala递归udf答案

【问题标题】：Parent-Child spark scala recursive udf with hive带蜂巢的父子火花scala递归udf
【发布时间】：2020-09-01 09:01:10
【问题描述】：

我有一个问题：我想选择一个父组织及其所有子组织和子组织：例如：上级组织 ID 为：63261

我有以下组织表（org_id，parent_id）：

   **org_id ||  parent_id**
    63549   ||  63261

如果我检查 63549 的孩子

**orga_id ||    parent_id**
1   58765 ||    63549
2   58766 ||    63549
3   58803 ||    63549

如果我检查 58765、58766、58803 的孩子，他们没有。因此，我想检索以下 ID：

63261, 63549, 58765, 58766, 58803. All of them.

我尝试了递归查询，但 Hive 不支持递归查询，因此我正在考虑开发一个 spark scala udf，它采用父 id 并返回其所有子项和子子项，直到最后一个不是父项的子项。

有什么想法吗？谢谢

【问题讨论】：

您能否添加更多示例数据输入和预期输出？用一个你可能会得到错误的结果。
我认为这个例子很清楚，我需要从父组织 ID 开始，找出它的子级，对于每个子级，我看看它是否有子级，所以是第四个
有什么想法吗？？
我不明白你的输入和输出。如果可能的话，请为父母、孩子多采集一些样本并添加预期的输出

标签： scala apache-spark recursion hive user-defined-functions

【解决方案1】：

我使用 scala 和 spark 开发了这些特定功能：

package com.specific

import org.apache.spark.sql.{DataFrame, SparkSession}
import scala.collection.Map

/**
 * Utilities class for generic and specific functions
 */
object Utils {


  /**
   * gets the list of direct first level childs of a given organization
   * @param data
   * @param parent
   * @return list of child organizations ids
   */
  def getFirstLevelChildren(data: Map[Int, Int], parent: Int): List[Int] = {
    data.keySet.filter(key => parent == data.get(key).get).toList
  }

  /**
   * Recursive method to get all children and sub-children of a parent organization
   * @param data
   * @param parent
   * @return
   */
  def getChildren(data: Map[Int, Int], parent: Int): List[Int] = {
    var children = List.empty[Int]
    if (!getFirstLevelChildren(data, parent).isEmpty) {
      children = List.concat(children, getFirstLevelChildren(data, parent))
      for (child <- getFirstLevelChildren(data, parent)) {
        children = List.concat(children, getChildren(data, child))
      }
    }
    children
  }



  def main(args: Array[String]): Unit = {
    val map2: Map[Int, Int] = Map(63549->63261,
                                  58765 -> 63549,
                                  58766 -> 63549,
                                  58803 -> 63549,
                                  10243 -> 6011,
                                  10257 -> 5996,
                                  10596 -> 6071,
                                  10652 -> 6076,
                                  10782->    6154,
                                  10873 -> 6134,
                                  1125  -> 197,
                                  11430 -> 6244,
                                  11692 -> 58803,
                                  1174  ->204,
                                  11951 ->6324,
                                  12369 ->6367,
                                  12544 ->6407,
                                  12759 ->5477,
                                  1280 -> 11692,
                                  1300 -> 11692,
                                  13183 ->  6950)
    val id = 63261
    List.concat((List(id)), getChildren(map2, id)).foreach(x => System.out.println(x))
  }
}

因此，我得到了所需的输出： 63261 63549 58803 58766 58765 11692 1300 1280

谁有更好的解决方案？我想把这个概念当作一个hive udf

【讨论】：