【发布时间】:2022-01-09 10:17:33
【问题描述】:
我有 csv 文件:dbname1.table1.csv:
|target | source |source_table |relation_type|
---------------------------------------------------------------------------------------
avg_ensure_sum_12m | inn_num | custom_cib_ml_stg.p_overall_part_tend_cust | direct
avg_ensure_sum_12m | protocol_dttm | custom_cib_ml_stg.p_overall_part_tend_cust | direct
avg_ensure_sum_12m | inn_num | custom_cib_ml_stg.p_overall_part_tend_cust | indirect
此表的 csv 格式:
target,source,source_table,relation_type
avg_ensure_sum_12m,inn_num,custom_cib_ml_stg.p_overall_part_tend_cust,direct
avg_ensure_sum_12m,protocol_dttm,custom_cib_ml_stg.p_overall_part_tend_cust,direct
avg_ensure_sum_12m,inn_num,custom_cib_ml_stg.p_overall_part_tend_cust,indirect
然后我通过读取它来创建一个数据框:
val dfDL = spark.read.option("delimiter", ",")
.option("header", true)
.csv(file.getPath.toUri.getPath)
现在我需要基于 dfDL 创建一个新的数据框。
新数据框的结构如下所示:
case class DataLink(schema_from: String,
table_from: String,
column_from: String,
link_type: String,
schema_to: String,
table_to: String,
column_to: String)
新DataFrame的字段信息是从csv文件中获取的:
pseudocode:
schema_from = source_table.split(".")(0) // Example: custom_cib_ml_stg
table_from = source_table.split(".")(1) // Example: p_overall_part_tend_cust
column_from = source // Example: inn_num
link_type = relation_type // Example: direct
schema_to = "dbname1.table1.csv".split(".")(0) // Example: dbname1
table_to = "dbname1.table1.csv".split(".")(1) // Example: table1
column_to = target // Example: avg_ensure_sum_12m
我需要创建一个新的数据框。我一个人应付不来。
附:我需要这个数据框稍后从中创建一个 json 文件。 示例 JSON:
[{"schema_from":"custom_cib_ml36_stg",
"table_from":"p_overall_part_tend_cust",
"column_from":"inn_num",
"link_type":"direct",
"schema_to":"dbname1",
"table_to":"table1",
"column_to":"avg_ensure_sum_12m"
},
{"schema_from":"custom_cib_ml36_stg",
"table_from":"p_overall_part_tend_cust",
"column_from":"protocol_dttm",
"link_type":"direct","schema_to":"dbname1",
"table_to":"table1",
"column_to":"avg_ensure_sum_12m"}
我不喜欢我当前的实现:
def readDLFromHDFS(file: LocatedFileStatus): Array[DataLink] = {
val arrTableName = file.getPath.getName.split("\\.")
val (schemaTo, tableTo) = (arrTableName(0), arrTableName(1))
val dfDL = spark.read.option("delimiter", ",")
.option("header", true)
.csv(file.getPath.toUri.getPath)
//val sourceTable = dfDL.select("source_table").collect().map(value => value.toString().split("."))
dfDL.collect.map(row => DataLink(row.getString(2).split("\\.")(0),
row.getString(2).split("\\.")(1),
row.getString(1),
row.getString(3),
schemaTo,
tableTo,
row.getString(0)))
}
def toJSON(dataLinks: Array[DataLink]): Option[JValue] =
dataLinks.map(Extraction.decompose).reduceOption(_ ++ _)
}
【问题讨论】:
标签: dataframe scala apache-spark apache-spark-sql