假设列名与currency 列中的值匹配:
import org.apache.spark.sql.functions.{lit, col, coalesce}
import org.apache.spark.sql.Column
// Dummy data
val df = sc.parallelize(Seq(
(49.5, "EUR", 99, 79, 69), (100.0, "GBP", 80, 120, 50)
)).toDF("paid", "currency", "EUR", "USD", "GBP")
// A list of available currencies
val currencies: List[String] = List("EUR", "USD", "GBP")
// Select listed value
val listedPrice: Column = coalesce(
currencies.map(c => when($"currency" === c, col(c)).otherwise(lit(null))): _*)
df.select($"*", (listedPrice - $"paid").alias("difference")).show
// +-----+--------+---+---+---+----------+
// | paid|currency|EUR|USD|GBP|difference|
// +-----+--------+---+---+---+----------+
// | 49.5| EUR| 99| 79| 69| 49.5|
// |100.0| GBP| 80|120| 50| -50.0|
// +-----+--------+---+---+---+----------+
listedPrice 的 SQL 等效表达式如下:
COALESCE(
CASE WHEN (currency = 'EUR') THEN EUR ELSE null,
CASE WHEN (currency = 'USD') THEN USD ELSE null,
CASE WHEN (currency = 'GBP') THEN GBP ELSE null
)
替代使用foldLeft:
import org.apache.spark.sql.functions.when
val listedPriceViaFold = currencies.foldLeft(
lit(null))((acc, c) => when($"currency" === c, col(c)).otherwise(acc))
df.select($"*", (listedPriceViaFold - $"paid").alias("difference")).show
// +-----+--------+---+---+---+----------+
// | paid|currency|EUR|USD|GBP|difference|
// +-----+--------+---+---+---+----------+
// | 49.5| EUR| 99| 79| 69| 49.5|
// |100.0| GBP| 80|120| 50| -50.0|
// +-----+--------+---+---+---+----------+
listedPriceViaFold 转换为以下 SQL:
CASE
WHEN (currency = 'GBP') THEN GBP
ELSE CASE
WHEN (currency = 'USD') THEN USD
ELSE CASE
WHEN (currency = 'EUR') THEN EUR
ELSE null
不幸的是,我不知道任何可以像这样直接表达 SQL 的内置函数
CASE currency
WHEN 'EUR' THEN EUR
WHEN 'USD' THEN USD
WHEN 'GBP' THEN GBP
ELSE null
END
但您可以在原始 SQL 中使用此构造。
我的假设不正确,您可以简单地在列名和 currency 列中的值之间添加映射。
编辑:
如果源支持谓词下推和有效的列修剪,另一个可能有效的选项是按货币和联合子集:
currencies.map(
// for each currency filter and add difference
c => df.where($"currency" === c).withColumn("difference", $"paid" - col(c))
).reduce((df1, df2) => df1.unionAll(df2)) // Union
相当于这样的SQL:
SELECT *, EUR - paid AS difference FROM df WHERE currency = 'EUR'
UNION ALL
SELECT *, USD - paid AS difference FROM df WHERE currency = 'USD'
UNION ALL
SELECT *, GBP - paid AS difference FROM df WHERE currency = 'GBP'