你可以这样做
import org.apache.spark.sql.functions._
//create a dataframe with demo data
val df = spark.sparkContext.parallelize(Seq(
(1, "Fname1", "Lname1", "Belarus"),
(2, "Fname2", "Lname2", "Belgium"),
(3, "Fname3", "Lname3", "Austria"),
(4, "Fname4", "Lname4", "Australia")
)).toDF("id", "fname","lname", "country")
//create a new column with the first letter of column
val result = df.withColumn("countryFirst", split($"country", "")(0))
//save the data with partitionby first letter of country
result.write.partitionBy("countryFirst").format("com.databricks.spark.csv").save("outputpath")
已编辑:
您还可以使用 Raphel 建议的可以提高性能的子字符串作为
substring(Column str, int pos, int len) 子字符串从 pos 开始,是
str 为 String 类型时长度为 len 或返回字节切片
以字节为单位从 pos 开始的数组,当 str 为长度为 len 时
二进制类型
val result = df.withColumn("firstCountry", substring($"country",1,1))
然后使用 partitionby 和 write
希望这能解决您的问题!