【发布时间】:2021-02-14 10:45:57
【问题描述】:
如图,我想用spark提取数据。
DataSetTest ro1 = new DataSetTest("apple", "fruit", "red", 3);
DataSetTest ro2 = new DataSetTest("apple", "fruit", "red", 4);
DataSetTest ro3 = new DataSetTest("car", "toy", "red", 1);
DataSetTest ro4 = new DataSetTest("bike", "toy", "white", 2);
DataSetTest ro5 = new DataSetTest("bike", "toy", "red", 5);
DataSetTest ro6 = new DataSetTest("apple", "fruit", "red", 3);
DataSetTest ro7 = new DataSetTest("car", "toy", "white", 7);
DataSetTest ro8 = new DataSetTest("apple", "fruit", "green", 1);
Dataset<Row> df = session.getSqlContext().createDataFrame(Arrays.asList(ro1, ro2, ro3, ro4, ro5, ro6, ro7, ro8), DataSetTest.class);
private void process(){
//1) groupByKey
Dataset<Row> df2 = df.groupBy("keyword", "opt1", "prt2").sum("count");
//2) counting by Opt & calculate the total number
Dataset<Row> df3 = df2.withColumn("fruit_red", **???**)
.withColumn("fruit_green", **???**)
.withColumn("toy_red", **???**)
.withColumn("toy_white",**???**)
.withColumn("total_count", ???);
//3) calculate the percent
Dataset<Row> df4 = df3.withColumn("percent", df3.col("total_count").divide("??sum of total_count??"));
你知道如何计算 2),3) 部分吗?
【问题讨论】:
-
2) 使用
pivot, 3) 使用窗口函数获取总数
标签: java scala dataframe apache-spark dataset