【发布时间】:2020-11-06 15:14:21
【问题描述】:
我有一个数据框。它包含跨不同销售网点的不同商品的销售额。下面显示的数据框仅显示了少数几个销售网点的少数项目。每件商品每天销售 100 件商品的基准。对于每件售出超过 100 件的商品,标记为“是”,低于 100 件的商品被标记为“否”
val df1 = Seq(
("Mumbai", 90, 109, , 101, 78, ............., "No", "Yes", "Yes", "No", .....),
("Singapore", 149, 129, , 201, 107, ............., "Yes", "Yes", "Yes", "Yes", .....),
("Hawaii", 127, 101, , 98, 109, ............., "Yes", "Yes", "No", "Yes", .....),
("New York", 146, 130, , 173, 117, ............., "Yes", "Yes", "Yes", "Yes", .....),
("Los Angeles", 94, 99, , 95, 113, ............., "No", "No", "No", "Yes", .....),
("Dubai", 201, 229, , 265, 317, ............., "Yes", "Yes", "Yes", "Yes", .....),
("Bangalore", 56, 89, , 61, 77, ............., "No", "No", "No", "No", .....))
.toDF("Outlet","Boys_Toys","Girls_Toys","Men_Shoes","Ladies_shoes", ............., "BT>100", "GT>100", "MS>100", "LS>100", .....)
现在,我想添加一列“Count_of_Yes”,其中对于每个销售网点(每一行),“Count_of_Yes”列的值将是该行中“Yes”的总数。如何遍历每一行以获得 Yes 的计数?
我预期的数据框应该是
val output_df = Seq(
("Mumbai", 90, 109, , 101, 78, ............., "No", "Yes", "Yes", "No", ....., 2),
("Singapore", 149, 129, , 201, 107, ............., "Yes", "Yes", "Yes", "Yes", ....., 4),
("Hawaii", 127, 101, , 98, 109, ............., "Yes", "Yes", "No", "Yes", ....., 3),
("New York", 146, 130, , 173, 117, ............., "Yes", "Yes", "Yes", "Yes", ....., 4),
("Los Angeles", 94, 99, , 95, 113, ............., "No", "No", "No", "Yes", ....., 1),
("Dubai", 201, 229, , 265, 317, ............., "Yes", "Yes", "Yes", "Yes", ....., 4),
("Bangalore", 56, 89, , 61, 77, ............., "No", "No", "No", "No", ....., 0))
.toDF("Outlet","Boys_Toys","Girls_Toys","Men_Shoes","Ladies_shoes", ............., "BT>100", "GT>100", "MS>100", "LS>100", ....., "Count_of_Yes")
【问题讨论】:
标签: scala apache-spark apache-spark-sql