如何对 sparkR 中的列使用 substr() 函数答案

【问题标题】：How to use substr() function to a column in sparkR如何对 sparkR 中的列使用 substr() 函数
【发布时间】：2016-02-11 21:25:36
【问题描述】：

如何在sparkR中对数据框列使用substr()函数

+----------+----------------+-----------+
|   cust_id|  tran_datetime |Total_trans|
+----------+----------------+-----------+
|CQ98901297|2015-06-06 09:00|          1|
|CQ98901297|2015-05-01 09:25|          1|
|CQ98901297|2015-05-02 10:45|          1|
|CQ98901297|2015-05-03 11:01|          1|

我需要在tran_datetime 列中删减时间

【问题讨论】：

你尝试了什么？为什么它不起作用？

标签： apache-spark sparkr

【解决方案1】：

#use substr(df, start position, End position) in the select() function
df_new <- select(df, df$cust_id , substr(df$tran_datetime, 1, 10), df$Total_trans)
#In the df_new you get a random column name for the column where you used substr(), so use rename() to get the desired column name
df_new <- rename(df_new, date = df_new[[2]])

showDF(df_new)

+----------+----------+-----------+
|   cust_id|  date    |Total_trans|
+----------+----------+-----------+
|CQ98901297|2015-06-06|          1|
|CQ98901297|2015-05-01|          1|
|CQ98901297|2015-05-02|          1|
|CQ98901297|2015-05-03|          1|

【讨论】：

【解决方案2】：

我想最好的解决方案是应用 strsplit。

x <- data.frame(lin=c('+----------+----------------+-----------+',
                      '|   cust_id|  tran_datetime |Total_trans|',
                      '+----------+----------------+-----------+',
                      '|CQ98901297|2015-06-06 09:00|          1|',
                      '|CQ98901297|2015-05-01 09:25|          1|',
                      '|CQ98901297|2015-05-02 10:45|          1|'),
                id = 1:6,
                stringsAsFactors = F)
#removing the lines that starts with +
x <- x[substr(x$lin,1,1)!="+",]
# spliting the line into columns pipe-separed
y <- strsplit(x$lin,split = "\\|")
#removing whitespaces after split
library(stringr)
y <- lapply(y, function(x){str_trim(x,'both')})
# [,-1] because the first column is empty
y <- do.call(rbind,y)[,-1]
colnames(y) <- y[1,]
y <- data.frame(y[-1,],stringsAsFactors = F)
y

【讨论】：

我只是想去掉那一列的时间，表格代表数据框
答案需要在 sparkR 中而不是在 R 中