【问题标题】:How to use substr() function to a column in sparkR如何对 sparkR 中的列使用 substr() 函数
【发布时间】:2016-02-11 21:25:36
【问题描述】:

如何在sparkR中对数据框列使用substr()函数

+----------+----------------+-----------+
|   cust_id|  tran_datetime |Total_trans|
+----------+----------------+-----------+
|CQ98901297|2015-06-06 09:00|          1|
|CQ98901297|2015-05-01 09:25|          1|
|CQ98901297|2015-05-02 10:45|          1|
|CQ98901297|2015-05-03 11:01|          1|

我需要在tran_datetime 列中删减时间

【问题讨论】:

  • 你尝试了什么?为什么它不起作用?

标签: apache-spark sparkr


【解决方案1】:
#use substr(df, start position, End position) in the select() function
df_new <- select(df, df$cust_id , substr(df$tran_datetime, 1, 10), df$Total_trans)
#In the df_new you get a random column name for the column where you used substr(), so use rename() to get the desired column name
df_new <- rename(df_new, date = df_new[[2]])

showDF(df_new)

+----------+----------+-----------+
|   cust_id|  date    |Total_trans|
+----------+----------+-----------+
|CQ98901297|2015-06-06|          1|
|CQ98901297|2015-05-01|          1|
|CQ98901297|2015-05-02|          1|
|CQ98901297|2015-05-03|          1|

【讨论】:

    【解决方案2】:

    我想最好的解决方案是应用 strsplit。

    x <- data.frame(lin=c('+----------+----------------+-----------+',
                          '|   cust_id|  tran_datetime |Total_trans|',
                          '+----------+----------------+-----------+',
                          '|CQ98901297|2015-06-06 09:00|          1|',
                          '|CQ98901297|2015-05-01 09:25|          1|',
                          '|CQ98901297|2015-05-02 10:45|          1|'),
                    id = 1:6,
                    stringsAsFactors = F)
    #removing the lines that starts with +
    x <- x[substr(x$lin,1,1)!="+",]
    # spliting the line into columns pipe-separed
    y <- strsplit(x$lin,split = "\\|")
    #removing whitespaces after split
    library(stringr)
    y <- lapply(y, function(x){str_trim(x,'both')})
    # [,-1] because the first column is empty
    y <- do.call(rbind,y)[,-1]
    colnames(y) <- y[1,]
    y <- data.frame(y[-1,],stringsAsFactors = F)
    y
    

    【讨论】:

    • 我只是想去掉那一列的时间,表格代表数据框
    • 答案需要在 sparkR 中而不是在 R 中
    猜你喜欢
    • 2016-09-21
    • 1970-01-01
    • 2016-02-19
    • 2012-12-28
    • 2015-03-08
    • 1970-01-01
    • 2020-09-07
    • 2015-05-14
    • 1970-01-01
    相关资源
    最近更新 更多