【问题标题】:R: How can I extract an element from a column of data in spark connection (sparklyr) in pipeR:如何从管道中的火花连接(sparklyr)中的数据列中提取元素
【发布时间】:2018-10-30 12:48:51
【问题描述】:

我有一个如下数据集。

由于数据量大,我通过sparklyr包上传,所以只能使用管道语句。

pos <- str_sub(csj$helpful,2)
neg1 <- str_sub(csj$helpful,4)
csj <- csj %>% mutate(neg=replace(helpful,stringr::str_sub(csj$helpful,4)==1,0))
csj <- csj %>% mutate(help=pos/neg)
csj
is.null(csj$helpful)

我想创建一个名为“帮助”的列,即“第一个有用的列数/第二个有用的列数”。

如果第2个数是0,我需要把第2个数改成1再除。

数据框名称为csj

但它不起作用。

如果有人能帮我解决这个问题,我会很高兴。

在我遵循@Sebastian Hoyos 的建议后,我仍然得到了这个 col1,col2,col3 作为 NAN,如下图所示。 (但他给我的例子奏效了)。我应该如何解决这个问题?

enter image description here

+) 在我尝试不使用as.numeric 之后,我得到了这个结果。

> csj %>%
+   mutate(col1 = stringi::stri_extract_first_regex(csj$helpful, pattern = "[0-9]"),#extract first number
+          col2 = stringi::stri_extract_last_regex(csj$helpful, pattern = "[0-9]"),#extract second
+          col3 = ifelse(col2 == 0, 1, col2 ),#change 0s to 1
+          help = col1/col3) #divide row1 and 3


# Source:   lazy query [?? x 12]
# Database: spark_connection
   `_c0` reviewerID     asin  helpful length_of_review overall unixReviewTime category   col1  col2  col3   help
   <int> <chr>          <chr> <chr>              <dbl> <chr>   <chr>          <chr>      <chr> <chr> <chr> <dbl>
 1     0 A1KLRMWW2FWPL4 31887 [0, 0]               172 5       1297468800     Clothes_s~ ""    ""    NA      NaN
 2     1 A2G5TCU2WDFZ65 31887 [0, 0]               306 5       1358553600     Clothes_s~ ""    ""    NA      NaN
 3     2 A1RLQXYNCMWRWN 31887 [0, 0]               312 5       1357257600     Clothes_s~ ""    ""    NA      NaN
 4     3 A8U3FAMSJVHS5  31887 [0, 0]               405 5       1398556800     Clothes_s~ ""    ""    NA      NaN
 5     4 A3GEOILWLK86XM 31887 [0, 0]               453 5       1394841600     Clothes_s~ ""    ""    NA      NaN
 6     5 A27UF1MSF3DB2  31887 [0, 0]               375 4       1396224000     Clothes_s~ ""    ""    NA      NaN
 7     6 A16GFPNVF4Y816 31887 [0, 0]               334 5       1399075200     Clothes_s~ ""    ""    NA      NaN
 8     7 A2M2APVYIB2U6K 31887 [0, 0]               158 5       1356220800     Clothes_s~ ""    ""    NA      NaN
 9     8 A1NJ71X3YPQNQ9 31887 [0, 0]                96 4       1384041600     Clothes_s~ ""    ""    NA      NaN
10     9 A3EERSWHAI6SO  31887 [7, 8]               532 5       1349568000     Clothes_s~ ""    ""    NA      NaN
# ... with more rows
> 

【问题讨论】:

    标签: r sparklyr dplyr


    【解决方案1】:

    虽然这不是最优雅的代码字符串,但它应该可以完成工作。由于除了屏幕截图之外没有提供示例数据集,因此我只是创建了一个包含您感兴趣的重要元素的示例。

    csj <- tibble(helpful = rep(c("[0,0]","[0,1]","[0,2]","[1,3]"),100),
                                overall = rep(c(5,4,3,2),100))
    #this change the columns and creates the help column
    csj %>%
          mutate(col1 = as.numeric(stringi::stri_extract_first_regex(helpful, pattern = "[0-9]")),#extract first number
                 col2 = as.numeric(stringi::stri_extract_last_regex(helpful, pattern = "[0-9]")),#extract second
                 col3 = ifelse(col2 == 0, 1, row2 ),#change 0s to 1
                 help = col1/col3) %>% #divide row1 and 3
          select(helpful, help)#select the rows you wish to keep
    

    只要您根据需要修改数据集的函数,这应该可以工作。另请注意,有用的是数据集中的字符类型,这就是您需要将其更改为数字的原因

    编辑:所以我查找了一些 sparklyr 并意识到为什么代码不起作用所以我为自己创建了一个示例来测试。虽然我没有完全复制你的数据,但我想出了足够的东西来希望提供一个工作解决方案。

    library(sparklyr)
    library(dplyr)
    library(ggplot2)
    library(magrittr) 
    sc <- spark_connect(master="local")
    #create dataframe
    cjs <- tibble(helpful = rep(c("[0,  0]","[0, 1]","[0, 2]","[1, 3]","[,1]",NA,"a"),100),
                  overall = rep(c(6,5,4,3,2,1,0),100))
    
    #transfer to sparkly
    csj <- copy_to(sc, csj,"cjs")
    
    #this should do the trick
    csj %>% 
      mutate(newcol2 = regexp_replace(helpful, "[^0-9,]", " "), 
             newcol3 = as.numeric(substring_index(newcol2, ",", 1)),
             newcol4 = as.numeric(substring_index(newcol2,",",-1)),
             newcol5 = ifelse(newcol4 == 0, 1, newcol4),
             help = newcol3/newcol5) %>% 
      select(starts_with("new"),help) #select the columns you need with help calculated appropriately
    

    【讨论】:

    • 您好@Sebastian,我尝试了您编辑的代码,它运行良好!!!这么久都解决不了!真的非常感谢!!!!!!! # Source: lazy query [?? x 5] # Database: spark_connection newcol1 newcol2 newcol3 newcol4 help &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; 7 " 0, 0 " 0 0 1 0 8 " 0, 0 " 0 0 1 0 9 " 0, 0 " 0 0 1 0 10 " 7, 8 " 7 8 8 0.875 # ... with more rows :: 这就是结果!完美!
    • @AliceShin 如果答案解决了问题,请考虑accepting it
    猜你喜欢
    • 2017-11-15
    • 1970-01-01
    • 2018-03-14
    • 2017-04-16
    • 2021-06-07
    • 2014-07-08
    • 1970-01-01
    • 2018-06-07
    • 2021-11-07
    相关资源
    最近更新 更多