【问题标题】:Speedup split and merge dataframe rows in R在 R 中加速拆分和合并数据帧行
【发布时间】:2020-02-27 06:07:05
【问题描述】:

我有一个数据要分隔行。

df <- data.frame(text=c("Lately, I haven't been able to view my Online Payment Card. It's prompting me to have to upgrade my account whereas before it didn't. I have used the Card at various online stores before and have successfully used it. But now it's starting to get very frustrating that I have to said \"upgrade\" my account. Do fix this... **I noticed some users have the same issue..","I've been using this app for almost 2 years without any problems. Until, their system just blocked my virtual paying card without any notice. So, I was forced to apply for an upgrade and it was rejected thrice, despite providing all of my available IDs. This app has been a big disappointment."), id=c(1,2), stringsAsFactors = FALSE)

我想拆分文本列中的句子并提出以下内容:

df <- data.frame (text = c("Lately, I haven't been able to view my Online Payment Card. It's prompting me to have to upgrade my account whereas before it didn't. I have used the Card at various online stores before and have successfully used it. But now it's starting to get very frustrating that I have to said \"upgrade\" my account. Do fix this... **I noticed some users have the same issue..", 
                            "I've been using this app for almost 2 years without any problems. Until, their system just blocked my virtual paying card without any notice. So, I was forced to apply for an upgrade and it was rejected thrice, despite providing all of my available IDs. This app has been a big disappointment.", 
                            "Lately, I haven't been able to view my Online Payment Card.", 
                            "It's prompting me to have to upgrade my account whereas before it didn't.", 
                            "I have used the Card at various online stores before and have successfully used it.", 
                            "But now it's starting to get very frustrating that I have to said upgrade my account.", 
                            "Do fix this|", "**I noticed some users have the same issue|", 
                            "I've been using this app for almost 2 years without any problems.", 
                            "Until, their system just blocked my virtual paying card without any notice.", 
                            "So, I was forced to apply for an upgrade and it was rejected thrice, despite providing all of my available IDs.", 
                            "This app has been a big disappointment."), id = c(1, 2, 1, 1, 
                                                                               1, 1, 1, 1, 2, 2, 2, 2), tag = c("DONE", "DONE", NA, NA, NA, 
                                                                                                                NA, NA, NA, NA, NA, NA, NA), stringsAsFactors = FALSE)

我已经使用此代码完成了它,但是我认为 for-loop 太慢了。我需要为 73,000 行执行此操作。所以我需要一种更快的方法。 尝试 1:

library("qdap")
df$tag <- NA
for (review_num in 1:nrow(df)) {
  x = sent_detect(df$text[review_num])
  if (length(x) > 1) {
    for (sentence_num in 1:length(x)) {
      df <- rbind(df, df[review_num,])
      df$text[nrow(df)]   <- x[sentence_num]
    }
    df$tag[review_num] <- "DONE"
  }
}

尝试 2:行:73000,花费时间:252 分钟或 ~4 小时

reviews_df1 <- data.frame(id=character(0), text=character(0))
for (review_num in 1:nrow(df)) {
preprocess_sent <- sent_detect(df$text[review_num])
if (length(preprocess_sent) > 0) {
        x <- data.frame(id=df$id[review_num],
                        text=preprocess_sent)
        reviews_df <- rbind(reviews_df1, x)
      }
     colnames(reviews_df) <- c("id", "text")
}

尝试 3:行:29000,花费时间:170 分钟或 ~2.8 小时

library(qdap)
library(dplyr)
library(tidyr)

df <- data.frame(text=c("Lately, I haven't been able to view my Online Payment Card. It's prompting me to have to upgrade my account whereas before it didn't. I have used the Card at various online stores before and have successfully used it. But now it's starting to get very frustrating that I have to said \"upgrade\" my account. Do fix this... **I noticed some users have the same issue..","I've been using this app for almost 2 years without any problems. Until, their system just blocked my virtual paying card without any notice. So, I was forced to apply for an upgrade and it was rejected thrice, despite providing all of my available IDs. This app has been a big disappointment."), id=c(1,2), stringsAsFactors = FALSE)

df %>%
  group_by(text) %>% 
  mutate(sentences = list(sent_detect(df$text))) %>% 
  unnest(cols=sentences) -> out.df

out.df

【问题讨论】:

  • 试试data.table。可能会更快,但我不知道会快多少。
  • 您是否尝试使用应用于您要拆分的字符串的用户自定义函数来“矢量化”您的循环?请参阅?apply() 或一些blogs 关于这个“系列”功能。这种结构往往比for 循环更快。

标签: r performance for-loop


【解决方案1】:

让我感到困惑的是,它需要这么长时间。您可以将您的输入变成一个列表并使用 mclapply(如果您不在 Windows 上)来进一步加快速度。 这是在Womens Clothing E-Commerce Reviews.csv(23k 行)上使用data.tableparallel::mclapply 的示例。使用 lapply 大约需要 21 秒,使用 mclapply 在 4 个内核上大约需要 5.5 秒。 当然,这些不是很长的评论和句子,但它证明了并行运行的有用性。

library(data.table)
library(parallel)
library(qdap)
#> Loading required package: qdapDictionaries
#> Loading required package: qdapRegex
#> Loading required package: qdapTools
#> 
#> Attaching package: 'qdapTools'
#> The following object is masked from 'package:data.table':
#> 
#>     shift
#> Loading required package: RColorBrewer
#> Registered S3 methods overwritten by 'qdap':
#>   method               from
#>   t.DocumentTermMatrix tm  
#>   t.TermDocumentMatrix tm
#> 
#> Attaching package: 'qdap'
#> The following object is masked from 'package:base':
#> 
#>     Filter

dt <- fread("https://raw.githubusercontent.com/NadimKawwa/WomeneCommerce/master/Womens%20Clothing%20E-Commerce%20Reviews.csv")
system.time({
dfl <- setNames(as.list(dt$`Review Text`), dt$V1)
makeDT <- function(x) data.table(text = sent_detect(x))
out.dt <- rbindlist(mclapply(dfl, makeDT, mc.cores=4L), idcol = "id")
out.dt[, tag := NA_character_]
out.dt <- rbind(data.table(id=dt$V1, text=dt$`Review Text`, tag = "DONE"), out.dt)
})
#>    user  system elapsed 
#>  21.078   0.482   5.467
out.dt
#>            id
#>      1:     0
#>      2:     1
#>      3:     2
#>      4:     3
#>      5:     4
#>     ---      
#> 137388: 23484
#> 137389: 23484
#> 137390: 23484
#> 137391: 23485
#> 137392: 23485
#>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         text
#>      1:                                                                                                                                                                                                                                                                                                                                                                                                                                                                Absolutely wonderful - silky and sexy and comfortable
#>      2:                                                                                                                                                                                                     Love this dress!  it's sooo pretty.  i happened to find it in a store, and i'm glad i did bc i never would have ordered it online bc it's petite.  i bought a petite and am 5'8"".  i love the length on me- hits just a little below the knee.  would definitely be a true midi on someone who is truly petite.
#>      3: I had such high hopes for this dress and really wanted it to work for me. i initially ordered the petite small (my usual size) but i found this to be outrageously small. so small in fact that i could not zip it up! i reordered it in petite medium, which was just ok. overall, the top half was comfortable and fit nicely, but the bottom half had a very tight under layer and several somewhat cheap (net) over layers. imo, a major design flaw was the net over layer sewn directly into the zipper - it c
#>      4:                                                                                                                                                                                                                                                                                                                                                                                         I love, love, love this jumpsuit. it's fun, flirty, and fabulous! every time i wear it, i get nothing but great compliments!
#>      5:                                                                                                                                                                                                                                                                                                                     This shirt is very flattering to all due to the adjustable front tie. it is the perfect length to wear with leggings and it is sleeveless so it pairs well with any cardigan. love this shirt!!!
#>     ---                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
#> 137388:                                                                                                                                                                                                                                                                                                                                                                                                                      the medium fits my waist perfectly, but was way too long and too big in the bust and shoulders.
#> 137389:                                                                                                                                                                                                                                                                                                                                                                                                              if i wanted to spend the money, i could get it tailored, but i just felt like it might not be worth it.
#> 137390:                                                                                                                                                                                                                                                                                                                                                                                               side note - this dress was delivered to me with a nordstrom tag on it and i found it much cheaper there after looking!
#> 137391:                                                                                                                                                                                                                                                                                                                                                                                                                         This dress in a lovely platinum is feminine and fits perfectly, easy to wear and comfy, too!
#> 137392:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    highly recommend!
#>          tag
#>      1: DONE
#>      2: DONE
#>      3: DONE
#>      4: DONE
#>      5: DONE
#>     ---     
#> 137388: <NA>
#> 137389: <NA>
#> 137390: <NA>
#> 137391: <NA>
#> 137392: <NA>

再想一想,您的代码可能是问题 - 尝试更改

df %>%
    group_by(text) %>% 
    mutate(sentences = list(sent_detect(df$text))) %>% 
    unnest(cols=sentences) -> out.df

df %>%
    group_by(text) %>% 
    mutate(sentences = list(sent_detect(text))) %>% 
    unnest(cols=sentences) -> out.df

看看这是否是罪魁祸首(我认为是)。

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2019-06-16
    • 2015-04-26
    • 1970-01-01
    • 2023-03-05
    • 1970-01-01
    • 1970-01-01
    • 2020-12-14
    • 1970-01-01
    相关资源
    最近更新 更多