【问题标题】:How to data wrangle efficiently to automatize the excel process?如何有效地进行数据争吵以自动化 Excel 流程?
【发布时间】:2019-09-20 10:42:27
【问题描述】:

我们通过 excel 执行的宏太多了,但是 我们可以使用 R 实现自动化吗?

我还不确定我还不是 R 专家需要什么。 像聚集、传播或重塑这样的功能没有多大意义…… 可能 stringr 和 regrex 可能会起作用,但不够熟练,无法尝试。

基本上,

  1. 我们必须从“列表”列中删除 ID,例如 beet_root、apple(通常以字母开头,除了少数已知的例外,例如 ot_p)。当然,我们也可以完全删除该行。
  2. 创建新列“名称”,然后附加对应 ID 例如列表值的橙色 ID:9734,75R4 而不是与葡萄 ID 对应的 123/R90

请找到reprex(简化的excel流程)

library(tidyverse)



> input
        list other_col1 other_col2
1  beet_root          a          u
2    123/92R          b          b
3    123/92R          c          q
4    10.1233          d          p
5      9.485          e          e
6       ot_p          f          f
7      apple          g          b
8    12X0893          z          z
9      123fg          h          8
10     038q4          i          i
11    orange          j          j
12      9734          k          9
13      75R4          l          l
14    grapes          m          m
15   123/R90          n          5
16   90X83.6          o          o


> expected_output
      list other_col1 other_col2     names
1  123/92R          b          b beet_root
2  123/92R          c          q beet_root
3  10.1233          d          p beet_root
4    9.485          e          e beet_root
5     ot_p          f          f beet_root
6  12X0893          z          z     apple
7    123fg          h          8     apple
8    038q4          i          i     apple
9     9734          k          9    orange
10    75R4          l          l    orange
11 123/R90          n          5    grapes
12 90X83.6          o          o    grapes


数据:

# Actual data 
input = data.frame("list" = c("beet_root","123/92R","123/92R","10.1233","9.485","ot_p",
                            "apple","12X0893","123fg","038q4",
                            "orange","9734","75R4",
                           "grapes", "123/R90","90X83.6"),
                "other_col1" = c("a","b","c","d","e","f","g","z",
                                "h","i","j","k","l","m","n","o"),
                "other_col2" = c("u","b","q","p","e","f","b","z",
                                "8","i","j","9","l","m","5","o"))
expected_output = data.frame("list" = c("123/92R","123/92R","10.1233","9.485","ot_p",
                                      "12X0893","123fg","038q4",
                                      "9734","75R4",
                                      "123/R90","90X83.6"),
                           "other_col1" = c("b","c","d","e",
                                            "f","z","h","i",
                                            "k","l",
                                            "n","o"
                                            ),
                           "other_col2" = c("b","q","p","e",
                                            "f","z","8","i",
                                            "9","l",
                                            "5","o"),
                           "names" = c("beet_root","beet_root","beet_root","beet_root","beet_root",
                                       "apple","apple","apple",
                                       "orange","orange",
                                       "grapes","grapes"))

【问题讨论】:

  • 听起来像是合并?
  • 真的吗?只有一张表作为输入和输出
  • 问题是输入和预期输出是隐藏的,这使得它有点难以理解。

标签: r string dplyr


【解决方案1】:

如果它遵循特定模式(以数字开头或为"ot_p")或具有NA,我们创建一个新列names,其值来自list。我们将fillNA 值向下names,然后对filter 行使用相同的正则表达式。

library(dplyr)

input %>%
  mutate(names = ifelse(!grepl("^[0-9]|ot_p", list), list, NA)) %>% 
  tidyr::fill(names) %>%
  filter(grepl("^[0-9]|ot_p", list))

#      list other_col1 other_col2     names
#1  123/92R          b          b beet_root
#2  123/92R          c          q beet_root
#3  10.1233          d          p beet_root
#4    9.485          e          e beet_root
#5     ot_p          f          f beet_root
#6  12X0893          z          z     apple
#7    123fg          h          8     apple
#8    038q4          i          i     apple
#9     9734          k          9    orange
#10    75R4          l          l    orange
#11 123/R90          n          5    grapes
#12 90X83.6          o          o    grapes

首先运行input[] <- lapply(input, as.character)factors 转换为字符。

【讨论】:

    【解决方案2】:

    tidyverse 的选项

    library(tidyverse0
    input %>% 
      mutate(names =  case_when(str_detect(list, "^([0-9]|ot_p)") ~ 
                NA_character_, TRUE ~ as.character(list) )) %>%
      fill(names) %>%
      filter(str_detect(list, "^[0-9]|ot_p"))
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2012-04-20
      • 2021-11-13
      • 2012-12-12
      • 1970-01-01
      • 2020-05-12
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多