【问题标题】:r loop/function to find matches from a listr 循环/函数从列表中查找匹配项
【发布时间】:2020-09-27 13:30:53
【问题描述】:

我有一个销售人员列表,分为三列,我想查看我的列表,并且:
a) 他们的名字出现在三列中的任何一列
b) 他们的名字与实习销售人员一起出现(这些人的名字不在列表中)

ilist <- c("SP1","SP2","SP3","SP4","SP5")
    
df2 <- 
    data.frame(sales1 = c("SP5","SP5","SP4","SP3","SP2","SP1","SP3"), 
               sales2 = c("","SP4","SP1","SP1","SP5","SP3",""), 
               sales3 = c("","SP9","","SP6","","",""))

输出我希望得到类似下面的答案(尽管我会接受任何输出):

      A     B   
SP1   3     1 
SP2   1     0 
SP3   3     1 
SP4   1     1 
SP5   3     1

我尝试创建一个循环和一个函数,但我似乎无法让它们工作。 让它工作后的目标是让它成为group_by 的一部分,这样我就可以按类型和年份分解它

data %>%
group_by(type,year) %>%
your helpful answer here

编辑: select 我正在查看的列。 我的 iList 将类似于以下内容 (在 3 Columns 中,第 2 列和第 3 列将包含空白,其中销售人员仅出现在第 1 列;也没有设置销售人员或实习生可能出现的位置)

ilist <- c("SJ","KW","MOLC","FERB","BACC")



structure(list(iYear = structure(c(1L, 4L, 3L, 4L, 4L, 
4L, 5L, 5L, 6L, 9L), .Label = c("2020-07-01", "2020-07-02", "2020-07-03", 
"2020-07-04", "2020-07-06", "2020-07-07", "2020-07-08", "2020-07-09", 
"2020-07-10", "2020-07-11", "2020-07-12", "2020-07-13", "2020-07-14", 
"2020-07-15", "2020-07-16", "2020-07-17", "2020-07-18", "2020-07-19", 
"2020-07-20", "2020-07-21", "2020-07-22", "2020-07-23", "2020-07-24", 
"2020-07-25", "2020-07-27", "2020-07-28", "2020-07-29", "2020-07-30", 
"2020-07-31"), class = "factor"), iType = structure(c(4L, 
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L), .Label = c("", "ZB", "BS", 
"CFN", "CTR", "MJ", "UK", "EFH", "ENOC", "EY", "F", "G", "CD", 
"HAEM", "HN", "IC", "LB", "LY", "MNN", "MOS", "NERO", "ZZZ", 
"ZZZQE", "GFT", "PG", "RE", "SK", "UR"), class = "factor"), 
    Sales.1 = structure(c(74L, 20L, 74L, 16L, 
    3L, 3L, 3L, 16L, 58L, 41L), .Label = c("", "ABUE", "AHMEM", 
    "AJOS", "ANNS", "AOK", "BACC", "BH", "BLAFM", "BLOCA", "BRAD", 
    "BROWNJ", "BRT", "BUIH", "BURDA", "BURYA", "CANRJ", "CAVM", 
    "CHAMBA", "COOSNP", "COUPSI", "CPH", "CTT", "DARA", "DILP", 
    "EXPAT", "FCH", "FERB", "FERMA", "GT", "GT", "HAEM", "HAMJR", 
    "HENJ", "HENJA", "HOWRA", "HUSA", "ILINC", "JONG", "KC", 
    "KNOT", "KW", "LAUC", "LOOP", "LYEJO", "LYNN", "MAJJ", "MCGREA", 
    "MENT", "MKB", "MOLC", "MUDHS", "MULLM", "NC", "NODS", 
    "O'BSG", "OLIT", "OLIVK", "PAEI", "PARKD", "PATEF", "PERT", 
    "POL", "PTRHUS", "RAMACN", "RAMS", "REYMA", "ROBCM", "ROBINE", 
    "SAMJN", "SAYC", "SHARMM", "SHEG", "SJ", "SJN", "SKINT", 
    "SLOP", "SORT", "SOUBIO", "SPOE", "TELED", "THAN", "THEL", 
    "TURH", "TURHJ", "UCONS", "UPH", "UT", "VALK", "WALJ"
    ), class = "factor"), Sales.2 = structure(c(1L, 
    12L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 45L), .Label = c("", "ABUE", 
    "AHMEM", "AJOS", "AOK", "BACC", "BH", "BLAFM", "BROWNJ", 
    "BUIH", "BURDA", "BURYA", "CANRJ", "CAVM", "CHAMBA", "COOSNP", 
    "COUPSI", "DARA", "DILP", "FCH", "FERB", "FERMA", "GYNT", 
    "HOWRA", "HUSA", "ILINC", "KW", "LAUC", "LOOP", "LYNN", "MAJJ", 
    "MOLC", "MULLM", "NC", "OLIVK", "PARKD", "POL", "PTRHUS", 
    "RAMS", "REYMA", "ROBCM", "ROBINE", "SAMJN", "SHARMM", "SJ", 
    "SJN", "SKINT", "SLOP", "SORT", "SPOE", "TELED", "THAN", 
    "THEL", "TURH", "VALK"), class = "factor"), Sales.3 = structure(c(1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", "AHMEM", 
    "AOK", "BACC", "BLAFM", "CHAMBA", "COUPSI", "DILP", "FCH", 
    "KW", "LOOP", "MAJJ", "PTRHUS", "RAMS", "ROBCM", "SAMJN", 
    "SHARMM", "SJ", "TELED", "THAN", "VALK"), class = "factor")), row.names = c(NA, 
10L), class = "data.frame")

【问题讨论】:

  • 你想要的不是很清楚。例如,SP1 出现在 df2$sales1 的位置 6,以及 df2$sales2 的位置 3 和 4。因此,当您声明希望它们出现在任何列中的位置时,我会想象您希望 A 列中 SP1 的结果为 (3, 4, 6) 。为什么结果是 3?你还说你想按类型和年份分组......很好,但是你没有任何类型和年份列,所以不清楚你想要实现的目标。
  • 哼,您想知道在df2 的哪一行中出现了与实习生一起出现的特定销售?
  • 使用实时数据更新
  • iType 在您的实时数据中用于什么?在这个过程中被忽略了吗?是否用于“按类型和年份细分”?
  • @Ben 类型是它所属的部门/部门。所以这个时期,对于这些部分,由这个人。

标签: r list function loops


【解决方案1】:

我不确定这是否是您要查找的内容,但认为它可能会有所帮助。如果您有兴趣使用 group_by,听起来您可能想要使用 tidyverse 方法。

在这里,将添加行号,因此您可以group_by 每行查看同一行中是否有实习生与销售人员。

然后,使用pivot_longer 放入长格式,并删除空字符串。

当按行号分组时,您可以添加一个指示符,表明这些人将与实习销售人员一起出现。它会查看此人是否未包含在 ilist 中。

最后,你可以group_by每个销售人员,只包括ilistfilter中的那些,并把出现的次数加起来(假设初始数据中每行只出现一次),和受训人数联系人。

library(tidyverse)

df2 %>%
  mutate(rn = row_number()) %>%
  pivot_longer(cols = -rn) %>%
  na_if("") %>%
  na.omit %>%
  group_by(rn) %>%
  mutate(with_trainee = ifelse(any(!value %in% ilist), 1, 0)) %>%
  group_by(value) %>%
  filter(value %in% ilist) %>%
  summarise(A = n(),
            B = sum(with_trainee))

输出

  value     A     B
  <chr> <int> <dbl>
1 SP1       3     1
2 SP2       1     0
3 SP3       3     1
4 SP4       2     1
5 SP5       3     1

编辑 1: 使用您的“实时数据”,并按年份从 iYeariType 对结果进行分组,您可以试试这个:

library(tidyverse)

df2 %>%
  mutate(rn = row_number(),
         iYear = substr(iYear, 1, 4)) %>%
  pivot_longer(cols = -c(rn, iYear, iType)) %>%
  na_if("") %>%
  na.omit %>%
  group_by(rn, iYear, iType) %>%
  mutate(with_trainee = ifelse(any(!value %in% ilist), 1, 0)) %>%
  group_by(value, iYear, iType) %>%
  filter(value %in% ilist) %>%
  summarise(A = n(),
            B = sum(with_trainee)) 

编辑2:补充详细说明:

行号(rn 通过row_number)在这种情况下很有帮助,因为您想知道销售人员是否同时在场(这意味着“在同一行内”)。因此,如果 2 位销售人员共享同一个 rn,则他们同时在场。

iYear 更改为仅一年。它使用substr()(子字符串)来获取iYear的第一个到第四个字符,在XXXX-XX-XX日期格式中是年份。

pivot_longer(和它的朋友,pivot_wider)对于从长 宽格式的数据进行转换非常强大。在tidyr package 中,pivot_longer 获取所有列(rniYeariType 除外)并放入两列(namevalue)。 value 现在在单个列中包含销售人员,而不是它开始时的多个列。

na_if("") 将使空白字符串"" 变为NA(缺少数据)。后续na.omit 将删除那些带有NA 的行。

group_byrn 确保您共同关注那些共享相同rn 的销售人员。我添加了iYeariType,这样它们也会出现在最终的汇总结果中。然后,with_trainee 是一个新列,将包含该销售人员是否与实习生在一起(在group_by 使用any 来查看组内是否“任何行”,共享相同的rn,是在ilist 向量中)。如果有则编码为1,如果没有则编码为0。

下一个group_byvalue(或销售人员),使用filter,因为您只需要ilist 中的人的结果。 (如果您想要所有人,包括不在ilist 中的学员,您可以省略此行。)

最后的summarisegroup_by 一起使用-n() 显示每个value(或每个销售人员)的数据行数,这与不同rn 值的数量相同销售人员可能会整体出现。 sum(with_trainee) 是给定 value(或销售人员)的 with_trainee 为 1 的总次数。

输出

  value iYear iType     A     B
  <fct> <chr> <fct> <int> <dbl>
1 SJ    2020  CFN       3     1

【讨论】:

  • 如果有帮助,我已经添加了一些实时数据
  • 请查看编辑后的答案。这需要来自iYear 的年份,并按年份和iType 分组。这更接近您的需要吗?
  • 这看起来很有效,而且很棒。巨大的帮助。看着你创造的东西,我还有很长的路要走。如果我没看错,您已经:添加了 ow ID 号码 > 仅一年完成了 iYear >(Pivot_longer 我需要研究,因为我没有遇到过)> 如果空白则不适用 >(no.omit I需要研究) > 按三种类型(数字、年份和类型)分组 > 为不在我的列表中的值添加一个额外的列以标识受训者;给他们一个 1 > 过滤我的列表中的内容 > 输出列表中包含匹配数的数据并添加受训者 我有这个权利吗?
【解决方案2】:

老实说,我不太了解预期的结果,因为您说您希望 SP2 | 1 | 0 但 SP2 没有出现在第 1 行中。以下可能会做您想做的事......或不做。

library(data.table)

sales <- data.table(sale = c("SP1", "SP2", "SP3", "SP4", "SP5"))

sales_group <- 
  data.table(
    sales1 = c("SP5", "SP5", "SP4", "SP3", "SP2", "SP1", "SP3"),
    sales2 = c("", "SP4", "SP1", "SP1", "SP5", "SP3", ""),
    sales3 = c("", "SP9", "", "SP6", "", "", "")
  )

all <- sort(sales_group[, unique(c(sales1, sales2, sales3))])
all <- all[all != ""]
trainees <- all[!all %in% c(sales$sale, "")]

sales_group[, pos := seq(.N)]

sales1 <- merge(sales, sales_group, by.x = "sale", by.y = "sales1")
sales2 <- merge(sales, sales_group, by.x = "sale", by.y = "sales2")
sales3 <- merge(sales, sales_group, by.x = "sale", by.y = "sales3")
setnames(sales1, c("sale", "plusone", "plustwo", "sales_pos"))
setnames(sales2, c("sale", "plusone", "plustwo", "sales_pos"))
setnames(sales3, c("sale", "plusone", "plustwo", "sales_pos"))
sales_visit_by_sale <- rbind(sales1, sales2, sales3)
sales_visit_by_sale[, with_trainee := FALSE]
sales_visit_by_sale[(plusone %in% trainees) | (plustwo %in% trainees), with_trainee := TRUE]
sales_visit_by_sale[(order(sale, sales_pos)), .(sale, sales_pos, with_trainee)]

【讨论】:

  • 如果有帮助,我已经添加了一些实时数据
猜你喜欢
  • 1970-01-01
  • 2018-09-20
  • 2020-04-06
  • 1970-01-01
  • 2020-02-13
  • 2020-07-31
  • 1970-01-01
  • 1970-01-01
  • 2019-07-18
相关资源
最近更新 更多