根据另一个变量值的时间生成一个新变量答案

【问题标题】：Generate a new variable based on the time of another variable value根据另一个变量值的时间生成一个新变量
【发布时间】：2020-02-09 03:16:36
【问题描述】：

我有一个如下所示的数据集

ID. Invoice. Date of Invoice.  paid or not.  

1    1         10/31/2019       yes
1    1         10/31/2019       yes
1    2         11/30/2019       no
1    3         12/31/2019       no

2    1         09/30/2019       no
2    2         10/30/2019       no
2    3         11/30/2019       yes

3    1         7/31/2019        no
3    2         9/30/2019        yes
3    3         12/31/2019       no

我想知道客户是否愿意付款。客户只要补了新的发票，旧的发票没付，我会给他一个好分数。所以对于客户 2 和 3，我给了“好”，客户 2 是“坏”的分数。

所以最终数据将多出一列，其值分别为好和坏。

ID. Invoice. Date of Invoice.  paid or not.  Bad or good

1    1         10/31/2019       yes          bad
1    1         10/31/2019       yes          bad
1    2         11/30/2019       no           bad
1    3         12/31/2019       no           bad

2    1         09/30/2019       no           good
2    2         10/30/2019       no           good
2    3         11/30/2019       yes          good

3    1         7/31/2019        no           good
3    2         9/30/2019        yes          good
3    3         12/31/2019       no           good

【问题讨论】：

这与您posted yesterday 的问题有何不同，除了数据样本略小？
嗨@camille，我也注意到了。这是怎么回事？亲爱的玉芳，如果有人在上一篇文章中回答了您的问题，请采纳。如果这里有一个单独的问题，我希望我已经以某种方式解决了它。三个版本的代码游来游去，重复工作是不好的。
您好，对不起，我昨天发布了，但答案并没有解决问题。所以我在这篇文章中添加了更多细节。抱歉，如果它是重复的。我应该删除另一个帖子吗？

标签： r dataframe

【解决方案1】：

您的数据：

df = structure(list(ID. = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L
), Invoice. = c(1L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), Date.of.Invoice. = structure(c(3L, 
3L, 4L, 5L, 1L, 2L, 4L, 6L, 7L, 5L), .Label = c("09/30/2019", 
"10/30/2019", "10/31/2019", "11/30/2019", "12/31/2019", "7/31/2019", 
"9/30/2019"), class = "factor"), paid.or.not. = structure(c(2L, 
2L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 1L), .Label = c("no", "yes"), class = "factor")), class = "data.frame", row.names = c(NA, 
-10L))

你可以试试这样的：

label_func = function(i){
if (all(i==2)) {
"good"
} else if (any(diff(i)>0)) {
"good"
} else{"bad"}
}

library(dplyr)
df$paid.or.not. = factor(df$paid.or.not.,levels=c("no","yes"))
df %>% group_by(ID.) %>% 
mutate(score=label_func(as.numeric(paid.or.not.)))

# A tibble: 10 x 5
# Groups:   ID. [3]
     ID. Invoice. Date.of.Invoice. paid.or.not. score
   <int>    <int> <fct>            <fct>        <chr>
 1     1        1 10/31/2019       yes          bad  
 2     1        1 10/31/2019       yes          bad  
 3     1        2 11/30/2019       no           bad  
 4     1        3 12/31/2019       no           bad  
 5     2        1 09/30/2019       no           good 
 6     2        2 10/30/2019       no           good 
 7     2        3 11/30/2019       yes          good 
 8     3        1 7/31/2019        no           good 
 9     3        2 9/30/2019        yes          good 
10     3        3 12/31/2019       no           good

解释它是如何工作的。在您的数据框中，列paid.or.not。被编码为一个因子（通常）。在上面的代码中，我强制执行它，并将“no”设置为第一个，将“yes”设置为第二个。如果我们对这个专栏做as.numeric()：

df %>% mutate(score=as.numeric(paid.or.not.))
   ID. Invoice. Date.of.Invoice. paid.or.not. score
1    1        1       10/31/2019          yes     2
2    1        1       10/31/2019          yes     2
3    1        2       11/30/2019           no     1
4    1        3       12/31/2019           no     1
5    2        1       09/30/2019           no     1
6    2        2       10/30/2019           no     1
7    2        3       11/30/2019          yes     2
8    3        1        7/31/2019           no     1
9    3        2        9/30/2019          yes     2
10   3        3       12/31/2019           no     1

我们可以看到它得到 1 或 2。当“no”后面有“yes”时，您会将其标记为好，这意味着它们的差异是 +1。

我们可以这样看：

df %>% mutate(score=as.numeric(paid.or.not.)-lag(as.numeric(paid.or.not.)))

   ID. Invoice. Date.of.Invoice. paid.or.not. score
1    1        1       10/31/2019          yes    NA
2    1        1       10/31/2019          yes     0
3    1        2       11/30/2019           no    -1
4    1        3       12/31/2019           no     0
5    2        1       09/30/2019           no     0
6    2        2       10/30/2019           no     0
7    2        3       11/30/2019          yes     1
8    3        1        7/31/2019           no    -1
9    3        2        9/30/2019          yes     1
10   3        3       12/31/2019           no    -1

你可以看到那些你想标记为“好”的人至少有一个+1，而那些“坏”的人没有“+1”。最后一个例外是，如果全部都是“是”并且全部都是“否”：

test=data.frame(ID.=1:2,Invoice.=1,
Date.of.Invoice.="12/31/2019",paid.or.not.=c("yes","no"))
test$paid.or.not. = factor(test$paid.or.not.,levels=c("no","yes"))
test %>% group_by(ID.) %>% 
mutate(score=label_func(as.numeric(paid.or.not.)))

# A tibble: 2 x 5
# Groups:   ID. [2]
    ID. Invoice. Date.of.Invoice. paid.or.not. score
  <int>    <dbl> <fct>            <fct>        <chr>
1     1        1 12/31/2019       yes          good 
2     2        1 12/31/2019       no           bad

【讨论】：

是的，如果一切都是肯定的，我会给客户说“好”。我想知道如果部分客户只有一张发票，无论是否已付款，他们会被标记为 NA 吗？
如果只有一个“是”，应该是“好”。如果只有一个“不”，那就是“坏”。我之所以问，是因为我的数据集包含太多 ID，我找不到一个示例来检查代码是否以这种方式工作。
好的，代码就像你真正想要的那样工作。见上文
非常感谢，解释的很清楚。还有一个问题，我如何删除一开始只支付发票的客户。
您必须先找到 ID。 rmv_IDs = df %>% group_by(ID.) %>% summarise(yes=mean(paid.or.not.=="yes")) %>% filter(yes==1) %>% pull(ID. ）。然后你的 subset(df,!ID.%in% rmv_IDs)