R数据按顺序清洗答案

【问题标题】：R data cleaning in seqenceR数据按顺序清洗
【发布时间】：2021-02-04 09:42:36
【问题描述】：

我有这样的数据，但有时我的里程数有误。里程应该增加，但有时会出现错误的数字 - 太低或太高。是否可以在 R 中清理该数据？你有什么想法？对于这个错误，我可以使用低于和高于记录的平均值，但是如何按顺序捕获错误？

CarID   FuelTransactionDate Mileage  
AAA555  05.01.2019      5060     
AAA555  30.01.2019      7800     
AAA555  14.02.2019      9100     
AAA555  24.02.2019      9900     
AAA555  07.04.2019      101110  <- mistake
AAA555  12.04.2019      12500    
AAA555  15.05.2019      13000    
AAA555  09.06.2019      13422    
BBB788  15.05.2018      15000    
BBB788  04.06.2018      15200    
BBB788  19.06.2018      16150    
BBB788  16.07.2018      100    <- mistake
BBB788  27.08.2018      17500    
BBB788  10.09.2018      17999    
BBB788  13.10.2018      18200    
BBB788  02.11.2018      18555

【问题讨论】：

您好 :) 您可以使用描述性统计函数，例如 summary() 和 barplot() 来检查一些错误。这应该足以发现导致数字非常高或低的错误。然后，当Mileage n > Mileage n+1
而且由于您的数据很整洁，我会使用dplyr::summarise()、dplyr::group_by() 和ggplot2 来完成。
嗨。谢谢。当 Mileage n > Mileage n+1 听起来不错时，按组和行之间的百分比变化来发现的想法。你能告诉我更多如何做到这一点吗？
你可以使用这个：df %>% group_by(CarID) %>% mutate(rate = Mileage/lag(Mileage, n = 1, default = NA))，或者这个：df %>% group_by(CarID) %>% mutate(rate = Mileage - lag(Mileage, n = 1, default = NA))。 df 是您作为 data.frame 的数据。

标签： r data-cleaning

【解决方案1】：

如果您想确定错误发生的位置，这里可能是使用带有基数 R 的 ave + cummax + cummin 的选项

within(
  df,
  err <- ave(
    Mileage,
    CarID,
    FUN = function(x) replace(cummax(x) == rev(cummax(rev(x))), length(x), 0) + replace(cummin(x) == rev(cummin(rev(x))), 1, 0)
  )
)

给了

    CarID FuelTransactionDate Mileage err
1  AAA555          05.01.2019    5060   0
2  AAA555          30.01.2019    7800   0
3  AAA555          14.02.2019    9100   0
4  AAA555          24.02.2019    9900   0
5  AAA555          07.04.2019  101110   1
6  AAA555          12.04.2019   12500   0
7  AAA555          15.05.2019   13000   0
8  AAA555          09.06.2019   13422   0
9  BBB788          15.05.2018   15000   0
10 BBB788          04.06.2018   15200   0
11 BBB788          19.06.2018   16150   0
12 BBB788          16.07.2018     100   1
13 BBB788          27.08.2018   17500   0
14 BBB788          10.09.2018   17999   0
15 BBB788          13.10.2018   18200   0
16 BBB788          02.11.2018   18555   0

【讨论】：

【解决方案2】：

这里有一个方法展示了如何识别异常值，然后使用approx 填充它们。我首先寻找里程减少 - 您可以在if_else 中添加您想要检查的任何其他条件以识别异常值：

dd %>%
  group_by(CarID) %>%
  dplyr::mutate(
    # replace mistakes with NA
    MileageNA = if_else(Mileage < lag(Mileage, 1, default = 0), NA_integer_, Mileage),
    # fill in missing values with approx
    # approx is nicely robust in case you have multiple mistakes in a row
    #   See the help page and the rule argument to control behavior
    #   in case you have mistakes as the first or last observations
    MileageCorrected = approx(MileageNA, xout = 1:n())$y
  )
# # A tibble: 16 x 5
# # Groups:   CarID [2]
#    CarID  FuelTransactionDate Mileage MileageNA MileageCorrected
#    <chr>  <chr>                 <int>     <int>            <dbl>
#  1 AAA555 05.01.2019             5060      5060             5060
#  2 AAA555 30.01.2019             7800      7800             7800
#  3 AAA555 14.02.2019             9100      9100             9100
#  4 AAA555 24.02.2019             9900      9900             9900
#  5 AAA555 07.04.2019           101110    101110           101110
#  6 AAA555 12.04.2019            12500        NA            57055
#  7 AAA555 15.05.2019            13000     13000            13000
#  8 AAA555 09.06.2019            13422     13422            13422
#  9 BBB788 15.05.2018            15000     15000            15000
# 10 BBB788 04.06.2018            15200     15200            15200
# 11 BBB788 19.06.2018            16150     16150            16150
# 12 BBB788 16.07.2018              100        NA            16825
# 13 BBB788 27.08.2018            17500     17500            17500
# 14 BBB788 10.09.2018            17999     17999            17999
# 15 BBB788 13.10.2018            18200     18200            18200
# 16 BBB788 02.11.2018            18555     18555            18555

【讨论】：

Gregor，我认为应该替换 1011110（第 5 行）值。
我只替换了减少，因为这是问题中明确说明的，我将其留给 OP 来定义他们想要查找错误的任何其他条件。

【解决方案3】：

我只是将我的 cmets 放在一个答案中，以便更好地显示输出：代码如下：

library(dplyr)
library(ggplot2)

df %>% group_by(CarID) %>% 
  summarise(min = min(Mileage),
            max = max(Mileage))

df %>% group_by(CarID) %>% mutate(rate = Mileage/lag(Mileage, n = 1, default = NA)) # if < 1 then the previous value was higher.
df %>% group_by(CarID) %>% mutate(rate = Mileage - lag(Mileage, n = 1, default = NA)) # if < 0 then the previous value was higher.

ggplot(data = df, aes(x = CarID, y = Mileage)) +
  geom_boxplot()

您可以使用的一些输出：

当 n dplyr 删除大小写 注意，您之前可能需要删除异常值！

    > df %>% 
+   group_by(CarID) %>% 
+   mutate(rate = Mileage - lag(Mileage, n = 1, default = NA)) %>% 
+   filter(rate > 0)
# A tibble: 12 x 4
# Groups:   CarID [2]
   CarID  FuelTransactionDate Mileage  rate
   <chr>  <chr>                 <int> <int>
 1 AAA555 30.01.2019             7800  2740
 2 AAA555 14.02.2019             9100  1300
 3 AAA555 24.02.2019             9900   800
 4 AAA555 07.04.2019           101110 91210
 5 AAA555 15.05.2019            13000   500
 6 AAA555 09.06.2019            13422   422
 7 BBB788 04.06.2018            15200   200
 8 BBB788 19.06.2018            16150   950
 9 BBB788 27.08.2018            17500 17400
10 BBB788 10.09.2018            17999   499
11 BBB788 13.10.2018            18200   201
12 BBB788 02.11.2018            18555   355

数据：

df <- structure(list(CarID = c("AAA555", "AAA555", "AAA555", "AAA555", 
                               "AAA555", "AAA555", "AAA555", "AAA555", "BBB788", "BBB788", "BBB788", 
                               "BBB788", "BBB788", "BBB788", "BBB788", "BBB788"), FuelTransactionDate = c("05.01.2019", 
                                                                                                          "30.01.2019", "14.02.2019", "24.02.2019", "07.04.2019", "12.04.2019", 
                                                                                                          "15.05.2019", "09.06.2019", "15.05.2018", "04.06.2018", "19.06.2018", 
                                                                                                          "16.07.2018", "27.08.2018", "10.09.2018", "13.10.2018", "02.11.2018"
                               ), Mileage = c(5060L, 7800L, 9100L, 9900L, 101110L, 12500L, 13000L, 
                                              13422L, 15000L, 15200L, 16150L, 100L, 17500L, 17999L, 18200L, 
                                              18555L)), class = "data.frame", row.names = c(NA, -16L))

【讨论】：

【解决方案4】：

也许您可以尝试去除异常值？

Q <- quantile(dataframe$Mileage, probs=c(.25, .75), na.rm = FALSE)
eliminated<- subset(dataframe, dataframe$Mileage > (Q[1] - 1.5*iqr) & dataframe$Mileage < (Q[2]+1.5*iqr))

【讨论】：

我不确定这是否是个好主意，因为这就像增加每个对象的序列一样，所以第一个值可以是 100，最后一个是 100 000，这些值是正确的。问题是当值“乱序”时
我明白你是对的我误解了。这不是一个好主意，感谢您的关注