【问题标题】:R data cleaning in seqenceR数据按顺序清洗
【发布时间】:2021-02-04 09:42:36
【问题描述】:

我有这样的数据,但有时我的里程数有误。里程应该增加,但有时会出现错误的数字 - 太低或太高。是否可以在 R 中清理该数据?你有什么想法? 对于这个错误,我可以使用低于和高于记录的平均值,但是如何按顺序捕获错误?

CarID   FuelTransactionDate Mileage  
AAA555  05.01.2019      5060     
AAA555  30.01.2019      7800     
AAA555  14.02.2019      9100     
AAA555  24.02.2019      9900     
AAA555  07.04.2019      101110  <- mistake
AAA555  12.04.2019      12500    
AAA555  15.05.2019      13000    
AAA555  09.06.2019      13422    
BBB788  15.05.2018      15000    
BBB788  04.06.2018      15200    
BBB788  19.06.2018      16150    
BBB788  16.07.2018      100    <- mistake
BBB788  27.08.2018      17500    
BBB788  10.09.2018      17999    
BBB788  13.10.2018      18200    
BBB788  02.11.2018      18555    

【问题讨论】:

  • 您好 :) 您可以使用描述性统计函数,例如 summary()barplot() 来检查一些错误。这应该足以发现导致数字非常高或低的错误。然后,当Mileage n &gt; Mileage n+1
  • 而且由于您的数据很整洁,我会使用dplyr::summarise()dplyr::group_by()ggplot2 来完成。
  • 嗨。谢谢。当 Mileage n > Mileage n+1 听起来不错时,按组和行之间的百分比变化来发现的想法。你能告诉我更多如何做到这一点吗?
  • 你可以使用这个:df %&gt;% group_by(CarID) %&gt;% mutate(rate = Mileage/lag(Mileage, n = 1, default = NA)),或者这个:df %&gt;% group_by(CarID) %&gt;% mutate(rate = Mileage - lag(Mileage, n = 1, default = NA))df 是您作为 data.frame 的数据。

标签: r data-cleaning


【解决方案1】:

如果您想确定错误发生的位置,这里可能是使用带有基数 R 的 ave + cummax + cummin 的选项

within(
  df,
  err <- ave(
    Mileage,
    CarID,
    FUN = function(x) replace(cummax(x) == rev(cummax(rev(x))), length(x), 0) + replace(cummin(x) == rev(cummin(rev(x))), 1, 0)
  )
)

给了

    CarID FuelTransactionDate Mileage err
1  AAA555          05.01.2019    5060   0
2  AAA555          30.01.2019    7800   0
3  AAA555          14.02.2019    9100   0
4  AAA555          24.02.2019    9900   0
5  AAA555          07.04.2019  101110   1
6  AAA555          12.04.2019   12500   0
7  AAA555          15.05.2019   13000   0
8  AAA555          09.06.2019   13422   0
9  BBB788          15.05.2018   15000   0
10 BBB788          04.06.2018   15200   0
11 BBB788          19.06.2018   16150   0
12 BBB788          16.07.2018     100   1
13 BBB788          27.08.2018   17500   0
14 BBB788          10.09.2018   17999   0
15 BBB788          13.10.2018   18200   0
16 BBB788          02.11.2018   18555   0

【讨论】:

    【解决方案2】:

    这里有一个方法展示了如何识别异常值,然后使用approx 填充它们。我首先寻找里程减少 - 您可以在if_else 中添加您想要检查的任何其他条件以识别异常值:

    dd %>%
      group_by(CarID) %>%
      dplyr::mutate(
        # replace mistakes with NA
        MileageNA = if_else(Mileage < lag(Mileage, 1, default = 0), NA_integer_, Mileage),
        # fill in missing values with approx
        # approx is nicely robust in case you have multiple mistakes in a row
        #   See the help page and the rule argument to control behavior
        #   in case you have mistakes as the first or last observations
        MileageCorrected = approx(MileageNA, xout = 1:n())$y
      )
    # # A tibble: 16 x 5
    # # Groups:   CarID [2]
    #    CarID  FuelTransactionDate Mileage MileageNA MileageCorrected
    #    <chr>  <chr>                 <int>     <int>            <dbl>
    #  1 AAA555 05.01.2019             5060      5060             5060
    #  2 AAA555 30.01.2019             7800      7800             7800
    #  3 AAA555 14.02.2019             9100      9100             9100
    #  4 AAA555 24.02.2019             9900      9900             9900
    #  5 AAA555 07.04.2019           101110    101110           101110
    #  6 AAA555 12.04.2019            12500        NA            57055
    #  7 AAA555 15.05.2019            13000     13000            13000
    #  8 AAA555 09.06.2019            13422     13422            13422
    #  9 BBB788 15.05.2018            15000     15000            15000
    # 10 BBB788 04.06.2018            15200     15200            15200
    # 11 BBB788 19.06.2018            16150     16150            16150
    # 12 BBB788 16.07.2018              100        NA            16825
    # 13 BBB788 27.08.2018            17500     17500            17500
    # 14 BBB788 10.09.2018            17999     17999            17999
    # 15 BBB788 13.10.2018            18200     18200            18200
    # 16 BBB788 02.11.2018            18555     18555            18555
    

    【讨论】:

    • Gregor,我认为应该替换 1011110(第 5 行)值。
    • 我只替换了减少,因为这是问题中明确说明的,我将其留给 OP 来定义他们想要查找错误的任何其他条件。
    【解决方案3】:

    我只是将我的 cmets 放在一个答案中,以便更好地显示输出: 代码如下:

    library(dplyr)
    library(ggplot2)
    
    df %>% group_by(CarID) %>% 
      summarise(min = min(Mileage),
                max = max(Mileage))
    
    df %>% group_by(CarID) %>% mutate(rate = Mileage/lag(Mileage, n = 1, default = NA)) # if < 1 then the previous value was higher.
    df %>% group_by(CarID) %>% mutate(rate = Mileage - lag(Mileage, n = 1, default = NA)) # if < 0 then the previous value was higher.
    
    ggplot(data = df, aes(x = CarID, y = Mileage)) +
      geom_boxplot()
    

    您可以使用的一些输出:

    当 n dplyr 删除大小写 注意,您之前可能需要删除异常值!

        > df %>% 
    +   group_by(CarID) %>% 
    +   mutate(rate = Mileage - lag(Mileage, n = 1, default = NA)) %>% 
    +   filter(rate > 0)
    # A tibble: 12 x 4
    # Groups:   CarID [2]
       CarID  FuelTransactionDate Mileage  rate
       <chr>  <chr>                 <int> <int>
     1 AAA555 30.01.2019             7800  2740
     2 AAA555 14.02.2019             9100  1300
     3 AAA555 24.02.2019             9900   800
     4 AAA555 07.04.2019           101110 91210
     5 AAA555 15.05.2019            13000   500
     6 AAA555 09.06.2019            13422   422
     7 BBB788 04.06.2018            15200   200
     8 BBB788 19.06.2018            16150   950
     9 BBB788 27.08.2018            17500 17400
    10 BBB788 10.09.2018            17999   499
    11 BBB788 13.10.2018            18200   201
    12 BBB788 02.11.2018            18555   355
    

    数据:

    df <- structure(list(CarID = c("AAA555", "AAA555", "AAA555", "AAA555", 
                                   "AAA555", "AAA555", "AAA555", "AAA555", "BBB788", "BBB788", "BBB788", 
                                   "BBB788", "BBB788", "BBB788", "BBB788", "BBB788"), FuelTransactionDate = c("05.01.2019", 
                                                                                                              "30.01.2019", "14.02.2019", "24.02.2019", "07.04.2019", "12.04.2019", 
                                                                                                              "15.05.2019", "09.06.2019", "15.05.2018", "04.06.2018", "19.06.2018", 
                                                                                                              "16.07.2018", "27.08.2018", "10.09.2018", "13.10.2018", "02.11.2018"
                                   ), Mileage = c(5060L, 7800L, 9100L, 9900L, 101110L, 12500L, 13000L, 
                                                  13422L, 15000L, 15200L, 16150L, 100L, 17500L, 17999L, 18200L, 
                                                  18555L)), class = "data.frame", row.names = c(NA, -16L))
    

    【讨论】:

      【解决方案4】:

      也许您可以尝试去除异常值?

      Q <- quantile(dataframe$Mileage, probs=c(.25, .75), na.rm = FALSE)
      eliminated<- subset(dataframe, dataframe$Mileage > (Q[1] - 1.5*iqr) & dataframe$Mileage < (Q[2]+1.5*iqr))
      

      【讨论】:

      • 我不确定这是否是个好主意,因为这就像增加每个对象的序列一样,所以第一个值可以是 100,最后一个是 100 000,这些值是正确的。问题是当值“乱序”时
      • 我明白你是对的我误解了。这不是一个好主意,感谢您的关注
      猜你喜欢
      • 2021-09-10
      • 2023-03-22
      • 1970-01-01
      • 2021-02-04
      • 2019-03-15
      • 2021-11-21
      • 2018-07-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多