【问题标题】:Finding repetitions in big data sets在大数据集中查找重复
【发布时间】:2017-05-24 16:55:34
【问题描述】:

我有一个数据集,其中包含有关控制系统故障的数据。这些数据具有以下结构:

TYPE OF FAILURE (string), START DATE (dd/mm/yyyy), START TIME (hh/mm/ss), DURATION (ss), LOCALIZATION (string), WORKING TEAM (A,B,C), SHIFT (morning, afternoon, night)

包含数据的表有 555000 行。 首先,我想分析一下 START DATE 参数是否存在重复的故障序列。基本上,我想找到这样的东西:

失败 1 出现在 3 月 10 日。失败 2 出现在 3 月 15 日。它们之间有 5 天。然后在 4 月 10 日和 4 月 15 日出现了失败 1,它们之间也是 5 天。比失败 1 出现在 5 月 10 日和 5 月 15 日,它们之间也有 5 天。然而,失败 1 也可能在不同的日期出现,但对我来说很有趣的是,有更大的可能性,即失败 2 将在失败 1 后 5 天出现,并且这些事件之间 (F1->F2) 是一个月。

我不知道我的解释是否足够清楚。但是,我正在寻找合适的方法/算法,通过这些方法/算法,我将能够从上述数据中提取此类序列。你能指点我一些方法吗?或者干脆让我们一起集思广益:)。任何帮助表示赞赏。

PS:我打算在 C# 或 MATLAB 中实现这个(取决于合适的方法) 谢谢。

【问题讨论】:

    标签: c# algorithm matlab sequence data-mining


    【解决方案1】:

    您的文件看起来像一个大的 CSV,因为 matlab 与 Data Store 有很好的实现

    https://es.mathworks.com/help/matlab/import_export/what-is-a-datastore.html

    并且拥有处理大文件的工具:

    https://es.mathworks.com/help/matlab/large-files-and-big-data.html

    也看看工作with tables in matlab

    在你的情况下,你可以这样工作:

    示例文件 Airlinessmall.csv(123524 行)

    Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
    1987,10,21,3,642,630,735,727,PS,1503,NA,53,57,NA,8,12,LAX,SJC,308,NA,NA,0,NA,0,NA,NA,NA,NA,NA
    1987,10,26,1,1021,1020,1124,1116,PS,1550,NA,63,56,NA,8,1,SJC,BUR,296,NA,NA,0,NA,0,NA,NA,NA,NA,NA
    1987,10,23,5,2055,2035,2218,2157,PS,1589,NA,83,82,NA,21,20,SAN,SMF,480,NA,NA,0,NA,0,NA,NA,NA,NA,NA
    1987,10,23,5,1332,1320,1431,1418,PS,1655,NA,59,58,NA,13,12,BUR,SJC,296,NA,NA,0,NA,0,NA,NA,NA,NA,NA
    1987,10,22,4,629,630,746,742,PS,1702,NA,77,72,NA,4,-1,SMF,LAX,373,NA,NA,0,NA,0,NA,NA,NA,NA,NA
    1987,10,28,3,1446,1343,1547,1448,PS,1729,NA,61,65,NA,59,63,LAX,SJC,308,NA,NA,0,NA,0,NA,NA,NA,NA,NA
    1987,10,8,4,928,930,1052,1049,PS,1763,NA,84,79,NA,3,-2,SAN,SFO,447,NA,NA,0,NA,0,NA,NA,NA,NA,NA
    1987,10,10,6,859,900,1134,1123,PS,1800,NA,155,143,NA,11,-1,SEA,LAX,954,NA,NA,0,NA,0,NA,NA,NA,NA,NA
    
    ...
    

    使用数据存储,您可以将数据作为表格处理并获取您需要的变量,例如获取到达延迟的平均值:

    >> ds = datastore('airlinesmall.csv','TreatAsMissing','NA');
    >> ds.MissingValue = 0;
    >> ds.SelectedVariableNames = 'ArrDelay';
    >> data = preview(ds)
    
    data = 
    
        ArrDelay
        ________
    
         8      
         8      
        21      
        13      
         4      
        59      
         3      
        11      
    
    >> data % this is a table
    
    data = 
    
        ArrDelay
        ________
    
         8      
         8      
        21      
        13      
         4      
        59      
         3      
        11      
    
    >> sums = [];
    counts = [];
    while hasdata(ds)
        T = read(ds); % this is a table, but this is not all loaded in memory
    
        sums(end+1) = sum(T.ArrDelay);
        counts(end+1) = length(T.ArrDelay);
    end
    
    >> avgArrivalDelay = sum(sums)/sum(counts)
    
    avgArrivalDelay =
    
        6.9670
    

    让我们处理您的示例。检查这个文件:

    样本.csv

    TYPE OF FAILURE, START DATE, START TIME, DURATION, LOCALIZATION, WORKING TEAM, SHIFT
    failure 1, 06/01/2017, 12/13/20, 300,  Area 1, A, morning
    failure 2, 06/01/2017, 12/13/20, 300,  Area 1, A, night
    failure 3, 06/01/2017, 12/13/20, 400,  Area 1, A, afternoon
    failure 1, 08/01/2017, 12/13/20, 300,  Area 1, A, morning
    failure 2, 09/01/2017, 12/13/20, 300,  Area 1, A, morning
    failure 3, 09/01/2017, 12/13/20, 300,  Area 1, A, night
    failure 3, 09/01/2017, 14/13/20, 200,  Area 1, A, morning
    failure 1, 10/01/2017, 12/13/20, 300,  Area 1, A, morning
    failure 1, 12/01/2017, 12/13/20, 300,  Area 1, A, afternoon
    failure 2, 12/01/2017, 12/13/20, 500,  Area 1, A, morning
    failure 1, 14/01/2017, 12/13/20, 300,  Area 1, A, night
    

    你可以看到失败 1 是每两天让我们看看这个:

    >> ds = tabularTextDatastore('sample.csv')
    Warning: Variable names were modified to make them valid MATLAB identifiers. 
    
    ds = 
    
      TabularTextDatastore with properties:
    
                          Files: {
                                 '/home/anquegi/learn/matlab/stackoverflow/sample.csv'
                                 }
                   FileEncoding: 'UTF-8'
              ReadVariableNames: true
                  VariableNames: {'TYPEOFFAILURE', 'STARTDATE', 'STARTTIME' ... and 4 more}
    
      Text Format Properties:
                 NumHeaderLines: 0
                      Delimiter: ','
                   RowDelimiter: '\r\n'
                 TreatAsMissing: ''
                   MissingValue: NaN
    
      Advanced Text Format Properties:
                TextscanFormats: {'%q', '%q', '%q' ... and 4 more}
             ExponentCharacters: 'eEdD'
                   CommentStyle: ''
                     Whitespace: ' \b\t'
        MultipleDelimitersAsOne: false
    
      Properties that control the table returned by preview, read, readall:
          SelectedVariableNames: {'TYPEOFFAILURE', 'STARTDATE', 'STARTTIME' ... and 4 more}
                SelectedFormats: {'%q', '%q', '%q' ... and 4 more}
                       ReadSize: 20000 rows
    
    >> ds.SelectedVariableNames = {'TYPEOFFAILURE', 'STARTDATE', 'STARTTIME', 'DURATION', 'LOCALIZATION', 'WORKINGTEAM', 'SHIFT'}
    
    ds = 
    
      TabularTextDatastore with properties:
    
                          Files: {
                                 '/home/anquegi/learn/matlab/stackoverflow/sample.csv'
                                 }
                   FileEncoding: 'UTF-8'
              ReadVariableNames: true
                  VariableNames: {'TYPEOFFAILURE', 'STARTDATE', 'STARTTIME' ... and 4 more}
    
      Text Format Properties:
                 NumHeaderLines: 0
                      Delimiter: ','
                   RowDelimiter: '\r\n'
                 TreatAsMissing: ''
                   MissingValue: NaN
    
      Advanced Text Format Properties:
                TextscanFormats: {'%q', '%q', '%q' ... and 4 more}
             ExponentCharacters: 'eEdD'
                   CommentStyle: ''
                     Whitespace: ' \b\t'
        MultipleDelimitersAsOne: false
    
      Properties that control the table returned by preview, read, readall:
          SelectedVariableNames: {'TYPEOFFAILURE', 'STARTDATE', 'STARTTIME' ... and 4 more}
                SelectedFormats: {'%q', '%q', '%q' ... and 4 more}
                       ReadSize: 20000 rows
    
    >> reset(ds)
    accum = [];
    while hasdata(ds)
        T = read(ds);
        accum = datetime(T(strcmp(T.TYPEOFFAILURE,'failure 1'),:).STARTDATE, 'InputFormat','dd/MM/yyyy');
        mean(diff(accum))
    end
    
    ans = 
    
       48:00:00
    

    % 正好每 48 小时一次,然后你可以尝试任何你想要的东西

    【讨论】:

    • 谢谢你的好提示:)。我一定会看看 Matlab 的数据存储。但是我也在寻找解决我问题的算法部分。你能告诉我一些相关的事情吗? :)
    • 当然我会试一试,您能否粘贴一个示例文件,5-6 行以及应用于该文件的示例
    • 已编辑以使用您的示例数据,如果有帮助,请不要忘记投票或赠送正确答案
    猜你喜欢
    • 2019-03-29
    • 2017-08-23
    • 2017-06-22
    • 2023-04-08
    • 1970-01-01
    • 2013-12-25
    • 2011-12-07
    • 2013-04-17
    • 1970-01-01
    相关资源
    最近更新 更多