【问题标题】:Regular Expression to capture text正则表达式捕获文本
【发布时间】:2012-07-17 10:17:01
【问题描述】:

我有一个日志文件,内容如下:

2012-07-16 03:20:41,23796160897,Text,id:SAR-23796160897-c0-2-1 sub:000 dlvrd:001 提交日期:120715220216 完成日期:120716032038 stat:DELIVRD err:000 文本:,FOTSO TOKAM,SMSCReceiptMsgId=SAR-23796160897-c0-2-1
2012-07-16 03:20:48,23796160897,Text,id:SAR-23796160897-c0-2-2 sub:000 dlvrd:001 提交日期:120715220216 完成日期:120716032045 stat:DELIVRD err:000 text:,FOTSO TOKAM,SMSCReceiptMsgId=SAR-23796160897-c0-2-2
2012-05-04 00:07:46,23777603300,Text,id:4FA23EB0 sub:000 dlvrd:001 提交日期:120503225018 完成日期:120504000744 stat:DELIVRD err:000 文本:,FLP,SMSCReceiptMsgId=4FA23EB0
2012-05-04 01:50:18,23796726987,Text,id:4FA23E95 sub:000 dlvrd:001 提交日期:120503225014 完成日期:120504015016 stat:DELIVRD err:000 text:,FLP,SMSCReeptMsgId=4FA23E95
2012-05-04 01:50:22,23799757015,Text,id:4FA23EB2 sub:000 dlvrd:001 提交日期:120503225018 完成日期:120504015021 stat:DELIVRD err:000 text:,FLP,SMSCReeptMsgId=4FA23EB2
2012-05-04 01:50:48,23799907239,Text,id:4FA23F38 sub:000 dlvrd:001 提交日期:120503225042 完成日期:120504015046 stat:DELIVRD err:000 文本:,FLP,SMSCReeptMsgId=4FA23F38
2012-05-04 01:50:48,23799896455,Text,id:4FA23D1C sub:000 dlvrd:001 提交日期:120503175232 完成日期:120504015046 stat:DELIVRD err:000 text:,FLP,SMSCReeptMsgId=4FA23D1C
2012-05-04 01:50:48,23799896455,Text,id:4FA23F04 sub:000 dlvrd:001 提交日期:120503225031 完成日期:120504015046 stat:DELIVRD err:000 text:,FLP,SMSCReeptMsgId=4FA23F04
2012-05-04 01:50:50,23794105044,Text,id:4FA23F55 sub:000 dlvrd:001 提交日期:120503225046 完成日期:120504015048 stat:DELIVRD err:000 text:,FLP,SMSCReeptMsgId=4FA23F55
2012-05-04 01:51:19,23796029764,Text,id:4FA23FEE sub:000 dlvrd:001 提交日期:120503225114 完成日期:120504015117 stat:DELIVRD err:000 text:,FLP,SMSCreceiptMsgId=4FA23FEE
2012-05-04 02:17:51,23775461594,Text,id:4FA24025 sub:000 dlvrd:001 提交日期:120503225125 完成日期:120504021749 stat:DELIVRD err:000 text:,FLP,SMSCReceiptMsgId=4FA24025
2012-05-04 04:08:02,23777437781,Text,id:4FA23F23 sub:000 dlvrd:001 提交日期:120503225037 完成日期:120504040800 stat:DELIVRD err:000 text:,FLP,SMSCReeptMsgId=4FA23F23
2012-05-04 04:50:12,23777970013,Text,id:4FA23E70 sub:000 dlvrd:000 提交日期:120503225005 完成日期:120504045011 stat:EXPIRED err:027 text:,FLP,SMSCreceiptMsgId=4FA23E70
2012-05-04 04:50:15,23775182832,Text,id:4FA23E7E sub:000 dlvrd:000 提交日期:120503225008 完成日期:120504045014 stat:EXPIRED err:027 text:,FLP,SMSCreceiptMsgId=4FA23E7E
2012-05-04 04:50:17,23777789644,Text,id:4FA23E80 sub:000 dlvrd:000 提交日期:120503225010 完成日期:120504045016 stat:EXPIRED err:027 text:,FLP,SMSCreceiptMsgId=4FA23E80
2012-05-04 04:50:21,23777529371,Text,id:4FA23E8F sub:000 dlvrd:000 提交日期:120503225013 完成日期:120504045019 stat:EXPIRED err:027 text:,FLP,SMSCreceiptMsgId=4FA23E8F
2012-05-04 04:50:21,23777613852,Text,id:4FA23E97 sub:000 dlvrd:000 提交日期:120503225014 完成日期:120504045020 stat:EXPIRED err:027 text:,FLP,SMSCreceiptMsgId=4FA23E97
2012-05-04 04:50:24,23777407598,Text,id:4FA23EAE sub:000 dlvrd:000 提交日期:120503225017 完成日期:120504045023 stat:EXPIRED err:032 text:,FLP,SMSCReceiptMsgId=4FA23EAAE
2012-05-04 04:50:26,23777736950,Text,id:4FA23EAF sub:000 dlvrd:000 提交日期:120503225018 完成日期:120504045024 stat:EXPIRED err:027 text:,FLP,SMSCreceiptMsgId=4FA23EAF
2012-05-04 04:50:31,23775834128,Text,id:4FA23ED6 sub:000 dlvrd:000 提交日期:120503225024 完成日期:120504045030 stat:EXPIRED err:027 text:,FLP,SMSCreceiptMsgId=4FA23ED6
2012-05-04 04:50:36,23777486441,Text,id:4FA23EF3 sub:000 dlvrd:000 提交日期:120503225029 完成日期:120504045035 stat:EXPIRED err:027 text:,FLP,SMSCreceiptMsgId=4FA23EF3

现在我想通过使用带有 c#.net 和 LINQ 的正则表达式从该内容中捕获几个特定字段的值,例如“id、done date、stat”。

如果有人有任何想法,请帮助我。

【问题讨论】:

  • 您想使用任何特定的语言?
  • 你打算使用哪个正则表达式引擎?
  • 是 Keppil 使用 C#.net 和 Linq

标签: c# .net regex linq


【解决方案1】:

我认为您的正则表达式不会对您有很大帮助。相反,您应该将行分成行然后分成列,因为我可以看到数据可以分割成一个矩阵,您可以从中轻松提取您正在寻找的信息......即使您可以在 JavaScript/C#/Java 中执行此操作或任何语言。

在我的实践中这样做:

  • 将数据分成几行
  • 将行分成列
  • 然后遍历每一行并指向您要查找的列。

    var content = data.split('\n');
    foreach(var line in content) 
    {
         var cols = line.split(',');
         var c1 = cols[0];
         var c2 = cols[1];
         var c3 = cols[2];
    }
    

您可以根据您的需要改进上述摘录...这是最好的方法。

【讨论】:

    【解决方案2】:

    不清楚所有字段的含义,或者分隔符是否为常量。使用您提供的测试数据,这会将大部分信息放入指定的组中。

    /// <summary>
    ///  Regular expression built for C# on: Tue, Jul 17, 2012, 12:08:12 PM
    ///  Using Expresso Version: 3.0.4334, http://www.ultrapico.com
    ///  
    ///  A description of the regular expression:
    ///  
    ///  Beginning of line or string
    ///  [Date]: A named capture group. [[^,]+]
    ///      Any character that is NOT in this class: [,], one or more repetitions
    ///  ,
    ///  [Number]: A named capture group. [[^,]+]
    ///      Any character that is NOT in this class: [,], one or more repetitions
    ///  ,
    ///  [Text1]: A named capture group. [[^,]+]
    ///      Any character that is NOT in this class: [,], one or more repetitions
    ///  ,
    ///  id:
    ///      id:
    ///  [ID]: A named capture group. [[^\s]+]
    ///      Any character that is NOT in this class: [\s], one or more repetitions
    ///  Whitespace
    ///  sub:
    ///      sub:
    ///  [Sub]: A named capture group. [\w+]
    ///      Alphanumeric, one or more repetitions
    ///  Whitespace
    ///  dlvrd:
    ///      dlvrd:
    ///  [Dlvrd]: A named capture group. [\w+]
    ///      Alphanumeric, one or more repetitions
    ///  Whitespace
    ///  submit\sdate:
    ///      submit
    ///      Whitespace
    ///      date:
    ///  [SubmitDate]: A named capture group. [\w+]
    ///      Alphanumeric, one or more repetitions
    ///  Whitespace
    ///  done\sdate:
    ///      done
    ///      Whitespace
    ///      date:
    ///  [DoneDate]: A named capture group. [\w+]
    ///      Alphanumeric, one or more repetitions
    ///  Whitespace
    ///  stat:
    ///      stat:
    ///  [Status]: A named capture group. [\w+]
    ///      Alphanumeric, one or more repetitions
    ///  Whitespace
    ///  err:
    ///      err:
    ///  [Error]: A named capture group. [\d+]
    ///      Any digit, one or more repetitions
    ///  Whitespace
    ///  
    ///
    /// </summary>
    public static Regex regex = new Regex(
          "^(?<Date>[^,]+),\r\n(?<Number>[^,]+),\r\n(?<Text1>[^,]+),\r\nid:(?"+
          "<ID>[^\\s]+)\\s\r\nsub:(?<Sub>\\w+)\\s\r\ndlvrd:(?<Dlvrd>\\w+)\\s"+
          "\r\nsubmit\\sdate:(?<SubmitDate>\\w+)\\s\r\ndone\\sdate:(?<DoneD"+
          "ate>\\w+)\\s\r\nstat:(?<Status>\\w+)\\s\r\nerr:(?<Error>\\d+)\\s",
        RegexOptions.Multiline
        | RegexOptions.ExplicitCapture
        | RegexOptions.CultureInvariant
        | RegexOptions.IgnorePatternWhitespace
        | RegexOptions.Compiled
        );
    

    因此,您可以调用:

    var matches = regex.Matches(inputData);
    

    我个人建议您将测试限制为单行数据并改为调用它:

    var match = regex.Match(inputLineOfData);
    

    这意味着您可以:

    if ( match.Success )
    {
       var id = match.Groups["ID"].Value;
       var submitDate = match.Groups["SubmitDate"].Value;  // Parse to DateTime
       var doneDate = match.Groups["DoneDate"].Value;  // Parse to DateTime
    
       // etc for 'sub', 'dlvrd', 'Status', 'Error'..
    }
    

    【讨论】:

      【解决方案3】:

      可能 csv 解析器会更好,但您可以使用此正则表达式并将 id: 替换为您想要的其他字段。前done date:(?&lt;done date&gt;.*?)\s

      string strRegex = @"id:(?<id>.*?)\s.*?done date:(?<donedate>.*?)\s.*?stat:(?<stat>.*?)\s";
      RegexOptions myRegexOptions = RegexOptions.IgnoreCase | RegexOptions.Multiline;
      Regex myRegex = new Regex(strRegex, myRegexOptions);
      string strTargetString = @"2012-07-16 03:20:41,23796160897,Text,id:SAR-23796160897-c0-2-1 sub:000 dlvrd:001 submit date:120715220216 done date:120716032038 stat:DELIVRD err:000 text:,FOTSO TOKAM,SMSCReceiptMsgId=SAR-23796160897-c0-2-1"
      foreach (Match myMatch in myRegex.Matches(strTargetString))
      {
        if (myMatch.Success)
        {
          // Add your code here 
          //myMatch.Groups["id"].Value;
          //myMatch.Groups["donedate"].Value;
          //myMatch.Groups["stat"].Value;
        }
      }
      

      您可以使用一个正则表达式id:(?&lt;id&gt;.*?)\s.*?done date:(?&lt;donedate&gt;.*?)\s.*?stat:(?&lt;stat&gt;.*?)\s,然后使用myMatch.Groups["id"].Value 等组访问

      【讨论】:

      • 是否可以同时读取“id,done date,stat”等所有特定字段?
      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2014-01-02
      • 1970-01-01
      • 1970-01-01
      • 2018-03-11
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多