正则表达式帮助：我的正则表达式模式将匹配无效字典答案

【问题标题】：Regex help: My regex pattern will match invalid Dictionary正则表达式帮助：我的正则表达式模式将匹配无效字典
【发布时间】：2011-04-01 13:40:20
【问题描述】：

我希望你们能帮助我。我正在使用 C# .Net 4.0

我想验证类似的文件结构

 
const string dataFileScr = @"
Start 0
{
    Next = 1
    Author = rk
    Date = 2011-03-10
/*  Description = simple */
}

PZ 11
{
IA_return()
}

GDC 7
{
    Message = 6
    Message = 7
        Message = 8
        Message = 8
    RepeatCount = 2
    ErrorMessage = 10
    ErrorMessage = 11
    onKey[5] = 6
    onKey[6] = 4
    onKey[9] = 11
}
";

到目前为止，我设法构建了这个正则表达式模式

 
const string patternFileScr = @"
^                           
((?:\[|\s)*                  

     (?<Section>[^\]\r\n]*)     
 (?:\])*                     
 (?:[\r\n]{0,}|\Z))         
(
    (?:\{)                  ### !! improve for .ini file, dont take { 
    (?:[\r\n]{0,}|\Z)           
        (                          # Begin capture groups (Key Value Pairs)
        (?!\}|\[)                    # Stop capture groups if a } is found; new section  

          (?:\s)*                     # Line with space
          (?<Key>[^=]*?)            # Any text before the =, matched few as possible
          (?:[\s]*=[\s]*)                     # Get the = now
          (?<Value>[^\r\n]*)        # Get everything that is not an Line Changes


         (?:[\r\n]{0,})
         )*                        # End Capture groups
    (?:[\r\n]{0,})
    (?:\})?
    (?:[\r\n\s]{0,}|\Z)
)*

                ";

和c#


  Dictionary <string, Dictionary<string, string>> DictDataFileScr
            = (from Match m in Regex.Matches(dataFileScr,
                                            patternFileScr,
                                            RegexOptions.IgnorePatternWhitespace | RegexOptions.Multiline)
               select new
               {
                   Section = m.Groups["Section"].Value,

                   kvps = (from cpKey in m.Groups["Key"].Captures.Cast().Select((a, i) => new { a.Value, i })
                           join cpValue in m.Groups["Value"].Captures.Cast().Select((b, i) => new { b.Value, i }) on cpKey.i equals cpValue.i
                           select new KeyValuePair(cpKey.Value, cpValue.Value)).OrderBy(_ => _.Key)
                           .ToDictionary(kvp => kvp.Key, kvp => kvp.Value)

               }).ToDictionary(itm => itm.Section, itm => itm.kvps);

它适用于

 
const string dataFileScr = @"
Start 0
{
    Next = 1
    Author = rk
    Date = 2011-03-10
/*  Description = simple */
}

GDC 7
{
    Message = 6
    RepeatCount = 2
    ErrorMessage = 10
    onKey[5] = 6
    onKey[6] = 4
    onKey[9] = 11
}
";

换句话说

 
Section1
{
key1=value1
key2=value2
}

Section2
{
key1=value1
key2=value2
}

，但是

1。不是多个键名，我想按键分组并输出


DictDataFileScr["GDC 7"]["Message"] = "6|7|8|8"
DictDataFileScr["GDC 7"]["ErrorMessage"] = "10|11"

2。不适用于像

这样的 .ini 文件


....
[Section1]
key1 = value1
key2 = value2

[Section2]
key1 = value1
key2 = value2
...

3。在

之后看不到下一节


....
PZ 11
{
IA_return()
}
.....

【问题讨论】：

如果你能将你的案例减少到几行，也许人们可以更好地帮助你
你能发一些其他的例子吗
Soo ah，你想告诉我为什么 \s*(\[[^\S\n]*)?(?<Section>\w+(?:[^\S\n]+ \w+)*)(?(1)[^\S\n]*\]|)\s*(?(1)|\{)(?:\s*(?:\/\*.*?\*\/|(?<Key>\w[\w\[\]]*(?:[^\S\n]+[\w\[\]]+)*)[^\S\n]*=[^\S\n]*(?<Value>[^\n]*)|(?(1)|[^{}\n]*))\s*)*(?(1)|\}) 只用单行（'.' 点也表示换行符）对你不起作用，我的意思是我在给你扔骨头这里。我在 dot net 中阅读了有关 Collections 的信息，这绝对应该这样做。我可以被雇来做你能想象到的最具挑战性的事情。这个正则表达式的微妙之处是崇高的。如果您知道自己在看什么，它的流程简单而强大。

标签： c# .net regex linq dictionary

【解决方案1】：

这是对 C# 中正则表达式的完整修改。

假设：（告诉我其中一个是假的还是全部是假的）

INI 文件部分的正文中只能包含键/值对行
在非 INI 文件部分，函数调用不能有任何参数

正则表达式标志：
正则表达式选项.IgnoreCase | RegexOptions.IgnorePatternWhitespace | RegexOptions.Compiled |正则表达式选项.单行

输入测试：


const string dataFileScr = @"
Start 0
{
    Next = 1
    Author = rk
    Date = 2011-03-10
/*  Description = simple */
}

PZ 11
{
IA_return()
}

GDC 7
{
    Message = 6
    Message = 7
        Message = 8
        Message = 8
    RepeatCount = 2
    ErrorMessage = 10
    ErrorMessage = 11
    onKey[5] = 6
    onKey[6] = 4
    onKey[9] = 11
}

[Section1]
key1 = value1
key2 = value2

[Section2]
key1 = value1
key2 = value2
";

重做的正则表达式：


const string patternFileScr = @"
(?<Section>                                                              (?# Start of a non ini file section)
  (?<SectionName>[\w ]+)\s*                                              (?# Capture section name)
     {                                                                   (?# Match but don't capture beginning of section)
        (?<SectionBody>                                                  (?# Capture section body. Section body can be empty)
         (?<SectionLine>\s*                                              (?# Capture zero or more line(s) in the section body)
         (?:                                                             (?# A line can be either a key/value pair, a comment or a function call)
            (?<KeyValuePair>(?<Key>[\w\[\]]+)\s*=\s*(?<Value>[\w-]*))    (?# Capture key/value pair. Key and value are sub-captured separately)
            |
            (?<Comment>/\*.+?\*/)                                        (?# Capture comment)
            |
            (?<FunctionCall>[\w]+\(\))                                   (?# Capture function call. A function can't have parameters though)
         )\s*                                                            (?# Match but don't capture white characters)
         )*                                                              (?# Zero or more line(s), previously mentionned in comments)
        )
     }                                                                   (?# Match but don't capture beginning of section)
)
|
(?<Section>                                                              (?# Start of an ini file section)
  \[(?<SectionName>[\w ]+)\]                                             (?# Capture section name)
  (?<SectionBody>                                                        (?# Capture section body. Section body can be empty)
     (?<SectionLine>                                                     (?# Capture zero or more line(s) in the section body. Only key/value pair allowed.)
        \s*(?<KeyValuePair>(?<Key>[\w\[\]]+)\s*=\s*(?<Value>[\w-]+))\s*  (?# Capture key/value pair. Key and value are sub-captured separately)
     )*                                                                  (?# Zero or more line(s), previously mentionned in comments)
  )
)
";

讨论构建正则表达式以匹配非 INI 文件部分 (1) 或 INI 文件部分 (2)。

(1) 非 INI 文件节 这些节由节名和由 { 和 } 括起来的正文组成。部分名称 con 包含字母、数字或空格。节体由零行或多行组成。一行可以是键/值对 (key = value)、注释 (/* Here is a comment */) 或不带参数的函数调用 (my_function())。

(2) INI 文件部分 这些部分由 [ 和 ] 括起来的部分名称后跟零个或多个键/值对组成。每一对都在一条线上。

【讨论】：

@sln 我会为您的 Perl 代码返回相同的问题。我做了所有这些东西来详细说明答案。我不想只是放弃 OP 无法理解的响应。为了清楚起见，所有命名的匹配组都在那里。
@Stephen 'for so little' 我的意思是<Key> \s* = *\s <Value> 不会将此对限制为 1 行。为了提取有意义的数据，必须进行 3 次交替XXX { (?: comments | key = value | junk_line )* } 其中，k/v pairs and junk_line 必须逐行处理（消耗），否则此部分将在回溯中失败。此外，VALUE 可以是任何东西或什么都不是（以行为基础），它不能像你在这里一样被限制为[\w-]+，\s*=\s* 是不可持续的。

【解决方案2】：

帮自己一个忙，保持理智，学习如何使用GPLex 和GPPG。它们是 C# 中最接近 Lex 和 Yacc（或者 Flex 和 Bison，如果您愿意的话）的东西，它们是适合这项工作的工具。

正则表达式是执行健壮的字符串匹配的好工具，但是当你想要匹配字符串的结构时，你需要一个“语法”。这就是解析器的用途。 GPLex 采用一堆正则表达式并生成一个超快速的词法分析器。 GPPG 采用您编写的语法并生成一个超快速的解析器。

相信我，学习如何使用这些工具......或任何其他类似的工具。你会很高兴你这样做了。

【讨论】：

在不知道涉及语法的情况下，这是一个不成熟的陈述。
@sln，你是什么意思？正则表达式应该在 OP 中是什么？我想说正则表达式描述（或试图描述）输入文本的结构：换句话说，有语法。
@Bart，样本数据中有{} 字符。我不能跳到有语法的结论。语法中是否涉及 if/then/else？因为，正则表达式有那个和占有量词。

【解决方案3】：

# 2. 不适用于 .ini 文件，如

不起作用，因为正如您的正则表达式所述，在 [Section] 之后需要一个 {。如果你有这样的东西，你的正则表达式会匹配：

[部分] { 键=值 }

【讨论】：

不，它不会停在结尾}。由于他的课程，他的正则表达式直接进入决赛 }。
我做了一个测试。将我的示例部分放在 r4ph 代码的末尾，该部分是匹配的。没有 { 它不是。用 C# 测试
哦，也许他当时修好了。他的一个要点曾经将其描述为一个问题。正则表达式存在许多问题，无法解决。最好重写。不幸的是，知道正则表达式的人并没有得到报酬。
已确认，没有任何变化，仍然超出所有边界。

【解决方案4】：

这是一个 Perl 示例。 Perl 没有命名的捕获数组。可能是因为回溯。
也许你可以从正则表达式中挑选一些东西。这假设没有嵌套 {} 括号。

编辑永远不要满足于独自离开，下面是修订版。

use strict;
use warnings;

my $str = '
Start 0
{
    Next = 1
    Author = rk
    Date = 2011-03-10
 /*  Description = simple
 */
}

asdfasdf

PZ 11
{
IA_return()
}

[ section 5 ]
  this = that
[ section 6 ]
  this_ = _that{hello() hhh = bbb}

TOC{}

GDC 7
{
    Message = 6
    Message = 7
        Message = 8
        Message = 8
    RepeatCount = 2
    ErrorMessage = 10
    ErrorMessage = 11
    onKey[5] = 6
    onKey[6] = 4
    onKey[9] = 11
}
';


use re 'eval';

my $rx = qr/

\s*
( \[ [^\S\n]* )?                     # Grp 1  optional ini section delimeter '['
(?<Section> \w+ (?:[^\S\n]+ \w+)* )  # Grp 2  'Section'
(?(1) [^\S\n]* \] |)                 # Condition, if we matched '[' then look for ']'
\s* 

(?<Body>                   # Grp 3 'Body' (for display only)
   (?(1)| \{ )                   # Condition, if we're not a ini section then look for '{'

   (?{ print "Section: '$+{Section}'\n" })  # SECTION debug print, remove in production

   (?:                           # _grp_
       \s*                           # whitespace
       (?:                              # _grp_
            \/\* .*? \*\/                    # some comments
          |                               # OR ..
                                             # Grp 4 'Key'  (tested with print, Perl doesen't have named capture arrays)
            (?<Key> \w[\w\[\]]* (?:[^\S\n]+ [\w\[\]]+)* )
            [^\S\n]* = [^\S\n]*              # =
            (?<Value> [^\n]* )               # Grp 5 'Value' (tested with print)

            (?{ print "  k\/v: '$+{Key}' = '$+{Value}'\n" })  # KEY,VALUE debug print, remove in production
          |                               # OR ..
            (?(1)| [^{}\n]* )                # any chars except newline and [{}] on the condition we're not a ini section
        )                               # _grpend_
        \s*                          # whitespace
    )*                           # _grpend_  do 0 or more times 
   (?(1)| \} )                   # Condition, if we're not a ini section then look for '}'
)
/x;


while ($str =~ /$rx/xsg)
{
    print "\n";
    print "Body:\n'$+{Body}'\n";
    print "=========================================\n";
}

__END__

输出

Section: 'Start 0'
  k/v: 'Next' = '1'
  k/v: 'Author' = 'rk'
  k/v: 'Date' = '2011-03-10'

Body:
'{
    Next = 1
    Author = rk
    Date = 2011-03-10
 /*  Description = simple
 */
}'
=========================================
Section: 'PZ 11'

Body:
'{
IA_return()
}'
=========================================
Section: 'section 5'
  k/v: 'this' = 'that'

Body:
'this = that
'
=========================================
Section: 'section 6'
  k/v: 'this_' = '_that{hello() hhh = bbb}'

Body:
'this_ = _that{hello() hhh = bbb}

'
=========================================
Section: 'TOC'

Body:
'{}'
=========================================
Section: 'GDC 7'
  k/v: 'Message' = '6'
  k/v: 'Message' = '7'
  k/v: 'Message' = '8'
  k/v: 'Message' = '8'
  k/v: 'RepeatCount' = '2'
  k/v: 'ErrorMessage' = '10'
  k/v: 'ErrorMessage' = '11'
  k/v: 'onKey[5]' = '6'
  k/v: 'onKey[6]' = '4'
  k/v: 'onKey[9]' = '11'

Body:
'{
    Message = 6
    Message = 7
        Message = 8
        Message = 8
    RepeatCount = 2
    ErrorMessage = 10
    ErrorMessage = 11
    onKey[5] = 6
    onKey[6] = 4
    onKey[9] = 11
}'
=========================================

【讨论】：