【问题标题】:pcre2 conditional replacement regexpcre2 条件替换正则表达式
【发布时间】:2017-10-21 18:48:32
【问题描述】:

我正在尝试编写一些正则表达式来根据条件插入画框字符,但我不断收到预期的编译错误子模式名称。

这是我的代码:

int match_pkg_details(char **pkgdetail, char *pkginfo)
{
    PCRE2_SPTR pattern = (PCRE2_SPTR)"^(?!Name|Architecture|URL|Licenses|"\
                    "Installed Size|Packager|Build Date|"\
                    "Install Date|Install Script|Validated By| *$).*$";
    *pkgdetail = malloc(4096); // FIXME malloc in initializer
    char *worker = *pkgdetail;
    size_t pattern_length = strlen((char *)pattern);
    int errornumber;
    PCRE2_SIZE erroroffset;
    pcre2_code *regex = pcre2_compile(
            pattern,
            pattern_length,
            PCRE2_MULTILINE,
            &errornumber,
            &erroroffset,
            NULL);
    if (regex == NULL)
    {
        PCRE2_UCHAR buffer[256];
        pcre2_get_error_message(errornumber, buffer, sizeof(buffer));
        printf("PCRE2 compilation failed at offset %d: %s\n", (int)erroroffset,
            buffer);
        return 1;
    }

    PCRE2_SPTR replacement = (PCRE2_SPTR)"(?(?=^Install Reason) a | ((?=(\\w) b | ((?=(\\s) c )))))";
                                                                                    // if starts with Install Reason replace with bottom line arrow }}}
    size_t replacement_length = strlen((char*)replacement);
    pcre2_code *replacement_regex = pcre2_compile(
            replacement,
            replacement_length,
            PCRE2_EXTENDED,
            &errornumber,
            &erroroffset,
            NULL);
    if (replacement_regex == NULL)
    {
        PCRE2_UCHAR buffer[256];
        pcre2_get_error_message(errornumber, buffer, sizeof(buffer));
        printf("PCRE2 compilation failed at offset %d: %s\n", (int)erroroffset,
               buffer);
        return 1;
    }
    pcre2_match_data *match_data =
            pcre2_match_data_create_from_pattern(regex, NULL);

    PCRE2_SPTR subject = (PCRE2_SPTR)pkginfo;
    size_t length = strlen((char *)subject);

    PCRE2_SIZE *ovector = pcre2_get_ovector_pointer(match_data);
    ovector[1] = 0;

    int rc;
    PCRE2_SIZE offset = 0;
    uint32_t options = PCRE2_NOTEMPTY_ATSTART | PCRE2_ANCHORED;
    while (offset < length - 1 && (rc =
         pcre2_match(regex, subject, length, offset, options, match_data, NULL)))
    {
        offset = ovector[1];
        options = 0;

        if (rc == PCRE2_ERROR_NOMATCH)
        {
            ovector[1] = offset + 1;
            continue;
        }

        for (int i = 0; i < rc; i++)
        {
            PCRE2_SIZE worker_len = strlen(worker);
            PCRE2_UCHAR output[4096];
            PCRE2_SIZE outlen;
            int rs = pcre2_substitute(
                    replacement_regex,
                    subject,
                    length,
                    offset,
                    PCRE2_SUBSTITUTE_EXTENDED,
                    NULL,
                    NULL,
                    (PCRE2_SPTR)"@",
                    1,
                    output,
                    &outlen);
            PCRE2_SPTR substring_start = subject + ovector[2*i];
            size_t substring_length = ovector[2*i+1] - ovector[2*i];
            snprintf(worker, 4096, "%.*s\n", (int)substring_length, (char*)substring_start);
            worker += (int)substring_length + 1;
        }
    }

    pcre2_match_data_free(match_data);
    pcre2_code_free(regex);
    return 0;
}

我要匹配的字符串:

Name            : cinnamon 
Version         : 3.4.6-1 
Description     : Linux desktop which provides advanced innovative features and 
                  a traditional user experience 
Architecture    : x86_64 
URL             : https://github.com/linuxmint/Cinnamon 
Licenses        : GPL2 
Groups          : None 
Provides        : None 
Depends On      : accountsservice  caribou  cinnamon-settings-daemon  
                  cinnamon-session cinnamon-translations  cjs  clutter-gtk 
                  gnome-backgrounds  gnome-themes-standard  gstreamer  
                  libgnome-keyring  libkeybinder3  librsvg  muffin  
                  python2-cairo  python-dbus  python2-dbus  python2-pillow  
                  python2-pam  python2-pexpect  python2-pyinotify  python2-lxml  
                  cinnamon-control-center  cinnamon-screensaver  cinnamon-menus                   
                  libgnomekbd  network-manager-applet  nemo  polkit-gnome  xapps  
                  python2-gobject 
Optional Deps   : blueberry: Bluetooth support [installed]
                  gnome-panel: fallback mode
                  metacity: fallback mode
                  system-config-printer: printer settings [installed] 
Required By     : cinnamon-sound-effects 
Optional For    : None
Conflicts With  : None 
Replaces        : None 
Installed Size  : 8.31 MiB 
Packager        : Antonio Rojas <arojas@archlinux.org> 
Build Date      : Sat 09 Sep 2017 05:38:21 AM CDT 
Install Date    : Sat 09 Sep 2017 11:37:44 AM CDT 
Install Reason  : Installed as a dependency for another package 
Install Script  : No 
Validated By    : Signature

目前,如果我删除我得到的替换组:

Version         : 3.4.6-1
Description     : Linux desktop which provides advanced innovative features
                    and a traditional user experience
Provides        : None
Depends On      : accountsservice  caribou  cinnamon-settings-daemon
                  cinnamon-session  cinnamon-translations  cjs  clutter-gtk  gnome-backgrounds
                  gnome-themes-standard  gstreamer  libgnome-keyring  libkeybinder3  librsvg
                  muffin  python2-cairo  python-dbus  python2-dbus  python2-pillow  python2-pam
                  python2-pexpect  python2-pyinotify  python2-lxml  cinnamon-control-center
                  cinnamon-screensaver  cinnamon-menus  libgnomekbd  network-manager-applet
                  nemo  polkit-gnome  xapps  python2-gobject
Optional Deps   : blueberry: Bluetooth support [installed]
Required By     : cinnamon-sound-effects
Optional For    : None
Conflicts With  : None
Replaces        : None
Install Reason  : Installed as a dependency for another package

预期的输出如下所示:

├─ Version         : 3.4.6-1
├─ Description     : Linux desktop which provides advanced innovative features
│                    and a traditional user experience
├─ Provides        : None
├─ Depends On      : accountsservice  caribou  cinnamon-settings-daemon
│                    cinnamon-session  cinnamon-translations  cjs  clutter-gtk  gnome-backgrounds
│                    gnome-themes-standard  gstreamer  libgnome-keyring  libkeybinder3  librsvg
│                    muffin  python2-cairo  python-dbus  python2-dbus  python2-pillow  python2-pam
│                    python2-pexpect  python2-pyinotify  python2-lxml  cinnamon-control-center
│                    cinnamon-screensaver  cinnamon-menus  libgnomekbd  network-manager-applet
│                    nemo  polkit-gnome  xapps  python2-gobject
├─ Optional Deps   : blueberry: Bluetooth support [installed]
├─ Required By     : cinnamon-sound-effects
├─ Optional For    : None
├─ Conflicts With  : None
├─ Replaces        : None
└─ Install Reason  : Installed as a dependency for another package

a、b 和 c 仅用于测试目的(我认为我应该将它们替换为命名的捕获组)。一旦我让替换工作正常,我将把regex_compile 部分分解为它自己的方法。如何用pcre2_substitute 替换命名组?

【问题讨论】:

  • (?(^Install Reason) 应该是什么?我在regular-expressions.info/refadv.html 的任何条件运算符中都没有看到它
  • 我基于 pcre2 文档中的 if then else 语句。 pcre.org/current/doc/html/pcre2pattern.html#SEC21^Install Reason 我正在检查它是否是行首的字符。如果它不存在,请检查空格,并且如果没有空格匹配字母。然后我将使用pcre2_substitute 替换字符。
  • 该页面列出了5种条件。这应该是哪种类型,因为它似乎与它们中的任何一个都不匹配?
  • 嗯,很多人混淆了“条件”这个词。有些只是表示常规的“或”是有条件的,或者是可选的子字符串。你想匹配什么文本?如果以后没有帮助,修复当前代码有什么好处?看,您当前的声明中很可能有 2 种类型的拼写错误:1)删除不必要的空格 - 它们是有意义的,2)双反斜杠。
  • 您不能只在condition 中添加任何内容,它必须是捕获引用或环视表达式(DEFINEVERSION 是特殊情况)。你是说(?(?=^Install Reason)...|...) 吗?

标签: c regex pcre


【解决方案1】:

你试图在错误的地方做你的逻辑。您需要在替换模式中处理它,而不是在正则表达式模式本身中。

首先,让我们编写一个模式来识别字符串的不同部分:

^(?:
    (?<remove>(?:
        Name|Architecture|URL|Licenses|
        Installed[ ]Size|Packager|Build[ ]Date|
        Install[ ]Date|Install[ ]Script|Validated[ ]By
    )\s*:[^\n]*\n)
    |(?<last>(?=Install[ ]Reason\s*:))
    |(?<field>(?=\S))
    |(?<cont>(?=\s))
)

Demo

使用 mx 选项 (PCRE2_MULTILINE | PCRE2_EXTENDED),但我们在 C 代码中并不真正需要 PCRE2_EXTENDED

这将识别字符串的某些部分并在结果中准确填充一个命名的捕获组:

  • remove 用于移除部件
  • last 最后一个字段
  • field 其他字段
  • cont 用于值延续(没有字段标签的行)

接下来,我们必须用不同的字符串替换这些部分:

  • remove => (空字符串)
  • last => └─(我将在下面的程序中使用\-
  • field => ├─(我将在下面的程序中使用+-
  • cont => (我将在下面的程序中使用|

我们可以让PCRE通过PCRE2_SUBSTITUTE_EXTENDEDdocs)来处理:

设置PCRE2_SUBSTITUTE_EXTENDED 的第二个效果是为组替换增加了更多的灵活性。语法类似于 Bash 使用的语法:

${<n>:-<string>}
${<n>:+<string1>:<string2>}

和以前一样,&lt;n&gt; 可以是组号或名称。第一种形式指定一个默认值。如果设置了组&lt;n&gt;,则插入其值;如果没有,&lt;string&gt; 将被扩展并插入结果。第二种形式分别指定在设置或取消设置组&lt;n&gt; 时展开和插入的字符串。第一种形式只是一种方便的简写方式

${<n>:+${<n>}:<string>}

因此,使用该语法,我们的替换字符串如下所示:

${remove:+:${last:+\\- :${field:++- :${cont:+|  :}}}}

这是一个完整的演示:

#include <stdio.h>

#define PCRE2_CODE_UNIT_WIDTH 8
#include <pcre2.h>

PCRE2_SPTR input =
    "Name            : cinnamon\n"
    "Version         : 3.4.6-1\n"
    "Description     : Linux desktop which provides advanced innovative features and\n"
    "                  a traditional user experience\n"
    "Architecture    : x86_64\n"
    "URL             : https://github.com/linuxmint/Cinnamon\n"
    "Licenses        : GPL2\n"
    "Groups          : None\n"
    "Provides        : None\n"
    "Depends On      : accountsservice  caribou  cinnamon-settings-daemon\n"
    "                  cinnamon-session cinnamon-translations  cjs  clutter-gtk\n"
    "                  gnome-backgrounds  gnome-themes-standard  gstreamer \n"
    "                  libgnome-keyring  libkeybinder3  librsvg  muffin \n"
    "                  python2-cairo  python-dbus  python2-dbus  python2-pillow\n"
    "                  python2-pam  python2-pexpect  python2-pyinotify  python2-lxml\n"
    "                  cinnamon-control-center  cinnamon-screensaver  cinnamon-menus\n"
    "                  libgnomekbd  network-manager-applet  nemo  polkit-gnome  xapps\n"
    "                  python2-gobject\n"
    "Optional Deps   : blueberry: Bluetooth support [installed]\n"
    "                  gnome-panel: fallback mode\n"
    "                  metacity: fallback mode\n"
    "                  system-config-printer: printer settings [installed]\n"
    "Required By     : cinnamon-sound-effects\n"
    "Optional For    : None\n"
    "Conflicts With  : None\n"
    "Replaces        : None\n"
    "Installed Size  : 8.31 MiB\n"
    "Packager        : Antonio Rojas <arojas@archlinux.org>\n"
    "Build Date      : Sat 09 Sep 2017 05:38:21 AM CDT\n"
    "Install Date    : Sat 09 Sep 2017 11:37:44 AM CDT\n"
    "Install Reason  : Installed as a dependency for another package\n"
    "Install Script  : No\n"
    "Validated By    : Signature\n";

PCRE2_SPTR pattern =
    "^(?:"
        "(?<remove>(?:"
            "Name|Architecture|URL|Licenses|"
            "Installed Size|Packager|Build Date|"
            "Install Date|Install Script|Validated By"
        ")\\s*:[^\n]*\n)"
        "|(?<last>(?=Install Reason\\s*:))"
        "|(?<field>(?=\\S))"
        "|(?<cont>(?=\\s))"
    ")";

PCRE2_SPTR replacement =
    "${remove:+:${last:+\\\\- :${field:++- :${cont:+|  :}}}}";

static void print_error(int code)
{
    PCRE2_UCHAR message[256];
    if (pcre2_get_error_message(code, &message, sizeof(message) / sizeof(PCRE2_UCHAR)))
        puts(message);
}

int main()
{
    pcre2_code *re;
    pcre2_match_context *match_context;
    int result, error;
    PCRE2_SIZE erroffset, outlength;
    PCRE2_UCHAR* outbuf;

    re = pcre2_compile(pattern, PCRE2_ZERO_TERMINATED, PCRE2_MULTILINE, &error, &erroffset, 0);
    if (!re)
    {
        print_error(error);
        return 1;
    }

    match_context = pcre2_match_context_create(0);

    outlength = 0;
    result = pcre2_substitute(
        re,
        input,
        PCRE2_ZERO_TERMINATED,
        0,
        PCRE2_SUBSTITUTE_GLOBAL | PCRE2_SUBSTITUTE_OVERFLOW_LENGTH | PCRE2_SUBSTITUTE_EXTENDED,
        0,
        match_context,
        replacement,
        PCRE2_ZERO_TERMINATED,
        0,
        &outlength
    );

    if (result != PCRE2_ERROR_NOMEMORY)
    {
        print_error(result);
        return ;
    }

    outbuf = malloc(outlength * sizeof(PCRE2_UCHAR));

    result = pcre2_substitute(
        re,
        input,
        PCRE2_ZERO_TERMINATED,
        0,
        PCRE2_SUBSTITUTE_GLOBAL | PCRE2_SUBSTITUTE_EXTENDED,
        0,
        match_context,
        replacement,
        PCRE2_ZERO_TERMINATED,
        outbuf,
        &outlength
    );

    if (result < 0)
    {
        print_error(result);
        return;
    }

    puts(outbuf);

    free(outbuf);
    pcre2_match_context_free(match_context);
    pcre2_code_free(re);

    return 0;
}

输出是:

+- Version         : 3.4.6-1
+- Description     : Linux desktop which provides advanced innovative features and
|                    a traditional user experience
+- Groups          : None
+- Provides        : None
+- Depends On      : accountsservice  caribou  cinnamon-settings-daemon
|                    cinnamon-session cinnamon-translations  cjs  clutter-gtk
|                    gnome-backgrounds  gnome-themes-standard  gstreamer
|                    libgnome-keyring  libkeybinder3  librsvg  muffin
|                    python2-cairo  python-dbus  python2-dbus  python2-pillow
|                    python2-pam  python2-pexpect  python2-pyinotify  python2-lxml
|                    cinnamon-control-center  cinnamon-screensaver  cinnamon-menus
|                    libgnomekbd  network-manager-applet  nemo  polkit-gnome  xapps
|                    python2-gobject
+- Optional Deps   : blueberry: Bluetooth support [installed]
|                    gnome-panel: fallback mode
|                    metacity: fallback mode
|                    system-config-printer: printer settings [installed]
+- Required By     : cinnamon-sound-effects
+- Optional For    : None
+- Conflicts With  : None
+- Replaces        : None
\- Install Reason  : Installed as a dependency for another package

我想我应该提一下,在您的情况下,手动进行字符串操作肯定比通过正则表达式模式更容易。

【讨论】:

  • 完美运行,只需要对我的程序稍作修改。
  • 试过了,它所做的只是去除匹配的字符。它不会用替换字符串替换它。
  • @RichardMcFriendOluwamuyiwa 我建议您提出一个新问题,详细说明您的具体案例。
  • 谢谢@LucasTrzesniewski,我已经让它工作了。我不得不改变一些事情
  • 嗨@LucasTrzesniewski,自从最新的 PCRE2 版本以来,第一个 pcre2_substitute 现在始终返回 PCRE2_ERROR_BADOPTION。知道为什么会这样吗??