如何在 Jenssegers raw() 函数中正确应用正则表达式答案

【问题标题】：How to apply regex properly in Jenssegers raw() function如何在 Jenssegers raw() 函数中正确应用正则表达式
【发布时间】：2021-08-09 09:44:04
【问题描述】：

我尝试在我的一个应用程序中实现对变音符号不敏感的全词搜索。我编写了这个查询并且在 MongoDB 终端中运行良好（我使用了 Robo3T）。

[这里我通过了单词'Irène'的Unicode转换]

db.getCollection('rvh_articles').aggregate([
  {
    "$match":{
       "art_xml_data.article.article_title":{
          "$regex":/( |^)[i\x{00ec}\x{00ed}\x{00ee}\x{00ef}]r[e\x{00e8}\x{00e9}\x{00ea}\x{00eb}\x{00e6}][n\x{00f1}][e\x{00e8}\x{00e9}\x{00ea}\x{00eb}\x{00e6}]( |$)/,
          "$options":"I"
       }
    }
  }
])

当我试图在 jenssegers raw() 函数中实现这个查询时，我编写了一个 PHP 函数来构建一个与搜索字符串对应的正则表达式。它将字符串中的每个字母转换为相应的 Unicode 并返回正则表达式。

public function makeComp($input) 
{
    $accents = array(
        /*
            I include json_encode here because:
            json_encode used in the jenssegers building query function converts diacritic charectes to 
            hexadecimal(\u). But '\u' is not supported with regex mongodb. It shows this error:
            "Regular expression is invalid: PCRE does not support \\L, \\l, \\N{name}, \\U, or \\u"

            So I first used json_encode for each string conversion and then replaced '{\u' with '{\x'. Problem solved.
        */
        "a" => json_encode('[a{à}{á}{â}{ã}{ä}{å}{æ}]'),
        "c" => json_encode('[c{ç}]'),
        "e" => json_encode('[e{è}{é}{ê}{ë}{æ}]'),
        "i" => json_encode('[i{ì}{í}{î}{ï}]'),
        "n" => json_encode('[n{ñ}]'),
        "o" => json_encode('[o{ò}{ó}{ô}{õ}{ö}{ø}]'),
        "s" => json_encode('[s{ß}]'),
        "u" => json_encode('[u{ù}{ú}{û}{ü}]'),
        "y" => json_encode('[y{ÿ}]'),
    );
    $out = strtr($input, $accents); // replacing all possible accented characters in the input string with $accents array key value
    $out = str_replace('{\u', '\x{', $out); // replace all {\u to \x{ because PCRE does not support the \uXXXX syntax. Use \x{XXXX}.
    $out = str_replace('"', "", $out); // replace all double quotes
    return '/( |^)' . $out . '( |$)/';
}

这是我在 jenssegers raw() 函数中应用 MongoDB 查询的函数。

public function getall_articles(Request $request)
{
    extract($request->all());

    if (!empty($search_key)) {
        DB::connection()->enableQueryLog();

        $search_key = $this->makeComp($search_key);

        $data = Article::raw()->aggregate([
            array(
                '$match' => array(
                    "art_xml_data.article.article_title" => array(
                        '$regex' => $search_key,
                        '$options' => 'i'
                    )
                )
            )
        ])->toArray();

        dd(DB::getQueryLog());
    }
}

这是打印的查询日志：

array:1 [
    0 => array:3 [
        "query" => rvh_articles.aggregate([{
            "$match":{
                "art_xml_data.article.article_title":{
                    "$regex":"\/( |^)[i\\x{00ec}\\x{00ed}\\x{00ee}\\x{00ef}]r[e\\x{00e8}\\x{00e9}\\x{00ea}\\x{00eb}\\x{00e6}][n\\x{00f1}][e\\x{00e8}\\x{00e9}\\x{00ea}\\x{00eb}\\x{00e6}]( |$)\/",
                    "$options":"i"
                }
            }
        }])
        "bindings" => []
        "time" => 620.14
    ]
]

我应用的正则表达式没有按原样放置。所以 mongo 返回零结果。谁能帮我解决这个问题？我需要一个替代解决方案来使用 jensegers raw() 函数应用变音符号不敏感和不区分大小写的搜索。

【问题讨论】：

如果您删除/s 会怎样？ return '( |^)' . $out . '( |$)';，甚至是return '(?<!\S)' . $out . '(?!\S)';
@WiktorStribiżew 这是删除 '/' 后查询日志中的正则表达式部分： {"$regex":"( |^)[i\\x{00ec}\\x{00ed}\ \x{00ee}\\x{00ef}]r[e\\x{00e8}\\x{00e9}\\x{00ea}\\x{00eb}\\x{00e6}][n\\ x{00f1}][e\\x{00e8}\\x{00e9}\\x{00ea}\\x{00eb}\\x{00e6}](|$)"
@WiktorStribiżew 此更改效果很好。 return '(?<!\S)' . $out . '(?!\S)'; 。非常感谢。你能把这个作为答案吗？所以我可以标记一下。

标签： regex mongodb laravel-5 robo3t jenssegers-mongodb

【解决方案1】：

在你的public function makeComp($input)方法中，你需要使用

return '(?<!\S)' . $out . '(?!\S)';

如果$out 可以（将来可能）包含多个用| 分隔的替代方案，您应该对模式进行分组，

return '(?<!\S)(?:' . $out . ')(?!\S)';
#              ^^^            ^

请注意，(?<!\S) 是一个左侧空白边界，它匹配一个不紧跟在非空白字符之前的位置，(?!\S) 是一个右侧空白边界，它匹配一个不紧跟非空白字符的位置-空白字符。

【讨论】：