带有 preg_match 和 foreach 的 PHP 标记系统答案

【问题标题】：PHP tag system with preg_match and foreach带有 preg_match 和 foreach 的 PHP 标记系统
【发布时间】：2017-05-19 13:25:45
【问题描述】：

我正在尝试为我的网站构建这个标签系统，它会检查书面文章（可能是 400-1000 字）中的特定字词，并从数组中创建一个包含所有找到的关键字的字符串。

我制作的那个工作正常，但有一些问题我想解决。

$a = "This is my article and it's about apples and pears. I like strawberries as well though.";

$targets = array('apple', 'apples','pear','pears','strawberry','strawberries','grape','grapes');
foreach($targets as $t)
{
   if (preg_match("/\b" . $t . "\b/i", $a)) {
    $b[] = $t;
   }
}
echo $b[0].",".$b[1].",".$b[2].",".$b[3];
$tags = $b[0].",".$b[1].",".$b[2].",".$b[3];

首先，我想知道，如果有什么办法，我可以让这更有效。我有一个包含大约 5.000 个关键字并且每天都在扩展的数据库。

您可以看到，我不知道如何获得所有匹配项。我正在写 $b[0]、$b[1] 等。

我希望它只创建一个包含所有匹配项的字符串 - 但每场匹配只有 1 次。如果 apples 被提及 5 次，那么字符串中应该只有 1 个。

有人说 - 这行得通。但我不觉得这是最好的解决方案。

编辑：

我现在正在尝试这个，但我根本无法让它工作。

$a = "This is my article and it's about apples and pears. I like strawberries as well though.";

$targets = array('apple', 'apples','pear','pears','strawberry','strawberries','grape','grapes');
$targets = implode('|', $targets);
$b = [];
preg_match("/\b(" . $targets . ")\b/i", $a, $b);

echo $b;

【问题讨论】：

为什么会运行 250 万次？它只是针对每个 $target 检查 $a，它只会运行 count($targets) 次。
如果你的文章有 400-1000 字，首先你应该做相反的事情。不是在文章中查找标签，而是在标签中从文章中查找单词。（效率会提高 5-10 倍）。使用此解决方案，您还可以先过滤短词（a、an、the、is ....）而不搜索它们。
因为您只在找到目标词时添加条目到$b，所以您可以使用echo implode(',', $b); 来显示您找到的词。
好的，德文郡 - 我把它编辑了 - 谢谢。 Autista_z - 我该如何开始？罗伯托06 - 太棒了！谢谢！
你为什么不使用 in_array 函数简单地比较你的 $a 变量然后 foreach ？

标签： php regex preg-match-all keyword-search word-boundary

【解决方案1】：

首先，我想提供一个非正则表达式的方法，然后我将讨论一些冗长的正则表达式考虑。

因为您的搜索“针”是完整的词，您可以像这样利用str_word_count() 的魔法：

代码：(Demo)

$targets=['apple','apples','pear','pears','strawberry','strawberries','grape','grapes'];  // all lowercase
$input="Apples, pears, and strawberries are delicious. I probably favor the flavor of strawberries most. My brother's favorites are crabapples and grapes.";
$lowercase_input=strtolower($input);                      // eliminate case-sensitive issue
$words=str_word_count($lowercase_input,1);                // split into array of words, permitting: ' and -
$unique_words=array_flip(array_flip($words));             // faster than array_unique()
$targeted_words=array_intersect($targets,$unique_words);  // retain matches
$tags=implode(',',$targeted_words);                       // glue together with commas
echo $tags;

echo "\n\n";
// or as a one-liner
echo implode(',',array_intersect($targets,array_flip(array_flip(str_word_count(strtolower($input),1)))));

输出：

apples,pears,strawberries,grapes

apples,pears,strawberries,grapes

现在关于正则表达式...

虽然 matiaslauriti 的回答可能会为您提供正确结果，但它很少尝试提供任何大的效率提升。

我要说明两点：

当 preg_match_all() 专门设计用于在单个调用中捕获多次出现时，请勿在循环中使用 preg_match()。（代码稍后在回答中提供）
尽可能地压缩你的模式逻辑...

假设您有这样的输入：

$input="Today I ate an apple, then a pear, then a strawberry. This is my article and it's about apples and pears. I like strawberries as well though.";

如果你使用这个标签数组：

$targets=['apple','apples','pear','pears','strawberry','strawberries','grape','grapes'];

生成一个简单的管道正则表达式模式，例如：

/\b(?:apple|apples|pear|pears|strawberry|strawberries|grape|grapes)\b/i

正则表达式引擎需要 677 步 来匹配 $input 中的所有水果。 (Demo)

相比之下，如果你像这样使用? 量词压缩标签元素：

\b(?:apples?|pears?|strawberry|strawberries|grapes?)\b

您的模式变得简洁高效，只需 501 步即可获得相同的预期结果。 (Demo)

可以通过编程方式为简单的关联（包括复数和动词变位）生成这种压缩模式。

这是处理单数/复数关系的方法：

foreach($targets as $v){
    if(substr($v,-1)=='s'){                       // if tag ends in 's'
        if(in_array(substr($v,0,-1),$targets)){   // if same words without trailing 's' exists in tag list
            $condensed_targets[]=$v.'?';          // add '?' quantifier to end of tag
        }else{
            $condensed_targets[]=$v;              // add tag that is not plural (e.g. 'dress')
        }
    }elseif(!in_array($v.'s',$targets)){          // if tag doesn't end in 's' and no regular plural form
            $condensed_targets[]=$v;              // add tag with irregular pluralization (e.g. 'strawberry')
    }
}
echo '/\b(?:',implode('|',$condensed_targets),")\b/i\n";
// /\b(?:apples?|pears?|strawberry|strawberries|grapes?)\b/i

这种技术只会处理最简单的情况。您可以通过检查标签列表并识别相关标签并压缩它们来真正提高性能。

执行我的上述方法以在每次页面加载时压缩管道模式会花费您的用户加载时间。我非常强烈的建议是保留您不断增长的标签的数据库表，这些标签存储为正则表达式标签。当遇到/生成新标签时，自动将它们单独添加到表中。您应该定期查看大约 5000 个关键字，并找出可以合并而不会丢失准确性的标签。

它甚至可以帮助您维护数据库表逻辑，如果您有一列用于正则表达式模式，另一列显示该行的正则表达式模式包含的内容的 csv：

---------------------------------------------------------------
|  Pattern               |   Tags                             |
---------------------------------------------------------------
|  apples?               |  apple,apples                      |
---------------------------------------------------------------
|  walk(?:s|er|ed|ing)?  |  walk,walks,walker,walked,walking  |
---------------------------------------------------------------
|  strawberry            |  strawberry                        |
---------------------------------------------------------------
|  strawberries          |  strawberries                      |
---------------------------------------------------------------

为了提高效率，您可以通过合并草莓和草莓行来更新表数据，如下所示：

---------------------------------------------------------------
|  strawberr(?:y|ies)    |  strawberry,strawberries           |
---------------------------------------------------------------

有了这么简单的改进，如果你只检查$input这两个标签，所需的步骤从59下降到40。

由于您要处理 >5000 个标签，因此性能提升将非常显着。这种细化最好在人工层面上处理，但您可以使用一些编程技术来识别共享内部子字符串的标签。

当您想使用您的 Pattern 列值时，只需将它们从您的数据库中拉出，将它们放在一起，然后将它们放在 preg_match_all() 中。

*请记住，在将标签压缩为单个模式时，您应该使用非捕获组，因为我要遵循的代码将通过避免捕获组来减少内存使用。

代码（Demo Link）：

$input="Today I ate an apple, then a pear, then a strawberry. This is my article and it's about apples and pears. I like strawberries as well though.";
$targets=['apple','apples','pear','pears','strawberry','strawberries','grape','grapes'];
//echo '/\b(?:',implode('|',$targets),")\b/i\n";

// condense singulars & plurals forms using ? quantifier
foreach($targets as $v){
    if(substr($v,-1)=='s'){                       // if tag ends in 's'
        if(in_array(substr($v,0,-1),$targets)){   // if same words without trailing 's' exists in tag list
            $condensed_targets[]=$v.'?';          // add '?' quantifier to end of tag
        }else{
            $condensed_targets[]=$v;              // add tag that is not plural (e.g. 'dress')
        }
    }elseif(!in_array($v.'s',$targets)){          // if tag doesn't end in 's' and no regular plural form
            $condensed_targets[]=$v;              // add tag with irregular pluralization (e.g. 'strawberry')
    }
}
echo '/\b(?:',implode('|',$condensed_targets),")\b/i\n\n";

// use preg_match_all and call it just once without looping!
$tags=preg_match_all("/\b(?:".implode('|',$condensed_targets).")\b/i",$input,$out)?$out[0]:null;
echo "Found tags: ";
var_export($tags);

输出：

/\b(?:苹果?|梨?|草莓|草莓|葡萄?)\b/i

找到标签：数组（0 => 'apple', 1 => 'pear', 2 => '草莓', 3 => '苹果', 4 => '梨', 5 => '草莓', )

...如果您已经成功阅读了我的帖子，那么您可能遇到了像 OP 这样的问题，并且您希望继续前进而不会后悔/错误。请转至my related Code Review post，了解有关边缘案例注意事项和方法逻辑的更多信息。

【讨论】：

@Morten 我今天收到了对我的回答的赞成票，并认为我会重新审视我的旧帖子。我能够纠正我以前的模式（它们都缺少单词边界字符内的非捕获组。我还添加了一种可能被证明更有效的非正则表达式方法（非正则表达式通常比正则表达式更快，但是你'必须用您的实际项目数据测试该断言）。如果您有任何问题或疑虑，请询问。 p.s. 我的旧正则表达式方法没有说明，但您需要“加倍array_flip()”输出数组并用逗号内爆。

【解决方案2】：

preg_match 已保存匹配项。所以：

int preg_match ( string $pattern , string $subject [, array &$matches [, int $flags = 0 [, int $offset = 0 ]]] )

第 3 个参数已经在保存匹配项，请更改：

if (preg_match("/\b" . $t . "\b/i", $a)) {
    $b[] = $t;
}

到这里：

$matches = [];
preg_match("/\b" . $t . "\b/i", $a, $matches);
$b = array_merge($b, $matches);

但是，如果您直接比较单词，文档建议使用strpos()。

提示
如果您只想检查一个字符串是否包含在另一个字符串中，请不要使用 preg_match()。改用 strpos() 会更快。

编辑

如果您仍想使用preg_match，您可以改进（在性能方面）您的代码，将其替换为：

$targets = array('apple', 'apples','pear','pears','strawberry','strawberries','grape','grapes');
foreach($targets as $t)
{
   if (preg_match("/\b" . $t . "\b/i", $a)) {
    $b[] = $t;
   }
}

有了这个：

$targets = array('apple', 'apples','pear','pears','strawberry','strawberries','grape','grapes');
$targets = implode('|', $targets);

preg_match("/\b(" . $t . ")\b/i", $a, $matches);

在这里，您将所有$targets 与|（管道）连接起来，因此您的正则表达式是这样的：(target1|target2|target3|targetN) 所以您只进行一次搜索而不是那个foreach。

【讨论】：

谢谢。我知道 strpos() 更快，但它不区分大小写。 stripos 是，但 stripos 的问题是它找不到完全匹配。 Applecider f.ex.我想让它尽可能快，但是 strpos() 和 stripos() 不起作用，因为提到的 2 个因素。
@Morten 您可以查找带前缀和带有关键字空格的后缀的条带，例如查找“Apple”。它是一个完全匹配的单词。
@matiaslauriti 我从这里得到 2 个错误：注意：未定义变量：b 和警告：array_merge()：参数 #1 不是数组。
@Morten 你必须在preg_match 之前定义变量$b，把$b = []; 如果你使用我最后的编辑，删除$b，这是我的错，或者你可以重命名@987654338 @到$b。
@matiaslauriti - 对不起，我错过了答案。检查我上面的编辑。