在 20mb 平面文件数据库 (PHP) 中搜索整个单词的最快方法答案

【问题标题】：Fastest way to search for whole words in 20mb flat file database (PHP)在 20mb 平面文件数据库 (PHP) 中搜索整个单词的最快方法
【发布时间】：2015-03-23 01:56:23
【问题描述】：

我有 20MB 的平面文件数据库，大约 500k 行，只允许[a-z0-9-] 字符，平均一行 7 个单词，没有空行或重复行：

平面文件数据库：

put-returns-between-paragraphs
for-linebreak-add-2-spaces-at-end
indent-code-by-4-spaces-indent-code-by-4-spaces

我正在搜索 whole words only 并从该数据库中提取 first 10k results。

到目前为止，如果在 db 的前 20k 行中找到 10k 个匹配项，则此代码可以正常工作，但如果该词很少见，则脚本必须搜索所有 500k 行，这会慢 10 倍。

设置：

$cats = file("cats.txt", FILE_IGNORE_NEW_LINES);
$search = "end";
$limit = 10000;

搜索：

foreach($cats as $cat) {
    if(preg_match("/\b$search\b/", $cat)) {
        $cats_found[] = $cat;
        if(isset($cats_found[$limit])) break;
    }
}

我的php技能和知识有限，我不会也不知道如何使用sql，所以这是我能做到的最好的，但我需要一些建议：

这是正确的代码吗，foreach 和 preg_match 有问题？
我应该将大文件拆分成较小的文件吗？如果是的话，大小是多少？
最后，sql会快多少？（未来的选择）

感谢您阅读本文，抱歉英语不好，这是我的第三种语言。

【问题讨论】：

标签： php search flat-file

【解决方案1】：

如果大部分行不包含搜索到的单词，您可以减少执行preg_match() 的频率，如下所示：

foreach ($lines as $line) {
    // fast prefilter...
    if (strpos($line, $word) === false) {
        continue;
    }
    // ... then proper search if the line passed the prefilter
    if (preg_match("/\b{$word}\b/", $line)) {
        // found
    }
}

不过，这需要在实际情况下进行基准测试。

【讨论】：

是的，只有 5 个搜索词将包含 30% 的行，其他词将包含 1-5000 行。 strpos 非常快，但它不能只找到整个单词，这就是为什么我使用带有单词边界的正则表达式
在我上面的代码中编辑了 cmets，使其更加明确。另外，你肯定要对这 5 个最慢的词进行基准测试，30% 相当可观。
好的，我已经测试过在 preg_match 之前添加 strpos，搜索“自动”等大多数单词的速度要快 2 倍，因为它会从 500k 中找到 100 行，然后 preg_match 将过滤这些并提取95 行 :) 太棒了
我会接受你的回答，因为 95% 的搜索速度更快，这是一个很大的改进 :) 另外，你认为如果我将 .txt 文件拆分为更小的文件，这会更快吗？跨度>

【解决方案2】：

这对你逐行阅读很有用，尽管你可能会耗尽内存：

（可能需要调整 php.ini memory_limit 和 max_execution_time 或通过 cli 运行）

$rFile = fopen( 'inputfile.txt', 'r' );
$iLineNumber = 0;
$sSearch = '123';
$iLimit  = 5000;
while( !feof( $rFile ) )
{
    if( $iLineNumber > $iLimit )
    {
        break;
    }
    $sLine = fgets( $rFile );
    if( preg_match("/\b$sSearch\b/", $sLine, $aMatches ) ) 
    {
        $aCats[] = $aMatches[ 0 ];
    }
    ++$iLineNumber;
}
var_dump( $aCats );

我的建议是将文件重新格式化为 sql 导入并使用数据库。平面文件搜索速度明显变慢。

文件：

put-returns-between-paragraphs
for-linebreak-add-2-spaces-at-end
indent-code-by-4-spaces-indent-code-by-4-spaces
put-returns-between-paragraphs
for-linebreak-add-2-spaces-at-end
indent-code-by-4-spaces-indent-code-by-4-spaces
put-returns-between-paragraphs
123
for-linebreak-add-2-spaces-at-end
indent-code-by-4-spaces-indent-code-by-4-spaces
put-returns-between-paragraphs
for-linebreak-add-2-spaces-at-end
indent-code-by-4-spaces-indent-code-by-4-spaces
123
put-returns-between-paragraphs
for-linebreak-add-2-spaces-at-end
indent-code-by-4-spaces-indent-code-by-4-spaces

输出：

array(2) {
  [0]=>
  string(3) "123"
  [1]=>
  string(3) "123"
}

它从匹配中包装了一个额外的数组，所以我们必须使用 [0]

【讨论】：

var_dump( $aCats ); 给了我array(3) { [0]=> array(1) { [0]=> string(3) "123" } [1]=> array(1) { [0]=> string(3) "123" } [2]=> array(1) { [0]=> string(3) "123" } }，我可以确认123 至少有 50 个匹配项
您需要将搜索更改为您想要的。
如果我在 inputfile.txt 123-test 中只放一行并搜索 123，结果应该是 123-test 因为那是那一行，现在我得到 array(1) { [0 ]=> 数组(1) { [0]=> 字符串(3) "123" } }
天哪，天哪，你有错误，应该是$aCats[] = $sLine; :) :)
应该是这样，因为它是逐行加载的。非常适合可能运行 php 内存不足但效率不如将整个文件内容加载到内存中的大文件。