【问题标题】:calculate word frequency in multiple files计算多个文件中的词频
【发布时间】:2012-08-09 06:11:02
【问题描述】:
<?php



$filename = "largefile.txt";



/* get content of $filename in $content */

$content = strtolower(file_get_contents($filename));



/* split $content into array of substrings of $content i.e wordwise */

$wordArray = preg_split('/[^a-z]/', $content, -1, PREG_SPLIT_NO_EMPTY);



/* "stop words", filter them */

$filteredArray = array_filter($wordArray, function($x){

return !preg_match("/^(.|a|an|and|the|this|at|in|or|of|is|for|to)$/",$x);

});



/* get associative array of values from $filteredArray as keys and their frequency count as value */

$wordFrequencyArray = array_count_values($filteredArray);



/* Sort array from higher to lower, keeping keys */

arsort($wordFrequencyArray);

这是我实现的代码,用于找出文件中不同单词的频率。 这是有效的。

现在我想做的是,假设有 10 个文本文件。我想计算一个单词在所有 10 个文件中的词频,即如果我想在所有 10 个文件中找到单词“stack”的频率files 是单词堆栈在所有文件中出现的次数。然后会对所有不同的单词执行此操作。

我已经为单个文件完成了它,但不知道如何将它扩展到多个文件。 谢谢你的帮助,对不起我的英语不好

【问题讨论】:

  • 您是否尝试过为每个文件将整个内容包装在一个循环中?

标签: php word word-frequency


【解决方案1】:

将你得到的东西放入一个函数中,并使用 foreach 循环为数组中的每个文件名调用它:

<?php

$wordFrequencyArray = array();

function countWords($file) use($wordFrequencyArray) {
    /* get content of $filename in $content */
    $content = strtolower(file_get_contents($filename));

    /* split $content into array of substrings of $content i.e wordwise */
    $wordArray = preg_split('/[^a-z]/', $content, -1, PREG_SPLIT_NO_EMPTY);

    /* "stop words", filter them */
    $filteredArray = array_filter($wordArray, function($x){
        return !preg_match("/^(.|a|an|and|the|this|at|in|or|of|is|for|to)$/",$x);
    });

    /* get associative array of values from $filteredArray as keys and their frequency count as value */
    foreach (array_count_values($filteredArray) as $word => $count) {
        if (!isset($wordFrequencyArray[$word])) $wordFrequencyArray[$word] = 0;
        $wordFrequencyArray[$word] += $count;
    }
}
$filenames = array('file1.txt', 'file2.txt', 'file3.txt', 'file4.txt' ...);
foreach ($filenames as $file) {
    countWords($file);
}

print_r($wordFrequencyArray);

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2019-04-19
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多