【问题标题】：Elegant ways to count the frequency of words in a file计算文件中单词频率的优雅方法
【发布时间】：2011-02-03 16:40:02
【问题描述】：

统计文件中每个“英语”单词出现频率的优雅有效的方法是什么？

【问题讨论】：

定义“单词”。您的意思是“英语单词”还是“不间断的字母字符序列”或“不间断的字符序列”或其他什么？
为了什么目的——只是为了好玩？
再次，“english”是什么意思？与[A-Za-z]+ 匹配的实际英语单词或序列？带连字符的单词或其他带有标点的单词呢？
缩略词和所有格词重要吗？例如，can't 和 The cat's toy.。
字母序列必须是有效的英文单词吗？例如，a 是一个有效的词，但 t 不是。

标签： c++ file-io

【解决方案1】：

首先，我定义letter_only std::locale 以便忽略来自流的标点符号，并且只从输入流中读取有效的“英文”字母。这样，流将把单词"ways"、"ways." 和"ways!" 视为同一个单词"ways"，因为流将忽略像"." 和"!" 这样的标点符号。

struct letter_only: std::ctype<char> 
{
    letter_only(): std::ctype<char>(get_table()) {}

    static std::ctype_base::mask const* get_table()
    {
        static std::vector<std::ctype_base::mask> 
            rc(std::ctype<char>::table_size,std::ctype_base::space);

        std::fill(&rc['A'], &rc['z'+1], std::ctype_base::alpha);
        return &rc[0];
    }
};

解决方案 1

int main()
{
     std::map<std::string, int> wordCount;
     ifstream input;
     input.imbue(std::locale(std::locale(), new letter_only())); //enable reading only letters!
     input.open("filename.txt");
     std::string word;
     while(input >> word)
     {
         ++wordCount[word];
     }
     for (std::map<std::string, int>::iterator it = wordCount.begin(); it != wordCount.end(); ++it)
     {
           cout << it->first <<" : "<< it->second << endl;
     }
}

解决方案 2

struct Counter
{
    std::map<std::string, int> wordCount;
    void operator()(const std::string & item) { ++wordCount[item]; }
    operator std::map<std::string, int>() { return wordCount; }
};

int main()
{
     ifstream input;
     input.imbue(std::locale(std::locale(), new letter_only())); //enable reading only letters!
     input.open("filename.txt");
     istream_iterator<string> start(input);
     istream_iterator<string> end;
     std::map<std::string, int> wordCount = std::for_each(start, end, Counter());
     for (std::map<std::string, int>::iterator it = wordCount.begin(); it != wordCount.end(); ++it)
     {
          cout << it->first <<" : "<< it->second << endl;
     }
 }

【讨论】：

但是，答案也清楚地表明“由空格分隔的非空格字符序列”不是 OP 所追求的“单词”的定义。
我认为这是正确的答案，因为他想要重复单词的频率。
第一种方案的输入循环错误。 eof 标志设置在由于到达 eof 而失败的输入操作。
再说一次，这不是正确答案。 OP 不要求空格分隔的单词。这会将"end_of_sentence." 和"end_of_sentence!" 视为两个不同的词，这不是OP 想要的。
@Nawaz：为什么不直接使用惯用的while (input >> word)？因为没有检查其他标志，所以写的仍然是错误的。

【解决方案2】：

Perl 可以说没有那么优雅，但非常有效。
我在这里发布了一个解决方案：Processing huge text files

简而言之，

1) 如果需要，去掉标点符号并将大写转换为小写：
perl -pe "s/[^a-zA-Z \t\n']/ /g; tr/A-Z/a-z/" file_raw > file

2) 统计每个单词的出现次数。打印结果首先按频率排序，然后按字母顺序排列：
perl -lane '$h{$_}++ for @F; END{for $w (sort {$h{$b}<=>$h{$a} || $a cmp $b} keys %h) {print "$h{$w}\t$w"}}' file > freq

我在一个包含 580,000,000 个单词的 3.3GB 文本文件上运行此代码。
Perl 5.22 在 3 分钟内完成。

【讨论】：

【解决方案3】：

这是可行的解决方案。这应该适用于真实文本（包括标点符号）：

#include <iterator>
#include <iostream>
#include <fstream>
#include <map>
#include <string>
#include <cctype>

std::string getNextToken(std::istream &in)
{
    char c;
    std::string ans="";
    c=in.get();
    while(!std::isalpha(c) && !in.eof())//cleaning non letter charachters
    {
        c=in.get();
    }
    while(std::isalpha(c))
    {
        ans.push_back(std::tolower(c));
        c=in.get();
    }
    return ans;
}

int main()
{
    std::map<std::string,int> words;
    std::ifstream fin("input.txt");

    std::string s;
    std::string empty ="";
    while((s=getNextToken(fin))!=empty )
            ++words[s];

    for(std::map<std::string,int>::iterator iter = words.begin(); iter!=words.end(); ++iter)
        std::cout<<iter->first<<' '<<iter->second<<std::endl;
}

编辑：现在我的代码为每个字母调用 tolow。

【讨论】：

这无疑适用于英语（这是 OP 要求的，它知道），但不适用于其他语言。如果输入文本中有数字，我也不会工作。
@Baltasarq 问题询问“英语”单词。is_alpha 也不会为数字返回 true。

【解决方案4】：

我的解决方案如下。首先，所有符号都转换为空格。然后，基本上使用之前提供的相同解决方案来提取单词：

const std::string Symbols = ",;.:-()\t!¡¿?\"[]{}&<>+-*/=#'";
typedef std::map<std::string, unsigned int> WCCollection;
void countWords(const std::string fileName, WCCollection &wcc)
    {
        std::ifstream input( fileName.c_str() );

        if ( input.is_open() ) {
            std::string line;
            std::string word;

            while( std::getline( input, line ) ) {
                // Substitute punctuation symbols with spaces
                for(std::string::const_iterator it = line.begin(); it != line.end(); ++it) {
                    if ( Symbols.find( *it ) != std::string::npos ) {
                        *it = ' ';
                    }

                }

                // Let std::operator>> separate by spaces
                std::istringstream filter( line );
                while( filter >> word ) {
                    ++( wcc[word] );
                }
            }
        }

    }

【讨论】：

我已经改进了算法并修复了一些小错误。

【解决方案5】：

我认为接近您想要的算法的伪代码：

counts = defaultdict(int)
for line in file:
  for word in line.split():
    if any(x.isalpha() for x in word):
      counts[word.toupper()] += 1

freq = sorted(((count, word) for word, count in counts.items()), reversed=True)
for count, word in freq:
  print "%d\t%s" % (count, word)

不区分大小写的比较被简单地处理，并且可能会在绝对一般意义上组合您不想组合的单词。在执行上述操作时请注意非 ASCII 字符。根据您的需要，误报可能包括“1-800-555-TELL”、“0xDEADBEEF”和“42 km”。遗漏的词包括“911 紧急服务”（我可能希望将其计为三个词）。

简而言之，自然语言解析很难：您可能可以根据您的实际用例做出一些近似。

【讨论】：

回答 C++ 问题的有趣方式：提供 Python 代码，然后将其声明为伪代码。考虑到，这使用了 Python stdlib 中的类型而不导入它，并且理解，并且任何阅读本文的 C++ 人都必须猜测很多，我很惊讶这得到了赞成。也许这是一个秘密实验，看看有多少 C++ 程序员可以在不知不觉中默默地转变为 Python 爱好者？

【解决方案6】：

确定“英语单词”的确切含义。该定义应涵盖诸如“健全”是一个词还是两个词、如何处理撇号（“不要相信他们！”）、大写是否重要等内容。
创建一组测试用例，这样您就可以确保在步骤 1 中做出的所有决定都是正确的。
创建一个分词器，它从输入中读取下一个单词（如步骤 1 中所定义）并以标准形式返回。根据您的定义，这可能是一个简单的状态机、一个正则表达式，或者只是依赖于的提取运算符（例如，std::cin >> word;）。使用步骤 2 中的所有测试用例测试您的分词器。
选择一个数据结构来保存单词和计数。在现代 C++ 中，您最终可能会得到类似 std::map<std::string, unsigned> 或 std::unordered_map<std::string, int> 的内容。
编写一个循环，从分词器中获取下一个单词并增加其在直方图中的计数，直到输入中没有更多单词。

【讨论】：

【解决方案7】：

一种更简单的方法是计算文件中的空格数，直到找到一个以上的空格，如果您只考虑单词之间的单个空格...

【讨论】：

【解决方案8】：

string mostCommon( string filename ) {

    ifstream input( filename );
    string line;
    string mostFreqUsedWord;
    string token;
    map< string, int > wordFreq;

    if ( input.is_open() ) {

        while ( true ) {
            input >> token;
            if( input ) {
                wordFreq[ token ]++;
                if ( wordFreq[ token] > wordFreq[ mostFreqUsedWord ] )
                    mostFreqUsedWord = token;
            } else
                break;
        }
        input.close();
    } else {
        cout << "Unable to ope file." << endl;
    }
    return mostFreqUsedWord;
}

【讨论】：