在两个文本文件中查找最常用的单词答案

【问题标题】：Find the most used words in both text files在两个文本文件中查找最常用的单词
【发布时间】：2013-04-03 16:33:35
【问题描述】：

我有两个 txt 文件都多次使用相同的单词。我设法将它们都拉入数组并通过插入排序格式化了其中一个未格式化的 txt 文件。

现在我需要比较两个格式化的数组来找出最常用的单词以及它们被使用的次数。

我知道我可以使用 for 循环遍历每个数组，但我不确定如何。有什么帮助吗？

编辑：这是我目前所拥有的。

#include<iostream>
#include<fstream>
#include<string>
using namespace std;

const int size = 100;
void checkIF(string x)
{
    fstream infile;
    cout << "Attempting to open ";
    cout << x;
    cout << "\n";
    infile.open(x);
    if( !infile )
    {
        cout << "Error: File couldn't be opened.\n";
    }
    else
    {
        cout << "File opened succsesfully.\n";
    }
}
void checkFile()
{
    string f1 = "text1.txt", f2 = "abbreviations.txt";
    checkIF(f1);
    checkIF(f2);
}

string* readFiles(string txt1[],string abb[])
{
    fstream intxt1("text1.txt");
    fstream inabb("abbreviations.txt");
    int i = 0;
    while (!intxt1.eof())
    {   
        intxt1 >> txt1[i];
        //cout << txt1[i];
        i++;
    }
        while (!inabb.eof())
    {   
        inabb >> abb[i];
        //cout << abb[i];
        i++;
    }

    return txt1;
    return abb;
}

string* insertionSort(string txt1[], int arraySize)
{
    int i, j;
    string insert;

    for (i = 1; i < arraySize; i++)
    {
        insert = txt1[i];
        j = i;
        while ((j > 0) && (txt1[j - 1] > insert))
        {
            txt1[j] = txt1[j - 1];
            j = j - 1;
        }
        txt1[j] = insert;
    }
    return txt1;
}


void compare(string txt1[],string abb[])
{

}

void main()
{
    string txt1Words[size];
    string abbWords[size];
    checkFile();
    readFiles(txt1Words,abbWords);
    insertionSort(txt1Words,100);
    compare(txt1Words,abbWords);
    system("Pause");
}

【问题讨论】：

向我们展示您的尝试。如果你这样做，你更有可能回答。
根据您提供的内容开始修改您的代码对于我们任何人来说都是非常无效的。

标签： c++ arrays string file fstream

【解决方案1】：

也许你应该从一个哈希图开始，将每个单词映射到它的使用次数

【讨论】：

我将如何实现它？我不熟悉hashmap的

【解决方案2】：

Insted 使用数组使用向量。

不是

string txt1Words[size];

但是

vector<string> txt1Words;

你可以简单地使用

std::count(txt1Words.begin(), txt1Words.end(), word_to_search);

【讨论】：

我怎样才能创建一个循环然后找到 txt1 文件中使用的缩写的数量呢？如果我实现一个向量
多少个缩写？我不明白
如何进行循环以找出 abb 数组中有多少个字母在 txt1 数组中？

【解决方案3】：

您可以为找到的每个单词使用地图。

std::map<std::string, int> wordmap;

for (int i = 0; i < arraylength; ++i)
{
   ++wordmap[array[i]];
}

我假设array 是std::string 的数组。之后，您可以使用特定单词查询地图并获取该单词的计数。

wordmap[word] // returns count for word

【讨论】：

【解决方案4】：

首先让我们解决“两个文本文件中最常用的单词”的问题。这实际上取决于您如何定义最常用的。你基本上有两组带有计数的单词。

例如

文件A："apple apple apple banana"

文件B："apple apple banana orange orange orange orange orange"

如果你将它存储为一组名称和计数，你会得到

文件A：{("apple",5), ("banana",1)}

文件 B：{("apple",2), ("banana",1), ("orange",5)}

注意：这不是代码，它只是一个模型符号。

那么在这个小例子中，两个文件最常用的是什么？但问题是“apple”是否应该是最常用的，因为它出现在两个文件中？或者“橙色”应该是最常用的，因为它在其中一个文件中使用最多？

我假设你想要这两组的某种交集。因此，只有出现在两个文件中的单词才算数。另外，如果我是你，我会按单词出现的最小值对单词进行排名，这样文件 A 中的 5 个“苹果”就不会对“苹果”的权重太高，因为它在文件 B 中只出现了两次。

所以如果我们把它写在代码中，你会得到类似这样的东西

class Word
{
public:
    std::string Token;
    int Count;

    Word (const std::string &token, int count)
        : Token(token), Count(count) {}
};

和

    std::map<std::string, int> FileA;
    std::map<std::string, int> FileB;

    std::vector<Word> intersection;

    for (auto i = FileA.begin(); i != FileA.end (); ++i)
    {
        auto bentry = FileB.find (i->first); //Look up the word from A in B
        if (bentry == FileB.end ())
        {
            continue; //The word from file A was not in file B, try the next word
        }

        //We found the word from A in B
        intersection.push_back(Word (i->first,std::min(i->second,bentry->second))); //You can replace the std::min call with whatever method you want to qualitate "most common"
    }

    //Now sort the intersection by Count
    std::sort (intersection.begin(),intersection.end(), [](const Word &a, const Word &b) { return a.Count > b.Count;});

    for (auto i = intersection.begin (); i != intersection.end (); ++i)
    {
        std::cout << (*i).Token << ": " << (*i).Count << std::endl;
    }

看到它运行： http://ideone.com/jbPm1g

希望对你有帮助。

【讨论】：