使用链表计算 C 中未知但巨大的文本文件中的单词答案

【问题标题】：Count words from a text file of unknown but massive size in C using linked list使用链表计算 C 中未知但巨大的文本文件中的单词
【发布时间】：2015-11-25 17:34:11
【问题描述】：

我知道如何数单词。我正在使用包含单词和计数的structs 链接列表。它适用于小文件，但需要我定义最大文本长度。据我所知，文本文件可能超过数 GB。如何将其更改为不需要#define MAX_TEXT_LENGTH？我应该使用malloc()，如果是，我应该将malloc() 应用于什么？最终目标是按字母顺序对所有内容进行排序并按频率打印单词，但是在我阅读单词并进行计数之后，这应该很容易。

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>

#define MAX_WORD 512
#define MAX_TEXT_LENGTH 10000

typedef struct word
{
char *str;              /* Stores the word */
int freq;               /* Stores the frequency */
struct word *pNext;     /* Pointer to the next word counter in the list */
} Word;

// ===========================================
//             FUNCTION PROTOTYPES
//============================================

int getNextWord(FILE *fp, char *buf, int bufsize);   /* Given function to get words */
void addWord(char *pWord);                          /* Adds a word to the list or updates exisiting word */
void show(Word *pWordcounter);        /* Outputs a word and its count of occurrences */
Word* createWordCounter(char *word);  /* Creates a new WordCounter structure */

// ===========================================
//             GLOBAL VARIABLES
//============================================

Word *pStart = NULL;                  /* Pointer to first word counter in the list */
int totalcount = 0;                  /* Total amount of words */
int uniquecount = 0;                /* Amount of unique words */


// ===========================================
//                 MAIN
//============================================      

int main (int argc, char *argv[]) {

    FILE * fp;          /* File pointer */
    fp = fopen(argv[1],"r");    /* Read text from here */
    char buf[MAX_WORD];     /* buf to hold the words */
    int size = MAX_TEXT_LENGTH; /* Size */

    Word *pCounter = NULL;  /* Pointer to Word counter */

    /* Read all words from text file */
    while (getNextWord(fp, buf, size))
    {
        /* Add the word to the list */
        addWord(buf); 
        /* Increment the total words counter */
        totalcount++;
    }

    /* Loop through list and figure out the number of unique words */
    pCounter = pStart;
    while(pCounter != NULL)
    {
        uniquecount++;
        pCounter = pCounter->pNext;
    }

    /* Print Summary */
    printf("\nSUMMARY:\n\n");
    printf("   %d words\n", totalcount);    /* Print total words */
    printf("   %d unique words\n", uniquecount); /* Print unique words */

    /* List the words and their counts */
    pCounter = pStart;
    while(pCounter != NULL)
    {
        show(pCounter);
        pCounter = pCounter->pNext;
    }
    printf("\n");

    /* Free the allocated  memory*/
    pCounter = pStart;
    while(pCounter != NULL)
    {
        free(pCounter->str);        
        pStart = pCounter;           
        pCounter = pCounter->pNext;  
        free(pStart);                  
    }
    /* Close file */
    fclose(fp);
    return 0;
}

// ===========================================
//                 FUNCTIONS
//============================================

void show(Word *pWordcounter)
{
    printf("\n%-30s   %5d", pWordcounter->str,pWordcounter->freq);
}

void addWord(char *word)
{
    Word *pCounter = NULL;
    Word *pLast = NULL;

    if(pStart == NULL)
    {
        pStart = createWordCounter(word);
        return;
    }

    /* If the word is in the list, increment its count */
    pCounter = pStart;
    while(pCounter != NULL)
    {
        if(strcmp(word, pCounter->str) == 0)
        {
            ++pCounter->freq;
            return;
    }
    pLast = pCounter;            
    pCounter = pCounter->pNext;  
}

    /* Word is not in the list, add it */
    pLast->pNext = createWordCounter(word);
}

Word* createWordCounter(char *word)
{
    Word *pCounter = NULL;
    pCounter = (Word*)malloc(sizeof(Word));
    pCounter->str = (char*)malloc(strlen(word)+1);
    strcpy(pCounter->str, word);
    pCounter->freq = 1;
    pCounter->pNext = NULL;
    return pCounter;
}

int getNextWord(FILE *fp, char *buf, int bufsize) {
    char *p = buf;
    char c;

    //skip all non-word characters
    do
    {
        c = fgetc(fp);
        if (c == EOF) 
            return 0;
    } while (!isalpha(c));

    //read word chars
    do
    {
        if (p - buf < bufsize - 1)
        *p++ = tolower(c);
        c = fgetc(fp);
    } while (isalpha(c));

    //finalize word
    *p = '\0';
    return 1;
}

【问题讨论】：

否决票是什么意思...我问了重复的问题还是什么？

标签： c memory text count linked-list

【解决方案1】：

详细说明约翰在他的回答中指出的错误...并在您的整体问题中添加一些 cmets...

> 我知道怎么数单词。我正在使用链接列表
> 包含单词和计数的结构。有用
> 在小文件上，但需要我定义最大文本
> 长度。
> 据我所知，文本文件可能超过
>数千兆字节。

您的代码看起来很可靠，除了下面的一点我不会改变太多。我相信您必须对任何特定“单词”的大小设置上限。但是文件的整体大小？你写的没问题；它只是内存有限的。

没错，一个文本文件可以是数 GB（甚至更长）。但您的代码似乎已经可以处理无限数量的 UNIQUE 单词。

顺便说一句：我喜欢你的小写你的话；最小化列表大小，找到更多“常用”词。

您的 MAX_WORD 大小已经是 512 个字符。建议 1：考虑计算超过 MAX_WORD 大小的次数（如果有），并在运行结束时打印出该状态。

>我怎样才能改变它不需要#define MAX_TEXT_LENGTH？
关于 MAX_TEXT_LENGTH，我认为这是一个错误。约翰在他的回答中也提到了这一点。继续阅读... :-)

> 我应该使用 malloc() 吗？如果是，我应该将 malloc() 应用于什么？
我认为不需要更多的 malloc()，你已经有了一个很好的自我增长链表。

建议2：只删除MAX_TEXT_LENGTH，我看不出你在哪里需要它。事实上，原始代码看起来允许缓冲区溢出您的“buf”变量（这将是“错误”部分）。

更具体地说，“buf”的容量只有 MAX_WORD，但您的原始代码告诉 getNextWord() 使用 MAX_TEXT_LENGTH，它比 MAX_WORD 大很多（非常）。

考虑将您的代码修改为如下所示：

/* Read all words from text file */

/* original: */
/* while (getNextWord(fp, buf, size)) */
/* NOTE: remeber size was originally MAX_TEXT_LENGTH (error?). */

/* You could just use MAX_WORD here and delete "size" while you're at it. */
while (getNextWord(fp, buf, MAX_WORD))
{
    /* Add the word to the list */
    addWord(buf); 
    /* Increment the total words counter */
    totalcount++;
}

> 最终目标是然后按字母顺序对所有内容进行排序并打印
> 有频率的单词，但是在我阅读之后应该很容易
>的话，有我的数。
r.e.排序，对于“延伸目标”，请参阅下面的建议 #4。

建议 3：为了好玩，您可能还想在最后打印一个单词长度的频率图表。例如：

Freq  : Word Length
 4851 :       1
  205 :       2
  104 :       3
...etc...  
    1 :     406

建议 4：不要通过在末尾添加每个新单词来增加列表，如果您按排序顺序增加链接列表，使用 strcmp() 来判断您是否超出了新词的可能附加点。这个想法是您将避免每次添加一个可能在中间某处的单词时都必须遍历整个列表。不过这种方法更复杂，因为您必须小心处理将某些内容拼接到中间列表以及边缘情况（例如在开始时处理新插入）。

但是，当您到达输入文件的末尾时，列表已经排序，因此可能值得考虑这种设计方法。

祝你好运，我认为你写的代码很不错。

【讨论】：

可能是 Linux VM 上的 4gig 文件
@ReiHinoX，我正在为最坏的“丑陋文件”玩数字。 :-) 可以生成一个 4gb 文件，其中包含唯一的 7 字符“单词”（仅限 a-z，但总共 8 个字符，因为您需要空格或 smth 作为分隔符）。假设每个 char 1 字节，6-char 字仅达到 1.7gb，但 7-char 字将生成大约 13gb 的唯一“字”），因此 7 对于 4gb 来说绰绰有余。这些数字对于内存消耗和计算操作来说很难生成用于分析的链表。一台 16GB 的 64 位机器可以在剩余一些内存的情况下处理这个问题，但似乎可以运行很长时间。
听起来很有趣。我鼓励您对各种内容进行基准测试；一个 4gb 的“随机”文件，例如"aaaaaaa aaaaaab aaaaaac" 等用不重复的 7 字符“单词”填充。您还可以查看从 Project Gutenberg 提取一些文本到基准测试，或者离线 Wikiepdia (bit.ly/1L9SWG6)。您的多线程加速可能会受到 IO 能力的限制，尤其是在具有单个旋转硬盘的机器上。代替（或除此之外）多线程，考虑可以处理 html、xml 或 ms-office 格式的格式感知文本剥离。

【解决方案2】：

代码有错误，不需要 MAX_TEXT_LENGTH。限制应该是确保单个单词不超过 word 缓冲区的长度。在你的程序中将 'buf' 的名称改为 'nextWordInFileBuffer' 看看是否更有意义。

【讨论】：