C中的字谜测试器答案

【问题标题】：Anagram tester in CC中的字谜测试器
【发布时间】：2019-07-10 04:15:24
【问题描述】：

我正在尝试在 C 中实现一个字谜测试器。调用程序时，用户在双引号中输入两个单词，例如“listen”和“silent”。我几乎已经让它工作了，但是我编写了一个帮助函数来消除两个输入单词中的空格时遇到了一些麻烦。下面是这个函数的代码：

void noSpaces(char word[100]) {
    /*
    This is a function to get rid of spaces in a word
    It does this by scanning for a space and shifting the
    array elements at indices > where the space is
    down by 1 as long as there is still a space
    there. 
    */
    for (int i = 0; i < 100; i++) {
        while (word[i] == ' ') {
            for (int j = i; j < 100; j++) {
                word[j] = word[j+1];
            }
        }
    }
}

现在，当我将输入词从 main 函数传递给这个助手时，它工作正常。问题是对该函数的第二次调用。当我在第二个输入上调用此函数时，如果 k 是第一个输入中的空格数，则该函数会删除第二个输入的第一个 k 字母。例如，输入./anagram " banana" "banana" 会给我一个假阴性，如果我添加一个打印语句来查看noSpaces 之后的输入发生了什么调用他们，我得到以下信息：

banana
anana

这是完整程序的代码：

#include <stdio.h>

int main(int argc, char *argv[]) {
    //this if statement checks for empty entry
    if (isEmpty(argv[1]) == 0 || isEmpty(argv[2]) == 0) {
        //puts("one of these strings is empty");
        return 1;
    }
    //call to noSpaces to eliminate spaces in each word
    noSpaces(argv[1]);
    noSpaces(argv[2]);
    //call to sortWords
    sortWords(argv[1]);
    sortWords(argv[2]);
    int result = compare(argv[1], argv[2]);
    /*
    if (result == 1) {
        puts("Not anagrams");
    } else {
        puts("Anagrams");
    }
    */
    return result;
}

int compare(char word1[100], char word2[100]) {
    /*
    This is a function that accepts two sorted 
    char arrays (see 'sortWords' below) and
    returns 1 if it finds a different character
    at entry i in either array, or 0 if at no 
    index the arrays have a different character.
    */
    int counter = 0;
    while (word1[counter] != '\0' && word2[counter] != '\0') {
        if (word1[counter] != word2[counter]) {
            //printf("not anagrams\n");
            return 1;
        }
        counter++;
    }
    // printf("anagrams\n");
    return 0;
}

void sortWords(char word[100]) {
    /*
    This is a function to sort the input char arrays
    it's a simple bubble sort on the array elements.
    'sortWords' function accepts a char array and returns void,
    sorting the entries in alphabetical order
    being careful about ignoring the 'special character'
    '\0'.
    */
    for (int j = 0; j < 100; j++) {
        int i = 0;
        while (word[i + 1] != '\0') {
            if (word[i] > word[i + 1]) {
                char dummy = word[i + 1];
                word[i + 1] = word[i];
                word[i] = dummy;
            }
            i++;
        }
    }
}

void noSpaces(char word[100]) {
    /*
    This is a function to get rid of spaces in a word
    It does this by scanning for a space and shifting the
    array elements at indices > where the space is
    down by 1 as long as there is still a space there. 
    */
    for (int i = 0; i < 100; i++) {
        while (word[i] == ' ') {
            for (int j = i; j < 100; j++) {
                word[j] = word[j + 1];
            }
        }
    }
}

int isEmpty(char word[100]) {
    // if a word consists of the empty character, it's empty
    //otherwise, it isn't
    if (word[0] == '\0') {
        return 0;
    }
    return 1;
}

我知道有一个库可以处理字符串，但我真的想避免不得不使用它。我已经走到这一步了，而且我觉得问题基本上已经解决了，但是对于一件我看不到的小事。

我来自 java 背景，我是 C 新手，如果这解释了我犯的任何错误。

【问题讨论】：

当j为99时，访问word[j+1]，即word[100]。但是没有word[100]，因为word只有100个条目。
为什么要用双引号？这是作业吗？
@David Schwartz 感谢您了解这一点，但没有意识到。但是，如果一个词远低于 100 个字符，这是否解释了我所看到的奇怪效果？
@n.m 是的，我决定使用命令行参数而不是 scanf，因为我认为让这个任务的第二部分工作可能会出现问题。我只会告诉 TA 在他们的输入周围加上双引号，以防他们有带空格的条目。当我第一次写这篇文章时，我认为我不必处理这个案子。
@P.Gillich 没有办法知道。越界访问的影响可能无法预测。

标签： c anagram

【解决方案1】：

在C 中，字符串是char 的数组，带有一个空终止符，即具有值0 的字节通常表示为'\0'。您不应假设任何特定长度，例如100。实际上，编译器会忽略函数原型参数中的数组大小。您可以通过扫描空终止符来确定长度，这是strlen() 有效执行的操作，或者您可以编写代码以避免多次扫描，在空终止符处停止。您应该确保您的函数适用于空字符串，这是一个具有单个空字节的数组。以下是您的代码中的问题：

在函数noSpaces 中，您遍历字符串的末尾，修改可能属于下一个字符串的内存。该程序具有未定义的行为。

你应该停在字符串的末尾。还使用 2 个索引变量在线性时间内执行：

void noSpaces(char word[]) {
    /*
    This is a function to get rid of spaces in a word
    It does this by scanning for a space and shifting the
    array elements at indices > where the space is
    down by 1 as long as there is still a space
    there. 
    */
    int i, j;
    for (i = j = 0; word[i] != '\0'; i++) {
        if (word[i] != ' ') {
            word[j++] = word[i];
        }
    }
    word[j] = '\0';
}

您可以简化 compare 以平均使用三分之一的测试：

int compare(const char word1[], const char word2[]) {
    /*
    This is a function that accepts two sorted 
    char arrays (see 'sortWords' below) and
    returns 1 if it finds a different character
    at entry i in either array, or 0 if at no 
    index the arrays have a different character.
    */
    for (int i = 0; word1[i] == word2[i]; i++) {
        if (word1[i]) == '\0')
            //printf("anagrams\n");
            return 0;
        }
    }
    // printf("not anagrams\n");
    return 1;
}

sortWords 对空字符串有未定义的行为，因为您在数组末尾之外的索引 1 处读取了 char。这是一个更正的版本：

void sortWords(char word[]) {
    /*
    This is a function to sort the input char arrays
    it's a simple bubble sort on the array elements.
    'sortWords' function accepts a char array and returns void,
    sorting the entries in alphabetical order
    being careful about ignoring the 'special character'
    '\0'.
    */
    for (int j = 0; word[j] != '\0'; j++) {
        for (int i = 1; i < j; i++) {
            if (word[i - 1] > word[i]) {
                char dummy = word[i - 1];
                word[i - 1] = word[i];
                word[i] = dummy;
            }
        }
    }
}

您应该在使用之前声明函数，或者在使用之前定义它们。您的代码可以编译，因为编译器接受旧式 C，其中尚未见过的函数的原型是从第一个调用站点传递的参数中推断出来的。这种做法容易出错且已过时。

您的排序函数具有二次时间复杂度，对于非常长的字符串可能会非常慢，但单词不应该太大，所以这不是问题。

最好不要修改参数字符串。您可以使用具有相同时间复杂度的字符串之一的副本来执行测试。

这是一个直接的方法：

#include <stdio.h>

int check_anagrams(const char word1[], const char word2[]) {
    /*
       This function accepts two strings and returns 1 if they
       are anagrams of one another, ignoring spaces.
       The strings are not modified.
     */
    int i, j, len1, letters1, letters2;

    /* compute the length and number of letters of word1 */
    for (len1 = letters1 = 0; word1[len1] != '\0'; len1++) {
        if (word1[len1] != ' ')
            letters1++;
    }

    /* create a copy of word1 in automatic storage */
    char copy[len1];    /* this is an array, not a string */
    for (i = 0; i < len1; i++)
        copy[i] = word1[i];

    for (j = letters2 = 0; word2[j] != '\0'; j++) {
        char temp = word2[j];
        if (temp != ' ') {
            letters2++;
            for (i = 0; i < len1; i++) {
                if (copy[i] == temp) {
                    copy[i] = '\0';
                    break;
                }
            }
            if (i == len1) {
                /* letter was not found */
                return 0;
            }
        }
    }
    if (letters1 != letters2)
        return 0;
    return 1;
}

int main(int argc, char *argv[]) {
    const char *s1 = " listen";
    const char *s2 = "silent   ";
    if (argc >= 3) {
        s1 = argv[1];
        s2 = argv[2];
    }
    int result = check_anagrams(s1, s2);
    if (result == 0) {
        printf("\"%s\" and \"%s\" are not anagrams\n", s1, s2);
    } else {
        printf("\"%s\" and \"%s\" are anagrams\n", s1, s2);
    }
    return result;
}

【讨论】：

感谢您的详细回复。我怎么会修改属于下一个数组的内存（假设我没有犯错）？两个数组都包含 100 个元素，并且我绝不会进入越界索引（同样，除非出现一个错误）。这是一个任务，假设输入很短，虽然我很欣赏我的排序是多么低效，但我所需要的只是让它工作。我并不担心算法会多花 4 毫秒。从 main 调用后定义的函数也是如此；从编译器的角度来看，这真的很重要吗？
我没有实现你建议的所有修复，但在编写了一个函数来给出输入单词的长度并且没有在随后的函数中迭代之后，我似乎确实解决了我的特定问题.也许我对 C 的行为的理解存在缺陷，但我想知道为什么这很重要，因为据我所知，每个数组都分配了 100 个插槽的内存，所以即使我对索引处的空插槽进行了更改
@P.Gillich：恐怕你的理解是有缺陷的：你从哪里听说每个数组分配了 100 个元素？根本不是这样。数组可以分配任何大小，必须避免访问超出 null 终止的 C 字符串的元素，因为它具有未定义的行为。例如，分配有malloc() 的 C 字符串可能只有必要的插槽，并在访问无效索引处的元素时导致分段错误。
我的意思是，如果我声明一个大小为 100 的数组，这是否意味着在某处保留了一块内存，有足够的空间来存储多达 100 个元素？因此，即使我对存储数组中最后一个字符的索引进行了更改，只要我不尝试更改超出数组范围的索引，为什么这会影响不同的数组？
@P.Gillich：在函数参数中声明数组大小完全没有效果。该函数只接收一个指向调用者传递的数组的第一个元素的指针。数组大小就是调用点的大小。访问数组末尾之外的元素具有未定义的行为，更改它们的行为更是如此，可能会产生副作用，例如修改另一个数组或更糟的情况。

【解决方案2】：

您的辅助函数出现逻辑错误。您将从 word[j] 开始复制，而不是从第二个单词的开头开始，因此您将删除与前导空格一样多的前导字符，就像您在输出中看到的那样。

请注意，j=i 和 i 会计算外循环中前导空格的数量。

顺便说一句，你应该只有两个循环。将while 条件放在第一个for 循环中，如下所示：for (int i = 0; i<100 && word[i]==' '; i++)。

要修复您的逻辑错误，您需要使用另一个迭代器k 在最内层循环中初始化为零，并使用word[k] = word[j+1]。我认为这会奏效。

【讨论】：

我不明白，不应该在对函数的不同调用中重置 i 和 j 吗？我不是从0跑到99，每次word[i]是空格，我们就进入第二个循环吗？
@P.Gillich 我不是在谈论对函数的单独调用，而是在谈论单个调用。不，这不是循环的工作方式。对于i=0，您将遍历0 <= j < 100。一直到i=99。
@P.Gillich 稍后我可能有时间查看您的其余代码。祝你好运。
那么为什么它在第一次通话中就可以完美运行呢？我不明白，如果我调用该函数一次，我会看到所需的输出，但只有第二个是恶作剧。
@P.Gillich 巧合，可能。无时无刻不在发生。尝试更多案例。

【解决方案3】：

如果 argv[1] 缓冲区长度小于 100，则 argv[1] 和 argv[2] 上的缓冲区溢出有问题。所以我认为你应该使用带有 strlen(word) 的 for 循环就足够了。当您在 for 循环中使用 100 的静态长度时，有时该单词会从另一个内存位置获取数据并使您的程序处于未定义的行为。其他功能也有同样的问题。我的意思是 sortWords 和 compare 函数。

这是我对你的 noSpaces 函数的修改，它应该可以工作。

void noSpaces(char word [100]){
    /*
    This is a function to get rid of spaces in a word
    It does this by scanning for a space and shifting the
    array elements at indices > where the space is
    down by 1 as long as there is still a space
    there.
    */
    for(int i =0; i<strlen(word)-1; i++){
        while(word[i]==' '){
            for(int j = i ; j<strlen(word); j++){
                word[j] = word [j+1];
            }
        }
    }
}

【讨论】：

【解决方案4】：

而不是试图删除空格和排序，这是 O(N lg N) 的运行时间。您可以通过构建一个表示单词中每个字母的计数的数组来执行 O(N) 操作。并且在执行此操作时忽略空格。

// Iterate over each character in the string
// For each char in string, increment the count of that character
// in the lettercount array.
// Return the number of unfiltered letters that were counted
int fillLetterCountTable(const char* string, int* lettercount)
{
    int len = strlen(string);
    int valid = 0;

    for (int i = 0; i < len; i++)
    {
       unsigned char index = (unsigned char)(string1[i]);
       if (index ==  ' ')  // ignore spaces
       {
           continue;
       }
       counts[index] += 1;
       valid++;
    }

    return valid;
}

// compare if two strings are anagrams of each other
// return true if string1 and string2 are anagrams, false otherwise
bool compare(const char* string1, const char* string2)
{
    int lettercount1[256] = {0};
    int lettercount2[256] = {0};

    int valid1 = fillLetterCountTable(string1, lettercount1);
    int valid2 = fillLetterCountTable(string2, lettercount2);

    if (valid1 != valid2)
        return false;

    // memcmp(lettercount1, lettercount2, sizeof(lettercount1));
    for (int i = 0; i < 256; i++)
    {
        if (counts1[i] != counts2[i])
            return false;
    }
    return true;
}

【讨论】：

仅供参考：这不适用于 Unicode 文本，因此在实际应用程序中基本上无法使用
OP 显然使用的是 char 类型。我认为这个解决方案是公平的游戏。对于 16 位 Unicode（ala Windows），只需从 char 切换到 wchar_t 和 256 到 65536）。对于 32 位 Unicode（ala Mac），您是正确的 - 需要更好的哈希表方法才能不破坏堆栈。这样做的方法是有一个排序的哈希表算法，可以用作一个“集合”和一种比较两个集合是否相等的方法。在 C++ 中，您只需使用 std::set<wchar_t> 即可。
在更一般的情况下，我认为这主要是关于他当前实现的问题，手头的问题是 C 而不是字谜测试。虽然提供一个好的解决方案是有价值的！
@Alexander 对单词中的字符进行排序也不适用于 Unicode 文本。无论如何，一般的字谜可能都没有很好地为 Unicode 定义。
@selbie 虽然我确实喜欢这个解决方案并且它已经闪过我的脑海，但我认为按照我的方式去做会更有启发性。顺便说一句，我确实最终让它工作了。谢谢！