将C中的标点符号与单词分开答案

【问题标题】：Separate punctuation marks in C from a word将C中的标点符号与单词分开
【发布时间】：2017-12-28 19:58:28
【问题描述】：

我正在尝试将所有单词与文本分开，我也需要将标点符号分开。

分离它们并将其保存在字符串数组中的最佳方法是什么？

这是一个例子：

输入：
- “嗨，我叫 Sara！”
预期输出
- “嗨”
- ","
- “我的”
- “姓名”
- “是”
- “萨拉”
- “！”
实际输出
- “嗨，”
- “我的”
- “姓名”
- “是”
- “萨拉！”

我的代码：

palavra_linha[i] = strtok (linhas[i], " \n\r");

while (palavra_linha[i] != NULL) {
    palavras_finais[j] = palavra_linha[i];
    j++;
    palavra_linha[i] = strtok (NULL, " \n\r");
}

我知道我必须使用类似的东西，但它不起作用，假设条件为假：

if (strlen(palavra_linha[i])-1) == '.') {
    palavras_finais[j] = palavra_linha[i];
}

【问题讨论】：

但是我需要在数组上保存标点符号。 @DeiDei
将"It isn't true that Bill O'Reilly came in 3rd!" 作为输入需要做什么？分割空间很好；您还需要考虑使用strspn() 和strcspn() 来诊断子字符串。您还必须复制材料，因为您不能简单地在单词之后和尾随标点之前添加空值。
我建议不要使用strtok。只需检查每个角色并对每个角色采取适当的行动。
请不要破坏您的问题。

标签： c arrays string

【解决方案1】：

现在它可以正常工作并给我输出：

[Hello] 
[,] 
[Sara] 
[!] 
[How] 
[are] 
[You] 
[?]

使用后不要忘记释放数组，也可以保存原始字符串以在程序开始时分配指向 tmp 的指针（例如）。

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <ctype.h>
#include <unistd.h>

static void skip_copied_bytes(char *str, int *i)
{
    char c = 0;
    while (*str == ' ' && str)
    {
        str++;
        (*i)++;
    }

    while (*str != ' ' && *str != '\0' && !ispunct(c))
    {
        str++;
        (*i)++;
        c = *str;
    }

    while (*str == ' ' && str)
    {
        str++;
        (*i)++;
    }
}

static int count_char(char *str)
{

    int count = 0;
    char c = 0;
    while (*str == ' ' && str)
        str++;


    while (*str != ' ' && *str != '\0' && !ispunct(c))
    {
        count++;
        str++;
        c = *str;
    }

    while (*str == ' ' && str)
    {
        str++;
    }

    return count;
}

static int count_word(char const *s, char c)
{
    int i;
    int count;

    count = 0;
    i = 0;
    while (s[i] != '\0')
    {
        while (s[i] == c)
            i++;
        if (s[i] != '\0')
            count++;
        while (s[i] != c && s[i] != '\0') {
            if (ispunct(s[i]))
                count++;
            i++;
        }
    }

    return count;
}
int main(void)
{
    char *str = "Hello, Sara! How are You?";
    char **array;
    int i = 0;
    int j = 0;
    int size = 0;

    size = count_word(str, ' ');
    if((array = malloc(sizeof(char *) * (size + 1))) == NULL)
        return -1;

    while (str[i])
    {
        size = count_char(&str[i]);
        if ((array[j] = malloc(sizeof(char) * (size))) == NULL)
            return -1;

        strncpy(array[j], &str[i], size);
        array[j][size] = '\0';

        skip_copied_bytes(&str[i], &i);
        j++;
    }

    array[j] = 0;

    for(i = 0; array[i]; i++) {
        printf("[%s] \n", array[i]);
    }
}

【讨论】：

【解决方案2】：

一个想法是您可以保留原始文本并创建它的副本。在创建副本时，迭代文本的每个字符并检查当前字符是否是任何标点符号。如果是这样，请在标点符号之前的副本中插入一个空格，然后从标点符号的下一个字符继续，直到到达文本的终止空字符。最后，您可以使用 strtok() 以与您所做的相同的方式标记文本副本。以下是实现上述思想的示例。

#include <stdlib.h>
#include <stdio.h>
#include <ctype.h>
#include <string.h>

int main(void) {
    char *stnc_org, *stnc_new;
    int size;
    printf("What is the expected size of the sentence: ");
    scanf("%d", &size);
    stnc_org = (char *)malloc(size * sizeof(char));

    printf("Input: \n");
    scanf(" %[^\n]", stnc_org);

    // get the number of punct
    int punct_num = 0;
    int i;
    for (i = 0; *(stnc_org + i) != '\0'; i++) {
        if (ispunct(*(stnc_org + i))) {
            punct_num++;
        }
    }

    char * stnc_backup = (char *)malloc((size + punct_num * 2) * sizeof(char));
    stnc_new = stnc_backup;

    // copy the original str to the new and add space before each punct
    for (i = 0; *(stnc_org + i) != '\0'; i++) {
        if (ispunct(*(stnc_org + i))) { // found a punct
            // boundary check!!!
            // 1. the first character is the punct
            if (i == 0) {
                *(stnc_new + i) = *(stnc_org + i);
                *(stnc_new + i + 1) = ' ';
                stnc_new = stnc_new + (i + 2);
            } 
            // 2. the last character is the punct
            else if (*(stnc_org + i + 1) == '\0') {
                if (*(stnc_org + i - 1) != ' ') {
                    strncpy(stnc_new, stnc_org, i);
                    *(stnc_new + i) = ' ';
                    *(stnc_new + i + 1) = *(stnc_org + i);
                    *(stnc_new + i + 2) = '\0';
                    stnc_new = stnc_new + (i + 1);
                }
            }

            // check the previous and next characters of the punct
            // 3. previous not the space && next is space -> insert ' ' before the punct
            else if (*(stnc_org + i - 1) != ' ' && *(stnc_org + i + 1) == ' ') {
                strncpy(stnc_new, stnc_org, i);
                *(stnc_new + i) = ' ';
                *(stnc_new + i + 1) = *(stnc_org + i);
                stnc_new = stnc_new + (i + 1);
            }

            // 4. previous is space && next is not space -> insert ' ' after the punct
            else if (*(stnc_org + i - 1) == ' ' && *(stnc_org + i + 1) != ' ') {
                strncpy(stnc_new, stnc_org, i);
                *(stnc_new + i) = *(stnc_org + i);
                *(stnc_new + i + 1) = ' ';
                stnc_new = stnc_new + (i + 2);
            }

            // 5. no space before or after -> insert ' ' both before and after the punct
            else if (*(stnc_org + i - 1) != ' ' && *(stnc_org + i + 1) != ' ') {
                strncpy(stnc_new, stnc_org, i);
                *(stnc_new + i) = ' ';
                *(stnc_new + i + 1) = *(stnc_org + i);
                *(stnc_new + i + 2) = ' ';
                stnc_new = stnc_new + (i + 3);
            }

            // reset the pointer of the original text
            stnc_org = stnc_org + i + 1;
            // reset the index, so that it starts from 0 in the next loop
            i = -1;
        }
    }

    //printf("%s\n", stnc_backup);

    printf("\nOutput:\n");
    char *str;
    str = strtok(stnc_backup, " \n\r");

    while(str != NULL) {
        printf("%s\n", str);
        str = strtok(NULL, " \n\r");
    }
}

示例输出如下：

Running "/home/ubuntu/workspace/replace.c"
What is the expected size of the sentence: 300
Input: 
"Isn't it true that Bill O'Reilly didn't win (he came in 3rd!)? 'Tain't necessarily so!"

Output:
"
Isn
'
t
it
true
that
Bill
O
'
Reilly
didn
'
t
win
(
he
came
in
3rd
!
)
?
'
Tain
'
t
necessarily
so
!
"


Process exited with code: 0

【讨论】：

请注意，扫描集由%[…] 表示，而不是%s 的修饰符。你有scanf(" %[^\n]s", stnc_org);。 %[^\n] 转换完成； s 只是一个文字字母，匹配总是失败（因为扫描集后面的字符是换行符 - 但在这种情况下，你永远不会知道 s 匹配失败，因为没有办法scanf() 报告失败。你应该测试结果。格式字符串中的前导空格是完全可以防御和正确的。
您的代码在严格的编译选项下编译干净。那挺好的。它在样本数据上产生正确的输出——这也很好。输入字符串/行有一些乐趣，例如：Isn't it true that Bill O'Reilly didn't win (he came in 3rd!)? 'Tain't necessarily so!——它不会中断，但我很感兴趣它产生的一个词是'Tain。（如果有什么安慰的话，我的代码也没有处理这个问题；单词开头或结尾的撇号很棘手。但我的处理方式并没有不同。我确实处理单词中的撇号。）
原因是我认为标点符号后跟空格或终止空字符。因此，例如(he came...，由于标点符号后面跟着一个单词，我的代码所做的（即在标点符号之前插入一个空格）将无济于事。为了解决这个问题，我们需要考虑几种情况：
1.标点符号是文本的第一个字符（例如，"hello"；2. 标点符号后跟一个空格或终止空字符（例如，hello, world!）；3. 标点符号在一个空格之后，然后是一个单词（例如，hello (world).）；4. 标点符号前后没有空格（例如，It's）。一旦代码涵盖了所有情况，它应该可以工作。
新代码似乎运行良好——就像旧代码一样。我要指出的是一个极端案例。抱歉，我不能投票两次。