你应该测试很多条件以确保你只匹配整个单词等。以下是搜索jury的一种方法,只匹配jury、jury's,但不匹配@ 987654324@。您还应该考虑是否要匹配单词的复数形式(例如review 和reviews。在单个分隔符集合(delim)下方被认为可以确保匹配整个单词。您可以轻松打破它如果您想匹配复数或其他各种后缀,则将它们分成两个并设置开头和结尾。
代码期望文件名作为第一个参数进行搜索,搜索项 (sterm) 作为第二个参数。 (如果没有给出参数,它将在stdin 上的文本中搜索'the')。代码将文件中的每一行读入名为line 的临时缓冲区,然后在line 中的每个字符中搜索sterm 中的开始字符。如果找到,则检查前一个字符以确保它是分隔符,然后单词后面的字符(sterm 长度)也是分隔符。如果是与sterm相同字符开头的单词,前后有分隔符,则使用strncmp比较内容。
如果所有条件都满足,则将单词复制到tmp,并增加count。结果与匹配的line 中的从零开始 的位置一起打印。这只是一个尚未优化的基本全词搜索,但应该为您提供一个从包含较少的子字符串中区分全词的起点。 (即搜索'the' 将不会同时匹配'them'、'then'、'they' 等。)。您还可以将此代码转换为一个函数,该函数将每个匹配项的行号和位置保存在可以返回指针的结构数组中。这样,您可以解析文本并返回指向包含每个匹配项的行和位置的数组的指针。 (那是另一天)。
查看代码,如果您有任何问题,请告诉我。如果您不关心只匹配 whole-words,那么您可以简单地在每一行上重复调用 strstr,同时推进一个指针来计算搜索词的出现次数。最能满足您需求的。
#include <stdio.h>
#include <string.h>
#define MAXS 256
int main (int argc, char **argv)
{
char line[MAXS] = {0}; /* line buffer for fgets */
FILE *fp = argc > 1 ? fopen (argv[1], "r") : stdin;
char *sterm = argc > 2 ? argv[2] : "the";
char *delim = " \t\n\'\".";
size_t count = 0, idx = 0, slen = strlen (sterm);
if (!fp) {
fprintf (stderr, "error: file open failed '%s'\n", argv[1]);
return 1;
}
while (fgets (line, MAXS, fp))
{
size_t i, llen = strlen (line);
idx++;
if (llen < slen + 1)
continue; /* line not longer than search term + \n */
for (i = 0; i < llen - slen + 1; i++) {
if (line[i] != *sterm)
continue; /* char != first char in sterm */
if (i && !strchr (delim, line[i-1]))
continue; /* prior char is not a delim */
if (!strchr (delim, line[i+slen]))
continue; /* next char is not a delim */
if (strncmp (&line[i], sterm, slen))
continue; /* chars don't match sterm */
printf (" line[%2zu] match %2zu. '%s' at location %zu\n",
idx, ++count, sterm, &line[i] - line);
}
}
if (fp != stdin) fclose (fp);
printf ("\n total occurrences of '%s' in '%s' : %zu\n\n",
sterm, argc > 1 ? argv[1] : "stdin", count);
return 0;
}
示例文件
$ cat dat/damages.txt
Personal injury damage awards are unliquidated
and are not capable of certain measurement; thus, the
jury has broad discretion in assessing the amount of
damages in a personal injury case. Yet, at the same
time, a factual sufficiency review insures that the
evidence supports the jury's award; and, although
difficult, the law requires appellate courts to conduct
factual sufficiency reviews on damage awards in
personal injury cases. Thus, while a jury has latitude in
assessing intangible damages in personal injury cases,
a jury's damage award does not escape the scrutiny of
appellate review.
Because Texas law applies no physical manifestation
rule to restrict wrongful death recoveries, a
trial court in a death case is prudent when it chooses
to submit the issues of mental anguish and loss of
society and companionship. While there is a
presumption of mental anguish for the wrongful death
beneficiary, the Texas Supreme Court has not indicated
that reviewing courts should presume that the mental
anguish is sufficient to support a large award. Testimony
that proves the beneficiary suffered severe mental
anguish or severe grief should be a significant and
sometimes determining factor in a factual sufficiency
analysis of large non-pecuniary damage awards.
输出
$ ./bin/searchterm dat/damages.txt jury
line[ 3] match 1. 'jury' at location 0
line[ 6] match 2. 'jury' at location 22
line[ 9] match 3. 'jury' at location 37
line[11] match 4. 'jury' at location 2
total occurrences of 'jury' in 'dat/damages.txt' : 4
或
$ ./bin/searchterm <dat/damages.txt
line[ 2] match 1. 'the' at location 50
line[ 3] match 2. 'the' at location 39
line[ 4] match 3. 'the' at location 43
line[ 5] match 4. 'the' at location 48
line[ 6] match 5. 'the' at location 18
line[ 7] match 6. 'the' at location 11
line[11] match 7. 'the' at location 38
line[17] match 8. 'the' at location 10
line[19] match 9. 'the' at location 34
line[20] match 10. 'the' at location 13
line[21] match 11. 'the' at location 42
line[23] match 12. 'the' at location 12
total occurrences of 'the' in 'stdin' : 12
使用指针而不是数组索引表示法
您可能会发现使用 pointer 而不是 array index 表示法更自然一些。 (例如,使用char *p = line; 和推进p,而不是使用line[X] 表示法)。如果是这样,您可以将读取循环替换为以下内容:
while (fgets (line, MAXS, fp))
{
char *p = line;
size_t llen = strlen (line);
idx++;
if (llen < slen + 1)
continue; /* line not longer than search term + \n */
for (;p < (line + llen - slen + 1); p++) {
if (*p != *sterm)
continue; /* char != first char in sterm */
if (p > line && !strchr (delim, *(p - 1)))
continue; /* prior char is not a delim */
if (!strchr (delim, *(p + slen)))
continue; /* next char is not a delim */
if (strncmp (p, sterm, slen))
continue; /* chars don't match sterm */
printf (" line[%2zu] match %2zu. '%s' at location %zu\n",
idx, ++count, sterm, p - line);
}
}
指针符号在 C 中可能更自然一些。如果您有任何问题,请告诉我。