在C中将文本文件的特定部分作为字符串读取？答案

【问题标题】：Reading a specific part of a text file as a string in C?在C中将文本文件的特定部分作为字符串读取？
【发布时间】：2019-06-19 12:50:23
【问题描述】：

我正在编写读取包含 DNA 碱基的巨大文本文件的代码，并且我需要能够提取特定部分。该文件如下所示：

TGTTCCAGGCTGTCAGATGCTAACCTGGGG
TCACTGGGGGTGTGCGTGCTGCTCCAGCCT
GTTCCAGGATATCAGATGCTCACCTGGGGG

...

每行 30 个字符。

我有一个单独的文件来指示这些部分，这意味着我有一个 start 值和一个 end 值.所以对于每个 start 和 end 值，我需要在文件中提取相应的字符串。例如，如果我有 start=10, end=45，我需要存储从第一行 (C) 的第 10 个字符开始并在单独的临时文件中的第 2 行 (C) 的第 15 个字符。

我尝试使用如下所示的 fread 函数来处理具有上述字母行的测试文件。参数为 start=1，end=90，生成的文件如下所示：

TGTTCCAGGCTGTCAGATGCTAACCTGGGG
TCACTGGGGGTGTGCGTGCTGCTCCAGCCT
GTTCCAGGATATCAGATGCTCACCTGGG™eRV

每次运行最后都会给出随机字符。

代码：


FILE* fp;
fp=fopen(filename, "r");
if (fp==NULL) puts("Failed to open file");

int start=1, end=90;
char string[end-start+2]; //characters from start to end = end-start+1

fseek(fp, start-1, SEEK_SET);

fread(exon,1, end-start+1, fp);

FILE* tp;
tp=fopen("exon", "w");
if (tp==NULL) puts("Failed to make tmp file");

fprintf(tp, "%s\n", string);
fclose(tp);

我不明白 fread 如何处理 \n 字符，因此我尝试将其替换为以下内容：

int i=0;
char ch;
while (!feof(fp))
{
            ch=fgetc(fp);

            if (ch != '\n') 
            {
                string[i]=ch;
                i++;
                if (i==end-start) break;
            }

}
string[end-start+1]='\0';

它创建了以下文件： TGTTCCAGGCTGTCAGATGCTAACCTGGGGTCACTGGGGGTGTGCGTGCTGCTCCAGCCTGTTCCAGGATATCAGATGCTCACCTGGGGô

（没有任何换行符，我不介意）。每次运行时，我都会得到一个不同的随机字符，而不是“G”。

我做错了什么？有没有办法用 fread 或其他功能来完成它？

提前谢谢你。

【问题讨论】：

您必须考虑每行 31 个字符（30 个字母后跟 \n），或者甚至可能每行 32 个字符（30 个字母后跟 \r\n）。这意味着您可能希望首先检查输入文件的格式。不管怎样，最好使用fseek 然后fread。
FWIW，fread 根本不关心 EOL 字符。
While is while (!feof(fp)) always wrong。 fread 不“特别”处理换行符，它只是一个字符。它还返回读取字符的数量，并且结果数据不是以空结尾的。
我认为这里有两个问题：（1）您没有考虑到每一行都以换行符结尾，换行符是一个字符。因此，要读取 2 行，您需要读取 30 + 1 + 30 个字符 = 61 个字符，而不是 60 个字符。您可能还想去掉换行符，并在每 30 个字符后添加自己的回行。并且 (2) 您没有在缓冲区的末尾添加空字符，因此当您尝试将其打印为字符串时，它会一直越过末尾，直到它碰巧在内存中遇到一个随机的零字节。跨度>
在您自己的循环中使用fgetc 在字符串末尾添加空值，但我认为您的索引已关闭——您应该在@ 时将其添加到i 的位置987654334@.

标签： c

【解决方案1】：

我已经修改了您的代码并添加了 cmets 以供解释。

请通过它。您忽略了错误检查，代码几乎没有未定义的变量。

我已经从 if 块返回失败，goto` 会更合适。

请参考this comment 为start 和end 添加1 个字符还是2 个字符。

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

int main()
{
        FILE* fp;
        // fp = fopen(filename, "r");
        // since the filename is undeclared i have used hard coded file name
        fp = fopen("dna.txt", "r");
        // Nothing wrong in performing error checking
        if (fp == NULL) {
                puts("Failed to open file");
                return -1; 
        }

        // Make sure start is not 0 if you want to use indices starting from 1
        int start = 1, end = 90; 

        // I would adjust the start and end index by adding count of '\n' or '\r\n' to the start and end
        // Here I am adjusting for '\n' i.e 1 char
        // since you have 30 chars so hardcoding it.
        int m = 1; // m depends on whether it is \n or \r\n
                   // 1 for \n and 2 for \r\n
        --start; --end; // adjusting indexes to be 0 based
        if (start != 0)
                start = start + (start / 30) * m;   // start will be 0
        if (end != 0)
                end = end + (end / 30) * m;         // start will be 93

        // lets declare the chars to read
        int char_to_read = end - start + 1;

        // need only 1 extra char to append null char
        // If start and end is going to change, then i would suggest using malloc instead of static buffer
        // because compiler cannot predict the memory to allocate to the buffer if it is dependent on external factor
        // char string[char_to_read + 1]; //characters from start to end = end-start+1

        char *string = malloc(char_to_read + 1); 
        if (string == NULL) {
                printf("malloc failed\n");
                fclose(fp);
                return -2;
        }

        // zero the buffer
        memset(string, 0, char_to_read + 1); 

        int rc = fseek(fp, start, SEEK_SET);
        if (rc == -1) {
                printf("fseek failed");
                fclose(fp);
                return -1;
        }

        // exon is not defined, and btw we wanted to read in string.
        int bytes_read = fread(string, 1, char_to_read, fp);

        // Lets check if there is any error after reading
        if (bytes_read == -1) {
                fclose(fp);
                return -1; 
        }

        // Now append the null char to the end
        string[bytes_read] = 0;
        printf("%s\n", string);
        fclose(fp);

        // free the memory once you are done with it
        if (string)
                free(string);


// Now u can write it back to file.
//      FILE* tp;
//      tp=fopen("exon", "w");
//      if (tp==NULL) puts("Failed to make tmp file");

//      fprintf(tp, "%s\n", string);
//      fclose(tp);
}

【讨论】：

感谢您的详细解答！但是，仍有一件事让我感到困惑。假设我们在一行中有 90 个字符，并且 start=1（第一个字符），end=90（最后一个）。那么字符不是end-start而是end-start+1。所以，如果这是真的，它应该是 char_to_read=end-start +1。我错过了什么？顺便说一句，对于一些未定义的变量，它们要么因为代码只是函数的一部分而丢失，要么是因为我在复制粘贴它们时忘记更改它们的名称（例如，'exon' 实际上是'string'） .
是的 char_to_read 应该是 end - start + 1。我忘了在你的情况下索引从 1 开始。我会更改代码。
start 和 end 的索引存在问题，通过从它们中减去 1 来修复它
@Shubham 编辑的start 现在将是0，所以在这个例子中不应该是fseek(fp, start, SEEK_SET) (start=1, end=90)？另外，我的开始/结束值通常不是 0，所以我猜start+=(start/30)*m（与结束相同）和char_to_read=end-start+1 可以正常工作，对吧？
是的，你是对的 fseek 它应该是fseek(fp, start, SEEK_SET)。如果start 和end 不为零，那么它将起作用。我更新了答案，谢谢指出