C用换行符解析逗号分隔值答案

【问题标题】：C parsing a comma-separated-values with line breaksC用换行符解析逗号分隔值
【发布时间】：2017-08-29 17:55:13
【问题描述】：

我有一个 CSV 数据文件，其中包含以下数据：

H1,H2,H3
a,"b
c
d",e

当我通过 Excel 作为 CSV 文件打开时，它能够显示列标题为 H1, H2, H3 和列值为：a for H1 的工作表，

multi line value as
b
c
d
for H2

和c for H3 我需要使用 C 程序解析这个文件并像这样获取值。但是，我的以下代码 sn-p 将不起作用，因为我有一列有多行值：

char buff[200];
char tokens[10][30];
fgets(buff, 200, stdin);
char *ptok = buff; // for iterating
char *pch; 
int i = 0;
while ((pch = strchr(ptok, ',')) != NULL) {
  *pch = 0; 
  strcpy(tokens[i++], ptok);
  ptok = pch+1;
}
strcpy(tokens[i++], ptok);

如何修改这段代码 sn-p 以适应列的多行值？请不要被字符串缓冲区的硬编码值所困扰，这是 POC 的测试代码。而不是任何第 3 方库，我想从第一原则开始艰难地做到这一点。请帮忙。

【问题讨论】：

解析 CSV 文件看似简单，因为有许多难以记住处理的特殊情况。或者只是很难处理。例如，如果多行字符串包含逗号怎么办？尝试找到一个可以为您处理它的库。
对于初学者来说，您应该考虑制作它，以便您的代码可以在额外的行中读取，并且 buff 可以是任意大小，而不是限制为 199 个字符。
请不要被字符串缓冲区的硬编码值所困扰，这是作为 POC 的测试代码。而不是任何第 3 方库，我想从第一原则开始努力
如果你想自己做这一切，那么从创建大量单元测试开始，这样你就可以确定当你完成时它是正确的。那么对于实际的解析，其实还是要对内容做一些解析，不能只逐行读取，然后用strtok来拆分内容。我建议使用更大的缓冲区并读入它。然后逐个字符处理，处理逗号（不在字符串中时）并处理字符串和可能的转义。当你找到他们时。
在stackoverflow.com/questions/32349263/… 中，我在C 中提供了一个基本的CSV 解析器。如果换行符在带引号的字符串中，它们将被复制到正在解析的字段中。

标签： c excel csv parsing strchr

【解决方案1】：

在 C 中解析“格式良好”的 CSV 的主要复杂性是精确地处理可变长度字符串和数组，而使用固定长度字符串和数组可以避免这种情况。（另一个复杂的问题是处理格式不正确的 CSV。）

没有这些复杂性，解析真的很简单：

（未经测试）

/* Appends a non-quoted field to s and returns the delimiter */
int readSimpleField(struct String* s) {
  for (;;) {
    int ch = getc();
    if (ch == ',' || ch == '\n' || ch == EOF) return ch;
    stringAppend(s, ch);
  }
}

/* Appends a quoted field to s and returns the delimiter.
 * Assumes the open quote has already been read.
 * If the field is not terminated, returns ERROR, which
 * should be a value different from any character or EOF.
 * The delimiter returned is the character after the closing quote
 * (or EOF), which may not be a valid delimiter. Caller should check.
 */
int readQuotedField(struct String* s) {
  for (;;) {
    int ch;
    for (;;) {
      ch = getc();
      if (ch == EOF) return ERROR;
      if (ch == '"') {
        ch = getc();
        if (ch != '"') break;
      }
      stringAppend(s, ch);
    }
  }
}

/* Reads a single field into s and returns the following delimiter,
 * which might be invalid.
 */
int readField(struct String* s) {
  stringClear(s);
  int ch = getc();
  if (ch == '"') return readQuotedField(s);
  if (ch == '\n' || ch == EOF) return ch;
  stringAppend(s, ch);
  return readSimpleField(s);
}

/* Reads a single row into row and returns the following delimiter,
 * which might be invalid.
 */
int readRow(struct Row* row) {
  struct String field = {0};
  rowClear(row);
  /* Make sure there is at least one field */
  int ch = getc();
  if (ch != '\n' && ch != EOF) {
    ungetc(ch, stdin);
    do {
      ch = readField(s);
      rowAppend(row, s);
    } while (ch == ',');
  }
  return ch;
}

/* Reads an entire CSV file into table.
 * Returns true if the parse was successful.
 * If an error is encountered, returns false. If the end-of-file
 * indicator is set, the error was an unterminated quoted field; 
 * otherwise, the next character read will be the one which
 * triggered the error.
 */
bool readCSV(struct Table* table) {
  tableClear(table);
  struct Row row = {0};
  /* Make sure there is at least one row */
  int ch = getc();
  if (ch != EOF) {
    ungetc(ch, stdin);
    do {
      ch = readRow(row);
      tableAppend(table, row);
    } while (ch == '\n');
  }
  return ch == EOF;
}

以上是“第一原则”——它甚至不使用标准 C 库字符串函数。但是需要一些努力去理解和验证。就个人而言，我会使用 (f)lex 甚至可能使用 yacc/bison（虽然它有点矫枉过正）来简化代码并使预期的语法更加明显。但在 C 中处理可变长度结构仍需要作为第一步。

【讨论】：