读取一个日志文件并以指定格式写入其他文件答案

【问题标题】：Read a log file and write other with the specified format读取一个日志文件并以指定格式写入其他文件
【发布时间】：2019-05-19 04:00:07
【问题描述】：

我有一个日志文本文件 (*.txt)，其中大约有 250 万个使用 C 语言的条目，我必须读取它并写入具有特定格式的其他文件。

必须读取的文件如下：

202.32.92.47 - - [01/Jun/1995:00:00:59 -0600] "GET /~scottp/publish.html" 200 271 - -
ix-or7-27.ix.netcom.com RFC-1413 John Thomas [01/Jun/1995:00:02:51 -0600] "GET /~ladd/ostriches.html" 200 205908 - "Mozilla/5.0 (X11; U; Linux i686; es-ES;rv:1.7.5)" 
ppp-4.pbmo.net - John Thomas [07/Dec/1995:13:20:28 -0600] "GET /dcs/courses/cai/html/introduction_lesson/index.html HTTP/1.0" 500 - "http://www.wikipedia.org/" "Mozilla/5.0 (X11; U; Linux i686; es-ES;rv:1.7.5)" 
ppp-4.pbmo.net - - [07/Dec/1995:13:20:37 -0600] "GET /dcs/courses/cai/html/index.html HTTP/1.0" 500 4528 - - 
lbm2.niddk.nih.gov RFC-1413 - [07/Dec/1995:13:21:03 -0600] "GET /~ladd/vet_libraries.html" 200 11337 "http://www.wikipedia.org/" -

此日志（原始）文件每一行的格式为：IP ID NAME [DATE:TIME TIMEZONE] "METHOD DIR" STATUS MB "WEB" "FROM"。因此，我将使用|| 拆分之前的日志示例以获得更好的可视化效果：

|| ix-or7-27.ix.netcom.com || RFC-1413 || John Thomas || [01/Jun/1995 || :00:02:51 || -0600] || "GET || /~ladd/ostriches.html" || 200 || 205908 || - || "Mozilla/5.0 (X11; U; Linux i686; es-ES;rv:1.7.5)" ||
|| ppp-4.pbmo.net || - || John Thomas || [07/Dec/1995 || :13:20:28 || -0600] || "GET || /dcs/courses/cai/html/introduction_lesson/index.html HTTP/1.0" || 500 || - || "http://www.wikipedia.org/" || "Mozilla/5.0 (X11; U; Linux i686; es-ES;rv:1.7.5)" ||
|| ppp-4.pbmo.net || - || - || [07/Dec/1995 || :13:20:37 || -0600] || "GET || /dcs/courses/cai/html/index.html HTTP/1.0" || 500 || 4528 || - || - ||
|| lbm2.niddk.nih.gov || RFC-1413 || - || [07/Dec/1995 || :13:21:03 || -0600] || "GET || /~ladd/vet_libraries.html" || 200 || 11337 || "http://www.wikipedia.org/" || - ||

因此，例如，对于第一行：

IP = ix-or7-27.ix.netcom.com 
ID = RFC-1413 
NAME = John Thomas 
DATE = 01/Jun/1995
TIME = 00:02:51 
TIMEZONE = -0600 
METHOD = GET 
DIR: /~ladd/ostriches.html
STATUS = 200 
MB = 205908 
WEB = -
FROM = Mozilla/5.0 (X11; U; Linux i686; es-ES;rv:1.7.5)

（每个字段的值可以是text或-）。

预期的输出是：

ix-or7-27.ix.netcom.com | RFC-1413 | John Thomas | 01/Jun/1995 | 00:02:51 | -06 | GET | /~ladd/ostriches.html | 200 || 205908 | - | Mozilla/5.0 (X11; U; Linux i686; es-ES;rv:1.7.5)
ppp-4.pbmo.net | - | John Thomas || 07/Dec/1995 | 13:20:28 | -06 | GET | /dcs/courses/cai/html/introduction_lesson/index.html HTTP/1.0 | 500 | - | http://www.wikipedia.org/ | Mozilla/5.0 (X11; U; Linux i686; es-ES;rv:1.7.5) 
ppp-4.pbmo.net | - | - || 07/Dec/1995 | 13:20:37 | -06 | GET | /dcs/courses/cai/html/index.html HTTP/1.0 | 500 || 4528 | - | - 
lbm2.niddk.nih.gov | RFC-1413 || - | 07/Dec/1995 | 13:21:03 | -06 | GET | /~ladd/vet_libraries.html | 200 | 11337 | http://www.wikipedia.org/ | -

因此，格式被拆分原始行并在每个字段之间添加|。每个字段可以是：

第一个参数（IP）：全部赶上空间
第二个参数（ID）：全部捕捉到空格（可以是字符串或-）
第三个参数（NAME）：捕获所有直到[（可以是带空格的字符串或-）
第四个参数（DATE）：catch all up to :
第五个参数（TIME）：全部赶上空间
第六个参数（TIMEZONE）：全部追到]（-dddd必须转换成-dd）
第七个参数（METHOD）：全部追到空格
第八个参数（DIR）：全部赶上空间
第九个参数（STATUS）：全部赶上空间
第十个参数（MB）：全部赶上空间
第十一个参数（WEB）：catch all inside ""（或-）
第十二个参数（FROM）：全部捕获在“”（或-）内

知道我是怎么得到它的吗？

谢谢。

编辑 1：

我用来读/写文件的代码是：

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main() {
    // variables
    int line[255];
    char *token;

    // open files
    FILE *fpr = fopen("myLogFile.txt","r");
    FILE *fpw = fopen("myFormattedLogFile.txt","w");

    // read file
    while (fgets(line, 255, fpr) != NULL) {
        token = strtok(line, " ");
        while (token != NULL) {
            // write file
            fprintf(fpw, "%s | ", token);
            token = strtok(NULL, " ");
        }
        fprintf(fpw, "\n");
    }

    // close files
    fclose(fpr);
    fclose(fpw);

    return 0;
}

但是由于需要两个值John Thomas，它不起作用，我不知道如何设置正确的格式（删除[，]，"，更改数字格式，拆分日期和时间，控制是字符串还是-, ...)。

编辑 2：@CHUX 的解决方案

我有一个帅哥：

// 6º pattern. How can I recover it as string?
// 7º pattern. How can I remove first "?
// 8º patter. How can I remove last "?
// how could catch all inside "" ? Which pattern should I use?
// what is variable n?
// what is Invalid_Input? It appears as undeclared

解决方案后更新的代码是：

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define LINE_LENGTH 255

// First parameter (IP): catch all up to space
#define IP_FMT "%s"
char IP[LINE_LENGTH];

// Second parameter (ID): catch all up to space (can be a string or a -)
#define ID_FMT "%s"
char ID[LINE_LENGTH];

// Third parameter (NAME): catch all up to [ (can be a string with spaces or a -)
#define NAME_FMT " %[^[]["
char NAME[LINE_LENGTH];

// Fourth parameter (DATE): catch all up to :
#define DATE_FMT " %11[^:]:"
char DATE[11+1];

// Fifth parameter (TIME): catch all up to space
#define TIME_FMT "%8s"
char TIME[8+1];

// Sixth parameter (TIMEZONE): catch all up to ] (-dddd must be converted in -dd)
#define TIMEZONE_FMT "%5d]"
int TIMEZONE;

// Seventh parameter (METHOD): catch all up to space
#define METHOD_FMT "%s"
char METHOD[LINE_LENGTH];

// Eigth parameter (DIR): catch all up to space
#define DIR_FMT "%s"
char DIR[LINE_LENGTH];

// Ninth parameter (STATUS): catch all up to space
#define STATUS_FMT "%s"
char STATUS[LINE_LENGTH];

// Tenth parameter (MB): catch all up to space
#define MB_FMT "%s"
char MB[LINE_LENGTH];

// Eleventh parameter (WEB): catch all inside "" (or -)

// Twelveth parameter (FROM): catch all inside "" (or -)



int main() {
    // variables
    char *line = malloc(LINE_LENGTH);
    char *token;
    int position = 0;

    // open files
    FILE *fpr = fopen("log.txt","r");
    FILE *fpw = fopen("myFormattedLogFile.txt","w");

    // read file
    while (fgets(line, LINE_LENGTH, fpr) != NULL) {

        int n = 0; 

        sscanf
            (
                line, 
                IP_FMT ID_FMT NAME_FMT DATE_FMT TIME_FMT TIMEZONE_FMT METHOD_FMT DIR_FMT STATUS_FMT MB_FMT " %n", 
                IP, ID, NAME, DATE, TIME, &TIMEZONE, METHOD, DIR, STATUS, MB, &n
            ); 

        NAME[strlen(NAME)-1] = '\0';

        fprintf
            (
                fpw, 
                "%s | %s | %s | %s | %s | %d | %s | %s | %s | %s\n", 
                IP, ID, NAME, DATE, TIME, TIMEZONE, METHOD, DIR, STATUS, MB
            );

    }

    // close files
    fclose(fpr);
    fclose(fpw);

    return 0;
}

【问题讨论】：

不要使用 C；使用 Perl 或其他脚本语言（也许是 awk）。它将与 C 代码一样快，更灵活，更容易编写。
@BurnsBA 我已经尝试过了，但我认为在我的代码不工作后这并不重要，它无济于事。我不知道 C 是否是最好的选择，因为我使用它是我知道的 lenguaje。我从未听说过scripting tools 或scripting languaje。我会读到它
@JuMoGar 如果您发布代码会有所帮助，您应该将其编辑到您的帖子中。
@JonathanLeffler 我从来没有用 Perl 写过任何代码，我应该学习它，我现在没有时间 :(。另外，我使用 Windows，所以我也不能使用 awk
如果您使用 POSIXy 系统（Linux、Mac），您可以使用 POSIX 正则表达式将输入拆分为字段，使用 regcomp() 编译表达式，使用 getline() 读取每一行，并使用regexec() 应用表达式，每个字段都作为子匹配。（每个子匹配以偏移量、长度元组的形式给出，指的是输入行。）然后，我可能只使用fwrite() 输出字段，fputs() 用于分隔符和换行符。根本不应该是很多行代码。

标签： c windows file

【解决方案1】：

sscanf() 和 "%n" 可以完成这项工作。 NAME 可能需要一些后期处理。

这么复杂的格式，我建议使用字符串拼接

// First parameter (IP): catch all up to space
#define IP_FMT "%s"
char IP[sizeof line];

// Second parameter (ID): catch all up to space (can be a string or a -)
#define ID_FMT "%s"
char ID[sizeof line];

// Third parameter (NAME): catch all up to [ (can be a string with spaces or a -)
#define NAME_FMT " %[^[]["
char NAME[sizeof line];

// Fourth parameter (DATE): catch all up to :
#define DATE_FMT " %11[^:]:"
char DATE[11+1];

// Fifth parameter (TIME): catch all up to space
#define TIME_FMT "%8s"
char TIME[8+1];

// Sixth parameter (TIMEZONE): catch all up to ] (-dddd must be converted in -dd)
#define TIMEZONE_FMT "%5d]"
int TIMEZONE;

// Other fields left for OP

int n = 0;
sscanf(s, IP_FMT ID_FMT NAME_FMT DATE_FMT TIME_FMT " %n", 
    ID, ID, NAME, DATE, TIME, &TIMEZONE, &n);

if (n == 0) return Invalid_Input;
trim(NAME);

【讨论】：

非常感谢。这是不可思议的。我已经更新了问题（带有 EDIT 2），其中包含一些关于您的代码的问题。请你看一下好吗？
@JuMoGar 6: #define TIMEZONE_FMT "%5d]" int TIMEZONE; --> #define TIMEZONE_FMT "%5[^]]]" char TIMEZONE[5+1];
@JuMoGar 7: #define METHOD_FMT " \"%s" char METHOD[sizeof line];
@JuMoGar 8 #define DIR_FMT " %[^\"]\"" char DIR[sizeof line];
@JuMoGar n 用于检测扫描是否成功，如此答案中的if (n == 0) return Invalid_Input; 所示。有关更多详细信息，请研究 sscanf() 中的 scan_set 和 "%n"。剩下的交给你。