如何从不断更新的文件中读取和提取信息？答案

【问题标题】：How to read and extract information from a file that is being continuously updated?如何从不断更新的文件中读取和提取信息？
【发布时间】：2010-09-09 20:14:01
【问题描述】：

这就是我计划为项目构建实用程序的方式：

logdump 将日志结果转储到文件 log。如果文件已经存在，则结果将附加到现有结果中（例如，如果每月创建一个新文件，则结果将附加到该月的同一文件中）。
extract 读取日志结果文件以根据提供的参数提取相关结果。
问题是我不想等待 logdump 完成写入 log 来开始处理它。同样，我需要记住直到我已经阅读了 log 才能开始提取更多信息，这不是我想要做的。
我需要实时结果，以便每当向日志结果文件中添加内容时，extract 都会获得所需的结果。
extract 将执行的处理将是通用的（将取决于它的一些命令行参数），但肯定是逐行进行的。

这包括在写入文件时读取文件，并持续监控它是否有新的更新，即使您到达日志文件的末尾也是如此。

如何使用 C 或 C++ 或 shell 脚本或 Perl 做到这一点？

【问题讨论】：

在这些情况下，我尝试修改日志以转到数据库。然后很容易得到你还没有处理的记录。如果您还没有设计日志记录部分，那可能是要走的路。

标签： c++ c perl shell

【解决方案1】：

tail -f 将读取文件并在文件到达 EOF 时监视它的更新，而不是直接退出。这是一种“实时”读取日志文件的简单方法。可以这么简单：

tail -f log.file | extract

或者可能是tail -n 0 -f，所以它只打印新行，而不是现有行。或tail -n +0 -f 显示整个文件，然后继续更新。

【讨论】：

虽然这可以满足我的需要，但有没有办法使用 C 或 C++ 来做同样的事情？
@Lazer：你总是可以“作弊”并查看“Hacker's Man Page”——跟踪源代码。 IIRC，它是非常简单的 C 代码。看这里：stackoverflow.com/questions/1439799/…

【解决方案2】：

用于此目的的传统 unix 工具是 tail -f，它会一直读取附加到其参数的数据，直到您将其杀死。所以你可以做

tail -c +1 -f log | extract

在 unix 世界中，从连续附加到文件中读取被称为“拖尾”。在 Perl 中，File::Tail 模块执行相同的任务。

use File::Tail;
my $log_file = File::Tail->new("log");
while (defined (my $log_line = $log_file->read)) {
    process_line($log_line);
}

【讨论】：

【解决方案3】：

使用logdump 的简单替代

#! /usr/bin/perl

use warnings;
use strict;

open my $fh, ">", "log" or die "$0: open: $!";
select $fh;
$| = 1;  # disable buffering

for (1 .. 10) {
  print $fh "message $_\n" or warn "$0: print: $!";
  sleep rand 5;
}

和下面extract 的骨架以获得您想要的处理。当logfile 遇到文件结束时，logfile.eof() 为真。调用logfile.clear() 会重置所有错误状态，然后我们休眠并重试。

#include <iostream>
#include <fstream>
#include <cerrno>
#include <cstring>
#include <unistd.h>

int main(int argc, char *argv[])
{
  const char *path;
  if      (argc == 2) path = argv[1];
  else if (argc == 1) path = "log";
  else {
    std::cerr << "Usage: " << argv[0] << " [ log-file ]\n";
    return 1;
  }

  std::ifstream logfile(path);
  std::string line;
  next_line: while (std::getline(logfile, line))
    std::cout << argv[0] << ": extracted [" << line << "]\n";

  if (logfile.eof()) {
    sleep(3);
    logfile.clear();
    goto next_line;
  }
  else {
    std::cerr << argv[0] << ": " << path << ": " << std::strerror(errno) << '\n';
    return 1;
  }

  return 0;
}

没有看直播那么有趣，但输出是

./extract: 提取 [消息 1]
./extract: 提取 [消息 2]
./extract: 提取 [消息 3]
./extract: 提取 [消息 4]
./extract: 提取 [消息 5]
./extract: 提取 [消息 6]
./extract: 提取 [消息 7]
./extract: 提取 [消息 8]
./extract: 提取 [消息 9]
./extract: 提取 [消息 10]
^C

我将中断留在输出中以强调这是一个无限循环。

使用 Perl 作为胶水语言使extract 通过tail 从日志中获取行：

#! /usr/bin/perl

use warnings;
use strict;

die "Usage: $0 [ log-file ]\n" if @ARGV > 1;
my $path = @ARGV ? shift : "log";

open my $fh, "-|", "tail", "-c", "+1", "-f", $path
  or die "$0: could not start tail: $!";

while (<$fh>) {
  chomp;
  print "$0: extracted [$_]\n";
}

最后，如果你坚持自己做繁重的工作，有一个related Perl FAQ：

如何在 perl 中执行 tail -f？

第一次尝试
seek(GWFILE, 0, 1);
seek(GWFILE, 0, 1) 语句不会改变当前位置，但它会清除句柄上的文件结束条件，因此下一个<GWFILE> 会使 Perl 再次尝试读取某些内容。

如果这不起作用（它依赖于您的 stdio 实现的功能），那么您需要更像这样的东西：
for (;;) {
  for ($curpos = tell(GWFILE); <GWFILE>; $curpos = tell(GWFILE)) {
    # search for some stuff and put it into files
  }
  # sleep for a while
  seek(GWFILE, $curpos, 0);  # seek to where we had been
}
如果这仍然不起作用，请查看 IO::Handle 中的 clearerr 方法，该方法会重置句柄上的错误和文件结束状态。

还有一个来自 CPAN 的 File::Tail 模块。

【讨论】：