多线程解析来自多个文件的数据流答案

【问题标题】：Multithreading to parse stream of data from multiple files多线程解析来自多个文件的数据流
【发布时间】：2021-01-23 04:01:26
【问题描述】：

我有一个解析数据流的python程序，如下所示

tail -F /path1/restapi.log -F /path2/restapi.log | parse.py

parse.py 正在解析来自 sys.stdin.readline 的数据

import re
import sys
import json

def deep_get(dictionary, keys, default=None):
    return reduce(lambda d, key: d.get(key, default) if isinstance(d, dict) else default, keys.split("."), dictionary)

regexp_date_status = re.compile(r'(\d+-\d+-\d+ \d+:\d+:\d+.\d+\+\d+) (\w+)')

while True:
    line = sys.stdin.readline()
    if not line:
        break
    if re.search(r'Request #\d+: {', line):
        date_status = regexp_date_status.match(line)
        json_str = '{\n'
        while True:
            json_str += sys.stdin.readline()
            try:
                d = json.loads(json_str) # we have our dictionary, perhaps
            except Exception:
                pass
            else:
                username = (deep_get(d,"context.authorization.authUserName", default="Username not found"))
                hostname = (deep_get(d,"context.headers.X-Forwarded-For"))
                uri      = (deep_get(d,"context.uri"))
                verb     = (deep_get(d,"context.verb"))

                print("State->{} : Date->{} : User->{} : Host->{} : URI->{} : Verb->{}".format(date_status.group(2), date_status.group(1), username,hostname,uri,verb))

                break

我想做多线程，因为文件数量最多可以增加到 30 个

tail -F /path1/restapi.log -F /path2/restapi.log /path3/restapi.log -F /path4/restapi.log .... | parse.py

在这种情况下，如何在线程之间分配工作，因为数据是流式传输和解析的，直到我在 try 块中获得有效的字典？我还需要在这里利用队列吗？

【问题讨论】：

标签： python multithreading python-2.7 python-multiprocessing python-multithreading

【解决方案1】：

让 bash 处理 parse.py 的多个实例。类似于：

echo -e 'file1.log\nfile2.log\nfile3.log'| xargs -n 1 --max-procs 10 -I % sh -c 'tail -f % |parse.py'

xargs 将在多个实例中处理事情。

注意文件列表中的“\n”。

玩游戏的例子：

echo -e "hello\nto\nyou" |xargs -n 1 --max-procs 2 -I % sh -c 'sleep 3; echo %'

这将使用两个线程来执行睡眠和回显。结果将是 'hello' 和 'to'，在看到 'you' 之前会有延迟。

【讨论】：