【问题标题】:Convert postgres text log file to csv file将 postgres 文本日志文件转换为 csv 文件
【发布时间】:2020-12-17 08:46:39
【问题描述】:

我正在尝试将文本日志格式化为 csv 文件 文本日志文件格式。每个以前缀 ("t=%m p=%ph=%h db=%d u=%u x=%x") 开头的条目都被视为一行。它可能包含 \n 和 \r 转义序列。

t=2020-08-25 15:00:00.000 +03 p=16205 h=127.0.0.1 db=test u=test_app x=0 LOG:  duration: 0.011 ms  execute S_40: SELECT ID, EKLEME_ZAMANI, EKLEYEN_KULLANICI_ID, GORULME_DURUMU, GUNCELLEME_ZAMANI, GUNCELLEYEN_KULLANICI_ID, IP_ADRESI, ISLEM_ZAMANI, ISLEMI_YAPAN_KULLANICI_ID, METOD, PARAMETRE_JSON, UYGULAMA_ID, VERSIYON, DURUM_ID FROM DB_LOG WHERE (ID = $1)
t=2020-08-25 15:00:00.000 +03 p=16205 h=127.0.0.1 db=test u=test_app x=0 DETAIL:  parameters: $1 = '9187372'
t=2020-08-25 15:00:00.001 +03 p=36001 h=127.0.0.1 db=test u=test_app x=0 LOG:  duration: 0.005 ms  bind S_1: COMMIT
t=2020-08-25 15:00:00.001 +03 p=36001 h=127.0.0.1 db=test u=test_app x=0 LOG:  duration: 0.004 ms  execute S_1: COMMIT
t=2020-08-25 15:00:00.001 +03 p=16205 h=127.0.0.1 db=test u=test_app x=0 LOG:  duration: 0.018 ms  bind S_41: INSERT INTO DB_LOG (ID, EKLEME_ZAMANI, EKLEYEN_KULLANICI_ID, GORULME_DURUMU, GUNCELLEME_ZAMANI, GUNCELLEYEN_KULLANICI_ID, IP_ADRESI, ISLEM_ZAMANI, ISLEMI_YAPAN_KULLANICI_ID, METOD, PARAMETRE_JSON, UYGULAMA_ID, VERSIYON, DURUM_ID) VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14)
t=2019-12-19 17:00:00.102 +03 p=58042 h= db= u= x=0 LOG:  automatic vacuum of table "postgres.pgagent.pga_job": index scans: 0
    pages: 0 removed, 9 remain, 0 skipped due to pins, 0 skipped frozen
    tuples: 0 removed, 493 remain, 472 are dead but not yet removable, oldest xmin: 20569983
    buffer usage: 90 hits, 0 misses, 0 dirtied
    avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
    system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s

在 SQL 语句的前缀之后,和往常一样,它们是不固定的。

如果可能没有前缀就完美了,每一行的格式应该如下:

"2020-08-25 15:00:00.000 +03","16205","127.0.0.1","test","test_app","0","LOG:"," duration: 0.011 ms  execute S_40: SELECT ID, EKLEME_ZAMANI, EKLEYEN_KULLANICI_ID, GORULME_DURUMU, GUNCELLEME_ZAMANI, GUNCELLEYEN_KULLANICI_ID, IP_ADRESI, ISLEM_ZAMANI, ISLEMI_YAPAN_KULLANICI_ID, METOD, PARAMETRE_JSON, UYGULAMA_ID, VERSIYON, DURUM_ID FROM DB_LOG WHERE (ID = $1)"
"2020-08-25 15:00:00.000 +03","16205","127.0.0.1","test","test_app","0","DETAIL:"," parameters: $1 = '9187372'"
"2020-08-25 15:00:00.001 +03","36001","127.0.0.1","test","test_app","0","LOG:"," duration: 0.005 ms  bind S_1: COMMIT"
"2020-08-25 15:00:00.001 +03","36001","127.0.0.1","test","test_app","0","LOG:"," duration: 0.004 ms  execute S_1: COMMIT"
"2020-08-25 15:00:00.001 +03","16205","127.0.0.1","test","test_app","0","LOG:"," duration: 0.018 ms  bind S_41: INSERT INTO DB_LOG (ID, EKLEME_ZAMANI, EKLEYEN_KULLANICI_ID, GORULME_DURUMU, GUNCELLEME_ZAMANI, GUNCELLEYEN_KULLANICI_ID, IP_ADRESI, ISLEM_ZAMANI, ISLEMI_YAPAN_KULLANICI_ID, METOD, PARAMETRE_JSON, UYGULAMA_ID, VERSIYON, DURUM_ID) VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14)"
"2019-12-19 17:00:00.102 +03","58042","","","","0","LOG:"," automatic vacuum of table "postgres.pgagent.pga_job": index scans: 0pages: 0 removed, 9 remain, 0 skipped due to pins, 0 skipped frozen    tuples: 0 removed, 493 remain, 472 are dead but not yet removable, oldest xmin: 20569983    buffer usage: 90 hits, 0 misses, 0 dirtied    avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s    system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s"

正则表达式101: https://regex101.com/r/R3vADD/4

但我不确定当将 csv 文件复制到 db 时,预期行的最后一部分会出现一些问题,因为“table”有双引号。

" automatic vacuum of table "postgres.pgagent.pga_job": index scans: 0pages: 0 removed, 9 remain, 0 skipped due to pins, 0 skipped frozen    tuples: 0 removed, 493 remain, 472 are dead but not yet removable, oldest xmin: 20569983    buffer usage: 90 hits, 0 misses, 0 dirtied    avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s    system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s"

谢谢大家。

【问题讨论】:

    标签: regex linux csv awk sed


    【解决方案1】:

    使用 GNU awk 表示 FPAT,将 3rg arg 表示为 match()\s/\S 简写为 [[:space:]][^[:space:]]

    $ cat tst.awk
    BEGIN {
        FPAT = "[[:alnum:]]+=[^=]* "
        OFS = ","
    }
    /^\S/ { if (NR>1) prt() }
    { prev = prev $0 }
    END { prt() }
    
    function prt(   orig, i, a) {
        orig = $0
        $0 = prev
    
        match($0,/(.* )(LOG|DETAIL): +(.*)/,a)
    
        $0 = a[1]
        $(NF+1) = a[2]
        $(NF+1) = a[3]
    
        for (i=1; i<=NF; i++) {
            gsub(/^\s+|\s+$/,"",$i)
            sub(/^\S+=/,"",$i)
            gsub(/"/,"\"\"",$i)
            printf "\"%s\"%s", $i, (i<NF ? OFS : ORS)
        }
    
        $0 = orig
        prev = ""
    }
    

    .

    $ awk -f tst.awk file
    "2020-08-25 15:00:00.000 +03","16205","127.0.0.1","test","test_app","0","LOG","duration: 0.011 ms  execute S_40: SELECT ID, EKLEME_ZAMANI, EKLEYEN_KULLANICI_ID, GORULME_DURUMU, GUNCELLEME_ZAMANI, GUNCELLEYEN_KULLANICI_ID, IP_ADRESI, ISLEM_ZAMANI, ISLEMI_YAPAN_KULLANICI_ID, METOD, PARAMETRE_JSON, UYGULAMA_ID, VERSIYON, DURUM_ID FROM DB_LOG WHERE (ID = $1)"
    "2020-08-25 15:00:00.000 +03","16205","127.0.0.1","test","test_app","0","DETAIL","parameters: $1 = '9187372'"
    "2020-08-25 15:00:00.001 +03","36001","127.0.0.1","test","test_app","0","LOG","duration: 0.005 ms  bind S_1: COMMIT"
    "2020-08-25 15:00:00.001 +03","36001","127.0.0.1","test","test_app","0","LOG","duration: 0.004 ms  execute S_1: COMMIT"
    "2020-08-25 15:00:00.001 +03","16205","127.0.0.1","test","test_app","0","LOG","duration: 0.018 ms  bind S_41: INSERT INTO DB_LOG (ID, EKLEME_ZAMANI, EKLEYEN_KULLANICI_ID, GORULME_DURUMU, GUNCELLEME_ZAMANI, GUNCELLEYEN_KULLANICI_ID, IP_ADRESI, ISLEM_ZAMANI, ISLEMI_YAPAN_KULLANICI_ID, METOD, PARAMETRE_JSON, UYGULAMA_ID, VERSIYON, DURUM_ID) VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14)"
    "2019-12-19 17:00:00.102 +03","58042","","","","0","LOG","automatic vacuum of table ""postgres.pgagent.pga_job"": index scans: 0    pages: 0 removed, 9 remain, 0 skipped due to pins, 0 skipped frozen    tuples: 0 removed, 493 remain, 472 are dead but not yet removable, oldest xmin: 20569983    buffer usage: 90 hits, 0 misses, 0 dirtied    avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s    system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s"
    

    您问题中预期输出的最后一行包含 " automatic vacuum of table "postgres.pgagent.pga_job": index ..." 但这不是有效的 CSV,因为您不能在双引号字符串中包含未转义的双引号。它必须是 " automatic vacuum of table ""postgres.pgagent.pga_job"": index ..."" automatic vacuum of table \"postgres.pgagent.pga_job\": index ..." (取决于哪个“标准”中使用了哪个转义结构,请参阅What's the most robust way to efficiently parse CSV using awk?,无论您要使用哪种工具来阅读它)都是有效的 CSV .我决定在上面的脚本中对这种情况使用"",因为这是MS-Excel所期望的,但如果你需要的话,使用\"会是一个微不足道的调整——只需将gsub(/"/,"\"\"",$i)更改为@987654335 @。

    【讨论】:

    • 很好地使用了 gawk 功能!一个问题:如果在$0 上使用match 然后覆盖$0,为什么还需要FPAT?如果注释掉 FPAT,我发现结果会有所不同,但我就是不明白为什么。需要解释一下吗?
    • @vgersh99 我正在使用 match() 将 $0 分成 3 个部分 - LOG|DETAIL 之前的部分,其中将使用 FPAT,然后是 LOG 或 DETAIL,然后是它之后的部分。这有意义还是仍然模糊?
    • 我完全理解。 Q 是:如果您只在 $0 上进行匹配(不进行任何字段拆分),为什么需要定义 FPAT?然后你覆盖 $0...
    • 当我在调用 match() 后执行 $0 = a[1] 时,$0 的值为 t=2020-08-25 15:00:00.000 +03 p=16205 h=127.0.0.1 db=test u=test_app x=0,并且当我将分配设置 $1 设置为 t=2020-08-25 15:00:00.000 +03 时,使用 FPAT 进行字段拆分,并且$2p=16205
    • 非常感谢@EdMorton,您能逐行解释或将与每一行相关的文档介绍给我吗?
    【解决方案2】:

    给你:https://regex101.com/r/R3vADD/1

    ^t=(.* .*) p=(\d+)? h=(.*)? db=(\w+)? u=(\w+)? x=(\d+)? (\w+:) (.*)
    

    将匹配组,您可以像这样替换它们:

    "\1","\2","\3","\4","\5","\6","\7","\8"
    

    在 CLI 中使用 Perl 的示例:

    cat file.csv|perl -pe 's/^t=(.* .*) p=(\d+) h=(.*) db=(\w+) u=(\w+) x=(\d+) (\w+:) (.*)/"\1","\2","\3","\4","\5","\6","\7","\8"/g'
    

    【讨论】:

    • 谢谢,但是我将如何使用这个正则表达式,我需要像'sed regex input>output'这样的东西
    • 我在回答中添加了一个示例。
    • 您能否在原始问题的示例日志中添加这些行,我们可以改进正则表达式
    猜你喜欢
    • 2017-03-02
    • 2021-07-24
    • 2011-02-19
    • 1970-01-01
    • 1970-01-01
    • 2013-04-04
    • 1970-01-01
    • 2022-01-25
    • 2021-09-24
    相关资源
    最近更新 更多