将文本从固定长度字段转换为 json 的问题答案

【问题标题】：Problems converting text from fixed length fields to json将文本从固定长度字段转换为 json 的问题
【发布时间】：2018-02-19 23:19:23
【问题描述】：

我有一个可行的解决方案，它从客户端 sftp 服务器下载压缩的固定长度字段文本文件，使用密码解压缩，然后在文件上运行 gnu awk 以将其转换为管道分隔的文本文件，然后清理在自己之后。

此处为 Bash 脚本代码：

#!/bin/bash
export ZipPassword=********
export SSHPASS=********
export WorkPath=/Users/administrator/Documents/Work/
export ArcPath=/Users/administrator/Documents/Work/archive/
export DownPath=/Users/administrator/Documents/Work/down/
export InPath=/Users/administrator/Documents/Work/input/
export ReadyPath=/Users/administrator/Documents/Work/preproc/
export OutPath=/Users/administrator/Documents/Work/Output/
export AwkPath=/Users/administrator/Documents/Work/scpost.awk


cd $DownPath

sshpass -e sftp -oBatchMode=no -b - ****@*****.*******.*** << !
    cd /frommbi
    get *.zip
    rm *.zip
    exit
!


for f in *.zip
do 
    cp -v "$f" "$InPath"
    cp -v "$f" "$ArcPath"
    rm *.zip
done    

shopt -s nullglob dotglob     # To include hidden files
files=($InPath*)
if [ ${#files[@]} -gt 0 ]; then


unzip -P $ZipPassword $InPath*.zip -d $ReadyPath


for f in $ReadyPath
do
    export PathName=/Users/administrator/Documents/Work/PreProc/*.TXT
    echo $PathName
    export FileName=`basename $PathName`
    echo $FileName
    echo $OutPath$FileName

awk -f $AwkPath $PathName > $OutPath$FileName

done



rm -f $InPath*
rm -f $ReadyPath*

fi

awk 文件内容在这里：

BEGIN{FIELDWIDTHS=" 3 2 2 18 5 9 10 10 10 14 16 30 30 30 30 30 30 30 30 45 45 45 45 45 45 45 45 16 28 6 1 1 3 2 6 2 4 3 2 30 3 3 3 40 6 5 6 3 3 3 40 6 5 6 3 3 3 40 6 5 6 3 3 3 40 6 5 6 3 3 3 40 6 5 6 3 3 3 40 6 5 6 3 3 3 40 6 5 6 3 3 3 40 6 5 6 20 7 20 2 6 13 6 6 6 32 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 40 2 6 20 30 11 12 3 1 14 14 1 4 4 4 4 4 4 4 12 28 30 8 2 1 8 8 8 8 8 10 12 8 130 1 7 65 3 82 512 528 1 "; 
OFS="|";
}
{
for (i=1;i<=NF;i++) gsub (/^ */,"",$i);for(i=1;i<=NF;i++) gsub("^[ \t]*|[ \t]*$","",$i);
}
{
print$1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16,$17,$18,$19,$20,$21,$22,$23,$24,$25,$26,$27,$28,$29,$30,$31,$32,$33,$34,$35,$36,$37,$38,$39,$40,$41,$42,$43,$44,$45,$46,$47,$48,$49,$50,$51,$52,$53,$54,$55,$56,$57,$58,$59,$60,$61,$62,$63,$64,$65,$66,$67,$68,$69,$70,$71,$72,$73,$74,$75,$76,$77,$78,$79,$80,$81,$82,$83,$84,$85,$86,$87,$88,$89,$90,$91,$92,$93,$94,$95,$96,$97,$98,$99,$100,$101,$102,$103,$104,$105,$106,$107,$108,$109,$110,$111,$112,$113,$114,$115,$116,$117,$118,$119,$120,$121,$122,$123,$124,$125,$126,$127,$128,$129,$130,$131,$132,$133,$134,$135,$136,$137,$138,$139,$140,$141,$142,$143,$144,$145,$146,$147,$148,$149,$150,$151,$152,$153,$154,$155,$156,$157,$158,$159,$160,$161,$162,$163,$164,$165,$166,$167,$168,$169,$170,$171
}

识别这里的字段名称是数字，以便以后在数据库中进行映射。

我已经安装了 jq 来处理从管道分隔的数据到 json 的转换，但是我无法获得正确的语法。

修改后的 bash 脚本内容在第 52 - 56 行：

#!/bin/bash
export ZipPassword=********
export SSHPASS=********
export WorkPath=/Users/administrator/Documents/Work/
export ArcPath=/Users/administrator/Documents/Work/archive/
export DownPath=/Users/administrator/Documents/Work/down/
export InPath=/Users/administrator/Documents/Work/input/
export ReadyPath=/Users/administrator/Documents/Work/preproc/
export OutPath=/Users/administrator/Documents/Work/Output/
export AwkPath=/Users/administrator/Documents/Work/scpost.awk
export JsonPath=/Users/administrator/Documents/Work/JSON/


cd $DownPath

sshpass -e sftp -oBatchMode=no -b - ****@*****.*******.*** << !
    cd /frommbi
    get *.zip
    rm *.zip
    exit
!


for f in *.zip
do 
    cp -v "$f" "$InPath"
    cp -v "$f" "$ArcPath"
    rm *.zip
done    

shopt -s nullglob dotglob     # To include hidden files
files=($InPath*)
if [ ${#files[@]} -gt 0 ]; then


unzip -P $ZipPassword $InPath*.zip -d $ReadyPath


for f in $ReadyPath
do
    export PathName=/Users/administrator/Documents/Work/PreProc/*.TXT
    echo $PathName
    export FileName=`basename $PathName`
    echo $FileName
    echo $OutPath$FileName

awk -f $AwkPath $PathName > $OutPath$FileName

done
chmod 776 $OutPath$FileName

jq -Rn  --slurp --raw-input --raw-output \'
( input  | split("|") ) as $keys |
( inputs | split("|") ) as $vals |
[[$keys, $vals] | transpose[] | {key:.[0],value:.[1]}] | from_entries
' $OutPath$FileName > $JsonPath$FileName



rm -f $InPath*
rm -f $ReadyPath*
rm -f $JsonPath*


fi

有人可以帮忙吗？在你问之前，我正在使用这种方法来实现绝对的转换速度。我的 Mac Pro 可以在大约 20 秒内转换 100,000 条 2850 字符记录，而且每天都会这样做。转换为 json 将大大加快该过程的下一步。

【问题讨论】：

如果问题出在jq，最好删除其余脚本，只向jq提供输入和预期输出。
问题可能根本不是jq。最好的解决方案可能是对 awk 文件进行不同的编码，并在没有中间格式的情况下从固定长度字段转换为 json。也许我应该包括我对替代解决方案持开放态度。

标签： json bash awk jq

【解决方案1】：

你快到了。由于您使用的是input 和inputs（这绝对是正确的方法），所以您不想“啜饮”文件。

jq  -nrR '
 ( input  | split("|") ) as $keys
 | ( inputs | split("|") ) as $vals
 | [[$keys, $vals] | transpose[] | {key:.[0], value:.[1]|tonumber}]
 | from_entries
'

顺便说一句，您可以轻松地将 awk+jq 步骤组合成一个 awk 或一个 jq 步骤。这样做可以节省很多不必要的修改。如果您选择坚持使用 awk，我将专注于缩短那个可笑的长“打印 $1, $2, ...”语句。（“打印 $0”还不够吗？）

awk 的 FIELDWIDTHS 确实很方便，因此在下一节中，将介绍一个 jq 过滤器，用于根据输入字符串和有关字段宽度的信息发出数组。

使用 jq 解析定长字段

# Given a string, emit a stream of the fields defined by the array of widths
def fixedfields(widths):
  foreach widths[] as $w ({s:.}; (.field = .s[:$w]) | (.s |= .[$w:]); .field);

如果你的 jq 没有foreach，这里有一个替代实现：

def fixedfields(widths):
  def do_while(cond; f; g): def r: select(cond) | f | (g, r); r;
  {s:., w: widths}
  | do_while(.w|length > 0;
             .w[0] as $w | {s: .s[$w:], w: .w[1:], field: .s[:$w] };
             .field);

【讨论】：

顶部的代码正在生成一个我似乎无法解决的错误：jq: error (at /Users/administrator/Documents/Work/Output/pCycle1219103821.TXT:2): Invalid numeric EOF 第 1 行第 2 列的文字（在解析“C$”时）