减少“读取时”循环的处理时间答案

【问题标题】：Reduce processing time for 'While read' loop减少“读取时”循环的处理时间
【发布时间】：2021-12-11 12:09:41
【问题描述】：

Shell 脚本新手..

我有一个巨大的 csv 文件，具有不同长度的 f11，比如

“000000aaad000000bhb200000uwwed...”
“000000aba200000bbrb2000000wwqr00000caba2000000bhbd000000qwew...”
. .

将字符串拆分为 10 大小后，我需要 6-9 个字符。然后我必须使用分隔符'|'像

一样加入他们

0aaa|0bhb|uwwe...
0aba|bbrb|0wwq|caba|0bhb|0qwe...

将处理后的f11与其他字段加入

这是处理 10k 条记录所需的时间 ->

真正的 4m43.506s
用户 0m12.366s
系统 0m12.131s

20K 记录 ->
真正的 5m20.244s
用户 2m21.591s
系统 3m20.042s

8 万条记录（大约 370 万条 f11 拆分并与 '|' 合并）->

真正的 21m18.854s
用户 9m41.944s
系统 13m29.019s

我预计处理 650K 记录的时间是 30 分钟（大约 5600 万次 f11 拆分和合并）。有什么办法优化吗？

while read -r line1; do
    f10=$( echo $line1 | cut -d',' -f1,2,3,4,5,7,9,10)
    echo $f10 >> $path/other_fields
    
    f11=$( echo $line1 | cut -d',' -f11 )
    f11_trim=$(echo "$f11" | tr -d '"')
    echo $f11_trim | fold -w10 > $path/f11_extract 

    cat $path/f11_extract | awk '{print $1}' | cut -c6-9 >> $path/str_list_trim
    
    arr=($(cat $path/str_list_trim))
    printf "%s|" ${arr[@]} >> $path/str_list_serialized
    printf '\n' >> $path/str_list_serialized
    arr=()
    
    rm $path/f11_extract
    rm $path/str_list_trim

done < $input
sed -i 's/.$//' $path/str_list_serialized
sed -i 's/\(.*\)/"\1"/g' $path/str_list_serialized

paste -d "," $path/other_fields $path/str_list_serialized > $path/final_out

【问题讨论】：

我建议只使用sed 或awk。
引号是文件的一部分吗？您显示的示例不包含,。请为您的问题添加带有字段分隔符的示例输入（无描述、无图像、无链接）和带有字段分隔符的所需输出（无评论）。
"xx","x",x,x,x,xx,xx,"x",x,11,"00000aaaaD00000bbbbD00000abcdD00000dwasD00000dedsD00000ddfgD00000dsdfD00000snfjD00000dj0,0000000wedfD00,0000000wedfD00,0000000wedfDx0 ,xx,xx,"x",x,5,"00000aaaaD00000bbbbD00000abcdD00000dwasD00000deds"
请不要发表评论。
@GD，感谢您的努力。请在您的问题中更清楚地提及您的输入样本和预期输出。还要提到在您的问题中获得预期输出的逻辑；为了更清楚，谢谢。

标签： arrays shell awk while-loop

【解决方案1】：

由于以下原因，您的代码效率不高：

在循环中调用多个命令，包括 awk。
生成许多中间时态文件。

您只需使用 awk 即可完成这项工作：

awk -F, -v OFS="," '                                    # assign input/output field separator to a comma
{
    len = length($11)                                   # length of the 11th field
    s = ""; d = ""                                      # clear output string and the delimiter
    for (i = 1; i <= len / 10; i++) {                   # iterate over the 11th field
        s = s d substr($11, (i - 1) * 10 + 6, 4)        # concatenate 6-9th substring of 10 characters long chunks
        d = "|"                                         # set the delimiter to a pipe character
    }
    $11 = "\"" s "\""                                   # assign the 11th field to the generated string
} 1' "$input"                                           # the final "1" tells awk to print all fields

输入示例：

1,2,3,4,5,6,7,8,9,10,000000aaad000000bhb200000uwwed
1,2,3,4,5,6,7,8,9,10,000000aba200000bbrb2000000wwqr00000caba2000000bhbd000000qwew

输出：

1,2,3,4,5,6,7,8,9,10,"0aaa|0bhb|uwwe"
1,2,3,4,5,6,7,8,9,10,"0aba|bbrb|0wwq|caba|0bhb|0qwe"

【讨论】：

超级有帮助。谢谢。很显着的提高。处理 80k 条记录仅需 5 秒。早些时候是21分钟。真实 0m5.485s 用户 0m4.971s 系统 0m0.080s
澄清一下，字符 's' 和 'd' 是 awk 的关键字吗？
不，它们只是变量名；意为s 表示字符串，d 表示分隔符。