【问题标题】:Extracting of the text data using bash utilities使用 bash 实用程序提取文本数据
【发布时间】:2015-05-18 12:07:30
【问题描述】:

我有一项非常重要的任务是从大型 CSV 日志中提取一些相关数据,这些数据看起来像

Frame #,Residue,Internal,van der Waals,Electrostatic,Polar Solvation,Non-Polar Solv.,TOTAL
1,1,119.745,0.356,-132.009,-95.618,1.7886312,-105.7373688
1,2,106.093,-3.835,-182.473,40.582,0.7132608,-38.9197392
1,3,21.228,-1.744,-38.026,-7.707,1.1189664,-25.1300336
1,4,-5.717,-4.721,-30.38,-4.839,0.406512,-45.250488
1,5,70.846,-4.127,-53.317,-2.534,0.7808472,11.6488472
...
2,1,119.745,0.356,-132.009,-95.618,1.7886312,-105.7373688
2,2,106.093,-3.835,-182.473,40.582,0.7132608,-38.9197392
2,3,21.228,-1.744,-38.026,-7.707,1.1189664,-25.1300336
2,4,-5.717,-4.721,-30.38,-4.839,0.406512,-45.250488
2,5,70.846,-4.127,-53.317,-2.534,0.7808472,11.6488472
...
n,1,119.745,0.356,-132.009,-95.618,1.7886312,-105.7373688
n,2,106.093,-3.835,-182.473,40.582,0.7132608,-38.9197392
n,3,21.228,-1.744,-38.026,-7.707,1.1189664,-25.1300336
n,4,-5.717,-4.721,-30.38,-4.839,0.406512,-45.250488
n,5,70.846,-4.127,-53.317,-2.534,0.7808472,11.6488472

这里我想从第 2 列 (#residue) 中选择一个指定的值,并根据第 1 列 (#frame number) 写出其最后一列 (#total energy) 的演化(#snapshot number 列的函数) .换句话说,我需要 1)首先根据第二列对所有数据进行排序):即选择第二列中的数字等于指定值的每个字符串(即 n=27)

#Frame, #Residue

1,27, ... , # last column value which is interested for me!
2,27, ... , # last column value which is interested for me!
3,27, ... , # last column value which is interested for me!
3,27, ... , # last column value which is interested for me!

然后提取其最后一列的相应值,因此生成的日志将只有 3 列:

#Frame, #Residue, # Total energy

1,27, # last column value which is interested for me!
2,27, # last column value which is interested for me!
3,27, # last column value which is interested for me!
3,27, # last column value which is interested for me!

将感谢任何使用 awk 和 sed 的实现!

谢谢!

格莱布

【问题讨论】:

    标签: bash text multiple-columns


    【解决方案1】:

    要提取第二列有27的行,可以使用grep

      grep '^[^,]\+,27,' input.csv
            | |   |
    beginning |   |
        not comma |
                  repeated
    

    要只输出第 1、2 和 8 列,请使用cut

    grep '^[^,]\+,27' input.csv | cut -d, -f1,2,8
                                       |   |
                                 delimiter |
                                          fields
    

    要按第二列对文件进行排序,可以使用sort

    sort -t, -nk2,2 input.csv
          |   | |
    delimiter | |
        numeric |
        sort    by only the second field
    

    【讨论】:

    • 可以在“27”后面加逗号,否则可以匹配最大的数字,比如270、271、271337...:grep '^[^,]\+,27,' input.csv | cut -d, -f1,2,8
    • \+ 在 POSIX 基本正则表达式中未定义,因此您依赖于恰好将 \+ 视为“1 或更多”的 grep。也就是说,它实际上应该是*
    • 谢谢!一个问题:在从初始 data.csv 中提取第 i 次之后,要在脚本中添加什么来停止提取这些行?例如,使用此类命令仅提取 n 行。
    • @user3470313:您可以将输出通过管道传输到head
    【解决方案2】:

    这是一个 awk 解决方案:

    awk -v n=27 'BEGIN { OFS = FS = "," } $2 == n { print $1, $2, $NF }' input.csv
    
    • -v n=27 - 首先分配一个 awk 变量 n27
    • BEGIN { OFS = FS = "," } - BEGIN 部​​分在 awk 开始解析任何数据之前运行。这里我们将FS(字段分隔符)和OFS(输出字段分隔符)都设置为“,”,这样输入行和输出行都将用逗号分隔/分隔。
    • $2 == n { print $1, $2, $NF } - 对于第二个字段 ($2) 等于 n 的任何记录(行),输出第一个、第二个和最后一个字段。

    m 匹配后停止:

    awk -v n=27 -v m=3 'BEGIN { OFS = FS = "," } $2 == n { print $1, $2, $NF; if (++count == m) exit}' input.csv
    

    【讨论】:

    • 谢谢!再次提出问题:在从初始 data.csv 进行第 i 次提取后,要在脚本中添加什么来停止提取这些行?例如,使用此类命令仅提取 n 行。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2016-09-19
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多