【问题标题】:Split CSV into two files based on column matching values in an array in bash / posh根据bash / posh中数组中的列匹配值将CSV拆分为两个文件
【发布时间】:2020-03-09 20:08:05
【问题描述】:

我有一个输入 CSV,我想将其拆分为两个 CSV 文件。如果第 4 列的值与 WLTarray 中的任何值匹配,则它应该进入输出文件 1,如果不匹配,则应该进入输出文件 2。

WLT 数组:

"22532" "79994" "18809" "21032"

输入 CSV 文件:

header1,header2,header3,header4,header5,header6,header7,header8
"83","6344324","585677","22532","Entitlements","BX","22532:718","36721"
"83","1223432","616454","79994","Compliance Stuff","DR","79994:64703","206134"
"83","162217","616454","83223","Data Enrichment","IEO","83223:64701","206475"
"83","267216","616457","79994","Compliance Engine","ABC","79994:64703","206020"

输出 CSV 文件 1:

header1,header2,header3,header4,header5,header6,header7,header8
"83","6344324","585677","22532","Entitlements","BX","22532:718","36721"
"83","1223432","616454","79994","Compliance Stuff","DR","79994:64703","206134"
"83","267216","616457","79994","Compliance Engine","ABC","79994:64703","206020"

输出 CSV 文件 2:

header1,header2,header3,header4,header5,header6,header7,header8
"83","162217","616454","83223","Data Enrichment","IEO","83223:64701","206475"

我一直在寻找 awk 来过滤它(python 和 perl 在我的环境中不是一个选项),但我认为可能有更聪明的方法:

  declare -a WLTarray=("22532" "79994" "18809" "21032")
  for WLTvalue in "${WLTarray[@]}" #Everything in the WLTarray will go to $filename-WLT.tmp
  do
        awk -F, '($4=='$WLTvalue'){print}' $filename.tmp >> $filename-WLT.tmp #move the lines to the WLT file
        # now filter to remove non matching values? why not just move the rows entirely?        
  done

【问题讨论】:

    标签: bash csv awk sed


    【解决方案1】:

    使用常规awk,您可以使用splitsubstr(处理双引号删除以进行比较)并按照您的指示拆分 csv 文件。例如,您可以使用:

    awk 'BEGIN { FS=","; s="22532 79994 18809 21032"
            split (s,a," ")     # split s into array a
            for (i in a)        # loop over each index in a
                b[a[i]]=1       # use value in a as index for b
        }
        FNR == 1 {      # first record, write header to both output files
            print $0 > "output1.csv"
            print $0 > "output2.csv"
            next
        }
        substr($4,2,length($4)-2) in b {    # 4th field w/o quotes in b?
            print $0 > "output1.csv"        # write to output1.csv
            next
        }
        { print $0 > "output2.csv" }        # otherwise write to output2.csv
    ' input.csv
    

    地点:

    • BEGIN {...} 规则中,您将字段分隔符 (FS) 设置为以逗号分隔,并拆分包含所需output1.csv 的字符串字段4匹配到数组a,然后循环遍历a 中的值,将它们用于数组b 中的索引(以允许简单的i in b 检查);
    • 第一条规则应用于文件中的第一条记录(标题行),它被简单地写入两个输出文件;
    • 下一条规则删除字段 4 周围的双引号,然后检查字段 4 中的数字是否与数组 b 中的索引匹配。如果是,则记录写入output1.csv,否则写入output2.csv

    输入文件示例

    $ cat input.csv
    header1,header2,header3,header4,header5,header6,header7,header8
    "83","6344324","585677","22532","Entitlements","BX","22532:718","36721"
    "83","1223432","616454","79994","Compliance Stuff","DR","79994:64703","206134"
    "83","162217","616454","83223","Data Enrichment","IEO","83223:64701","206475"
    "83","267216","616457","79994","Compliance Engine","ABC","79994:64703","206020"
    

    生成的输出文件

    $ cat output1.csv
    header1,header2,header3,header4,header5,header6,header7,header8
    "83","6344324","585677","22532","Entitlements","BX","22532:718","36721"
    "83","1223432","616454","79994","Compliance Stuff","DR","79994:64703","206134"
    "83","267216","616457","79994","Compliance Engine","ABC","79994:64703","206020"
    
    $ cat output2.csv
    header1,header2,header3,header4,header5,header6,header7,header8
    "83","162217","616454","83223","Data Enrichment","IEO","83223:64701","206475"
    

    【讨论】:

      【解决方案2】:

      你可以像这样使用gawk

      test.awk

      #!/usr/bin/gawk -f
      BEGIN {
          split("22532 79994 18809 21032", a)
          for(i in a) {
              WLTarray[a[i]]
          }
          FPAT="[^\",]+"
      }
      
      NR > 1 {
          if ($4 in WLTarray) {
              print >> "output1.csv"
          } else {
              print >> "output2.csv"
          }
      }
      

      使其可执行并像这样运行它:

      chmod +x test.awk
      ./test.awk input.csv
      

      【讨论】:

        【解决方案3】:

        使用带有过滤器文件的 grep 作为输入是最简单的答案。

        declare -a WLTarray=("22532" "79994" "18809" "21032")
                for WLTvalue in "${WLTarray[@]}" 
                do
                    awkstring="'\$4 == "\"\\\"$WLTvalue\\\"\"" {print}'"
                    eval "awk -F, $awkstring input.csv >> output.WLT.csv"
                done
                grep -v -x -f output.WLT.csv input.csv > output.NonWLT.csv
        

        【讨论】:

          猜你喜欢
          • 2012-01-14
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 2021-03-31
          • 2019-02-03
          • 2015-09-03
          • 1970-01-01
          相关资源
          最近更新 更多