【问题标题】:CSV to JSON using BASH使用 BASH 将 CSV 转换为 JSON
【发布时间】:2014-08-09 15:09:59
【问题描述】:

我正在尝试将下面的 csv 转换为 json 格式。

Africa,Kenya,NAI,281
Africa,Kenya,NAI,281
Asia,India,NSI,100
Asia,India,BSE,160
Asia,Pakistan,ISE,100
Asia,Pakistan,ANO,100
European Union,United Kingdom,LSE,100

这是所需的 json 格式,我无法创建它。我将在此下方发布我正在进行的工作。任何帮助或指导将不胜感激...

  {"name":"Africa",
      "children":[
      {"name":"Kenya",
          "children":[
          {"name":"NAI","size":"109"},
          {"name":"NAA","size":"160"}]}]},
  {"name":"Asia",
      "children":[
      {"name":"India",
          "children":[
          {"name":"NSI","size":"100"},
          {"name":"BSE","size":"60"}]},
  {"name":"Pakistan",
      "children":[
      {"name":"ISE","size":"120"},
      {"name":"ANO","size":"433"}]}]},
  {"name":"European Union",
        "children":[
        {"name":"United Kingdom",
            "children":[
            {"name":"LSE","size":"550"},
            {"name":"PLU","size":"123"}]}]}

正在进行中。

$1 是上面粘贴了 csv 值的文件。

#!/bin/bash

pcountry=$(head -1 $1 | cut -d, -f2)

cat $1 | while read line ; do 

region=$(echo $line|cut -d, -f1)
country=$(echo $line|cut -d, -f2)
code=$(echo $line|cut -d, -f3-)
size=$(echo $line|cut -d, -f4)

if test "$pcountry" == "$country" ;
  then 
  echo -e {\"name\":\"$region\", '\n' \"children\": [ '\n'{\"name\":\"$country\",'\n'\"children\": [ '\n' \{\"name\":\"NAI\",\"size\":\"$size\"\}
  else
      if test "$pregion" == "$region"
      then :
      else 
          echo -e ,'\n'{\"name\":\""$region\", '\n' \"children\": [ '\n'{\"name\":\"$country\",'\n'\"children\": [ '\n' \{\"name\":\"NAI\",\"size\":\"$size\"\},


pcountry=$country
pregion=$region

fi ; done

问题是我似乎无法找到一种方法来找出国家价值何时结束。

【问题讨论】:

  • 为什么要 bash?能够读写 csv 和 json 的 Python 将是这项任务的更好选择。
  • 我建议为这类事情使用awk 脚本,而不是bash+cut。或者,如果没有必要坚持使用经典的 shell 工具,请使用 Perl 或 Python 之类的工具。
  • 您可以假设国家值在您看到一个新国家(风险)或到达 EOF(安全)时结束。如果国家总是被分类在正确的地区,预分类可以消除风险。提供的数据格式存在歧义。
  • Python、nodeJS、Perl 将更好地支持 csv 和 json 之间的数据转换,因为库的可用性。
  • 感谢大家的cmets。我使用 BASH 的原因是因为我不懂任何其他语言。我刚刚拿起 BASH 做我的工作.. 我想我知道接下来要“拿起”什么。 Python :) 特别感谢 @David Atchley 的脚本......你是冠军!

标签: json bash shell csv


【解决方案1】:

使用像 这样可以操作 csv / 原始文本并理解 JSON 的工具会更好:

我假设so_24300508.csv :

Africa,Kenya,NAI,109
Africa,Kenya,NAA,160
Asia,India,NSI,100
Asia,India,BSE,60
Asia,Pakistan,ISE,120
Asia,Pakistan,ANO,433
European Union,United Kingdom,LSE,550
European Union,United Kingdom,PLU,123

(这是从您的 JSON 样本而不是您提供的 CSV 样本中提取的)

xidel -s so_24300508.csv --json-mode=deprecated --xquery '
  [
    let $csv:=x:lines($raw)
    for $region in distinct-values($csv ! tokenize(.,",")[1])
    return {
      "name":$region,
      "children":[
        for $country in distinct-values($csv[starts-with(.,$region)] ! tokenize(.,",")[2]) return {
          "name":$country,
          "children":for $data in $csv[starts-with(.,$region) and contains(.,$country)]
          let $value:=tokenize($data,",")
          return {
            "name":$value[3],
            "size":$value[4]
          }
        }
      ]
    }
  ]
'

(没有--json-mode=deprecated 替换[ ]array{ }

有关导致此查询的中间步骤,请参阅 this code snippet
另见this online xidelcgi demo

输出:

[
  {
    "name": "Africa",
    "children": [
      {
        "name": "Kenya",
        "children": [
          {
            "name": "NAI",
            "size": "109"
          },
          {
            "name": "NAA",
            "size": "160"
          }
        ]
      }
    ]
  },
  {
    "name": "Asia",
    "children": [
      {
        "name": "India",
        "children": [
          {
            "name": "NSI",
            "size": "100"
          },
          {
            "name": "BSE",
            "size": "60"
          }
        ]
      },
      {
        "name": "Pakistan",
        "children": [
          {
            "name": "ISE",
            "size": "120"
          },
          {
            "name": "ANO",
            "size": "433"
          }
        ]
      }
    ]
  },
  {
    "name": "European Union",
    "children": [
      {
        "name": "United Kingdom",
        "children": [
          {
            "name": "LSE",
            "size": "550"
          },
          {
            "name": "PLU",
            "size": "123"
          }
        ]
      }
    ]
  }
]

【讨论】:

    【解决方案2】:

    这是使用jq 的解决方案。

    如果filter.jq 包含以下过滤器

     reduce (
         split("\n")[]                  # split string into lines
       | split(",")                     # split data
       | select(length>0)               # eliminate blanks
     )  as [$c1,$c2,$c3,$c4] (          # convert to object 
         {}                             #   e.g. "Africa": { "Kenya": {  
       ; setpath([$c1,$c2,"name"];$c3)  #           "name": "NAI",
       | setpath([$c1,$c2,"size"];$c4)  #           "size": "281"        
    )                                   #        }, }
    | [                                 # then build final array of objects format:
        keys[] as $k1                   # [ {                                               
      | {name: $k1, children: (         #   "name": "Africa",                                  
           .[$k1]                       #   "children": {                                   
         | keys[] as $k2                #     "name": "Kenya",                                 
         | {name: $k2, children:.[$k2]} #     "children": { "name": "NAI", "size": "281" }
        )}                              #   ...
      ]
    

    data 包含样本数据,然后是命令

    $ jq -M -Rsr -f filter.jq data
    

    生产

    [
      {
        "name": "Africa",
        "children": {
          "name": "Kenya",
          "children": {
            "name": "NAI",
            "size": "281"
          }
        }
      },
      {
        "name": "Asia",
        "children": {
          "name": "India",
          "children": {
            "name": "BSE",
            "size": "160"
          }
        }
      },
      {
        "name": "Asia",
        "children": {
          "name": "Pakistan",
          "children": {
            "name": "ANO",
            "size": "100"
          }
        }
      },
      {
        "name": "European Union",
        "children": {
          "name": "United Kingdom",
          "children": {
            "name": "LSE",
            "size": "100"
          }
        }
      }
    ]
    

    【讨论】:

      【解决方案3】:

      正如许多评论者所说,使用 shell 进行这种转换是一个可怕的想法。而且,仅使用 bash 内置函数几乎是不可能的;和 shell 脚本用于组合标准的 unix 命令,如sedawkcut 等。您应该选择一种为这种迭代解析/处理而构建的更好的语言来解决您的问题。

      但是,由于时间已晚,而且我喝了太多咖啡,我拼凑了一个 bash 脚本(还有一些 sed 用于解析帮助)例如 .csv 您拥有的数据并以您记下的格式输出 JSON。这是脚本:

      #! /bin/bash 
      # Initial input file format:
      #
      #         Africa,Kenya,NAI,281
      #         Africa,Kenya,NAA,281
      #         Asia,India,NSI,100
      #         Asia,India,BSE,160
      #         Asia,Pakistan,ISE,100
      #         Asia,Pakistan,ANO,100
      #         European Union,United Kingdom,LSE,100
      #
      # Intermediate file format for parsing to JSON:
      #
      #         Africa|Kenya:NAI=281
      #         Asia|India:BSE=160&NSI=100|Pakistan:ISE=100&ANO=100
      #         European Union|United Kingdom:LSE=100
      #
      # Call as:
      #
      #   $ ./script INPUTFILE.csv >OUTPUTFILE.json
      #
      
      
      # temporary files for output/parsing
      TMP="./tmp.dat"
      TMP2="./tmp2.dat"
      >$TMP
      >$TMP2
      
      # read through initial file and output intermediate format
      while read line
      do
          region=$(echo $line | cut -d, -f1)
          country=$(echo $line | cut -d, -f2)
          code=$(echo $line | cut -d, -f3)
          size=$(echo $line | cut -d, -f4)
      
          # region record already started
          if grep "^$region" $TMP 2>&1 >/dev/null ;then
              >$TMP2 
              while read rec
              do
                  if echo $rec | grep "^$region" 2>&1 >/dev/null
                  then
                      if echo "$rec" | grep "\|$country:" 2>&1 >/dev/null
                      then
                          echo "$rec" | sed -e 's/\('"$country"':[^\|][^\|]*\)/\1\&'"$code"'='"$size"'/' >>$TMP2
                      else
                          echo "$rec|$country:$code=$size" >>$TMP2
                      fi
                  else
                      echo $rec >>$TMP2
                  fi
              done < $TMP
              mv $TMP2 $TMP
          else
          # new region
              echo "$region|$country:$code=$size" >>$TMP
          fi
      
      done < $1
      
      # Parse through our intermediary format and output JSON to standard out
      echo "["
      country_count=$(cat $TMP | wc -l)
      while read line
      do
          country=$(echo $line | cut -d\| -f1)
          echo "{ \"name\": \"$country\", "
          echo "  \"children\": ["
          region_count=$(echo $line | cut -d\| -f2- | sed -e 's/|/\n/g' | wc -l)
          echo $line | cut -d\| -f2- | sed -e 's/|/\n/g' | 
          while read region
          do
              name=$(echo $region | cut -d: -f1)
              echo "    { \"name\": \"$name\", "
              echo "      \"children\": ["
                  code_count=$(echo $region | sed -e 's/^'"$name"'://' -e 's/&/\n/g'  | wc -l)
                  echo $region | sed -e 's/^'"$name"'://' -e 's/&/\n/g'  |
                  while read code_size
                  do
                      code=$(echo $code_size | cut -d= -f1)
                      size=$(echo $code_size | cut -d= -f2)
                      code_count=$((code_count - 1))
                      COMMA=""
                      if [ $code_count -gt 0 ]; then
                        COMMA=","
                      fi
                      echo "        { \"name\": \"$code\", \"size\": \"$size\" }$COMMA " 
                  done
              echo "      ]"
              region_count=$((region_count - 1))
              if [ $region_count -gt 0 ]; then
                  echo "    },"
              else
                  echo "    }"
              fi
          done 
          echo "  ]"
          country_count=$((country_count - 1))
          COMMA=""
          if [ $country_count -gt 0 ]; then
              COMMA=","
          fi    
          echo "}$COMMA"
      
      done < $TMP
      echo "]"
      
      exit 0
      

      而且,这是上述脚本的结果输出:

      [
      { "name": "Africa",
        "children": [
          { "name": "Kenya",
            "children": [
              { "name": "NAI", "size": "281" },
              { "name": "NAA", "size": "281" }
            ]
          }
        ]
      },
      { "name": "Asia",
        "children": [
          { "name": "India",
            "children": [
              { "name": "NSI", "size": "100" },
              { "name": "BSE", "size": "160" }
            ]
          },
          { "name": "Pakistan",
            "children": [
              { "name": "ISE", "size": "100" },
              { "name": "ANO", "size": "100" }
            ]
          }
        ]
      },
      { "name": "European Union",
        "children": [
          { "name": "United Kingdom",
            "children": [
              { "name": "LSE", "size": "100" }
            ]
          }
        ]
      }
      ]
      

      请不要在任何生产环境中使用上述代码。

      【讨论】:

      • 我觉得应该有一个鼓励不良行为的徽章?
      • 嵌入式系统(例如使用 Yocto 或 DD-WRT 的系统)通常只有 BusyBox 可用,其中包括令人惊讶的功能性 Bash 类实现,但总是缺少包管理器或本地编译器。 100% Bash FTW!
      猜你喜欢
      • 2017-11-30
      • 2015-08-23
      • 1970-01-01
      • 1970-01-01
      • 2018-01-04
      • 2021-08-29
      • 2023-04-02
      • 2019-09-07
      • 2015-06-22
      相关资源
      最近更新 更多