【问题标题】:convert csv to dictionary将csv转换为字典
【发布时间】:2017-11-08 08:42:15
【问题描述】:

我有以下 csv 文件(总共 20000 行)

ozone,paricullate_matter,carbon_monoxide,sulfure_dioxide,nitrogen_dioxide,longitude,latitude,timestamp,avg_measured_time,avg_speed,median_measured_time,timestamp:1,vehicle_count,lat1,long1,lat2,long2,distance_between_2_points,duration_of_measurement,ndt_in_kmh
99,99,98,116,118,10.09351660921,56.1671665604395,1407575099.99998,0,0,0,1407575099.99998,0,56.1089513576227,10.1823955595246,56.1048021343541,10.1988040846558,1124,65,62
99,99,98,116,118,10.09351660921,56.1671665604395,1407575099.99998,0,0,0,1407575099.99998,0,56.10986429895,10.1627288048935,56.1089513576227,10.1823955595246,1254,71,64
99,99,98,116,118,10.09351660921,56.1671665604395,1407575099.99998,0,0,0,1407575099.99998,0,56.1425188527673,10.1868802625656,56.1417522836526,10.1927236478157,521,62,30
99,99,98,116,118,10.09351660921,56.1671665604395,1407575099.99998,18,84,18,1407575099.99998,1,56.1395320665735,10.1772034087371,56.1384485157567,10.1791506011887,422,50,30

我想把它转换成像

这样的字典
{'ozone': [99,99,99,99], 'paricullate_matter': [99,99,99,99],'carbon_monoxide': [98,98,98,98],etc....}

我尝试过的

import csv

reader = csv.DictReader(open('resulttable.csv'))

output = open("finalprojdata.py","w")

result = {}
for row in reader:
    for column, value in row.iteritems():
        result.setdefault(column, []).append(float(value))
output.write(str(result))

我得到的输出只包含几个字典。喜欢来自

{'vehicle_count': [0,0,0,1], 'lat1': etc}

整个 csv 文件没有被转换成字典。

【问题讨论】:

  • 您需要将其保存为 JSON 还是什么?
  • 我想保存在finalprojdata.py中
  • 有什么用?为什么不直接在屏幕上打印或传递给其他功能?
  • 如果你想输出python值,你最好使用repr而不是str结果。\

标签: python csv dictionary awk


【解决方案1】:

如果你有 pandas,这非常简单:

import pandas as pd
data = pd.read_csv("data.csv")
data_dict = {col: list(data[col]) for col in data.columns}

【讨论】:

  • 甚至data_dict = data.to_dict(orient='list')
【解决方案2】:

这应该做你想做的:

import csv

def int_or_float(strg):
    val = float(strg)
    return int(val) if val.is_integer() else val

with open('test.csv') as in_file:
    it = zip(*csv.reader(in_file))
    dct = {el[0]: [int_or_float(val) for val in el[1:]] for el in it}

zip(*it) 只会转置您拥有的数据并以您想要的方式重新排列它;然后字典理解构建您的新字典。

dct 现在包含您想要的字典。

【讨论】:

    【解决方案3】:

    awk 版本

    awk -F',' '
       NR==1 {s=0;for( i=1;i<=NR;i++) D=sprintf("%s \"%s\" : [", (s++?",":""), $i);next}
             {for( i=1;i<=NR;i++) D[i] = D[i] sprintf( "%s %s", (NR>2?",":""), $(i))}
       END   { 
          printf( "{ ")
          s=0;for( d in D) { printf( "%s]", (s++?",":""), D[d] )
          printf( "}"
          }
       ' YourFile > final.py
    

    又快又脏,没有优化内存(2000 行在现代内存空间中并没有那么大)

    【讨论】:

      【解决方案4】:
      from collections import defaultdict
      import csv
      
      columns = defaultdict(list)
      with open('test.csv') as csvfile:
          reader = csv.DictReader(csvfile)
          for row in reader:
              for (k,v) in row.items():
                   columns[k].append(v)
      
          print columns
      

      #输出

      defaultdict(<type 'list'>, {'vehicle_count': ['0', '0', '0', '1'], 'lat1': ['56.1089513576227', '56.10986429895', '56.1425188527673', '56.1395320665735'], 'lat2': ['56.1048021343541', '56.1089513576227', '56.1417522836526', '56.1384485157567'], 'paricullate_matter': ['99', '99', '99', '99'], 'timestamp': ['1407575099.99998', '1407575099.99998', '1407575099.99998', '1407575099.99998'], 'long1': ['10.1823955595246', '10.1627288048935', '10.1868802625656', '10.1772034087371'], 'longitude': ['10.09351660921', '10.09351660921', '10.09351660921', '10.09351660921'], 'nitrogen_dioxide': ['118', '118', '118', '118'], 'ozone': ['99', '99', '99', '99'], 'latitude': ['56.1671665604395', '56.1671665604395', '56.1671665604395', '56.1671665604395'], 'timestamp:1': ['1407575099.99998', '1407575099.99998', '1407575099.99998', '1407575099.99998'], 'distance_between_2_points': ['1124', '1254', '521', '422'], 'long2': ['10.1988040846558', '10.1823955595246', '10.1927236478157', '10.1791506011887'], 'avg_measured_time': ['0', '0', '0', '18'], 'carbon_monoxide': ['98', '98', '98', '98'], 'ndt_in_kmh': ['62', '64', '30', '30'], 'avg_speed': ['0', '0', '0', '84'], 'sulfure_dioxide': ['116', '116', '116', '116'], 'duration_of_measurement': ['65', '71', '62', '50'], 'median_measured_time': ['0', '0', '0', '18']})
      

      【讨论】:

        【解决方案5】:

        pyexcel版本:

        import pyexcel as p
        
        p.get_dict(file_name='test.csv')
        

        【讨论】:

          【解决方案6】:
          $ cat tst.awk
          BEGIN { FS=OFS=","; ORS="}\n" }
          NR==1 {split($0,hdr); next }
          {
              for (i=1; i<=NF; i++) {
                  vals[i] = (i in vals ? vals[i] "," : "") $i
              }
          }
          END {
              printf "{"
              for (i=1; i<=NF; i++) {
                  printf "\047%s\047: [%s]%s", hdr[i], vals[i], (i<NF?OFS:ORS)
              }
          }
          
          $ awk -f tst.awk file
          {'ozone': [99,99,99,99],'paricullate_matter': [99,99,99,99],'carbon_monoxide': [98,98,98,98],'sulfure_dioxide': [116,116,116,116],'nitrogen_dioxide': [118,118,118,118],'longitude': [10.09351660921,10.09351660921,10.09351660921,10.09351660921],'latitude': [56.1671665604395,56.1671665604395,56.1671665604395,56.1671665604395],'timestamp': [1407575099.99998,1407575099.99998,1407575099.99998,1407575099.99998],'avg_measured_time': [0,0,0,18],'avg_speed': [0,0,0,84],'median_measured_time': [0,0,0,18],'timestamp:1': [1407575099.99998,1407575099.99998,1407575099.99998,1407575099.99998],'vehicle_count': [0,0,0,1],'lat1': [56.1089513576227,56.10986429895,56.1425188527673,56.1395320665735],'long1': [10.1823955595246,10.1627288048935,10.1868802625656,10.1772034087371],'lat2': [56.1048021343541,56.1089513576227,56.1417522836526,56.1384485157567],'long2': [10.1988040846558,10.1823955595246,10.1927236478157,10.1791506011887],'distance_between_2_points': [1124,1254,521,422],'duration_of_measurement': [65,71,62,50],'ndt_in_kmh': [62,64,30,30]}
          

          【讨论】:

          • @Mukund 他用 awk 标记了这个问题——这意味着他要求一个 awk 程序。仅仅因为他失败的尝试是一个 python 脚本并不意味着这就是他可以使用的全部。
          猜你喜欢
          • 2018-12-09
          • 2021-07-31
          • 1970-01-01
          • 1970-01-01
          • 2017-03-27
          • 2014-04-24
          • 2016-01-28
          • 2020-01-03
          • 2014-09-16
          相关资源
          最近更新 更多