【问题标题】:Rearranging unique groups in csv重新排列 csv 中的唯一组
【发布时间】:2020-09-27 12:45:21
【问题描述】:

我有一个这样排列的庞大数据集。每个 ID 对应一组唯一的组。

0
0 0
NUMBER        22 ADD_FLD    5  15 &11111
ID  382 START_TIME 2001052306
POINT  63
2010052306 119.464119 15.870264 1.682708e+00 & 1.213053
2010052312 119.910667 15.874892 1.934127e+00 & 1.221175 
2010052318 120.368523 16.022879 2.260490e+00 & 1.227459
2010052400 120.611115 15.788021 2.787007e+00 & 1.229084
2010052406 121.286072 15.984570 3.253321e+00 & 1.230381

ID  413 START_TIME 2010061006
POINT  40
2010061006 156.424057 5.559299 1.059667e+00 & 1.578506 
2010061012 153.899506 6.450210 1.150635e+00 & 1.516614 
2010061018 152.346802 7.281753 1.187466e+00 & 1.501871

我想做的是将它们重新排列成这样。

ID   YR     MONTH   DAY   HR  LON         LAT        RESULT1        RESULT2
382  2010   05      23    06  119.464119  15.870264  1.682708e+00   1.213053
382  2010   05      23    12  119.910667  15.874892  1.934127e+00   1.221175 
382  2010   05      23    18  120.368523  16.022879  2.260490e+00   1.227459
382  2010   05      24    00  120.611115  15.788021  2.787007e+00   1.229084
382  2010   05      24    06  121.286072  15.984570  3.253321e+00   1.230381
413  2010   06      10    06  156.424057  5.559299   1.059667e+00   1.578506 
413  2010   06      10    12  153.899506  6.450210   1.150635e+00   1.516614 
413  2010   06      10    18  152.346802  7.281753   1.187466e+00   1.501871

ID 列基于为每个组分配的相应唯一 ID。 YR、MONTH、DAY 和 HR 基于输入的第一列。

如果有任何帮助,我将不胜感激。谢谢

【问题讨论】:

  • 您能详细说明一下吗?您如何到达输出的列并不明显。如果可能,展示一个更简单的可重现的最小示例可能会有所帮助。

标签: python python-3.x pandas csv


【解决方案1】:

我花了一些时间,我希望它有所帮助:)

t=open('your_file.txt').read() #or your_file.csv'
l=t.split('\n')
l=l[3:]
l=[i for i in l if i[:5] not in ('POINT', '')]
d={}
current_key=0
for i in range(len(l)):
  if l[i][:2]=='ID':
    current_key=l[i].split(' ')[2]
    d[current_key]=[]
  else:
    d[current_key].append(l[i])

for i in d:
  for k in range(len(d[i])):
    s=d[i][k]
    s=s.split(' ')
    s=[p for p in s if p not in ('&', '')]
    s=[s[0][:4]]+[s[0][4:6]]+[s[0][6:8]]+[s[0][8:]]+s[1:]
    d[i][k]=[i]+s
    
rows=sum(d.values(), [])   
columns=['ID','YR','MONTH','DAY','HR','LON','LAT','RESULT1','RESULT2']

result=pd.DataFrame(rows, columns=columns)

print(result)

输出:

    ID    YR MONTH DAY  HR         LON        LAT       RESULT1   RESULT2
0  382  2010    05  23  06  119.464119  15.870264  1.682708e+00  1.213053
1  382  2010    05  23  12  119.910667  15.874892  1.934127e+00  1.221175
2  382  2010    05  23  18  120.368523  16.022879  2.260490e+00  1.227459
3  382  2010    05  24  00  120.611115  15.788021  2.787007e+00  1.229084
4  382  2010    05  24  06  121.286072  15.984570  3.253321e+00  1.230381
5  413  2010    06  10  06  156.424057   5.559299  1.059667e+00  1.578506
6  413  2010    06  10  12  153.899506   6.450210  1.150635e+00  1.516614
7  413  2010    06  10  18  152.346802   7.281753  1.187466e+00  1.501871

【讨论】:

    【解决方案2】:

    不漂亮,但如果您的文件继续使用相同的结构,则可以使用

    import pandas as pd
    
    with open('yourfile.txt') as f:
        id = False
        d = []
        for line in f:
    
            if line.startswith('ID'):
                id = line[4:7]
                next(f)
            elif id and line.strip():
                line = line.strip().replace(' & ',' ')
                d.append(f'{id} {line[:4]} {line[4:6]} {line[6:8]} {line[8:10]} {line[10:]}'.split())
    
    df = pd.DataFrame(d)
    df.columns = ['ID','YR','MONTH','DAY','HR','LON','LAT','RESULT1','RESULT2']
    print(df)
    

    输出:

        ID    YR MONTH DAY  HR         LON        LAT       RESULT1   RESULT2
    0  382  2010    05  23  06  119.464119  15.870264  1.682708e+00  1.213053
    1  382  2010    05  23  12  119.910667  15.874892  1.934127e+00  1.221175
    2  382  2010    05  23  18  120.368523  16.022879  2.260490e+00  1.227459
    3  382  2010    05  24  00  120.611115  15.788021  2.787007e+00  1.229084
    4  382  2010    05  24  06  121.286072  15.984570  3.253321e+00  1.230381
    5  413  2010    06  10  06  156.424057   5.559299  1.059667e+00  1.578506
    6  413  2010    06  10  12  153.899506   6.450210  1.150635e+00  1.516614
    7  413  2010    06  10  18  152.346802   7.281753  1.187466e+00  1.501871
    

    【讨论】:

      【解决方案3】:
      import csv
      
      output_rows = []
      recent_id = ''
      recent_starttime = ''
      for line in open("input.csv"):
          csv_row = line.split() # returns row as list
          
          # Ignores unwanted lines and continue loop
          if 'POINT' in csv_row or 'NUMBER' in csv_row:
             recent_id = ''
             recent_starttime = ''
             continue
          elif 'ID' in csv_row:
             recent_id = csv_row[1]
             recent_starttime = csv_row[3]
          else str(recent_starttime) in csv_row:
             row = {}
             row['ID'] = recent_id
             row['YR'] = recent_starttime[0:4]
             row['MONTH'] = recent_starttime[4:2]
             row['DAY'] = recent_starttime[6:2]
             row['HR'] = recent_starttime[8:2]
             row['LON'] = csv_row[1]
             row['LAT'] = csv_row[2]
             row['RESULT1'] = csv_row[3]
             row['RESULT2'] = csv_row[5]
             output_rows.append(row)
          else:
             continue
             
      if len(output_rows) > 0:
        keys = output_rows[0].keys()
        with open('output.csv', 'w', newline='')  as output_file:
          dict_writer = csv.DictWriter(output_file, keys)
          dict_writer.writeheader()
          dict_writer.writerows(output_rows)
      

      【讨论】:

        猜你喜欢
        • 2015-06-20
        • 1970-01-01
        • 2016-10-26
        • 2016-01-05
        • 1970-01-01
        • 1970-01-01
        • 2015-08-28
        • 1970-01-01
        • 2022-11-04
        相关资源
        最近更新 更多