重新排列 csv 中的唯一组答案

【问题标题】：Rearranging unique groups in csv重新排列 csv 中的唯一组
【发布时间】：2020-09-27 12:45:21
【问题描述】：

我有一个这样排列的庞大数据集。每个 ID 对应一组唯一的组。

0
0 0
NUMBER        22 ADD_FLD    5  15 &11111
ID  382 START_TIME 2001052306
POINT  63
2010052306 119.464119 15.870264 1.682708e+00 & 1.213053
2010052312 119.910667 15.874892 1.934127e+00 & 1.221175 
2010052318 120.368523 16.022879 2.260490e+00 & 1.227459
2010052400 120.611115 15.788021 2.787007e+00 & 1.229084
2010052406 121.286072 15.984570 3.253321e+00 & 1.230381

ID  413 START_TIME 2010061006
POINT  40
2010061006 156.424057 5.559299 1.059667e+00 & 1.578506 
2010061012 153.899506 6.450210 1.150635e+00 & 1.516614 
2010061018 152.346802 7.281753 1.187466e+00 & 1.501871

我想做的是将它们重新排列成这样。

ID   YR     MONTH   DAY   HR  LON         LAT        RESULT1        RESULT2
382  2010   05      23    06  119.464119  15.870264  1.682708e+00   1.213053
382  2010   05      23    12  119.910667  15.874892  1.934127e+00   1.221175 
382  2010   05      23    18  120.368523  16.022879  2.260490e+00   1.227459
382  2010   05      24    00  120.611115  15.788021  2.787007e+00   1.229084
382  2010   05      24    06  121.286072  15.984570  3.253321e+00   1.230381
413  2010   06      10    06  156.424057  5.559299   1.059667e+00   1.578506 
413  2010   06      10    12  153.899506  6.450210   1.150635e+00   1.516614 
413  2010   06      10    18  152.346802  7.281753   1.187466e+00   1.501871

ID 列基于为每个组分配的相应唯一 ID。 YR、MONTH、DAY 和 HR 基于输入的第一列。

如果有任何帮助，我将不胜感激。谢谢

【问题讨论】：

您能详细说明一下吗？您如何到达输出的列并不明显。如果可能，展示一个更简单的可重现的最小示例可能会有所帮助。

标签： python python-3.x pandas csv

【解决方案1】：

我花了一些时间，我希望它有所帮助:)

t=open('your_file.txt').read() #or your_file.csv'
l=t.split('\n')
l=l[3:]
l=[i for i in l if i[:5] not in ('POINT', '')]
d={}
current_key=0
for i in range(len(l)):
  if l[i][:2]=='ID':
    current_key=l[i].split(' ')[2]
    d[current_key]=[]
  else:
    d[current_key].append(l[i])

for i in d:
  for k in range(len(d[i])):
    s=d[i][k]
    s=s.split(' ')
    s=[p for p in s if p not in ('&', '')]
    s=[s[0][:4]]+[s[0][4:6]]+[s[0][6:8]]+[s[0][8:]]+s[1:]
    d[i][k]=[i]+s
    
rows=sum(d.values(), [])   
columns=['ID','YR','MONTH','DAY','HR','LON','LAT','RESULT1','RESULT2']

result=pd.DataFrame(rows, columns=columns)

print(result)

输出：

    ID    YR MONTH DAY  HR         LON        LAT       RESULT1   RESULT2
0  382  2010    05  23  06  119.464119  15.870264  1.682708e+00  1.213053
1  382  2010    05  23  12  119.910667  15.874892  1.934127e+00  1.221175
2  382  2010    05  23  18  120.368523  16.022879  2.260490e+00  1.227459
3  382  2010    05  24  00  120.611115  15.788021  2.787007e+00  1.229084
4  382  2010    05  24  06  121.286072  15.984570  3.253321e+00  1.230381
5  413  2010    06  10  06  156.424057   5.559299  1.059667e+00  1.578506
6  413  2010    06  10  12  153.899506   6.450210  1.150635e+00  1.516614
7  413  2010    06  10  18  152.346802   7.281753  1.187466e+00  1.501871

【讨论】：

【解决方案2】：

不漂亮，但如果您的文件继续使用相同的结构，则可以使用

import pandas as pd

with open('yourfile.txt') as f:
    id = False
    d = []
    for line in f:

        if line.startswith('ID'):
            id = line[4:7]
            next(f)
        elif id and line.strip():
            line = line.strip().replace(' & ',' ')
            d.append(f'{id} {line[:4]} {line[4:6]} {line[6:8]} {line[8:10]} {line[10:]}'.split())

df = pd.DataFrame(d)
df.columns = ['ID','YR','MONTH','DAY','HR','LON','LAT','RESULT1','RESULT2']
print(df)

输出：

    ID    YR MONTH DAY  HR         LON        LAT       RESULT1   RESULT2
0  382  2010    05  23  06  119.464119  15.870264  1.682708e+00  1.213053
1  382  2010    05  23  12  119.910667  15.874892  1.934127e+00  1.221175
2  382  2010    05  23  18  120.368523  16.022879  2.260490e+00  1.227459
3  382  2010    05  24  00  120.611115  15.788021  2.787007e+00  1.229084
4  382  2010    05  24  06  121.286072  15.984570  3.253321e+00  1.230381
5  413  2010    06  10  06  156.424057   5.559299  1.059667e+00  1.578506
6  413  2010    06  10  12  153.899506   6.450210  1.150635e+00  1.516614
7  413  2010    06  10  18  152.346802   7.281753  1.187466e+00  1.501871

【讨论】：

【解决方案3】：

import csv

output_rows = []
recent_id = ''
recent_starttime = ''
for line in open("input.csv"):
    csv_row = line.split() # returns row as list
    
    # Ignores unwanted lines and continue loop
    if 'POINT' in csv_row or 'NUMBER' in csv_row:
       recent_id = ''
       recent_starttime = ''
       continue
    elif 'ID' in csv_row:
       recent_id = csv_row[1]
       recent_starttime = csv_row[3]
    else str(recent_starttime) in csv_row:
       row = {}
       row['ID'] = recent_id
       row['YR'] = recent_starttime[0:4]
       row['MONTH'] = recent_starttime[4:2]
       row['DAY'] = recent_starttime[6:2]
       row['HR'] = recent_starttime[8:2]
       row['LON'] = csv_row[1]
       row['LAT'] = csv_row[2]
       row['RESULT1'] = csv_row[3]
       row['RESULT2'] = csv_row[5]
       output_rows.append(row)
    else:
       continue
       
if len(output_rows) > 0:
  keys = output_rows[0].keys()
  with open('output.csv', 'w', newline='')  as output_file:
    dict_writer = csv.DictWriter(output_file, keys)
    dict_writer.writeheader()
    dict_writer.writerows(output_rows)

【讨论】：