【问题标题】:How do I add header attributes to data rows?如何将标题属性添加到数据行?
【发布时间】:2018-09-01 21:12:15
【问题描述】:

我有一个从igra2 weather dataset获得的天气数据集

文件的典型部分如下所示:

#ICXUAE05424 1909 04 22 99 0935    6          erac-hud  657000      -180000
30 -9999  -9999   100 -9999 -9999 -9999   120    49
30 -9999  -9999   350 -9999 -9999 -9999   119   110
30 -9999  -9999   750-9999 -9999 -9999   149    97
30 -9999  -9999  1250-9999 -9999 -9999   136   123
30 -9999  -9999  1750-9999 -9999 -9999   104   121
30 -9999  -9999  2250-9999 -9999 -9999   117   171
#ICXUAE05424 1909 04 22 99 1820    3          erac-hud  657000      -180000
30 -9999  -9999   100 -9999 -9999 -9999   120    53
30 -9999  -9999   350A -9999 -9999 -9999   111    69
30 -9999  -9999   750B-9999 -9999 -9999   102    55
#ICXUAE05424 1909 04 23 99 0845    5          erac-hud  657000      -180000
30 -9999  -9999   100 -9999 -9999 -9999    31     9
30 -9999  -9999   350 -9999 -9999 -9999   102    62
30 -9999  -9999   750 -9999 -9999 -9999   103   132
30 -9999  -9999  1250 -9999 -9999 -9999    98   120
30 -9999  -9999  1750 -9999 -9999 -9999   101   100

我需要通过将部分(或全部)标题属性附加到其数据行来预处理数据,然后将其转换为 csv 文件。 如果不是 Python Pandas,我如何在 linux bash 中使用 sed 来实现这一点 输出的 csv 文件应该是这样的:

lvl12,etime,press,gph,temp,rh,dpdp,wdir,wspd,hour,lattitude,longitude
21,-9999,96900A,234,270A,742,-9999,-9999,-9999,12,316333,748667
20,-9999,95000,-9999,290A,484,-9999,-9999,-9999,12,316333,748667
20,-9999,88700,-9999,290A,454,-9999,-9999,-9999,12,316333,748667
10,-9999,85000,1384A,260A,446,-9999,-9999,-9999,12,316333,748667
10,-9999,70000,3055A,130A,506,-9999,-9999,-9999,12,316333,748667
20,-9999,58400,-9999,0A,690,-9999,-9999,-9999,12,316333,748667
20,-9999,55900,-9999,0A,312,-9999,-9999,-9999,12,316333,748667
10,-9999,50000,5772A,-65A-9999,-9999,320,850,,12,316333,748667

其他数据集信息:

# 前缀的行是标题,后面的行是数据。

Header 属性是:

station code, year, month, day, etc

分隔的第七个属性空间是npv,表示后面的数据行数。

数据列是:

lvl12, etime, press, gph, temp, rh, dpdp, wdir, wspd

【问题讨论】:

  • 如果您的样本输入和样本输出实际匹配会更好,这样我们就可以看到数据如何从一个映射到另一个。
  • awk中,检查第一个字符是否为#并将字段值保存在变量中。在其他行中,通过分配给$10$11$12 将这些变量附加到该行,然后打印该行。
  • StackOverflow 希望你能try to solve your own problem first。请更新您的问题以在minimal reproducible example 中显示您已经尝试过的内容。如需更多信息,请参阅How to Ask,并拨打tour :)

标签: python pandas csv sed dataset


【解决方案1】:

您需要逐行手动解析文件,并注意标题的位置。

我假设750-9999 这样的数据实际上有一个空格750 -9999?如果不是这种情况,则需要使用固定宽度的方法:

这可以通过 Python 的 CSV 库来完成,如下所示:

import csv

header = ["lvl12", "etime", "press", "gph", "temp", "rh", "dpdp", "wdir", "wspd", "hour", "lattitude", "longitude"]    
data = []

with open('weather.txt', newline='') as f_input, open('output.csv', 'w', newline='') as f_output:
    csv_input = csv.reader(f_input, delimiter=' ', skipinitialspace=True)
    csv_output = csv.writer(f_output)
    csv_output.writerow(header)

    for row in csv_input:
        if row[0].startswith('#'):
            header = row
        else:
            csv_output.writerow(row + [header[5]] + header[-2:])

或者,如果您还想使用 Pandas:

import pandas as pd
import csv

data = []

with open('weather.txt', newline='') as f_input:
    csv_input = csv.reader(f_input, delimiter=' ', skipinitialspace=True)

    for row in csv_input:
        if row[0].startswith('#'):
            header = row
        else:
            data.append(row + [header[5]] + header[-2:])

columns = ["lvl12", "etime", "press", "gph", "temp", "rh", "dpdp", "wdir", "wspd", "hour", "lattitude", "longitude"]
df = pd.DataFrame(data, columns=columns)
print(df) 

给你:

   lvl12  etime  press   gph   temp     rh   dpdp wdir wspd  hour lattitude longitude
0     30  -9999  -9999   100  -9999  -9999  -9999  120   49  0935    657000   -180000
1     30  -9999  -9999   350  -9999  -9999  -9999  119  110  0935    657000   -180000
2     30  -9999  -9999   750  -9999  -9999  -9999  149   97  0935    657000   -180000
3     30  -9999  -9999  1250  -9999  -9999  -9999  136  123  0935    657000   -180000
4     30  -9999  -9999  1750  -9999  -9999  -9999  104  121  0935    657000   -180000
..   ...    ...    ...   ...    ...    ...    ...  ...  ...   ...       ...       ...
9     30  -9999  -9999   100  -9999  -9999  -9999   31    9  0845    657000   -180000
10    30  -9999  -9999   350  -9999  -9999  -9999  102   62  0845    657000   -180000
11    30  -9999  -9999   750  -9999  -9999  -9999  103  132  0845    657000   -180000
12    30  -9999  -9999  1250  -9999  -9999  -9999   98  120  0845    657000   -180000
13    30  -9999  -9999  1750  -9999  -9999  -9999  101  100  0845    657000   -180000

[14 rows x 12 columns]

使用 Python 3.x 测试。如果您使用的是 Python 2.x,请更改以下内容:

with open('weather.txt', 'rb') as f_input, open('output.csv', 'wb') as f_output:

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2018-06-08
    • 1970-01-01
    • 1970-01-01
    • 2020-10-24
    • 2014-01-15
    相关资源
    最近更新 更多