Pandas 列数学运算没有错误没有答案答案

【问题标题】：Pandas Column mathematical operations No error no answerPandas 列数学运算没有错误没有答案
【发布时间】：2015-09-19 01:13:52
【问题描述】：

我正在尝试对文件执行一些简单的数学运算。

file_1.csv 下面的列本质上是动态的，列数会不时增加。所以我们不能修复last_column

master_ids.csv : 在任何预处理之前

Ids,ref0 #the columns increase dynamically
1234,1000
8435,5243
2341,563
7352,345

master_count.csv：在任何处理之前

Ids,Name,lat,lon,ref1
1234,London,40.4,10.1,500
8435,Paris,50.5,20.2,400
2341,NewYork,60.6,30.3,700
7352,Japan,70.7,80.8,500
1234,Prague,40.4,10.1,100
8435,Berlin,50.5,20.2,200
2341,Austria,60.6,30.3,500
7352,China,70.7,80.8,300

master_Ids.csv：经过一次预处理

Ids,ref,00:30:00
1234,1000,500
8435,5243,300
2341,563,400
7352,345,500

master_count.csv：预期输出（追加/合并）

Ids,Name,lat,lon,ref1,00:30:00
1234,London,40.4,10.1,500,750
8435,Paris,50.5,20.2,400,550
2341,NewYork,60.6,30.3,700,900
7352,Japan,70.7,80.8,500,750
1234,Prague,40.4,10.1,100,350
8435,Berlin,50.5,20.2,200,350
2341,Austria,60.6,30.3,500,700
7352,China,70.7,80.8,300,750

例如：Ids: 1234 出现2 次，所以ids:1234 在current time (00:30:00) 的值是500，它要除以ids 出现的次数，然后添加到来自@987654338 的相应值@ 并使用当前时间创建一个新列。

master_Ids.csv : 经过另一次预处理

Ids,ref,00:30:00,00:45:00
1234,1000,500,100
8435,5243,300,200
2341,563,400,400
7352,345,500,600

master_count.csv: 另一次执行后的预期输出（合并/追加）

Ids,Name,lat,lon,ref1,00:30:00,00:45:00
1234,London,40.4,10.1,500,750,550
8435,Paris,50.5,20.2,400,550,500
2341,NewYork,60.6,30.3,700,900,900
7352,Japan,70.7,80.8,500,750,800
1234,Prague,40.4,10.1,100,350,150
8435,Berlin,50.5,20.2,200,350,300
2341,Austria,60.6,30.3,500,700,700
7352,China,70.7,80.8,300,750,600

所以这里current time 是00:45:00，我们将current time value 除以ids 出现的count，然后将add 划分为对应的ref1 值，方法是使用@ 创建一个新列987654350@.

程序：Jianxun Li

import pandas as pd
import numpy as np

csv_file1 = '/Data_repository/master_ids.csv'
csv_file2 = '/Data_repository/master_count.csv'

df1 = pd.read_csv(csv_file1).set_index('Ids')

# need to sort index in file 2
df2 = pd.read_csv(csv_file2).set_index('Ids').sort_index()

# df1 and df2 has a duplicated column 00:00:00, use df1 without 1st column
temp = df2.join(df1.iloc[:, 1:])

# do the division by number of occurence of each Ids 
# and add column any time series
def my_func(group):
    num_obs = len(group)
    # process with column name after next timeseries (inclusive)
    group.iloc[:,4:] = (group.iloc[:,4:]/num_obs).add(group.iloc[:,3], axis=0)
    return group

result = temp.groupby(level='Ids').apply(my_func)

程序执行时没有错误也没有输出。需要一些修复建议。

【问题讨论】：

我想说你应该考虑重组你的数据。不要为每个“预处理”步骤添加新列，而是为您的数据提供固定数量的列，其中之一包括您当前用作新列标题的时间信息。也就是说，一个current_time 列，以及该列中具有“00:30:00”的一堆行，然后该列中具有“00:45:00”的一堆行，等等。
@BrenBarn 我无法进行重建，因为我也需要旧的时间序列计数以用于未来的绘图目的。
不确定你的意思。我所描述的更改不会导致任何信息丢失，只是格式不同。
@BrenBarn 能否以程序和输出格式显示，以便消除混乱？
我在答案中添加了更新，请查看。另外，你能检查一下中国行的预期输出吗？我想我得到了除那一行之外的所有行的预期结果。

标签： python csv datetime pandas multiple-columns

【解决方案1】：

此程序假定 master_counts.csv 和 master_ids.csv 会随着时间的推移而更新，并且应该对更新时间具有稳健性。也就是说，如果在同一更新上运行多次或错过更新，它应该会产生正确的结果。

# this program updates (and replaces) the original master_counts.csv with data
# in master_ids.csv, so we only want the first 5 columns when we read it in
master_counts = pd.read_csv('master_counts.csv').iloc[:,:5]

# this file is assumed to be periodically updated with the addition of new columns
master_ids = pd.read_csv('master_ids.csv')

for i in range( 2, len(master_ids.columns) ):
    master_counts = master_counts.merge( master_ids.iloc[:,[0,i]], on='Ids' )
    count = master_counts.groupby('Ids')['ref1'].transform('count')
    master_counts.iloc[:,-1] = master_counts['ref1'] + master_counts.iloc[:,-1]/count

master_counts.to_csv('master_counts.csv',index=False)

%more master_counts.csv
Ids,Name,lat,lon,ref1,00:30:00,00:45:00
1234,London,40.4,10.1,500,750.0,550.0
1234,Prague,40.4,10.1,100,350.0,150.0
8435,Paris,50.5,20.2,400,550.0,500.0
8435,Berlin,50.5,20.2,200,350.0,300.0
2341,NewYork,60.6,30.3,700,900.0,900.0
2341,Austria,60.6,30.3,500,700.0,700.0
7352,Japan,70.7,80.8,500,750.0,800.0
7352,China,70.7,80.8,300,550.0,600.0

【讨论】：

能否给个更清晰的代码以便理解。
请在问题的编辑部分检查描述。
第一个程序不能使用，因为master_ids.csv 没有改变它只是继续追加。更新后的版本只是运行没有错误，同时没有输出。
我正在检查master_count.csv，但我没有看到任何更新输出需要附加到master_count.csv。
@SitzBlogz 好的，我明确添加了 csv 输出。这是最简单的部分，我假设你已经知道怎么做。

【解决方案2】：

我的建议是重新格式化您的数据，使其如下所示：

Ids,ref0,current_time,ref1
1234,1000,None,None
8435,5243,None,None
2341,563,None,None
7352,345,None,None

然后在你的“第一次预处理”之后，它会变成这样：

Ids,ref0,time,ref1
1234,1000,None,None
8435,5243,None,None
2341,563,None,None
7352,345,None,None
1234,1000,00:30:00,500
8435,5243,00:30:00,300
2341,563,00:30:00,400
7352,345,00:30:00,500

。 . .等等。这个想法是你应该做一个单独的列来保存时间信息，然后对于每个预处理，将新数据插入到新的 rows 中，并在 time 列中给这些行一个值，指示什么时间他们来自的时期。您可能希望也可能不希望在此表中保留带有“无”的初始行；也许您只想从“00:30:00”值开始，并将“主 ID”保存在单独的文件中。

我还没有完全按照您计算新ref1 值的方式进行操作，但关键是这样做可能会大大简化您的生活。通常，与其添加无限数量的新列，不如添加一个新列，其值将成为您将用作开放式新列标题的值。

【讨论】：

【解决方案3】：

import pandas as pd
import numpy as np

csv_file1 = '/home/Jian/Downloads/stack_flow_bundle/Data_repository/master_lac_Test.csv'
csv_file2 = '/home/Jian/Downloads/stack_flow_bundle/Data_repository/lat_lon_master.csv'

df1 = pd.read_csv(csv_file1).set_index('Ids')

Out[53]: 
      00:00:00  00:30:00  00:45:00
Ids                               
1234      1000       500       100
8435      5243       300       200
2341       563       400       400
7352       345       500       600

# need to sort index in file 2
df2 = pd.read_csv(csv_file2).set_index('Ids').sort_index()

Out[81]: 
         Name   lat   lon  00:00:00
Ids                                
1234   London  40.4  10.1       500
1234   Prague  40.4  10.1       500
2341  NewYork  60.6  30.3       700
2341  Austria  60.6  30.3       700
7352    Japan  70.7  80.8       500
7352    China  70.7  80.8       500
8435    Paris  50.5  20.2       400
8435   Berlin  50.5  20.2       400

# df1 and df2 has a duplicated column 00:00:00, use df1 without 1st column
temp = df2.join(df1.iloc[:, 1:])



Out[55]: 
         Name   lat   lon  00:00:00  00:30:00  00:45:00
Ids                                                    
1234   London  40.4  10.1       500       500       100
1234   Prague  40.4  10.1       500       500       100
2341  NewYork  60.6  30.3       700       400       400
2341  Austria  60.6  30.3       700       400       400
7352    Japan  70.7  80.8       500       500       600
7352    China  70.7  80.8       500       500       600
8435    Paris  50.5  20.2       400       300       200
8435   Berlin  50.5  20.2       400       300       200

# do the division by number of occurence of each Ids 
# and add column 00:00:00
def my_func(group):
    num_obs = len(group)
    # process with column name after 00:30:00 (inclusive)
    group.iloc[:,4:] = (group.iloc[:,4:]/num_obs).add(group.iloc[:,3], axis=0)
    return group



result = temp.groupby(level='Ids').apply(my_func)

Out[104]: 
         Name   lat   lon  00:00:00  00:30:00  00:45:00
Ids                                                    
1234   London  40.4  10.1       500       750       550
1234   Prague  40.4  10.1       500       750       550
2341  NewYork  60.6  30.3       700       900       900
2341  Austria  60.6  30.3       700       900       900
7352    Japan  70.7  80.8       500       750       800
7352    China  70.7  80.8       500       750       800
8435    Paris  50.5  20.2       400       550       500
8435   Berlin  50.5  20.2       400       550       500

【讨论】：

我看到有些混乱，请阅读我编辑的问题以使其更清楚。
@SitzBlogz 我已经修改了代码。让我知道这是否是你想要的。
ans 看起来像我需要的，但列 00:30:00 and 00:45:00 是动态的。 File_1 可能有任何timeseries 你能改变那部分吗？
@SitzBlogz 我已将my_func 部分修改为使用.iloc 而不是.loc。
你也可以试试这个吗stackoverflow.com/questions/31201986/…