【发布时间】:2017-01-19 02:46:20
【问题描述】:
我想创建一个只包含出现次数最多的行的新数据框:
我的代码如下:
import pandas as pd
f1=pd.read_csv('FILE1.csv')
f2=pd.read_csv('FILE2.csv')
df_all = f2.merge(f1, how='left', on='Symbol')
df_sort = df_all.sort_values(by=['Symbol','Date'], ascending=[True,True])
df_sort=df_sort.dropna()
df_cnt=df_sort['Symbol'].value_counts()
来自 2 个文件的原始数据被合并到 df_all:
In[1]: f1
Out[2]:
Symbol Weight
0 IBM 0.2
1 GE 0.3
2 AAPL 0.4
3 XOM 0.1
In[2]: f2
Out[3]:
Date Symbol ClosingPrice
0 3/1/2010 IBM 116.51
1 3/2/2010 IBM 117.32
2 3/3/2010 IBM 116.40
3 3/4/2010 IBM 116.58
4 3/5/2010 IBM 117.61
5 3/1/2010 GE 45.00
6 3/2/2010 GE 43.50
7 3/3/2010 GE 46.00
8 3/1/2010 AAPL 85.07
9 3/2/2010 AAPL 85.10
10 3/3/2010 AAPL 86.20
11 3/4/2010 AAPL 84.93
12 3/5/2010 AAPL 84.80
13 3/1/2010 XOM 98.15
14 3/2/2010 XOM 99.00
15 3/3/2010 XOM 98.23
16 3/4/2010 XOM 97.56
17 3/1/2010 MSFT 99.00
18 3/2/2010 MSFT 98.00
19 3/3/2010 MSFT 97.00
20 3/4/2010 MSFT 98.00
21 3/5/2010 MSFT 97.00
In[4]:df_all
Out[4]:
Date Symbol ClosingPrice Weight
0 3/1/2010 IBM 116.51 0.2
1 3/2/2010 IBM 117.32 0.2
2 3/3/2010 IBM 116.40 0.2
3 3/4/2010 IBM 116.58 0.2
4 3/5/2010 IBM 117.61 0.2
5 3/1/2010 GE 45.00 0.3
6 3/2/2010 GE 43.50 0.3
7 3/3/2010 GE 46.00 0.3
8 3/1/2010 AAPL 85.07 0.4
9 3/2/2010 AAPL 85.10 0.4
10 3/3/2010 AAPL 86.20 0.4
11 3/4/2010 AAPL 84.93 0.4
12 3/5/2010 AAPL 84.80 0.4
13 3/1/2010 XOM 98.15 0.1
14 3/2/2010 XOM 99.00 0.1
15 3/3/2010 XOM 98.23 0.1
16 3/4/2010 XOM 97.56 0.1
17 3/1/2010 MSFT 99.00 NaN
18 3/2/2010 MSFT 98.00 NaN
19 3/3/2010 MSFT 97.00 NaN
20 3/4/2010 MSFT 98.00 NaN
21 3/5/2010 MSFT 97.00 NaN
然后我对删除了 NaN 值的数据进行排序:
In[5]: df_sort
Out[5]:
Date Symbol ClosingPrice Weight
8 3/1/2010 AAPL 85.07 0.4
9 3/2/2010 AAPL 85.10 0.4
10 3/3/2010 AAPL 86.20 0.4
11 3/4/2010 AAPL 84.93 0.4
12 3/5/2010 AAPL 84.80 0.4
5 3/1/2010 GE 45.00 0.3
6 3/2/2010 GE 43.50 0.3
7 3/3/2010 GE 46.00 0.3
0 3/1/2010 IBM 116.51 0.2
1 3/2/2010 IBM 117.32 0.2
2 3/3/2010 IBM 116.40 0.2
3 3/4/2010 IBM 116.58 0.2
4 3/5/2010 IBM 117.61 0.2
13 3/1/2010 XOM 98.15 0.1
14 3/2/2010 XOM 99.00 0.1
15 3/3/2010 XOM 98.23 0.1
16 3/4/2010 XOM 97.56 0.1
然后我确定每个符号的出现总数
In[6]: df_cnt
Out[6]:
AAPL 5
IBM 5
XOM 4
GE 3
Name: Symbol, dtype: int64
此时我不知道如何创建一个新的数据框 df_final,它只包含出现次数为最大次数的数据。 . .在这种情况下 5.
我的最终数据框应如下所示:
Date Symbol ClosingPrice Weight
3/1/2010 AAPL 85.07 0.4
3/2/2010 AAPL 85.10 0.4
3/3/2010 AAPL 86.20 0.4
3/4/2010 AAPL 84.93 0.4
3/5/2010 AAPL 84.80 0.4
3/1/2010 IBM 116.51 0.2
3/2/2010 IBM 117.32 0.2
3/3/2010 IBM 116.40 0.2
3/4/2010 IBM 116.58 0.2
3/5/2010 IBM 117.61 0.2
【问题讨论】:
标签: python pandas dataframe slice