【发布时间】:2018-03-25 20:01:35
【问题描述】:
我在根据具有最低值的条件(int 和dates)的列中基于键 ticker 过滤掉重复数据时遇到问题。
因此,初始数据集如下所示:
ticker dim cal_date date0 date1 diff
0 A ART 9/30/16 12/20/16 12/20/17 -81
1 AA ART 9/30/16 12/1/16 12/1/17 -62
2 AA ART 9/30/16 12/1/16 2/8/18 -131
3 AA ART 9/30/16 2/8/17 12/1/17 -62
4 AA ART 9/30/16 2/8/17 2/8/18 -131
5 AABA ART 9/30/16 11/9/16 11/9/17 -40
6 AAC ART 9/30/16 11/8/16 11/8/17 -39
7 AAL ART 9/30/16 10/20/16 10/20/17 -20
8 AAMC ART 9/30/16 11/7/16 11/7/17 -38
9 AAME ART 9/30/16 11/14/16 11/14/17 -45
36 ABMT ART 9/30/16 2/14/17 2/14/18 -137
37 ABMT ART 9/30/16 2/14/17 2/16/18 -139
38 ABMT ART 9/30/16 2/16/17 2/14/18 -137
注意,AA 值重复 4 次,ABMT 值重复 3 次。我想根据两个条件过滤掉一些值,第一个选择先出现的 date0 日期,所以现在数据集将如下所示:
ticker dim cal_date date0 date1 diff
0 A ART 9/30/16 12/20/16 12/20/17 -81
1 AA ART 9/30/16 12/1/16 12/1/17 -62
2 AA ART 9/30/16 12/1/16 2/8/18 -131
5 AABA ART 9/30/16 11/9/16 11/9/17 -40
6 AAC ART 9/30/16 11/8/16 11/8/17 -39
7 AAL ART 9/30/16 10/20/16 10/20/17 -20
8 AAMC ART 9/30/16 11/7/16 11/7/17 -38
9 AAME ART 9/30/16 11/14/16 11/14/17 -45
36 ABMT ART 9/30/16 2/14/17 2/14/18 -137
37 ABMT ART 9/30/16 2/14/17 2/16/18 -139
第二个条件是去掉diff值最小的值,得到最终结果。现在过滤后的完整数据集将如下所示:
ticker dim cal_date date0 date1 diff
0 A ART 9/30/16 12/20/16 12/20/17 -81
1 AA ART 9/30/16 12/1/16 12/1/17 -62
5 AABA ART 9/30/16 11/9/16 11/9/17 -40
6 AAC ART 9/30/16 11/8/16 11/8/17 -39
7 AAL ART 9/30/16 10/20/16 10/20/17 -20
8 AAMC ART 9/30/16 11/7/16 11/7/17 -38
9 AAME ART 9/30/16 11/14/16 11/14/17 -45
36 ABMT ART 9/30/16 2/14/17 2/14/18 -137
感谢您的帮助。
编辑:
在温的回答之后,我已将我的代码更新为以下内容:
import pandas as pd
data = pd.read_csv('input_transform.csv')
print(data)
返回:
Unnamed: 0 ticker dim cal_date date0 date1 diff
0 0 A ART 9/30/16 12/20/16 12/20/17 -81
1 1 AA ART 9/30/16 12/1/16 12/1/17 -62
2 2 AA ART 9/30/16 12/1/16 2/8/18 -131
3 3 AA ART 9/30/16 2/8/17 12/1/17 -62
4 4 AA ART 9/30/16 2/8/17 2/8/18 -131
5 5 AABA ART 9/30/16 11/9/16 11/9/17 -40
6 6 AAC ART 9/30/16 11/8/16 11/8/17 -39
7 7 AAL ART 9/30/16 10/20/16 10/20/17 -20
8 8 AAMC ART 9/30/16 11/7/16 11/7/17 -38
9 9 AAME ART 9/30/16 11/14/16 11/14/17 -45
10 36 ABMT ART 9/30/16 2/14/17 2/14/18 -137
11 37 ABMT ART 9/30/16 2/14/17 2/16/18 -139
12 38 ABMT ART 9/30/16 2/16/17 2/14/18 -137
然后我补充:
# making sure the date is in date format.
data['date0'] = pd.to_datetime(data['date0'].replace("'", ""))
# making sure the diff is in float or int format
data['diff'] = data['diff'].astype(float)
data.sort_values(['date0', 'diff'], ascending=[False, True]).drop_duplicates('ticker', keep='last').sort_index()
print(data)
返回:
Unnamed: 0 ticker dim cal_date date0 date1 diff
0 0 A ART 9/30/16 2016-12-20 12/20/17 -81.0
1 1 AA ART 9/30/16 2016-12-01 12/1/17 -62.0
2 2 AA ART 9/30/16 2016-12-01 2/8/18 -131.0
3 3 AA ART 9/30/16 2017-02-08 12/1/17 -62.0
4 4 AA ART 9/30/16 2017-02-08 2/8/18 -131.0
5 5 AABA ART 9/30/16 2016-11-09 11/9/17 -40.0
6 6 AAC ART 9/30/16 2016-11-08 11/8/17 -39.0
7 7 AAL ART 9/30/16 2016-10-20 10/20/17 -20.0
8 8 AAMC ART 9/30/16 2016-11-07 11/7/17 -38.0
9 9 AAME ART 9/30/16 2016-11-14 11/14/17 -45.0
10 36 ABMT ART 9/30/16 2017-02-14 2/14/18 -137.0
11 37 ABMT ART 9/30/16 2017-02-14 2/16/18 -139.0
12 38 ABMT ART 9/30/16 2017-02-16 2/14/18 -137.0
不幸的是,到目前为止,没有运气。
【问题讨论】:
-
应该去掉AA -131 吗?
-
是的,AA -131(第 2 行),我会编辑它。
标签: python pandas dataframe filter