【问题标题】:How can I filter for pandas columns or rows based on values of another column?如何根据另一列的值过滤熊猫列或行?
【发布时间】:2018-03-25 20:01:35
【问题描述】:

我在根据具有最低值的条件(intdates)的列中基于键 ticker 过滤掉重复数据时遇到问题。 因此,初始数据集如下所示:

    ticker    dim     cal_date   date0        date1    diff
0   A         ART      9/30/16  12/20/16    12/20/17    -81
1   AA        ART      9/30/16   12/1/16     12/1/17    -62
2   AA        ART      9/30/16   12/1/16      2/8/18   -131
3   AA        ART      9/30/16    2/8/17     12/1/17    -62
4   AA        ART      9/30/16    2/8/17      2/8/18   -131
5   AABA      ART      9/30/16   11/9/16     11/9/17    -40
6   AAC       ART      9/30/16   11/8/16     11/8/17    -39
7   AAL       ART      9/30/16  10/20/16    10/20/17    -20
8   AAMC      ART      9/30/16   11/7/16     11/7/17    -38
9   AAME      ART      9/30/16  11/14/16    11/14/17    -45
36  ABMT      ART      9/30/16   2/14/17     2/14/18    -137
37  ABMT      ART      9/30/16   2/14/17     2/16/18    -139
38  ABMT      ART      9/30/16   2/16/17     2/14/18    -137

注意,AA 值重复 4 次,ABMT 值重复 3 次。我想根据两个条件过滤掉一些值,第一个选择先出现的 date0 日期,所以现在数据集将如下所示:

    ticker    dim     cal_date   date0        date1    diff
0   A         ART      9/30/16   12/20/16   12/20/17    -81
1   AA        ART      9/30/16    12/1/16    12/1/17    -62
2   AA        ART      9/30/16    12/1/16     2/8/18   -131
5   AABA      ART      9/30/16    11/9/16    11/9/17    -40
6   AAC       ART      9/30/16    11/8/16    11/8/17    -39
7   AAL       ART      9/30/16   10/20/16   10/20/17    -20
8   AAMC      ART      9/30/16    11/7/16    11/7/17    -38
9   AAME      ART      9/30/16   11/14/16   11/14/17    -45
36  ABMT      ART      9/30/16    2/14/17    2/14/18    -137
37  ABMT      ART      9/30/16    2/14/17    2/16/18    -139

第二个条件是去掉diff值最小的值,得到最终结果。现在过滤后的完整数据集将如下所示:

    ticker    dim     cal_date   date0        date1    diff
0   A         ART      9/30/16   12/20/16   12/20/17    -81
1   AA        ART      9/30/16    12/1/16    12/1/17    -62
5   AABA      ART      9/30/16    11/9/16    11/9/17    -40
6   AAC       ART      9/30/16    11/8/16    11/8/17    -39
7   AAL       ART      9/30/16   10/20/16   10/20/17    -20
8   AAMC      ART      9/30/16    11/7/16    11/7/17    -38
9   AAME      ART      9/30/16   11/14/16   11/14/17    -45
36  ABMT      ART      9/30/16    2/14/17    2/14/18    -137

感谢您的帮助。


编辑:

在温的回答之后,我已将我的代码更新为以下内容:

import pandas as pd
data = pd.read_csv('input_transform.csv')
print(data)

返回:

    Unnamed: 0 ticker  dim cal_date     date0     date1  diff
 0           0      A  ART  9/30/16  12/20/16  12/20/17   -81
 1           1     AA  ART  9/30/16   12/1/16   12/1/17   -62
 2           2     AA  ART  9/30/16   12/1/16    2/8/18  -131
 3           3     AA  ART  9/30/16    2/8/17   12/1/17   -62
 4           4     AA  ART  9/30/16    2/8/17    2/8/18  -131
 5           5   AABA  ART  9/30/16   11/9/16   11/9/17   -40
 6           6    AAC  ART  9/30/16   11/8/16   11/8/17   -39
 7           7    AAL  ART  9/30/16  10/20/16  10/20/17   -20
 8           8   AAMC  ART  9/30/16   11/7/16   11/7/17   -38
 9           9   AAME  ART  9/30/16  11/14/16  11/14/17   -45
10          36   ABMT  ART  9/30/16   2/14/17   2/14/18  -137
11          37   ABMT  ART  9/30/16   2/14/17   2/16/18  -139
12          38   ABMT  ART  9/30/16   2/16/17   2/14/18  -137

然后我补充:

# making sure the date is in date format.
data['date0'] = pd.to_datetime(data['date0'].replace("'", ""))
# making sure the diff is in float or int format
data['diff'] = data['diff'].astype(float)

data.sort_values(['date0', 'diff'], ascending=[False, True]).drop_duplicates('ticker', keep='last').sort_index()
print(data)

返回:

    Unnamed: 0 ticker  dim cal_date      date0     date1   diff
 0           0      A  ART  9/30/16 2016-12-20  12/20/17  -81.0
 1           1     AA  ART  9/30/16 2016-12-01   12/1/17  -62.0
 2           2     AA  ART  9/30/16 2016-12-01    2/8/18 -131.0
 3           3     AA  ART  9/30/16 2017-02-08   12/1/17  -62.0
 4           4     AA  ART  9/30/16 2017-02-08    2/8/18 -131.0
 5           5   AABA  ART  9/30/16 2016-11-09   11/9/17  -40.0
 6           6    AAC  ART  9/30/16 2016-11-08   11/8/17  -39.0
 7           7    AAL  ART  9/30/16 2016-10-20  10/20/17  -20.0
 8           8   AAMC  ART  9/30/16 2016-11-07   11/7/17  -38.0
 9           9   AAME  ART  9/30/16 2016-11-14  11/14/17  -45.0
10          36   ABMT  ART  9/30/16 2017-02-14   2/14/18 -137.0
11          37   ABMT  ART  9/30/16 2017-02-14   2/16/18 -139.0
12          38   ABMT  ART  9/30/16 2017-02-16   2/14/18 -137.0

不幸的是,到目前为止,没有运气。

【问题讨论】:

  • 应该去掉AA -131 吗?
  • 是的,AA -131(第 2 行),我会编辑它。

标签: python pandas dataframe filter


【解决方案1】:

然后sort_values + drop_duplicates

df.sort_values(['date0','diff'],ascending=[False,True]).drop_duplicates('ticker',keep='last').sort_index()
Out[1071]: 
   ticker  dim cal_date     date0     date1  diff
0       A  ART  9/30/16  12/20/16  12/20/17   -81
1      AA  ART  9/30/16   12/1/16   12/1/17   -62
5    AABA  ART  9/30/16   11/9/16   11/9/17   -40
6     AAC  ART  9/30/16   11/8/16   11/8/17   -39
7     AAL  ART  9/30/16  10/20/16  10/20/17   -20
8    AAMC  ART  9/30/16   11/7/16   11/7/17   -38
9    AAME  ART  9/30/16  11/14/16  11/14/17   -45
36   ABMT  ART  9/30/16   2/14/17   2/14/18  -137

【讨论】:

  • 感谢您的回复,不幸的是,它似乎不起作用。我将使用我的代码示例编辑我的答案,以便您可以看到它返回的内容。
  • @michael0196 您忘记将其分配回data=data.sort_values(['date0', 'diff'], ascending=[False, True]).drop_duplicates('ticker', keep='last').sort_index()
  • 哦..对不起,我是个白痴哈哈。对python还是很陌生。非常感谢。
猜你喜欢
  • 2014-12-27
  • 2019-09-29
  • 2020-08-08
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2018-04-06
  • 2022-12-06
  • 2022-11-02
相关资源
最近更新 更多