【发布时间】:2018-06-15 18:12:33
【问题描述】:
我有以下数据框 ds,它是通过 .merge 到达的:
Date_x Invoice_x Name Coupon_x Location_x Date_y \
1 2017-12-24 700349.0 John Doe NONE VAGG1 2017-12-24
2 2017-12-24 700349.0 John Doe NONE VAGG1 2017-12-24
4 NaN NaN Sue Simpson NaN NaN 2017-12-23
Invoice_y Price Coupon_y Location_y
1 800345 17.95 CHANGE VAGG1
2 800342 9.95 GADSLR VAGG1
4 800329 34.95 GADSLR GG2
我正在寻找的输出是:
Date Invoice Name Coupon Location Price
1 2017-12-24 700349 John Doe NONE VAGG1 17.95
2 2017-12-24 700349 John Doe NONE VAGG1 9.95
通过使用以下代码:
ds = ds.query('Price_x != Price_y')
我明白了
Date_x Invoice_x Name Price_x Coupon_x Location_x \
1 2017-12-24 700349.0 John Doe 59.95 NONE VAGG1
2 2017-12-24 700349.0 John Doe 59.95 NONE VAGG1
4 NaN NaN Sue Simpson NaN NaN NaN
Date_y Invoice_y Price_y Coupon_y Location_y
1 2017-12-24 800345 17.95 CHANGE VAGG1
2 2017-12-24 800342 9.95 GADSLR VAGG1
4 2017-12-23 800329 34.95 GADSLR GG2
这与我想要的很接近。 .drop 和.rename 可以删除多余的列。真正缺少的是摆脱名称仅出现一个的行的能力。
我一直在查询语句中尝试以下几行的逻辑:
ds =ds.query('Price_x != Price_y & Name > 1')
这会导致以下错误:
TypeError: '>' not supported between instances of 'str' and 'int'
编辑:
ds = ds[(ds[Price_x] != ds[Price_y]) & (ds['Name'].value_counts() > 1)]
结果:
NameError: name 'Price_x' is not defined
或者,尝试:
ds = ds[(ds.Price_x != ds.Price_y) & (ds['Name'].value_counts() > 1)]
结果
c:\users\...\python\python36\lib\site-packages\pandas\core\indexes\base.py:3140: RuntimeWarning: '<' not supported between instances of 'int' and 'str', sort order is undefined for incomparable objects
return this.join(other, how=how, return_indexers=return_indexers)
C:\Users\...\Python\Python36\Scripts\ipython:1: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
以及ds 为空。
Empty DataFrame
Columns: [Date_x, Invoice_x, Name, Price_x, Coupon_x, Location_x, Date_y, Invoice_y, Price_y, Coupon_y, Location_y]
Index: []
【问题讨论】:
-
你想要的输出是什么?当您说“删除名称仅出现一个的行”时,您是什么意思?
-
df.query("...").groupby(by=[...]).filter(lambda g: g.shape[0] > 1) -
@jakevdp 我在第二个代码块中表达了我正在寻找的内容。 '我正在寻找的是输出'
-
@PaulH 你有没有机会发布整行作为答案?
标签: python python-3.x pandas numpy scikit-learn