【问题标题】:find specific value that meets conditions - python查找满足条件的特定值 - python
【发布时间】:2021-05-15 19:45:48
【问题描述】:

尝试使用满足特定条件的值创建新列。下面我列出的代码在某种程度上解释了逻辑,但没有产生正确的输出:

import pandas as pd
import numpy as np


df = pd.DataFrame({'date': ['2019-08-06 09:00:00', '2019-08-06 12:00:00', '2019-08-06 18:00:00', '2019-08-06 21:00:00', '2019-08-07 09:00:00', '2019-08-07 16:00:00', '2019-08-08 17:00:00' ,'2019-08-09 16:00:00'], 
                'type': [0, 1, np.nan, 1, np.nan, np.nan, 0 ,0], 
                'colour': ['blue', 'red', np.nan, 'blue', np.nan, np.nan, 'blue', 'red'],
                'maxPixel': [255, 7346, 32, 5184, 600, 322, 72, 6000],
                'minPixel': [86, 96, 14, 3540, 528, 300, 12, 4009],
                'colourDate': ['2019-08-06 12:00:00', '2019-08-08 16:00:00', '2019-08-06 23:00:00', '2019-08-06 22:00:00', '2019-08-08 09:00:00', '2019-08-09 16:00:00', '2019-08-08 23:00:00' ,'2019-08-11 16:00:00'] })

max_conditions = [(df['type'] == 1) & (df['colour'] == 'blue'),
                  (df['type'] == 1) & (df['colour'] == 'red')]


max_choices = [np.where(df['date'] <= df['colourDate'], max(df['maxPixel']), np.nan),
                np.where(df['date'] <= df['colourDate'], min(df['minPixel']), np.nan)]


df['pixelLimit'] = np.select(max_conditions, max_choices, default=np.nan)

输出不正确:

                  date  type colour  maxPixel  minPixel           colourDate  pixelLimit
0  2019-08-06 09:00:00   0.0   blue       255        86  2019-08-06 12:00:00         NaN
1  2019-08-06 12:00:00   1.0    red      7346        96  2019-08-08 16:00:00        12.0
2  2019-08-06 18:00:00   NaN    NaN        32        14  2019-08-06 23:00:00         NaN
3  2019-08-06 21:00:00   1.0   blue      5184      3540  2019-08-06 22:00:00      6000.0
4  2019-08-07 09:00:00   NaN    NaN       600       528  2019-08-08 09:00:00         NaN
5  2019-08-07 16:00:00   NaN    NaN       322       300  2019-08-09 16:00:00         NaN
6  2019-08-08 17:00:00   0.0   blue        72        12  2019-08-08 23:00:00         NaN
7  2019-08-09 16:00:00   0.0    red      6000      4009  2019-08-11 16:00:00         NaN

解释为什么输出不正确:

索引行 1 中 df['pixelLimit'] 列的值 12.0 不正确 因为该值来自 df['minPixel'] 索引行 6,该行的 df['date'] 日期时间为 2019-08-08 17:00:00,即大于索引第 1 行中包含的 2019-08-08 16:00:00 df['date'] 日期时间。

索引第 3 行中 df['pixelLimit'] 列的值 6000.0 不正确 因为该值来自 df['maxPixel'] 索引第 7 行,其中 df['date'] 日期时间为 2019-08-09 16:00:00 更大比索引行中包含的2019-08-06 22:00:00df['date']日期时间。

正确的输出:

                  date  type colour  maxPixel  minPixel           colourDate  pixelLimit
0  2019-08-06 09:00:00   0.0   blue       255        86  2019-08-06 12:00:00         NaN
1  2019-08-06 12:00:00   1.0    red      7346        96  2019-08-08 16:00:00        14.0
2  2019-08-06 18:00:00   NaN    NaN        32        14  2019-08-06 23:00:00         NaN
3  2019-08-06 21:00:00   1.0   blue      5184      3540  2019-08-06 22:00:00      5184.0
4  2019-08-07 09:00:00   NaN    NaN       600       528  2019-08-08 09:00:00         NaN
5  2019-08-07 16:00:00   NaN    NaN       322       300  2019-08-09 16:00:00         NaN
6  2019-08-08 17:00:00   0.0   blue        72        12  2019-08-08 23:00:00         NaN
7  2019-08-09 16:00:00   0.0    red      6000      4009  2019-08-11 16:00:00         NaN

解释为什么输出正确:

14.0df['pixelLimit'] 的索引行 1 中的值 14.0 是正确的,因为我们正在寻找列 df['minPixel'] 中的最小值,它在列 df['date'] 中的日期时间小于df['colourDate'] 列的索引第 1 行中的日期时间,并且大于或等于 df['date'] 列的索引第 1 行中的日期时间

df['pixelLimit'] 列的索引第 3 行中的值 5184.0 是正确的,因为我们正在寻找列 df['maxPixel'] 中的最大值,它在列 df['date'] 中的日期时间小于df['colourDate'] 列的索引第 3 行中的日期时间,并且大于或等于 df['date'] 列的索引第 3 行中的日期时间

注意事项:

也许np.select 不是最适合这项任务,而某种功能可能更好地服务于这项任务?

另外,也许我需要创建某种动态的len 作为每一行的起点?

请求

请任何人帮助我修改我的代码以实现正确的输出

【问题讨论】:

  • 抱歉,@sammywemmy 和 Allolz 打错字了

标签: python pandas dataframe indexing


【解决方案1】:

对于像这样的匹配问题,一种可能性是使用布尔系列对所有满足您的条件的行(对于该行)进行完全合并,然后子集,并在所有行中找到 maxmin可能的匹配。由于这需要稍微不同的列和不同的函数,我将操作分成两段非常相似的代码,一段处理 1/blue,另一段处理 1/red。

首先做一些家务,让事情成为日期时间

import pandas as pd

df['date'] = pd.to_datetime(df['date'])
df['colourDate'] = pd.to_datetime(df['colourDate'])

计算每行时间之间 1/red 的最小像素

# Subset of rows we need to do this for
dfmin = df[df.type.eq(1) & df.colour.eq('red')].reset_index()

# To each row merge all rows from the original DataFrame
dfmin = dfmin.merge(df[['date', 'minPixel']], how='cross')
# If pd.version < 1.2 instead use: 
#dfmin = dfmin.assign(t=1).merge(df[['date', 'minPixel']].assign(t=1), on='t')

# Only keep rows between the dates, then among those find the min minPixel
smin = (dfmin[dfmin.date_y.between(dfmin.date_x, dfmin.colourDate)]
            .groupby('index')['minPixel_y'].min()
            .rename('pixel_limit'))
#index
#1    14
#Name: pixel_limit, dtype: int64

# Max is basically a mirror
dfmax = df[df.type.eq(1) & df.colour.eq('blue')].reset_index()

dfmax = dfmax.merge(df[['date', 'maxPixel']], how='cross')
#dfmax = dfmax.assign(t=1).merge(df[['date', 'maxPixel']].assign(t=1), on='t')

smax = (dfmax[dfmax.date_y.between(dfmax.date_x, dfmax.colourDate)]
           .groupby('index')['maxPixel_y'].max()
           .rename('pixel_limit'))

最后因为上面的分组超过了原始索引(即'index'),我们可以简单地分配回与原始DataFrame对齐。

df['pixel_limit'] = pd.concat([smin, smax])

                 date  type colour  maxPixel  minPixel          colourDate  pixel_limit
0 2019-08-06 09:00:00   0.0   blue       255        86 2019-08-06 12:00:00          NaN
1 2019-08-06 12:00:00   1.0    red      7346        96 2019-08-08 16:00:00         14.0
2 2019-08-06 18:00:00   NaN    NaN        32        14 2019-08-06 23:00:00          NaN
3 2019-08-06 21:00:00   1.0   blue      5184      3540 2019-08-06 22:00:00       5184.0
4 2019-08-07 09:00:00   NaN    NaN       600       528 2019-08-08 09:00:00          NaN
5 2019-08-07 16:00:00   NaN    NaN       322       300 2019-08-09 16:00:00          NaN
6 2019-08-08 17:00:00   0.0   blue        72        12 2019-08-08 23:00:00          NaN
7 2019-08-09 16:00:00   0.0    red      6000      4009 2019-08-11 16:00:00          NaN

如果您需要为具有最小/最大像素的行带来很多不同的信息,那么我们将使用 sort_values 和 gropuby + headtail 而不是 groupby最小或最大像素。对于分钟,这看起来像(后缀的轻微重命名):

# Subset of rows we need to do this for
dfmin = df[df.type.eq(1) & df.colour.eq('red')].reset_index()

# To each row merge all rows from the original DataFrame
dfmin = dfmin.merge(df[['date', 'minPixel']].reset_index(), how='cross', 
                    suffixes=['', '_match'])
# For older pandas < 1.2
#dfmin = (dfmin.assign(t=1)
#              .merge(df[['date', 'minPixel']].reset_index().assign(t=1), 
#                     on='t', suffixes=['', '_match'])) 

# Only keep rows between the dates, then among those find the min minPixel row. 
# A bunch of renaming. 
smin = (dfmin[dfmin.date_match.between(dfmin.date, dfmin.colourDate)]
            .sort_values('minPixel_match', ascending=True)
            .groupby('index').head(1)
            .set_index('index')
            .filter(like='_match')
            .rename(columns={'minPixel_match': 'pixel_limit'}))

然后使用 .tail 时,Max 将与此类似

dfmax = df[df.type.eq(1) & df.colour.eq('blue')].reset_index()
dfmax = dfmax.merge(df[['date', 'maxPixel']].reset_index(), how='cross', 
                    suffixes=['', '_match'])

smax = (dfmax[dfmax.date_match.between(dfmax.date, dfmin.colourDate)]
            .sort_values('maxPixel_match', ascending=True)
            .groupby('index').tail(1)
            .set_index('index')
            .filter(like='_match')
            .rename(columns={'maxPixel_match': 'pixel_limit'}))

最后我们连接axis=1,现在我们需要将多个列连接到原始列:

result = pd.concat([df, pd.concat([smin, smax])], axis=1)

                  date  type colour  maxPixel  minPixel           colourDate  index_match           date_match  pixel_limit
0  2019-08-06 09:00:00   0.0   blue       255        86  2019-08-06 12:00:00          NaN                  NaN          NaN
1  2019-08-06 12:00:00   1.0    red      7346        96  2019-08-08 16:00:00          2.0  2019-08-06 18:00:00         14.0
2  2019-08-06 18:00:00   NaN    NaN        32        14  2019-08-06 23:00:00          NaN                  NaN          NaN
3  2019-08-06 21:00:00   1.0   blue      5184      3540  2019-08-06 22:00:00          3.0  2019-08-06 21:00:00       5184.0
4  2019-08-07 09:00:00   NaN    NaN       600       528  2019-08-08 09:00:00          NaN                  NaN          NaN
5  2019-08-07 16:00:00   NaN    NaN       322       300  2019-08-09 16:00:00          NaN                  NaN          NaN
6  2019-08-08 17:00:00   0.0   blue        72        12  2019-08-08 23:00:00          NaN                  NaN          NaN
7  2019-08-09 16:00:00   0.0    red      6000      4009  2019-08-11 16:00:00          NaN                  NaN          NaN

【讨论】:

  • 这非常有效。谢谢。我越来越多地看到groupby 的广泛适用性。
  • 我认为交叉连接可以代替assign 这里的dfmin = dfmin.assign(t=1).merge(df[['date', 'minPixel']].assign(t=1), on='t')
  • 使用交叉连接会导致这部分出现问题...smin = (dfmin[dfmin.date_y.between(dfmin.date_x, dfmin.colourDate)] .groupby('index_x')['minPixel_y'].min() .rename('pixel_limit'))
  • 啊..是的,我有点仓促
  • 伙计们,代码在超过 100k 行的 df 上运行非常有效!但是,我很难手动检查 pixel_limit 是从哪个行索引中提取的。如何添加另一列来引用行索引或date 其中pixel_limit 值已被拉出@sammywemmy 和 ALollz
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2018-07-27
  • 2020-05-20
  • 1970-01-01
  • 2012-01-22
  • 2020-09-14
  • 2019-11-11
相关资源
最近更新 更多