【问题标题】:How to add to numpy arrays based on conditional output from a function?如何根据函数的条件输出添加到 numpy 数组?
【发布时间】:2023-03-13 03:36:01
【问题描述】:

给定这些数据框(在实际数据中,每个数据框可能有数百万行):

df1 =

   Start  End
0     10   20
1     25   35

df2 =

   Start  End
0     12   18
1      2    8
2     22   28

df1 可以被认为是主范围,df2 是样本范围,我需要将每个范围的偏移量存储在df2 中作为一组列。在sammywemmy 的帮助下,我能够获得带有偏移量的输出:

# Import required modules
import numpy as np
import pandas as pd

# Define dataframes
df1 = pd.DataFrame([[10, 20], [25, 35]], columns=['Start', 'End'])
df2 = pd.DataFrame([[12, 18], [2, 8], [22, 28]], columns=['Start', 'End'])

# Create 2d numpy arrays
np_start1 = df1['Start'].to_numpy()
np_end1 = df1['End'].to_numpy()
np_start2 = df2['Start'].to_numpy()
np_end2 = df2['End'].to_numpy()

# Use numpy tiles to create shapes that allow elementwise math
tile_start1 = np.tile(np_start1, (len(df2), 1)).T
tile_end1 = np.tile(np_end1, (len(df2), 1)).T
tile_start2 = np.tile(np_start2, (len(df1), 1))
tile_end2 = np.tile(np_end2, (len(df1), 1))

# Do some math
np_start1_end2_diff = np.subtract(tile_start1, tile_end2)
np_start2_end1_diff = np.subtract(tile_start2, tile_end1)
np_start2_start1_diff = np.subtract(tile_start2, tile_start1)
np_end2_end1_diff = np.subtract(tile_end2, tile_end1)

# Create columns
col_start1_end2_diff = [f'S1-E2_{i}' for i in range(len(df2))]
col_start2_end1_diff = [f'S2-E1_{i}' for i in range(len(df2))]
col_start2_start1_diff = [f'S2-S1_{i}' for i in range(len(df2))]
col_end2_end1_diff = [f'E2-E1_{i}' for i in range(len(df2))]

# Create dataframes of calculated numpy arrays
df_start1_end2_diff = pd.DataFrame(np_start1_end2_diff, columns=col_start1_end2_diff)
df_start2_end1_diff = pd.DataFrame(np_start2_end1_diff, columns=col_start2_end1_diff)
df_start2_start1_diff = pd.DataFrame(np_start2_start1_diff, columns=col_start2_start1_diff)
df_end2_end1_diff = pd.DataFrame(np_end2_end1_diff, columns=col_end2_end1_diff)

# Lump calculated numpy arrays into output dataframe
df_output = pd.concat([
    df_start1_end2_diff,
    df_start2_end1_diff,
    df_start2_start1_diff,
    df_end2_end1_diff
], axis=1)

# Sort the columns by the digits at the end
filtered = df_output.columns[df_output.columns.str.contains('\d')]
cols = sorted(filtered, key=lambda x: x[-1])
df_output = df_output.reindex(cols, axis='columns')

print(df_output)

输出:

   S1-E2_0  S2-E1_0  S2-S1_0  E2-E1_0  S1-E2_1  S2-E1_1  S2-S1_1  E2-E1_1  S1-E2_2  S2-E1_2  S2-S1_2  E2-E1_2
0       -8       -8        2       -2        2      -18       -8      -12      -18        2       12        8
1        7      -23      -13      -17       17      -33      -23      -27       -3      -13       -3       -7
  • S1 = df1.Start
  • E1 = df1.End
  • S2 = df2.开始
  • E2 = df2.End

我正在苦苦挣扎的部分是,我还需要根据以下函数的输出为df2 中的每一行添加一个额外的列:

def get_position(start1, end1, start2, end2):
    if start1 >= start2 and end1 <= end2:
        return 'A'
    elif start1 > end2:
        return 'B'
    elif start1 == end2:
        return 'C'
    elif start1 < end2 and end1 > end2:
        return 'D'
    elif start1 < start2 and end1 > start2:
        return 'E'
    elif end1 == start2:
        return 'F'
    elif end1 < start2:
        return 'G'

目标输出应该是这样的:

   S1-E2_0  S2-E1_0  S2-S1_0  E2-E1_0  Pos_0  S1-E2_1  S2-E1_1  S2-S1_1  E2-E1_1  Pos_1  S1-E2_2  S2-E1_2  S2-S1_2  E2-E1_2  Pos_2
0       -8       -8        2       -2      A        2      -18       -8      -12      B      -18        2       12        8      G
1        7      -23      -13      -17      B       17      -33      -23      -27      B       -3      -13       -3       -7      A

如何为df2 中的每一行附加一个Pos_{i} 列,它是函数get_position() 的输出?

当我们处理数百万行时,包含一堆 if/else 条件的函数是否是个好主意?我读到我们可以使用vectorize 函数来提高性能,但在我的场景中我无法弄清楚如何为get_position() 做到这一点。

【问题讨论】:

    标签: python pandas numpy dataframe


    【解决方案1】:

    您可以使用np.select() 向量化get_position()

    def get_position(start1, end1, start2, end2):
        return np.select([
            (start1 >= start2) & (end1 <= end2)
            start1 > end2,
            start1 == end2,
            (start1 < end2) & end1 > end2),
            # etc...
            ], ['A', 'B', 'C', 'D'], '?')
    

    现在只需使用 start1end1 等的整个数组调用它,而不是单个单元格。

    【讨论】:

      猜你喜欢
      • 2021-01-24
      • 1970-01-01
      • 2019-06-17
      • 2021-11-29
      • 2021-08-28
      • 1970-01-01
      • 1970-01-01
      • 2015-01-25
      • 1970-01-01
      相关资源
      最近更新 更多