【发布时间】:2021-11-02 00:21:05
【问题描述】:
需要减少以下包含多个 if else 语句的 python 代码的计算。该代码在 DataBricks 上运行,因此我也对 Pyspark Solutions 持开放态度。 目前,此代码需要 1 个多小时才能运行。因此,我们将不胜感激。
unique_list_code:来自concat_df['C_Code'] 列的唯一代码列表,用于过滤包含该代码的数据帧行。concat_df:具有 400 万条记录的 Pandas 数据帧
unique_list_code = list(concat_df['C_Code'].unique())
MC_list =[]
SN_list =[]
AN_list = []
Nothing_list =[]
for i in range(0,len(unique_list_code)):
print(unique_list_code[i])
code_filtered_df = concat_df[concat_df['C_Code'] == unique_list_code[i]]
#SN_Filter:
SN_filter = code_filtered_df[(code_filtered_df['D_Type'] == 'SN') & (code_filtered_df['Comm_P'] == 'P-mail')]
if len(SN_filter)>0:
print("Found SN")
SN_list.append(unique_list_code[i])
clean_up(SN_filter)
else:
#AN_Filter
AN_filter = code_filtered_df[(code_filtered_df['D_Type'] == 'AN') & (code_filtered_df['Comm_P'] == 'P-mail')]
if len(AN_filter)>0:
print("Found AN")
AN_list.append(unique_list_code[i])
clean_up(AN_filter)
else:
#MC_Check
MF_filter = code_filtered_df[code_filtered_df['MC_Flag'] =='Y' ]
MF_DNS_filter = MF_filter[~(((MF_filter['D_Type'] == 'AN')| (MF_filter['D_Type'] =='SN')) & (MF_filter['Comm_P'] == 'DNS'))]
if len(MF_DNS_filter)>0:
print("Found MC")
MC_list.append(unique_list_code[i])
clean_up(MF_DNS_filter)
else:
print("Nothing Found")
Nothing_list.append(unique_list_code[i])
更新: 改成Pyspark DF,代码如下,还是不行。
from pyspark.sql.functions import col
from pyspark.sql.functions import when
MC_list =[]
SN_list =[]
AN_list = []
Nothing_list =[]
for i in range(0,len(unique_list_code)):
code_filtered_df = df.filter(col("C_code") == unique_list_code[i])
SN_filter = code_filtered_df.filter((col('D_Type') == 'SN') & (col('Comm_P') == 'P-mail'))
if SN_filter.count() >0:
SN_list.append(unique_list_code[i])
else:
AN_filter = code_filtered_df.filter((col('D_Type') == 'AN') & (col('Comm_P') == 'P-mail'))
if AN_filter.count()>0:
AN_list.append(unique_list_code[i])
else:
MF_filter = code_filtered_df.filter(col('MC_Flag') =='Y')
MF_DNS_filter = MF_filter[~(((col('D_Type') == 'AN')| (col('D_Type') =='SN')) & (col('Comm_P') == 'DNS'))]
if MF_DNS_filter.count()>0:
print("Found MC")
MC_list.append(unique_list_code[i])
else:
print("Nothing Found")
Nothing_list.append(unique_list_code[i])
【问题讨论】:
-
使用
line_profiler检查这里的瓶颈是什么
标签: python performance for-loop if-statement pyspark