【问题标题】:How can I update my DataFrame in Pandas and export out to Excel?如何在 Pandas 中更新我的 DataFrame 并导出到 Excel?
【发布时间】:2018-06-02 03:39:32
【问题描述】:

我是 Python 和编程新手。如果我的问题看起来很愚蠢或不清楚,请原谅我。我做过研究,但坦率地说,我读过的一些解释我很难理解。

我有一个数据框,其中包含医院的大量预约预约数据,需要对其进行评估和修改,以便将其导入到他们的新预约应用程序中。不幸的是,供应商的导入工具是垃圾并且进行零检查,所以我必须编写一些东西来检查旧数据并将其转换为新系统的上传数据。以下是格式示例:

start appointment   department  procedure   resource
20171020131500      MAM         BDXMAMUNI   BDIAG2    
20171020133000      MAM         BDXMAMUNI   BDIAG1    
20171020141500      MAM         BDXMAMUNI   BDIAG2    
20171020143000      MAM         BDXMAMUNI   BDIAG1    
20171020144500      MAM         BDXMAMBIL   BDIAG2    
20171020150000      MAM         BDXMAMBIL   BDIAG1    
20171020151500      MAM         BDXMAMUNI   BDIAG2    
20171023080000      MAM         BDXMAMBIL   BDIAG1    
20171023081500      MAM         BDXMAMBIL   BDIAG2       

我正在尝试根据标准进行更新。这是我想出的,但我无法让它更新该字段。以下是我自己的标准。

如果在索引 X 分钟 = 15 并且(hr = 8 或 h= 9 或 hr = 10 或 hr = 11 或 h =13 或 hr =14 或 hr =15) 并且资源 = BDIAG1、BDIAG2 或 BDIAG 3 然后在索引 X 开始约会 索引 X 处的资源 ZBMDX3

如果在索引 X 开始约会的分钟数 = 00 并且(hr = 8 或 hr = 9 或 hr = 10 或 hr = 11 或 hr = 13 或 hr = 14 或 hr =15) 然后开始 索引 X 的任命将在索引 X 的资源 ZBMDX2 中

如果在索引 X 分钟 = 45 并且(hr = 7 或 hr = 8 或 hr = 9 或 hr = 10 或 hr 12 或 hr = 13 或 hr = 14) 然后开始 索引 X 的任命将在索引 X 的资源 ZBMDX1 中

如果在索引 X 开始约会,则分钟 = 30 并且(hr = 8 或 hr = 9 或 hr = 10 或 hr = 13 或 hr = 14) 然后在索引 X 开始约会 将在索引 X 的资源 ZBMDX4 中

创建输出文件时,它没有任何更新的更改。我对 StackOverflow 做了一些研究,但我读过的所有线程似乎都不起作用。有些人建议使用 locs 和 ix 以及 df.update 做一些事情。

  import pandas as pd
df = pd.read_excel(my_file, sheet_name='Sheet1')

  dept = df['department']
  resource = df['resource']
  start_appointment = df['start appointment']


  def diagnostic():  # Check Diagnostic Breast scheduled appointments
      for i in range(10):
          minutes = str(start_appointment[i])[14:16]
          hour = str(start_appointment[i])[11:13]
          if minutes == '15' and (
                  hour == '8' or hour == '9' or hour == '10' or hour == '11'             
            or hour == '13' or hour == '14' or hour == '15') and (
            resource[i] == 'BIDAG1' or resource[i] == 'BDIAG2' or 
            resource[i] == 'BDIAG3'):
        df.update['resource'][i] = 'ZBMDX3'
    elif minutes == '00' and (hour == '8' or hour == '9' or hour == '10' or 
            hour == '11' or hour == '13' or hour == '14' or hour == '15') 
            and (resource[i] == 'BIDAG1' or resource[i] == 'BDIAG2' or 
            resource[i] == 'BDIAG2'):
        df.update['resource'][i] = 'ZBMDX2'
    elif minutes == '45' and (
            hour == '7' or hour == '8' or hour == '9' or hour == '10' or 
            hour == '12' or hour == '13' or hour == '14') and (
            resource[i] == 'BIDAG1' or resource[i] == 'BDIAG2' or 
            resource[i] == 'BDIAG1'):
        df.update['resource'][i] = 'ZBMDX1'
    elif minutes == '30' and (hour == '8' or hour == '9' or hour == '10' or 
            hour == '13' or hour == '14') and (
            resource[i] == 'BIDAG1' or resource[i] == 'BDIAG2' or 
            resource[i] == 'BDIAG1'):
        df.update['resource'][i] = 'ZBMDX4'
  diagnostic()

# Specify a writer
writer = pd.ExcelWriter('C:\\Users\user_name\Desktop\Python 3\Python_Output.xlsx', engine='xlsxwriter')

# Write your DataFrame to a file
df.to_excel(writer, 'Sheet1')

# Save the result
writer.save()

我做了建议的更改。

df2 = diagnostic(df)

# Specify a writer
writer = pd.ExcelWriter('C:\\Users\cboutsikos\Desktop\Python 3\Python_Output.xlsx', engine='xlsxwriter')

# Write your DataFrame to a file
df2.to_excel(writer, 'Sheet1')

# Save the result
writer.save()

现在我遇到了错误。 回溯(最近一次通话最后): 文件“Excel Parse.py”,第 55 行,在 df2.to_excel(作家,'Sheet1') AttributeError:“NoneType”对象没有属性“to_excel” 异常被忽略:> 回溯(最近一次通话最后): del 中的文件“C:\ProgramData\Anaconda3\lib\site-packages\xlsxwriter\workbook.py”,第 153 行 异常:工作簿析构函数中捕获的异常。工作簿可能需要显式 close()。

Seiji,我完全更新了我的代码以反映您的更改。让我们看看解决方案 2,因为它处理得更快。

import pandas as pd

my_file = 'C:\\Users\user_name\Desktop\Python 3\schdocexprt10_Bob - Copy.xlsx'
df = pd.read_excel(my_file, sheetname='Sheet3')

def update_val(row):
    minutes = str(row['start appointment'])[14:16]
    hour = str(row['start appointment'])[11:13]
    resource = row['resource']
    # cond1, cond2, cond3, cond4 = True, False, False, False
    # Condition 1
    if minutes == '00' and hour in ['8', '9', '10', '11', '13', '14', '15']
        and resource in ['BDIAG1', 'BDIAG2', 'BDIAG3'] == True:
    row['resource'] = 'ZBMDX2'
    # Condition 2
    elif minutes == '15' and  hour in ['9', '10','11','13','14','15']
    and resource in ['BDIAG1','BDIAG2','BDIAG3'] == True:
        row['resource'] = 'ZBMDX3'
    # Condition 3
    elif minutes == '45' and hour in ['7','8','9','10','12','13','14'] 
    and resource in ['BDIAG1','BDIAG2','BDIAG3'] == True:
    row['resource'] = 'ZBMDX1'
    # Condition 4
    elif minutes == '30' and hour in ['8','9','10','13','14'] 
    and resource in ['BDIAG1','BDIAG2','BDIAG3'] == True:
        row['resource'] = 'ZBMDX4'
return row        

df2 = df.apply(update_val, axis='columns')

# Specify a writer
writer = pd.ExcelWriter('C:\\Users\user_name\Desktop\Python     3\Python_Output.xlsx', engine='xlsxwriter')

# Write your DataFrame to a file
df2.to_excel(writer, 'Sheet1')

# Save the result
writer.save()

创建输出文件后,我仍然没有看到资源字段的更新。我手动评估了前 10 行,以确保不满足条件存在并且可能它正在工作但条件存在。

start appointment dept      procedure   resource
20171020131500    MAM       BDXMAMUNI   BDIAG2    should change to ZBMDX3
20171020133000    MAM       BDXMAMUNI   BDIAG1    should change to ZBMDX4
20171020141500    MAM       BDXMAMUNI   BDIAG2    should change to ZBMDX3
20171020143000    MAM       BDXMAMUNI   BDIAG1    should change to ZBMDX4
20171020144500    MAM       BDXMAMBIL   BDIAG2    should change to ZBMDX1

Seiji 的解决方案 1

import pandas as pd
df = pd.read_excel(my_file, sheet_name='Sheet3')
# Pull Columns as a Variable
dept = df['department']
resource = df['resource']
start_appointment = df['start appointment']

def diagnostic(df):
    for i in range(1,100):
        minutes = str(start_appointment[i])[14:16]
        hour = str(start_appointment[i])[11:13]
        if minutes == '15' and  hour in ['9', '10','11','13','14','15'] and     resource[i] in ['BDIAG1','BDIAG2','BDIAG3']:
            df.loc[i, 'resource'] = 'ZBMDX3'
        elif minutes == '00' and hour in ['8','9','10','11','13','14','15']     and resource[i] in ['BDIAG1','BDIAG2','BDIAG3']:
            df.loc[i, 'resource'] = 'ZBMDX2'
        elif minutes == '45' and hour in ['7','8','9','10','12','13','14']     and resource[i] in ['BIDAG1','BDIAG2','BDIAG3']:
            df.loc[i, 'resource'] = 'ZBMDX1'
        elif minutes == '30' and hour in ['8','9','10','13','14'] and     resource[i] in ['BIDAG1','BDIAG2','BDIAG3']:
            df.loc[i, 'resource'] = 'ZBMDX4'
    return df

df2 = diagnostic(df)

# Specify a writer
writer = pd.ExcelWriter('C:\\Users\cboutsikos\Desktop\Python     3\Python_Output.xlsx', engine='xlsxwriter')

# Write your DataFrame to a file
df2.to_excel(writer, 'Sheet1')

# Save the result
writer.save()

同样的问题。输出文件没有更新。

修改时分切片

仍然没有在输出中显示更新。此时我想知道是否应该将 xlsx 文件保存为 CSV 并且不使用任何库,或者我是否应该通过将每一列(开始约会、资源)迭代到各自的列表中来从头开始创建数据框.你怎么看?

import pandas as pd

my_file = 'C:\\Users\cboutsikos\Desktop\Python 3\schdocexprt10_Bob - Copy.xlsx'
df = pd.read_excel(my_file, sheetname='Sheet3')

def update_val(row):
    minutes = str(row['start appointment'])[10:12]
    hour = str(row['start appointment'])[8:10]
    resource = row['resource']
    # Condition 1
    if (minutes == '00') and (hour in ['8', '9', '10', '11', '13', '14',     '15']) \
         and (resource in ['BDIAG1', 'BDIAG2', 'BDIAG3']) == True:
        row['resource'] = 'ZBMDX2'
    # Condition 2
    elif (minutes == '15') and  (hour in ['9', '10','11','13','14','15']) \
            and (resource in ['BDIAG1','BDIAG2','BDIAG3']):
        row['resource'] = 'ZBMDX3'
    # Condition 3
    elif (minutes == '45') and (hour in ['7','8','9','10','12','13','14']) \
            and (resource in ['BDIAG1','BDIAG2','BDIAG3']):
        row['resource'] = 'ZBMDX1'
    # Condition 4
    elif (minutes == '30') and (hour in ['8','9','10','13','14']) \
            and (resource in ['BDIAG1','BDIAG2','BDIAG3']):
        row['resource'] = 'ZBMDX4'
    return row

df2 = df.apply(update_val, axis='columns')
print(df2.head())

【问题讨论】:

  • 也许这是一个词法范围问题?你在函数内部对 df 做一些事情,但你没有返回任何东西,所以 df 在函数调用范围之外保持不变。尝试使用def diagnostic(df): 定义您的函数,然后使用df = diagnostic(df) 调用它
  • 我要试试 df2 = diagnostic(df)
  • 嘿,如果您仍然卡在这个问题上,请给我发送一个小的 excel 文件的 sn-p,我会看看问题出在哪里。虽然没有敏感信息。 g mail 的 seiji dot armstrong
  • 是的,我是。会做!谢谢!

标签: python excel pandas


【解决方案1】:

好的,我现在查看了示例数据并发现了问题。 resource 列中有尾随空格,导致逻辑失败。这可以通过使用str.strip() 简单地删除。此外,start appointment 字段被解析为pandas.tslib.Timestamp 对象,这通过将minutehour 标记提取为ints 来简化我们的逻辑。以下应该有效:

def update_val(row):
    minutes = row['start appointment'].minute
    hour = row['start appointment'].hour
    resource = row['resource'].strip()
    # Condition 1
    if (minutes == 0) and (hour in [8,9,10,11,13,14,15]) and (resource in ['BDIAG1', 'BDIAG2', 'BDIAG3']):
        new_resource = 'ZBMDX2'
    # Condition 2
    elif (minutes == 15) and (hour in [8,9,10,11,13,14,15]) and (resource in ['BDIAG1','BDIAG2','BDIAG3']):
        new_resource = 'ZBMDX3'
    # Condition 3
    elif (minutes == 45) and (hour in [7,8,9,10,12,13,14]) and (resource in ['BDIAG1','BDIAG2','BDIAG3']):
        new_resource = 'ZBMDX1'
    # Condition 4
    elif (minutes == 30) and (hour in [8,9,10,13,14]) and (resource in ['BDIAG1','BDIAG2','BDIAG3']):
        new_resource = 'ZBMDX4'
    else:
        new_resource = resource
    row['resource'] = new_resource
    return row      

df2 = df.apply(update_val, axis='columns')

【讨论】:

  • Seiji,这对我来说很完美。非常感谢您一直以来的努力和支持,帮助我解决了这个问题!非常感谢,我学到了很多!我的最后一个问题是,您如何知道 Pandas 将日期/时间字段解析为时间戳?
  • 不用担心。您可以使用type(el) 检查任何对象的类型。在这种情况下,type(df.iloc[0]['start appointment']) 返回pandas.tslib.Timestamp
【解决方案2】:

我没有足够的分数来评论你的问题。因此,我将发布应该可以使用的代码的修改版本:

import pandas as pd

my_file = 'C:\\Users\user_name\Desktop\Python 3\schdocexprt10_Bob - Copy.xlsx'
df = pd.read_excel(my_file, sheetname='Sheet3')

def update_val(row):
    def time_range(start,stop):
        return [str(el).zfill(2) for el in range(start,stop+1)]

    minutes = str(row['start appointment'])[14:16] # [10:12] in sample data
    hour = str(row['start appointment'])[11:13] # [8:10] in sample data
    resource = row['resource']
    # Condition 1
    if (minutes == '00') and (hour in time_range(8,15)) and (resource in ['BDIAG1', 'BDIAG2', 'BDIAG3']):
        row['resource'] = 'ZBMDX2'
    # Condition 2
    elif (minutes == '15') and (hour in time_range(9,15)) and (resource in ['BDIAG1','BDIAG2','BDIAG3']):
        row['resource'] = 'ZBMDX3'
    # Condition 3
    elif (minutes == '45') and (hour in time_range(7,14)) and (resource in ['BDIAG1','BDIAG2','BDIAG3']):
        row['resource'] = 'ZBMDX1'
    # Condition 4
    elif (minutes == '30') and (hour in time_range(8,14)) and (resource in ['BDIAG1','BDIAG2','BDIAG3']):
        row['resource'] = 'ZBMDX4'
return row        

df2 = df.apply(update_val, axis='columns')
print(df2.head())

我做了两处改动。

1) 将子条件放在括号中。我相信它们在您的原始公式中的格式不正确,因此它们从未评估为True

2) 更改了start appointment 行的索引。根据您的示例数据,原始索引返回一个空 str,因此从不评估任何选项。

附言您可以将前 5 行打印到控制台以检查值是否更新,而不是每次都写入磁盘。

【讨论】:

    【解决方案3】:

    好几件事..

    您的函数diagnostic 对全局df 进行了更改,但它不接受DataFrame,也不返回任何内容。因此,当您使用df2 = diagnostic(df) 调用它时,您不会将df 输入其中,并且您不会返回修改后的DataFrame,而是返回NoneType。这就是为什么您收到错误消息,告诉您df2 不是 pd.DataFrame 对象,因此它没有属性“to_excel”。

    如果您的函数接受df 作为输入,对其进行更改,并将修改后的df 作为输出返回,那就更好了。

    你只需要做两处修改:

    1) 在第一行包含 df 作为参数: def diagnostic(df):

    2) 包括 return df 作为最后一行。

    类似:

    def diagnostic(df):  # Check Diagnostic Breast scheduled appointments
        for i in range(10):
          ...
          ...
                df.loc[i, 'resource'] = 'ZBMDX4' # see explanation below.
        return df
    

    另一个问题是您可能应该使用df.loc[row, col] = new_val 来更新您的值。 df.update() 接受 DataFrames(或来自文档的可强制转换为 DataFrames 的对象),而您一次更新一个值。

    另一个问题是您的条件可以简化。您可以将可能的值放在列表中并检查成员资格,而不是写 hour == x1 or hour == x2 or ....。类似于hour in [x1, x2, ...]

    由于这里有很多东西要解压,所以我写了我正在谈论的内容的骨架:

    解决方案 1

    def diagnostic(df):  # Check Diagnostic Breast scheduled appointments
        for i in range(10):
            minutes = str(start_appointment[i])[10:12]
            hour = str(start_appointment[i])[8:10]
            if condition_1:
                df.loc[i, 'resource'] = 'ZBMDX3'
            elif condition_2:
                df.loc[i, 'resource'] = 'ZBMDX2'
            elif condition_3:
                df.loc[i, 'resource'] = 'ZBMDX1'
            elif condition_3:
                df.loc[i, 'resource'] = 'ZBMDX4'        
        return(df)
    
    df2 = diagnostic(df)
    

    每个条件都是您的逻辑(例如 condition_1 = if (minutes == '15') and hour in ['09', '10', '11'])

    解决方案 2

    另一种方法是创建一个函数,根据某些逻辑对每一行进行更改,然后将其应用于您的 DataFrame。类似于以下内容:

    def update_val(row):
        minutes = str(row['start appointment'])[10:12]
        hour = str(row['start appointment'])[8:10]
        resource = row['resource']
        cond1, cond2, cond3, cond4 = True, False, False, False
        if cond1:
            row['resource'] = 'ZBMDX3'
        elif cond2:
            row['resource'] = 'ZBMDX2'
        elif cond3:
            row['resource'] = 'ZBMDX1'
        elif cond4:
            row['resource'] = 'ZBMDX4'
        return row
    
    df2 = df.apply(update_val, axis='columns')
    

    显然你会在我放入虚拟条件cond1等的地方更新你的条件逻辑。

    我更喜欢解决方案 2,因为它更简洁,更容易跟踪更改。它通常也具有更高的性能(尽管我没有在这种特殊情况下进行验证)。

    【讨论】:

    • 对于参数行,我可以输入一个从 1 到 100 的范围吗? def update_val(1,100):另外,你能解释一下为什么你有 cond1、cond2、cond3 和 cond4 = True、false、false、false?我是否将 IF 和 ELIF 设置为按顺序等于 True、False、False、False?
    • 函数update_val将更新它应用到的DataFrame中的所有行。如果要应用到前 100 行,请使用 df100 = df.iloc[:100].apply(update_val, axis='columns'。这将返回一个只有前 100 行的 DataFrame。如果您想要原始 DataFrame 并且只修改前 100 行,请使用 pd.concat([df100, df.iloc[100:]], ignore_index=True)
    • 我只是对条件使用了虚拟变量(True、False 等),因为我不想输入所有条件 :) 目的是在不混淆的情况下显示解决方案的结构使用详细的逻辑。
    • 正如我所怀疑的那样。只是想确定一下。谢谢
    • 好吧,我想我知道为什么你的 DataFrame 没有改变。您错误地索引了“开始约会”列。如果我没记错的话,从您提供的样本数据来看,它应该是minutes = str(row['start appointment'])[10:12]hour = str(row['start appointment'])[8:10]。此外,小时始终是 2 个字符,因此 9 应该是 09 等。请尝试以下操作:对于其中一种情况,只需使用 if True:。然后查看输出中的字段是否正确更新。然后你就会知道这是你的条件逻辑的问题——尝试用括号来分隔子条件。
    猜你喜欢
    • 2013-04-04
    • 1970-01-01
    • 2019-09-03
    • 2016-08-26
    • 2019-08-08
    • 1970-01-01
    • 1970-01-01
    • 2017-09-18
    • 1970-01-01
    相关资源
    最近更新 更多