递归更新数据框答案

【问题标题】：Recursively update the dataframe递归更新数据框
【发布时间】：2021-07-23 10:56:19
【问题描述】：

我有一个名为 datafe 的数据框，我想从中组合连字符。

例如输入数据框如下所示：

,author_ex
0,Marios
1,Christodoulou
2,Intro-
3,duction
4,Simone
5,Speziale
6,Exper-
7,iment

输出数据框应该是这样的：

,author_ex
0,Marios
1,Christodoulou
2,Introduction
3,Simone
4,Speziale
5,Experiment

我已经编写了一个示例代码来实现这一点，但我无法安全地退出递归。

def rm_actual(datafe, index):
    stem1 = datafe.iloc[index]['author_ex']
    stem2 = datafe.iloc[index + 1]['author_ex']
    fixed_token = stem1[:-1] + stem2
    datafe.drop(index=index + 1, inplace=True, axis=0)
    newdf=datafe.reset_index(drop=True)
    newdf.iloc[index]['author_ex'] = fixed_token
    return newdf

def remove_hyphens(datafe):
    for index, row in datafe.iterrows():
        flag = False
        token=row['author_ex']
        if token[-1:] == '-':
            datafe=rm_actual(datafe, index)
            flag=True
            break
    if flag==True:
        datafe=remove_hyphens(datafe)
    if flag==False:
        return datafe

datafe=remove_hyphens(datafe)
print(datafe)

有没有可能我可以从这个递归中得到预期的输出？

【问题讨论】：

两个或多个连字符是否可以连续出现在行中？

标签： pandas dataframe recursion data-processing

【解决方案1】：

另一种选择：

给定/输入：

       author_ex
0         Marios
1  Christodoulou
2         Intro-
3        duction
4         Simone
5       Speziale
6         Exper-
7          iment

代码：

import pandas as pd

# read/open file or create dataframe
df = pd.DataFrame({'author_ex':['Marios', 'Christodoulou', 'Intro-', \
                                  'duction', 'Simone', 'Speziale', 'Exper-', 'iment']})

# check input format
print(df)

# create new column 'Ending' for True/False if column 'author_ex' ends with '-'
df['Ending'] = df['author_ex'].shift(1).str.contains('-$', na=False, regex=True)

# remove the trailing '-' from the 'author_ex' column
df['author_ex'] = df['author_ex'].str.replace('-$', '', regex=True)

# create new column with values of 'author_ex' and shifted 'author_ex' concatenated together
df['author_ex_combined'] = df['author_ex'] + df.shift(-1)['author_ex']

# create a series true/false but shifted up
index = (df['Ending'] == True).shift(-1) 

# set the last row to 'False' after it was shifted
index.iloc[-1] = False 

# replace 'author_ex' with 'author_ex_combined' based on true/false of index series
df.loc[index,'author_ex'] = df['author_ex_combined']

# remove rows that have the 2nd part of the 'author_ex' string and are no longer required
df = df[~df.Ending]

# remove the extra columns
df.drop(['Ending', 'author_ex_combined'], axis = 1, inplace=True)

# output final dataframe
print('\n\n')
print(df)

# notice index 3 and 6 are missing

输出：

       author_ex
0         Marios
1  Christodoulou
2   Introduction
4         Simone
5       Speziale
6     Experiment

【讨论】：