【问题标题】:Splitting up substrings by comma and separate rows by each substring用逗号分割子串并用每个子串分隔行
【发布时间】:2020-12-15 08:59:04
【问题描述】:

我想在这里寻求您的帮助。我有一个带有“标签”列的数据框,该列有多个用逗号分隔的子字符串。我想将子字符串拆分为逗号并根据子字符串复制行。下面是一个示例操作。

样本df

   A          B          C          D          E                   Tag
A mug      computer    stack      code       phone        labor relation, m&a, h&s
google     virjoy      plant      ivan       wrong          business, environment
gazette   nowhere     conquer    jermo       chris             business ethics
spray      hilda      square     walk      nonsense        m&a, hiring and expansion
Florence    plug     nihilist    font       hello     h&s, wages and hours, product recall

输出df

   A        B         C       D       E                   Tag                              New Tag
A mug    computer   stack    code   phone       labor relation, m&a, h&s               labor relation
A mug    computer   stack    code   phone       labor relation, m&a, h&s                     m&a
A mug    computer   stack    code   phone       labor relation, m&a, h&s                     h&s
google    virjoy    plant    ivan   wrong        business, environment                     business
google    virjoy    plant    ivan   wrong        business, environment                    environment
gazette  nowhere   conquer   jermo  chris          business ethics                     business ethics             
spray     hilda    square    walk  nonsense     m&a, hiring and expansion                    m&a
spray     hilda    square    walk  nonsense     m&a, hiring and expansion            hiring and expansion
Florence  plug    nihilist   font   hello    h&s, wages and hours, product recall            h&s
Florence  plug    nihilist   font   hello    h&s, wages and hours, product recall     wages and hours
Florence  plug    nihilist   font   hello    h&s, wages and hours, product recall    product recall

我正在考虑通过','进行拆分操作,然后可能会像融化? 将不胜感激任何帮助!非常感谢提前!

【问题讨论】:

    标签: python pandas split


    【解决方案1】:

    创建所需数据格式的函数

    def transform(df, col_list, fill_value='', preserve_index=False):
    
        if (col_list is not None and len(col_list) > 0 and not isinstance(col_list, (list, tuple, np.ndarray, pd.Series))):
            col_list = [col_list]
    
        v2_cols = df.columns.difference(col_list)
        lens = df[col_list[0]].str.len()    
        v2 = np.repeat(df.index.values, lens)
    
        final = (pd.DataFrame({
                col:np.repeat(df[col].values, lens)
                for col in v2_cols},
                index=v2)
             .assign(**{col:np.concatenate(df.loc[lens>0, col].values)
                            for col in col_list}))
    
        if (lens == 0).any():
            final = (final.append(df.loc[lens==0, v2_cols], sort=False).fillna(fill_value))
    
        final = final.sort_index()
    
        if not preserve_index:        
            final = final.reset_index(drop=True)
        return final
    

    像这样调用这个函数

    transform(df.assign(Tag=df.Tag.str.split(',')), 'Tag')
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2019-10-25
      • 2013-07-27
      • 1970-01-01
      • 1970-01-01
      • 2020-08-30
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多