【问题标题】:Python : Merge several columns of a dataframe without having duplicates of dataPython:合并数据框的几列而没有重复的数据
【发布时间】:2021-02-23 20:48:28
【问题描述】:

假设我有这个数据框:

Name = ['Lolo', 'Mike', 'Tobias','Luke','Sam']
Age = [19, 34, 13, 45, 52]
Info_1 = ['Tall', 'Large', 'Small', 'Small','']
Info_2 = ['New York', 'Paris', 'Lisbon', '', 'Berlin']
Info_3 = ['Tall', 'Paris', 'Hi', 'Small', 'Thanks']
Data = [123,268,76,909,87]
Sex = ['F', 'M', 'M','M','M']

df = pd.DataFrame({'Name' : Name, 'Age' : Age, 'Info_1' : Info_1, 'Info_2' : Info_2, 'Info_3' : Info_3, 'Data' : Data, 'Sex' : Sex})

print(df)

     Name  Age Info_1    Info_2  Info_3  Data Sex
0    Lolo   19   Tall  New York    Tall   123   F
1    Mike   34  Large     Paris   Paris   268   M
2  Tobias   13  Small    Lisbon      Hi    76   M
3    Luke   45  Small             Small   909   M
4     Sam   52           Berlin  Thanks    87   M

我想合并这个数据框四列的数据:Info_1、Info_2、Info_3、Data。 我想合并它们而不需要每一行的数据重复。这意味着对于“0”行,我不想有两次“高”。所以最后我想得到类似的东西:

     Name  Age                Info Sex
0    Lolo   19   Tall New York 123   F
1    Mike   34     Large Paris 268   M
2  Tobias   13  Small Lisbon Hi 76   M
3    Luke   45           Small 909   M
4     Sam   52    Berlin Thanks 87   M

我试过这个功能来合并数据:

di['period'] = df[['Info_1', 'Info_2', 'Info_3' 'Data']].agg('-'.join, axis=1)

但是我收到一个错误,因为它需要一个字符串,如何合并“数据”列的数据?以及如何检查我没有创建重复项

谢谢

【问题讨论】:

    标签: python pandas dataframe merge duplicates


    【解决方案1】:

    您的Data 列似乎是int 类型。先转成字符串:

    df['Data'] = df['Data'].astype(str)
    df['period'] = (df[['Info_1','Info_2','Info_3','Data']]
                       .apply(lambda x: ' '.join(x[x!=''].unique()), axis=1)
                   )
    

    输出:

         Name  Age Info_1    Info_2  Info_3 Data Sex              period
    0    Lolo   19   Tall  New York    Tall  123   F   Tall New York 123
    1    Mike   34  Large     Paris   Paris  268   M     Large Paris 268
    2  Tobias   13  Small    Lisbon      Hi   76   M  Small Lisbon Hi 76
    3    Luke   45  Small             Small  909   M           Small 909
    4     Sam   52           Berlin  Thanks   87   M    Berlin Thanks 87
    

    【讨论】:

      【解决方案2】:

      我认为最简单的方法可能是首先将所需的所有字段与中间的空格连接起来:

      df['Info'] = df.Info_1 + ' ' +  df.Info_2 + ' ' + df.Info_3 + ' ' + df.Data.astype(str)
      

      然后你可以编写一个函数来从字符串中删除重复的单词,如下所示:

      def remove_dup_words(s):
          words = s.split(' ')
          unique_words = pd.Series(words).drop_duplicates().tolist()
          return ' '.join(unique_words)
      

      并将该函数应用于Info 字段:

      df['Info'] = df.Info.apply(remove_dup_words)
      

      所有代码放在一起:

      import pandas as pd
      
      def remove_dup_words(s):
          words = s.split(' ')
          unique_words = pd.Series(words).drop_duplicates().tolist()
          return ' '.join(unique_words)
      
      Name = ['Lolo', 'Mike', 'Tobias','Luke','Sam']
      Age = [19, 34, 13, 45, 52]
      Info_1 = ['Tall', 'Large', 'Small', 'Small','']
      Info_2 = ['New York', 'Paris', 'Lisbon', '', 'Berlin']
      Info_3 = ['Tall', 'Paris', 'Hi', 'Small', 'Thanks']
      Data = [123,268,76,909,87]
      Sex = ['F', 'M', 'M','M','M']
      
      df = pd.DataFrame({'Name' : Name, 'Age' : Age, 'Info_1' : Info_1, 'Info_2' : Info_2, 'Info_3' : Info_3, 'Data' : Data, 'Sex' : Sex})
      
      df['Info'] = df.Info_1 + ' ' +  df.Info_2 + ' ' + df.Info_3 + ' ' + df.Data.astype(str)
      df['Info'] = df.Info.apply(remove_dup_words)
      
      print(df)
      
           Name  Age Info_1    Info_2  Info_3  Data Sex                Info
      0    Lolo   19   Tall  New York    Tall   123   F   Tall New York 123
      1    Mike   34  Large     Paris   Paris   268   M     Large Paris 268
      2  Tobias   13  Small    Lisbon      Hi    76   M  Small Lisbon Hi 76
      3    Luke   45  Small             Small   909   M          Small  909
      4     Sam   52           Berlin  Thanks    87   M    Berlin Thanks 87
      

      【讨论】:

        猜你喜欢
        • 2020-04-10
        • 2019-02-06
        • 2017-01-06
        • 1970-01-01
        • 2012-02-08
        • 2021-03-10
        • 1970-01-01
        • 1970-01-01
        • 2020-08-16
        相关资源
        最近更新 更多