【问题标题】:keep unique words in a pandas dataframe row在 pandas 数据框行中保留唯一单词
【发布时间】:2017-09-12 17:54:25
【问题描述】:

数据框:

> df
>type(df)
pandas.core.frame.DataFrame

ID      Property Type                                Amenities
1952043 Apartment, Villa, Apartment                  Park, Jogging Track, Park
1918916 Bungalow, Cottage House, Cottage, Bungalow   Garden, Play Ground

我怎样才能在数据框行中只保留用“逗号”分隔的唯一单词?在这种情况下,它不得将“Cottage House”和“Cottage”视为相同。它必须检查数据框的所有列。所以我想要的输出应该如下所示: 期望的输出:

    ID      Property Type                      Amenities
    1952043 Apartment, Villa                   Park, Jogging Track
    1918916 Bungalow, Cottage House, Cottage   Garden, Play Ground

【问题讨论】:

    标签: python-3.x pandas set pattern-matching unique


    【解决方案1】:

    首先,我创建了一个函数,它可以对给定的字符串执行您想要的操作。其次,我将此函数应用于列中的所有字符串。

    import numpy as np
    import pandas as pd
    
    df = pd.DataFrame([['Apartment, Villa, Apartment',
                        'Park, Jogging Track, Park'],
                       ['Bungalow, Cottage House, Cottage, Bungalow',
                        'Garden, Play Ground']],
                      columns=['Property Type', 'Amenities'])
    
    def drop_duplicates(row):
        # Split string by ', ', drop duplicates and join back.
        words = row.split(', ')
        return ', '.join(np.unique(words).tolist())
    
    # drop_duplicates is applied to all rows of df.
    df['Property Type'] = df['Property Type'].apply(drop_duplicates)
    df['Amenities'] = df['Amenities'].apply(drop_duplicates)
    print(df)
    

    【讨论】:

    • 感谢您的解决方案,但我希望此代码从数据框的所有列中删除重复项。我已经修改了我的问题,你能看看吗:)
    【解决方案2】:

    将文件读入pandas DataFrame

    >>> import pandas as pd
    >>> df = pd.read_csv('test.txt', sep='\t')
    >>> df['Property Type'].apply(lambda cell: set([c.strip() for c in cell.split(',')]))
    0                    {Apartment, Villa}
    1    {Cottage, Bungalow, Cottage House}
    Name: Property Type, dtype: object
    

    主要思想是

    1. 遍历每一行,
    2. ,分割目标列中的字符串
    3. 从第 2 步返回列表的唯一 set()

    代码:

    >>> for row in proptype_column: # Step 1.
    ...     items_in_row = row.split(', ') # Step 2. 
    ...     uniq_items_in_row = set(row.split(', ')) # Step 3. 
    ...     print(uniq_items_in_row)
    ... 
    set(['Apartment', 'Villa'])
    set(['Cottage', 'Bungalow', 'Cottage House'])
    

    现在您可以使用DataFrame.apply() 函数实现相同的功能:

    >>> import pandas as pd
    >>> df = pd.read_csv('test.txt', sep='\t')
    >>> df['Property Type'].apply(lambda cell: set([c.strip() for c in cell.split(',')]))
    0                    {Apartment, Villa}
    1    {Cottage, Bungalow, Cottage House}
    Name: Property Type, dtype: object
    >>> proptype_uniq = df['Property Type'].apply(lambda cell: set(cell.split(', ')))
    >>> df['Property Type (Unique)'] = proptype_uniq
    >>> df
          ID                               Property Type  \
    0  12345                 Apartment, Villa, Apartment   
    1  67890  Bungalow, Cottage House, Cottage, Bungalow   
    
                       Amenities              Property Type (Unique)  
    0  Park, Jogging Track, Park                  {Apartment, Villa}  
    1        Garden, Play Ground  {Cottage, Bungalow, Cottage House}
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2019-07-18
      • 1970-01-01
      • 2020-07-04
      • 1970-01-01
      • 2020-03-01
      • 1970-01-01
      • 2017-01-20
      • 2019-04-14
      相关资源
      最近更新 更多