【问题标题】:Pandas: Wide to first, second, third, identified categoriesPandas:广泛到第一、第二、第三、确定的类别
【发布时间】:2020-11-30 21:08:50
【问题描述】:

我想知道是否有人知道在 pandas 中快速旋转数据框以实现下面所需的转换。这是一种从宽到长的支点,但并不完全如此。

输入数据框结构(需要能够支持N个类别,而不是下面的3个)

+------+--------------+----------+----------+-----------+--------------+----------+----------+-----------+--------------+----------+----------+-----------+
| id   | catA_present | catA_pos | catA_neg | catA_ntrl | catB_present | catB_pos | catB_neg | catB_ntrl | catC_present | catC_pos | catC_neg | catC_ntrl |
+------+--------------+----------+----------+-----------+--------------+----------+----------+-----------+--------------+----------+----------+-----------+
| 0001 | 1            | 1        | 0        | 0         | 0            | 0        | 0        | 0         | 1            | 0        | 1        | 0         |
+------+--------------+----------+----------+-----------+--------------+----------+----------+-----------+--------------+----------+----------+-----------+
| 0002 | 0            | 0        | 0        | 0         | 1            | 1        | 0        | 0         | 1            | 1        | 0        | 0         |
+------+--------------+----------+----------+-----------+--------------+----------+----------+-----------+--------------+----------+----------+-----------+
| 0003 | 1            | 0        | 0        | 1         | 1            | 0        | 0        | 1         | 0            | 0        | 0        | 0         |
+------+--------------+----------+----------+-----------+--------------+----------+----------+-----------+--------------+----------+----------+-----------+
| 0004 | 1            | 1        | 0        | 0         | 1            | 1        | 0        | 0         | 1            | 0        | 0        | 1         |
+------+--------------+----------+----------+-----------+--------------+----------+----------+-----------+--------------+----------+----------+-----------+
| 0005 | 0            | 0        | 0        | 0         | 0            | 0        | 0        | 0         | 1            | 0        | 1        | 0         |
+------+--------------+----------+----------+-----------+--------------+----------+----------+-----------+--------------+----------+----------+-----------+

Output Transformed Dataframe 结构:(需要支持 N 个类别,而不是示例所示的 3 个)

+------+------+-------+------+-------+------+-------+
| id   | cat1 | sent1 | cat2 | sent2 | cat3 | sent3 |
+------+------+-------+------+-------+------+-------+
| 0001 | catA | pos   | catC | neg   | NULL | NULL  |
+------+------+-------+------+-------+------+-------+
| 0002 | catB | pos   | catC | pos   | NULL | NULL  |
+------+------+-------+------+-------+------+-------+
| 0003 | catA | ntrl  | catB | ntrl  | NULL | NULL  |
+------+------+-------+------+-------+------+-------+
| 0004 | catA | pos   | catB | pos   | catC | ntrl  |
+------+------+-------+------+-------+------+-------+
| 0005 | catC | neg   | NULL | NULL  | NULL | NULL  |
+------+------+-------+------+-------+------+-------+

【问题讨论】:

    标签: python pandas pivot transform melt


    【解决方案1】:

    我认为这根本不是一个支点。但是,一切皆有可能,所以我们开始吧:

    import io
    import itertools
    import pandas
    
    # Your data
    data = io.StringIO(
    """
    id   | catA_present | catA_pos | catA_neg | catA_ntrl | catB_present | catB_pos | catB_neg | catB_ntrl | catC_present | catC_pos | catC_neg | catC_ntrl
    0001 | 1            | 1        | 0        | 0         | 0            | 0        | 0        | 0         | 1            | 0        | 1        | 0
    0002 | 0            | 0        | 0        | 0         | 1            | 1        | 0        | 0         | 1            | 1        | 0        | 0
    0003 | 1            | 0        | 0        | 1         | 1            | 0        | 0        | 1         | 0            | 0        | 0        | 0
    0004 | 1            | 1        | 0        | 0         | 1            | 1        | 0        | 0         | 1            | 0        | 0        | 1
    0005 | 0            | 0        | 0        | 0         | 0            | 0        | 0        | 0         | 1            | 0        | 1        | 0
    """
    )
    df = pandas.read_table(data, sep="\s*\|\s*")
    
    
    def get_sentiment(row: pandas.Series) -> pandas.Series:
        if row["cat_pos"] == 1:
            return "pos"
        elif row["cat_neg"] == 1:
            return "neg"
        elif row["cat_ntrl"] == 1:
            return "ntrl"
        else:
            return None
    
    
    # Initialize a dict that will hold an entry for every index in the dataframe, with a list of categories and sentiments
    categories_per_index = {index: [] for index in df.index}
    
    # Extract a list of unique names of all possible categories
    categories = set([column[3] for column in df.columns if column.startswith("cat")])
    
    # Loop over the unique categories
    for key in categories:
    
        # Select only the columns for a particular category, and where that category is present
        group = df.loc[df[f"cat{key}_present"] == 1, [f"cat{key}_present", f"cat{key}_pos", f"cat{key}_neg", f"cat{key}_ntrl"]]
    
        # Change the column names for generic processing
        group.columns = ["cat_present", "cat_pos", "cat_neg", "cat_ntrl"]
    
        # Figure out the sentiment for every line
        group["sentiment"] = group.apply(get_sentiment, axis=1)
    
        # Loop the rows in the group and add the sentiment for this category to the indices
        for index, row in group.iterrows():
    
            # Add the name of the category and the sentiment to the index
            categories_per_index[index].append(f"cat{key}")
            categories_per_index[index].append(row["sentiment"])
    
    
    # Reconstruct the dataframe from the dictionary
    df = pandas.DataFrame.from_dict(categories_per_index, orient="index", columns=list(itertools.chain.from_iterable([ [f"cat{i}", f"sent{i}"] for i in range(len(categories)) ])))
    

    输出:

    print(df)
       cat0 sent0  cat1 sent1  cat2 sent2
    0  catA   pos  catC   neg  None  None
    1  catB   pos  catC   pos  None  None
    2  catB  ntrl  catA  ntrl  None  None
    3  catB   pos  catA   pos  catC  ntrl
    4  catC   neg  None  None  None  None
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2017-09-27
      • 2014-07-23
      • 2018-06-04
      • 2012-10-16
      • 1970-01-01
      • 1970-01-01
      • 2011-02-14
      • 1970-01-01
      相关资源
      最近更新 更多