【问题标题】:Count the number of likert scale results from multiple column questions in pandas计算pandas中多列问题的likert量表结果数
【发布时间】:2017-11-08 02:22:59
【问题描述】:

我有以下数据框:

       Question1        Question2         Question3          Question4
User1  Agree            Agree          Disagree         Strongly Disagree
User2  Disagree         Agree          Agree            Disagree
User3  Agree            Agree          Agree            Agree

有没有办法将上面列出的数据框转换为下面的?

              Agree         Disagree         Strongly Disagree
 Question1    2               1                  0

 Question2    2               1                  0

 Question3    2               1                  0
 Question4    1               1                  1

这和我之前的问题类似:Make a dataframe with grouped questions from three columns

我尝试使用 stack/pivot 查看以前的问题,但无法弄清楚。实际的数据框有 20 多个问题和一个李克特量表,从非常同意、同意、中立、不同意、非常不同意。

【问题讨论】:

    标签: python pandas numpy group-by pandas-groupby


    【解决方案1】:

    您可以使用pd.Series.value_counts 遍历列。如果您使用 apply 执行此操作,索引将自动对齐:

    df.apply(pd.Series.value_counts)
    Out: 
                       Question1  Question2  Question3  Question4
    Agree                    2.0        3.0        2.0          1
    Disagree                 1.0        NaN        1.0          1
    Strongly Disagree        NaN        NaN        NaN          1
    

    一点后处理:

    df.apply(pd.Series.value_counts).fillna(0).astype('int')
    Out: 
                       Question1  Question2  Question3  Question4
    Agree                      2          3          2          1
    Disagree                   1          0          1          1
    Strongly Disagree          0          0          0          1
    

    【讨论】:

      【解决方案2】:
      df.apply(lambda x:x.value_counts()).fillna(0).astype(int)
      #                   Question1  Question2  Question3  Question4
      #Agree                      2          3          2          1
      #Disagree                   1          0          1          1
      #Strongly Disagree          0          0          0          1
      

      【讨论】:

        【解决方案3】:

        pd.get_dummies

        pd.get_dummies(df.stack()).groupby(level=1).sum()
        
                   Agree  Disagree  Strongly Disagree
        Question1      2         1                  0
        Question2      3         0                  0
        Question3      2         1                  0
        Question4      1         1                  1
        

        将其提升到另一个层次
        我们可以使用 numpy.bincount 来加快速度。但是我们要注意尺寸

        v = df.values
        f, u = pd.factorize(v.ravel())
        n, m = u.size, v.shape[1]
        r = np.tile(np.arange(m), n)
        b0 = np.bincount(r * n + f)
        pad = np.zeros(n * m - b0.size, dtype=int)
        b = np.append(b0, pad)
        
        pd.DataFrame(b.reshape(m, n), df.columns, u)
        
                   Agree  Disagree  Strongly Disagree
        Question1      2         1                  0
        Question2      3         0                  0
        Question3      2         1                  0
        Question4      1         1                  1
        

        另一个numpy 选项

        v = df.values
        n, m = v.shape
        f, u = pd.factorize(v.ravel())
        
        pd.DataFrame(
            np.eye(u.size, dtype=int)[f].reshape(n, m, -1).sum(0),
            df.columns, u
        )
        
                   Agree  Disagree  Strongly Disagree
        Question1      2         1                  0
        Question2      3         0                  0
        Question3      2         1                  0
        Question4      1         1                  1
        

        速度差异

        %%timeit
        v = df.values
        f, u = pd.factorize(v.ravel())
        n, m = u.size, v.shape[1]
        r = np.tile(np.arange(m), n)
        b0 = np.bincount(r * n + f)
        pad = np.zeros(n * m - b0.size, dtype=int)
        b = np.append(b0, pad)
        ​
        pd.DataFrame(b.reshape(m, n), df.columns, u)
        1000 loops, best of 3: 194 µs per loop
        
        %%timeit
        v = df.values
        n, m = v.shape
        f, u = pd.factorize(v.ravel())
        
        pd.DataFrame(
            np.eye(u.size, dtype=int)[f].reshape(n, m, -1).sum(0),
            df.columns, u
        )
        1000 loops, best of 3: 195 µs per loop
        
        %timeit pd.get_dummies(df.stack()).groupby(level=1).sum()
        1000 loops, best of 3: 1.2 ms per loop
        

        【讨论】:

        • 谢谢!这非常有效,我上一个问题中的“额外积分”部分帮助我对列进行了排序。
        猜你喜欢
        • 2022-01-17
        • 1970-01-01
        • 1970-01-01
        • 2021-08-21
        • 2018-11-20
        • 1970-01-01
        • 2019-01-03
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多