在 Python 中合并交叉表答案

【问题标题】：Merging crosstabs in Python在 Python 中合并交叉表
【发布时间】：2016-12-14 19:01:44
【问题描述】：

我正在尝试将多个交叉表合并为一个。请注意，提供的数据显然仅用于测试目的。实际数据要大得多，所以效率对我来说非常重要。

生成、列出交叉表，然后与 word 列上的 lambda 函数合并。然而，这种合并的结果并不是我期望的那样。我认为问题在于，即使使用dropna = False，只有交叉表的 NA 值的列也会被删除，这会导致merge 函数失败。我将首先展示代码，然后展示中间数据和错误。

import pandas as pd
import numpy as np
import functools as ft

def main():
    # Create dataframe
    df = pd.DataFrame(data=np.zeros((0, 3)), columns=['word','det','source'])
    df["word"] = ('banana', 'banana', 'elephant', 'mouse', 'mouse', 'elephant', 'banana', 'mouse', 'mouse', 'elephant', 'ostrich', 'ostrich')
    df["det"] = ('a', 'the', 'the', 'a', 'the', 'the', 'a', 'the', 'a', 'a', 'a', 'the')
    df["source"] = ('BE', 'BE', 'BE', 'NL', 'NL', 'NL', 'FR', 'FR', 'FR', 'FR', 'FR', 'FR')

    create_frequency_list(df)

def create_frequency_list(df):
    # Create a crosstab of ALL values
    # NOTE that dropna = False does not seem to work as expected
    total = pd.crosstab(df.word, df.det, dropna = False)
    total.fillna(0)
    total.reset_index(inplace=True)
    total.columns = ['word', 'a', 'the']

    crosstabs = [total]

    # For the column headers, multi-level
    first_index = [('total','total')]
    second_index = [('a','the')]

    # Create crosstabs per source (one for BE, one for NL, one for FR)
    # NOTE that dropna = False does not seem to work as expected
    for source, tempDf in df.groupby('source'):
        crosstab = pd.crosstab(tempDf.word, tempDf.det, dropna = False)
        crosstab.fillna(0)
        crosstab.reset_index(inplace=True)
        crosstab.columns = ['word', 'a', 'the']
        crosstabs.append(crosstab)

        first_index.extend((source,source))
        second_index.extend(('a','the'))

    # Just for debugging: result as expected
    for tab in crosstabs:
        print(tab)

    merged = ft.reduce(lambda left,right: pd.merge(left,right, on='word'), crosstabs).set_index('word')

    # UNEXPECTED RESULT
    print(merged)    

    arrays = [first_index, second_index]

    # Throws error: NotImplementedError: > 1 ndim Categorical are not supported at this time
    columns = pd.MultiIndex.from_arrays(arrays)

    df_freq = pd.DataFrame(data=merged.as_matrix(),
                      columns=columns,
                      index = crosstabs[0]['word'])
    print(df_freq)

main()

单个交叉表：与预期不同。 NA 列被删除

       word  a  the
0    banana  2    1
1  elephant  1    2
2     mouse  2    2
3   ostrich  1    1

       word  a  the
0    banana  1    1
1  elephant  0    1

       word  a  the
0    banana  1    0
1  elephant  1    0
2     mouse  1    1
3   ostrich  1    1

       word  a  the
0  elephant  0    1
1     mouse  1    1

这意味着数据框不会相互共享所有值，这反过来可能会破坏合并。

合并：显然不像预期的那样

          a_x  the_x  a_y  the_y  a_x  the_x  a_y  the_y
word                                                    
elephant    1      2    0      1    1      0    0      1

但是，错误只会在列分配时引发：

# NotImplementedError: > 1 ndim Categorical are not supported at this time
columns = pd.MultiIndex.from_arrays(arrays)

据我所知，问题很早就开始了，与 NA 相关，导致整个事情都失败了。但是，由于我在 Python 方面的经验不足，我无法确定。

我所期望的是多索引输出：

    source       total        BE          FR          NL
    det         a   the     a   the     a   the     a   the
    word
0   banana      2   1       1   1       1   0       0   0
1   elephant    1   2       0   1       1   0       0   1
2   mouse       2   2       0   0       1   1       1   1
3   ostrich     1   1       0   0       1   1       0   0

【问题讨论】：

标签： python pandas merge multi-index

【解决方案1】：

我只是决定给你一个更好的方式来得到你想要的：

我通常使用df.groupby([col1, col2]).size().unstack() 代理我的pd.crosstab。您试图为每组source 做一个交叉表。我可以通过df.groupby([col1, col2, col3]).size().unstack([2, 1]) 将其与我现有的 groupby 很好地结合起来

sort_index(1).fillna(0).astype(int) 只是为了美化。

如果你想更好地理解。尝试以下方法，看看你会得到什么：

df.groupby(['word', 'gender']).size()
df.groupby(['word', 'gender', 'source']).size()

unstack 和 stack 是将索引中的内容放入列中的便捷方法，反之亦然。 unstack([2, 1]) 正在指定索引级别被取消堆叠的顺序。

最后，我再次使用我的xtabs 和stack 并在行和unstack 之间求和以准备pd.concat。瞧！

xtabs = df.groupby(df.columns.tolist()).size() \
          .unstack([2, 1]).sort_index(1).fillna(0).astype(int)

pd.concat([xtabs.stack().sum(1).rename('total').to_frame().unstack(), xtabs], axis=1)

您的代码现在应该如下所示：

import pandas as pd
import numpy as np
import functools as ft

def main():
    # Create dataframe
    df = pd.DataFrame(data=np.zeros((0, 3)), columns=['word','gender','source'])
    df["word"] = ('banana', 'banana', 'elephant', 'mouse', 'mouse', 'elephant', 'banana', 'mouse', 'mouse', 'elephant', 'ostrich', 'ostrich')
    df["gender"] = ('a', 'the', 'the', 'a', 'the', 'the', 'a', 'the', 'a', 'a', 'a', 'the')
    df["source"] = ('BE', 'BE', 'BE', 'NL', 'NL', 'NL', 'FR', 'FR', 'FR', 'FR', 'FR', 'FR')

    return create_frequency_list(df)

def create_frequency_list(df):
    xtabs = df.groupby(df.columns.tolist()).size() \
              .unstack([2, 1]).sort_index(1).fillna(0).astype(int)

    total = xtabs.stack().sum(1)
    total.name = 'total'
    total = total.to_frame().unstack()

    return pd.concat([total, xtabs], axis=1)

main()

【讨论】：

感谢您迄今为止所做的努力。你能解释一下我应该把它放在哪里吗？它尝试删除 create_frequency_list 中的所有内容并将其替换为您的代码，但我在代码的最后一行收到错误 'str' object is not callable。接下来，您能否更彻底地解释一下您的代码中发生了什么？正如我所说，我是一个初学者，但我真的很想学习。
我复制粘贴了您的代码，但仍然出现同样的错误。这里是a break-down。运行 Python 3.4.3。
@BramVanroy 这种方法比 pd.crosstab() 快得多。看看这个 Q 的指标...是的，我不知道一对一的比较...stackoverflow.com/questions/38821985/…
这确实非常快。现在可以了！但是，请您在实际发生的情况下添加 cmets 吗？这并不紧急，我想我想知道在做什么而不仅仅是复制粘贴。
@BramVanroy 我添加了一些 cmets