如何计算熊猫数组结构中每列的字数答案

【问题标题】：how to count number of words for each column which is in array structure in Pandas如何计算熊猫数组结构中每列的字数
【发布时间】：2018-12-07 10:49:42
【问题描述】：

我的数据框中有字符串列，我已将句子拆分为单词。现在我需要计算该单词的出现并将它们转换为列。基本上是创建一个文档术语矩阵

0                   [kubernetes, client, bootstrapping, ponda]
1                                                [micro, insu]
2                                                 [motor, upi]
3                                  [secure, app, installation]
4                    [health, insu, express, credit, customer]
5                                  [secure, app, installation]
6                                                 [aap, insta]
7                               [loan, house, loan, customers]

输出：

    kubernetes  client  bootstrapping   ponda   loan    customers   installation
0        1       1      1               1       0           0        0
1        0       0      0               0       1           0        1
2        0       2      0               0       0           0        0
3        1       1      1               1       0           0        0

到目前为止的代码

 from sklearn.feature_extraction.text import CountVectorizer

 countvec = CountVectorizer()

 countvec.fit_transform(df.new)

错误：

AttributeError: 'list' 对象没有属性 'lower'

【问题讨论】：

标签： python-3.x pandas dataframe word-count

【解决方案1】：

如果值是列表，首先join将它们放在一起，然后使用CountVectorizer：

print (type(df.loc[0, 'new']))
<class 'list'>

from sklearn.feature_extraction.text import CountVectorizer

countvec = CountVectorizer()
counts = countvec.fit_transform(df['new'].str.join(' '))
df = pd.DataFrame(counts.toarray(), columns=countvec.get_feature_names())

get_dummies 和 sum 的另一个 pandas 解决方案：

df1 = pd.DataFrame(df['new'].values.tolist())
df = pd.get_dummies(df1, prefix='', prefix_sep='').sum(axis=1, level=0)

print (df)

   aap  app  bootstrapping  client  credit  customer  customers  express  \
0    0    0              1       1       0         0          0        0   
1    0    0              0       0       0         0          0        0   
2    0    0              0       0       0         0          0        0   
3    0    1              0       0       0         0          0        0   
4    0    0              0       0       1         1          0        1   
5    0    1              0       0       0         0          0        0   
6    1    0              0       0       0         0          0        0   
7    0    0              0       0       0         0          1        0   

   health  house  insta  installation  insu  kubernetes  loan  micro  motor  \
0       0      0      0             0     0           1     0      0      0   
1       0      0      0             0     1           0     0      1      0   
2       0      0      0             0     0           0     0      0      1   
3       0      0      0             1     0           0     0      0      0   
4       1      0      0             0     1           0     0      0      0   
5       0      0      0             1     0           0     0      0      0   
6       0      0      1             0     0           0     0      0      0   
7       0      1      0             0     0           0     2      0      0   

   ponda  secure  upi  
0      1       0    0  
1      0       0    0  
2      0       0    1  
3      0       1    0  
4      0       0    0  
5      0       1    0  
6      0       0    0  
7      0       0    0

【讨论】：

是的，然后使用第一个解决方案。
在我的回答中有 2 个解决方案 :)

【解决方案2】：

要按照您使用它的方式使用CountVectorizer，您的DataFrame 需要是这样的：

                                  string
0  kubernetes client bootstrapping ponda
1                             micro insu
2                              motor upi
3                secure app installation
4    health insu express credit customer
5                secure app installation
6                              aap insta
7              loan house loan customers

目前，你有这样的：

                                   stringList
0  [kubernetes, client, bootstrapping, ponda]
1                               [micro, insu]
2                                [motor, upi]
3                 [secure, app, installation]
4   [health, insu, express, credit, customer]
5                 [secure, app, installation]
6                                [aap, insta]
7              [loan, house, loan, customers]

以下是您如何按照使用 CountVectorizer 所需的方式对其进行转换

这是一个可重现的例子：

df = pd.DataFrame([[['kubernetes', 'client', 'bootstrapping', 'ponda']], [['micro', 'insu']], [['motor', 'upi']],[['secure', 'app', 'installation']],[['health', 'insu', 'express', 'credit', 'customer']],[['secure', 'app', 'installation']],[['aap', 'insta']],[['loan', 'house', 'loan', 'customers']]])

df.columns = ['new']

我正在调用您的列，该列的单词列表为 new，就像它最初在您的 DataFrame 中一样。

df['string'] = ""

我正在创建一个空列，我将在其中连接该单词列表中的每个单词。

for i in df.index:

    df.at[i, 'string'] = " ".join(item for item in df.at[i, 'new'])

我已按行扫描，并将字符串列表中的每个项目与" " 连接起来，并将其添加到string 列中。

df.drop(['new'], axis = 1, inplace = True)

现在，不需要包含字符串列表的列！所以我放弃了。

现在您的 DataFrame 已按您想要的方式准备就绪！现在你可以使用CountVectorizer！

from sklearn.feature_extraction.text import CountVectorizer

countvec = CountVectorizer()

counts = countvec.fit_transform(df['string'])

vocab = pd.DataFrame(counts.toarray())
vocab.columns = countvec.get_feature_names()

print(vocab)

给予

   aap  app  bootstrapping  client  credit  customer  customers  express  \
0    0    0              1       1       0         0          0        0   
1    0    0              0       0       0         0          0        0   
2    0    0              0       0       0         0          0        0   
3    0    1              0       0       0         0          0        0   
4    0    0              0       0       1         1          0        1   
5    0    1              0       0       0         0          0        0   
6    1    0              0       0       0         0          0        0   
7    0    0              0       0       0         0          1        0   

   health  house  insta  installation  insu  kubernetes  loan  micro  motor  \
0       0      0      0             0     0           1     0      0      0   
1       0      0      0             0     1           0     0      1      0   
2       0      0      0             0     0           0     0      0      1   
3       0      0      0             1     0           0     0      0      0   
4       1      0      0             0     1           0     0      0      0   
5       0      0      0             1     0           0     0      0      0   
6       0      0      1             0     0           0     0      0      0   
7       0      1      0             0     0           0     2      0      0   

   ponda  secure  upi  
0      1       0    0  
1      0       0    0  
2      0       0    1  
3      0       1    0  
4      0       0    0  
5      0       1    0  
6      0       0    0  
7      0       0    0

【讨论】：

@Anagha 很高兴我能帮上忙！