【问题标题】:how to count number of words for each column which is in array structure in Pandas如何计算熊猫数组结构中每列的字数
【发布时间】:2018-12-07 10:49:42
【问题描述】:

我的数据框中有字符串列,我已将句子拆分为单词。现在我需要计算该单词的出现并将它们转换为列。基本上是创建一个文档术语矩阵

0                   [kubernetes, client, bootstrapping, ponda]
1                                                [micro, insu]
2                                                 [motor, upi]
3                                  [secure, app, installation]
4                    [health, insu, express, credit, customer]
5                                  [secure, app, installation]
6                                                 [aap, insta]
7                               [loan, house, loan, customers]

输出:

    kubernetes  client  bootstrapping   ponda   loan    customers   installation
0        1       1      1               1       0           0        0
1        0       0      0               0       1           0        1
2        0       2      0               0       0           0        0
3        1       1      1               1       0           0        0

到目前为止的代码

 from sklearn.feature_extraction.text import CountVectorizer

 countvec = CountVectorizer()

 countvec.fit_transform(df.new)

错误:

AttributeError: 'list' 对象没有属性 'lower'

【问题讨论】:

    标签: python-3.x pandas dataframe word-count


    【解决方案1】:

    如果值是列表,首先join将它们放在一起,然后使用CountVectorizer

    print (type(df.loc[0, 'new']))
    <class 'list'>
    
    from sklearn.feature_extraction.text import CountVectorizer
    
    countvec = CountVectorizer()
    counts = countvec.fit_transform(df['new'].str.join(' '))
    df = pd.DataFrame(counts.toarray(), columns=countvec.get_feature_names())
    

    get_dummiessum 的另一个 pandas 解决方案:

    df1 = pd.DataFrame(df['new'].values.tolist())
    df = pd.get_dummies(df1, prefix='', prefix_sep='').sum(axis=1, level=0)
    

    print (df)
    
       aap  app  bootstrapping  client  credit  customer  customers  express  \
    0    0    0              1       1       0         0          0        0   
    1    0    0              0       0       0         0          0        0   
    2    0    0              0       0       0         0          0        0   
    3    0    1              0       0       0         0          0        0   
    4    0    0              0       0       1         1          0        1   
    5    0    1              0       0       0         0          0        0   
    6    1    0              0       0       0         0          0        0   
    7    0    0              0       0       0         0          1        0   
    
       health  house  insta  installation  insu  kubernetes  loan  micro  motor  \
    0       0      0      0             0     0           1     0      0      0   
    1       0      0      0             0     1           0     0      1      0   
    2       0      0      0             0     0           0     0      0      1   
    3       0      0      0             1     0           0     0      0      0   
    4       1      0      0             0     1           0     0      0      0   
    5       0      0      0             1     0           0     0      0      0   
    6       0      0      1             0     0           0     0      0      0   
    7       0      1      0             0     0           0     2      0      0   
    
       ponda  secure  upi  
    0      1       0    0  
    1      0       0    0  
    2      0       0    1  
    3      0       1    0  
    4      0       0    0  
    5      0       1    0  
    6      0       0    0  
    7      0       0    0  
    

    【讨论】:

    • 是的,然后使用第一个解决方案。
    • 在我的回答中有 2 个解决方案 :)
    【解决方案2】:

    要按照您使用它的方式使用CountVectorizer,您的DataFrame 需要是这样的:

                                      string
    0  kubernetes client bootstrapping ponda
    1                             micro insu
    2                              motor upi
    3                secure app installation
    4    health insu express credit customer
    5                secure app installation
    6                              aap insta
    7              loan house loan customers
    

    目前,你有这样的:

                                       stringList
    0  [kubernetes, client, bootstrapping, ponda]
    1                               [micro, insu]
    2                                [motor, upi]
    3                 [secure, app, installation]
    4   [health, insu, express, credit, customer]
    5                 [secure, app, installation]
    6                                [aap, insta]
    7              [loan, house, loan, customers]
    

    以下是您如何按照使用 CountVectorizer 所需的方式对其进行转换

    这是一个可重现的例子:

    df = pd.DataFrame([[['kubernetes', 'client', 'bootstrapping', 'ponda']], [['micro', 'insu']], [['motor', 'upi']],[['secure', 'app', 'installation']],[['health', 'insu', 'express', 'credit', 'customer']],[['secure', 'app', 'installation']],[['aap', 'insta']],[['loan', 'house', 'loan', 'customers']]])
    
    df.columns = ['new']
    

    我正在调用您的列,该列的单词列表为 new,就像它最初在您的 DataFrame 中一样。

    df['string'] = ""
    

    我正在创建一个空列,我将在其中连接该单词列表中的每个单词。

    for i in df.index:
    
        df.at[i, 'string'] = " ".join(item for item in df.at[i, 'new'])
    

    我已按行扫描,并将字符串列表中的每个项目与" " 连接起来,并将其添加到string 列中。

    df.drop(['new'], axis = 1, inplace = True)
    

    现在,不需要包含字符串列表的列!所以我放弃了。

    现在您的 DataFrame 已按您想要的方式准备就绪!现在你可以使用CountVectorizer

    from sklearn.feature_extraction.text import CountVectorizer
    
    countvec = CountVectorizer()
    
    counts = countvec.fit_transform(df['string'])
    
    vocab = pd.DataFrame(counts.toarray())
    vocab.columns = countvec.get_feature_names()
    
    print(vocab)
    

    给予

       aap  app  bootstrapping  client  credit  customer  customers  express  \
    0    0    0              1       1       0         0          0        0   
    1    0    0              0       0       0         0          0        0   
    2    0    0              0       0       0         0          0        0   
    3    0    1              0       0       0         0          0        0   
    4    0    0              0       0       1         1          0        1   
    5    0    1              0       0       0         0          0        0   
    6    1    0              0       0       0         0          0        0   
    7    0    0              0       0       0         0          1        0   
    
       health  house  insta  installation  insu  kubernetes  loan  micro  motor  \
    0       0      0      0             0     0           1     0      0      0   
    1       0      0      0             0     1           0     0      1      0   
    2       0      0      0             0     0           0     0      0      1   
    3       0      0      0             1     0           0     0      0      0   
    4       1      0      0             0     1           0     0      0      0   
    5       0      0      0             1     0           0     0      0      0   
    6       0      0      1             0     0           0     0      0      0   
    7       0      1      0             0     0           0     2      0      0   
    
       ponda  secure  upi  
    0      1       0    0  
    1      0       0    0  
    2      0       0    1  
    3      0       1    0  
    4      0       0    0  
    5      0       1    0  
    6      0       0    0  
    7      0       0    0  
    

    【讨论】:

    • @Anagha 很高兴我能帮上忙!
    猜你喜欢
    • 1970-01-01
    • 2015-03-26
    • 2022-07-21
    • 2020-12-24
    • 1970-01-01
    • 1970-01-01
    • 2018-06-10
    • 2017-08-08
    • 2018-07-17
    相关资源
    最近更新 更多