【问题标题】:Counting tokenized words in data frame with pandas ( python)用熊猫(python)计算数据框中的标记词
【发布时间】:2021-07-22 15:17:52
【问题描述】:

我在 Python 的数据框中创建了一个标记化的数据(文本)

我只想计算标记化数据,并有一个输出显示标记化数据中每个元素的重复频率。

这是我用来创建标记化数据的代码:

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import re

def tokenize(txt):
    tokens = re.split('\W+', txt)
    return tokens

Complains['clean_text_tokenized'] = Complains['clean text'].apply(lambda x: tokenize(x.lower()))

# Complains['clean text'] is the original file of the data


Complains['clean_text_tokenized'].head(10)

这是标记化数据的输出


0                   [comcast, cable, internet, speeds]
1     [payment, disappear, service, got, disconnected]
2                                [speed, and, service]
3    [comcast, imposed, a, new, usage, cap, of, 300...
4    [comcast, not, working, and, no, service, to, ...
5    [isp, charging, for, arbitrary, data, limits, ...
6    [throttling, service, and, unreasonable, data,...
7    [comcast, refuses, to, help, troubleshoot, and...
8                         [comcast, extended, outages]
9    [comcast, raising, prices, and, not, being, av...
Name: clean_text_tokenized, dtype: object

任何建议都会有所帮助

【问题讨论】:

    标签: python tokenize


    【解决方案1】:

    你可以使用Counter:

    from collections import Counter
    # ... and then
    def tokenize(txt):
        return Counter(re.split('\W+', txt))
    

    查看 Python 测试:

    from collections import Counter
    import pandas as pd
    import re
    
    Complains = pd.DataFrame({'clean text':['comcast, cable, internet, speeds', 'payment, disappear, service, got, disconnected']})
    
    Complains['clean_text_tokenized'] = Complains['clean text'].str.findall(r'\w+')
    
    freq = Counter([item for sublist in Complains['clean_text_tokenized'].to_list() for item in sublist])
    

    【讨论】:

    • 感谢您的帮助,我已经尝试过测试,它很顺利。但是,当我尝试对所有令牌执行相同操作时,我收到一条错误消息,上面写着 (AttributeError: 'list' object has no attribute 'split')。这是我使用的代码 ``` from collections import Counter import pandas as pd import re def tokenized(txt): freq = Counter([token for item in Complains['clean_text_tokenized_without_stopwords'].to_list() for token in item.split ()]) 返回频率抱怨['clean_text_tokenized_without_stopwords'].apply(lambda x: tokenized(x)) ````
    • @Hattan 这只是意味着您在列中有一个列表而不是一个字符串。确保您在该列中只有字符串,这意味着您需要在此之前检查所有代码。
    猜你喜欢
    • 2018-03-28
    • 1970-01-01
    • 1970-01-01
    • 2015-01-28
    • 1970-01-01
    • 1970-01-01
    • 2021-07-22
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多