使用 NLTK 对 BeautifulSoup 和 NaiveBayes 的网站内容问题进行文档分类答案

【问题标题】：Using NLTK to perform document classification on website content issues with BeautifulSoup and NaiveBayes使用 NLTK 对 BeautifulSoup 和 NaiveBayes 的网站内容问题进行文档分类
【发布时间】：2015-02-03 19:31:45
【问题描述】：

我有一个 Python 2.7 项目，我想根据其内容对网站进行分类。我有一个数据库，其中有许多网站 URL 及其相关类别。有很多类别（=标签），我希望根据新网站的内容将其分类到相应的类别中。我一直在关注here 列出的 NLTK 分类教程/示例，但遇到了一些我无法解释的问题。

这是我使用的过程的概述：

使用 MySQLdb 检索与给定网站 URL 关联的类别。这将在从 URL 中提取数据（内容）以将其与网站的类别（= 标签）。
使用 getSiteContent(site) 函数从网站中提取内容

上面的函数是这样的：

def getSiteContent(site):
    try:
        response = urllib2.urlopen(site, timeout = 1)
        htmlSource = response.read()
    except Exception as e: # <=== some websites may be inaccessible as list isn't up-to-date
        global errors
        errors += 1
        return ''

    soup = BeautifulSoup(htmlSource)
    for script in soup.find_all('script'):
        script.extract()

    commonWords = set(stopwords.words('english'))
    commonWords.update(['function', 'document', 'window', 'functions',     'getElementsByTagName', 'parentNode', 'getDocumentById', 'javascript', 'createElement',     'Copyright', 'Copyrights', 'respective', 'owners', 'Contact Us', 'Mobile Version', 'FAQ',     'Privacy Policy', 'Terms of Service', 'Legal Disclaimer' ])

    text = soup.get_text()

    # Remove ',', '/', '%', ':'
    re.sub(r'(\\d+[,/%:]?\\d*)', '', text)
    # Remove digits
    re.sub(r'\d+', '', text)
    # Remove non-ASCII
    re.sub(r'[^\x00-\x7F]',' ', text)
    # Remove stopwords
    for word in commonWords :
        text = text.replace(' '+word+' ', ' ')

    # Tokenize the site content using NLTK
    tokens = word_tokenize(text)

    # We collect some word statistics, i.e. how many times a given word appears in the     text
    counts = defaultdict(int)
    for token in tokens:
        counts[token] += 1

    features = {}
    # Get rid of words that appear less than 3 times
    for word in tokens:
        if counts[word] >= 3 :
            features['count(%s)' % word] = counts[word]

    return features

完成以上所有操作后，我执行以下操作：

train = getTrainingSet(n)
random.shuffle(train)

其中 n 是我希望训练模型的站点数量。

之后，我会这样做：

feature_set = []
count = 0
for (site, category) in train:
    result = getSiteContent(site)
    count += 1
    if result != '':
        print "%d. Got content for %s" % (count, site)
        feature_set.append((result, category))
    else  :
        print "%d. Failed to get content for %s" % (count, site)

此时打印语句主要用于调试目的。完成上述操作后，feature_set 包含类似于以下内容：

print feature_set
[({u'count(import)': 22, u'count(maxim)': 22, u'count(Maxim)': 5, u'count(css)': 22, u'count(//www)': 22, u'count(;)': 22, u'count(url)': 22, u'count(Gift)': 3, u"count('')": 44, u'count(http)': 22, u'count(&)': 3, u'count(ng16ub)': 22, u'count(STYLEThe)': 3, u'count(com/modules/system/system)': 4, u'count(@)': 22, u'count(?)': 22}, 'Arts & Entertainment'), ({u'count(import)': 3, u'count(css)': 3, u'count(\u05d4\u05d9\u05d5\u05dd)': 4, u'count(\u05de\u05d9\u05dc\u05d5\u05df)': 6, u'count(;)': 3, u'count(\u05e2\u05d1\u05e8\u05d9)': 4, u'count(\u05d0\u05ea)': 3, u'count(\u05de\u05d5\u05e8\u05e4\u05d9\u05e7\u05e1)': 6, u"count('')": 6, u'count(\u05d4\u05d5\u05d0)': 3, u'count(\u05e8\u05d1\u05de\u05d9\u05dc\u05d9\u05dd)': 3, u'count(ver=01122014_4)': 3, u'count(|)': 4, u'count(``)': 4, u'count(@)': 3, u'count(?)': 7}, 'Miscellaneous')]

之后，我尝试训练我的分类器，然后针对我从 feature_set 提取的测试数据运行它

train_set, test_set = feature_set[len(train)/2:], feature_set[:len(train)/2]
print "Num in train_set: %d" % len(train_set)
print "Num in test_set: %d" % len(test_set)
classifier = nltk.NaiveBayesClassifier.train(train_set) # <=== classified declared on train_set
print classifier.show_most_informative_features(5)
print "=== Classifying a site ==="
print classifier.classify(getSiteContent("http://www.mangaspoiler.com"))
print "Non-working sites: %d" % errors
print "Classifier accuracy: %d" % nltk.classify.accuracy(classifier, test_set)

这与 NLTK 文档网站上的教程几乎完全一样。但是，结果如下（给定一组 100 个网站）：

$ python classify.py
Num in train_set: 23
Num in test_set: 50
Most Informative Features
            count(Pizza) = None           Arts & : Techno =      1.0 : 1.0
None
=== Classifying a site ===
Technology & Computing
Non-working sites: 27
Classifier accuracy: 0

现在，这显然存在一些问题：

单词 tokens 包含 unicode 字符，例如 \u05e2\u05d1\u05e8\u05d9，因为用于删除它们的正则表达式似乎仅在它们是独立的情况下才有效。这是一个小问题。
更大的问题是，即使我在printfeature_set 时，单词标记显示为u'count(...)' = # 而不是'count(...)' = #。我认为这可能是一个更大的问题，也是我的分类器失败的部分原因。
显然，分类器在某些方面发生了灾难性的失败。即使我将整个数据集输入分类器，准确度也会列为0，这似乎极不可能。
Most Informative Features 函数表示count(Pizza) = None。但是，我声明 defaultdict(int) 的代码要求每个条目都与文本中出现的次数相关联。

我不知道为什么会发生这种情况。据我所知，我的数据的结构与 NLTK 文档在我在此问题顶部链接的网站上的教程中使用的数据相同。如果任何曾与 NLTK 合作过的人以前曾见过这种行为，我将非常感谢任何关于我可能做错的提示。

【问题讨论】：

仅供参考，我从问题中删除了关于关闭投票的评论，因为Close Votes review 以一致投票结束。

标签： python nlp classification nltk document-classification

【解决方案1】：

这里可能有很多错误，但第一个也是最明显的一个在这里突出：

准确率列为0，即使我将整个数据集输入分类器

它没有被列为0.0？听起来里面应该是float 是int。我怀疑你在某个时候为了标准化而进行除法，而int/int 没有被转换为float。

在构建计数表时，为每个计数添加1.0，而不是1。这将解决问题的根源，并且更正将逐渐减少。

如果对带有浮点数的文档进行计数似乎很奇怪，请将每个计数视为单词科学意义上的测量，而不是离散文档的表示。

【讨论】：