从 Web 域地址中提取名称实体答案

【问题标题】：Extract name entities from web domain address从 Web 域地址中提取名称实体
【发布时间】：2016-08-15 13:59:29
【问题描述】：

我正在研究一个 NLP 问题（在 Python 2.7 中），以从报告中的文本中提取新闻报告的位置。对于这项任务，我使用了运行良好的 Clvin API。

但是我注意到，报告本身的 URL 中经常提到位置区域的名称，我想找到一种方法从域名中提取此实体，以提高从Clvin 通过在请求中提供额外的命名实体。

在理想的世界中，我希望能够提供以下输入： www.britainnews.net

并返回这个或类似的输出： [www,britain,news,net]

当然，我可以使用 .split() 功能来分离不重要的 www 和 net 标记，但是我不知道如何在没有密集字典查找的情况下分割中间短语。

我不是要求某人解决这个问题或为我编写任何代码 - 但这是一个公开征集，就理想的 NLP 库（如果存在）或如何解决这个问题的任何想法提出建议.

【问题讨论】：

标签： python string machine-learning nlp

【解决方案1】：

检查 - Word Segmentation Task 来自 Norvig 的工作。

from __future__ import division
from collections import Counter
import re, nltk

WORDS = nltk.corpus.reuters.words()

COUNTS = Counter(WORDS)

def pdist(counter):
    "Make a probability distribution, given evidence from a Counter."
    N = sum(counter.values())
    return lambda x: counter[x]/N

P = pdist(COUNTS)

def Pwords(words):
    "Probability of words, assuming each word is independent of others."
    return product(P(w) for w in words)

def product(nums):
    "Multiply the numbers together.  (Like `sum`, but with multiplication.)"
    result = 1
    for x in nums:
        result *= x
    return result

def splits(text, start=0, L=20):
    "Return a list of all (first, rest) pairs; start <= len(first) <= L."
    return [(text[:i], text[i:]) 
            for i in range(start, min(len(text), L)+1)]

def segment(text):
    "Return a list of words that is the most probable segmentation of text."
    if not text: 
        return []
    else:
        candidates = ([first] + segment(rest) 
                      for (first, rest) in splits(text, 1))
        return max(candidates, key=Pwords)

print segment('britainnews')     # ['britain', 'news']

更多示例：Word Segmentation Task

【讨论】：

完美。谢谢
点击接受，如果这个答案解决了你的问题[点击正确标记]，因为它对其他人也有用。