【问题标题】:How can I access the raw documents from the Brown corpus?如何从 Brown 语料库访问原始文档?
【发布时间】:2017-11-15 06:55:02
【问题描述】:

对于所有其他 NLTK 语料库,调用 corpus.raw() 会从文件中生成原始文本。 例如:

>>> from nltk.corpus import webtext
>>> webtext.raw()[:10]
'Cookie Man'

但是,当调用 brown.raw() 时,您会得到标记文本。

>>> from nltk.corpus import brown
>>> brown.raw()[:10]
'\n\n\tThe/at '

我已经阅读了我能找到的所有文档,但似乎找不到明显的解释或获取未标记版本的方法。这个语料库被标记而其他语料库没有被标记是有原因的吗?

【问题讨论】:

    标签: python nlp nltk corpus tagged-corpus


    【解决方案1】:

    TL;DR

    import nltk
    nltk.download('brown')
    nltk.download('nonbreaking_prefixes')
    nltk.download('perluniprops')
    
    from nltk.corpus import brown
    from nltk.tokenize.moses import MosesDetokenizer
    
    mdetok = MosesDetokenizer()
    
    brown_natural = [mdetok.detokenize(' '.join(sent).replace('``', '"').replace("''", '"').replace('`', "'").split(), return_str=True)  for sent in brown.sents()]
    
    for sent in brown_natural:
        print(sent)
    

    长期

    这是因为布朗语料库的“原始”版本被标记和标记,即语料库被标记为语料库的原始形式 =)

    您可以查看nltk_data 目录中的各个文件:

    $ head -n10 nltk_data/corpora/brown/ca01
    
    
        The/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr an/at investigation/nn of/in Atlanta's/np$ recent/jj primary/nn election/nn produced/vbd ``/`` no/at evidence/nn ''/'' that/cs any/dti irregularities/nns took/vbd place/nn ./.
    
    
        The/at jury/nn further/rbr said/vbd in/in term-end/nn presentments/nns that/cs the/at City/nn-tl Executive/jj-tl Committee/nn-tl ,/, which/wdt had/hvd over-all/jj charge/nn of/in the/at election/nn ,/, ``/`` deserves/vbz the/at praise/nn and/cc thanks/nns of/in the/at City/nn-tl of/in-tl Atlanta/np-tl ''/'' for/in the/at manner/nn in/in which/wdt the/at election/nn was/bedz conducted/vbn ./.
    
    
        The/at September-October/np term/nn jury/nn had/hvd been/ben charged/vbn by/in Fulton/np-tl Superior/jj-tl Court/nn-tl Judge/nn-tl Durwood/np Pye/np to/to investigate/vb reports/nns of/in possible/jj ``/`` irregularities/nns ''/'' in/in the/at hard-fought/jj primary/nn which/wdt was/bedz won/vbn by/in Mayor-nominate/nn-tl Ivan/np Allen/np Jr./np ./.
    

    如果你想要语料库中的单词,你可以使用brown.words(),例如

    >>> from nltk.corpus import brown
    
    >>> brown.words()
    [u'The', u'Fulton', u'County', u'Grand', u'Jury', ...]
    
    >>> ' '.join(brown.words()[:30])
    u"The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced `` no evidence '' that any irregularities took place . The jury further said in"
    

    如果您想从特定文件中获取单词:

    >>> brown.fileids()[:10] # The first 10 fileids from brown.
    [u'ca01', u'ca02', u'ca03', u'ca04', u'ca05', u'ca06', u'ca07', u'ca08', u'ca09', u'ca10']
    
    >>> ' '.join(brown.words('ca01')[:30]) # First 30 words from the 'ca01' file.
    u"The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced `` no evidence '' that any irregularities took place . The jury further said in"
    

    以及来自特定文件的句子:

    >>> brown.sents('ca01')
    [[u'The', u'Fulton', u'County', u'Grand', u'Jury', u'said', u'Friday', u'an', u'investigation', u'of', u"Atlanta's", u'recent', u'primary', u'election', u'produced', u'``', u'no', u'evidence', u"''", u'that', u'any', u'irregularities', u'took', u'place', u'.'], [u'The', u'jury', u'further', u'said', u'in', u'term-end', u'presentments', u'that', u'the', u'City', u'Executive', u'Committee', u',', u'which', u'had', u'over-all', u'charge', u'of', u'the', u'election', u',', u'``', u'deserves', u'the', u'praise', u'and', u'thanks', u'of', u'the', u'City', u'of', u'Atlanta', u"''", u'for', u'the', u'manner', u'in', u'which', u'the', u'election', u'was', u'conducted', u'.'], ...]
    

    打印单个句子:

    >>> for sent in brown.sents('ca01')[:5]: # First 5 sentences.
    ...     print(' '.join(sent))
    ... 
    The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced `` no evidence '' that any irregularities took place .
    The jury further said in term-end presentments that the City Executive Committee , which had over-all charge of the election , `` deserves the praise and thanks of the City of Atlanta '' for the manner in which the election was conducted .
    The September-October term jury had been charged by Fulton Superior Court Judge Durwood Pye to investigate reports of possible `` irregularities '' in the hard-fought primary which was won by Mayor-nominate Ivan Allen Jr. .
    `` Only a relative handful of such reports was received '' , the jury said , `` considering the widespread interest in the election , the number of voters and the size of this city '' .
    The jury said it did find that many of Georgia's registration and election laws `` are outmoded or inadequate and often ambiguous '' .
    

    尝试对标记化的语料库进行去标记相当混乱,可能会也可能不会,但您可以尝试MosesDetokenizer

    首先下载 MosesDetokenizer 需要的数据:

    >>> import nltk
    >>> nltk.download('perluniprops')
    [nltk_data] Downloading package perluniprops to
    [nltk_data]     /home/ltan/nltk_data...
    [nltk_data]   Unzipping misc/perluniprops.zip.
    True
    >>> nltk.download('nonbreaking_prefixes')
    [nltk_data] Downloading package nonbreaking_prefixes to
    [nltk_data]     /home/ltan/nltk_data...
    [nltk_data]   Package nonbreaking_prefixes is already up-to-date!
    True
    

    然后初始化MosesDetokenizer:

    >>> from nltk.tokenize.moses import MosesDetokenizer
    >>> mdetok = MosesDetokenizer()
    

    并使用MosesDetokenizer.detokenize()

    >>> for sent in brown.sents('ca01')[:5]: # First 5 sentences.
    ...     # Join the words in sentences and convert the `` -> "
    ...     # also convert '' -> " and ` -> '
    ...     munged_sentence = ' '.join(sent).replace('``', '"').replace("''", '"').replace('`', "'")
    ...     print(mdetok.detokenize(munged_sentence.split(), return_str=True)) # MosesDetokenizer expects a list of strings as input.
    ... 
    The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced "no evidence" that any irregularities took place.
    The jury further said in term-end presentments that the City Executive Committee, which had over-all charge of the election, "deserves the praise and thanks of the City of Atlanta" for the manner in which the election was conducted.
    The September-October term jury had been charged by Fulton Superior Court Judge Durwood Pye to investigate reports of possible "irregularities" in the hard-fought primary which was won by Mayor-nominate Ivan Allen Jr..
    "Only a relative handful of such reports was received", the jury said, "considering the widespread interest in the election, the number of voters and the size of this city".
    The jury said it did find that many of Georgia's registration and election laws "are outmoded or inadequate and often ambiguous".
    

    brown中的每个句子转换成自然阅读文本:

    from nltk.tokenize.moses import MosesDetokenizer
    mdetok = MosesDetokenizer()
    brown_natural = [mdetok.detokenize(' '.join(sent).replace('``', '"').replace("''", '"').replace('`', "'").split(), return_str=True)  for sent in brown.sents()]
    

    [出]:

    >>> for sent in brown_natural:
    ...     print(sent)
    ...     break
    ... 
    The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced "no evidence" that any irregularities took place.
    

    【讨论】:

      【解决方案2】:

      标记文本原始文档,即布朗语料库文件的实际内容。 raw() 方法可以准确地显示文件中存储的内容;它只为“纯文本”语料库返回干净的文本,而不是您假设的“所有其他语料库”。例如,尝试nltk.corpus.treebank.raw('wsj_0001.mrg')nltk.corpus.conll2000.raw("train.txt"),您将分别看到树和“IOB 格式”文本。

      现在,如果您的目标是重构可读文本,加入空格对我来说通常就足够了:

      for sent in brown.sents():
          print(" ".join(sent))
      

      你会得到这样的输出:

      `` Only a relative handful of such reports was received '' , the jury said , `` considering
      the widespread interest in the election , the number of voters and the size of this 
      city '' .
      

      如果您不喜欢这样的外观,请查看 alvas 的答案以获得更雄心勃勃的重建。

      【讨论】:

        猜你喜欢
        • 2015-12-12
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2017-08-19
        • 1970-01-01
        • 1970-01-01
        • 2015-11-21
        相关资源
        最近更新 更多