【问题标题】:How to find the most frequent words before and after a given word in a given text in python?如何在python中的给定文本中找到给定单词之前和之后最常见的单词?
【发布时间】:2013-11-29 23:49:48
【问题描述】:

我有一个大文本,我试图在文本中的给定单词之前和之后获取最频繁出现的单词。

例如:

我想知道在“lake”之后出现频率最高的词是什么。理想情况下会得到类似的东西:(单词 1,# 出现),(单词 2,# 出现),...

前面的词也一样……

我尝试了 NLTK bigran,但它似乎只能找到最常见的 n-grans... 是否有可能以某种方式修复其中一个单词并根据固定单词找到最常见的 n-grans)?

感谢您的帮助!!

【问题讨论】:

    标签: python-2.7 nlp nltk n-gram


    【解决方案1】:

    你在寻找这样的东西吗?

    text = """
    A lake is a body of relatively still water of considerable size, localized in a basin, that is surrounded by land apart from a river, stream, or other form of moving water that serves to feed or drain the lake. Lakes are inland and not part of the ocean and therefore are distinct from lagoons, and are larger and deeper than ponds.[1][2] Lakes can be contrasted with rivers or streams, which are usually flowing. However most lakes are fed and drained by rivers and streams.
    Natural lakes are generally found in mountainous areas, rift zones, and areas with ongoing glaciation. Other lakes are found in endorheic basins or along the courses of mature rivers. In some parts of the world there are many lakes because of chaotic drainage patterns left over from the last Ice Age. All lakes are temporary over geologic time scales, as they will slowly fill in with sediments or spill out of the basin containing them.
    Many lakes are artificial and are constructed for industrial or agricultural use, for hydro-electric power generation or domestic water supply, or for aesthetic or recreational purposes.
    Etymology, meaning, and usage of "lake"[edit]
    Oeschinen Lake in the Swiss Alps
    Lake Tahoe on the border of California and Nevada
    The Caspian Sea is either the world's largest lake or a full-fledged sea.[3]
    The word lake comes from Middle English lake ("lake, pond, waterway"), from Old English lacu ("pond, pool, stream"), from Proto-Germanic *lakō ("pond, ditch, slow moving stream"), from the Proto-Indo-European root *leǵ- ("to leak, drain"). Cognates include Dutch laak ("lake, pond, ditch"), Middle Low German lāke ("water pooled in a riverbed, puddle"), German Lache ("pool, puddle"), and Icelandic lækur ("slow flowing stream"). Also related are the English words leak and leach.
    There is considerable uncertainty about defining the difference between lakes and ponds, and no current internationally accepted definition of either term across scientific disciplines or political boundaries exists.[4] For example, limnologists have defined lakes as water bodies which are simply a larger version of a pond, which can have wave action on the shoreline or where wind-induced turbulence plays a major role in mixing the water column. None of these definitions completely excludes ponds and all are difficult to measure. For this reason there has been increasing use made of simple size-based definitions to separate ponds and lakes. One definition of lake is a body of water of 2 hectares (5 acres) or more in area;[5]:331[6] however, others[who?] have defined lakes as waterbodies of 5 hectares (12 acres) and above,[citation needed] or 8 hectares (20 acres) and above[citation needed] (see also the definition of "pond"). Charles Elton, one of the founders of ecology, regarded lakes as waterbodies of 40 hectares (99 acres) or more.[7] The term lake is also used to describe a feature such as Lake Eyre, which is a dry basin most of the time but may become filled under seasonal conditions of heavy rainfall. In common usage many lakes bear names ending with the word pond, and a lesser number of names ending with lake are in quasi-technical fact, ponds. One textbook illustrates this point with the following: "In Newfoundland, for example, almost every lake is called a pond, whereas in Wisconsin, almost every pond is called a lake."[8]
    One hydrology book proposes to define it as a body of water with the following five chacteristics:[4]
    it partially or totally fills one or several basins connected by straits[4]
    has essentially the same water level in all parts (except for relatively short-lived variations caused by wind, varying ice cover, large inflows, etc.)[4]
    it does not have regular intrusion of sea water[4]
    a considerable portion of the sediment suspended in the water is captured by the basins (for this to happen they need to have a sufficiently small inflow-to-volume ratio)[4]
    the area measured at the mean water level exceeds an arbitrarily chosen threshold (for instance, one hectare)[4]
    With the exception of the sea water intrusion criterion, the other ones have been accepted or elaborated upon by other hydrology publications.[9][10]
    """.split()
    
    from nltk import bigrams
    
    bgs = bigrams(text)
    lake_bgs = filter(lambda item: item[0] == 'lake', bgs)
    
    from collections import Counter
    c = Counter(map(lambda item: item[1], lake_bgs))
    print c.most_common()
    

    哪个输出:

    [('is', 4), ('("lake,', 1), ('or', 1), ('comes', 1), ('are', 1)]
    

    请注意,如果您的文本很长,您可能需要使用ifilter, imap, etc...

    编辑:这是'lake'之后之前的代码。

    from nltk import trigrams
    
    tgs = trigrams(text)
    lake_tgs = filter(lambda item: item[1] == 'lake', tgs)
    
    from collections import Counter
    
    before_lake = map(lambda item: item[0], lake_tgs)
    after_lake = map(lambda item: item[2], lake_tgs)
    
    c = Counter(before_lake + after_lake)
    print c.most_common()
    

    请注意,这也可以使用 bigrams 来完成 :)

    【讨论】:

    • 谢谢!对于 n+1 个单词,正是如此。我是编程新手,不知道如何调整 N-1 的代码...
    • 抱歉在完成上述操作之前按回车...有没有办法调整它以计算之前的单词而不是之后的单词?你认为它也适用于三元组吗?
    【解决方案2】:

    只是为了补充@Ohad 的答案,这是 NLTK 中的一个 ngram 实现,具有一定的可扩展性。

    #-*- coding: utf8 -*-
    
    import string
    from nltk import ngrams
    from itertools import chain
    from collections import Counter
    
    text = """
    A lake is a body of relatively still water of considerable size, localized in a basin, that is surrounded by land apart from a river, stream, or other form of moving water that serves to feed or drain the lake. Lakes are inland and not part of the ocean and therefore are distinct from lagoons, and are larger and deeper than ponds.[1][2] Lakes can be contrasted with rivers or streams, which are usually flowing. However most lakes are fed and drained by rivers and streams.
    Natural lakes are generally found in mountainous areas, rift zones, and areas with ongoing glaciation. Other lakes are found in endorheic basins or along the courses of mature rivers. In some parts of the world there are many lakes because of chaotic drainage patterns left over from the last Ice Age. All lakes are temporary over geologic time scales, as they will slowly fill in with sediments or spill out of the basin containing them.
    Many lakes are artificial and are constructed for industrial or agricultural use, for hydro-electric power generation or domestic water supply, or for aesthetic or recreational purposes.
    Etymology, meaning, and usage of "lake"[edit]
    Oeschinen Lake in the Swiss Alps
    Lake Tahoe on the border of California and Nevada
    The Caspian Sea is either the world's largest lake or a full-fledged sea.[3]
    The word lake comes from Middle English lake ("lake, pond, waterway"), from Old English lacu ("pond, pool, stream"), from Proto-Germanic *lakō ("pond, ditch, slow moving stream"), from the Proto-Indo-European root *leǵ- ("to leak, drain"). Cognates include Dutch laak ("lake, pond, ditch"), Middle Low German lāke ("water pooled in a riverbed, puddle"), German Lache ("pool, puddle"), and Icelandic lækur ("slow flowing stream"). Also related are the English words leak and leach.
    There is considerable uncertainty about defining the difference between lakes and ponds, and no current internationally accepted definition of either term across scientific disciplines or political boundaries exists.[4] For example, limnologists have defined lakes as water bodies which are simply a larger version of a pond, which can have wave action on the shoreline or where wind-induced turbulence plays a major role in mixing the water column. None of these definitions completely excludes ponds and all are difficult to measure. For this reason there has been increasing use made of simple size-based definitions to separate ponds and lakes. One definition of lake is a body of water of 2 hectares (5 acres) or more in area;[5]:331[6] however, others[who?] have defined lakes as waterbodies of 5 hectares (12 acres) and above,[citation needed] or 8 hectares (20 acres) and above[citation needed] (see also the definition of "pond"). Charles Elton, one of the founders of ecology, regarded lakes as waterbodies of 40 hectares (99 acres) or more.[7] The term lake is also used to describe a feature such as Lake Eyre, which is a dry basin most of the time but may become filled under seasonal conditions of heavy rainfall. In common usage many lakes bear names ending with the word pond, and a lesser number of names ending with lake are in quasi-technical fact, ponds. One textbook illustrates this point with the following: "In Newfoundland, for example, almost every lake is called a pond, whereas in Wisconsin, almost every pond is called a lake."[8]
    One hydrology book proposes to define it as a body of water with the following five chacteristics:[4]
    it partially or totally fills one or several basins connected by straits[4]
    has essentially the same water level in all parts (except for relatively short-lived variations caused by wind, varying ice cover, large inflows, etc.)[4]
    it does not have regular intrusion of sea water[4]
    a considerable portion of the sediment suspended in the water is captured by the basins (for this to happen they need to have a sufficiently small inflow-to-volume ratio)[4]
    the area measured at the mean water level exceeds an arbitrarily chosen threshold (for instance, one hectare)[4]
    With the exception of the sea water intrusion criterion, the other ones have been accepted or elaborated upon by other hydrology publications.[9][10]
    """
    
    def ngrammer(txt, n):
        # Removes punctuations and numbers.
        sentences = "".join([i for i in txt if i not in string.punctuation and not i.isdigit()]).split('\n')
        return list(chain(*[ngrams(i.split(), n) for i in sentences]))
    
    def before_after(ngs, word):
        word_grams = filter(lambda item: item[1] == word, ngs)
        before = map(lambda item: item[0], ngs)
        after = map(lambda item: item[2], ngs)
        return before, after
    
    bgs = ngrammer(text,2) # bigrams
    tgs = ngrammer(text,3) # trigrams
    xgs = ngrammer(text,10) # 10grams
    
    focus = 'lake'
    bf, af = before_after(xgs, focus)
    c = Counter(bf+af)
    
    # Most common word before and after 'lake' from the 10grams.
    print c.most_common()[0]
    

    【讨论】:

      猜你喜欢
      • 2019-08-22
      • 1970-01-01
      • 1970-01-01
      • 2022-01-04
      • 2017-04-16
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多