【问题标题】:merge and split synonym word list合并和拆分同义词列表
【发布时间】:2015-08-05 21:41:12
【问题描述】:

(我正在尝试更新 hunspell 拼写词典) 我的同义词文件看起来像这样......

mylist="""
specimen|3 
sample
prototype
example
sample|3
prototype
example
specimen
prototype|3
example
specimen
sample
example|3 
specimen
sample
prototype
protoype|1
illustration
"""

第一步是合并重复的单词。在上面提到的示例中,重复了“原型”一词。所以我需要把它放在一起。由于添加了“插图”同义词,计数将从 3 变为 4。

specimen|3 
sample
prototype
example
sample|3
prototype
example
specimen
prototype|4
example
specimen
sample
illustration
example|3 
specimen
sample
prototype

第二步比较复杂。合并重复项是不够的。添加的单词也应该反映到链接的单词中。在这种情况下,我需要在同义词列表中搜索“原型”,如果找到,应该添加“插图”一词。 最终的单词列表将如下所示...

specimen|4
sample
prototype
example
illustration
sample|4
prototype
example
specimen
illustration
prototype|4
example
specimen
sample
illustration
example|4 
specimen
sample
prototype
illustration

一个新词“illustration”应该添加到原始列表中,其中包含所有 4 个链接词。

illustration|4
example
specimen
sample
prototype

我尝试过的:

myfile=StringIO.StringIO(mylist)
for lineno, i in enumerate(myfile):
    if i:
        try:
            if int(i.split("|")[1]) > 0:
                print lineno, i.split("|")[0], int(i.split("|")[1])
        except:
            pass

上面的代码返回带有行号和计数的单词。

1 specimen 3
5 sample 3
9 prototype 3
13 example 3
17 protoype 1

这意味着我需要将第 18 行上的 1 个单词与第 4 行第 9 行(“原型”)上找到的单词合并。 如果我能做到这一点,我将完成任务的第 1 步。

【问题讨论】:

    标签: python


    【解决方案1】:

    为此使用图表:

    mylist="""
    specimen|3 
    sample
    prototype
    example
    sample|3
    prototype
    example
    specimen
    prototype|3
    example
    specimen
    sample
    example|3 
    specimen
    sample
    prototype
    prototype|1
    illustration
    specimen|1
    cat
    happy|2
    glad
    cheerful 
    """
    
    
    import networkx as nx
    
    
    G = nx.Graph()
    
    nodes = []
    
    for line in mylist.strip().splitlines():
        if '|' in line:
            node, _ = line.split('|')
            if node not in nodes:
                nodes.append(node)
            G.add_node(node)
        else:
             G.add_edge(node, line)
             if line not in nodes:
                nodes.append(line)
    
    for node in nodes:
        neighbors = G.neighbors(node)
        non_neighbors = []
        for non_nb in nx.non_neighbors(G, node):
            try:
                if nx.bidirectional_dijkstra(G, node, non_nb):
                    non_neighbors.append(non_nb)
            except Exception:
                    pass
    
        syns = neighbors + non_neighbors
    
        print '{}|{}'.format(node, len(syns))
        print '\n'.join(syns)
    

    输出:

    specimen|5
    sample
    prototype
    example
    cat
    illustration
    sample|5
    specimen
    prototype
    example
    illustration
    cat
    prototype|5
    sample
    specimen
    example
    illustration
    cat
    example|5
    sample
    specimen
    prototype
    illustration
    cat
    illustration|5
    prototype
    specimen
    cat
    sample
    example
    cat|5
    specimen
    illustration
    sample
    prototype
    example
    happy|2
    cheerful
    glad
    glad|2
    happy
    cheerful
    cheerful|2
    happy
    glad
    

    图表将如下所示:

    【讨论】:

    • 我收到一个错误 #NetworkXNoPath: No path between samples 和快乐 # 当我添加一个与任何其他词无关的词时,例如开心|2 开心 开心
    • @shantanuo 啊!这将产生 2 个单独的图表,答案已更新。
    【解决方案2】:

    您描述的问题是一个经典的 Union-Find 问题,可以使用不相交集算法来解决。不要重新发明轮子。

    了解联合查找/不相交集:

    http://en.wikipedia.org/wiki/Disjoint-set_data_structure

    或问题:

    A set union find algorithm

    Union find implementation using Python

    class DisjointSet(object):
    def __init__(self):
        self.leader = {} # maps a member to the group's leader
        self.group = {} # maps a group leader to the group (which is a set)
    
    def add(self, a, b):
        leadera = self.leader.get(a)
        leaderb = self.leader.get(b)
        if leadera is not None:
            if leaderb is not None:
                if leadera == leaderb: return # nothing to do
                groupa = self.group[leadera]
                groupb = self.group[leaderb]
                if len(groupa) < len(groupb):
                    a, leadera, groupa, b, leaderb, groupb = b, leaderb, groupb, a, leadera, groupa
                groupa |= groupb
                del self.group[leaderb]
                for k in groupb:
                    self.leader[k] = leadera
            else:
                self.group[leadera].add(b)
                self.leader[b] = leadera
        else:
            if leaderb is not None:
                self.group[leaderb].add(a)
                self.leader[a] = leaderb
            else:
                self.leader[a] = self.leader[b] = a
                self.group[a] = set([a, b])
    
    mylist="""
    specimen|3 
    sample
    prototype
    example
    sample|3
    prototype
    example
    specimen
    prototype|3
    example
    specimen
    sample
    example|3 
    specimen
    sample
    prototype
    prototype|1
    illustration
    specimen|1
    cat
    happy|2
    glad
    cheerful 
    """
    ds = DisjointSet()
    for line in mylist.strip().splitlines():
        if '|' in line:
             node, _ = line.split('|')
        else:
             ds.add(node, line)
    
    for _,g in ds.group.items():
        print g
    
    >>> 
    set(['specimen', 'illustration', 'cat', 'sample', 'prototype', 'example'])
    set(['cheerful', 'glad', 'happy'])
    

    使用dijkstra算法可以解决问题,但我认为这有点矫枉过正,因为你实际上不需要节点之间的最短距离,你只需要图中的连通分量。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2020-03-27
      相关资源
      最近更新 更多