【发布时间】:2014-09-26 08:31:53
【问题描述】:
我用 SMT 的短语提取算法编写了以下代码。
# -*- coding: utf-8 -*-
def phrase_extraction(srctext, trgtext, alignment):
"""
Phrase extraction algorithm.
"""
def extract(f_start, f_end, e_start, e_end):
phrases = set()
# return { } if f end == 0
if f_end == 0:
return
# for all (e,f) ∈ A do
for e,f in alignment:
# return { } if e < e start or e > e end
if e < e_start or e > e_end:
return
fs = f_start
# repeat-
while True:
fe = f_end
# repeat-
while True:
# add phrase pair ( e start .. e end , f s .. f e ) to set E
trg_phrase = " ".join(trgtext[i] for i in range(fs,fe))
src_phrase = " ".join(srctext[i] for i in range(e_start,e_end))
phrases.add("\t".join([src_phrase, trg_phrase]))
fe+=1 # fe++
# -until fe aligned
if fe in f_aligned or fe > trglen:
break
fs-=1 # fe--
# -until fs aligned
if fs in f_aligned or fs < 0:
break
return phrases
# Calculate no. of tokens in source and target texts.
srctext = srctext.split()
trgtext = trgtext.split()
srclen = len(srctext)
trglen = len(trgtext)
# Keeps an index of which source/target words are aligned.
e_aligned = [i for i,_ in alignment]
f_aligned = [j for _,j in alignment]
bp = set() # set of phrase pairs BP
# for e start = 1 ... length(e) do
for e_start in range(srclen):
# for e end = e start ... length(e) do
for e_end in range(e_start, srclen):
# // find the minimally matching foreign phrase
# (f start , f end ) = ( length(f), 0 )
f_start, f_end = trglen, 0
# for all (e,f) ∈ A do
for e,f in alignment:
# if e start ≤ e ≤ e end then
if e_start <= e <= e_end:
f_start = min(f, f_start)
f_end = max(f, f_end)
# add extract (f start , f end , e start , e end ) to set BP
phrases = extract(f_start, f_end, e_start, e_end)
if phrases:
bp.update(phrases)
return bp
srctext = "michael assumes that he will stay in the house"
trgtext = "michael geht davon aus , dass er im haus bleibt"
alignment = [(0,0), (1,1), (1,2), (1,3), (2,5), (3,6), (4,9), (5,9), (6,7), (7,7), (8,8)]
phrases = phrase_extraction(srctext, trgtext, alignment)
for i in phrases:
print i
Philip Koehn 的 Statistical Machine Translation 一书第 133 页中的短语提取算法是这样的:
所需的输出应该是:
但是使用我的代码,我只能得到这些输出:
michael 假设他将留在 - michael geht davon aus , 达斯尔·伊姆豪斯
michael 假设他将留在 - michael geht davon aus , dass er im haus bleibt
有人发现我的实施有什么问题吗?
代码确实提取了短语,但它不是完整的所需输出,如上面的翻译表所示:
【问题讨论】:
-
为什么要投反对票?它在 codereview 上,我不知道它为什么会出现在这里......
-
那你为什么要问“我的实现有什么问题”? Running 与 working 不同 - 如果输出错误,则这是非工作代码。
-
@alvas 如果它没有得到正确的输出,它显然不工作。
-
它可能看起来是对的,但如果输出错误,要么输入错误,要么算法错误,我真的希望你仔细检查输入!
-
对示例运行时的跟踪显示,
extract被调用了 45 次并找到了两个短语。这些调用中的大多数从未达到生成短语的地步。这表明参数错误(太少)或提前返回的逻辑错误。我还注意到该算法使用基于 1 的索引和基于 0 的代码。建议:编写测试来证明不同部分按预期工作......
标签: python algorithm machine-learning nlp machine-translation