【问题标题】:How to create paragraphs from markov chain output?如何从马尔可夫链输出创建段落?
【发布时间】:2012-10-20 21:13:00
【问题描述】:

我想修改下面的脚本,以便从脚本生成的随机数量的句子中创建段落。换句话说,在添加换行符之前连接一个随机数(如 1-5 个)的句子。

脚本按原样运行良好,但输出是用换行符分隔的短句。我想把一些句子整理成段落。

关于最佳实践的任何想法?谢谢。

"""
    from:  http://code.activestate.com/recipes/194364-the-markov-chain-algorithm/?in=lang-python
"""

import random;
import sys;

stopword = "\n" # Since we split on whitespace, this can never be a word
stopsentence = (".", "!", "?",) # Cause a "new sentence" if found at the end of a word
sentencesep  = "\n" #String used to seperate sentences


# GENERATE TABLE
w1 = stopword
w2 = stopword
table = {}

for line in sys.stdin:
    for word in line.split():
        if word[-1] in stopsentence:
            table.setdefault( (w1, w2), [] ).append(word[0:-1])
            w1, w2 = w2, word[0:-1]
            word = word[-1]
        table.setdefault( (w1, w2), [] ).append(word)
        w1, w2 = w2, word
# Mark the end of the file
table.setdefault( (w1, w2), [] ).append(stopword)

# GENERATE SENTENCE OUTPUT
maxsentences  = 20

w1 = stopword
w2 = stopword
sentencecount = 0
sentence = []

while sentencecount < maxsentences:
    newword = random.choice(table[(w1, w2)])
    if newword == stopword: sys.exit()
    if newword in stopsentence:
        print ("%s%s%s" % (" ".join(sentence), newword, sentencesep))
        sentence = []
        sentencecount += 1
    else:
        sentence.append(newword)
    w1, w2 = w2, newword

编辑 01:

好的,我拼凑了一个简单的“段落包装器”,它可以很好地将句子收集到段落中,但它与句子生成器的输出相混淆 - 例如,我的第一个单词重复性过多等问题。

但前提是合理的;我只需要弄清楚为什么句子循环的功能会受到添加段落循环的影响。请告知您是否可以看到问题:

###
#    usage: $ python markov_sentences.py < input.txt > output.txt
#    from:  http://code.activestate.com/recipes/194364-the-markov-chain-algorithm/?in=lang-python
###

import random;
import sys;

stopword = "\n" # Since we split on whitespace, this can never be a word
stopsentence = (".", "!", "?",) # Cause a "new sentence" if found at the end of a word
paragraphsep  = "\n\n" #String used to seperate sentences


# GENERATE TABLE
w1 = stopword
w2 = stopword
table = {}

for line in sys.stdin:
    for word in line.split():
        if word[-1] in stopsentence:
            table.setdefault( (w1, w2), [] ).append(word[0:-1])
            w1, w2 = w2, word[0:-1]
            word = word[-1]
        table.setdefault( (w1, w2), [] ).append(word)
        w1, w2 = w2, word
# Mark the end of the file
table.setdefault( (w1, w2), [] ).append(stopword)

# GENERATE PARAGRAPH OUTPUT
maxparagraphs = 10
paragraphs = 0 # reset the outer 'while' loop counter to zero

while paragraphs < maxparagraphs: # start outer loop, until maxparagraphs is reached
    w1 = stopword
    w2 = stopword
    stopsentence = (".", "!", "?",)
    sentence = []
    sentencecount = 0 # reset the inner 'while' loop counter to zero
    maxsentences = random.randrange(1,5) # random sentences per paragraph

    while sentencecount < maxsentences: # start inner loop, until maxsentences is reached
        newword = random.choice(table[(w1, w2)]) # random word from word table
        if newword == stopword: sys.exit()
        elif newword in stopsentence:
            print ("%s%s" % (" ".join(sentence), newword), end=" ")
            sentencecount += 1 # increment the sentence counter
        else:
            sentence.append(newword)
        w1, w2 = w2, newword
    print (paragraphsep) # newline space
    paragraphs = paragraphs + 1 # increment the paragraph counter


# EOF

编辑 02:

根据下面的答案将sentence = [] 添加到elif 语句中。智慧;

        elif newword in stopsentence:
            print ("%s%s" % (" ".join(sentence), newword), end=" ")
            sentence = [] # I have to be here to make the new sentence start as an empty list!!!
            sentencecount += 1 # increment the sentence counter

编辑 03:

这是此脚本的最后一次迭代。感谢 grieve 帮助解决这个问题。我希望其他人可以从中获得一些乐趣,我知道我会的。 ;)

仅供参考:有一个小工件 - 如果您使用此脚本,您可能需要清理一个额外的段落结尾空间。但是,除此之外,马尔可夫链文本生成的完美实现。

###
#    usage: python markov_sentences.py < input.txt > output.txt
#    from:  http://code.activestate.com/recipes/194364-the-markov-chain-algorithm/?in=lang-python
###

import random;
import sys;

stopword = "\n" # Since we split on whitespace, this can never be a word
stopsentence = (".", "!", "?",) # Cause a "new sentence" if found at the end of a word
sentencesep  = "\n" #String used to seperate sentences


# GENERATE TABLE
w1 = stopword
w2 = stopword
table = {}

for line in sys.stdin:
    for word in line.split():
        if word[-1] in stopsentence:
            table.setdefault( (w1, w2), [] ).append(word[0:-1])
            w1, w2 = w2, word[0:-1]
            word = word[-1]
        table.setdefault( (w1, w2), [] ).append(word)
        w1, w2 = w2, word
# Mark the end of the file
table.setdefault( (w1, w2), [] ).append(stopword)

# GENERATE SENTENCE OUTPUT
maxsentences  = 20

w1 = stopword
w2 = stopword
sentencecount = 0
sentence = []
paragraphsep = "\n"
count = random.randrange(1,5)

while sentencecount < maxsentences:
    newword = random.choice(table[(w1, w2)]) # random word from word table
    if newword == stopword: sys.exit()
    if newword in stopsentence:
        print ("%s%s" % (" ".join(sentence), newword), end=" ")
        sentence = []
        sentencecount += 1 # increment the sentence counter
        count -= 1
        if count == 0:
            count = random.randrange(1,5)
            print (paragraphsep) # newline space
    else:
        sentence.append(newword)
    w1, w2 = w2, newword


# EOF

【问题讨论】:

    标签: python markov-chains


    【解决方案1】:

    你需要复制

    sentence = [] 
    

    回到

    elif newword in stopsentence:
    

    条款。

    所以

    while paragraphs < maxparagraphs: # start outer loop, until maxparagraphs is reached
        w1 = stopword
        w2 = stopword
        stopsentence = (".", "!", "?",)
        sentence = []
        sentencecount = 0 # reset the inner 'while' loop counter to zero
        maxsentences = random.randrange(1,5) # random sentences per paragraph
    
        while sentencecount < maxsentences: # start inner loop, until maxsentences is reached
            newword = random.choice(table[(w1, w2)]) # random word from word table
            if newword == stopword: sys.exit()
            elif newword in stopsentence:
                print ("%s%s" % (" ".join(sentence), newword), end=" ")
                sentence = [] # I have to be here to make the new sentence start as an empty list!!!
                sentencecount += 1 # increment the sentence counter
            else:
                sentence.append(newword)
            w1, w2 = w2, newword
        print (paragraphsep) # newline space
        paragraphs = paragraphs + 1 # increment the paragraph counter
    

    编辑

    这是一个不使用外循环的解决方案。

    """
        from:  http://code.activestate.com/recipes/194364-the-markov-chain-algorithm/?in=lang-python
    """
    
    import random;
    import sys;
    
    stopword = "\n" # Since we split on whitespace, this can never be a word
    stopsentence = (".", "!", "?",) # Cause a "new sentence" if found at the end of a word
    sentencesep  = "\n" #String used to seperate sentences
    
    
    # GENERATE TABLE
    w1 = stopword
    w2 = stopword
    table = {}
    
    for line in sys.stdin:
        for word in line.split():
            if word[-1] in stopsentence:
                table.setdefault( (w1, w2), [] ).append(word[0:-1])
                w1, w2 = w2, word[0:-1]
                word = word[-1]
            table.setdefault( (w1, w2), [] ).append(word)
            w1, w2 = w2, word
    # Mark the end of the file
    table.setdefault( (w1, w2), [] ).append(stopword)
    
    # GENERATE SENTENCE OUTPUT
    maxsentences  = 20
    
    w1 = stopword
    w2 = stopword
    sentencecount = 0
    sentence = []
    paragraphsep == "\n\n"
    count = random.randrange(1,5)
    
    while sentencecount < maxsentences:
        newword = random.choice(table[(w1, w2)])
        if newword == stopword: sys.exit()
        if newword in stopsentence:
            print ("%s%s" % (" ".join(sentence), newword), end=" ")
            sentence = []
            sentencecount += 1
            count -= 1
            if count == 0:
                count = random.randrange(1,5)
                print (paragraphsep)
        else:
            sentence.append(newword)
        w1, w2 = w2, newword
    

    【讨论】:

    • 糟糕!是的,我一定是在某个时候把它拉出来了,忘了把它放回去。谢谢你的洞察力!那成功了 - 几乎。似乎句子循环为每个句子重复使用相同的起始词。关于如何混合它为句子生成选择的第一个单词有什么想法吗?
    • 我添加了一个不需要外循环的单独解决方案。
    • 我目前没有安装 python 3,所以你可能需要调整第二个解决方案的语法。
    • 甜蜜。谢谢伤心!这完美无缺。需要进行一些小编辑,但没有什么大不了的。最终代码请参考原帖。我非常感谢你 - 我正在拔头发。非常好的工作。
    【解决方案2】:

    你看懂了这段代码吗?我敢打赌,您可以找到打印句子的部分,并将其更改为一起打印多个句子,而无需返回。您可以在句子位周围添加另一个 while 循环以获取多个段落。

    语法提示:

    print 'hello'
    print 'there'
    hello
    there
    
    print 'hello',
    print 'there'
    hello there
    
    print 'hello',
    print 
    print 'there'
    

    关键是打印语句末尾的逗号防止在行尾返回,空白打印语句打印返回。

    【讨论】:

    • 是的,我跟着。问题是,我用print 语句尝试的所有操作都无助于将句子收集到段落中(除非您计算出 all 换行符,从而形成一个巨大的段落)。 while 循环是我的想法,但我不太确定如何包装句子部分。我尝试的一切都导致了各种错误,所以我想我会问专家。告诉它“生成 x(例如 1-5)个句子,然后插入换行符,然后重复直到达到 maxsentences”的最佳方式是什么?
    猜你喜欢
    • 1970-01-01
    • 2015-11-10
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多