如何拆分多个连接词？答案

【问题标题】：How can I split multiple joined words?如何拆分多个连接词？
【发布时间】：2010-09-16 17:42:39
【问题描述】：

我有一个包含 1000 个左右条目的数组，示例如下：

wickedweather
liquidweather
driveourtrucks
gocompact
slimprojector

我希望能够将这些拆分成各自的词，如：

wicked weather
liquid weather
drive our trucks
go compact
slim projector

我希望正则表达式可以解决问题。但是，由于没有边界可以停止，也没有任何我可以键入的大写字母，我在想，可能需要某种对字典的引用？

我想它可以手动完成，但为什么 - 什么时候可以用代码完成！ =) 但这让我很难过。有任何想法吗？

【问题讨论】：

请注意，天真的实现会返回“wick ed weather”
嘿最佳解决方案，我看到了您对 EMR 问题的回复，想知道是否可以就医疗保健 IT 方面的问题与您联系？
另请参阅stackoverflow.com/questions/11447859/…上的 Python 和 Ruby 实现

标签： string nlp

【解决方案1】：

Viterbi algorithm 更快。它计算与上述 Dmitry 答案中的递归搜索相同的分数，但在 O(n) 时间内。（Dmitry 的搜索需要指数级的时间；Viterbi 是通过动态规划来完成的。）

import re
from collections import Counter

def viterbi_segment(text):
    probs, lasts = [1.0], [0]
    for i in range(1, len(text) + 1):
        prob_k, k = max((probs[j] * word_prob(text[j:i]), j)
                        for j in range(max(0, i - max_word_length), i))
        probs.append(prob_k)
        lasts.append(k)
    words = []
    i = len(text)
    while 0 < i:
        words.append(text[lasts[i]:i])
        i = lasts[i]
    words.reverse()
    return words, probs[-1]

def word_prob(word): return dictionary[word] / total
def words(text): return re.findall('[a-z]+', text.lower()) 
dictionary = Counter(words(open('big.txt').read()))
max_word_length = max(map(len, dictionary))
total = float(sum(dictionary.values()))

测试它：

>>> viterbi_segment('wickedweather')
(['wicked', 'weather'], 5.1518198982768158e-10)
>>> ' '.join(viterbi_segment('itseasyformetosplitlongruntogetherblocks')[0])
'its easy for me to split long run together blocks'

为了实用，您可能需要一些改进：

添加概率日志，不要乘以概率。这样可以避免浮点下溢。
您的输入通常会使用语料库中没有的词。必须为这些子字符串分配非零概率作为单词，否则您最终将没有解决方案或错误的解决方案。（对于上面的指数搜索算法也是如此。）这个概率必须从语料库单词的概率中提取出来，并合理地分布在所有其他候选词中：一般主题在统计语言模型中被称为平滑。（不过，你可以通过一些相当粗略的技巧来摆脱困境。）这就是 O(n) Viterbi 算法击败搜索算法的地方，因为考虑非语料库单词会破坏分支因子。

【讨论】：

@mj_，它在 Python 中，range(low, high) 表示 [low, low+1, ..., high-1]——即不包括上限。 j in range(0, 1) 的循环仅查看 j=0。
你是对的，它不会回溯。它确实产生了总分最高的分割（单个词概率的乘积）。我不清楚您遇到了什么具体问题-您的字典中有“思考”，而您却得到“它清楚地思考”？通过转到您正在查看条件概率的二阶模型：P(word|previous_word)，可以改进单个词频率的结果，但代价是显着更大的计算量。请参阅norvig.com/ngrams（我应该通过链接更新答案）。
为了获得完美的结果，您将需要更多改进：1）您需要为非单词返回一个非常小的概率而不是零，否则如果短语的某些部分永远不会出现在字典中它会破坏整个短语的结果。我使用 (log(1/total)-max_word_len-1)*(j-i) 作为非单词概率的对数（这也会惩罚较长的非单词以防止耗尽有效单词）。 2）你需要在probs中保留(-non_words_len, -non_words_count, prob)的元组，以最小化无法识别的序列并合并相邻的非单词块。
这个答案就像 StackOverflow 的“最佳”。
@dawg 只是任何一个充满英语单词的大文本文件，可以用来构建字典。比如说，你可能会使用古腾堡计划中的《战争与和平》。

【解决方案2】：

人类能做到吗？

远方袋远侧袋背袋远侧袋

您不仅必须使用字典，还可能必须使用统计方法来找出最有可能的内容（或者，上帝保佑，您选择的人类语言的实际 HMM...）

关于如何进行可能有用的统计，我请您咨询 Peter Norvig 博士，他在 21 行代码中解决了一个不同但相关的拼写检查问题： http://norvig.com/spell-correct.html

（他确实通过将每个 for 循环折叠成一行来作弊……但仍然如此）。

更新这件事卡在我的脑海里，所以我今天不得不生下它。此代码与 Robert Gamble 描述的代码进行了类似的拆分，但随后它根据提供的字典文件中的词频对结果进行排序（现在预计该文件通常是代表您的域或英语的一些文本。我使用了 big来自 Norvig 的 .txt，上面有链接，并为它添加了一本字典，以覆盖缺失的单词）。

除非频率差异很大，否则两个词的组合在大多数情况下会胜过三个词的组合。

我在我的博客上发布了这段代码并做了一些小的改动

http://squarecog.wordpress.com/2008/10/19/splitting-words-joined-into-a-single-string/ 并且还写了一些关于这段代码中的下溢错误。我很想悄悄地修复它，但认为这可能会帮助一些以前没有看过日志技巧的人： http://squarecog.wordpress.com/2009/01/10/dealing-with-underflow-in-joint-probability-calculations/

输出你的话，加上我自己的话——注意“orcore”会发生什么：

perl splitwords.pl big.txt 单词 answerveal：2种可能性 - 回答小牛肉 - 回答ve al 邪恶的天气：4种可能性 - 恶劣的天气 - 邪恶的我们在她 - 邪恶的天气 - 邪恶的我们在她液态天气：6种可能性 - 液态天气 - 我们对她的液体 - li quid 天气 - 我对她说我们 - li qu id 天气 - li qu id we at her driveourtrucks: 1 种可能性 - 驾驶我们的卡车 gocompact：1 种可能性 - 紧凑超薄投影仪：2 种可能性 - 超薄投影仪 - 苗条的项目或 orcore：3种可能性 - 或核心 - 或核心 - 兽人矿石

代码：

#!/usr/bin/env perl

use strict;
use warnings;

sub find_matches($);
sub find_matches_rec($\@\@);
sub find_word_seq_score(@);
sub get_word_stats($);
sub print_results($@);
sub Usage();

our(%DICT,$TOTAL);
{
  my( $dict_file, $word_file ) = @ARGV;
  ($dict_file && $word_file) or die(Usage);

  {
    my $DICT;
    ($DICT, $TOTAL) = get_word_stats($dict_file);
    %DICT = %$DICT;
  }

  {
    open( my $WORDS, '<', $word_file ) or die "unable to open $word_file\n";

    foreach my $word (<$WORDS>) {
      chomp $word;
      my $arr = find_matches($word);


      local $_;
      # Schwartzian Transform
      my @sorted_arr =
        map  { $_->[0] }
        sort { $b->[1] <=> $a->[1] }
        map  {
          [ $_, find_word_seq_score(@$_) ]
        }
        @$arr;


      print_results( $word, @sorted_arr );
    }

    close $WORDS;
  }
}


sub find_matches($){
    my( $string ) = @_;

    my @found_parses;
    my @words;
    find_matches_rec( $string, @words, @found_parses );

    return  @found_parses if wantarray;
    return \@found_parses;
}

sub find_matches_rec($\@\@){
    my( $string, $words_sofar, $found_parses ) = @_;
    my $length = length $string;

    unless( $length ){
      push @$found_parses, $words_sofar;

      return @$found_parses if wantarray;
      return  $found_parses;
    }

    foreach my $i ( 2..$length ){
      my $prefix = substr($string, 0, $i);
      my $suffix = substr($string, $i, $length-$i);

      if( exists $DICT{$prefix} ){
        my @words = ( @$words_sofar, $prefix );
        find_matches_rec( $suffix, @words, @$found_parses );
      }
    }

    return @$found_parses if wantarray;
    return  $found_parses;
}


## Just a simple joint probability
## assumes independence between words, which is obviously untrue
## that's why this is broken out -- feel free to add better brains
sub find_word_seq_score(@){
    my( @words ) = @_;
    local $_;

    my $score = 1;
    foreach ( @words ){
        $score = $score * $DICT{$_} / $TOTAL;
    }

    return $score;
}

sub get_word_stats($){
    my ($filename) = @_;

    open(my $DICT, '<', $filename) or die "unable to open $filename\n";

    local $/= undef;
    local $_;
    my %dict;
    my $total = 0;

    while ( <$DICT> ){
      foreach ( split(/\b/, $_) ) {
        $dict{$_} += 1;
        $total++;
      }
    }

    close $DICT;

    return (\%dict, $total);
}

sub print_results($@){
    #( 'word', [qw'test one'], [qw'test two'], ... )
    my ($word,  @combos) = @_;
    local $_;
    my $possible = scalar @combos;

    print "$word: $possible possibilities\n";
    foreach (@combos) {
      print ' -  ', join(' ', @$_), "\n";
    }
    print "\n";
}

sub Usage(){
    return "$0 /path/to/dictionary /path/to/your_words";
}

【讨论】：

这可以在 Windows XP 上运行吗？如何加载 Perl。我显然需要更多（就其他语言而言）！ :)
是的，您正在寻找一个叫做 ActivePerl 的东西，它是 Windows 发行版。我没有使用任何模块，因此您无需在标准构建中添加任何内容。找一本好的有代表性的词典就行了。
+1 - 我不知道 Perl，但我给了你 +1 以超越职责要求。不错！
我修改了代码以尝试使其更易于维护。虽然一开始还不错。
这段代码显然有一个缺陷，它只给出了第二个最可能的结果。

【解决方案3】：

pip install wordninja

>>> import wordninja
>>> wordninja.split('bettergood')
['better', 'good']

【讨论】：

最好提一下产生上述wordninja包的答案：stackoverflow.com/a/11642687/8291949
你可能会提到这是一个稳定包形式的维特比。

【解决方案4】：

这里最好的工具是递归，而不是正则表达式。基本思想是从字符串的开头开始寻找一个单词，然后取出字符串的其余部分并寻找另一个单词，依此类推，直到到达字符串的末尾。递归解决方案是自然的，因为当字符串的给定剩余部分无法分解为一组单词时，需要进行回溯。下面的解决方案使用字典来确定什么是单词，并在找到它们时打印出解决方案（一些字符串可以分解为多个可能的单词集，例如 wickedweather 可以解析为“wicked we at her”）。如果您只想要一组单词，则需要确定选择最佳组的规则，可能通过选择单词数量最少的解决方案或设置最小单词长度。

#!/usr/bin/perl

use strict;

my $WORD_FILE = '/usr/share/dict/words'; #Change as needed
my %words; # Hash of words in dictionary

# Open dictionary, load words into hash
open(WORDS, $WORD_FILE) or die "Failed to open dictionary: $!\n";
while (<WORDS>) {
  chomp;
  $words{lc($_)} = 1;
}
close(WORDS);

# Read one line at a time from stdin, break into words
while (<>) {
  chomp;
  my @words;
  find_words(lc($_));
}

sub find_words {
  # Print every way $string can be parsed into whole words
  my $string = shift;
  my @words = @_;
  my $length = length $string;

  foreach my $i ( 1 .. $length ) {
    my $word = substr $string, 0, $i;
    my $remainder = substr $string, $i, $length - $i;
    # Some dictionaries contain each letter as a word
    next if ($i == 1 && ($word ne "a" && $word ne "i"));

    if (defined($words{$word})) {
      push @words, $word;
      if ($remainder eq "") {
        print join(' ', @words), "\n";
        return;
      } else {
        find_words($remainder, @words);
      }
      pop @words;
    }
  }

  return;
}

【讨论】：

没有运行它，但它看起来比 BKB 的解决方案更好，因为它产生了所有可能性。
这就像魔术一样。正是我一直在寻找的东西，非常感谢。我正在尝试翻译成 PHP。如果有PHP版本，请在这里分享。

【解决方案5】：

我认为您认为这不是正则表达式的真正工作是正确的。我会使用字典的想法来解决这个问题——在字典中查找最长的前缀。当你找到它时，把它剪掉，然后对字符串的其余部分做同样的事情。

上述方法存在歧义，例如“drivereallyfast”会先找到“driver”，然后“eallyfast”会出现问题。因此，如果遇到这种情况，您还必须进行一些回溯。或者，由于您没有那么多要拆分的字符串，因此只需手动执行自动拆分失败的字符串。

【讨论】：

必须找到一个字典文件来匹配。
谢谢！我要把这个和那个 Perl 放在一起，看看会发生什么。

【解决方案6】：

这与称为标识符拆分或标识符名称标记化的问题有关。在 OP 的情况下，输入似乎是普通单词的串联；在标识符拆分中，输入是源代码中的类名、函数名或其他标识符，问题更难。我意识到这是一个老问题，OP 要么解决了他们的问题，要么继续前进，但如果其他人在寻找标识符拆分器时遇到这个问题（就像我不久前一样），我想提供 @987654321 @ ("标识符分割器：一个库")。它是用 Python 编写的，但带有一个命令行实用程序，可以读取标识符文件（每行一个）并拆分每个标识符。

拆分标识符看似困难。程序员在命名事物时通常使用缩写、首字母缩略词和单词片段，而且他们并不总是使用一致的约定。即使在标识符确实遵循某些约定（例如驼峰式大小写）时，也会出现歧义。

Spiral 实现了许多标识符拆分算法，包括一种称为 Ronin 的新算法。它使用从挖掘源代码存储库中获得的各种启发式规则、英语词典和令牌频率表。 Ronin 可以拆分不使用驼峰式或其他命名约定的标识符，包括将J2SEProjectTypeProfiler拆分为[J2SE、Project、Type、Profiler]等情况，这需要读者识别@987654329 @ 为单位。以下是 Ronin 可以拆分的更多示例：

# spiral mStartCData nonnegativedecimaltype getUtf8Octets GPSmodule savefileas nbrOfbugs
mStartCData: ['m', 'Start', 'C', 'Data']
nonnegativedecimaltype: ['nonnegative', 'decimal', 'type']
getUtf8Octets: ['get', 'Utf8', 'Octets']
GPSmodule: ['GPS', 'module']
savefileas: ['save', 'file', 'as']
nbrOfbugs: ['nbr', 'Of', 'bugs']

使用 OP 问题中的示例：

# spiral wickedweather liquidweather  driveourtrucks gocompact slimprojector
wickedweather: ['wicked', 'weather']
liquidweather: ['liquid', 'weather']
driveourtrucks: ['driveourtrucks']
gocompact: ['go', 'compact']
slimprojector: ['slim', 'projector']

如您所见，它并不完美。值得注意的是，Ronin 有许多参数，调整它们也可以拆分driveourtrucks，但代价是程序标识符的性能恶化。

更多信息可以在GitHub repo for Spiral找到。

【讨论】：

【解决方案7】：

好吧，问题本身并不能仅用正则表达式来解决。一个解决方案（可能不是最好的）是获取字典并为字典中的每个工作与列表中的每个单词进行正则表达式匹配，只要成功就添加空格。当然，这不会非常快，但它会很容易编程并且比手工操作更快。

【讨论】：

【解决方案8】：

需要基于字典的解决方案。如果您的字典有限，这可能会有所简化，否则构成其他单词前缀的单词将是一个问题。

【讨论】：

【解决方案9】：

Santhosh thottingal 发布了一个名为 mlmorph 的 python 包，可用于形态分析。

https://pypi.org/project/mlmorph/

例子：

from mlmorph import Analyser
analyser = Analyser()
analyser.analyse("കേരളത്തിന്റെ")

给予

[('കേരളം<np><genitive>', 179)]

他还写了一篇关于该主题的博客https://thottingal.in/blog/2017/11/26/towards-a-malayalam-morphology-analyser/

【讨论】：

【解决方案10】：

一个简单的 Python 解决方案：安装 wordsegment 包：pip install wordsegment。

$ echo thisisatest | python -m wordsegment
this is a test

【讨论】：

【解决方案11】：

其中一种解决方案可以是递归（同样可以转换为动态编程）：

static List<String> wordBreak(
    String input,
    Set<String> dictionary
) {

  List<List<String>> result = new ArrayList<>();
  List<String> r = new ArrayList<>();

  helper(input, dictionary, result, "", 0, new Stack<>());

  for (List<String> strings : result) {
    String s = String.join(" ", strings);
    r.add(s);
  }

  return r;
}

static void helper(
    final String input,
    final Set<String> dictionary,
    final List<List<String>> result,
    String state,
    int index,
    Stack<String> stack
) {

  if (index == input.length()) {

    // add the last word
    stack.push(state);

    for (String s : stack) {
      if (!dictionary.contains(s)) {
        return;
      }
    }

    result.add((List<String>) stack.clone());

    return;
  }

  if (dictionary.contains(state)) {
    // bifurcate
    stack.push(state);
    helper(input, dictionary, result, "" + input.charAt(index),
           index + 1, stack);

    String pop = stack.pop();
    String s = stack.pop();

    helper(input, dictionary, result, s + pop.charAt(0),
           index + 1, stack);

  }
  else {
    helper(input, dictionary, result, state + input.charAt(index),
           index + 1, stack);
  }

  return;
}

另一个可能的解决方案是使用Tries 数据结构。

【讨论】：

【解决方案12】：

output :-
['better', 'good'] ['coffee', 'shop']
['coffee', 'shop']

    pip install wordninja
import wordninja
n=wordninja.split('bettergood')
m=wordninja.split("coffeeshop")
print(n,m)

list=['hello','coffee','shop','better','good']
mat='coffeeshop'
expected=[]
for i in list:
    if i in mat:
        expected.append(i)
print(expected)

【讨论】：

【解决方案13】：

我可能会因此而被降级，但让秘书去做。

与手动处理相比，您在字典解决方案上花费的时间更多。此外，您不可能对解决方案有 100% 的信心，因此您仍然必须手动关注它。

【讨论】：

man.. 现在我真的很想给你投反对票！ :-) 我们曾经尝试过一种类似的方法来过滤淘气的搜索查询。与我在分类器上相比，我们花了更多的时间来构建一个秘书（在我的情况下是公关人员）会使用的漂亮界面。

【解决方案14】：

如果是驼峰式，这将起作用。 JavaScript！！！

function spinalCase(str) {
  let lowercase = str.trim()
  let regEx = /\W+|(?=[A-Z])|_/g
  let result = lowercase.split(regEx).join("-").toLowerCase()

  return result;
}

spinalCase("AllThe-small Things");

【讨论】：

好的，首先我们创建一个接受值的函数。 let lowercase = str.trim() 基本上 .trim() 不检查空格 in 或 between 一个字符串它只检查并从 开始、结束或两边删除空格（如果有） , let regEx = /\W+|(?=[A-Z])|_/g 是正则表达式，其中，\W+ 不匹配字母和数字，因此，让 _ 和空格 匹配，| 表示 OR，(?=[A-Z]) 表示正向前瞻，它在其中搜索所有 Capletters，然后 result 运行表达式并将其与连字符连接起来，并将其全部转换为小写。

【解决方案15】：

所以我在这个答案上花了大约 2 天的时间，因为我需要它来完成我自己的 NLP 工作。我的答案来自Darius Bacon's 答案，它本身来自Viterbi algorithm。我还将它抽象为获取消息中的每个单词，尝试拆分它，然后重新组合消息。我扩展了 Darius 的代码以使其可调试。我还换掉了对“big.txt”的需求，改用wordfreq 库。一些 cmets 强调需要对不存在的词使用非零词频。我发现使用任何高于零的频率都会导致“itseasyformetosplitlongruntogetherblocks”被拆分为“itseasyformetosplitlongruntogether blocks”。该算法通常倾向于过度拆分或拆分各种测试消息，具体取决于您如何组合词频以及如何处理丢失的词频。我进行了许多调整，直到它表现良好。我的解决方案对缺失词使用 0.0 频率。它还增加了对单词长度的奖励（否则它倾向于将单词拆分为字符）。我尝试了很多长度奖励，似乎最适合我的测试用例的是word_frequency * (e ** word_length)。还有 cmets 警告不要将词频相乘。我尝试使用调和平均值添加它们，并使用 1-freq 而不是 0.00001 形式。他们都倾向于过度拆分测试用例。简单地将词频相乘效果最好。我把我的调试打印语句留在了那里，以便其他人更容易继续调整。最后，有一种特殊情况，如果您的整个消息是一个不存在的单词，例如“Slagle's”，那么该函数会将单词拆分为单个字母。就我而言，我不希望这样，所以我在最后有一个特殊的 return 语句来返回这些情况下的原始消息。

import numpy as np
from wordfreq import get_frequency_dict

word_prob = get_frequency_dict(lang='en', wordlist='large')
max_word_len = max(map(len, word_prob))  # 34

def viterbi_segment(text, debug=False):
    probs, lasts = [1.0], [0]
    for i in range(1, len(text) + 1):
        new_probs = []
        for j in range(max(0, i - max_word_len), i):
            substring = text[j:i]
            length_reward = np.exp(len(substring))
            freq = word_prob.get(substring, 0) * length_reward
            compounded_prob = probs[j] * freq
            new_probs.append((compounded_prob, j))
            
            if debug:
                print(f'[{j}:{i}] = "{text[lasts[j]:j]} & {substring}" = ({probs[j]:.8f} & {freq:.8f}) = {compounded_prob:.8f}')

        prob_k, k = max(new_probs)  # max of a touple is the max across the first elements, which is the max of the compounded probabilities
        probs.append(prob_k)
        lasts.append(k)

        if debug:
            print(f'i = {i}, prob_k = {prob_k:.8f}, k = {k}, ({text[k:i]})\n')


    # when text is a word that doesn't exist, the algorithm breaks it into individual letters.
    # in that case, return the original word instead
    if len(set(lasts)) == len(text):
        return text

    words = []
    k = len(text)
    while 0 < k:
        word = text[lasts[k]:k]
        words.append(word)
        k = lasts[k]
    words.reverse()
    return ' '.join(words)

def split_message(message):
  new_message = ' '.join(viterbi_segment(wordmash, debug=False) for wordmash in message.split())
  return new_message

messages = [
    'tosplit',
    'split',
    'driveourtrucks',
    "Slagle's",
    "Slagle's wickedweather liquidweather driveourtrucks gocompact slimprojector",
    'itseasyformetosplitlongruntogetherblocks',
]

for message in messages:
    print(f'{message}')
    new_message = split_message(message)
    print(f'{new_message}\n')

tosplit
to split

split
split

driveourtrucks
drive our trucks

Slagle's
Slagle's

Slagle's wickedweather liquidweather driveourtrucks gocompact slimprojector
Slagle's wicked weather liquid weather drive our trucks go compact slim projector

itseasyformetosplitlongruntogetherblocks
its easy for me to split long run together blocks

【讨论】：