将文本拆分为单个单词答案

【问题标题】：Split a text into single words将文本拆分为单个单词
【发布时间】：2010-10-21 21:27:27
【问题描述】：

我想使用 PHP 将文本拆分为单个单词。您知道如何实现这一目标吗？

我的做法：

function tokenizer($text) {
    $text = trim(strtolower($text));
    $punctuation = '/[^a-z0-9äöüß-]/';
    $result = preg_split($punctuation, $text, -1, PREG_SPLIT_NO_EMPTY);
    for ($i = 0; $i < count($result); $i++) {
        $result[$i] = trim($result[$i]);
    }
    return $result; // contains the single words
}
$text = 'This is an example text, it contains commas and full-stops. Exclamation marks, too! Question marks? All punctuation marks you know.';
print_r(tokenizer($text));

这是一个好方法吗？你有什么改进的想法吗？

提前致谢！

【问题讨论】：

标签： php split

【解决方案1】：

使用匹配任何 unicode 标点字符的类 \p{P}，结合 \s 空白类。

$result = preg_split('/((^\p{P}+)|(\p{P}*\s+\p{P}*)|(\p{P}+$))/', $text, -1, PREG_SPLIT_NO_EMPTY);

这将拆分为一组一个或多个空白字符，但也会吸收任何周围的标点符号。它还匹配字符串开头或结尾的标点符号。这区分了诸如“不要”和“他说'哎哟！'”之类的情况

【讨论】：

+1，不知道这将如何处理äöüß。正则表达式是否通常将 äöüß 归类为单词字符？
谢谢。这可能不适用于英文文本，但我还想提取德语变音符号（ä、ö、ü）、字符串中的“ß”和数字。 “\W”不会提取“Fri3nd”，对吗？
似乎没有，但用类似的东西更新了答案。
更新的答案适用于 perl（php 正则表达式基于）： $ echo "äöüß, test" | perl -e 'while () { if (/([\p{P}\s]+)/) { print "$1\n"; } }' ,
不应该分成don和t吗？

【解决方案2】：

令牌化 - strtok。

<?php
$text = 'This is an example text, it contains commas and full stops. Exclamation marks, too! Question marks? All punctuation marks you know.';
$delim = ' \n\t,.!?:;';

$tok = strtok($text, $delim);

while ($tok !== false) {
    echo "Word=$tok<br />";
    $tok = strtok($delim);
}
?>

【讨论】：

谢谢，我觉得这个功能做得很好。
如果你得到一个 : 或 ; 这将不起作用或任何其他你没有考虑到的标点符号。
@marcog，我添加了：和;。 {P} 不会捕获撇号和连字符吗？
这样引用的案例呢？我更新的答案区分了这些情况。
好主意。添加了+1。唯一的问题是 $delim = " \n\t,.!?:;"; 周围应该有双引号。单引号不能正常工作，它也会被字母 n 分割。

【解决方案3】：

在拆分之前，我会先将字符串设为小写。这将使i 修饰符和之后的数组处理变得不必要。此外，我会使用\W 速记来表示非单词字符并添加+ 乘数。

$text = 'This is an example text, it contains commas and full stops. Exclamation marks, too! Question marks? All punctuation marks you know.';
$result = preg_split('/\W+/', strtolower($text), -1, PREG_SPLIT_NO_EMPTY);

编辑使用Unicode character properties 而不是\W as marcog suggested。 [\p{P}\p{Z}]（标点符号和分隔符）之类的内容会覆盖比 \W 更具体的字符。

【讨论】：

谢谢，之前执行 strtolower() 的想法非常好。我会用这个。
如果您与\W 分手，strtolower() 的作用是什么？是否要添加 u 模式修饰符？给研究人员的注释...\W 不会匹配下划线。

【解决方案4】：

您还可以使用 PHP strtok() 函数从您的大字符串中获取字符串标记。你可以这样使用它：

 $result = array();
 // your original string
 $text = 'This is an example text, it contains commas and full stops. Exclamation marks, too! Question marks? All punctuation marks you know.';
 // you pass strtok() your string, and a delimiter to specify how tokens are separated. words are seperated by a space.
 $word = strtok($text,' ');
 while ( $word !== false ) {
     $result[] = $word;
     $word = strtok(' ');
 }

查看更多关于strtok()的php文档

【讨论】：

这和explode(' ', $text);有什么区别
问题中的代码示例是一个分词器，我的回答是暗示 PHP 内置了一个字符串分词器。此外，explode() 将立即返回文本中的所有单词，但使用 strtok() 调用者可以选择在满足所需条件后立即停止搜索文本中的单词。除此之外，我想不出任何其他区别。

【解决方案5】：

做：

str_word_count($text, 1);

或者如果您需要 unicode 支持：

function str_word_count_Helper($string, $format = 0, $search = null)
{
    $result = array();
    $matches = array();

    if (preg_match_all('~[\p{L}\p{Mn}\p{Pd}\'\x{2019}' . preg_quote($search, '~') . ']+~u', $string, $matches) > 0)
    {
        $result = $matches[0];
    }

    if ($format == 0)
    {
        return count($result);
    }

    return $result;
}

【讨论】：

谢谢，但这行不通。不会提取“Fri3nd”，但应该提取。
我不明白为什么要提取“Fri3nd”。从数组中删除，分解为“Fri3”和“nd”（或类似）？ O.o
如果您想将数字视为单词，只需执行 str_word_count_Helper($string, 1, '0123456789');

【解决方案6】：

你也可以使用explode方法：http://php.net/manual/en/function.explode.php

$words = explode(" ", $sentence);

【讨论】：

不适用于 2 个或更多连续空格。您必须在if($word == "") continue; 中使用带有explode(" ", $sentence) 的foreach，这样您就可以避免空话。