【问题标题】:mine phrases (up to 3 words) from a given text从给定文本中挖掘短语(最多 3 个单词)
【发布时间】:2011-11-11 17:12:52
【问题描述】:

我之前曾要求一个简单的解决方案来解决我的问题(使用 sphinx 搜索服务),但我一无所获...

有人向我提供了此代码

<?php
/**
 * $Project: GeoGraph $
 * $Id$
 * 
 * GeoGraph geographic photo archive project
 * This file copyright (C) 2005  Barry Hunter (geo@barryhunter.co.uk)
 *
 * This program is free software; you can redistribute it and/or
 * modify it under the terms of the GNU General Public License
 * as published by the Free Software Foundation; either version 2
 * of the License, or (at your option) any later version.
 * 
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License for more details.
 * 
 * You should have received a copy of the GNU General Public License
 * along with this program; if not, write to the Free Software
 * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA  02111-1307, USA.
 */



/**
* Provides the methods for updating the worknet tables
*
* @package Geograph
* @author Barry Hunter <geo@barryhunter.co.uk>
* @version $Revision$
*/

function addTwoLetterPhrase($phrase) {
    global $w2;
    $w2[$phrase] = (isset($w2[$phrase]))?($w2[$phrase]+1):1; 
}

function addThreeLetterPhrase($phrase) {
    global $w3;
    $w3[$phrase] = (isset($w3[$phrase]))?($w3[$phrase]+1):1; 
}

function updateWordnet(&$db,$text,$field,$id) {
    global $w1,$w2,$w3;

    $alltext = strtolower(preg_replace('/\W+/',' ',str_replace("'",'',$text)));


    if (strlen($text)< 1)
        return;


    $words = preg_split('/ /',$alltext);

    $w1 = array();
    $w2 = array();
    $w3 = array();

    //build a list of one word phrases
    foreach ($words as $word) {
        $w1[$word] = (isset($w1[$word]))?($w1[$word]+1):1; 
    }

    //build a list of two word phrases
        $text = $alltext;
    $text = preg_replace('/(\w+) (\w+)/e','addTwoLetterPhrase("$1 $2")',$text); 
        $text = $alltext;
        $text = preg_replace('/(\w+)/','',$text,1);
    $text = preg_replace('/(\w+) (\w+)/e','addTwoLetterPhrase("$1 $2")',$text);

    //build a list of three word phrases
        $text = $alltext;
    $text = preg_replace('/(\w+) (\w+) (\w+)/e','addThreeLetterPhrase("$1 $2 $3")',$text);  
        $text = $alltext;
        $text = preg_replace('/(\w+)/','',$text,1);
    $text = preg_replace('/(\w+) (\w+) (\w+)/e','addThreeLetterPhrase("$1 $2 $3")',$text);  
        $text = $alltext;
        $text = preg_replace('/(\w+) (\w+)/','',$text,1);
    $text = preg_replace('/(\w+) (\w+) (\w+)/e','addThreeLetterPhrase("$1 $2 $3")',$text);



    foreach ($w1 as $word=>$count) {
        $db->Execute("insert into wordnet1 set gid = $id,words = '$word',$field = $count");// ON DUPLICATE KEY UPDATE $field=$field+$count");
    }
    foreach ($w2 as $word=>$count) {
        $db->Execute("insert into wordnet2 set gid = $id,words = '$word',$field = $count");
    }   
    foreach ($w3 as $word=>$count) {
        $db->Execute("insert into wordnet3 set gid = $id,words = '$word',$field = $count");
    }   
}


?>

它工作正常,几乎完全符合我的需要......除了......它不是 utf8 友好......我的意思是......它将整个单词分成几个部分(在特殊字符上),其中不应该!

所以我的猜测是我应该使用多字节函数而不是常规 preg_replace...

我试图用 mb_ereg_replace 替换 preg_replace,但它没有按应有的方式工作......至少对于 2 和 3 个单词的短语来说不是

有什么想法吗?

【问题讨论】:

    标签: php regex utf-8


    【解决方案1】:

    PCRE 可以处理 UTF-8。您只需在每个正则表达式中添加/u 修饰符。

    http://www.php.net/manual/en/reference.pcre.pattern.modifiers.php

    (您也可以使用\pL+ 代替\w+,但该标志在最近的PCRE 版本中就足够了。)

    【讨论】:

    • 我添加了 /u.. 像:........ $alltext = strtolower(preg_replace('/\W+ /u',' ',str_replace("'", '', $text)));............ 和...... $text = preg_replace('/(\w+ ) (\w+)/e/u','addTwoLetterPhrase("$1 $2")',$text);但我遇到了错误....对不起,我不擅长正则表达式
    • 好吧,是的。请阅读手册。添加修饰符意味着/.../u/.../eu - 不添加两个分隔符。
    • 只是另一件事...如何设置单词的 min_length ?假设我只需要接受超过 2 个字母的单词....可以用正则表达式完成吗?
    • 用重复的东西{2,} 而不是+
    猜你喜欢
    • 2019-05-05
    • 1970-01-01
    • 2018-09-21
    • 1970-01-01
    • 2014-05-18
    • 2017-06-01
    • 2013-04-29
    • 2015-03-27
    • 2015-12-05
    相关资源
    最近更新 更多