【问题标题】:Compare similarity between two string with PHP用 PHP 比较两个字符串之间的相似度
【发布时间】:2016-07-01 02:25:58
【问题描述】:

大家好 :) 我想寻求一些解决方案。现在,我有字典 words.txt,这里有一些例子:

happy
laugh
sad

我有 俚语字符串

hppy

我想搜索和匹配那个俚语字符串到我的字典这意味着它将返回“happy”,因为那些字符串指的是“happy”字典中。

最近我一直在使用 similar_text(),但对它的有效性没有信心。你们可以为我的问题推荐更好的解决方案吗?谢谢你:)

我把我的代码放在这里:

function searchwords($tweet){
//echo $tweet;
$find       = false;
$handle     = @fopen("words.txt", "r");
if ($handle)
{
    while (!feof($handle))
    {
        $buffer         = fgets($handle);
        similar_text(trim($tweet),trim($buffer),$percent);
        if ($percent == 100){ // this exact match
            $find = true;
        }else if ($percent >= 90){ //there is the possibility of errors
            $find = true;
        }

    }
    fclose($handle);
}
  if ($find == true){
    unset($tweet);
  }else{
    return $tweet;
  }
}

【问题讨论】:

    标签: php similarity


    【解决方案1】:

    回复here

    我发现要计算字符串之间的相似度百分比, LevenshteinJaro Winkler 算法适用于 拼写错误和字符串之间的微小变化,而Smith Waterman Gotoh 算法适用于显着的字符串 文本的块将是相同的,但周围有“噪音” 边缘。 This answer to a similar question 显示更多细节 这个。

    我包含了使用这三个示例中的每一个来返回两个字符串之间的相似度百分比的 php 示例:

    Levenshtein

    echo levenshtein("LEGENDARY","BARNEY STINSON");
    

    Jaro Winkler

    class StringCompareJaroWinkler 
    {
        public function compare($str1, $str2)
        {
            return $this->JaroWinkler($str1, $str2, $PREFIXSCALE = 0.1 );
        }
    
        private function getCommonCharacters( $string1, $string2, $allowedDistance ){
    
          $str1_len = mb_strlen($string1);
          $str2_len = mb_strlen($string2);
          $temp_string2 = $string2;
    
          $commonCharacters='';
          for( $i=0; $i < $str1_len; $i++){
    
            $noMatch = True;
            // compare if char does match inside given allowedDistance
            // and if it does add it to commonCharacters
            for( $j= max( 0, $i-$allowedDistance ); $noMatch && $j < min( $i + $allowedDistance + 1, $str2_len ); $j++){
              if( $temp_string2[$j] == $string1[$i] ){
                $noMatch = False;
            $commonCharacters .= $string1[$i];
            $temp_string2[$j] = '';
              }
            }
          }
          return $commonCharacters;
        }
    
        private function Jaro( $string1, $string2 ){
    
          $str1_len = mb_strlen( $string1 );
          $str2_len = mb_strlen( $string2 );
    
          // theoretical distance
          $distance = (int) floor(min( $str1_len, $str2_len ) / 2.0); 
    
          // get common characters
          $commons1 = $this->getCommonCharacters( $string1, $string2, $distance );
          $commons2 = $this->getCommonCharacters( $string2, $string1, $distance );
    
          if( ($commons1_len = mb_strlen( $commons1 )) == 0) return 0;
          if( ($commons2_len = mb_strlen( $commons2 )) == 0) return 0;
          // calculate transpositions
          $transpositions = 0;
          $upperBound = min( $commons1_len, $commons2_len );
          for( $i = 0; $i < $upperBound; $i++){
            if( $commons1[$i] != $commons2[$i] ) $transpositions++;
          }
          $transpositions /= 2.0;
          // return the Jaro distance
          return ($commons1_len/($str1_len) + $commons2_len/($str2_len) + ($commons1_len - $transpositions)/($commons1_len)) / 3.0;
    
        }
    
        private function getPrefixLength( $string1, $string2, $MINPREFIXLENGTH = 4 ){
    
          $n = min( array( $MINPREFIXLENGTH, mb_strlen($string1), mb_strlen($string2) ) );
    
          for($i = 0; $i < $n; $i++){
            if( $string1[$i] != $string2[$i] ){
              // return index of first occurrence of different characters 
              return $i;
            }
          }
          // first n characters are the same   
          return $n;
        }
    
        private function JaroWinkler($string1, $string2, $PREFIXSCALE = 0.1 ){
    
          $JaroDistance = $this->Jaro( $string1, $string2 );
          $prefixLength = $this->getPrefixLength( $string1, $string2 );
          return $JaroDistance + $prefixLength * $PREFIXSCALE * (1.0 - $JaroDistance);
        }
    }
    
    $jw = new StringCompareJaroWinkler();
    echo $jw->compare("LEGENDARY","BARNEY STINSON");
    

    Smith Waterman Gotoh

    class SmithWatermanGotoh 
    {
        private $gapValue;
        private $substitution;
    
        /**
         * Constructs a new Smith Waterman metric.
         * 
         * @param gapValue
         *            a non-positive gap penalty
         * @param substitution
         *            a substitution function
         */
        public function __construct($gapValue=-0.5, 
                    $substitution=null) 
        {
            if($gapValue > 0.0) throw new Exception("gapValue must be <= 0");
            //if(empty($substitution)) throw new Exception("substitution is required");
            if (empty($substitution)) $this->substitution = new SmithWatermanMatchMismatch(1.0, -2.0);
            else $this->substitution = $substitution;
            $this->gapValue = $gapValue;
        }
    
        public function compare($a, $b) 
        {
            if (empty($a) && empty($b)) {
                return 1.0;
            }
    
            if (empty($a) || empty($b)) {
                return 0.0;
            }
    
            $maxDistance = min(mb_strlen($a), mb_strlen($b))
                    * max($this->substitution->max(), $this->gapValue);
            return $this->smithWatermanGotoh($a, $b) / $maxDistance;
        }
    
        private function smithWatermanGotoh($s, $t) 
        {   
            $v0 = [];
            $v1 = [];
            $t_len = mb_strlen($t);
            $max = $v0[0] = max(0, $this->gapValue, $this->substitution->compare($s, 0, $t, 0));
    
            for ($j = 1; $j < $t_len; $j++) {
                $v0[$j] = max(0, $v0[$j - 1] + $this->gapValue,
                        $this->substitution->compare($s, 0, $t, $j));
    
                $max = max($max, $v0[$j]);
            }
    
            // Find max
            for ($i = 1; $i < mb_strlen($s); $i++) {
                $v1[0] = max(0, $v0[0] + $this->gapValue, $this->substitution->compare($s, $i, $t, 0));
    
                $max = max($max, $v1[0]);
    
                for ($j = 1; $j < $t_len; $j++) {
                    $v1[$j] = max(0, $v0[$j] + $this->gapValue, $v1[$j - 1] + $this->gapValue,
                            $v0[$j - 1] + $this->substitution->compare($s, $i, $t, $j));
    
                    $max = max($max, $v1[$j]);
                }
    
                for ($j = 0; $j < $t_len; $j++) {
                    $v0[$j] = $v1[$j];
                }
            }
    
            return $max;
        }
    }
    
    class SmithWatermanMatchMismatch
    {
        private $matchValue;
        private $mismatchValue;
    
        /**
         * Constructs a new match-mismatch substitution function. When two
         * characters are equal a score of <code>matchValue</code> is assigned. In
         * case of a mismatch a score of <code>mismatchValue</code>. The
         * <code>matchValue</code> must be strictly greater then
         * <code>mismatchValue</code>
         * 
         * @param matchValue
         *            value when characters are equal
         * @param mismatchValue
         *            value when characters are not equal
         */
        public function __construct($matchValue, $mismatchValue) {
            if($matchValue <= $mismatchValue) throw new Exception("matchValue must be > matchValue");
    
            $this->matchValue = $matchValue;
            $this->mismatchValue = $mismatchValue;
        }
    
        public function compare($a, $aIndex, $b, $bIndex) {
            return ($a[$aIndex] === $b[$bIndex] ? $this->matchValue
                    : $this->mismatchValue);
        }
    
        public function max() {
            return $this->matchValue;
        }
    
        public function min() {
            return $this->mismatchValue;
        }
    }
    
    $o = new SmithWatermanGotoh();
    echo $o->compare("LEGENDARY","BARNEY STINSON");
    

    【讨论】:

    • 感谢您的回答。 PHP 不再支持将空字符串分配给字符串偏移量。的预期行为是什么: $temp_string2[$j] = '';
    【解决方案2】:

    希望它能达到您的目的。但是在使用similar_text() 时需要注意一些事项。

    1. 匹配百分比可能因参数顺序而异。即:similar_text($a, $b, $percent)similar_text($b, $a, $percent) 在这两种情况下 $percent 可能不一样。
    2. 请注意,此函数区分大小写:因此在传递 $a 和 $b 时,都以大写或小写形式传递。

    更多信息请查看this page

    【讨论】:

      猜你喜欢
      • 2012-04-07
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2011-10-20
      • 2021-10-12
      相关资源
      最近更新 更多