最后我决定全部用 PHP,因此我的question about
which characters are equal with utf8_general_ci。
下面是我想出的例子:标签是由文本构成的
$description,子字符串 $term 突出显示,以及特殊字符
转换。替换不完整,但可能足以满足实际
用例。
mb_internal_encoding("UTF-8");
function withoutAccents($s) {
return strtr(utf8_decode($s),
utf8_decode('àáâãäçèéêëìíîïñòóôõöùúûüýÿß'),
'aaaaaceeeeiiiinooooouuuuyys');
}
function simplified($s) {
return withoutAccents(strtolower($s));
}
function encodedSubstr($s, $start, $length) {
return htmlspecialchars(mb_substr($s, $start, $length));
}
function labelFromDescription($description, $term) {
$simpleTerm = simplified($term);
$simpleDescription = simplified($description);
$lastEndPos = $pos = 0;
$termLen = strlen($simpleTerm);
$label = ''; // HTML
while (($pos = strpos($simpleDescription,
$simpleTerm, $lastEndPos)) !== false) {
$label .=
encodedSubstr($description, $lastEndPos, $pos - $lastEndPos).
'<strong>'.
encodedSubstr($description, $pos, $termLen).
'</strong>';
$lastEndPos = $pos + $termLen;
}
$label .= encodedSubstr($description, $lastEndPos,
strlen($description) - $lastEndPos);
return $label;
}
echo labelFromDescription('São Paulo <SAO>', 'SAO')."\n";
echo labelFromDescription('München <MUC>', 'ünc');
输出:
<strong>São</strong> Paulo <<strong>SAO</strong>>
M<strong>ünc</strong>hen <MUC>