如何检测是否必须对字符串应用 UTF-8 解码或编码？答案

【问题标题】：How do I detect if have to apply UTF-8 decode or encode on a string?如何检测是否必须对字符串应用 UTF-8 解码或编码？
【发布时间】：2011-05-23 09:54:23
【问题描述】：

我有一个来自第三方网站的提要，有时我必须申请 utf8_decode 和其他时候 utf8_encode 才能获得所需的可见输出。

如果错误地应用了两次相同的东西/或使用了错误的方法，我会得到更丑陋的东西，这就是我想要改变的。

我怎样才能检测到什么时候必须对字符串应用什么？

其实内容返回UTF-8，但里面有部分不是。

【问题讨论】：

我们是否应该假设提要声明了某些字符集但使用了另一个？
请提供一个示例供稿

标签： php encoding utf-8

【解决方案1】：

提要（我猜您的意思是某种基于 XML 的提要）应该在标题中包含一个属性，告诉您编码是什么。如果没有，那么您就很不走运，因为您没有可靠的方法来识别编码。

【讨论】：

【解决方案2】：

你可以使用

mb_detect_encoding — 检测字符编码

该字符集也可能在HTTP response headers 或响应数据本身中可用。

例子：

var_dump(
    mb_detect_encoding(
        file_get_contents('http://stackoverflow.com/questions/4407854')
    ),
    $http_response_header
);

输出（codepad）：

string(5) "UTF-8"
array(9) {
  [0]=>
  string(15) "HTTP/1.1 200 OK"
  [1]=>
  string(33) "Cache-Control: public, max-age=11"
  [2]=>
  string(38) "Content-Type: text/html; charset=utf-8"
  [3]=>
  string(38) "Expires: Fri, 10 Dec 2010 10:40:07 GMT"
  [4]=>
  string(44) "Last-Modified: Fri, 10 Dec 2010 10:39:07 GMT"
  [5]=>
  string(7) "Vary: *"
  [6]=>
  string(35) "Date: Fri, 10 Dec 2010 10:39:55 GMT"
  [7]=>
  string(17) "Connection: close"
  [8]=>
  string(21) "Content-Length: 34119"
}

【讨论】：

【解决方案3】：

我不能说我可以依赖mb_detect_encoding()。前段时间我有一些奇怪的误报。

我发现在每种情况下都能正常工作的最通用方法是：

if (preg_match('!!u', $string))
{
   // This is UTF-8
}
else
{
   // Definitely not UTF-8
}

【讨论】：

+1 实现了一个 utf8_validate()，它使用您的解决方案将字符串转换为 utf8（如果不是），这很有魅力！
谢谢！这是一个非常聪明的技巧 ;-) 因为我完全不知道它是如何工作的，所以我深入研究了 PHP 文档以找到 this: u (PCRE8) This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern strings are treated as UTF-8. This modifier is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32. UTF-8 validity of the pattern is checked since PHP 4.3.5. 无论如何，非常感谢！
甚至不需要正则表达式中的那个点preg_match('!!u', $str) 工作正常
那个点甚至会为空字符串返回 0（等于 false）。但空字符串是有效的 UTF-8 ;)。
"它只是一个空的正则表达式。! 是分隔符，u 是修饰符。"解决方案确实很聪明，但需要一些更详细的解释，所以我问了一下 - stackoverflow.com/questions/10855682/…

【解决方案4】：

function str_to_utf8 ($str) {
    $decoded = utf8_decode($str);
    if (mb_detect_encoding($decoded , 'UTF-8', true) === false)
        return $str;
    return $decoded;
}

var_dump(str_to_utf8("« Chrétiens d'Orient » : la RATP fait marche arrière"));
//string '« Chrétiens d'Orient » : la RATP fait marche arrière' (length=56)
var_dump(str_to_utf8("Â« ChrÃ©tiens d'Orient Â» : la RATP fait marche arriÃ¨re"));
//string '« Chrétiens d'Orient » : la RATP fait marche arrière' (length=56)

【讨论】：

【解决方案5】：

编码自动检测不是万无一失的，但您可以尝试mb_detect_encoding()。另见mb_check_encoding()。

【讨论】：