file_get_contents() 将 UTF-8 转换为 ISO-8859-1答案

【问题标题】：file_get_contents() converts UTF-8 to ISO-8859-1file_get_contents() 将 UTF-8 转换为 ISO-8859-1
【发布时间】：2011-08-01 19:18:20
【问题描述】：

我正在尝试从yahoo.com 获取搜索结果。

但是 file_get_contents() 将 UTF-8 字符集（雅虎使用的字符集）内容转换为 ISO-8859-1。

试试：

$filename = "http://search.yahoo.com/search;_ylt=A0oG7lpgGp9NTSYAiQBXNyoA?p=naj%C5%A1%C5%A5astnej%C5%A1%C3%AD&fr2=sb-top&fr=yfp-t-701&type_param=&rd=pref";

echo file_get_contents($filename);

脚本为

header('Content-Type: text/html; charset=UTF-8');

或

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

或

$er = mb_convert_encoding($filename , 'UTF-8');

或

$s2 = iconv("ISO-8859-1","UTF-8",$filename );

或

echo utf8_encode(file_get_contents($filename));

没有帮助，因为在获取网页内容后，š ť ž 等特殊字符被替换为问号???

我将不胜感激。

【问题讨论】：

file_get_contents() 不转换任何东西

标签： php utf-8 file-get-contents iso-8859-1

【解决方案1】：

这似乎是content negotiation 问题，因为file_get_contents 可能发送的请求仅接受 ISO 8859-1 作为字符编码。

您可以使用明确声明您接受 UTF-8 的 stream_context_create 为 file_get_contents 创建自定义 stream context：

$opts = array('http' => array('header' => 'Accept-Charset: UTF-8, *;q=0'));
$context = stream_context_create($opts);

$filename = "http://search.yahoo.com/search;_ylt=A0oG7lpgGp9NTSYAiQBXNyoA?p=naj%C5%A1%C5%A5astnej%C5%A1%C3%AD&fr2=sb-top&fr=yfp-t-701&type_param=&rd=pref";
echo file_get_contents($filename, false, $context);

【讨论】：

有趣的是，我试过Accept-Charset=utf-8;q=0.7,*;q=0.7，但没有用:)
@webarto：utf-8;q=0.7,*;q=0.7 的值类似于utf-8,*，并且可以平等地接受任何字符编码。
不错的秋葵汤！我在网址（München）中的变音符号上苦苦挣扎 - 这解决了问题。谢谢！

【解决方案2】：

file_get_contents 应该不更改字符集。数据以二进制字符串的形式拉入。

查看您提供的网址时，这是它提供的标头：

Content-Type: text/html; charset=ISO-8859-1

还有，在正文中：

<meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">

此外，您无法将 UTF-8 无损转换为 ISO-8859-1 并在返回 UTF-8 时恢复字符。 UTF-8 / unicode 支持更多的字符，所以字符在第一步就丢失了。

在浏览器中情况并非如此，所以也许您只需要提供一个正确的 Accept-Encoding 标头来指示 yahoo 的系统您可以接受 UTF-8。

【讨论】：

你是如何找到Content-Type: text/html; charset=ISO-8859-1 和<meta http-equiv="content-type" content="text/html; charset=ISO-8859-1"> 当我查看该页面的源代码时，我看到了<!doctype html><html lang="en"><head><meta http-equiv="content-type" content="text/html; charset=UTF-8">
它根据您的位置提供不同的编码，您可以尝试使用俄罗斯代理服务器获取页面。

【解决方案3】：

$s2 = iconv("ISO-8859-1","UTF-8//TRANSLIT//IGNORE",$filename );

更好的解决方案...

function curl($url){
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
    curl_setopt($ch, CURLOPT_ENCODING, 1);
    return curl_exec($ch);
    curl_close($ch);
}

echo curl($filename);

【讨论】：

结果为：文档已移至此处。
@vladinko0，我觉得你需要设置CURLOPT_FOLLOWLOCATION，我已经更新了答案，再试一次。
现在它加载页面，但结果与 file_get_contents() 相同，这意味着带有问号。字符集也转换为 ISO-8859-1。
看起来 yahoo.com 根据您的 IP（国家/地区）提供不同的页面（字符集）。我将您的网址更改为http://ru.search.yahoo.com，但它不起作用。也许您可以通过接受字符集标头来实现某些目标，拒绝 ISO-8859-1 ...

【解决方案4】：

对于任何对此进行调查的人：

我花在编码问题上的时间告诉我，很少有 php 函数“神奇地”改变字符串的编码。（其中一个罕见的例子是：

exec($command, $output, $returnVal)

另请注意，工作标头集如下：

header('Content-Type: text/html; charset=utf-8');

而不是：

header('Content-Type: text/html; charset=UTF-8');

由于我遇到了与您描述的类似的问题，因此正确设置标题就足够了。

希望这会有所帮助！

【讨论】：