如何使用 file_get_contents 以正确的 utf-8 编码获取文件内容？答案

【问题标题】：How to get file content with a proper utf-8 encoding using file_get_contents?如何使用 file_get_contents 以正确的 utf-8 编码获取文件内容？
【发布时间】：2018-05-02 19:00:08
【问题描述】：

我需要以 utf-8 编码获取远程文件的内容。 utf-8 格式的文件。当我在屏幕上显示该文件时，它具有正确的编码：

http://www.parfumeriafox.sk/source_file.html

（注意ň 和č 字符，例如，这些都可以）。

当我运行这段代码时：

<?php

$url = 'http://parfumeriafox.sk/source_file.html';

$csv = file_get_contents_utf8($url);
header('Content-type: text/html; charset=utf-8');
print $csv;

function file_get_contents_utf8($fn) {
  $content = file_get_contents($fn);
  return mb_convert_encoding($content, 'utf-8');
}

（您可以使用http://www.parfumeriafox.sk/encoding.php 运行它），然后我得到问号而不是那些特殊字符。我对此进行了大量研究，我尝试了标准的file_read_contents 函数，我什至使用了一些流 bla bla php 上下文函数，我还尝试了 fopen 和 fread 函数来读取二进制级别的文件，似乎没有任何效果。我已经尝试过发送和不发送标题。这应该是完全简单的，我做错了什么？当我使用一些编码检测函数检查该字符串时，它返回UTF-8。

【问题讨论】：

标签： php utf-8 file-get-contents

【解决方案1】：

这个怎么样？？？？

为此我使用了header('Content-Type: text/plain;; charset=Windows-1250');

佛手柑、香橼、tráva、rebarbora、bazalka;levanduľa、škorica、hruška;céderové drevo、vanilka、pižmo、amberlyn

此代码适用于我

<?php
header('Content-Type: text/plain;charset=Windows-1250');
echo file_get_contents('http://www.parfumeriafox.sk/source_file.html');
?>

问题不在于 file_get_contents()

我将 $data 保存到文件中，字符正确，但我的文本编辑器仍未正确编码。见下图。

$data = file_get_contents('http://www.parfumeriafox.sk/source_file.html');
file_put_contents('doc.txt',$data);

更新

似乎是一个有问题的角色，如此处所示。 它也可以在下面的 HTML 图像中看到。渲染为 ¾

它的十六进制值为xBE（十进制190）

我试过这两个字符集。都没有用。

header('Content-Type: text/plain; charset=ISO 8859-1');
header('Content-Type: text/plain; charset=ISO 8859-2');

更新结束

它通过添加一个不带 charset=utf-8 的标头来工作。

这两个标题有效

header('Content-Type: text/plain');
header('Content-Type: text/html');

这两个标题不起作用

header('Content-Type: text/plain; charset=utf-8');
header('Content-Type: text/html; charset=utf-8');

此代码经过测试并显示所有字符。

<?php
header('Content-Type: text/plain');
echo file_get_contents('http://www.parfumeriafox.sk/source_file.html');
?>

<?php
header('Content-Type: text/html');
echo file_get_contents('http://www.parfumeriafox.sk/source_file.html');
?>

这些是一些带有十六进制值的有问题的字符。
这是使用 UTF-8 编码在 Notepad++ 中查看的保存文件。

对照这些字符集检查十六进制值。

从上表中我看到字符集是 Latin2。

我去Wikipedia Windows code page发现Latin2是Windows-1250

佛手柑、香橼、tráva、rebarbora、bazalka;levanduľa、škorica、hruška;céderové drevo、vanilka、pižmo、amberlyn

【讨论】：

不，它没有，我可以看到“è”应该读“č”、“ò”而不是“ň”等...
我不会抓住那些。我能够找到一个字符。好吧，它有所改善。我发布后找到了on角色。您需要知道正在使用什么字符编码，然后将该字符集添加到 header()。此链接可能会有所帮助：docs.oracle.com/cd/B10501_01/server.920/a96529/ch2.htm
我不认为问题出在 file_get_contents() 上，我更新了我的帖子，提供了更多信息。
谢谢，很好的研究。它现在适用于 Windows 编码，这很有趣，因为发送文件的一方一直说它是 UTF-8 编码，但可能不是。无论如何，我在哪里说 file_get_contents() 函数有问题？

【解决方案2】：

您可以通过打开开发者控制台并查看 document.characterSet 来查看您的浏览器确定文档的字符集：

> document.characterSet
"windows-1250"

有了这些知识，我们可以要求 iconv 为我们从“windows-1250”转换为 utf-8：

<?php
$text = file_get_contents("source_file.csv");
$text = iconv("windows-1250", "utf-8", $text);
print($text);

输出是有效的 utf-8，levanduľa 也正确显示。

【讨论】：