从 File.ReadAllBytes (byte[]) 中删除字节顺序标记答案

【问题标题】：Remove Byte Order Mark from a File.ReadAllBytes (byte[])从 File.ReadAllBytes (byte[]) 中删除字节顺序标记
【发布时间】：2010-09-22 05:36:14
【问题描述】：

我有一个 HTTPHandler，它正在读取一组 CSS 文件并将它们组合起来，然后对它们进行 GZipping。但是，一些 CSS 文件包含字节顺序标记（由于 TFS 2005 自动合并中的错误），并且在 FireFox 中，BOM 被作为实际内容的一部分被读取，因此它搞砸了我的类名等。我该如何剥离出BOM字符？有没有一种简单的方法可以做到这一点，而无需手动通过字节数组寻找“ï»¿”？

【问题讨论】：

BOM 是出现在实际文本本身中，还是刚开始出现？我会惊讶地在数据开头以外的任何地方看到它——在这种情况下，只需忽略前 3 个字节（假设 UTF-8）就可以了。
FWIW，您可以在Notepad++ 中打开文件并在没有字节顺序标记的情况下保存它们。这是我必须在this question 中做的事情。
我在遇到这个问题后写了following post。本质上，我没有使用 BinaryReader 类读取文件内容的原始字节，而是使用带有特定构造函数的 StreamReader 类，该构造函数会自动从我试图检索的文本数据中删除字节顺序标记字符。

标签： c# byte-order-mark

【解决方案1】：

使用示例扩展 Jon's comment。

var name = GetFileName();
var bytes = System.IO.File.ReadAllBytes(name);
System.IO.File.WriteAllBytes(name, bytes.Skip(3).ToArray());

【讨论】：

引用 OP：但是，某些 CSS 文件包含字节顺序标记。 .. ** some ** .. 所以上面的代码在跳过它之前不会检查是否有 BOM...

【解决方案2】：

另一种方式，假设 UTF-8 转 ASCII。

File.WriteAllText(filename, File.ReadAllText(filename, Encoding.UTF8), Encoding.ASCII);

【讨论】：

【解决方案3】：

var text = File.ReadAllText(args.SourceFileName);
var streamWriter = new StreamWriter(args.DestFileName, args.Append, new UTF8Encoding(false));
streamWriter.Write(text);
streamWriter.Close();

【讨论】：

看看这段代码，理想情况下它应该可以工作。但是，我很惊讶它以 ANSI 格式保存文件。
new UTF8Encoding(false)参数表示是否添加BOM。

【解决方案4】：

扩展 JaredPar 示例以递归子目录：

using System.Linq;
using System.IO;
namespace BomRemover
{
    /// <summary>
    /// Remove UTF-8 BOM (EF BB BF) of all *.php files in current & sub-directories.
    /// </summary>
    class Program
    {
        private static void removeBoms(string filePattern, string directory)
        {
            foreach (string filename in Directory.GetFiles(directory, file  Pattern))
            {
                var bytes = System.IO.File.ReadAllBytes(filename);
                if(bytes.Length > 2 && bytes[0] == 0xEF && bytes[1] == 0xBB && bytes[2] == 0xBF)
                {
                    System.IO.File.WriteAllBytes(filename, bytes.Skip(3).ToArray()); 
                }
            }
            foreach (string subDirectory in Directory.GetDirectories(directory))
            {
                removeBoms(filePattern, subDirectory);
            }
        }
        static void Main(string[] args)
        {
            string filePattern = "*.php";
            string startDirectory = Directory.GetCurrentDirectory();
            removeBoms(filePattern, startDirectory);            
        }       
    }
}

在您尝试执行基本 PHP 下载文件时发现 UTF-8 BOM 损坏文件后，我需要那段 C# 代码。

【讨论】：

【解决方案5】：

对于较大的文件，请使用以下代码；内存高效！

StreamReader sr = new StreamReader(path: @"<Input_file_full_path_with_byte_order_mark>", 
                    detectEncodingFromByteOrderMarks: true);

StreamWriter sw = new StreamWriter(path: @"<Output_file_without_byte_order_mark>", 
                    append: false, 
                    encoding: new UnicodeEncoding(bigEndian: false, byteOrderMark: false));

var lineNumber = 0;
while (!sr.EndOfStream)
{
    sw.WriteLine(sr.ReadLine());
    lineNumber += 1;
    if (lineNumber % 100000 == 0)
        Console.Write("\rLine# " + lineNumber.ToString("000000000000"));
}

sw.Flush();
sw.Close();

【讨论】：