带有二进制数据的 python 文件 I/O答案

【问题标题】：python file I/O with binary data带有二进制数据的 python 文件 I/O
【发布时间】：2016-06-15 09:13:28
【问题描述】：

我正在从 mp3 数据中提取 jpeg 类型的位，实际上它将是专辑封面。我考虑过使用名为 mutagen 的库，但我想尝试使用 bits 来进行一些练习。

import os
import sys
import re

f = open(sys.argv[1], "rb")
#sys.argv[1] gets mp3 file name ex) test1.mp3

saver = ""
for value in f:
    for i in value:
        hexval = hex(ord(i))[2:]
        if (ord(i) == 0):
            saver += "00" #to match with hex form
        else:
            saver += hexval


header = "ffd8"
tail = "ffd9"

这部分代码是将mp3获取为bit形式，然后转化为hex 并找到以“ffd8”开头并以“ffd9”结尾的 jpeg 预告片

frontmatch = re.search(header,saver)
endmatch = re.search(tail, saver)
startIndex = frontmatch.start()
endIndex = endmatch.end()

jpgcontents = saver[startIndex:endIndex]
scale = 16 # equals to hexadecimal
numbits = len(jpgcontents) * 4 #log2(scale)
bitcontents = bin(int(jpgcontents, scale))[2:].zfill(numbits)

在这里，我得到了头部和尾部之间的位并将其转换为二进制形式。应该是 mp3 文件的 jpg 部分。

txtfile = open(sys.argv[1] + "_tr.jpg", "w")
txtfile.write(bitcontents)

我将 bin 写入新文件，写入类型为 jpg。对不起，我错误地命名为 txtfile。

但是这些代码给出的错误是

Error interpreting JPEG image file
(Not a JPEG file: starts with 0x31 0x31)

我不确定我提取的位是错误的还是写入文件的步骤错误。或者代码中可能有其他问题。

我正在使用 python 2.6 的 linux 版本工作。有什么问题吗只是将str类型的bin数据写成JPG？

【问题讨论】：

哪一行导致异常/错误？
其实代码本身并没有报错。但是当我打开生成的 jpg 文件时，我收到错误消息，不是 JPEG 文件：以 0x31 0x31 开头，所以我无法成功打开 jpg 文件

标签： python file-io jpeg bitstring

【解决方案1】：

你需要写成二进制

试试：

txtfile = open(sys.argv[1] + "_tr.jpg", "wb")

【讨论】：

谢谢，我会试试的。但我不确定为什么错误让我的文件以 0x31 开头。这个十六进制值是从哪里来的？
如果您在文本编辑器中打开 jpg，您很可能会看到十六进制字符串的字符串表示形式。由于只写打开文件的默认行为是打印一个字符串。
那么如果我使用“wb”类型并将它们作为二进制类型，它会解决问题吗？还是您的意思是在文本编辑器中打开 jpg 本身有问题？
使用“wb”应该可以。您的 jpg 的第一个字符是 0x31，转换为 ASCII 是“1”。
仅仅使用“wb”是不够的。 bitcontents 不是正确二进制格式的所需数据，它是一串 ASCII 零和一。

【解决方案2】：

您正在创建一串 ASCII 零和一，即 \x30 和 \x31，但 JPEG 文件需要是正确的二进制数据。所以你的文件应该有一个字节（例如）\xd8，而你却有这八个字节：11011000，或\x31\x31\x30\x31\x31\x30\x30\x30。

您无需进行所有繁琐的转换工作。您可以直接搜索所需的字节模式，使用\x 十六进制转义序列编写它们。而且您甚至不需要正则表达式：简单的字符串 .index 或 .find 方法可以轻松快速地完成此操作。

with open(fname, 'rb') as f:
    data = f.read()

header = "\xff\xd8"
tail = "\xff\xd9"

try:
    start = data.index(header)
    end = data.index(tail, start) + 2
except ValueError:
    print "Can't find JPEG data!"
    exit()

print 'Start: %d End: %d Size: %d' % (start, end, end - start)

with open(fname + "_tr.jpg", 'wb') as f:
    f.write(data[start:end])

（在 Python 2.6.6 上测试）

但是，像这样提取嵌入的 JPEG 数据并不是万无一失的，因为这些头部和尾部字节序列可能存在于 MP3 声音数据中。

FWIW，将二进制数据转换为十六进制字符串并返回的更简单方法是使用 binascii 模块中的 hexlify 和 unhexlify。

以下是执行这些转换的一些示例，包括使用和不使用 binascii 函数。

from binascii import hexlify, unhexlify

#Create a string of all possible byte values
allbytes = ''.join([chr(i) for i in xrange(256)])
print 'allbytes'
print repr(allbytes)

print '\nhex list'
print [hex(ord(v))[2:].zfill(2) for v in allbytes]
hexstr = hexlify(allbytes)

print '\nhex string'
print hexstr
newbytes = ''.join([chr(int(hexstr[i:i+2], 16)) for i in xrange(0, len(hexstr), 2)])

print '\nNew bytes'
print repr(newbytes)

print '\nUsing unhexlify'
print repr(unhexlify(hexstr))

输出

allbytes
'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'

hex list
['00', '01', '02', '03', '04', '05', '06', '07', '08', '09', '0a', '0b', '0c', '0d', '0e', '0f', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '1a', '1b', '1c', '1d', '1e', '1f', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '2a', '2b', '2c', '2d', '2e', '2f', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '3a', '3b', '3c', '3d', '3e', '3f', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '4a', '4b', '4c', '4d', '4e', '4f', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59', '5a', '5b', '5c', '5d', '5e', '5f', '60', '61', '62', '63', '64', '65', '66', '67', '68', '69', '6a', '6b', '6c', '6d', '6e', '6f', '70', '71', '72', '73', '74', '75', '76', '77', '78', '79', '7a', '7b', '7c', '7d', '7e', '7f', '80', '81', '82', '83', '84', '85', '86', '87', '88', '89', '8a', '8b', '8c', '8d', '8e', '8f', '90', '91', '92', '93', '94', '95', '96', '97', '98', '99', '9a', '9b', '9c', '9d', '9e', '9f', 'a0', 'a1', 'a2', 'a3', 'a4', 'a5', 'a6', 'a7', 'a8', 'a9', 'aa', 'ab', 'ac', 'ad', 'ae', 'af', 'b0', 'b1', 'b2', 'b3', 'b4', 'b5', 'b6', 'b7', 'b8', 'b9', 'ba', 'bb', 'bc', 'bd', 'be', 'bf', 'c0', 'c1', 'c2', 'c3', 'c4', 'c5', 'c6', 'c7', 'c8', 'c9', 'ca', 'cb', 'cc', 'cd', 'ce', 'cf', 'd0', 'd1', 'd2', 'd3', 'd4', 'd5', 'd6', 'd7', 'd8', 'd9', 'da', 'db', 'dc', 'dd', 'de', 'df', 'e0', 'e1', 'e2', 'e3', 'e4', 'e5', 'e6', 'e7', 'e8', 'e9', 'ea', 'eb', 'ec', 'ed', 'ee', 'ef', 'f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'fa', 'fb', 'fc', 'fd', 'fe', 'ff']

hex string
000102030405060708090a0b0c0d0e0f101112131415161718191a1b1c1d1e1f202122232425262728292a2b2c2d2e2f303132333435363738393a3b3c3d3e3f404142434445464748494a4b4c4d4e4f505152535455565758595a5b5c5d5e5f606162636465666768696a6b6c6d6e6f707172737475767778797a7b7c7d7e7f808182838485868788898a8b8c8d8e8f909192939495969798999a9b9c9d9e9fa0a1a2a3a4a5a6a7a8a9aaabacadaeafb0b1b2b3b4b5b6b7b8b9babbbcbdbebfc0c1c2c3c4c5c6c7c8c9cacbcccdcecfd0d1d2d3d4d5d6d7d8d9dadbdcdddedfe0e1e2e3e4e5e6e7e8e9eaebecedeeeff0f1f2f3f4f5f6f7f8f9fafbfcfdfeff

New bytes
'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'

Using unhexlify
'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'

请注意，此代码需要进行一些修改才能在 Python 3 上运行（除了将 print 语句转换为 print 函数调用），因为普通的 Python 3 字符串是 Unicode 字符串，而不是字节字符串。

【讨论】：

【解决方案3】：

哎呀，你没有做你所期望的。 bin 生成一个 string 包含二进制形式的值。让我们看看你有什么，如果输入文件上的内容是：

saver 是一串文本形式的十六进制字符，例如“313233414243”，用于初始字符串“132ABC”
jpgcontents 格式相同，以“ffd8”开头，以“ffd9”结尾
然后你应用魔法公式bin(int(jpgcontents, scale))[2:].zfill(numbits)
- 将十六进制字符串转换为长整数
- 将长整数转换为二进制表示字符串 - 这部分会将十六进制“ff”转换为整数 255 并以字符串“0b11111111”结尾
如果需要，删除第一个字符“0b”并填充缓冲区的末尾

bitcontents 是一个以“11111111....”开头的字符串。只需将文件重命名为 .txt 扩展名并使用文本编辑器打开它，您会看到它是一个仅包含 ASCII 字符 0 和 1 的大文件。

由于标题是“ffd8”，文件将以 10“1”开头。所以它以0x31 0x31开头的错误，因为0x31是“1”的ascii代码。

您需要将六进制字符串jpgcontents 转换为二进制字节数组。

fileimage = ''.join([ jpgcontent[i:i+2] for i in range(0, len(jpgcontent), 2]

然后您可以安全地将文件图像缓冲区复制到二进制文件：

file = open(sys.argv[1] + "_tr.jpg", "wb")
file.write(fileimage)

【讨论】：

【解决方案4】：

最简单的方法是使用 binascii 模块：https://docs.python.org/2/library/binascii.html。

import binascii

# code in ascii format contained in a list
code = ['00', '01', '02', '03', '04', '05', '06', '07', '08', '09']

bfile = open('bfile.bin', 'w')

for c in code:
    # convert the ascii to binary and write it to the file
    bfile.write(binascii.unhexlify(c))

bfile.close()

【讨论】：