【问题标题】:Converting TFRECORD file to text data将 TFRECORD 文件转换为文本数据
【发布时间】:2021-10-30 06:32:26
【问题描述】:

我已将.txt 文件转换为tfrecords,并对其进行了一些更改。但现在我想转换或读取相同的文件,这样我就可以理解我现在更改的数据。我这样做是为了我的知识图项目。

import numpy as np
import os
import tensorflow as tf
import tqdm
import pdb
import glob
import time
import sys
import re
import argparse
import fastBPE
import platform

use_py3 = platform.python_version()[0] == '3'

parser = argparse.ArgumentParser(description='TensorFlow code for creating TFRecords data')
parser.add_argument('--text_file', type=str, required=True,
                                        help='location of text file to convert to TFRecords')
parser.add_argument('--control_code', type=str, required=True,
                                        help='control code to use for this file. must be in the vocabulary, else it will error out.')
parser.add_argument('--sequence_len', type=int, required=True,
                                        help='sequence length of model being fine-tuned (256 or 512)')

args = parser.parse_args()


path_to_train_file = fname = args.text_file
domain = [args.control_code]

train_text = open(path_to_train_file, 'rb').read().decode(encoding='utf-8')
bpe = fastBPE.fastBPE('../codes', '../vocab')
tokenized_train_text = bpe.apply([train_text.encode('ascii', errors='ignore') if not use_py3 else train_text])[0] # will NOT work for non-English texts 
# if you want to run non-english text, please tokenize separately using ./fast applybpe and then run this script on the .bpe file with utf8 encoding

tokenized_train_text = re.findall(r'\S+|\n', tokenized_train_text)
tokenized_train_text = list(filter(lambda x: x != u'@@', tokenized_train_text))

# load the vocabulary from file
vocab = open('../vocab').read().decode(encoding='utf-8').split('\n') if not use_py3 else open('../vocab', encoding='utf-8').read().split('\n')
vocab = list(map(lambda x: x.split(' ')[0], vocab)) + ['<unk>'] + ['\n']
print ('{} unique words'.format(len(vocab)))

if args.control_code not in vocab:
    print('Provided control code is not in the vocabulary')
    print('Please provide a different one; refer to the vocab file for allowable tokens')
    sys.exit(1)
    
# Creating a mapping from unique characters to indices
word2idx = {u:i for i, u in enumerate(vocab)}
idx2word = np.array(vocab)

seq_length = args.sequence_len-1

def numericalize(x):
    count = 0
    for i in x:
        if i not in word2idx:
            print(i)
            count += 1
    return count>1, [word2idx.get(i, word2idx['<unk>'])  for i in x]

tfrecords_fname = fname.lower()+'.tfrecords'

total = 0
skipped = 0
with tf.io.TFRecordWriter(tfrecords_fname) as writer:
    for i in tqdm.tqdm(range(0, len(tokenized_train_text), seq_length)):
        flag_input, inputs = numericalize(domain+tokenized_train_text[i:i+seq_length])
        flag_output, outputs = numericalize(tokenized_train_text[i:i+seq_length+1])
        total += 1
        if flag_input or flag_output:
            skipped += 1
            continue

        if len(inputs)!=seq_length+1 or len(outputs)!=seq_length+1:
            break
        example_proto = tf.train.Example(features=tf.train.Features(feature={'input': tf.train.Feature(int64_list=tf.train.Int64List(value=inputs)),
                                                                             'output': tf.train.Feature(int64_list=tf.train.Int64List(value=outputs))}))
        writer.write(example_proto.SerializeToString())
print('Done')
print('Skipped', skipped, 'of', total)

这是我的代码,我希望它的所有更改都在 tfrecords 中转换。

【问题讨论】:

    标签: python-3.x tensorflow machine-learning deep-learning tensorflow2.0


    【解决方案1】:

    使用 TFRecordDataset 读取 TFRecord。

    然后遍历 TFRecordDataset 并为每个元素写入一个新的文本文件或打印出结果。

    https://www.tensorflow.org/api_docs/python/tf/data/TFRecordDataset

    【讨论】:

    • 感谢@Yaoshiang 回复我。我可以在图像上找到很多东西,但找不到任何有用的东西来处理我的 txt 文件..
    • 我想将其作为文本文件读取,以便查看对我的文件进行了哪些更改!
    • 分享您的代码,了解您如何从文本生成 TFRecord - 您基本上需要反过来。
    • 你现在可以看到代码了。如果可能的话,我只想以 .txt 格式而不是 tfrecord 格式保存所有更改!!!
    • 既然是使用 tf.train.Example 进行序列化,那么只需使用 tf.train.Example 进行反序列化即可。请参阅指南的这一部分tensorflow.org/tutorials/load_data/…
    猜你喜欢
    • 1970-01-01
    • 2020-01-27
    • 2020-12-25
    • 2016-04-23
    • 1970-01-01
    • 1970-01-01
    • 2019-06-16
    • 2015-06-13
    • 1970-01-01
    相关资源
    最近更新 更多