在python中随机拆分文件答案

【问题标题】：splitting a file randomly in python在python中随机拆分文件
【发布时间】：2015-09-20 20:04:33
【问题描述】：

我有一个输入文件 word.txt。我正在尝试在 python 中将文件随机拆分为 75%-25%。

def shuffle_split(infilename, outfilename1, outfilename2):
    from random import shuffle

    with open(infilename, 'r') as f:
        lines = f.readlines()

    # append a newline in case the last line didn't end with one
    lines[-1] = lines[-1].rstrip('\n') + '\n'
    traingdata = len(lines)* 75 // 100
    testdata = len(lines)-traingdata
    with open(outfilename1, 'w') as f:
        f.writelines(lines[:traingdata])
    with open(outfilename2, 'w') as f:
        f.writelines(lines[:testdata])

但是这段代码在第一个输出文件中写入原始文件的前 75%，在第二个输出文件中再次写入原始文件的 25%。您能否建议我一些解决方法。

【问题讨论】：

回答同样的问题：stackoverflow.com/questions/17412439/…

标签： python file input

【解决方案1】：

如果您不想读取内存中的所有文件，我会使用类似的东西。请注意，它还支持无改组：

import random

def split_file(file,out1,out2,percentage=0.75,isShuffle=True,seed=123):
    """Splits a file in 2 given the `percentage` to go in the large file."""
    random.seed(seed)
    with open(file, 'r',encoding="utf-8") as fin, \
         open(out1, 'w') as foutBig, \
         open(out2, 'w') as foutSmall:

        nLines = sum(1 for line in fin) # if didn't count you could only approximate the percentage
        fin.seek(0)
        nTrain = int(nLines*percentage) 
        nValid = nLines - nTrain

        i = 0
        for line in fin:
            r = random.random() if isShuffle else 0 # so that always evaluated to true when not isShuffle
            if (i < nTrain and r < percentage) or (nLines - i > nValid):
                foutBig.write(line)
                i += 1
            else:
                foutSmall.write(line)

如果您的文件太大以至于您不想对其进行两次迭代（一次用于计数），那么您可以概率性地拆分。因为文件太大，会产生不错的结果：

import random

def split_huge_file(file,out1,out2,percentage=0.75,seed=123):
        """Splits a file in 2 given the approximate `percentage` to go in the large file."""
    random.seed(seed)
    with open(file, 'r',encoding="utf-8") as fin, \
         open(out1, 'w') as foutBig, \
         open(out2, 'w') as foutSmall:

        for line in fin:
            r = random.random() 
            if r < percentage:
                foutBig.write(line)
            else:
                foutSmall.write(line)

【讨论】：

【解决方案2】：

问题在于这一行

 f.writelines(lines[:testdata])

您是在说“从索引 0 到索引 testdata 的所有内容”：

 f.writelines(lines[0:testdata])

这不是你想要的。改成

 f.writelines(lines[testdata:])

这意味着“从（testdata）到列表末尾的所有内容”。那应该行得通。甚至更简单

 f.writelines(lines[traingdata + 1:])

这一行表示“从 (traindata + 1) 到列表末尾的所有内容”。

【讨论】：

【解决方案3】：

先洗牌：

shuffle(lines)

然后，你只需要做一些列表切片来得到你的两个集合

import math
TRAINING_RATIO = 0.75    # This is the percentage of the array you want to be training data

...

shuffle(lines)
train, test = lines[:int(math.floor(len(lines)*TRAINING_RATIO))], lines[int(math.ceil(len(lines)*TRAINING_RATIO)):]

最后，您将有两个列表train 和test。 train 将包含您日期的 75%（加上很大的舍入误差）。 test 将包含其余部分。

这是通过以下方式完成的（对于train）：

lines[:int(math.floor(len(lines)*TRAINING_RATIO))]

这是从打乱列表的开头到 75% 的标记。对于test，它得到剩余的 25%：

lines[int(math.ceil(len(lines)*TRAINING_RATIO)):]

例如，使用一个在自己的行上有数字 1-20 的文件（总共 20 行），我去掉了结尾的 \n：

Train: ['2', '17', '19', '6', '5', '3', '14', '7', '10', '18', '9', '20', '16', '4', '8']
Test: ['12', '15', '13', '1', '11']

【讨论】：

【解决方案4】：

这会打乱读取的行，然后单独保存它们

outfilename1 = "lines25.txt"
outfilename2 = "lines75.txt"
import random

with open('w2.txt','r') as f:
    lines = f.readlines()

random.shuffle(lines)
numlines = int(len(lines)*0.25)

with open(outfilename1, 'w') as f:
    f.writelines(lines[:numlines])
with open(outfilename2, 'w') as f:
    f.writelines(lines[numlines:])

【讨论】：