Python：如何读取文件并将某些列存储在数组中答案

【问题标题】：Python: How to read file and store certain columns in arrayPython：如何读取文件并将某些列存储在数组中
【发布时间】：2016-04-07 18:57:56
【问题描述】：

我正在从文件中读取数据集（由空格分隔）。我需要存储除数组data 中的最后一列之外的所有列，以及数组target 中的最后一列。

你能指导我如何进行下一步吗？

这就是我目前所拥有的：

with open(filename) as f:
    data = f.readlines()

或者我应该逐行阅读？

PS：列的数据类型也不同。

编辑：示例数据

faban       1   0   0.288   withspy
faban       2   0   0.243   withoutspy
simulated   1   0   0.159   withoutspy
faban       1   1   0.189   withoutspy

【问题讨论】：

能否提供样本数据？
请检查编辑部分。
你可能想使用 csv 模块。
请同时描述输出
如果你稍后要做一些分析，你可能也可以看看 pandas (pandas.pydata.org)。它提供了从 CSV 文件中读取数据的功能。然后，您可以分隔列并以您希望的方式处理数据。

标签： python arrays list file io

【解决方案1】：

这可行：

data = []
target = []
with open('faban.txt') as fobj:
    for line in fobj:
        row = line.split()
        data.append(row[:-1])
        target.append(row[-1])

现在：

>>> data
[['faban', '1', '0', '0.288'],
 ['faban', '2', '0', '0.243'],
 ['simulated', '1', '0', '0.159'],
 ['faban', '1', '1', '0.189']]

>>> target
['withspy', 'withoutspy', 'withoutspy', 'withoutspy']

【讨论】：

【解决方案2】：

以下效果很好：

data = open('<FILE>', 'r').read().split('\n')
out = []
for l in data:
    out.append([e for e in l.split(' ') if e])

out 将具有 [['faban', '1', '0', '0.288', 'withspy'],[...], 格式。 ..]（注意，所有元素都是字符串）

【讨论】：

【解决方案3】：

我认为numpy 在这里有一个干净、简单的解决方案。

>>> import numpy as np
>>> data, target = np.array_split(np.loadtxt('file', dtype=str), [-1], axis=1)

结果：

>>> data.tolist()
[['faban', '1', '0', '0.288'], 
 ['faban', '2', '0', '0.243'], 
 ['simulated', '1', '0', '0.159'], 
 ['faban', '1', '1', '0.189']]
>>> target.flatten().tolist()
['withspy', 'withoutspy', 'withoutspy', 'withoutspy']

【讨论】：

【解决方案4】：

你可以使用pandas 来读取数据，iloc 来子集你的数据，values 来从 DataFrame 中获取值，tolist 方法来将 numpy 数组转换为列表：

import pandas as pd
df = pd.read_table('path_to_your_file', delim_whitespace=True, header=None)
print(df)
           0  1  2      3           4
0      faban  1  0  0.288     withspy
1      faban  2  0  0.243  withoutspy
2  simulated  1  0  0.159  withoutspy
3      faban  1  1  0.189  withoutspy


data = df.iloc[:,:-1].values.tolist()
target = df.iloc[:,-1].tolist()

print(data)
[['faban', 1, 0, 0.28800000000000003],
 ['faban', 2, 0, 0.243],
 ['simulated', 1, 0, 0.159],
 ['faban', 1, 1, 0.18899999999999997]]

print(target)
['withspy', 'withoutspy', 'withoutspy', 'withoutspy']

【讨论】：

read_table 已弃用，现代版本：pd.read_csv('path_to_your_file', sep='\t', header=None)。作为额外说明，您可以使用 names=['foo','bar','whatever','target'] 命名列。