当第一列是字符串而其余列是数字时，如何使用 numpy.genfromtxt？答案

【问题标题】：How to use numpy.genfromtxt when first column is string and the remaining columns are numbers?当第一列是字符串而其余列是数字时，如何使用 numpy.genfromtxt？
【发布时间】：2021-10-29 00:37:43
【问题描述】：

基本上，我有一堆数据，其中第一列是字符串（标签），其余列是数值。我运行以下命令：

data = numpy.genfromtxt('data.txt', delimiter = ',')

这可以很好地读取大部分数据，但标签列只是得到“nan”。我该如何处理？

【问题讨论】：

你希望label栏得到什么？
@mgilson 标签列是一个字符串。是这个意思吗？
不，我的意思是您希望该标签列发生什么？你希望它存储在numpy 数组中吗？您是否希望将其存储为单独的数组？ ...
我最终希望额外的列在它自己的单独数组中。

标签： python numpy

【解决方案1】：

默认情况下，np.genfromtxt 使用 dtype=float：这就是为什么将字符串列转换为 NaN，因为毕竟它们不是数字......

您可以让np.genfromtxt 尝试使用dtype=None 猜测您的列的实际类型：

>>> from StringIO import StringIO
>>> test = "a,1,2\nb,3,4"
>>> a = np.genfromtxt(StringIO(test), delimiter=",", dtype=None)
>>> print a
array([('a',1,2),('b',3,4)], dtype=[('f0', '|S1'),('f1', '<i8'),('f2', '<i8')])

您可以使用列名访问列，例如a['f0']...

如果您不知道列应该是什么，使用dtype=None 是一个好技巧。如果你已经知道他们应该有什么类型，你可以给出明确的dtype。例如，在我们的测试中，我们知道第一列是字符串，第二列是 int，我们希望第三列是浮点数。然后我们将使用

>>> np.genfromtxt(StringIO(test), delimiter=",", dtype=("|S10", int, float))
array([('a', 1, 2.0), ('b', 3, 4.0)], 
      dtype=[('f0', '|S10'), ('f1', '<i8'), ('f2', '<f8')])

使用显式dtype 比使用dtype=None 更有效，并且是推荐的方式。

在这两种情况下（dtype=None 或显式的、非同质的dtype），您最终都会得到一个结构化数组。

[注意：使用dtype=None，输入被第二次解析并且每一列的类型被更新以匹配可能的更大类型：首先我们尝试一个bool，然后一个int，然后一个float，然后一个complex ，然后如果一切都失败了，我们会保留一个字符串。实际上，实现相当笨拙。有一些尝试使类型猜测更有效（使用正则表达式），但到目前为止没有任何问题]

【讨论】：

这比我尝试的dtype 的实验要容易得多，这甚至都不好笑。 :^) 我打算建立一个dict 而不是使用None.. sigh
很多人都在争论 dtype=None 作为默认值。不过，它会破坏向后兼容性，所以我们一直使用dtype=float。是的，它相当强大，但它的性能受到了打击......
@PierreGM：也许Sniffer(**hints).sniff_dtype(sample) 可能是一个有效的解决方案：无需读取所有输入两次或硬编码dtype。
好主意，我得去看看。无论如何，np.genfromtxt 仍有工作要做。报价处理不当，例如...
它来自 IO 而不是来自 StringIO

【解决方案2】：

如果你的数据文件是这样的结构

col1, col2, col3
   1,    2,    3
  10,   20,   30
 100,  200,  300

然后numpy.genfromtxt 可以使用names=True 选项将第一行解释为列标题。有了这个，您可以通过提供列标题非常方便地访问数据：

data = np.genfromtxt('data.txt', delimiter=',', names=True)
print data['col1']    # array([   1.,   10.,  100.])
print data['col2']    # array([   2.,   20.,  200.])
print data['col3']    # array([   3.,   30.,  300.])

因为在你的情况下，数据是这样形成的

row1,   1,  10, 100
row2,   2,  20, 200
row3,   3,  30, 300

您可以使用以下代码 sn-p 实现类似的功能：

labels = np.genfromtxt('data.txt', delimiter=',', usecols=0, dtype=str)
raw_data = np.genfromtxt('data.txt', delimiter=',')[:,1:]
data = {label: row for label, row in zip(labels, raw_data)}

第一行将第一列（标签）读入字符串数组。第二行从文件中读取所有数据，但丢弃第一列。第三行使用字典理解来创建一个字典，该字典的使用非常类似于numpy.genfromtxt 使用names=True 选项创建的结构化数组：

print data['row1']    # array([   1.,   10.,  100.])
print data['row2']    # array([   2.,   20.,  200.])
print data['row3']    # array([   3.,   30.,  300.])

【讨论】：

【解决方案3】：

data=np.genfromtxt(csv_file, delimiter=',', dtype='unicode')

对我来说很好用。

【讨论】：

【解决方案4】：

对于这种格式的数据集：

CONFIG000   1080.65 1080.87 1068.76 1083.52 1084.96 1080.31 1081.75 1079.98
CONFIG001   414.6   421.76  418.93  415.53  415.23  416.12  420.54  415.42
CONFIG010   1091.43 1079.2  1086.61 1086.58 1091.14 1080.58 1076.64 1083.67
CONFIG011   391.31  392.96  391.24  392.21  391.94  392.18  391.96  391.66
CONFIG100   1067.08 1062.1  1061.02 1068.24 1066.74 1052.38 1062.31 1064.28
CONFIG101   371.63  378.36  370.36  371.74  370.67  376.24  378.15  371.56
CONFIG110   1060.88 1072.13 1076.01 1069.52 1069.04 1068.72 1064.79 1066.66
CONFIG111   350.08  350.69  352.1   350.19  352.28  353.46  351.83  350.94

此代码适用于我的应用程序：

def ShowData(data, names):
    i = 0
    while i < data.shape[0]:
        print(names[i] + ": ")
        j = 0
        while j < data.shape[1]:
            print(data[i][j])
            j += 1
        print("")
        i += 1

def Main():
    print("The sample data is: ")
    fname = 'ANOVA.csv'
    csv = numpy.genfromtxt(fname, dtype=str, delimiter=",")
    num_rows = csv.shape[0]
    num_cols = csv.shape[1]
    names = csv[:,0]
    data = numpy.genfromtxt(fname, usecols = range(1,num_cols), delimiter=",")
    print(names)
    print(str(num_rows) + "x" + str(num_cols))
    print(data)
    ShowData(data, names)

Python-2 输出：

The sample data is:
['CONFIG000' 'CONFIG001' 'CONFIG010' 'CONFIG011' 'CONFIG100' 'CONFIG101'
 'CONFIG110' 'CONFIG111']
8x9
[[ 1080.65  1080.87  1068.76  1083.52  1084.96  1080.31  1081.75  1079.98]
 [  414.6    421.76   418.93   415.53   415.23   416.12   420.54   415.42]
 [ 1091.43  1079.2   1086.61  1086.58  1091.14  1080.58  1076.64  1083.67]
 [  391.31   392.96   391.24   392.21   391.94   392.18   391.96   391.66]
 [ 1067.08  1062.1   1061.02  1068.24  1066.74  1052.38  1062.31  1064.28]
 [  371.63   378.36   370.36   371.74   370.67   376.24   378.15   371.56]
 [ 1060.88  1072.13  1076.01  1069.52  1069.04  1068.72  1064.79  1066.66]
 [  350.08   350.69   352.1    350.19   352.28   353.46   351.83   350.94]]
CONFIG000:
1080.65
1080.87
1068.76
1083.52
1084.96
1080.31
1081.75
1079.98

CONFIG001:
414.6
421.76
418.93
415.53
415.23
416.12
420.54
415.42

CONFIG010:
1091.43
1079.2
1086.61
1086.58
1091.14
1080.58
1076.64
1083.67

CONFIG011:
391.31
392.96
391.24
392.21
391.94
392.18
391.96
391.66

CONFIG100:
1067.08
1062.1
1061.02
1068.24
1066.74
1052.38
1062.31
1064.28

CONFIG101:
371.63
378.36
370.36
371.74
370.67
376.24
378.15
371.56

CONFIG110:
1060.88
1072.13
1076.01
1069.52
1069.04
1068.72
1064.79
1066.66

CONFIG111:
350.08
350.69
352.1
350.19
352.28
353.46
351.83
350.94

【讨论】：

【解决方案5】：

您可以使用numpy.recfromcsv(filename)：将自动确定每列的类型（就像您使用np.genfromtxt() 和dtype=None），默认情况下为delimiter=","。这基本上是 Pierre GM 在他的回答中指出的 np.genfromtxt(filename, delimiter=",", dtype=None) 的捷径。

【讨论】：

【解决方案6】：

这是一个从头到尾的工作示例：

如果我想从没有第一行的文件中导入数字：

 I like trains #this is the first line, a string

1 \t 2 \t 3   #\t is to signify that the delimeter (separation) is tab and not komma  

4 \t 5 \t 6

然后运行以下代码：

import numpy as np              #contains genfromtxt
import matplotlib.pyplot as plt #enables plots 
from pathlib import Path        # easier using path instead of writing it again and again when you have many files in the same folder
path = r'some_path'             #location of your file in your computer like r'C:my comp\folder\folder2' r is there to make the win 10 path readable in python, it means "just text"
fileNames = [r'\I like trains.txt',
             r'\die potato.txt']

data=np.genfromtxt(path + fileNames[0], delimiter='\t', skip_header=1)

产生这个结果：

data = [1 2 3
        4 5 6]

每个数字都有自己的单元格，可以单独访问

【讨论】：

for 循环可用于遍历所有文件名，列表理解可用于制作大量列表，但在处理大量文件时非常混乱，因此为每个文件提供一个列表可以让事情更容易理解