如何读取带有换行符的字符串并存储到 Pandas 数据框或 python 列表中答案

【问题标题】：How to read a string with newlines and store into Pandas dataframe or python list如何读取带有换行符的字符串并存储到 Pandas 数据框或 python 列表中
【发布时间】：2020-07-28 18:13:23
【问题描述】：

我读入了一个具有自定义数据格式的大型文本文件，如下所示：

file_object = open(file, "r")
contents = file_object.read()

打印内容会给出这样的结果（整个“对象”只是一个带有新行的字符串）：

object name {
    # Data Type 1
    burgers [taste="good" type="food"];
    sushi [taste="good" type="food"];

    # Data Type 2
    NYC [population="300" type="urban"];
    
    # Data Type 3
    NYC -> CHI [distance="15.0"];

    LA -> SF [distnace="2.0"];
}

数据分为 3 个部分，用 # 表示。数据在部分内部/之间可能有不一致的新行，所以我想先删除所有空的换行符，然后我想了解如何删除每行数据前面的制表符/空格，就像这样。

object name {
# Data Type 1
burgers [taste="good" type="food"];
sushi [taste="good" type="food"];
# Data Type 2
NYC [population="300" type="urban"];
# Data Type 3
NYC -> CHI [distance="15.0"];
LA -> SF [distnace="2.0"];
}

然后从那里找出如何将其分成 3 个相应的部分。我不确定哪种数据结构最好，因为格式随处可见（或者是否有更简单的方法来读取这个东西）。任何建议将不胜感激！

【问题讨论】：

可能不需要先读取和预处理所有内容：for line in file_object: line=line.strip() if 0==len(line): continue 是一个相当常见的习语。

标签： python pandas dataframe file data-structures

【解决方案1】：

代码如下：

contents = """
object name {
    # Data Type 1
    burgers [taste="good" type="food"];
    sushi [taste="good" type="food"];

    # Data Type 2
    NYC [population="300" type="urban"];

    # Data Type 3
    NYC -> CHI [distance="15.0"];

    LA -> SF [distnace="2.0"];
}
"""

all_lines = contents.split("\n")

selected_lines = [line.strip() for line in all_lines if len(line) > 0]

new_contents = "\n".join(selected_lines)

print(new_contents)

结果在new_contents。

编辑（回复评论）：

此时您可以将字符串拆分为多个部分：

lines = new_contents.split("\n")

# remove first and last lines
lines = lines[1:-1]

sections = {}
for line in lines:
  if "#" in line:
    # create new key (Data type X)
    key = line[2:]
    # value of new key is an empty list
    sections[key] = []
  else:
    # append row to key (Data type X)
    sections[key].append(line)

print(sections)

【讨论】：

如何将这 3 个部分放入一个数据结构中，如字符串行列表或数据框？
查看我的编辑 - sections 将是一个字典，每个键都包含一个字符串列表。也许您可以使用它来将sections 转换为数据框：pandas.pydata.org/pandas-docs/stable/reference/api/…

【解决方案2】：

我会在读取它们时处理这些行，并将它们存储在列表列表的列表中：

with open(file, "r") as file_object:
    data = {}
    section = None
    for line in file_object:
        line = line.strip()
        if 0 == len(line):
            continue
        if line.startswith('#'):
            section = []
            data[line[1:] = section
        elif section is not None:
            section.append(line)

您应该以以下列表结尾：

{' Data Type 1': [
    'burgers [taste="good" type="food"];',
    'sushi [taste="good" type="food"];'
    ],

 ' Data Type 2': ['NYC [population="300" type="urban"];'],
 { Data Type 3': [
    'NYC -> CHI [distance="15.0"];',
    'LA -> SF [distnace="2.0"];',
    '}'
    ]
}

【讨论】：