将 txt 文件转换为 csv，将特定行分隔为列答案

【问题标题】：Convert txt file to csv, separation specific lines to column将 txt 文件转换为 csv，将特定行分隔为列
【发布时间】：2022-11-23 06:24:41
【问题描述】：

我目前正在尝试拥有这样的数据

noms sommets
0000 Abbesses
0001 Alexandre Dumas
...
coord sommets
0000 308 536
0001 472 386
0002 193 404

转换为csv时数据应该是这样的，数据应该有三列

nom	sommets	coord sommets
0000	Abbesses	308 536

然而，数据中的一切都是一条直线，很难处理。这是什么解决方案。我尝试将它从 txt 转换为 csv。

【问题讨论】：

你的数据现在怎么样了？
啊，对不起，我只是编辑问题
这回答了你的问题了吗？ converting TXT to CSV python
我不这么认为
是在同一个文件中提供的数据吗？它们是单独的文件吗？

标签： python csv txt

【解决方案1】：

没有进口你可以做到这一点。

由于数据中的噪音，有一些安全检查。

此外，我正在使用字典，因为它们在尝试查找键/值对时速度非常快。

with open("metro", encoding="latin-1") as infile:
    data = infile.read().splitlines()

nom_start = "noms sommets"
coord_start = "coord sommets"
end = "arcs values"
mode = None

# use a dict as lookups on dicts are stupidly fast.
result = {}

for line in data:
    # this one is needed due to the first line
    if mode == None:
        if line == nom_start:
            mode = nom_start
        continue
    line = line.strip()
    # safety check
    if line != "":
        if line == end:
            # skip the end data
            break
        key, value = line.split(maxsplit=1)
        if mode == nom_start:
            if line != coord_start:
                result[key] = {"sommets": value}
            else:
                mode = coord_start
        else:
            result[key]["coord sommets"] = value


# CSV separator
SEP = ";"
with open("output.csv", "w", encoding="latin-1") as outfile:
    # CSV header
    outfile.write(f"noms{SEP}sommets{SEP}coord sommets
")
    for key, val in result.items():
        outfile.write(f'{key}{SEP}{val["sommets"]}{SEP}{val["coord sommets"]}
')

【讨论】：

【解决方案2】：

挺有意思的问题。我假设该文件包含比示例更多的列或键/变量集。所以您不想对列名进行硬编码。

我会创建一个新的空数据框，然后逐行读取输入文件，检查它是否是下一个新列名（不以数字开头），用这些新值构建一个字典，然后继续将该字典合并为新数据框中的新列。

所以我会做这样的事情：

import pandas as pd

# create an Empty DataFrame object
df_new = pd.DataFrame({"recordkey": []})

# read all input lines
inputfilename = "inputfile.txt"
file1 = open(inputfilename, 'r')
Lines = file1.readlines()

tmpdict = {}
colname = ""

# iterate through all lines
for idx in range(len(Lines)):
    line = Lines[idx]
    # this is assuming all keys are exactly 4 digits
    iscolname = not (line[:4].isdigit())
    
    if not iscolname:
        # split on the first space for key and value
        tmp = line.split(" ", 1)
        getkey = tmp[0].strip()
        getvalue = tmp[1].strip()

        # add to dictionary
        tmpdict[getkey] = getvalue

    # new column or last line
    if iscolname or idx == len(Lines)-1:
        # new column (except skip for first line of file)
        if colname != "":
            # create new column from dictionary
            df_tmp = pd.DataFrame(tmpdict.items(), columns=["recordkey", colname])
            df_new = df_new.merge(df_tmp, how='outer', on='recordkey')

        # keep new column name
        colname = line.strip()
        tmpdict = {}

# display dataframe
print(df_new)

# write dataframe to csv
fileoutput = "outputfile.csv"
df_new.to_csv(fileoutput, sep=",", index=False)

【讨论】：

【解决方案3】：

from pathlib import Path

import pandas as pd

f = Path("metro")

lines = [[], []]
file_num = 0

for line in f.read_text().split("
"):
    if not line:
        continue
    if line.startswith("coord"):
        file_num = 1
    lines[file_num].append(line.split(maxsplit=1))


def get_df(data):
    df1 = pd.DataFrame(data)
    df1.columns = df1.iloc[0]
    df1 = df1.drop(index=0)
    df1.columns.name = None
    return df1


df1 = get_df(lines[0])
df2 = get_df(lines[1])

df2.columns = [df1.columns[0], " ".join(df2.columns)]

res = pd.merge(df1, df2, how="outer", on="noms")
#    noms          sommets coord sommets
# 0  0000         Abbesses       308 536
# 1  0001  Alexandre Dumas       472 386
# 2  0002         NaN       193 404
res.to_csv("data.csv")

编辑：要解决编码问题，请将您想要的编码传递给read_text()。

for line in f.read_text(encoding="latin-1").split("
"):
    ...

【讨论】：

如果数据不是 utf-8 会发生什么，“utf-8”编解码器无法解码位置 81 中的字节 0xe9：无效的连续字节
哪一行给你这个问题？ Pandas 部分应该是独立的。不看完整的错误就很难说。
问题从'for line in f.read_text().split(" "):'。我认为问题出在数据上，通常我会将其编码为'latin-1'
一种解决方法是将其作为字节读取：f.read_bytes()，但随后您必须将 for 循环中的所有 str 替换为字节，例如b' '等。但是在创建数据框之前，您需要将解析结果从bytes转换为str。您应该弄清楚为什么数据中具有这种特征。
数据样本：drive.google.com/file/d/18QGnzcJ4PYaVY_qnFZ1cwVlCNZc1zUSt/…