【问题标题】:Pandas remove line break and convert text file to CSVPandas 删除换行符并将文本文件转换为 CSV
【发布时间】:2021-02-22 15:31:36
【问题描述】:

我有一个文本文件,我想从中删除换行符并添加标题以将其转换为 CSV 文件。

文件如下所示:

3G LOJISTIK VE HAVACILIK HIZMETLARI LTD., No. 3/182 Altintepe
Bagdat Cad. Istasyon Yolu Sok., Istanbul 34840, Turkey; Additional
Sanctions Information - Subject to Secondary Sanctions [SDGT]
[IFSR] (Linked To: MAHAN AIR).

7 KARNES, Avenida Ciudad de Cali No. 15A-91, Local A06-07, Bogota,
Colombia; Matricula Mercantil No 1978075 (Colombia) [SDNTK].

我使用的代码:

sdnlist = pd.DataFrame(pd.read_csv('sndlist.txt',delimiter="\t"))
sdnlist.to_csv('sdnlist.csv',index=False)
colnames=["a","b", "c", "d"]
sndlist_data = pd.read_csv("sdnlist.csv",names=colnames)
sndlist_data.head()

所需的输出只是用逗号分割所有内容:(a,b,c..) 是标题名称

  a        b            c        d         c           

3G LO...  No. 3/18.... Ista.... Turk..... Sancti... - Subject to....

这是来自过去bin中的文本文件的示例pastbin

全文文件取自以下链接FULL SDN TEXT

【问题讨论】:

  • 您能否为该示例添加所需的输出 CSV 的样子。这将有助于显示您如何尝试处理它
  • 嘿@MartinEvans 所需的输出 CSV 将以逗号分隔成字段,了解 CSV 文件的正常情况,我用一个示例编辑了我的原始问题

标签: python pandas csv split


【解决方案1】:

您可以使用 Python 的 itertools.groupby() 函数一次读取整个块。然后可以对其进行处理以将其放入一行中,并在似乎是逗号和分号的地方进行拆分。正则表达式可以定位括号内的逗号并将其替换为不同的字符,例如-.

from itertools import groupby
import csv
import io
import re

with open('sdnlist.txt') as f_input, open('output.csv', 'w', newline='') as f_output:
    csv_output = csv.writer(f_output)
    csv_output.writerow(list('abcdefg'))
    
    for key, block in groupby(f_input, lambda x: x.strip() != ''):
        if key:
            single_line = ' '.join(block).replace('\n', '').replace(';', ',')
            single_line = re.sub('(\([^)]*?)(,)([^)]*?\))', r'\1-\3', single_line)
            row = next(csv.reader(io.StringIO(single_line), skipinitialspace=True))
            csv_output.writerow(row)
            #print('\n'.join(row) + '\n')

这应该会给你以下输出:

a,b,c,d,e,f,g
3G LOJISTIK VE HAVACILIK HIZMETLARI LTD.,No. 3/182 Altintepe Bagdat Cad. Istasyon Yolu Sok.,Istanbul 34840,Turkey,Additional Sanctions Information - Subject to Secondary Sanctions [SDGT] [IFSR] (Linked To: MAHAN AIR).
7 KARNES,Avenida Ciudad de Cali No. 15A-91,Local A06-07,Bogota,Colombia,Matricula Mercantil No 1978075 (Colombia) [SDNTK].
7 MAKARA PHARY CO.,LTD.,Deaum Mien,Daeum Mien,Ta Khmau,Kandal 8252,Cambodia,Company Number 00037307 (Cambodia) [GLOMAG] (Linked To: SOPHARY- Kim).
7TH OF TIR (a.k.a. 7TH OF TIR COMPLEX- a.k.a. 7TH OF TIR INDUSTRIAL COMPLEX,a.k.a. 7TH OF TIR INDUSTRIES,a.k.a. 7TH OF TIR INDUSTRIES OF ISFAHAN/ESFAHAN,a.k.a. MOJTAMAE SANATE HAFTOME TIR,a.k.a. SANAYE HAFTOME TIR,a.k.a. SEVENTH OF TIR),Mobarakeh Road Km 45,Isfahan,Iran,P.O. Box 81465-478,Isfahan,Iran,Additional Sanctions Information - Subject to Secondary Sanctions [NPWMD] [IFSR].
7TH OF TIR COMPLEX (a.k.a. 7TH OF TIR- a.k.a. 7TH OF TIR INDUSTRIAL COMPLEX,a.k.a. 7TH OF TIR INDUSTRIES,a.k.a. 7TH OF TIR INDUSTRIES OF ISFAHAN/ESFAHAN,a.k.a. MOJTAMAE SANATE HAFTOME TIR,a.k.a. SANAYE HAFTOME TIR,a.k.a. SEVENTH OF TIR),Mobarakeh Road Km 45,Isfahan,Iran,P.O. Box 81465-478,Isfahan,Iran,Additional Sanctions Information - Subject to Secondary Sanctions [NPWMD] [IFSR].
7TH OF TIR INDUSTRIAL COMPLEX (a.k.a. 7TH OF TIR- a.k.a. 7TH OF TIR COMPLEX,a.k.a. 7TH OF TIR INDUSTRIES,a.k.a. 7TH OF TIR INDUSTRIES OF ISFAHAN/ESFAHAN,a.k.a. MOJTAMAE SANATE HAFTOME TIR,a.k.a. SANAYE HAFTOME TIR,a.k.a. SEVENTH OF TIR),Mobarakeh Road Km 45,Isfahan,Iran,P.O. Box 81465-478,Isfahan,Iran,Additional Sanctions Information - Subject to Secondary Sanctions [NPWMD] [IFSR].
7TH OF TIR INDUSTRIES (a.k.a. 7TH OF TIR- a.k.a. 7TH OF TIR COMPLEX,a.k.a. 7TH OF TIR INDUSTRIAL COMPLEX,a.k.a. 7TH OF TIR INDUSTRIES OF ISFAHAN/ESFAHAN,a.k.a. MOJTAMAE SANATE HAFTOME TIR,a.k.a. SANAYE HAFTOME TIR,a.k.a. SEVENTH OF TIR),Mobarakeh Road Km 45,Isfahan,Iran,P.O. Box 81465-478,Isfahan,Iran,Additional Sanctions Information - Subject to Secondary Sanctions [NPWMD] [IFSR].
7TH OF TIR INDUSTRIES OF ISFAHAN/ESFAHAN (a.k.a. 7TH OF TIR- a.k.a. 7TH OF TIR COMPLEX,a.k.a. 7TH OF TIR INDUSTRIAL COMPLEX,a.k.a. 7TH OF TIR INDUSTRIES,a.k.a. MOJTAMAE SANATE HAFTOME TIR,a.k.a. SANAYE HAFTOME TIR,a.k.a. SEVENTH OF TIR),Mobarakeh Road Km 45,Isfahan,Iran,P.O. Box 81465-478,Isfahan,Iran,Additional Sanctions Information - Subject to Secondary Sanctions [NPWMD] [IFSR].
8TH IMAM INDUSTRIES GROUP (a.k.a. CRUISE MISSILE INDUSTRY GROUP- a.k.a. CRUISE SYSTEMS INDUSTRY GROUP,a.k.a. NAVAL DEFENCE MISSILE INDUSTRY GROUP,a.k.a. SAMEN AL-A'EMMEH INDUSTRIES GROUP),Tehran,Iran,Additional Sanctions Information - Subject to Secondary Sanctions [NPWMD] [IFSR].
"14 STAR SHIPPING MANAGEMENT (a.k.a. FOURTEEN STAR SHIPPING MANAGEMENT- a.k.a. ""FOURTEEN STARS"")",United Arab Emirates,Additional Sanctions Information - Subject to Secondary Sanctions [SDGT] (Linked To: MEHDI GROUP).

您仍然难以挑选地址。

【讨论】:

  • 嘿@MartinEvans 感谢您的详细回答,有些字段非常适合,但有些字段不喜欢,例如这个14 STAR SHIPPING MANAGEMENT (a.k.a. FOURTEEN STAR SHIPPING MANAGEMENT,"a.k.a. ""FOURTEEN STARS"")",United Arab Emirates,Additional Sanctions Information - Subject to Secondary Sanctions [SDGT] (Linked To: MEHDI GROUP). 有没有办法忽略() 中的逗号,因为代码拆分那些也
  • 可能是的,但您需要编辑您的问题以包含更多有问题的示例行(并更新您想要的输出以准确显示拆分的位置)。作为替代方案,您可以使用 pastebin.com 之类的内容发布文件链接。
  • 我刚刚从 Pastebin 中的文本文件中添加了一个示例,以及指向完整数据的链接
  • 那数据肯定不好解析。您仍然难以提取地址列。我添加了代码来删除括号内的逗号。
  • 非常感谢您的帮助,我想知道是否有一种方法可以自动处理与此类似的文件(非结构化文本文件)的此过程
猜你喜欢
  • 1970-01-01
  • 2021-11-21
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2020-03-06
  • 1970-01-01
  • 2013-01-08
  • 1970-01-01
相关资源
最近更新 更多