【问题标题】:Split each line in a file based on delimitters根据分隔符拆分文件中的每一行
【发布时间】:2022-01-19 21:35:31
【问题描述】:

这是文件中的示例数据。我想拆分文件中的每一行并添加到数据框中。在某些情况下,他们有超过 1 个孩子。因此,每当他们有多个子列时,都必须添加 child2 Name 和 DOB

(P322) Rashmika Chadda 15/05/1995 – Rashmi C 12/02/2024
(P324) Shiva Bhupati 01/01/1994 – Vinitha B 04/08/2024
(P356) Karthikeyan chandrashekar 22/02/1991 – Kanishka P 10/03/2014
(P366) Kalyani Manoj 23/01/1975 - Vandana M 15/05/1995 - Chandana M 18/11/1998 

这是我尝试过的代码,但这仅通过考虑“-”来拆分

with open("text.txt") as read_file:
    file_contents = read_file.readlines()
content_list = []
temp = []
for each_line in file_contents:
    temp = each_line.replace("–", " ").split()

    content_list.append(temp)

print(content_list)

当前输出:

[['(P322)', 'Rashmika', 'Chadda', '15/05/1995', 'Rashmi', 'Chadda', 'Teega', '12/02/2024'], ['(P324)', 'Shiva', 'Bhupati', '01/01/1994', 'Vinitha', 'B', 'Sahu', '04/08/2024'], ['(P356)', 'Karthikeyan', 'chandrashekar', '22/02/1991', 'Kanishka', 'P', '10/03/2014'], ['(P366)', 'Kalyani', 'Manoj', '23/01/1975', '-', 'Vandana', 'M', '15/05/1995', '-', 'Chandana', 'M', '18/11/1998']]

最终输出应如下所示

Code Parent_Name DOB Child1_Name DOB Child2_Name DOB
P322 Rashmika Chadda 15/05/1995 Rashmi C 12/02/2024
P324 Shiva Bhupati 01/01/1994 Vinitha B 04/08/2024
P356 Karthikeyan chandrashekar 22/02/1991 Kanishka P 10/03/2014
P366 Kalyani Manoj 23/01/1975 Vandana M 15/05/1995 Chandana M 18/11/1998

【问题讨论】:

  • 您需要将参数传递给split。结果数据的架构已损坏,因为您有 3 个具有相同名称“DOB”的列。

标签: python split nlp delimiter


【解决方案1】:

我不确定您是否希望将其作为列表或其他内容。 获取列表:

result = []
for t in text[:]:

    # remove the \n at the end of each line
    t = t.strip()
    # remove the parenthesis you don't wnt
    t = t.replace("(", "")
    t = t.replace(")", "")
    # split on space
    t = t.split(" – ")
    
    # reconstruct
    for i, person in enumerate(t):
        person = person.split(" ")
        # print(person)
        # remove code
        if i==0:
            res = [person.pop(0)]
        res.extend([" ".join(person[:2]), person[2]])

    result.append(res)

print(result)

这将给出以下输出:

[['P322', 'Rashmika Chadda', '15/05/1995', 'Rashmi C', '12/02/2024'], ['P324', 'Shiva Bhupati', '01/01/1994', 'Vinitha B', '04/08/2024'], ['P356', 'Karthikeyan chandrashekar', '22/02/1991', 'Kanishka P', '10/03/2014'], ['P366', 'Kalyani Manoj', '23/01/1975', 'Vandana M', '15/05/1995', 'Chandana M', '18/11/1998']]

您可以使用字典来组织更多数据:

result = {}
for t in text[:]:

    # remove the \n at the end of each line
    t = t.strip()
    # remove the parenthesis you don't wnt
    t = t.replace("(", "")
    t = t.replace(")", "")
    # split on space
    t = t.split(" – ")
    
    for i, person in enumerate(t):
        # split name
        person = person.split(" ")
        # remove code
        if i==0:
            code = person.pop(0)
        if i==0:
            result[code] = {"parent_name": " ".join(person[:2]), "parent_DOB": person[2], "children": [] }
        else:
            result[code]['children'].append({f"child{i}_name": " ".join(person[:2]), f"child{i}_DOB": person[2]})

print(result)

这会给出这个输出:

{'P322': {'children': [{'child1_DOB': '12/02/2024',
    'child1_name': 'Rashmi C'}],
  'parent_DOB': '15/05/1995',
  'parent_name': 'Rashmika Chadda'},
 'P324': {'children': [{'child1_DOB': '04/08/2024',
    'child1_name': 'Vinitha B'}],
  'parent_DOB': '01/01/1994',
  'parent_name': 'Shiva Bhupati'},
 'P356': {'children': [{'child1_DOB': '10/03/2014',
    'child1_name': 'Kanishka P'}],
  'parent_DOB': '22/02/1991',
  'parent_name': 'Karthikeyan chandrashekar'},
 'P366': {'children': [{'child1_DOB': '15/05/1995',
    'child1_name': 'Vandana M'},
   {'child2_DOB': '18/11/1998', 'child2_name': 'Chandana M'}],
  'parent_DOB': '23/01/1975',
  'parent_name': 'Kalyani Manoj'}}

最后,要拥有一个实际的表格,您将需要使用 pandas,但这需要您修复最大子节点数,以便您可以填充空单元格。

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2012-07-04
    • 2019-09-19
    • 2013-02-27
    • 2018-10-25
    • 2017-06-24
    • 1970-01-01
    • 1970-01-01
    • 2014-07-27
    相关资源
    最近更新 更多