【问题标题】:Parsing and searching a CSV file解析和搜索 CSV 文件
【发布时间】:2022-01-28 22:12:56
【问题描述】:

我想解析一个联系人列表 CSV 文件,如下所示:

First Name  Last Name  Full Name                            Short Name  Phone Number
Jenny       Smith      CN=Jenny Smith/OU=CORP/O=COMPANY     jesmi       6468675309
Mary        Poppin     CN=Mary Poppins/OU=STORE/O=COMPANY   mapop       7005555578
Tony        Stark      CN=Tony Stark/OU=STORE/O=COMPANY     tostar      6007777798
Peter       Parker     CN=Peter Parker/OU=NEWS/O=COMPANY    pepar       5008889090

我希望能够搜索“全名”列并选择字符串“OU=STORE”并将所有包含“OU=STORE”的行移到一边,然后将其移动到它自己的名为“store.csv”的 csv 文件中。 .csv”。然后对“OU=CORP”和“OU=NEWS”重复相同的过程。


这是我希望输出的样子:

一旦该过程完成,Store.csv 应该只包含此信息。

First Name  Last Name  Full Name                            Short Name  Phone Number
Mary        Poppin     CN=Mary Poppins/OU=STORE/O=COMPANY   mapop       7005555578
Tony        Stark      CN=Tony Stark/OU=STORE/O=COMPANY     tostar      6007777798

corp.csv

First Name  Last Name  Full Name                            Short Name  Phone Number
Jenny       Smith      CN=Jenny Smith/OU=CORP/O=COMPANY     jesmi       6468675309

news.csv

First Name  Last Name  Full Name                            Short Name  Phone Number    
Peter       Parker     CN=Peter Parker/OU=NEWS/O=COMPANY    pepar       5008889090

我有一个到目前为止我所做的小脚本,但我不确定最后要做什么:

import pandas as pd
import csv

#this is the source folder    
source_dir = 'C:/Users/username/documents/contacts/contactslist.csv'

#this is the folder where I want to move the parsed data.
store_target_dir = 'C:/Users/username/documents/contacts/store/'
corp_target_dir = 'C:/Users/username/documents/contacts/corp/'
news_target_dir = 'C:/Users/username/documents/contacts/news/'

col_list = ["Full Name"]

store = 'OU=STORE'
corp = 'OU=CORP'
news = 'OU=NEWS'

#When it comes time to move the data to their folders with their csv name
csvName = store_target_dir + "/" + "store.csv"
csvName2 = corp_target_dir + "/" + "corp.csv"
csvName3 = news_target_dir + "/" +"news.csv"

#opening the file
file = open(source_dir)

#reading the csv file
df = pd.read_csv(file)

【问题讨论】:

    标签: python pandas dataframe csv


    【解决方案1】:

    要过滤您的 DataFrame,您可以执行以下操作:

    # key is the value you are looking for, e.g. 'OU=STORE'    
    indices = [key in value for value in df['File Name']]
    subset = df[indices]
    

    indices 是一个布尔列表,指示一行是否包含key

    【讨论】:

    • 嗨 Tobiaaa,再次感谢您的回答。我还有一个额外的问题,我不想发一个全新的帖子。但是,如果关键变量是一个包含 4 个或更多字符串的列表,我们需要迭代怎么办?我试过key[i],但不行。
    • 那么您要过滤多个类别吗?你能给我举个例子,一个关键列表可能是什么样子吗?如果您希望 subset 包含“全名”包含键列表中至少一个项目的所有列,您可以从上面对键列表中的每个项目重复列表理解并附加 indices 列表每个时间
    • 嗨 Tobiaaa,实际上我最终发布了一个新帖子。在这里stackoverflow.com/questions/71399887/…
    • 列表看起来像这样key = ['store', 'pharmacy', 'str1']。之前的代码只使用key = 'store'。我在想代码看起来像这样:[[i for i in key] in row for row in df['Names']][[i for i in key] not in row for row in df['Names']] 以保持与以前相同的输出。
    【解决方案2】:

    您可以提取OU= 值并将其添加为另一列。然后,.unique() 可用于确定 3 个可能的值,然后根据该值创建每个 CSV。例如:

    import pandas as pd
    
    df = pd.read_csv('contactslist.csv', dtype={'Phone Number': str})
    df['file'] = df['Full Name'].str.extract(r'OU=(\S+)/')
    
    for key in df['file'].unique():
        df_filtered = df.loc[df['file'] == key]
        df_filtered = df_filtered.drop(['file'], axis=1)
        df_filtered.to_csv(f"{key}.csv", index=False)      
    

    【讨论】:

      【解决方案3】:

      尽量不对代码做太多修改,解决方案可能如下所示:

      
      import os, os.path
      import csv
      
      #this is the source folder
      original_contacts_filename = r'C:\Users\username\documents\contacts\contactslist.csv'
      
      #this is the folder where I want to move the parsed data.
      store_target_dir = r'C:\Users\username\documents\contacts\store'
      corp_target_dir = r'C:\Users\username\documents\contacts\corp'
      news_target_dir = r'C:\Users\username\documents\contacts\news'
      
      os.makedirs(store_target_dir, exist_ok=True)
      os.makedirs(corp_target_dir, exist_ok=True)
      os.makedirs(news_target_dir, exist_ok=True)
      
      store = 'OU=STORE'
      corp = 'OU=CORP'
      news = 'OU=NEWS'
      
      #When it comes time to move the data to their folders with their csv name
      csv_name = os.path.join(store_target_dir, "store.csv")
      csv_name2 = os.path.join(corp_target_dir, "corp.csv")
      csv_name3 = os.path.join(news_target_dir, "news.csv")
      
      with (
          open(original_contacts_filename, newline='') as original_contacts_file,
          open(csv_name, mode='w', newline='') as csv_file, 
          open(csv_name2, mode='w', newline='') as csv_file2, 
          open(csv_name3, mode='w', newline='') as csv_file3):
      
          original_contacts = csv.DictReader(original_contacts_file)
          
          store_destination = csv.writer(csv_file)
          corp_destination = csv.writer(csv_file2)
          news_destionation = csv.writer(csv_file3)
      
          output_headers = ('First Name', 'Last Name', 'Full Name', 'Short Name', 'Phone Number')
          store_destination.writerow(output_headers)
          corp_destination.writerow(output_headers)
          news_destionation.writerow(output_headers)
      
          for current_contact in original_contacts:
              if store in current_contact['Full Name']:
                  output_destination = store_destination
              elif corp in current_contact['Full Name']:
                  output_destination = corp_destination
              elif news in current_contact['Full Name']:
                  output_destination = news_destionation
              else:
                  output_destination = None
              
              if output_destination is not None:
                  output_destination.writerow(current_contact[column] for column in output_headers)
      

      但是我们可以看到很多重复的东西,它们通常很臭。我们可以这样简化代码:

      import os, os.path
      import csv
      import re
      
      original_contacts_filename = 'contactslist.csv'
      
      source_directory = r'C:\Users\username\documents\contacts'
      
      corporate_units_expected = ('store', 'corp', 'news')
      
      target_directory = r'C:\Users\username\documents\contacts'
      target_files_info = {
          current_unit: (
              current_name := os.path.join(target_directory, current_unit, f'{current_unit}2.csv'),
              open(current_name, 'w', newline='')
          )
          for current_unit in corporate_units_expected
      }
      
      for current_target, _ in target_files_info.values():
          os.makedirs(os.path.dirname(current_target), exist_ok=True)
          
      matcher = re.compile(r'OU=([ 0-9a-zA-Z]+)')
      
      with (
          open(original_contacts_filename, newline='') as original_contacts_file,
          target_files_info['store'][1], target_files_info['corp'][1], target_files_info['news'][1]
      ):
          original_contacts = csv.DictReader(original_contacts_file)
      
          writers = { 
              current_unit: csv.writer(current_target[1])
              for (current_unit, current_target) in target_files_info.items()
          }
      
          output_headers = ('First Name', 'Last Name', 'Full Name', 'Short Name', 'Phone Number')
      
          for current_writer in writers.values():
              current_writer.writerow(output_headers)
          
          for current_contact in original_contacts:
              if match_found := matcher.search(current_contact['Full Name']):
                  current_writer = writers[match_found[1].lower()]
                  current_writer.writerow(current_contact[column] for column in output_headers)
      

      我们也可以有一个例子,我们事先不知道将这些条目分类到多少个文件中,但它变得更加复杂,因为我们不能直接使用with 语句。

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2020-04-07
        • 2013-11-25
        • 2022-01-07
        • 2022-01-15
        • 1970-01-01
        • 1970-01-01
        • 2011-11-07
        • 2013-05-13
        相关资源
        最近更新 更多