从 .csv 文件中提取最常见名称的 Python 程序答案

【问题标题】：Python program that extracts most frequently found names from .csv file从 .csv 文件中提取最常见名称的 Python 程序
【发布时间】：2020-02-28 15:06:38
【问题描述】：

我创建了一个程序，它生成 5000 个随机名称、ssn、城市、地址和电子邮件，并将它们存储在 fakeprofile.csv 文件中。我正在尝试从文件中提取最常用的名称。我能够让程序在语法上工作，但无法提取常用名称。代码如下：

import re
import statistics
file_open = open('fakeprofile.csv').read()
frequent_names = re.findall('[A-Z][a-z]*', file_open)
print(frequent_names)

文件中的示例：

Alicia Walters 419-52-4141 Yorkstad 66616 Schultz Extensions Suite 225
Reynoldsmouth, VA 72465 stevenserin@stein.biz
Nicole Duffy 212-38-9009 West Timothy 51077 Phillips Ports Apt. 314
Hubbardville, IN 06723 kaitlinthomas@bennett-carter.com
Stephanie Lewis 442-20-1279 Jacquelineshire 650 Gutierrez Forge Apt. 839
West Christianbury, TN 13654 ukelley@gmail.com
Michael Harris 108-81-3733 East Toddberg 14387 Douglas Mission Suite 038
Garciaview, WI 58624 kshields@yahoo.com
Aaron Moreno 171-30-7715 Port Taraburgh 56672 Wagner Path
Lake Christopher, VA 37884 lucasscott@nguyen.info
Alicia Zimmerman 286-88-9507 Barberstad 5365 Heath Extensions Apt. 731
South Randyburgh, NJ 79367 daniellewebb@yahoo.com
Brittney Mcmillan 334-44-0321 Lisahaven PSC 3856, Box 2428
APO AE 03215 kevin95@hotmail.com
Amanda Perkins 327-31-6610 Perryville 8750 Hurst Harbor Apt. 929

样本输出：

', 'Lake', 'Brianna', 'P', 'A', 'Michael', 'Smith', 'Harveymouth', 'Patricia', 'Tunnel', 'West', 'William', 'G', 'A', 'Charles', 'Perkins', 'Lake', 'Marie', 'Lisa', 'Overpass', 'Suite', 'Kennedymouth', 'C', 'A', 'Barbara', 'Perez', 'Billyshire', 'Joshua', 'Village', 'Cindymouth', 'W', 'I', 'Curtis', 'Simmons', 'North', 'Mitchellport', 'Gordon', 'Crest', 'Suite', 'Jacksonburgh', 'C', 'O', 'Cameron', 'Berg', 'South', 'Dean', 'Christina', 'Coves', 'Williamton', 'T', 'N', 'Maria', 'Williams', 'North', 'Judith', 'Carson', 'Overpass', 'Apt', 'West', 'Amandastad', 'N', 'M', 'Hannah', 'Dennis', 'Rodriguezmouth', 'P', 'S', 'C', 'Box', 'A', 'P', 'O', 'A', 'E', 'Laura', 'Richardson', 'Lake', 'Kayla', 'Johnson', 'Place', 'Suite', 'Port', 'Jennifermouth', 'N', 'H', 'John', 'Lawson', 'Hintonhaven', 'Thomas', 'Via', 'Mossport', 'N', 'J', 'Jennifer', 'Hill', 'East', 'Phillip', 'P', 'S', 'C', 'Box', 'A', 'P', 'O', 'A', 'E', 'Cody', 'Jackson', 'Lake', 'Jessicamouth', 'Snyder', 'Ways', 'Apt', 'New', 'Stacey', 'M', 'E', 'Ryan', 'Friedman', 'Shahburgh', 'Jerry', 'Pike', 'Suite', 'Toddfort', 'N', 'V', 'Kathleen', 'Fox', 'Ferrellmouth', 'P', 'S', 'C', 'Box', 'A', 'P', 'O', 'A', 'P', 'Michael', 'Thompson', 'Port', 'Jessica', 'Boone', 'Spurs', 'Suite', 'Port', 'Ashleyland', 'C', 'O', 'Christopher', 'Marsh', 'North', 'Catherine', 'Scott', 'Trail', 'Apt', 'Baileyburgh', 'F', 'L', 'Richard', 'Rangel', 'New', 'Anna', 'Ray', 'Drive', 'Apt', 'Nunezland', 'I', 'A', 'Connor', 'Stanton', 'Troyshire', 'Rodgers', 'Hill', 'West', 'Annmouth', 'N', 'H', 'James', 'Medina',

我的问题是无法提取大多数经常找到的名字以及避免那些大写字母。相反，我提取了所有名称（包括不必要的大写字母），上面看到的是提取的所有名称的一小部分。我注意到名字总是在输出的奇数行中，我试图在这些奇数行中捕获最常见的名字。

fakeprofile.csv 文件是由这个程序创建的：

import csv
import faker
from faker import Faker
fake = Faker()
name = fake.name(); print(name)
ssn = fake.ssn(); print(ssn)
city = fake.city(); print(city)
address = fake.address(); print(address)
email = fake.email(); print(email)
profile = fake.simple_profile()
for i,j in profile.items():
    print('{}: {}'.format(i,j))
print('Name: {}, SSN: {}, City: {}, Address: {}, Email: {}'.format(name,ssn,city,address,email))
with open('fakeprofile.csv', 'w') as f:
    for i in range(0,5001):
        print(f'{fake.name()} {fake.ssn()} {fake.city()} {fake.address()} {fake.email()}', file=f)

【问题讨论】：

您是在询问有关正则表达式的问题吗？你能展示你的文件样本吗？要按出现次数对计数器进行排序，您可以 [(k,v) for k,v in sorted(count.items(), key=lambda x: x[1], reverse=True)]
不，我用 .csv 文件中的示例信息更新了帖子。
数据是否遵循某种模式？您共享的输入文件似乎不是有效的 CSV。
数据的格式为(name,ssn,city,address,email)，由另一个程序生成并存储在 fakeprofile.csv 文件中。
@Pkd 看起来不尊重格式，然后，我几乎看不到任何逗号。您能否更具体一点，确切格式是什么？

标签： python-3.x csv extraction names

【解决方案1】：

这能达到你想要的吗？

import collections, re

# Read in all lines into a list
with open('fakeprofile.csv') as f:
    lines = f.readlines()
# Throw out every other line
lines = [line for i, line in enumerate(lines) if i%2 == 0]
# Keep only first word of each line
names = [line.split()[0] for line in lines]
# Find most common names
n = 3
frequent_names = collections.Counter(names).most_common(n)
# Display most common names
for name, count in frequent_names:
    print(name, count)

为了进行计数，它使用collections.Counter 及其most_common() 方法。

【讨论】：

代码显示Apt、Suite、West等名称。我正在寻找的是每个奇数行的第一个单词（名字）。
这根本不是你的问题所说的？？
啊，我需要指定我要查找的名称。在上面的输出中，您会注意到每个奇数行（1、3、5、7 等）都有一个名字，我正在从这些奇数行中寻找最常见的名字。

【解决方案2】：

我认为如果您使用 pandas 库来进行 CSV 操作（收集所需信息）会更好，然后将诸如 counter(df ['name'] ) 之类的 python 集合应用到其中，否则您可以给了解有关 CSV 文件的更多信息。

谢谢你

【讨论】：

【解决方案3】：

所以您遇到的主要问题是您使用的正则表达式会捕获每个字母。你对奇数行中的第一个世界感兴趣。

你可以在这些线上做点什么：

# either use a dict to count or a list to transform as counter.
dico_count = {}
with open('fakeprofile.csv') as file_open:  # use of context manager

    line_number = 1
    for line in file_open: #iterates all the lines

        if line_number % 2 != 0 : # odd line

            spt = line.strip().split()
            dico_count[spt[0]] = dico_count.get(spt[0], 0) + 1

frequent_name_counter = [(k,v) for k,v in sorted(dico_count.items(), key=lambda x: x[1], reverse=True)]

【讨论】：

这个文件对我不起作用。我在上面发布的代码与创建 fakeprofile.csv 文件的代码一起使用，我在其中更新了帖子。