【问题标题】:Extracting name from line从行中提取名称
【发布时间】:2013-06-27 20:13:56
【问题描述】:

我有以下格式的数据:

Bxxxx, Mxxxx F  Birmingham   AL (123) 555-2281  NCC Clinical Mental Health, Counselor Education, Sexual Abuse Recovery, Depression/Grief/Chronically or Terminally Ill, Mental Health/Agency Counseling English 99.52029    -99.8115
Axxxx, Axxxx Brown  Birmingham   AL (123) 555-2281  NCC Clinical Mental Health, Depression/Grief/Chronically or Terminally Ill, Mental Health/Agency Counseling English 99.52029    -99.8115
Axxxx, Bxxxx    Mobile   AL (123) 555-8011  NCC Childhood & Adolescence, Clinical Mental Health, Sexual Abuse Recovery, Disaster Counseling English 99.68639    -99.053238
Axxxx, Rxxxx Lunsford   Athens   AL (123) 555-8119  NCC, NCCC, NCSC Career Development, Childhood & Adolescence, School, Disaster Counseling, Supervision   English 99.804501   -99.971283
Axxxx, Mxxxx    Mobile   AL (123) 555-5963  NCC Clinical Mental Health, Counselor Education, Depression/Grief/Chronically or Terminally Ill, Mental Health/Agency Counseling, Supervision   English 99.68639    -99.053238
Axxxx, Txxxx    Mountain Brook   AL (123) 555-3099  NCC Addictions and Dependency, Career Development, Childhood & Adolescence, Corrections/Offenders, Sexual Abuse Recovery    English 99.50214    -99.75557
Axxxx, Lxxxx    Birmingham   AL (123) 555-4550  NCC Addictions and Dependency, Eating Disorders English 99.52029    -99.8115
Axxxx, Wxxxx    Birmingham   AL (123) 555-2328  NCC     English 99.52029    -99.8115
Axxxx, Rxxxx    Mobile   AL (123) 555-9411  NCC Addictions and Dependency, Childhood & Adolescence, Couples & Family, Sexual Abuse Recovery, Depression/Grief/Chronically or Terminally Ill English 99.68639    -99.053238

并且只需要提取人名。理想情况下,我可以使用 humanName 来获取一堆名称对象,其中包含字段name.firstname.middlename.lastname.title...

我已经尝试迭代,直到我击中代表状态的前两个连续大写字母,然后将之前的内容存储到列表中,然后调用 humanName 但那是一场灾难。我不想继续尝试这种方法。

有没有办法感知单词的开头和结尾?这可能会有所帮助...

建议?

【问题讨论】:

  • 向我们展示一些有问题的行。
  • 文件制表符是否分隔?
  • 使用str.split() 分隔单词。这不会很简单,因为您的数据在字段之间没有明确的分隔。问题是要设计一个规则,将名称与其后的城市和州区分开来。例如。在单词中搜索第一个大写的两个字母代码的州?那么前面的单词是城市,前面的单词应该是人名。但如果城市是两个词,比如纽约,这将失败。
  • 你有多少行?

标签: python text-parsing


【解决方案1】:

最好的办法是找到不同的数据源。严重地。这个被骗了。

如果你不能这样做,那么我会做一些这样的工作:

  1. 用单空格替换所有双空格。
  2. 用空格分隔行
  3. 获取列表中的最后 2 个项目。这些是纬度和经度
  4. 在列表中向后循环,将每个项目查找到潜在语言列表中。如果查找失败,您就完成了语言。
  5. 用空格连接剩余的列表项
  6. 在该行中,找到第一个开头括号。读取大约 13 或 14 个字符,将所有标点符号替换为空字符串,并将其重新格式化为普通电话号码。
  7. 电话号码后面的剩余部分用逗号分隔。
  8. 使用该拆分,循环遍历列表中的每个项目。如果文本以超过 1 个大写字母开头,请将其添加到认证中。否则,将其添加到实践领域。
  9. 回到您在第 6 步中找到的索引,直到那时为止。将其拆分为空格,然后取最后一项。这就是状态。剩下的就是名字和城市!
  10. 取空格分割行中的前 2 个项目。到目前为止,这是您对名字的最佳猜测。
  11. 查看第 3 项。如果是单个字母,请将其添加到名称中并从列表中删除。
  12. 从这里下载 US.zip:http://download.geonames.org/export/zip/US.zip
  13. 在美国数据文件中,将其全部拆分为选项卡。取索引 2 和 4 处的数据,即城市名称和州缩写。循环遍历所有数据并将每一行插入到一个新列表中,并以缩写 + “:” + 城市名称(即 AK:Sand Point)的形式连接。
  14. 按照与第 13 步相同的格式,对行中剩余项目的所有可能连接进行组合。所以你最终会选择 AL:Brown Birmingham 和 AL:Birmingham 作为第二行。
  15. 遍历每个组合并在您在步骤 #13 中创建的列表中搜索它。如果找到了,请将其从拆分列表中删除。
  16. 将字符串拆分列表中的所有剩余项目添加到此人的姓名。
  17. 如果需要,用逗号分隔名称。 index[0] 是姓氏 index[1] 是所有剩余的名字。不要对中间名做出任何假设。

只是为了咯咯笑,我实现了这个。享受吧。

import itertools

# this list of languages could be longer and should read from a file
languages = ["English", "Spanish", "Italian", "Japanese", "French",
             "Standard Chinese", "Chinese", "Hindi", "Standard Arabic", "Russian"]

languages = [language.lower() for language in languages]

# Loop through US.txt and format it. Download from geonames.org.
cities = []
with open('US.txt', 'r') as us_data:
    for line in us_data:
        line_split = line.split("\t")
        cities.append("{}:{}".format(line_split[4], line_split[2]))

# This is the dataset
with open('state-teachers.txt', 'r') as teachers:
    next(teachers)  # skip header

    for line in teachers:
        # Replace all double spaces with single spaces
        while line.find("  ") != -1:
            line = line.replace("  ", " ")

        line_split = line.split(" ")

        # Lat/Lon are the last 2 items
        longitude = line_split.pop().strip()
        latitude = line_split.pop().strip()

        # Search for potential languages and trim off the line as we find them
        teacher_languages = []

        while True:
            language_check = line_split[-1]
            if language_check.lower().replace(",", "").strip() in languages:
                teacher_languages.append(language_check)
                del line_split[-1]
            else:
                break

        # Rejoin everything and then use phone number as the special key to split on
        line = " ".join(line_split)

        phone_start = line.find("(")
        phone = line[phone_start:phone_start+14].strip()

        after_phone = line[phone_start+15:]

        # Certifications can be recognized as acronyms
        # Anything else is assumed to be an area of practice
        certifications = []
        areas_of_practice = []

        specialties = after_phone.split(",")
        for specialty in specialties:
            specialty = specialty.strip()
            if specialty[0:2].upper() == specialty[0:2]:
                certifications.append(specialty)
            else:
                areas_of_practice.append(specialty)

        before_phone = line[0:phone_start-1]
        line_split = before_phone.split(" ")

        # State is the last column before phone
        state = line_split.pop()

        # Name should be the first 2 columns, at least. This is a basic guess.
        name = line_split[0] + " " + line_split[1]

        line_split = line_split[2:]

        # Add initials
        if len(line_split[0].strip()) == 1:
            name += " " + line_split[0].strip()
            line_split = line_split[1:]

        # Combo of all potential word combinations to see if we're dealing with a city or a name
        combos = [" ".join(combo) for combo in set(itertools.permutations(line_split))] + line_split

        line = " ".join(line_split)
        city = ""

        # See if the state:city combo is valid. If so, set it and let everything else be the name
        for combo in combos:
            if "{}:{}".format(state, combo) in cities:
                city = combo
                line = line.replace(combo, "")
                break

        # Remaining data must be a name
        if line.strip() != "":
            name += " " + line

        # Clean up names
        last_name, first_name = [piece.strip() for piece in name.split(",")]

        print first_name, last_name

【讨论】:

  • 很高兴你喜欢它。我喜欢文件解析,出于某种不敬虔的原因。
【解决方案2】:

不是代码答案,但看起来您可以从http://www.abec.alabama.gov/rostersearch2.asp?search=%25&submit1=Search 的许可委员会获得大部分/全部数据。名字很容易找到。

【讨论】:

  • 欣赏它,但数据是全国性的。阿拉巴马州只是辅导员冰山一角......
猜你喜欢
  • 2014-10-30
  • 2014-04-04
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多