【发布时间】:2013-06-27 20:13:56
【问题描述】:
我有以下格式的数据:
Bxxxx, Mxxxx F Birmingham AL (123) 555-2281 NCC Clinical Mental Health, Counselor Education, Sexual Abuse Recovery, Depression/Grief/Chronically or Terminally Ill, Mental Health/Agency Counseling English 99.52029 -99.8115
Axxxx, Axxxx Brown Birmingham AL (123) 555-2281 NCC Clinical Mental Health, Depression/Grief/Chronically or Terminally Ill, Mental Health/Agency Counseling English 99.52029 -99.8115
Axxxx, Bxxxx Mobile AL (123) 555-8011 NCC Childhood & Adolescence, Clinical Mental Health, Sexual Abuse Recovery, Disaster Counseling English 99.68639 -99.053238
Axxxx, Rxxxx Lunsford Athens AL (123) 555-8119 NCC, NCCC, NCSC Career Development, Childhood & Adolescence, School, Disaster Counseling, Supervision English 99.804501 -99.971283
Axxxx, Mxxxx Mobile AL (123) 555-5963 NCC Clinical Mental Health, Counselor Education, Depression/Grief/Chronically or Terminally Ill, Mental Health/Agency Counseling, Supervision English 99.68639 -99.053238
Axxxx, Txxxx Mountain Brook AL (123) 555-3099 NCC Addictions and Dependency, Career Development, Childhood & Adolescence, Corrections/Offenders, Sexual Abuse Recovery English 99.50214 -99.75557
Axxxx, Lxxxx Birmingham AL (123) 555-4550 NCC Addictions and Dependency, Eating Disorders English 99.52029 -99.8115
Axxxx, Wxxxx Birmingham AL (123) 555-2328 NCC English 99.52029 -99.8115
Axxxx, Rxxxx Mobile AL (123) 555-9411 NCC Addictions and Dependency, Childhood & Adolescence, Couples & Family, Sexual Abuse Recovery, Depression/Grief/Chronically or Terminally Ill English 99.68639 -99.053238
并且只需要提取人名。理想情况下,我可以使用 humanName 来获取一堆名称对象,其中包含字段name.first、name.middle、name.last、name.title...
我已经尝试迭代,直到我击中代表状态的前两个连续大写字母,然后将之前的内容存储到列表中,然后调用 humanName 但那是一场灾难。我不想继续尝试这种方法。
有没有办法感知单词的开头和结尾?这可能会有所帮助...
建议?
【问题讨论】:
-
向我们展示一些有问题的行。
-
文件制表符是否分隔?
-
使用
str.split()分隔单词。这不会很简单,因为您的数据在字段之间没有明确的分隔。问题是要设计一个规则,将名称与其后的城市和州区分开来。例如。在单词中搜索第一个大写的两个字母代码的州?那么前面的单词是城市,前面的单词应该是人名。但如果城市是两个词,比如纽约,这将失败。 -
你有多少行?
标签: python text-parsing