【问题标题】:Cleaning Mixed Geographic Data (R)清理混合地理数据 (R)
【发布时间】:2020-09-11 10:06:44
【问题描述】:

我在包含州和城市(国内和国际)的数据集中有一个非常丑陋的列。其余的数据都是数字,与任何地理无关。有没有什么方法可以进行文本分析来确定最终目标是什么,最终目标是使列分开州和城市并有第三列来显示国家?

   c("Arizona", "(not set)", "Arizona", "(not set)", "California", 
"California", "New York", "Texas", "New York", "Texas", "England", 
"Illinois", "Florida", "Maharashtra", "Massachusetts", "Virginia", 
"Maryland", "Florida", "Karnataka", "Pennsylvania", "Arizona", 
"New Jersey", "Illinois", "District of Columbia", "Delhi", "Ohio", 
"Ontario", "Georgia", "Colorado", "Washington", "Michigan", "Virginia", 
"North Carolina", "England", "Maryland", "Pennsylvania", "Colorado", 
"Utah", "Arizona", "New Jersey", "District of Columbia", "Tamil Nadu", 
"North Carolina", "Arizona", "Massachusetts", "Tokyo", "Andhra Pradesh", 
"Minnesota", "Washington", "Tainan City", "Michigan", "Arizona", 
"Maharashtra", "Federal District", "Ile-de-France", "Utah", "Georgia", 
"Metro Manila", "Ontario", "Connecticut")

【问题讨论】:

  • 发布数据图像不是一个好主意。你能发布一些数据,尤其是坏数据。
  • @ShanR 道歉,进行了更正。

标签: r nlp geo


【解决方案1】:

根据您想要搜索的详尽程度,您可以下载https://download.geonames.org/export/dump/ 下的一个或多个文件并搜索一个或多个列。对于您提供的一组测试数据,我能够做到这一点:

temp <- tempfile()
download.file("https://download.geonames.org/export/dump/cities500.zip",temp)
unzipped <- unz(temp, "cities500.txt")
cities500 <- read.delim(unzipped, header=FALSE)

c("Arizona", "(not set)", "Arizona", "(not set)", "California", 
  "California", "New York", "Texas", "New York", "Texas", "England", 
  "Illinois", "Florida", "Maharashtra", "Massachusetts", "Virginia", 
  "Maryland", "Florida", "Karnataka", "Pennsylvania", "Arizona", 
  "New Jersey", "Illinois", "District of Columbia", "Delhi", "Ohio", 
  "Ontario", "Georgia", "Colorado", "Washington", "Michigan", "Virginia", 
  "North Carolina", "England", "Maryland", "Pennsylvania", "Colorado", 
  "Utah", "Arizona", "New Jersey", "District of Columbia", "Tamil Nadu", 
  "North Carolina", "Arizona", "Massachusetts", "Tokyo", "Andhra Pradesh", 
  "Minnesota", "Washington", "Tainan City", "Michigan", "Arizona", 
  "Maharashtra", "Federal District", "Ile-de-France", "Utah", "Georgia", 
  "Metro Manila", "Ontario", "Connecticut") %in% cities500$V2

请注意,我没有对您的输入进行详尽的测试,仅足以显示可能性。由于站点中有多个转储文件,并且每个转储文件中都有多个列,因此您需要进行试验并找到合适的选项。

【讨论】:

    猜你喜欢
    • 2017-11-24
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2015-11-10
    • 1970-01-01
    相关资源
    最近更新 更多