将位置列拆分为邮政编码、纬度和经度答案

【问题标题】：Spliting the location column into Zipcode, Latitude and Longitude将位置列拆分为邮政编码、纬度和经度
【发布时间】：2017-05-23 22:22:55
【问题描述】：

我的数据框中有一列包含邮政编码、纬度和经度

位置

"10007 (40.71363051943297, -74.00913138370635)"
"10002 (40.71612146793143, -73.98583147024613)"
"10012 (40.72553802086304, -73.99789641059084)"
"10009 (40.72664935898081, -73.97911148500697)"

我需要将它们分成三个不同的列，例如邮政编码、纬度和经度。

我尝试过这样做

extract(Location, c("Zip-Code","Latitude", "Longitude"), "\\(([^,]+), ([^)]+)\\)")

我想用经纬度用ggmap绘制地图

谢谢

【问题讨论】：

试试library(tidyr); separate(df1, Location, into = c("Zip_Code", "Latitude", "Longitude"), sep=",*\\s+\\(*|\\)", merge="extra")
它的不同行不在同一行，我试过你的解决方案“它说列规范无效”
我的意思是 10007 和 (40.71363051943297, -74.00913138370635)" 在下一行还是同一行？我的解决方案是基于 "10007 (40.71363051943297, -74.00913138370635)" 的假设
unlist(strsplit(xxx, "(,|\\(|\\))")) 应该会有所帮助。然后从中提取第一个、第三个和第四个
不应该separate 的参数是extra="merge"（或extra="drop"）而不是merge="extra"？

标签： r dataframe r-markdown ggmap

【解决方案1】：

基本的正则表达式提取：

library(purrr)

c("10007 (40.71363051943297, -74.00913138370635)", "10002 (40.71612146793143, -73.98583147024613)",
  "10012 (40.72553802086304, -73.99789641059084)", "10009 (40.72664935898081, -73.97911148500697)") %>%
  stringi::stri_match_all_regex("([[:digit:]]+)[[:space:]]+\\(([[:digit:]\\.\\-]+),[[:space:]]+([[:digit:]\\.\\-]+)\\)") %>%
  map_df(dplyr::as_data_frame) %>%
  dplyr::select(zip=V2, latitude=V3, longitude=V4)
## # A tibble: 4 × 3
##     zip          latitude          longitude
##   <chr>             <chr>              <chr>
## 1 10007 40.71363051943297 -74.00913138370635
## 2 10002 40.71612146793143 -73.98583147024613
## 3 10012 40.72553802086304 -73.99789641059084
## 4 10009 40.72664935898081 -73.97911148500697

更具可读性：

library(purrr)
library(stringi)
library(dplyr)
library(purrr)

dat <- c("10007 (40.71363051943297, -74.00913138370635)",
         "10002 (40.71612146793143, -73.98583147024613)",
         "10012 (40.72553802086304, -73.99789641059084)", 
         "10009 (40.72664935898081, -73.97911148500697)")

zip <- "([[:digit:]]+)"
num <- "([[:digit:]\\.\\-]+)"
space <- "[[:space:]]+"
lp <- "\\("
rp <- "\\)"
comma <- ","

match_str <- zip %s+% space %s+% lp %s+% num %s+% comma %s+% space %s+% num %s+% rp

dat %>%
  stri_match_all_regex(match_str) %>%
  map_df(as_data_frame) %>%
  select(zip=V2, latitude=V3, longitude=V4)

【讨论】：

【解决方案2】：

s.tmp = "10007 (40.71363051943297, -74.00913138370635)"

对于邮编：

gsub('([0-9]+) .*', '\\1', s.tmp)

纬度：

gsub('.*\\((.*),.*', '\\1', s.tmp)

经度：

gsub('.*, (.*)\\).*', '\\1', s.tmp)

【讨论】：

感谢它几乎可以在这里工作的代码，我能够分离纬度和经度但是 ZIP 仍然有整行。 10007 (40.71363051943297, -74.00913138370635) 40.71363051943297 -74.00913138370635 第一个是使用代码的ZIP