【发布时间】:2019-06-03 19:15:26
【问题描述】:
我有一个数据框,其中包含一列字符串,我想将其进一步标记为以下类别:城市、国家和大陆。我使用 gsub 将所有城市替换为“City”,将所有国家替换为“Country”,将所有大陆替换为“Continent”。
#This is what I have
dataframe
Color Letter Words
red A Paris,Asia,parrot,Antarctica,North America,cat,lizard
blue A Panama,New York,Africa,dog,Tokyo,Washington DC,fish
red B Copenhagen,bird,USA,Japan,Chicago,Mexico,insect
blue B Israel,Antarctica,horse,South America,North America,turtle,Brazil
#This is what I want
dataframe
Color Letter New
red A City,Continent
blue A Country,City,Continent
red B City,Country
blue B Country,Continent
#This is the code I have so far
dataframe$New <- NA
#groups all the cities
dataframe$New <- lapply)dataframe$Words, function(x) {
gsub("Paris|New York|Tokyo|Washington DC|Copenhagen|Chicago", "City", x)})
#groups all the countries
dataframe$New <- lapply)dataframe$Words, function(x) {
gsub("Panama|USA|Japan|Mexico|Israel|Brazil", "Country", x)})
#groups all the continents
dataframe$New <- lapply)dataframe$Words, function(x) {
gsub("Asia|Antarctica|Africa|North America|South America", "Continent", x)})
dataframe$Words <- NULL
如何防止每次都覆盖 dataframe$New 以及如何删除多余的单词(即 fish、 horse、cat)?
以上数据是基于非常大的数据集的示例。在数据集中,单词列有很多重复。请参阅下面的 dataframe$Words 中的一些示例行:
Words
Panama,Paris
Panama,Israel,cat
Panama,Paris,horse,
Panama,Asia
Panama
Panama,Chicago
Israel,Chicago
Israel,lizard,Paris
Israel,Panama,horse,Africa
```
【问题讨论】: