我想使用条件使用dplyr过滤数据。我想测试的条件是国家年组合是否有两个版本。
df <- data.frame(country = c("country1", "country2", "country1", "country2", "country3"), year = rep(2011,5), version = c("versionA", "versionA", "versionB", "versionB", "versionB"))
以下是我在看完here之后所做的尝试
df %>%
group_by(country, year) %>%
{if unique(version)==1 . else filter(version == "versionA")}
我希望得到的是如下所示的数据:
country year version
country1 2011 versionA
country2 2011 versionA
country3 2011 versionB
发布于 2020-03-28 18:26:05
要计算唯一值的数量,我们可以使用n_distinct
并在此基础上筛选行。
library(dplyr)
df %>%
group_by(country, year) %>%
filter(if(n_distinct(version) == 2) version == 'versionA' else TRUE)
# country year version
# <fct> <dbl> <fct>
#1 country1 2011 versionA
#2 country2 2011 versionA
#3 country3 2011 versionB
发布于 2020-03-28 18:21:31
在按'country‘、'year’分组之后,如果不同元素的数量大于1,则返回'versionA',否则返回第一个元素
library(dplyr)
df %>%
group_by(country, year) %>%
filter((n_distinct(version) > 1 & version == 'versionA')|row_number() == 1)
# A tibble: 3 x 3
# Groups: country, year [3]
# country year version
# <fct> <dbl> <fct>
#1 country1 2011 versionA
#2 country2 2011 versionA
#3 country3 2011 versionB
也可以将其添加到if/else
中
df %>%
group_by(country, year) %>%
filter(if(n_distinct(version) > 1) version == 'versionA'
else row_number() ==1)
# A tibble: 3 x 3
# Groups: country, year [3]
# country year version
# <fct> <dbl> <fct>
#1 country1 2011 versionA
#2 country2 2011 versionA
#3 country3 2011 versionB
或者另一个选择是arrange
df %>%
arrange(country, year, version != 'versionA') %>%
group_by(country, year) %>%
slice(1)
或使用summarize
df %>%
group_by(country, year) %>%
summarise(version = if(n_distinct(version) > 1) 'versionA' else first(version))
或者使用data.table
library(data.table)
setDT(df)[, .SD[if(n_distinct(version) > 1) version == 'versionA'
else 1], .(country, year)]
发布于 2020-03-28 19:39:07
基数R-一行谢谢(@akrun):
df[!(duplicated(df[1:2])),]
基数R单线:
df[!(duplicated(df$country, df$year)),]
Tidyverse解决方案:
library(tidyverse)
df %>%
filter(!(duplicated(country, year)))
一个更通用的R基解决方案:
# Create a counter of versions for each year and country:
df$tmp <- with(lapply(df, function(x){if(is.factor(x)){as.character(x)}else{x}}),
ave(version, paste0(country, year), FUN = seq.int))
# Subset the dataframe to hold only the first record for each year/country:
df[which(df$tmp == 1), ]
一种更通用的tidyverse解决方案:
df %>%
arrange(version) %>%
filter(!(duplicated(country, year)))
https://stackoverflow.com/questions/60909423
复制相似问题