我有一个关于人们居住流动性的数据集,我想研究某些疾病方面。我收到了excel格式的数据,问题是每个人都有多行,这些行类似于一个人居住过的不同位置。一个虚拟示例类似于下面的示例:
dput(df)
df <- structure(list(caseid = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 4L, 4L, 4L, 4L, 5L), startdate = c(2001L, 2008L,
2012L, 1990L, 2009L, 1998L, 1999L, 2002L, 2003L, 2007L, 2011L,
2015L, 2003L, 2013L, 2013L, 2019L, 2004L), enddate = c(2008L,
2012L, NA, 2009L, NA, 1999L, 2002L, 2003L, 2007L, 2011L, 2015L,
NA, 2013L, 2013L, 2019L, NA, NA), cityno = c(418L, 20L, 420L,
544L, 132L, 312L, 350L, 312L, 350L, 99L, 99L, 351L, 862L, 863L,
861L, 43L, 449L), location = c("north", "south", "north", "mid",
"south", "mid", "mid", "mid", "mid", "south", "south", "mid",
"north", "north", "north", "south", "north")), class = "data.frame", row.names = c(NA,
-17L))
> df
caseid startdate enddate cityno location
1 1 2001 2008 418 north
2 1 2008 2012 20 south
3 1 2012 NA 420 north
4 2 1990 2009 544 mid
5 2 2009 NA 132 south
6 3 1998 1999 312 mid
7 3 1999 2002 350 mid
8 3 2002 2003 312 mid
9 3 2003 2007 350 mid
10 3 2007 2011 99 south
11 3 2011 2015 99 south
12 3 2015 NA 351 mid
13 4 2003 2013 862 north
14 4 2013 2013 863 north
15 4 2013 2019 861 north
16 4 2019 NA 43 south
17 5 2004 NA 449 north
每个ID都有不同数量的位置,并且没有包含居住标签(residence1、residence2等)的变量。
例如,我想研究身处某个特定地点是否会增加患上这种疾病的几率。我认为作为第一步,将数据更改为广泛的格式,以便根据身份证的居住位置对其进行分类。我尝试了spread
和dcast
,但它们需要一个额外的变量来描述住所(在dcast
中创建一个公式)。我找不到如何创建这样一个变量,我也不确定我是否在做正确的事情,有什么想法吗?
Ps:这是我第一次在这里发布问题,如果我没有提供正确的信息,请注意。
发布于 2020-12-27 16:05:03
您可以为居住号码创建一个单独的列,然后继续使用更广泛的格式
df %>% group_by(caseid) %>% mutate(residence = paste0("Residence", row_number())) %>%
pivot_wider(id_cols = caseid, values_from = c(startdate, enddate, cityno), names_from = residence,
names_glue = "{residence}_{.value}")
# A tibble: 5 x 22
# Groups: caseid [5]
caseid Residence1_star~ Residence2_star~ Residence3_star~ Residence4_star~ Residence5_star~ Residence6_star~
<int> <int> <int> <int> <int> <int> <int>
1 1 2001 2008 2012 NA NA NA
2 2 1990 2009 NA NA NA NA
3 3 1998 1999 2002 2003 2007 2011
4 4 2003 2013 2013 2019 NA NA
5 5 2004 NA NA NA NA NA
# ... with 15 more variables: Residence7_startdate <int>, Residence1_enddate <int>, Residence2_enddate <int>,
# Residence3_enddate <int>, Residence4_enddate <int>, Residence5_enddate <int>, Residence6_enddate <int>,
# Residence7_enddate <int>, Residence1_cityno <int>, Residence2_cityno <int>, Residence3_cityno <int>,
# Residence4_cityno <int>, Residence5_cityno <int>, Residence6_cityno <int>, Residence7_cityno <int>
Residence列组将扩展到任何caseid
中的最大Residence数。如果这不是您想要的,请添加您想要的输出。
https://stackoverflow.com/questions/65450140
复制相似问题