问长到宽的格式问题
EN

Stack Overflow用户

提问于 2020-12-25 17:11:18

回答 1查看 46关注 0票数 0

我有一个关于人们居住流动性的数据集，我想研究某些疾病方面。我收到了excel格式的数据，问题是每个人都有多行，这些行类似于一个人居住过的不同位置。一个虚拟示例类似于下面的示例：

dput(df)
df <- structure(list(caseid = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 4L, 4L, 4L, 4L, 5L), startdate = c(2001L, 2008L, 
2012L, 1990L, 2009L, 1998L, 1999L, 2002L, 2003L, 2007L, 2011L, 
2015L, 2003L, 2013L, 2013L, 2019L, 2004L), enddate = c(2008L, 
2012L, NA, 2009L, NA, 1999L, 2002L, 2003L, 2007L, 2011L, 2015L, 
NA, 2013L, 2013L, 2019L, NA, NA), cityno = c(418L, 20L, 420L, 
544L, 132L, 312L, 350L, 312L, 350L, 99L, 99L, 351L, 862L, 863L, 
861L, 43L, 449L), location = c("north", "south", "north", "mid", 
"south", "mid", "mid", "mid", "mid", "south", "south", "mid", 
"north", "north", "north", "south", "north")), class = "data.frame", row.names = c(NA, 
-17L))

> df
   caseid startdate enddate cityno location
1       1      2001    2008    418    north
2       1      2008    2012     20    south
3       1      2012      NA    420    north
4       2      1990    2009    544      mid
5       2      2009      NA    132    south
6       3      1998    1999    312      mid
7       3      1999    2002    350      mid
8       3      2002    2003    312      mid
9       3      2003    2007    350      mid
10      3      2007    2011     99    south
11      3      2011    2015     99    south
12      3      2015      NA    351      mid
13      4      2003    2013    862    north
14      4      2013    2013    863    north
15      4      2013    2019    861    north
16      4      2019      NA     43    south
17      5      2004      NA    449    north

每个ID都有不同数量的位置，并且没有包含居住标签(residence1、residence2等)的变量。

例如，我想研究身处某个特定地点是否会增加患上这种疾病的几率。我认为作为第一步，将数据更改为广泛的格式，以便根据身份证的居住位置对其进行分类。我尝试了spread和dcast，但它们需要一个额外的变量来描述住所(在dcast中创建一个公式)。我找不到如何创建这样一个变量，我也不确定我是否在做正确的事情，有什么想法吗？

Ps:这是我第一次在这里发布问题，如果我没有提供正确的信息，请注意。

rows

dcast

回答 1

Stack Overflow用户

发布于 2020-12-27 16:05:03

您可以为居住号码创建一个单独的列，然后继续使用更广泛的格式

df %>% group_by(caseid) %>% mutate(residence = paste0("Residence", row_number())) %>%
  pivot_wider(id_cols = caseid, values_from = c(startdate, enddate, cityno), names_from = residence, 
              names_glue = "{residence}_{.value}")

# A tibble: 5 x 22
# Groups:   caseid [5]
  caseid Residence1_star~ Residence2_star~ Residence3_star~ Residence4_star~ Residence5_star~ Residence6_star~
   <int>            <int>            <int>            <int>            <int>            <int>            <int>
1      1             2001             2008             2012               NA               NA               NA
2      2             1990             2009               NA               NA               NA               NA
3      3             1998             1999             2002             2003             2007             2011
4      4             2003             2013             2013             2019               NA               NA
5      5             2004               NA               NA               NA               NA               NA
# ... with 15 more variables: Residence7_startdate <int>, Residence1_enddate <int>, Residence2_enddate <int>,
#   Residence3_enddate <int>, Residence4_enddate <int>, Residence5_enddate <int>, Residence6_enddate <int>,
#   Residence7_enddate <int>, Residence1_cityno <int>, Residence2_cityno <int>, Residence3_cityno <int>,
#   Residence4_cityno <int>, Residence5_cityno <int>, Residence6_cityno <int>, Residence7_cityno <int>