这与前面的问题here密切相关。然而,我需要一些稍微不同的东西。
我有一个很长的对象列表,我需要将它们划分为较小的列表,每个列表都有一定数量的条目。我需要能够为不同的任务改变列表的长度。问题是每个对象在单个列表中只能出现一次。
# Create some example data...
# Make a list of objects.
LIST <- c('Oranges', 'Toast', 'Truck', 'Dog', 'Hippo', 'Bottle', 'Hope', 'Mint', 'Red', 'Trees', 'Watch', 'Cup', 'Pencil', 'Lunch', 'Paper', 'Peanuts', 'Cloud', 'Forever', 'Ocean', 'Train', 'Fork', 'Moon', 'Horse', 'Parrot', 'Leaves', 'Book', 'Cheese', 'Tin', 'Bag', 'Socks', 'Lemons', 'Blue', 'Plane', 'Hammock', 'Roof', 'Wind', 'Green', 'Chocolate', 'Car', 'Distance')
# Generate a longer list, with a random sequence and number of repetitions for each entry.
set.seed(123)
LONG.LIST <- data.frame(Name = (sample(LIST, size = 200, replace = TRUE)))
print(LONG.LIST)
Name
1 Cup
2 Distance
3 Roof
4 Pencil
5 Lunch
6 Toast
7 Watch
8 Bottle
9 Car
10 Roof
11 Lunch
12 Forever
13 Cheese
14 Oranges
15 Ocean
16 Chocolate
17 Socks
18 Leaves
19 Oranges
20 Distance
21 Green
22 Paper
23 Red
24 Paper
25 Trees
26 Chocolate
27 Bottle
28 Dog
29 Wind
30 Parrot
etc....对于参数,假设我想创建一系列包含20个项目的列表。使用上面生成的示例,'Distance'出现在位置'2‘和位置'20','Lunch'出现在'5’和'11‘,'Oranges'出现在'14’和19',因此没有重复的第一个列表需要扩展以包括'Green'、'Paper'和'Red'。然后,第二个列表将从位置24的'Paper'开始。然而,我不想被限制在20个长度内,有时我可能希望它是10或25。
在下面的@LAP中添加注释,这有助于描述我的问题;“检查向量,直到找到20个唯一项,将它们放在一起,丢弃重复项,然后继续向量,直到找到下20个唯一项,依此类推,直到向量的末尾,用NA填充最后一部分。
“单独的列表只需要自身是唯一的。两个或多个列表之间可能存在重复项。”
最后一个列表可能是不完整的,所以最好用'NA's填充它。理想情况下,每个列表中的条目应该是按字母顺序排列的。
最有用的输出是数据帧中的每列一个列表。
发布于 2018-02-22 22:41:03
好的,这是一个部分的答案,因为我想我已经得到了你需要的大部分。
请注意,对于大数据,这可能会很慢。
首先,您可以初始化一个列表,其中的空向量数量与您想要的分组数量一样多。在这个例子中,我们希望从一个200个项目的向量中创建10个20个的组。
首先,我们创建可复制的数据:
LIST <- c('Oranges', 'Toast', 'Truck', 'Dog', 'Hippo', 'Bottle', 'Hope', 'Mint', 'Red',
'Trees', 'Watch', 'Cup', 'Pencil', 'Lunch', 'Paper', 'Peanuts', 'Cloud', 'Forever',
'Ocean', 'Train', 'Fork', 'Moon', 'Horse', 'Parrot', 'Leaves', 'Book', 'Cheese',
'Tin', 'Bag', 'Socks', 'Lemons', 'Blue', 'Plane', 'Hammock', 'Roof', 'Wind', 'Green',
'Chocolate', 'Car', 'Distance')
set.seed(123)
LONG.LIST <- data.frame(Name = (sample(LIST, size = 200, replace = TRUE)), stringsAsFactors = F)
test <- vector("list", 10)然后初始化两个计数器:
i <- 1
j <- 1现在我们使用一个while循环,该循环一直运行到i大于我们的向量中要拆分的项数(因此当i > 200时它停止)。在这个循环中,我们检查列表中的当前子向量j是否小于20。如果是,我们添加一个项目并进行重复数据删除,如果不是,我们向j添加1以跳转到下一个子向量。
while(i <= nrow(LONG.LIST)){
if(length(test[[j]]) < 20){
test[[j]] <- c(test[[j]], LONG.LIST$Name[i])
test[[j]] <- unique(test[[j]])
i <- i+1
}else{
j <- j+1
}
}这是我们的结果:
> test
[[1]]
[1] "Lunch" "Cheese" "Truck" "Roof" "Hope" "Mint" "Lemons" "Pencil" "Hippo" "Moon"
[11] "Car" "Chocolate" "Trees" "Distance" "Dog" "Bag" "Paper" "Peanuts" "Ocean" "Wind"
[[2]]
[1] "Hippo" "Wind" "Mint" "Plane" "Trees" "Truck" "Lemons" "Watch" "Chocolate" "Train"
[11] "Dog" "Lunch" "Green" "Horse" "Toast" "Distance" "Cloud" "Hammock" "Fork" "Paper"
[[3]]
[1] "Watch" "Hope" "Paper" "Socks" "Bag" "Plane" "Bottle" "Green" "Lunch" "Fork"
[11] "Mint" "Hippo" "Chocolate" "Car" "Trees" "Toast" "Forever" "Red" "Wind" "Ocean"
[[4]]
[1] "Car" "Lunch" "Toast" "Lemons" "Moon" "Socks" "Hippo" "Pencil" "Blue" "Fork" "Paper"
[12] "Distance" "Cloud" "Train" "Wind" "Watch" "Bottle" "Forever" "Green" "Bag"
[[5]]
[1] "Train" "Cheese" "Bottle" "Fork" "Paper" "Green" "Leaves" "Blue" "Toast" "Parrot" "Lemons" "Dog"
[13] "Hammock" "Ocean" "Red" "Peanuts" "Pencil" "Bag" "Horse" "Hope"
[[6]]
[1] "Oranges" "Truck" "Hippo" "Trees" "Parrot" "Red" "Hope" "Cloud" "Tin" "Bag"
[11] "Pencil" "Cup" "Dog" "Leaves" "Chocolate" "Mint" "Plane" "Moon" "Fork" "Green"
[[7]]
[1] "Tin" "Mint" "Book" "Bag" "Roof" "Hope" "Socks" "Watch" "Paper" "Peanuts"
[11] "Cup" "Distance" "Leaves" "Bottle" "Cloud" "Horse" "Trees" "Oranges" "Chocolate" "Toast"
[[8]]
[1] "Horse" "Watch" "Chocolate" "Tin" "Red" "Train"
[[9]]
NULL
[[10]]
NULL现在我们只需要用NA填充最后的向量。这可能会以不同的方式完成,但它可以完成工作:
for(i in 1:length(test)){
if(length(test[[i]]) < 20){
test[[i]] <- c(test[[i]], rep(NA, 20 - length(test[[i]])))
}
}发布于 2018-02-22 22:39:30
这是一个潜在的答案,它不是很漂亮,但我认为这是你想要的:
首先是数据:
LIST <- c('Oranges', 'Toast', 'Truck', 'Dog', 'Hippo', 'Bottle', 'Hope',
'Mint', 'Red', 'Trees', 'Watch', 'Cup', 'Pencil', 'Lunch', 'Paper',
'Peanuts', 'Cloud', 'Forever', 'Ocean', 'Train', 'Fork', 'Moon',
'Horse', 'Parrot', 'Leaves', 'Book', 'Cheese', 'Tin', 'Bag',
'Socks', 'Lemons', 'Blue', 'Plane', 'Hammock', 'Roof', 'Wind',
'Green', 'Chocolate', 'Car', 'Distance')
set.seed(123)
LONG.LIST <- data.frame(Name = (sample(LIST, size = 200, replace = TRUE)))创建一个函数,该函数将从数据帧的顶部找到20个唯一元素,并根据该元素将数据帧拆分为两个列表元素:
library(tidyverse)
spliter <- function(df){
df %>%
as.tibble()%>%
mutate(Name = as.character(Name),
dup = !duplicated(Name),
cum = cumsum(dup),
splt = ifelse(cum <= 20, 0, 1)) %>%
{split(df, .$splt)}
}现在,将此函数应用于结果列表的第二个元素,直到没有可拆分的内容为止,删除每个列表元素中的重复项:
b <- spliter(LONG.LIST)
c1 <- list(b[[1]] %>%
filter(!duplicated(Name)))
i <- 1
while(length(b) != 1){
i <- i+1
b <- spliter(b[[2]])
c1[[i]] <- b[[1]] %>%
filter(!duplicated(Name))
}如果需要,用NA填充最后一个元素:
c1 <- lapply(c1, function(x){
if(nrow(x) < 20){
data.frame(Name = c(as.character(x$Name), rep(NA_character_, (20-length(x$Name)))))
} else( x)
})合并到数据帧:
do.call(cbind, c1)
Name Name Name Name Name Name Name
1 Cup Green Wind Mint Book Hammock Parrot
2 Blue Tin Paper Bottle Pencil Trees Hammock
3 Cloud Blue Cheese Cheese Red Dog Pencil
4 Wind Oranges Dog Lunch Paper Socks Bag
5 Chocolate Train Peanuts Pencil Distance Train Watch
6 Toast Lemons Watch Blue Hope Peanuts Train
7 Moon Red Plane Dog Dog Hippo Horse
8 Horse Pencil Forever Ocean Bottle Horse Green
9 Ocean Trees Blue Fork Tin Red Distance
10 Car Bottle Lemons Parrot Leaves Forever Leaves
11 Tin Cloud Book Train Wind Fork Chocolate
12 Hippo Paper Bag Car Cheese Paper Ocean
13 Trees Hope Oranges Wind Socks Book Cloud
14 Lunch Ocean Train Green Fork Moon Cheese
15 Book Watch Red Leaves Plane Cloud Hope
16 Distance Roof Leaves Cloud Blue Watch <NA>
17 Cheese Toast Hippo Chocolate Forever Mint <NA>
18 Bag Forever Trees Truck Cloud Roof <NA>
19 Parrot Hippo Cloud Bag Oranges Cheese <NA>
20 Bottle Horse Distance Moon Mint Leaves <NA>下面是ngm的答案中的一个函数:
miss <- function(y, split){
require(tidyverse)
spliter <- function(df){
df %>%
as.tibble()%>%
mutate(Name = as.character(Name),
dup = !duplicated(Name),
cum = cumsum(dup),
splt = ifelse(cum <= split, 0, 1)) %>%
{split(df, .$splt)}
}
b <- spliter(y)
c1 <- list(b[[1]] %>%
filter(!duplicated(Name)))
i <- 1
while(length(b) != 1){
i <- i+1
b <- spliter(b[[2]])
c1[[i]] <- b[[1]] %>%
filter(!duplicated(Name))
}
c1 <- lapply(c1, function(x){
if(nrow(x) < 20){
data.frame(Name = c(as.character(x$Name), rep(NA_character_, (20-length(x$Name)))))
} else( x)
})
return(do.call(cbind, c1))
}用法:
miss(LONG.LIST, 20 )发布于 2018-02-22 22:58:41
此函数svu (“唯一拆分向量”)接受一个向量,并根据您的规范生成一个数据帧。
我不明白为什么输入会是列表或数据框。将输入设为向量似乎更自然。
words <- c('Oranges', 'Toast', 'Truck', 'Dog', 'Hippo', 'Bottle', 'Hope', 'Mint', 'Red', 'Trees', 'Watch', 'Cup', 'Pencil', 'Lunch', 'Paper', 'Peanuts', 'Cloud', 'Forever', 'Ocean', 'Train', 'Fork', 'Moon', 'Horse', 'Parrot', 'Leaves', 'Book', 'Cheese', 'Tin', 'Bag', 'Socks', 'Lemons', 'Blue', 'Plane', 'Hammock', 'Roof', 'Wind', 'Green', 'Chocolate', 'Car', 'Distance')
set.seed(123)
more_words <- sample(words, size = 200, replace = TRUE)
# x is the original vector and n is the desired number of
# words in each column of the resulting data frame.
svu <- function(x, n) {
# How many eventual columns?
n_cols <- trunc(length(x)/n)
# That many eventual columns all filled with NA for now.
vec_list <- lapply(1:n_cols, function(x) rep(NA, n))
# For each word...
for(string in x) {
for(i in 1:n_cols) {
if(!(string %in% vec_list[[i]]) && sum(is.na(vec_list[[i]])) > 0) {
# ...add it to a non-full column not containing that word.
vec_list[[i]][min(which(is.na(vec_list[[i]])))] <- string
break
}
}
}
# Make it a data frame
data.frame(do.call(cbind, vec_list), stringsAsFactors = FALSE)
}试试看:
svu(more_words, 20)
#> X1 X2 X3 X4 X5 X6 X7
#> 1 Cup Wind Wind Wind Wind Plane Wind
#> 2 Blue Ocean Car Bottle Plane Forever Bottle
#> 3 Cloud Horse Tin Watch Forever Wind Pencil
#> 4 Wind Toast Cloud Plane Blue Cheese Distance
#> 5 Chocolate Car Bottle Forever Hippo Mint Hope
#> 6 Toast Tin Trees Blue Mint Blue Dog
#> 7 Moon Moon Ocean Lemons Bottle Lunch Tin
#> 8 Horse Cup Watch Book Cheese Train Leaves
#> 9 Ocean Green Roof Bag Lunch Bottle Cheese
#> 10 Car Blue Toast Oranges Pencil Pencil Socks
#> 11 Tin Oranges Forever Train Dog Truck Fork
#> 12 Hippo Train Blue Red Ocean Chocolate Plane
#> 13 Trees Lemons Hippo Peanuts Fork Bag Blue
#> 14 Lunch Red Horse Leaves Parrot Moon Forever
#> 15 Book Pencil Red Paper Train Car Cloud
#> 16 Distance Trees Lemons Hippo Car Parrot Oranges
#> 17 Cheese Bottle Paper Trees Green Cloud Mint
#> 18 Bag Cloud Cheese Cheese Leaves Book Hammock
#> 19 Parrot Paper Dog Cloud Cloud Red Trees
#> 20 Bottle Hope Peanuts Distance Chocolate Paper Train
#> X8 X9 X10
#> 1 Wind Wind Trees
#> 2 Pencil Pencil Paper
#> 3 Bottle Trees Red
#> 4 Cheese Peanuts Socks
#> 5 Distance Red Roof
#> 6 Trees Paper Pencil
#> 7 Dog Socks Parrot
#> 8 Socks Book Watch
#> 9 Hammock Mint Green
#> 10 Peanuts Roof Distance
#> 11 Hippo Cheese Leaves
#> 12 Horse Leaves Chocolate
#> 13 Red Moon Ocean
#> 14 Forever Parrot Cloud
#> 15 Fork Hammock Cheese
#> 16 Paper Bag Hope
#> 17 Book Watch Horse
#> 18 Moon Train <NA>
#> 19 Cloud Horse <NA>
#> 20 Watch Green <NA>https://stackoverflow.com/questions/48928383
复制相似问题