R函数caret::findCorrelation
搜索相关矩阵并返回与变量对应的整数向量,如果删除,将减少其余变量之间的成对相关性。下面是这个函数的R代码:
function (x, cutoff = 0.9, verbose = FALSE, names = FALSE, exact = ncol(x) <
100)
{
if (names & is.null(colnames(x)))
stop("'x' must have column names when `names = TRUE`")
out <- if (exact)
findCorrelation_exact(x = x, cutoff = cutoff, verbose = verbose)
else findCorrelation_fast(x = x, cutoff = cutoff, verbose = verbose)
out
if (names)
out <- colnames(x)[out]
out
}
以及函数findCorrelation_fast
,它是我感兴趣的函数(去掉了可选参数):
findCorrelation_fast <- function(x, cutoff = .90)
{
if(any(!complete.cases(x)))
stop("The correlation matrix has some missing values.")
averageCorr <- colMeans(abs(x))
averageCorr <- as.numeric(as.factor(averageCorr))
x[lower.tri(x, diag = TRUE)] <- NA
combsAboveCutoff <- which(abs(x) > cutoff)
colsToCheck <- ceiling(combsAboveCutoff / nrow(x))
rowsToCheck <- combsAboveCutoff %% nrow(x)
colsToDiscard <- averageCorr[colsToCheck] > averageCorr[rowsToCheck]
rowsToDiscard <- !colsToDiscard
deletecol <- c(colsToCheck[colsToDiscard], rowsToCheck[rowsToDiscard])
deletecol <- unique(deletecol)
deletecol
}
在熊猫的帮助下,我正在编写一个函数来模仿Python 3中这个函数的意图。我的实现包含一个嵌套的for
循环,据我所知,这并不是实现所需结果的最有效方法。原始的R函数不需要任何循环就能完成任务。
我的两个问题是:
for
循环替换为向量化的实现?findCorrelation_fast
使用行averageCorr <- as.numeric(as.factor(averageCorr))
。这种结构对我来说似乎是非常陌生的,也是成功实施掠夺者R的关键。有人能弄清楚这条线在干什么吗?我的直觉告诉我,它非常聪明,利用了R的一些独特的行为。我的Python实现及其使用示例:
import numpy as np
import pandas as pd
# calculate pair-wise correlations
def findCorrelated(corrmat, cutoff = 0.8):
### search correlation matrix and identify pairs that if removed would reduce pair-wise correlations
# args:
# corrmat: a correlation matrix
# cutoff: pairwise absolute correlation cutoff
# returns:
# variables to removed
if(len(corrmat) != len(corrmat.columns)) : return 'Correlation matrix is not square'
averageCorr = corrmat.abs().mean(axis = 1)
# set lower triangle and diagonal of correlation matrix to NA
corrmat = corrmat.where(np.triu(np.ones(corrmat.shape)).astype(np.bool))
corrmat.values[[np.arange(len(corrmat))]*2] = None
# where a pairwise correlation is greater than the cutoff value, check whether mean abs.corr of a or b is greater and cut it
to_delete = list()
for col in range(0, len(corrmat.columns)):
for row in range(0, len(corrmat)):
if(corrmat.iloc[row, col] > cutoff):
if(averageCorr.iloc[row] > averageCorr.iloc[col]): to_delete.append(row)
else: to_delete.append(col)
to_delete = list(set(to_delete))
return to_delete
# generate some data
df = pd.DataFrame(np.random.randn(50,25))
# demonstrate usage of function
removeCols = findCorrelated(df.corr(), cutoff = 0.01) #set v.low cutoff as data is uncorrelated
print('Columns to be removed:')
print(removeCols)
uncorrelated = df.drop(df.index[removeCols], axis =1, inplace = False)
print('Uncorrelated variables:')
print(uncorrelated)
发布于 2021-02-08 14:14:09
四年前被问到了,但是我正在寻找一个python实现,来自R自己。
对于2:as.numeric(as.factor(x))
提供x的每个值的秩顺序。
as.factor()
分解相关值,按数字上升的顺序分配它们的级别,本质上是生成数字的字符,但保持它们的相对顺序。
然后,as.numeric()
将有序的级别/字符转换为数字,因此x的最低值现在是1,最高的值是length(unique(x))
。如果有关联,它们将具有相同的整数值,因为它们将被分配到与as.factor()
相同的级别/级别。
注意,这个片段不适用于矩阵,只适用于向量。
https://stackoverflow.com/questions/41761332
复制相似问题