一些SQL数据库通过REGR_SLOPE、REGR_R2、REGR_SXX等一系列函数对值进行线性回归。BigQuery用户如何模拟这些函数?
发布于 2017-01-12 17:06:23
编辑器编辑:请参阅 linear regression is now natively supported in BigQuery。--Fh
REGR_函数可以转换为涉及BigQuery实现的聚合的表达式,但是要小心处理NULL。REGR_函数采用由自变量(X)和因变量"y“组成的对,忽略这两个变量都为NULL的对。
下面的查询一次计算所有REGR_表达式。它使用BigQuery标准SQL编写,并利用用户定义的SQL函数使查询更加简洁。通过在表达式中测试x和y的空值,这个版本保留了键。换句话说,如果某个键的所有对都包含NULL,则结果将为NULL。
CREATE TEMPORARY FUNCTION NonNullX(y FLOAT64, x FLOAT64)
RETURNS FLOAT64
AS (IF(x IS NOT NULL AND y IS NOT NULL, x, NULL));
CREATE TEMPORARY FUNCTION NonNullY(y FLOAT64, x FLOAT64)
RETURNS FLOAT64
AS (IF(x IS NOT NULL AND y IS NOT NULL, y, NULL));
CREATE TEMPORARY FUNCTION NonNullCount(y FLOAT64, x FLOAT64)
RETURNS INT64
AS (IF(x IS NOT NULL AND y IS NOT NULL, 1, 0));
CREATE TEMPORARY FUNCTION REGR_R2_COEF(y_var FLOAT64, x_var FLOAT64, xy_corr FLOAT64)
RETURNS FLOAT64
AS (IF(x_var = 0, NULL, IF(y_var = 0 AND x_var <> 0, 1, POWER(xy_corr, 2))));
SELECT k,
AVG(NonNullX(y, x)) AS REGR_AVGX,
AVG(NonNullY(y, x)) AS REGR_AVGY,
SUM(NonNullCount(y, x)) AS REGR_COUNT,
AVG(NonNullY(y, x))-(COVAR_POP(NonNullY(y, x), NonNullX(y, x))/VAR_POP(NonNullX(y, x)))*AVG(NonNullX(y, x)) AS REGR_INTERCEPT,
REGR_R2_COEF(VAR_POP(NonNullY(y, x)), VAR_POP(NonNullX(y, x)), CORR(NonNullY(y, x), NonNullX(y, x))) AS REGR_R2,
COVAR_POP(NonNullY(y, x), NonNullX(y, x))/VAR_POP(NonNullX(y, x)) AS REGR_SLOPE,
SUM(NonNullCount(y, x)) * VAR_POP(NonNullX(y, x)) AS REGR_SXX,
SUM(NonNullCount(y, x)) * COVAR_POP(NonNullY(y, x), NonNullX(y, x)) AS REGR_SXY,
SUM(NonNullCount(y, x)) * VAR_POP(NonNullY(y, x)) AS REGR_SYY
FROM Temp.SimpleRegressionData
GROUP BY k;
如果您不关心有空结果,您可以在WHERE子句中筛选出包含NULL的对,聚合表达式将简化如下:
CREATE TEMPORARY FUNCTION REGR_R2_COEF(y_var FLOAT64, x_var FLOAT64, xy_corr FLOAT64)
RETURNS FLOAT64
AS (IF(x_var = 0, NULL, IF(y_var = 0 AND x_var <> 0, 1, POWER(xy_corr, 2))));
SELECT k,
AVG(x) AS REGR_AVGX,
AVG(y) AS REGR_AVGY,
SUM(1) AS REGR_COUNT,
AVG(y)-(COVAR_POP(y, x)/VAR_POP(x))*AVG(x) AS REGR_INTERCEPT,
REGR_R2_COEF(VAR_POP(y), VAR_POP(x), CORR(y, x)) AS REGR_R2,
COVAR_POP(y, x)/VAR_POP(x) AS REGR_SLOPE,
SUM(1) * VAR_POP(x) AS REGR_SXX,
SUM(1) * COVAR_POP(y, x) AS REGR_SXY,
SUM(1) * VAR_POP(y) AS REGR_SYY
FROM Temp.SimpleRegressionData
WHERE x IS NOT NULL AND Y IS NOT NULL
GROUP BY k;
请注意,如果某个键的所有非空x值相等,这些查询中的任何一个都可能被零错误除以。
https://stackoverflow.com/questions/41625819
复制