前往小程序,Get更优阅读体验!
立即前往
发布
社区首页 >专栏 >统计学学术速递[8.18]

统计学学术速递[8.18]

作者头像
公众号-arXiv每日学术速递
发布2021-08-24 16:25:21
发布2021-08-24 16:25:21
4450
举报

Update!H5支持摘要折叠,体验更佳!点击阅读原文访问arxivdaily.com,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏等功能!

stat统计学,共计15篇

【1】 Spatio-temporal Parking Behaviour Forecasting and Analysis Before and During COVID-19 标题:冠状病毒前后停车行为的时空预测与分析 链接:https://arxiv.org/abs/2108.07731

作者:Shuhui Gong,Xiaopeng Mo,Rui Cao,Yu Liu,Wei Tu,Ruibin Bai 机构:School of Information Engineering, China University of Geosciences, Beijing, China, School of Computer Science, University of Nottingham Ningbo, Ningbo, China, Dept. of LSGI & SCRI, The Hong Kong Polytechnic, Hong Kong, China, Institute of Remote Sensing and 备注:DeepSpatial '21: 2nd ACM SIGKDD Workshop on Deep Learning for Spatiotemporal Data, Applications, and Systems (this https URL) 摘要:近年来,停车需求预测和行为分析受到越来越多的关注,因为它们在缓解交通拥堵和了解出行行为方面起着至关重要的作用。然而,以往的研究通常只考虑时间依赖性,而忽略停车预测的停车场之间的空间相关性。这主要是由于它们之间缺乏直接的物理联系或可观察到的相互作用。因此,如何量化空间相关性仍然是一个重大挑战。为了弥补这一差距,在本研究中,我们提出了一个空间感知停车预测框架,该框架包括两个步骤,即空间连接图构建和时空预测。在中国宁波的一个案例研究中,使用了在新冠肺炎爆发前和爆发期间超过100万条记录的停车数据。结果表明,该方法在停车占用预测方面优于基线方法,特别是对于具有高度时间不规则性的病例,如在新冠病毒-19期间。我们的工作揭示了大流行对停车行为的影响,并强调了停车行为预测中空间依赖性建模的重要性,这有助于未来流行病学和人类出行行为的研究。 摘要:Parking demand forecasting and behaviour analysis have received increasing attention in recent years because of their critical role in mitigating traffic congestion and understanding travel behaviours. However, previous studies usually only consider temporal dependence but ignore the spatial correlations among parking lots for parking prediction. This is mainly due to the lack of direct physical connections or observable interactions between them. Thus, how to quantify the spatial correlation remains a significant challenge. To bridge the gap, in this study, we propose a spatial-aware parking prediction framework, which includes two steps, i.e. spatial connection graph construction and spatio-temporal forecasting. A case study in Ningbo, China is conducted using parking data of over one million records before and during COVID-19. The results show that the approach is superior on parking occupancy forecasting than baseline methods, especially for the cases with high temporal irregularity such as during COVID-19. Our work has revealed the impact of the pandemic on parking behaviour and also accentuated the importance of modelling spatial dependence in parking behaviour forecasting, which can benefit future studies on epidemiology and human travel behaviours.

【2】 Semi-parametric Bayesian Additive Regression Trees 标题:半参数贝叶斯加性回归树 链接:https://arxiv.org/abs/2108.07636

作者:Estevão B. Prado,Andrew C. Parnell,Nathan McJames,Ann O'Shea,Rafael A. Moral 机构:Moral, Hamilton Institute, University, Co. Kildare, Ireland., Department of Mathematics &, Statistics, Co., Insight Centre for Data Analytics, Correspondence, Present Address, Summary 摘要:我们提出了一种新的基于贝叶斯加性回归树(BART)的半参数模型。在我们的方法中,响应变量由一个线性预测因子和一个BART模型近似,其中第一个分量负责估计主要影响,BART解释了非指定的相互作用和非线性。我们的方法的新颖之处在于,我们改变了BART中的树生成动作,以处理参数和非参数组件之间的混淆,因为它们具有共同的协变量。通过合成和实际例子,我们证明了新的半参数BART与回归模型和其他基于树的方法相比具有竞争力。建议方法的实施可在https://github.com/ebprado/SP-BART. 摘要:We propose a new semi-parametric model based on Bayesian Additive Regression Trees (BART). In our approach, the response variable is approximated by a linear predictor and a BART model, where the first component is responsible for estimating the main effects and BART accounts for the non-specified interactions and non-linearities. The novelty in our approach lies in the way we change tree generation moves in BART to deal with confounding between the parametric and non-parametric components when they have covariates in common. Through synthetic and real-world examples, we demonstrate that the performance of the new semi-parametric BART is competitive when compared to regression models and other tree-based methods. The implementation of the proposed method is available at https://github.com/ebprado/SP-BART.

【3】 Non-Asymptotic Bounds for the \ell_{\infty} Estimator in Linear Regression with Uniform Noise链接:https://arxiv.org/abs/2108.07630

作者:Yufei Yi,Matey Neykov 机构:Department of Statistics & Data Science, Carnegie Mellon University 备注:34 pages, 1 figure, 1 table 摘要:切比雪夫或$\elll{\infty}$估计量是求解线性回归的普通最小二乘法的一种非常规替代方法。它被定义为$\ell{\infty}$目标函数\begin{align*}\hat{\boldsymbol{\beta}:=\arg\min{\boldsymbol{\beta}}}\\\boldsymbol{Y}-\mathbf{X}\boldsymbol{\beta}\{\infty}的最小值\结束{align*}最近研究了固定数量协变量下切比雪夫估计量的渐近分布(Knight,2020),但有限样本保证和高维环境的推广仍然是开放的。在本文中,我们对切比雪夫估计量$\hat{\boldsymbol{\beta}}-\boldsymbol{\beta}^*\\\124; U 2$的估计误差$\\hat{\boldsymbol{\beta}}给出了非渐近上界,其中,$a$是已知的或未知的。通过对(随机)设计矩阵$\mathbf{X}$的相对温和的假设,我们可以用$\frac{C_p}{n}$高概率地限制错误率,对于一些常数$C_p$,这取决于维度$p$和设计法则。此外,我们还证明了存在切比雪夫估计是(接近)极大极小最优的设计。此外,我们还证明了“切比雪夫套索”在高维情况下优于常规套索,只要噪声是均匀的。具体而言,我们认为,在稀疏度水平和环境维度相对于样本大小的增长率的某些假设下,它实现了更快的估计速度。 摘要:The Chebyshev or $\ell_{\infty}$ estimator is an unconventional alternative to the ordinary least squares in solving linear regressions. It is defined as the minimizer of the $\ell_{\infty}$ objective function \begin{align*} \hat{\boldsymbol{\beta}} := \arg\min_{\boldsymbol{\beta}} \|\boldsymbol{Y} - \mathbf{X}\boldsymbol{\beta}\|_{\infty}. \end{align*} The asymptotic distribution of the Chebyshev estimator under fixed number of covariates were recently studied (Knight, 2020), yet finite sample guarantees and generalizations to high-dimensional settings remain open. In this paper, we develop non-asymptotic upper bounds on the estimation error $\|\hat{\boldsymbol{\beta}}-\boldsymbol{\beta}^*\|_2$ for a Chebyshev estimator $\hat{\boldsymbol{\beta}}$, in a regression setting with uniformly distributed noise $\varepsilon_i\sim U([-a,a])$ where $a$ is either known or unknown. With relatively mild assumptions on the (random) design matrix $\mathbf{X}$, we can bound the error rate by $\frac{C_p}{n}$ with high probability, for some constant $C_p$ depending on the dimension $p$ and the law of the design. Furthermore, we illustrate that there exist designs for which the Chebyshev estimator is (nearly) minimax optimal. In addition we show that "Chebyshev's LASSO" has advantages over the regular LASSO in high dimensional situations, provided that the noise is uniform. Specifically, we argue that it achieves a much faster rate of estimation under certain assumptions on the growth rate of the sparsity level and the ambient dimension with respect to the sample size.

【4】 Testing Multiple Linear Regression Systems with Metamorphic Testing 标题:用变质测试检验多元线性回归系统 链接:https://arxiv.org/abs/2108.07584

作者:Quang-Hung Luu,Man F. Lau,Sebastian P. H. Ng,Tsong Yueh Chen 机构:Department of Computer Science and Software Engineering, Swinburne University of Technology, Hawthorn, Australia 备注:24 pages, 5 figures, 7 tables. The Journal of Systems and Software (2021) 摘要:回归是最常用的统计技术之一。然而,测试回归系统是一个巨大的挑战,因为通常缺乏测试oracle。在本文中,我们证明了变质测试是测试多元线性回归系统的有效方法。在此过程中,我们确定了线性回归的内在数学性质,然后提出了11个用于测试的变质关系。通过对一系列不同的回归程序进行突变分析来检验它们的有效性。我们进一步研究如何以更有效的方式采用测试。我们的工作适用于检验基于回归的预测系统的可靠性,回归在经济学、工程学和科学中得到了广泛的应用,也适用于检验统计用户操纵的回归计算的可靠性。 摘要:Regression is one of the most commonly used statistical techniques. However, testing regression systems is a great challenge because of the absence of test oracle in general. In this paper, we show that Metamorphic Testing is an effective approach to test multiple linear regression systems. In doing so, we identify intrinsic mathematical properties of linear regression, and then propose 11 Metamorphic Relations to be used for testing. Their effectiveness is examined using mutation analysis with a range of different regression programs. We further look at how the testing could be adopted in a more effective way. Our work is applicable to examine the reliability of predictive systems based on regression that has been widely used in economics, engineering and science, as well as of the regression calculation manipulated by statistical users.

【5】 Modelling Time-Varying First and Second-Order Structure of Time Series via Wavelets and Differencing 标题:基于小波和差分的时变时间序列一阶和二阶结构建模 链接:https://arxiv.org/abs/2108.07550

作者:Euan T. McGonigle,Rebecca Killick,Matthew A. Nunes 机构:STOR-i Centre for Doctoral Training, Lancaster University, School of Mathematics, University of Bristol, Department of Mathematics and Statistics, Lancaster University, Department of Mathematics and Statistics, University of Bath 摘要:实践中观察到的大多数时间序列表现出时变趋势(一阶)和自方差(二阶)行为。差分是一种常用的技术,用于消除此类序列中的趋势,以估计(差分序列的)时变二阶结构。然而,我们通常需要推断原始序列的二阶行为,例如,在执行趋势估计时。在本文中,我们提出了一种在局部平稳小波建模框架内,使用差分法联合估计非平稳时间序列的时变趋势和二阶结构的方法。我们开发了基于差分估计的原始时间序列二阶结构的小波估计,并展示了如何将其纳入时间序列趋势的估计中。我们进行了模拟研究,以调查该方法的性能,并通过分析来自环境和生物医学科学的数据示例来证明该方法的实用性。 摘要:Most time series observed in practice exhibit time-varying trend (first-order) and autocovariance (second-order) behaviour. Differencing is a commonly-used technique to remove the trend in such series, in order to estimate the time-varying second-order structure (of the differenced series). However, often we require inference on the second-order behaviour of the original series, for example, when performing trend estimation. In this article, we propose a method, using differencing, to jointly estimate the time-varying trend and second-order structure of a nonstationary time series, within the locally stationary wavelet modelling framework. We develop a wavelet-based estimator of the second-order structure of the original time series based on the differenced estimate, and show how this can be incorporated into the estimation of the trend of the time series. We perform a simulation study to investigate the performance of the methodology, and demonstrate the utility of the method by analysing data examples from environmental and biomedical science.

【6】 Technical report: Impact of evaluation metrics and sampling on the comparison of machine learning methods for biodiversity indicators prediction 标题:技术报告:评价指标和抽样对生物多样性指标预测机器学习方法比较的影响 链接:https://arxiv.org/abs/2108.07480

作者:Geneviève Robin,Cathia Le Hasif 机构:: CNRS, Laboratoire de Mathématiques et Modélisation d’Évry, UEVE, France, : Laboratoire de Mathématiques et Modélisation d’Évry, UEVE, France 摘要:机器学习(ML)方法在生物多样性监测中的应用越来越广泛。特别是,一个重要的应用是根据包含例如气候和人为因素的预测集预测物种丰富度、物种发生率或物种丰富度等生物多样性指标的问题。考虑到文献中提供的大量不同的ML方法以及它们的出版速度,制定统一的评估程序至关重要,以便产生合理和公平的实证研究。然而,定义公平评估程序是一项挑战:因为生物多样性指标的固有特性(如零膨胀和过度分散)得到了充分证明,因此设计好的交叉验证抽样方案和良好的评估指标并非易事。事实上,经典的均方误差(MSE)无法捕捉不同方法性能的细微差异,特别是在预测非常小或非常大的值(例如,零计数或大计数)方面。在本报告中,我们根据地理、气象和时空因素,通过比较十种统计和机器学习模型来说明这一现象,预测北非地区水鸟的数量。我们的结果突出表明,不同的现成评估指标和交叉验证抽样方法产生了截然不同的指标排名,并且无法获得可解释的结论。 摘要:Machine learning (ML) approaches are used more and more widely in biodiversity monitoring. In particular, an important application is the problem of predicting biodiversity indicators such as species abundance, species occurrence or species richness, based on predictor sets containing, e.g., climatic and anthropogenic factors. Considering the impressive number of different ML methods available in the litterature and the pace at which they are being published, it is crucial to develop uniform evaluation procedures, to allow the production of sound and fair empirical studies. However, defining fair evaluation procedures is challenging: because well-documented, intrinsic properties of biodiversity indicators such as their zero-inflation and over-dispersion, it is not trivial to design good sampling schemes for cross-validation nor good evaluation metrics. Indeed, the classical Mean Squared Error (MSE) fails to capture subtle differences in the performance of different methods, particularly in terms of prediction of very small, or very large values (e.g., zero counts or large counts). In this report, we illustrate this phenomenon by comparing ten statistical and machine learning models on the task of predicting waterbirds abundance in the North-African area, based on geographical, meteorological and spatio-temporal factors. Our results highlight that differnte off-the-shelf evaluation metrics and cross-validation sampling approaches yield drastically different rankings of the metrics, and fail to capture interpretable conclusions.

【7】 Causal Inference with Noncompliance and Unknown Interference 标题:具有不顺应和未知干扰的因果推断 链接:https://arxiv.org/abs/2108.07455

作者:Tadao Hoshino,Takahide Yanagi 摘要:在本文中,我们研究了一个治疗效果模型,其中个体在社会网络中相互作用,他们可能不遵守指定的治疗。我们引入了一个新的暴露映射概念,它将溢出效应总结为工具变量的固定维统计,我们称这种映射为工具暴露映射(IEM)。我们调查了编译器的意向处理效应和平均因果效应的识别条件,同时明确考虑了IEM规范错误的可能性。基于我们的识别结果,我们开发了治疗参数的非参数估计程序。他们的渐近性质,包括一致性和渐近正态性,由Leung(2021)使用近似邻域干扰框架进行了研究。为了实证说明我们提出的方法,我们回顾了Paluck等人(2016)关于反冲突干预学校计划的实验数据。 摘要:In this paper, we investigate a treatment effect model in which individuals interact in a social network and they may not comply with the assigned treatments. We introduce a new concept of exposure mapping, which summarizes spillover effects into a fixed dimensional statistic of instrumental variables, and we call this mapping the instrumental exposure mapping (IEM). We investigate identification conditions for the intention-to-treat effect and the average causal effect for compliers, while explicitly considering the possibility of misspecification of IEM. Based on our identification results, we develop nonparametric estimation procedures for the treatment parameters. Their asymptotic properties, including consistency and asymptotic normality, are investigated using an approximate neighborhood interference framework by Leung (2021). For an empirical illustration of our proposed method, we revisit Paluck et al.'s (2016) experimental data on the anti-conflict intervention school program.

【8】 Limiting distributions of graph-based test statistics 标题:基于图的测试统计量的限制分布 链接:https://arxiv.org/abs/2108.07446

作者:Yejiong Zhu,Hao Chen 机构:University of California, Davis 摘要:利用观测相似图的两个样本测试对于高维数据和非欧几里德数据非常有用,因为它们在各种备选方案下具有灵活性和良好的性能。现有的研究主要集中在稀疏图上,如边数按观测数的顺序排列的图。然而,在许多设置下,使用更密集的图形进行测试时,性能更好。在这项工作中,我们为基于图的测试奠定了理论基础,这些测试使用的图比现有工作中的图密集得多。 摘要:Two-sample tests utilizing a similarity graph on observations are useful for high-dimensional data and non-Euclidean data due to their flexibility and good performance under a wide range of alternatives. Existing works mainly focused on sparse graphs, such as graphs with the number of edges in the order of the number of observations. However, the tests have better performance with denser graphs under many settings. In this work, we establish the theoretical ground for graph-based tests with graphs that are much denser than those in existing works.

【9】 InfoGram and Admissible Machine Learning 标题:信息图与容许机器学习 链接:https://arxiv.org/abs/2108.07380

作者:Subhadeep Mukhopadhyay 备注:Keywords: Admissible machine learning; InfoGram; L-Features; Information-theory; ALFA-testing, Algorithmic risk management; Fairness; Interpretability; COREml; FINEml 摘要:我们已经进入了一个机器学习(ML)的新时代,在这个时代,具有卓越预测能力的最精确算法甚至可能无法部署,除非它在监管约束下是可接受的。这引起了人们对开发公平、透明和可信的ML方法的极大兴趣。本文的目的是介绍一种新的信息理论学习框架(可接受的机器学习)和算法风险管理工具(信息图、L特征、阿尔法测试),可以指导分析师重新设计现成的ML方法,使其符合监管要求,同时保持良好的预测准确性。我们使用了来自金融部门、生物医学研究、营销活动和刑事司法系统的几个真实数据示例来说明我们的方法。 摘要:We have entered a new era of machine learning (ML), where the most accurate algorithm with superior predictive power may not even be deployable, unless it is admissible under the regulatory constraints. This has led to great interest in developing fair, transparent and trustworthy ML methods. The purpose of this article is to introduce a new information-theoretic learning framework (admissible machine learning) and algorithmic risk-management tools (InfoGram, L-features, ALFA-testing) that can guide an analyst to redesign off-the-shelf ML methods to be regulatory compliant, while maintaining good prediction accuracy. We have illustrated our approach using several real-data examples from financial sectors, biomedical research, marketing campaigns, and the criminal justice system.

【10】 Density Sharpening: Principles and Applications to Discrete Data Analysis 标题:密度锐化:原理及其在离散数据分析中的应用 链接:https://arxiv.org/abs/2108.07372

作者:Subhadeep Mukhopadhyay 摘要:本文介绍了一种称为“密度锐化”的通用统计建模原理,并将其应用于离散计数数据的分析。基础的基础是一个新的理论的非参数逼近和平滑方法的离散分布,在解释和统一一大类应用统计方法发挥了有益的作用。从地震学到医疗保健再到物理学,我们用几个实际应用来说明所提出的建模框架。 摘要:This article introduces a general statistical modeling principle called ``Density Sharpening'' and applies it to the analysis of discrete count data. The underlying foundation is based on a new theory of nonparametric approximation and smoothing methods for discrete distributions which play a useful role in explaining and uniting a large class of applied statistical methods. The proposed modeling framework is illustrated using several real applications, from seismology to healthcare to physics.

【11】 Detecting changes in covariance via random matrix theory 标题:利用随机矩阵理论检测协方差变化 链接:https://arxiv.org/abs/2108.07340

作者:Sean Ryan,Rebecca Killick 摘要:提出了一种检测中维时间序列协方差结构变化的新方法。这种非线性检验统计量具有许多有用的性质。最重要的是,它独立于协方差矩阵的底层结构。我们将讨论如何使用随机矩阵理论的结果来研究中等维环境下测试统计的行为(即变量数量与数据长度相当)。特别地,我们证明了检验统计量在零假设下逐点收敛到正态分布。我们在一系列模拟数据集上评估了该方法的性能,发现它优于最近提出的一系列替代方法。最后,我们使用我们的方法来研究一块土壤表面水量的变化,这将为地面管道退化的模型开发提供依据。 摘要:A novel method is proposed for detecting changes in the covariance structure of moderate dimensional time series. This non-linear test statistic has a number of useful properties. Most importantly, it is independent of the underlying structure of the covariance matrix. We discuss how results from Random Matrix Theory, can be used to study the behaviour of our test statistic in a moderate dimensional setting (i.e. the number of variables is comparable to the length of the data). In particular, we demonstrate that the test statistic converges point wise to a normal distribution under the null hypothesis. We evaluate the performance of the proposed approach on a range of simulated datasets and find that it outperforms a range of alternative recently proposed methods. Finally, we use our approach to study changes in the amount of water on the surface of a plot of soil which feeds into model development for degradation of surface piping.

【12】 Augmenting control arms with Real-World Data for cancer trials: Hybrid control arm methods and considerations 标题:利用癌症试验的真实数据增加对照臂:混合对照臂方法和考虑 链接:https://arxiv.org/abs/2108.07335

作者:W. Katherine Tan,Brian D. Segal,Melissa D. Curtis,Shrujal S. Baxi,William B. Capra,Elizabeth Garrett-Mayer,Brian P. Hobbs,David S. Hong,Rebecca A. Hubbard,Jiawen Zhu,Somnath Sarkar,Meghna Samant 机构:a Flatiron Health, Inc.; New York, NY , b Genentech; South San Francisco, CA, c American Society of Clinical Oncology Center for Research and Analytics (CENTRA);, Alexandria, VA , d Dell Medical School, University of Texas, Austin, TX 备注:71 pgs (with supplemental) 3 Tables, 4 Figures 摘要:随机对照试验(RCT)是评估药物安全性和疗效的金标准。然而,随机对照试验有一些缺点,这导致使用单臂研究来做出某些内部药物开发和监管决策,特别是在肿瘤学方面。具有真实世界数据的混合对照试验(RWD),其中对照组由试验和真实世界患者组成,有可能帮助解决特定情况下随机对照试验和单臂研究的一些缺点,例如,当一种疾病的发病率较低时,或者当对照组使用的护理标准无效或剧毒时,实验治疗显示出早期前景。本文讨论了为什么可能有利于考虑混合控制试验与RWD,什么样的设计需要,当它可能是合适的,以及如何进行分析。我们提出了一种新的两步借用方法来构造混合控制臂。我们使用模拟来演示动态和静态借款方法的操作特征,并强调研究团队在设计混合研究时需要解决的权衡和分析决策。 摘要:Randomized controlled trials (RCTs) are the gold standard for assessing drug safety and efficacy. However, RCTs have some drawbacks which have led to the use of single-arm studies to make certain internal drug development and regulatory decisions, particularly in oncology. Hybrid controlled trials with real-world data (RWD), in which the control arm is composed of both trial and real-world patients, have the potential to help address some of the shortcomings of both RCTs and single-arm studies in particular situations, such as when a disease has low prevalence or when the standard of care to be used in the control arm is ineffective or highly toxic and an experimental therapy shows early promise. This paper discusses why it may be beneficial to consider hybrid controlled trials with RWD, what such a design entails, when it may be appropriate, and how to conduct the analyses. We propose a novel two-step borrowing method for the construction of hybrid control arms. We use simulations to demonstrate the operating characteristics of dynamic and static borrowing methods, and highlight the trade-offs and analytic decisions that study teams will need to address when designing a hybrid study.

【13】 Digital Divide: Mapping the geodemographics of internet accessibility across Great Britain 标题:数字鸿沟:绘制全英国互联网可访问性的地理人口统计图 链接:https://arxiv.org/abs/2108.07699

作者:Claire Powell,Luke Burns 机构:University of Leeds, Leeds, United Kingdom 备注:46 pages, 16 figures, 4 tables 摘要:目的:这项研究提出了英国第一个关于数字无障碍性的社会人口测量方法。数字不可访问性影响到大约1 000万无法访问或充分利用互联网的人,特别是影响到社会中的弱势群体。方法:通过分析文献指导下的地区社会人口统计变量,制定了一个地理地理学分类。分析:根据社会人口统计变量和空间范围对结果集群进行分析。调查结果表明,存在三个风险集群,“都市少数民族斗争”、“印度都市生活”和“巴基斯坦-孟加拉国不平等”。这些数据通过全国Ofcom电信性能数据和使用国家统计局互联网使用数据的具体案例研究进行验证。结论:仅使用当代和开源社会人口统计变量,本文加强了先前的数字无障碍性研究。在预期的最终全国人口普查之后,通过识别数字不可访问区域,可以集中地方和国家政府的资源和政策目标,作为2021年后的关键数据源和方法,这一点尤为重要。 摘要:Aim: This research proposes the first solely sociodemographic measure of digital accessibility for Great Britain. Digital inaccessibility affects circa 10 million people who are unable to access or make full use of the internet, particularly impacting the disadvantaged in society. Method: A geodemographic classification is developed, analysing literature-guided sociodemographic variables at the district level. Analysis: Resultant clusters are analysed against their sociodemographic variables and spatial extent. Findings suggest three at-risk clusters exist, "Metropolitan Minority Struggle", "Indian Metropolitan Living" and "Pakistani-Bangladeshi Inequality". These are validated through nationwide Ofcom telecommunications performance data and specific case studies using Office for National Statistics internet usage data. Conclusion: Using solely contemporary and open-source sociodemographic variables, this paper enhances previous digital accessibility research. The identification of digitally inaccessible areas allows focussed local and national government resource and policy targeting, particularly important as a key data source and methodology post-2021, following the expected final nationwide census.

【14】 Fine-tuning is Fine in Federated Learning 标题:在联合学习中,微调是好的 链接:https://arxiv.org/abs/2108.07313

作者:Gary Cheng,Karan Chadha,John Duchi 备注:40 pages (10 main pages, 30 appendix pages), 13 figures 摘要:我们在渐近框架下研究了联邦学习算法及其变体的性能。我们的出发点是将联合学习表述为一个多标准目标,目标是使用来自所有客户机的信息最大限度地减少每个客户机的损失。我们提出了一个线性回归模型,其中,对于给定的客户,我们从理论上比较了各种算法在高维渐近极限下的性能。这种渐进多准则方法自然地模拟了联合学习的高维、多设备特性,并表明个性化是联合学习的核心。我们的理论表明,精细调整联邦平均(FTFA),即先进行联邦平均,然后进行局部训练,以及岭正则化变体岭调优联邦平均(RTFA),与更复杂的元学习和近端正则化方法相比具有竞争力。除了在概念上更简单外,FTFA和RTFA在计算上比其竞争对手更高效。我们在EMNIST、CIFAR-100、Shakespeare和堆栈溢出数据集的联合版本上进行了大量实验,证实了我们的理论主张。 摘要:We study the performance of federated learning algorithms and their variants in an asymptotic framework. Our starting point is the formulation of federated learning as a multi-criterion objective, where the goal is to minimize each client's loss using information from all of the clients. We propose a linear regression model, where, for a given client, we theoretically compare the performance of various algorithms in the high-dimensional asymptotic limit. This asymptotic multi-criterion approach naturally models the high-dimensional, many-device nature of federated learning and suggests that personalization is central to federated learning. Our theory suggests that Fine-tuned Federated Averaging (FTFA), i.e., Federated Averaging followed by local training, and the ridge regularized variant Ridge-tuned Federated Averaging (RTFA) are competitive with more sophisticated meta-learning and proximal-regularized approaches. In addition to being conceptually simpler, FTFA and RTFA are computationally more efficient than its competitors. We corroborate our theoretical claims with extensive experiments on federated versions of the EMNIST, CIFAR-100, Shakespeare, and Stack Overflow datasets.

【15】 Understanding the factors driving the opioid epidemic using machine learning 标题:用机器学习理解阿片类药物流行的驱动因素 链接:https://arxiv.org/abs/2108.07301

作者:Sachin Gavali,Chuming Chen,Julie Cowart,Xi Peng,Shanshan Ding,Cathy Wu,Tammy Anderson 机构:University of Delaware, Newark, DE, USA 备注:Submitted to IEEE International Conference on Bioinformatics & Biomedicine 2021 摘要:近年来,美国经历了一场类阿片流行病,吸毒过量死亡人数前所未有。研究发现,此类过量死亡与邻里层面的特征有关,因此提供了确定有效干预措施的机会。通常,诸如普通最小二乘法(OLS)或最大似然估计法(MLE)等技术用于记录在解释此类不利结果方面具有重要意义的邻域级因素。然而,这些技术不太适合确定混杂因素之间的非线性关系。因此,在本研究中,我们应用基于机器学习的技术识别特拉华州社区的阿片类药物风险,并使用Shapley加法解释(SHAP)探讨这些因素之间的相关性。我们发现,与社区环境相关的因素,其次是教育,然后是犯罪,与较高的类阿片风险高度相关。我们还探讨了这些相关性多年来的变化,以了解疫情的变化动态。此外,我们发现,随着近年来疫情从合法(即处方类阿片)转向非法(如海洛因和芬太尼)药物,环境,与类阿片风险相关的犯罪和健康变量显著增加,而经济和社会人口变量的相关性降低。教育相关因素的相关性从一开始就较高,近年来略有增加,表明需要提高对类阿片流行病的认识。 摘要:In recent years, the US has experienced an opioid epidemic with an unprecedented number of drugs overdose deaths. Research finds such overdose deaths are linked to neighborhood-level traits, thus providing opportunity to identify effective interventions. Typically, techniques such as Ordinary Least Squares (OLS) or Maximum Likelihood Estimation (MLE) are used to document neighborhood-level factors significant in explaining such adverse outcomes. These techniques are, however, less equipped to ascertain non-linear relationships between confounding factors. Hence, in this study we apply machine learning based techniques to identify opioid risks of neighborhoods in Delaware and explore the correlation of these factors using Shapley Additive explanations (SHAP). We discovered that the factors related to neighborhoods environment, followed by education and then crime, were highly correlated with higher opioid risk. We also explored the change in these correlations over the years to understand the changing dynamics of the epidemic. Furthermore, we discovered that, as the epidemic has shifted from legal (i.e., prescription opioids) to illegal (e.g.,heroin and fentanyl) drugs in recent years, the correlation of environment, crime and health related variables with the opioid risk has increased significantly while the correlation of economic and socio-demographic variables has decreased. The correlation of education related factors has been higher from the start and has increased slightly in recent years suggesting a need for increased awareness about the opioid epidemic.

本文参与 腾讯云自媒体同步曝光计划,分享自微信公众号。
原始发表:2021-08-18,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 arXiv每日学术速递 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档