
访问www.arxivdaily.com获取含摘要速递,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏、发帖等功能!点击阅读原文即可访问
stat统计学,共计33篇
【1】 Joint Shapley values: a measure of joint feature importance 标题:关节Shapley值:关节特征重要性的度量
作者:Chris Harris,Richard Pymar,Colin Rowat 机构:Visual Alpha, Tokyo, Japan, Economics, Mathematics and Statistics, Birkbeck College University of London, UK, University of Birmingham, UK 备注:Source code available at this https URL 链接:https://arxiv.org/abs/2107.11357 摘要:Shapley值是可解释人工智能中最广泛使用的特征重要性的模型不可知度量之一:它有明确的公理基础,保证唯一存在,并且作为特征对模型预测的平均影响有明确的解释。我们引入了联合Shapley值,它直接扩展了Shapley公理。这保留了经典的Shapley值的直觉:联合Shapley值度量一组特征对模型预测的平均影响。证明了联合Shapley值的唯一性。游戏结果表明,联合Shapley值与现有的交互指数不同,后者评估了一组特征中一个特征的效果。由此导出ML属性问题中的联合Shapley值,我们就可以第一次度量特征集对模型预测的联合影响。在具有二进制特征的数据集中,我们提出了一种保留效率特性的全局值计算方法。 摘要:The Shapley value is one of the most widely used model-agnostic measures of feature importance in explainable AI: it has clear axiomatic foundations, is guaranteed to uniquely exist, and has a clear interpretation as a feature's average effect on a model's prediction. We introduce joint Shapley values, which directly extend the Shapley axioms. This preserves the classic Shapley value's intuitions: joint Shapley values measure a set of features' average effect on a model's prediction. We prove the uniqueness of joint Shapley values, for any order of explanation. Results for games show that joint Shapley values present different insights from existing interaction indices, which assess the effect of a feature within a set of features. Deriving joint Shapley values in ML attribution problems thus gives us the first measure of the joint effect of sets of features on model predictions. In a dataset with binary features, we present a presence-adjusted method for calculating global values that retains the efficiency property.
【2】 RiLACS: Risk-Limiting Audits via Confidence Sequences 标题:RiLACS:通过置信序列进行风险限制审计
作者:Ian Waudby-Smith,Philip B. Stark,Aaditya Ramdas 机构:Carnegie Mellon University, University of California, Berkeley 链接:https://arxiv.org/abs/2107.11323 摘要:准确确定选举结果是一项复杂的任务,有许多潜在的错误来源,从投票机的软件故障到程序失误,再到彻头彻尾的欺诈。风险限制审计(RLA)是一种统计原则上的“增量”手数,它提供了统计保证,即报告的结果准确地反映了有效投票。我们提供了一套使用置信序列进行RLA的工具--置信集序列,它以高概率统一捕获从审计开始到彻底重新计票的一个感兴趣的选举参数。采用SHANGRLA框架,我们设计了非负鞅,得到了计算和统计上有效的置信序列和rla,适用于各种选举类型。 摘要:Accurately determining the outcome of an election is a complex task with many potential sources of error, ranging from software glitches in voting machines to procedural lapses to outright fraud. Risk-limiting audits (RLA) are statistically principled "incremental" hand counts that provide statistical assurance that reported outcomes accurately reflect the validly cast votes. We present a suite of tools for conducting RLAs using confidence sequences -- sequences of confidence sets which uniformly capture an electoral parameter of interest from the start of an audit to the point of an exhaustive recount with high probability. Adopting the SHANGRLA framework, we design nonnegative martingales which yield computationally and statistically efficient confidence sequences and RLAs for a wide variety of election types.
【3】 Bayesian Precision Factor Analysis for High-dimensional Sparse Gaussian Graphical Models 标题:高维稀疏高斯图形模型的贝叶斯精度因子分析
作者:Noirrit Kiran Chandra,Peter Mueller,Abhra Sarkar 机构:Department of Statistics and Data Sciences, The University of Texas at Austin, Speedway D, Austin, TX ,-, USA, Department of Mathematics, Speedway, PMA ,., Austin, TX , USA 链接:https://arxiv.org/abs/2107.11316 摘要:高斯图形模型是研究不同随机变量之间依赖关系的常用工具。我们提出了一种新的高斯图形模型方法,该方法依赖于将编码条件独立关系的精度矩阵分解为低秩和对角分量。这种分解方法已经广泛应用于大型协方差矩阵的建模,因为它们采用了基于潜在因子的表示法,可以方便地进行推理,但由于其计算困难性,在精确矩阵模型中尚未得到广泛应用。我们证明了这种分解的一个简单的潜变量表示实际上也存在于精度矩阵中。潜变量构造为高斯图形模型提供了基本的新见解。在贝叶斯环境中,通过直接的Gibbs采样器实现有效的后验推理也非常有用,它可以很好地扩展到远远超出当前最新技术限制的高维问题。有效探索整个后验空间的能力使得模型的不确定性可以很容易地评估,并且通过一种新的后验错误发现率控制程序来确定基础图。这种分解还使我们能够适应稀疏诱导先验,将不重要的非对角项缩小到零,从而使该方法适用于高维小样本稀疏设置。我们通过综合实验评估了该方法的实证性能,并在两个不同应用领域的数据集中说明了该方法的实用性。 摘要:Gaussian graphical models are popular tools for studying the dependence relationships between different random variables. We propose a novel approach to Gaussian graphical models that relies on decomposing the precision matrix encoding the conditional independence relationships into a low rank and a diagonal component. Such decompositions are already popular for modeling large covariance matrices as they admit a latent factor based representation that allows easy inference but are yet to garner widespread use in precision matrix models due to their computational intractability. We show that a simple latent variable representation for such decomposition in fact exists for precision matrices as well. The latent variable construction provides fundamentally novel insights into Gaussian graphical models. It is also immediately useful in Bayesian settings in achieving efficient posterior inference via a straightforward Gibbs sampler that scales very well to high-dimensional problems far beyond the limits of the current state-of-the-art. The ability to efficiently explore the full posterior space allows the model uncertainty to be easily assessed and the underlying graph to be determined via a novel posterior false discovery rate control procedure. The decomposition also crucially allows us to adapt sparsity inducing priors to shrink insignificant off-diagonal entries toward zero, making the approach adaptable to high-dimensional small-sample-size sparse settings. We evaluate the method's empirical performance through synthetic experiments and illustrate its practical utility in data sets from two different application domains.
【4】 Bootstrapping Whittle Estimators 标题:自助式惠特尔估计量
作者:Jens-Peter Kreiss,Efstathios Paparoditis 备注:34 pages, 2 figures 链接:https://arxiv.org/abs/2107.11270 摘要:在时间序列分析中,通过优化频域目标函数来拟合参数模型是一种很有吸引力的参数估计方法。Whittle估计就是一个突出的例子。在弱条件和(现实的)假设下,基本过程的真实谱密度不一定属于拟合的谱密度的参数类,Whittle估计量的分布通常取决于难以估计的基本过程的特征。这使得构造置信区间或评估估计量的可变性的渐近结果在实践中很难实现。本文提出了一种频域bootstrap方法来估计Whittle估计量的分布,该方法在不仅允许(可能的)模型误判,而且允许广泛平稳随机过程所满足的弱依赖条件的假设下渐近有效。本文还考虑了bootstrap过程的自适应性,该过程包含了文献中提出的Whittle估计的不同修改,例如锥形、去偏或边界扩展Whittle估计。仿真结果验证了该方法的有效性和良好的有限样本性能。文中还对实际数据进行了分析。 摘要:Fitting parametric models by optimizing frequency domain objective functions is an attractive approach of parameter estimation in time series analysis. Whittle estimators are a prominent example in this context. Under weak conditions and the (realistic) assumption that the true spectral density of the underlying process does not necessarily belong to the parametric class of spectral densities fitted, the distribution of Whittle estimators typically depends on difficult to estimate characteristics of the underlying process. This makes the implementation of asymptotic results for the construction of confidence intervals or for assessing the variability of estimators, difficult in practice. This paper proposes a frequency domain bootstrap method to estimate the distribution of Whittle estimators which is asymptotically valid under assumptions that not only allow for (possible) model misspecification but also for weak dependence conditions which are satisfied by a wide range of stationary stochastic processes. Adaptions of the bootstrap procedure developed to incorporate different modifications of Whittle estimators proposed in the literature, like for instance, tapered, de-biased or boundary extended Whittle estimators, are also considered. Simulations demonstrate the capabilities of the bootstrap method proposed and its good finite sample performance. A real-life data analysis also is presented.
【5】 State, global and local parameter estimation using local ensemble Kalman filters: applications to online machine learning of chaotic dynamics 标题:基于局部集成卡尔曼滤波的状态、全局和局部参数估计:在混沌动力学在线机器学习中的应用
作者:Quentin Malartic,Alban Farchi,Marc Bocquet 机构:CEREA, ´Ecole des Ponts and EDF R&D, ˆIle–de–France, France, LMDIPSL, ENS, PSL Universit´e, ´Ecole Polytechnique, Institut Polytechnique de Paris, Sorbonne Universit´e, CNRS, Paris, France 链接:https://arxiv.org/abs/2107.11253 摘要:最近的研究表明,将机器学习方法与数据同化相结合,仅利用系统的稀疏和噪声观测就可以重构出一个动力系统。同样的方法也可以用来修正基于知识的模型的错误。由此产生的代理模型是混合的,统计部分补充了物理部分。在实践中,可以将校正作为一个集成项(即模型预解式中的{textit})或直接添加到物理模型的趋势中。这种预解校正方法易于实现。趋势修正更具技术性,特别是需要物理模型的伴随,而且更具灵活性。利用双尺度Lorenz模型对两种方法进行了比较。采用预解修正和趋势修正的替代模型在长期预报实验中的精度有一定的相似性。相比之下,在资料同化实验中,使用趋势修正的替代模式显著优于使用预解修正的替代模式。最后,我们证明了趋势修正开启了在线模型误差修正的可能性,即当新的观测值可用时,逐步改进模型。该算法可以看作是弱约束4D-Var的一种新形式。我们将在线学习和离线学习与双尺度Lorenz系统进行了比较,结果表明,在线学习可以从稀疏和噪声观测中提取所有信息。 摘要:Recent studies have shown that it is possible to combine machine learning methods with data assimilation to reconstruct a dynamical system using only sparse and noisy observations of that system. The same approach can be used to correct the error of a knowledge-based model. The resulting surrogate model is hybrid, with a statistical part supplementing a physical part. In practice, the correction can be added as an integrated term (\textit{i.e.} in the model resolvent) or directly inside the tendencies of the physical model. The resolvent correction is easy to implement. The tendency correction is more technical, in particular it requires the adjoint of the physical model, but also more flexible. We use the two-scale Lorenz model to compare the two methods. The accuracy in long-range forecast experiments is somewhat similar between the surrogate models using the resolvent correction and the tendency correction. By contrast, the surrogate models using the tendency correction significantly outperform the surrogate models using the resolvent correction in data assimilation experiments. Finally, we show that the tendency correction opens the possibility to make online model error correction, \textit{i.e.} improving the model progressively as new observations become available. The resulting algorithm can be seen as a new formulation of weak-constraint 4D-Var. We compare online and offline learning using the same framework with the two-scale Lorenz system, and show that with online learning, it is possible to extract all the information from sparse and noisy observations.
【6】 Multivariate Methods for Detection of Rubbery Rot in Storage Apples by Monitoring Volatile Organic Compounds: An Example of Multivariate Generalised Mixed Models 标题:通过监测挥发性有机物检测贮藏苹果橡胶腐烂的多变量方法--以多变量广义混合模型为例
作者:J. S. Pelck,H. Holthusen,M. Edelenbos,A. Luca,R. Labouriau 机构: 1Department of Mathematics, Aarhus University, Germany 3Department of Food Science 备注:11 pages and 1 figure 链接:https://arxiv.org/abs/2107.11233 摘要:本文是一个应用多元统计方法筛选潜在化学标记的案例研究,可用于贮藏水果采后病害的早期检测。我们同时测量了一系列挥发性有机化合物(VOCs)和两个衡量贮藏中苹果病害感染严重程度的指标:出现明显症状的苹果数量和病斑面积。我们使用多变量广义线性混合模型(MGLMM)通过随机分量的协方差结构来研究这些同时观测响应的关联模式。值得注意的是,这些mglmm可以用来表示不同统计性质的量之间的关联模式。在本文考虑的特定示例中,存在正响应(VOC浓度,基于伽马分布的模型),正响应可能包含零值观测值(病变区域,复合泊松分布为基础的模型)和二项分布的反应(比例苹果提出感染症状)。我们使用图形模型(由图形表示的网络)表示MGLMM推断的关联模式,这允许我们消除由于响应之间的间接关联级联而产生的虚假关联。 摘要:This article is a case study illustrating the use of a multivariate statistical method for screening potential chemical markers for early detection of post-harvest disease in storage fruit. We simultaneously measure a range of volatile organic compounds (VOCs) and two measures of severity of disease infection in apples under storage: the number of apples presenting visible symptoms and the lesion area. We use multivariate generalised linear mixed models (MGLMM) for studying association patterns of those simultaneously observed responses via the covariance structure of random components. Remarkably, those MGLMMs can be used to represent patterns of association between quantities of different statistical nature. In the particular example considered in this paper, there are positive responses (concentrations of VOC, Gamma distribution based models), positive responses possibly containing observations with zero values (lesion area, Compound Poisson distribution based models) and binomially distributed responses (proportion of apples presenting infection symptoms). We represent patterns of association inferred with the MGLMMs using graphical models (a network represented by a graph), which allow us to eliminate spurious associations due to a cascade of indirect correlations between the responses.
【7】 A hierarchical prior for generalized linear models based on predictions for the mean response 标题:基于平均响应预测的广义线性模型的分层先验
作者:Ethan M. Alt,Matthew A. Psioda,Joseph G. Ibrahim 机构:Department of Biostatistics, University of North Carolina, Chapel Hill, NC 链接:https://arxiv.org/abs/2107.11195 摘要:人们对在统计分析中使用先验信息越来越感兴趣。例如,在罕见疾病中,由于样本量小,仅基于前瞻性研究的数据很难确定治疗效果。为了克服这一问题,可以引出一个治疗效果的信息先验。我们开发了Chen和Ibrahim(2003)的共轭先验的一个新的扩展,使得从业者能够导出广义线性模型的平均响应的先验预测,将预测视为随机的。我们将层次先验称为层次预测先验。对于i.i.d.设置和标准线性模型,我们导出了超先验是共轭先验的情形。我们还开发了一个扩展的水电站的情况下,总结统计从以前的研究是可用的,提请比较与权力之前。HPP允许基于单个水平预测的质量进行贴现,有可能在预测与数据不兼容的情况下提供效率收益(例如,较低的MSE)。提出了一种有效的马尔可夫链蒙特卡罗算法。应用实例表明,HPP下的推理比选择的非层次先验推理对先验数据冲突具有更强的鲁棒性。 摘要:There has been increased interest in using prior information in statistical analyses. For example, in rare diseases, it can be difficult to establish treatment efficacy based solely on data from a prospective study due to low sample sizes. To overcome this issue, an informative prior for the treatment effect may be elicited. We develop a novel extension of the conjugate prior of Chen and Ibrahim (2003) that enables practitioners to elicit a prior prediction for the mean response for generalized linear models, treating the prediction as random. We refer to the hierarchical prior as the hierarchical prediction prior. For i.i.d. settings and the normal linear model, we derive cases for which the hyperprior is a conjugate prior. We also develop an extension of the HPP in situations where summary statistics from a previous study are available, drawing comparisons with the power prior. The HPP allows for discounting based on the quality of individual level predictions, having the potential to provide efficiency gains (e.g., lower MSE) where predictions are incompatible with the data. An efficient Markov chain Monte Carlo algorithm is developed. Applications illustrate that inferences under the HPP are more robust to prior-data conflict compared to selected non-hierarchical priors.
【8】 Estimation of sparse linear dynamic networks using the stable spline horseshoe prior 标题:用稳定样条马蹄形先验估计稀疏线性动态网络
作者:Gianluigi Pillonetto 机构:Estimation of sparse linear dynamic networksusing the stable spline horseshoe priorGianluigi Pillonetto aaDepartment of Information Engineering, University of Padova 链接:https://arxiv.org/abs/2107.11155 摘要:动态网络的辨识是近年来控制文献中最具挑战性的问题之一。这类系统由大规模互联系统组成,也称为模块。为了恢复完整的网络动态,两个关键步骤是拓扑检测(其中必须从数据中推断哪些连接处于活动状态)和模块估计。由于在许多实际系统中,一小部分连接是有效的,因此该问题还发现了群稀疏估计的基本连接。特别是,在线性设置模块对应的未知脉冲响应预期有零范数,但在一小部分样本。本文介绍了一种新的贝叶斯方法用于线性动态网络辨识,其中脉冲响应通过两个特定先验分布的组合来描述。第一个模型是马蹄形先验模型的块体模型,该模型具有重要的全局局部收缩特征。第二种是稳定样条先验,它对模的光滑指数衰减信息进行编码。由此产生的模型称为稳定样条马蹄形(SSH)先验。它实现了小脉冲响应的大幅度收缩,而大脉冲响应则易于进行稳定的样条正则化。推理采用马尔可夫链蒙特卡罗方法,可根据动态上下文进行调整,并能有效地以抽样形式返回模块的后验值。我们包括数值研究,表明新的方法如何能够准确地重建稀疏网络动力学也当数以千计的未知脉冲响应系数必须从相对较小的数据集推断。 摘要:Identification of the so-called dynamic networks is one of the most challenging problems appeared recently in control literature. Such systems consist of large-scale interconnected systems, also called modules. To recover full networks dynamics the two crucial steps are topology detection, where one has to infer from data which connections are active, and modules estimation. Since a small percentage of connections are effective in many real systems, the problem finds also fundamental connections with group-sparse estimation. In particular, in the linear setting modules correspond to unknown impulse responses expected to have null norm but in a small fraction of samples. This paper introduces a new Bayesian approach for linear dynamic networks identification where impulse responses are described through the combination of two particular prior distributions. The first one is a block version of the horseshoe prior, a model possessing important global-local shrinkage features. The second one is the stable spline prior, that encodes information on smooth-exponential decay of the modules. The resulting model is called stable spline horseshoe (SSH) prior. It implements aggressive shrinkage of small impulse responses while larger impulse responses are conveniently subject to stable spline regularization. Inference is performed by a Markov Chain Monte Carlo scheme, tailored to the dynamic context and able to efficiently return the posterior of the modules in sampled form. We include numerical studies that show how the new approach can accurately reconstruct sparse networks dynamics also when thousands of unknown impulse response coefficients must be inferred from data sets of relatively small size.
【9】 A comparison of combined data assimilation and machine learning methods for offline and online model error correction 标题:数据同化和机器学习相结合的离线和在线模型误差校正方法的比较
作者:Alban Farchi,Marc Bocquet,Patrick Laloyaux,Massimo Bonavita,Quentin Malartic 链接:https://arxiv.org/abs/2107.11114 摘要:最近的研究表明,将机器学习方法与数据同化相结合,仅利用系统的稀疏和噪声观测就可以重构出一个动力系统。同样的方法也可以用来修正基于知识的模型的错误。由此产生的代理模型是混合的,统计部分补充了物理部分。在实践中,可以将校正作为一个综合项(即在模型预解中)或直接添加到物理模型的趋势中。这种预解校正方法易于实现。趋势修正更具技术性,特别是需要物理模型的伴随,而且更具灵活性。利用双尺度Lorenz模型对两种方法进行了比较。采用预解修正和趋势修正的替代模型在长期预报实验中的精度有一定的相似性。相比之下,在资料同化实验中,使用趋势修正的替代模式显著优于使用预解修正的替代模式。最后,我们证明了趋势修正开启了在线模型误差修正的可能性,即随着新观测值的出现,逐步改进模型。该算法可以看作是弱约束4D-Var的一种新形式。我们将在线学习和离线学习与双尺度Lorenz系统进行了比较,结果表明,在线学习可以从稀疏和噪声观测中提取所有信息。 摘要:Recent studies have shown that it is possible to combine machine learning methods with data assimilation to reconstruct a dynamical system using only sparse and noisy observations of that system. The same approach can be used to correct the error of a knowledge-based model. The resulting surrogate model is hybrid, with a statistical part supplementing a physical part. In practice, the correction can be added as an integrated term (i.e. in the model resolvent) or directly inside the tendencies of the physical model. The resolvent correction is easy to implement. The tendency correction is more technical, in particular it requires the adjoint of the physical model, but also more flexible. We use the two-scale Lorenz model to compare the two methods. The accuracy in long-range forecast experiments is somewhat similar between the surrogate models using the resolvent correction and the tendency correction. By contrast, the surrogate models using the tendency correction significantly outperform the surrogate models using the resolvent correction in data assimilation experiments. Finally, we show that the tendency correction opens the possibility to make online model error correction, i.e. improving the model progressively as new observations become available. The resulting algorithm can be seen as a new formulation of weak-constraint 4D-Var. We compare online and offline learning using the same framework with the two-scale Lorenz system, and show that with online learning, it is possible to extract all the information from sparse and noisy observations.
【10】 Kernel regression for cause-specific hazard models with time-dependent coefficients 标题:具有时间依赖系数的特定原因风险模型的核回归
作者:Xiaomeng Qi,Zhangsheng Yu 机构: School of Mathematical Sciences, Shanghai, Jiao Tong University, Shanghai, China, SJTU-Yale Joint Centre for Biostatistics, School of Life Science, Shanghai Jiao Tong, Correspondence, Jiao Tong University, Dongchuan Rd, Shanghai, China; or SJTU-Yale Joint Centre 链接:https://arxiv.org/abs/2107.11025 摘要:竞争风险数据在现代生物医学研究中广泛存在。在过去的二十年中,原因特定的危险模型经常被用来处理竞争性的风险数据。对于具有时变系数的特定原因风险模型,核似然方法目前还没有研究。我们建议使用局部部分对数似然方法进行非参数时变系数估计。仿真研究表明,本文提出的非参数核估计在有限样本条件下具有良好的性能。最后,我们应用所提出的方法来分析糖尿病透析研究与竞争性死亡原因。 摘要:Competing risk data appear widely in modern biomedical research. Cause-specific hazard models are often used to deal with competing risk data in the past two decades. There is no current study on the kernel likelihood method for the cause-specific hazard model with time-varying coefficients. We propose to use the local partial log-likelihood approach for nonparametric time-varying coefficient estimation. Simulation studies demonstrate that our proposed nonparametric kernel estimator has a good performance under assumed finite sample settings. Finally, we apply the proposed method to analyze a diabetes dialysis study with competing death causes.
【11】 Post-Treatment Confounding in Causal Mediation Studies: A Cutting-Edge Problem and A Novel Solution via Sensitivity Analysis 标题:因果调解研究中的后处理混淆:敏感性分析的前沿问题和新的解决方案
作者:Guanglei Hong,Fan Yang,Xu Qin 机构:( Equal first-authors), University of Chicago; Postal Address: , E ,th Street, Chicago, IL ,; Email:, University of Colorado Denver; Postal Address: , E. ,th Place, Aurora, CO ,; 链接:https://arxiv.org/abs/2107.11014 摘要:在将平均治疗效果分解为自然间接效果(NIE)和自然直接效果(NDE)的因果中介研究中,治疗后混杂的例子很多。过去的研究普遍认为,由于信息不完全,调整治疗后的中介-结果关系的混杂因素是不可行的:它是在实际治疗条件下观察到的,而在反实际治疗条件下缺失的。本研究提出一个新的敏感度分析策略来处理治疗后的混淆,并将其纳入基于权重的因果中介分析中,而不需要额外的识别假设。在治疗分配和中介变量顺序可忽略的情况下,我们得到了反事实治疗下治疗后混杂因子的条件分布,它不仅是预处理协变量的函数,而且是实际治疗下的对应变量的函数。然后敏感性分析生成NIE和NDE的界限,在实际和反实际条件下治疗后混杂因素之间的条件相关性的合理范围内。通过插补或整合实施,该策略适用于治疗后混杂因素的二元和连续测量。仿真结果表明了该方案的主要优点和潜在局限性。对国家福利工作策略评估(NEWWS)河畔数据的再分析表明,最初的分析结果对遗漏的治疗后混杂因素很敏感。 摘要:In causal mediation studies that decompose an average treatment effect into a natural indirect effect (NIE) and a natural direct effect (NDE), examples of post-treatment confounding are abundant. Past research has generally considered it infeasible to adjust for a post-treatment confounder of the mediator-outcome relationship due to incomplete information: it is observed under the actual treatment condition while missing under the counterfactual treatment condition. This study proposes a new sensitivity analysis strategy for handling post-treatment confounding and incorporates it into weighting-based causal mediation analysis without making extra identification assumptions. Under the sequential ignorability of the treatment assignment and of the mediator, we obtain the conditional distribution of the post-treatment confounder under the counterfactual treatment as a function of not just pretreatment covariates but also its counterpart under the actual treatment. The sensitivity analysis then generates a bound for the NIE and that for the NDE over a plausible range of the conditional correlation between the post-treatment confounder under the actual and that under the counterfactual conditions. Implemented through either imputation or integration, the strategy is suitable for binary as well as continuous measures of post-treatment confounders. Simulation results demonstrate major strengths and potential limitations of this new solution. A re-analysis of the National Evaluation of Welfare-to-Work Strategies (NEWWS) Riverside data reveals that the initial analytic results are sensitive to omitted post-treatment confounding.
【12】 Robust Estimation of High-Dimensional Vector Autoregressive Models 标题:高维向量自回归模型的稳健估计
作者:Di Wang,Ruey S. Tsay 机构: TSAY† 1Booth School of Business, University of Chicago 备注:37 pages, 4 figures 链接:https://arxiv.org/abs/2107.11002 摘要:在当前数据丰富的环境下,高维时间序列数据出现在许多科学领域。这些数据的分析对数据分析人员提出了新的挑战,因为不仅序列之间存在复杂的动态相关性,而且还存在异常观测,如缺失值、污染观测和重尾分布。对于高维向量自回归(VAR)模型,我们引入了一种统一的估计方法,该方法对模型的误判、重尾噪声污染和条件异方差具有鲁棒性。该方法具有统计最优性和计算效率,能够处理稀疏、降秩、带状和网络结构的VAR模型等高维模型。通过适当的正则化和数据截断,在有界四阶矩条件下,估计收敛速度接近最优。在松弛有界$(2+2ε)$矩条件下,对于某些$\epsilon\in(0,1)$,所提出的估计量的相合性也被建立,与$\epsilon$相关的收敛速度较慢。通过仿真和实例验证了该方法的有效性。 摘要:High-dimensional time series data appear in many scientific areas in the current data-rich environment. Analysis of such data poses new challenges to data analysts because of not only the complicated dynamic dependence between the series, but also the existence of aberrant observations, such as missing values, contaminated observations, and heavy-tailed distributions. For high-dimensional vector autoregressive (VAR) models, we introduce a unified estimation procedure that is robust to model misspecification, heavy-tailed noise contamination, and conditional heteroscedasticity. The proposed methodology enjoys both statistical optimality and computational efficiency, and can handle many popular high-dimensional models, such as sparse, reduced-rank, banded, and network-structured VAR models. With proper regularization and data truncation, the estimation convergence rates are shown to be nearly optimal under a bounded fourth moment condition. Consistency of the proposed estimators is also established under a relaxed bounded $(2+2\epsilon)$-th moment condition, for some $\epsilon\in(0,1)$, with slower convergence rates associated with $\epsilon$. The efficacy of the proposed estimation methods is demonstrated by simulation and a real example.
【13】 A note on sharp oracle bounds for Slope and Lasso 标题:关于Slope和Lasso的尖锐先知界的一点注记
作者:Zhiyong Zhou 机构:Department of Statistics, Zhejiang University City College, Hangzhou, China 链接:https://arxiv.org/abs/2107.10974 摘要:本文研究了Slope和Lasso的尖锐预言界,推广了Bellec等人(2018)的结果,允许参数向量不完全稀疏的情况,并利用一些扩展的限制特征值类型条件,获得了$1\leq\leq\infty$估计误差的最优界。 摘要:In this paper, we study the sharp oracle bounds for Slope and Lasso and generalize the results in Bellec et al. (2018) to allow the case that the parameter vector is not exactly sparse and obtain the optimal bounds for $\ell_q$ estimation errors with $1\leq q\leq \infty$ by using some extended Restricted Eigenvalue type conditions.
【14】 The decomposition of the higher-order homology embedding constructed from the k-Laplacian标题:由k-Laplacian构造的高阶同调嵌入的分解
作者:Yu-Chia Chen,Marina Meilă 机构:Electrical & Computer Engineering, University of Washington, Seattle, WA , Marina Meil˘a, Department of Statistics 链接:https://arxiv.org/abs/2107.10970 摘要:$k$-阶Laplacian$\mathbf{\mathcal L}\u k$的空空间称为{\em$k$-阶同调向量空间},编码流形或网络的非平凡拓扑。理解同调嵌入的结构可以从数据中揭示几何或拓扑信息。对图Laplacian$\mathbf{\mathcal L}u0$的零空间嵌入的研究激发了新的研究和应用,如具有理论保证的谱聚类算法和随机块模型的估计。在这项工作中,我们研究了几何的$k$-第同源嵌入和重点放在案件联想到光谱聚类。也就是说,我们将流形的{em连通和}分析为对其同调嵌入的直和的扰动。提出了一种将同调嵌入到流形最简拓扑分量对应的子空间中的因子分解算法。该框架被应用于{em最短同调循环检测}问题,这是一个NP-hard问题。我们的光谱环路检测算法比现有的方法具有更好的扩展性,并且对点云和图像等不同的数据是有效的。 摘要:The null space of the $k$-th order Laplacian $\mathbf{\mathcal L}_k$, known as the {\em $k$-th homology vector space}, encodes the non-trivial topology of a manifold or a network. Understanding the structure of the homology embedding can thus disclose geometric or topological information from the data. The study of the null space embedding of the graph Laplacian $\mathbf{\mathcal L}_0$ has spurred new research and applications, such as spectral clustering algorithms with theoretical guarantees and estimators of the Stochastic Block Model. In this work, we investigate the geometry of the $k$-th homology embedding and focus on cases reminiscent of spectral clustering. Namely, we analyze the {\em connected sum} of manifolds as a perturbation to the direct sum of their homology embeddings. We propose an algorithm to factorize the homology embedding into subspaces corresponding to a manifold's simplest topological components. The proposed framework is applied to the {\em shortest homologous loop detection} problem, a problem known to be NP-hard in general. Our spectral loop detection algorithm scales better than existing methods and is effective on diverse data such as point clouds and images.
【15】 Inference for High Dimensional Censored Quantile Regression 标题:高维删失分位数回归的推论
作者:Zhe Fei,Qi Zheng,Hyokyoung G. Hong,Yi Li 机构:. Department of Biostatistics, University of California, Los Angeles, . Department of Bioinformatics and Biostatistics, University of Louisville, . Department of Statistics and Probability, Michigan State University 链接:https://arxiv.org/abs/2107.10959 摘要:随着高维基因生物标记物的可用性,识别这些预测因子对患者生存的异质性影响以及适当的统计推断是很有意义的。删失分位数回归已成为检测协变量对生存结果异质性影响的有力工具。据我们所知,对于删失分位数回归的高维预测因子的影响,很少有工作可以进行推断。本文提出了一种在全局删失分位数回归框架下对所有预测因子进行推理的新方法,该方法研究了分位数水平区间内的协变量响应关联,而不是几个离散值。该估计器结合了一系列基于多样本分割和变量选择的低维模型估计。我们证明了在某些正则性条件下,估计量是一致的,并且渐近地遵循以分位数为指标的高斯过程。仿真研究表明,在高维环境下,该方法可以很好地量化估计的不确定性。我们利用波士顿肺癌生存队列研究(一项关于肺癌分子机制的癌症流行病学研究),应用我们的方法来分析居住在肺癌途径中的单核苷酸多态性对患者生存的异质性影响。 摘要:With the availability of high dimensional genetic biomarkers, it is of interest to identify heterogeneous effects of these predictors on patients' survival, along with proper statistical inference. Censored quantile regression has emerged as a powerful tool for detecting heterogeneous effects of covariates on survival outcomes. To our knowledge, there is little work available to draw inference on the effects of high dimensional predictors for censored quantile regression. This paper proposes a novel procedure to draw inference on all predictors within the framework of global censored quantile regression, which investigates covariate-response associations over an interval of quantile levels, instead of a few discrete values. The proposed estimator combines a sequence of low dimensional model estimates that are based on multi-sample splittings and variable selection. We show that, under some regularity conditions, the estimator is consistent and asymptotically follows a Gaussian process indexed by the quantile level. Simulation studies indicate that our procedure can properly quantify the uncertainty of the estimates in high dimensional settings. We apply our method to analyze the heterogeneous effects of SNPs residing in lung cancer pathways on patients' survival, using the Boston Lung Cancer Survival Cohort, a cancer epidemiology study on the molecular mechanism of lung cancer.
【16】 Linear Polytree Structural Equation Models: Structural Learning and Inverse Correlation Estimation 标题:线性多叉树结构方程模型:结构学习和逆相关估计
作者:Xingmei Lou,Yu Hu,Xiaodong Li 机构:Department of Statistics, University of California, Davis, Department of Mathematics and Division of Life Science, Hong Kong University of, Science and Technology 备注:27 pages, 3 figures 链接:https://arxiv.org/abs/2107.10955 摘要:我们感兴趣的问题是学习有向无环图(DAG)时,数据产生的线性结构方程模型(SEM)和因果结构可以表征为一个多树。特别地,在高斯和次高斯模型下,我们研究了著名的Chow-Liu算法精确恢复由CPDAG唯一表示的polytree等价类的样本大小条件。我们还研究了在这种模型下估计逆相关矩阵的错误率。我们的理论结果被综合的数值模拟所说明,并且在基准数据上的实验也证明了当地面真值图形结构只能由一个多树来近似时,该方法的鲁棒性。 摘要:We are interested in the problem of learning the directed acyclic graph (DAG) when data are generated from a linear structural equation model (SEM) and the causal structure can be characterized by a polytree. Specially, under both Gaussian and sub-Gaussian models, we study the sample size conditions for the well-known Chow-Liu algorithm to exactly recover the equivalence class of the polytree, which is uniquely represented by a CPDAG. We also study the error rate for the estimation of the inverse correlation matrix under such models. Our theoretical findings are illustrated by comprehensive numerical simulations, and experiments on benchmark data also demonstrate the robustness of the method when the ground truth graphical structure can only be approximated by a polytree.
【17】 On Integral Theorems: Monte Carlo Estimators and Optimal Functions 标题:关于积分定理:蒙特卡罗估计量与最优函数
作者:Nhat Ho,Stephen G. Walker 机构:Department of Statistics and Data Sciences, University of Texas at Austin⋄, Department of Mathematics, University of Texas at Austin♭ 备注:18 pages, 2 figures. arXiv admin note: text overlap with arXiv:2106.06608 链接:https://arxiv.org/abs/2107.10947 摘要:介绍了一类基于循环函数和黎曼和逼近积分定理的积分定理。Fourier积分定理是作为变换和逆变换的结合而导出的,是作为一个特例而出现的。积分定理通过蒙特卡罗积分提供密度函数的自然估计。密度估计的质量评估可以用来获得使平方积分最小化的最优循环函数。我们的证明技术依赖于常微分方程的变分方法和复分析中的柯西剩余定理。 摘要:We introduce a class of integral theorems based on cyclic functions and Riemann sums approximating integrals theorem. The Fourier integral theorem, derived as a combination of a transform and inverse transform, arises as a special case. The integral theorems provide natural estimators of density functions via Monte Carlo integration. Assessments of the quality of the density estimators can be used to obtain optimal cyclic functions which minimize square integrals. Our proof techniques rely on a variational approach in ordinary differential equations and the Cauchy residue theorem in complex analysis.
【18】 Estimating survival parameters under conditionally independent left truncation 标题:条件独立左截断下生存参数的估计
作者:Arjun Sondhi 机构:Flatiron Health 链接:https://arxiv.org/abs/2107.10911 摘要:EHR衍生的数据库通常会受到左截断的影响,这是一种选择偏差,是由于患者需要存活足够长的时间来满足特定的进入标准。调整左截断偏倚的标准方法依赖于进入时间和存活时间之间的边际独立性假设,这在实践中可能并不总是得到满足。在这项工作中,我们探讨了如何较弱的假设条件独立可以导致共同统计参数的无偏估计。特别地,我们展示了截断数据集中条件参数的可估计性,以及利用包含混杂因子非截断数据的参考数据的边缘参数的可估计性。后者是对应用于真实世界外部比较器的观察因果推理方法的补充,后者是真实世界数据库的常见用例。我们在仿真研究中实现了我们提出的方法,证明了无偏估计和有效的统计推断。我们还说明了在现实世界的临床基因组数据库中,条件独立左截断下生存分布的估计。 摘要:EHR-derived databases are commonly subject to left truncation, a type of selection bias induced due to patients needing to survive long enough to satisfy certain entry criteria. Standard methods to adjust for left truncation bias rely on an assumption of marginal independence between entry and survival times, which may not always be satisfied in practice. In this work, we examine how a weaker assumption of conditional independence can result in unbiased estimation of common statistical parameters. In particular, we show the estimability of conditional parameters in a truncated dataset, and of marginal parameters that leverage reference data containing non-truncated data on confounders. The latter is complementary to observational causal inference methodology applied to real world external comparators, which is a common use case for real world databases. We implement our proposed methods in simulation studies, demonstrating unbiased estimation and valid statistical inference. We also illustrate estimation of a survival distribution under conditionally independent left truncation in a real world clinico-genomic database.
【19】 Laplace and Saddlepoint Approximations in High Dimensions 标题:高维空间中的Laplace和鞍点逼近
作者:Yanbo Tang,Nancy Reid 机构:Department of Statistical Sciences, University of Toronto, Toronto, Canada, Vector Institute, Toronto, Canada 备注:43 pages 链接:https://arxiv.org/abs/2107.10885 摘要:我们研究了拉普拉斯近似和鞍点近似在高维环境中的行为,在高维环境中,模型的维数允许随着观测次数的增加而增加。考虑了关节密度、边缘后验密度和条件密度的近似。我们的结果表明,在对模型最温和的假设下,如果拉普拉斯近似和鞍点近似的联合密度近似为$p=O(n^{1/4})$,则联合密度近似的误差为$O(p^4/n)$,并且在附加假设下可能有改进。对于边缘后验密度的近似,得到了更强的结果。 摘要:We examine the behaviour of the Laplace and saddlepoint approximations in the high-dimensional setting, where the dimension of the model is allowed to increase with the number of observations. Approximations to the joint density, the marginal posterior density and the conditional density are considered. Our results show that under the mildest assumptions on the model, the error of the joint density approximation is $O(p^4/n)$ if $p = o(n^{1/4})$ for the Laplace approximation and saddlepoint approximation, with improvements being possible under additional assumptions. Stronger results are obtained for the approximation to the marginal posterior density.
【20】 Structured second-order methods via natural gradient descent 标题:基于自然梯度下降的结构化二阶方法
作者:Wu Lin,Frank Nielsen,Mohammad Emtiyaz Khan,Mark Schmidt 机构:Many machine learning applications can be expressed as 1University of British Columbia, Alberta Machine Intelligence Institute 备注:ICML workshop paper. arXiv admin note: substantial text overlap with arXiv:2102.07405 链接:https://arxiv.org/abs/2107.10884 摘要:在本文中,我们提出了新的结构化二阶方法和通过在结构化参数空间上执行自然梯度下降得到的结构化自适应梯度方法。自然梯度下降是设计新算法的一种很有吸引力的方法,在许多情况下,如无梯度、自适应梯度和二阶方法。我们的结构化方法不仅具有结构不变性,而且具有简单的表达式。最后,我们在确定性非凸问题和深度学习问题上验证了所提方法的有效性。 摘要:In this paper, we propose new structured second-order methods and structured adaptive-gradient methods obtained by performing natural-gradient descent on structured parameter spaces. Natural-gradient descent is an attractive approach to design new algorithms in many settings such as gradient-free, adaptive-gradient, and second-order methods. Our structured methods not only enjoy a structural invariance but also admit a simple expression. Finally, we test the efficiency of our proposed methods on both deterministic non-convex problems and deep learning problems.
【21】 A local approach to parameter space reduction for regression and classification tasks 标题:用于回归和分类任务的参数空间约简的局部方法
作者:Francesco Romor,Marco Tezzele,Gianluigi Rozza 机构:Mathematics Area, mathLab, SISSA, via Bonomea , I-, Trieste, Italy 链接:https://arxiv.org/abs/2107.10867 摘要:通常情况下,为形状设计或其他涉及代理模型定义的应用程序选择的参数空间表示目标函数高度正则或行为良好的子域。因此,如果仅限于这些子域,单独研究,可以得到更精确的近似值。这种方法的缺点是在某些应用中可能缺乏数据,但在那些有大量数据的应用中,考虑到参数空间维数和目标函数的复杂性,适度丰富的数据是可用的,分区或局部研究是有益的。在这项工作中,我们提出了一种称为局部活动子空间(LAS)的新方法,该方法探索了活动子空间与监督聚类技术的协同作用,以便在参数空间中进行更有效的降维以设计精确的响应面。我们还开发了一个程序来利用局部活动子空间信息进行分类任务。将该技术作为参数空间的预处理步骤,或是向量输出的输出空间,对于代理模型的建立有着显著的效果。 摘要:Frequently, the parameter space, chosen for shape design or other applications that involve the definition of a surrogate model, present subdomains where the objective function of interest is highly regular or well behaved. So, it could be approximated more accurately if restricted to those subdomains and studied separately. The drawback of this approach is the possible scarcity of data in some applications, but in those, where a quantity of data, moderately abundant considering the parameter space dimension and the complexity of the objective function, is available, partitioned or local studies are beneficial. In this work we propose a new method called local active subspaces (LAS), which explores the synergies of active subspaces with supervised clustering techniques in order to perform a more efficient dimension reduction in the parameter space for the design of accurate response surfaces. We also developed a procedure to exploit the local active subspace information for classification tasks. Using this technique as a preprocessing step onto the parameter space, or output space in case of vectorial outputs, brings remarkable results for the purpose of surrogate modelling.
【22】 Graph Pseudometrics from a Topological Point of View 标题:从拓扑角度看图的伪度量
作者:Ana Lucia Garcia-Pulido,Kathryn Hess,Jane Tan,Katharine Turner,Bei Wang,Naya Yerolemou 机构: University of Oxford 备注:27 pages, 7 figures 链接:https://arxiv.org/abs/2107.11329 摘要:为了更好地理解有向图的拓扑性质,我们研究了有向图的伪度量。与有向图相关联的有向标志复合体在网络科学和拓扑学之间提供了一个有用的桥梁。事实上,人们经常观察到,现实世界网络所呈现的现象反映了它们的标志复合体的拓扑结构,例如,通过Betti数或单纯形计数来测量。由于精确地确定这些拓扑特征的计算代价很高(甚至是不可行的),因此在有向图集上建立伪度量是非常有价值的,它既可以检测拓扑差异,又可以有效地进行计算。为了便于这方面的工作,我们引入了一些方法来度量伪度量图对有向图的拓扑结构的捕捉程度。然后,我们使用这些方法来评估一些建立良好的伪度量,使用来自几个随机图族的测试数据。 摘要:We explore pseudometrics for directed graphs in order to better understand their topological properties. The directed flag complex associated to a directed graph provides a useful bridge between network science and topology. Indeed, it has often been observed that phenomena exhibited by real-world networks reflect the topology of their flag complexes, as measured, for example, by Betti numbers or simplex counts. As it is often computationally expensive (or even unfeasible) to determine such topological features exactly, it would be extremely valuable to have pseudometrics on the set of directed graphs that can both detect the topological differences and be computed efficiently. To facilitate work in this direction, we introduce methods to measure how well a graph pseudometric captures the topology of a directed graph. We then use these methods to evaluate some well-established pseudometrics, using test data drawn from several families of random graphs.
【23】 Optimization on manifolds: A symplectic approach 标题:流形上的最优化:一种辛算法
作者:Guilherme França,Alessandro Barp,Mark Girolami,Michael I. Jordan 机构:University of California, Berkeley, USA, University of Cambridge, UK 链接:https://arxiv.org/abs/2107.11231 摘要:利用动力系统和微分方程数值分析的工具来理解和构造新的优化方法已经引起了人们极大的兴趣。特别是最近出现了一种新的范式,它应用力学和几何积分的思想来获得欧氏空间上的加速优化方法。考虑到加速方法是许多机器学习应用背后的工作马,这将产生重要的影响。在本文中,我们基于这些进展,提出了一个耗散和约束哈密顿系统的框架,适用于求解任意光滑流形上的优化问题。重要的是,这使我们能够利用辛积分的成熟理论导出“速率匹配”耗散积分器。这为流形上的优化问题提供了一个新的视角,即通过辛几何中的经典参数和后向误差分析来保证收敛性。此外,我们构造了两个易于实现的蛙跳的耗散推广:一个用于李群和齐次空间,依赖于可控制的测地线流或其收缩;另一个用于约束子流形,基于著名的拨浪鼓积分的耗散推广。 摘要:There has been great interest in using tools from dynamical systems and numerical analysis of differential equations to understand and construct new optimization methods. In particular, recently a new paradigm has emerged that applies ideas from mechanics and geometric integration to obtain accelerated optimization methods on Euclidean spaces. This has important consequences given that accelerated methods are the workhorses behind many machine learning applications. In this paper we build upon these advances and propose a framework for dissipative and constrained Hamiltonian systems that is suitable for solving optimization problems on arbitrary smooth manifolds. Importantly, this allows us to leverage the well-established theory of symplectic integration to derive "rate-matching" dissipative integrators. This brings a new perspective to optimization on manifolds whereby convergence guarantees follow by construction from classical arguments in symplectic geometry and backward error analysis. Moreover, we construct two dissipative generalizations of leapfrog that are straightforward to implement: one for Lie groups and homogeneous spaces, that relies on the tractable geodesic flow or a retraction thereof, and the other for constrained submanifolds that is based on a dissipative generalization of the famous RATTLE integrator.
【24】 Signature asymptotics, empirical processes, and optimal transport 标题:签名渐近性、经验过程和最优传输
作者:Thomas Cass,Remy Messadene 机构:Department of Mathematics, Imperial College London, Alan Turing Institute, London 备注:18 pages, 1 figure 链接:https://arxiv.org/abs/2107.11203 摘要:粗糙路径理论提供了一个签名的概念,一个分等级的张量族,它的特征是可以忽略不计的等价类和向量值数据的有序流。近几年来,在时间序列分析、机器学习、深度学习以及最近的核方法中,签名的应用得到了广泛的应用。在这篇文章中,我们为签名渐近性、经验过程理论和Wasserstein距离之间的联系奠定了理论基础,打开了第一个研究中第二个和第三个的领域和工具。我们的主要贡献是证明Hambly-Lyons极限可以重新解释为一个关于来自同一潜在分布的样本的两个独立经验测度之间的Wasserstein距离的渐近行为的陈述。在这里研究的环境中,这些度量是从概率分布的样本中导出的,概率分布由基础路径的几何特性决定。在最近的Bobkov和Ledoux的专著中,对这些物体的收敛速度的一般问题进行了深入的研究。利用这些结果,我们将Hambly和Lyons的原始结果从$C^3$曲线推广到$C^2$曲线。最后,我们给出了一个用二阶微分方程计算极限的显式方法。 摘要:Rough path theory provides one with the notion of signature, a graded family of tensors which characterise, up to a negligible equivalence class, and ordered stream of vector-valued data. In the last few years, use of the signature has gained traction in time-series analysis, machine learning , deep learning and more recently in kernel methods. In this article, we lay down the theoretical foundations for a connection between signature asymptotics, the theory of empirical processes, and Wasserstein distances, opening up the landscape and toolkit of the second and third in the study of the first. Our main contribution is to show that the Hambly-Lyons limit can be reinterpreted as a statement about the asymptotic behaviour of Wasserstein distances between two independent empirical measures of samples from the same underlying distribution. In the setting studied here, these measures are derived from samples from a probability distribution which is determined by geometrical properties of the underlying path. The general question of rates of convergence for these objects has been studied in depth in the recent monograph of Bobkov and Ledoux. By using these results, we generalise the original result of Hambly and Lyons from $C^3$ curves to a broad class of $C^2$ ones. We conclude by providing an explicit way to compute the limit in terms of a second-order differential equation.
【25】 Data-driven optimization of reliability using buffered failure probability 标题:基于缓冲失效概率的数据驱动可靠性优化
作者:Ji-Eun Byun,Johannes O. Royset 机构: Department of Civil, Environmental and Geomatic Engineering, University College London, Operations Research Department, Naval Postgraduate School, California, United States; 备注:32 pages 链接:https://arxiv.org/abs/2107.11176 摘要:复杂工程系统的设计和运行依赖于可靠性优化。这种优化要求我们考虑以复杂的高维概率分布表示的不确定性,对于这种分布,可能只有样本或数据可用。然而,使用数据或样本往往会降低计算效率,特别是当使用梯度不为零的指示函数来估计常规失效概率时。为了解决这一问题,利用缓冲失效概率,本文提出了缓冲优化与可靠性方法(BORM),以实现高效的、数据驱动的可靠性优化。所提出的公式、算法和策略大大提高了优化的计算效率,从而解决了高维非线性问题的需要。此外,本文还提出了一个用常规失效概率估计可靠性灵敏度的解析公式。在多种不同分布的情况下,对缓冲失效概率进行了深入研究,提出了一种新的尾重测量方法,称为缓冲尾指数。三个数值算例验证了所提出的优化方法的有效性和准确性,突出了缓冲失效概率在数据驱动可靠性分析中的独特优势。 摘要:Design and operation of complex engineering systems rely on reliability optimization. Such optimization requires us to account for uncertainties expressed in terms of compli-cated, high-dimensional probability distributions, for which only samples or data might be available. However, using data or samples often degrades the computational efficiency, particularly as the conventional failure probability is estimated using the indicator function whose gradient is not defined at zero. To address this issue, by leveraging the buffered failure probability, the paper develops the buffered optimization and reliability method (BORM) for efficient, data-driven optimization of reliability. The proposed formulations, algo-rithms, and strategies greatly improve the computational efficiency of the optimization and thereby address the needs of high-dimensional and nonlinear problems. In addition, an analytical formula is developed to estimate the reliability sensitivity, a subject fraught with difficulty when using the conventional failure probability. The buffered failure probability is thoroughly investigated in the context of many different distributions, leading to a novel measure of tail-heaviness called the buffered tail index. The efficiency and accuracy of the proposed optimization methodology are demonstrated by three numerical examples, which underline the unique advantages of the buffered failure probability for data-driven reliability analysis.
【26】 Constellation: Learning relational abstractions over objects for compositional imagination 标题:星座:学习对象上的关系抽象以进行构图想象
作者:James C. R. Whittington,Rishabh Kabra,Loic Matthey,Christopher P. Burgess,Alexander Lerchner 机构: Factorisedsensory representations are easily re-combined to represent 1UniversityofOxford 2WorkdoneatDeepMind 3DeepMind 4Wayve 链接:https://arxiv.org/abs/2107.11153 摘要:学习视觉场景的结构化表示是目前连接感知和推理的主要瓶颈。虽然基于狭缝的模型已经取得了令人兴奋的进展,它可以学习将场景分割成多组对象,但是学习整个对象组的配置特性仍在探索中。为了解决这个问题,我们引入了Constellation,一个学习静态视觉场景的关系抽象的网络,并将这些抽象概括为感官的特殊性,从而为抽象关系推理提供了一个潜在的基础。我们进一步证明,这个基础,连同语言联想,提供了一种以新的方式想象感官内容的方法。这项工作是在视觉关系的显式表示和复杂的认知过程中使用它们的第一步。 摘要:Learning structured representations of visual scenes is currently a major bottleneck to bridging perception with reasoning. While there has been exciting progress with slot-based models, which learn to segment scenes into sets of objects, learning configurational properties of entire groups of objects is still under-explored. To address this problem, we introduce Constellation, a network that learns relational abstractions of static visual scenes, and generalises these abstractions over sensory particularities, thus offering a potential basis for abstract relational reasoning. We further show that this basis, along with language association, provides a means to imagine sensory content in new ways. This work is a first step in the explicit representation of visual relationships and using them for complex cognitive procedures.
【27】 High Dimensional Differentially Private Stochastic Optimization with Heavy-tailed Data 标题:具有重尾数据的高维微分私有随机优化
作者:Lijie Hu,Shuo Ni,Hanshen Xiao,Di Wang 机构:King Abdullah University of Science and Technology, Saudi Arabia, University of Southern California, United States, Massachusetts Institute of Technology 链接:https://arxiv.org/abs/2107.11136 摘要:差分私有随机凸优化(DP-SCO)作为机器学习、统计和微分隐私等领域中最基本的问题之一,近年来得到了广泛的研究。然而,以往的工作大多只能处理规则数据分布或低维空间中的不规则数据。为了更好地理解不规则数据分布所带来的挑战,本文首次研究了高维空间中具有重尾数据的DP-SCO问题。在第一部分中,我们主要讨论一些多面体约束(如$\ellu 1$-范数球)上的问题。我们证明了在$\epsilon$-DP模型中,如果损失函数是光滑的且其梯度有界二阶矩,则可能得到$\tilde{O}(\frac{\log d}{(n\epsilon)^\frac{1}{3})$的(高概率)误差界(超额总体风险),其中,$n$是样本大小,$d$是底层空间的维数。接下来,对于LASSO,如果数据分布具有有界的四阶矩,我们改进了$(\epsilon,\delta)$-DP模型中$\tilde{O}(\frac{\log d}{(n\epsilon)^\frac{2}{5}})$的界。第二部分研究了重尾数据下的稀疏学习。我们首先回顾稀疏线性模型,并提出一个截断的DP-IHT方法,其输出可以达到$\tilde{O}(\frac{s^{*2}\logd}{n\epsilon})$的误差,其中,$s^*$是底层参数的稀疏性。然后我们研究了稀疏性({em即}$\ell0$-范数)约束上的一个更一般的问题,并且证明了如果损失函数是光滑的和强凸的,则在$\tilde{O}(\frac{s^{*\frac{3}{2}\logd}{n\epsilon})$的因子下也可能达到$\tilde{O}(\sqrt{s^*})}的近似最优。 摘要:As one of the most fundamental problems in machine learning, statistics and differential privacy, Differentially Private Stochastic Convex Optimization (DP-SCO) has been extensively studied in recent years. However, most of the previous work can only handle either regular data distribution or irregular data in the low dimensional space case. To better understand the challenges arising from irregular data distribution, in this paper we provide the first study on the problem of DP-SCO with heavy-tailed data in the high dimensional space. In the first part we focus on the problem over some polytope constraint (such as the $\ell_1$-norm ball). We show that if the loss function is smooth and its gradient has bounded second order moment, it is possible to get a (high probability) error bound (excess population risk) of $\tilde{O}(\frac{\log d}{(n\epsilon)^\frac{1}{3}})$ in the $\epsilon$-DP model, where $n$ is the sample size and $d$ is the dimensionality of the underlying space. Next, for LASSO, if the data distribution that has bounded fourth-order moments, we improve the bound to $\tilde{O}(\frac{\log d}{(n\epsilon)^\frac{2}{5}})$ in the $(\epsilon, \delta)$-DP model. In the second part of the paper, we study sparse learning with heavy-tailed data. We first revisit the sparse linear model and propose a truncated DP-IHT method whose output could achieve an error of $\tilde{O}(\frac{s^{*2}\log d}{n\epsilon})$, where $s^*$ is the sparsity of the underlying parameter. Then we study a more general problem over the sparsity ({\em i.e.,} $\ell_0$-norm) constraint, and show that it is possible to achieve an error of $\tilde{O}(\frac{s^{*\frac{3}{2}}\log d}{n\epsilon})$, which is also near optimal up to a factor of $\tilde{O}{(\sqrt{s^*})}$, if the loss function is smooth and strongly convex.
【28】 Reference Class Selection in Similarity-Based Forecasting of Sales Growth 标题:基于相似度的销售增长预测中的参考类选择
作者:Etienne Theising,Dominik Wied,Daniel Ziggel 机构:and Statistics, University of Cologne, Institute of Econometrics and, Data Analytics, Flossbach von Storch AG 链接:https://arxiv.org/abs/2107.11133 摘要:本文提出了一种为分析师的销售预测寻找合适外部视角的方法。其思想是为每个被分析的公司分别找到参考类,即同级组。因此,额外的公司被认为在特定预测因素方面与感兴趣的公司有相似之处。如果预测的销售分布与实际分布尽可能接近,则这些类别被认为是最优的。通过对估计的概率积分变换进行拟合优度检验和比较预测分位数来衡量预测质量。该方法应用于由21808家美国公司组成的1950-2019年数据集,并进行了描述性分析。特别是过去的营业利润率似乎是未来销售分布的良好预测指标。通过一个案例研究,将我们的预测与实际分析师的估计进行比较,强调了我们的方法在实践中的相关性。 摘要:This paper proposes a method to find appropriate outside views for sales forecasts of analysts. The idea is to find reference classes, i.e. peer groups, for each analyzed company separately. Hence, additional companies are considered that share similarities to the firm of interest with respect to a specific predictor. The classes are regarded to be optimal if the forecasted sales distributions match the actual distributions as closely as possible. The forecast quality is measured by applying goodness-of-fit tests on the estimated probability integral transformations and by comparing the predicted quantiles. The method is applied on a data set consisting of 21,808 US firms over the time period 1950 - 2019, which is also descriptively analyzed. It appears that in particular the past operating margins are good predictors for the distribution of future sales. A case study with a comparison of our forecasts with actual analysts' estimates emphasizes the relevance of our approach in practice.
【29】 COVID-19 and the gig economy in Poland 标题:冠状病毒与波兰的零工经济
作者:Maciej Beręsewicz,Dagmara Nikulin 机构:COVID- 19 and the gig economy in PolandBer˛esewicz MaciejPozna´n University of Economics and Business, PolandNikulin Dagmara†Gda´nsk University of Technology 链接:https://arxiv.org/abs/2107.11124 摘要:我们使用一个覆盖几乎所有目标人群的数据集,基于从智能手机上被动收集的数据,来衡量第一次COVID-19浪潮对波兰gig经济的影响。特别是,我们专注于交通(Uber,Bolt)和送货(Wolt,Takeaway,Glover,DeliGoo)应用,这使得我们能够区分这个市场的需求和供应部分。基于贝叶斯结构时间序列模型,我们估计了第一个COVID-19波对活跃司机和信使数量的因果影响。与反事实对照组相比,Wolt和Glover显著增加(分别为15%和24%),Uber和Bolt略有减少(分别为-3%和-7%)。Uber和Bolt的变化可以部分解释为新法律(所谓的Uber-Lex)的前景,该法律已经在2019年宣布,旨在规范平台驱动程序的工作。 摘要:We use a dataset covering nearly the entire target population based on passively collected data from smartphones to measure the impact of the first COVID-19 wave on the gig economy in Poland. In particular, we focus on transportation (Uber, Bolt) and delivery (Wolt, Takeaway, Glover, DeliGoo) apps, which make it possible to distinguish between the demand and supply part of this market. Based on Bayesian structural time-series models, we estimate the causal impact of the first COVID-19 wave on the number of active drivers and couriers. We show a significant relative increase for Wolt and Glover (15% and 24%) and a slight relative decrease for Uber and Bolt (-3% and -7%) in comparison to a counterfactual control. The change for Uber and Bolt can be partially explained by the prospect of a new law (the so-called Uber Lex), which was already announced in 2019 and is intended to regulate the work of platform drivers.
【30】 LocalGLMnet: interpretable deep learning for tabular data 标题:LocalGLMnet:表格数据的可解释深度学习
作者:Ronald Richman,Mario V. Wüthrich 机构:Mario V. W¨uthrich† 链接:https://arxiv.org/abs/2107.11059 摘要:深度学习模型在统计建模中得到了广泛的应用,因为它们导致了非常有竞争力的回归模型,通常比经典的统计模型(如广义线性模型)表现更好。深度学习模型的缺点是其解很难解释和解释,变量选择也不容易,因为深度学习模型在内部以不透明的方式解决特征工程和变量选择问题。受广义线性模型吸引人的结构启发,我们提出了一种新的网络结构,该结构与广义线性模型具有相似的特性,但得益于表示学习的艺术,它提供了优越的预测能力。这种新的体系结构允许表格数据的变量选择和校准的深度学习模型的解释,事实上,我们的方法提供了一种基于Shapley值和综合梯度的加法分解。 摘要:Deep learning models have gained great popularity in statistical modeling because they lead to very competitive regression models, often outperforming classical statistical models such as generalized linear models. The disadvantage of deep learning models is that their solutions are difficult to interpret and explain, and variable selection is not easily possible because deep learning models solve feature engineering and variable selection internally in a nontransparent way. Inspired by the appealing structure of generalized linear models, we propose a new network architecture that shares similar features as generalized linear models, but provides superior predictive power benefiting from the art of representation learning. This new architecture allows for variable selection of tabular data and for interpretation of the calibrated deep learning model, in fact, our approach provides an additive decomposition in the spirit of Shapley values and integrated gradients.
【31】 Implicit Rate-Constrained Optimization of Non-decomposable Objectives 标题:不可分解目标的隐式速率约束优化
作者:Abhishek Kumar,Harikrishna Narasimhan,Andrew Cotter 备注:ICML 2021 链接:https://arxiv.org/abs/2107.10960 摘要:我们考虑一个受欢迎的家庭中的约束优化问题产生的机器学习,涉及优化不可分解的评价指标与一定阈值形式,同时约束另一个度量的兴趣。这些问题的例子包括以固定的假阳性率优化假阴性率,以固定的召回率优化精度,优化精度召回或ROC曲线下的区域,我们的核心思想是通过隐函数定理建立一个速率约束优化模型,将阈值参数表示为模型参数的函数。我们展示了如何使用基于标准梯度的方法来解决由此产生的优化问题。在基准数据集上的实验证明了本文方法的有效性。 摘要:We consider a popular family of constrained optimization problems arising in machine learning that involve optimizing a non-decomposable evaluation metric with a certain thresholded form, while constraining another metric of interest. Examples of such problems include optimizing the false negative rate at a fixed false positive rate, optimizing precision at a fixed recall, optimizing the area under the precision-recall or ROC curves, etc. Our key idea is to formulate a rate-constrained optimization that expresses the threshold parameter as a function of the model parameters via the Implicit Function theorem. We show how the resulting optimization problem can be solved using standard gradient based methods. Experiments on benchmark datasets demonstrate the effectiveness of our proposed method over existing state-of-the art approaches for these problems.
【32】 Using UMAP to Inspect Audio Data for Unsupervised Anomaly Detection under Domain-Shift Conditions 标题:利用UMAP检测音频数据实现移域条件下的无监督异常检测
作者:Andres Fernandez,Mark D. Plumbley 机构:Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, UK 备注:Submitted for publication 链接:https://arxiv.org/abs/2107.10880 摘要:无监督异常检测(UAD)的目标是在只有非异常(正常)数据的情况下检测异常信号。在域转移条件下的UAD(UAD-S)中,数据进一步暴露于通常事先未知的上下文变化。受2021年版《声场景和事件的检测与分类》(DCASE)挑战赛上UAD-S任务中遇到的困难的启发,我们目视检查了对数STFT、对数mel和预训练外观的均匀流形近似和投影(UMAP),聆听并学习DCASE UAD-S数据集的三级表示法。在我们的探索性研究中,我们寻找了可分离性(SEP)和判别支持(DSUP)这两个特征,并提出了一些有助于诊断和进一步表征和检测方法发展的假设。特别地,我们假设输入长度和预训练可能调节SEP和DSUP之间的相关权衡。我们的代码以及生成的umap和plot都是公开的。 摘要:The goal of Unsupervised Anomaly Detection (UAD) is to detect anomalous signals under the condition that only non-anomalous (normal) data is available beforehand. In UAD under Domain-Shift Conditions (UAD-S), data is further exposed to contextual changes that are usually unknown beforehand. Motivated by the difficulties encountered in the UAD-S task presented at the 2021 edition of the Detection and Classification of Acoustic Scenes and Events (DCASE) challenge, we visually inspect Uniform Manifold Approximations and Projections (UMAPs) for log-STFT, log-mel and pretrained Look, Listen and Learn (L3) representations of the DCASE UAD-S dataset. In our exploratory investigation, we look for two qualities, Separability (SEP) and Discriminative Support (DSUP), and formulate several hypotheses that could facilitate diagnosis and developement of further representation and detection approaches. Particularly, we hypothesize that input length and pretraining may regulate a relevant tradeoff between SEP and DSUP. Our code as well as the resulting UMAPs and plots are publicly available.
【33】 Filament Plots for Data Visualization 标题:用于数据可视化的丝状图
作者:Nate Strawn 机构:Department of Mathematics and Statistics, Georgetown University, Washington, D.C. , USA, Editor: 备注:33 pages, 13 figures 链接:https://arxiv.org/abs/2107.10869 摘要:我们通过考虑Frenet-Serret方程生成的曲线和最优光滑2D的Andrew图,构造了一个计算成本低廉的Andrew图的3D扩展。我们考虑从欧几里德数据空间到2D曲线的无限维空间的线性等距,并在给定的数据集上参数化(平均)最优平滑曲线的线性等距。这组最优等距允许多个自由度,并且(使用最近关于广义高斯和的结果)我们确定了这组最优等距的一个特殊成员,它允许渐近投影“巡更”性质。最后,我们考虑单位长度的3D曲线(细丝)由这些2D安得烈图诱导,其中线性等距特性保留距离为“相对总平方曲率”。这项工作的最后说明灯丝图几个数据集。代码位于https://github.com/n8epi/filaments 摘要:We construct a computationally inexpensive 3D extension of Andrew's plots by considering curves generated by Frenet-Serret equations and induced by optimally smooth 2D Andrew's plots. We consider linear isometries from a Euclidean data space to infinite dimensional spaces of 2D curves, and parametrize the linear isometries that produce (on average) optimally smooth curves over a given dataset. This set of optimal isometries admits many degrees of freedom, and (using recent results on generalized Gauss sums) we identify a particular a member of this set which admits an asymptotic projective "tour" property. Finally, we consider the unit-length 3D curves (filaments) induced by these 2D Andrew's plots, where the linear isometry property preserves distances as "relative total square curvatures". This work concludes by illustrating filament plots for several datasets. Code is available at https://github.com/n8epi/filaments