访问www.arxivdaily.com获取含摘要速递,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏、发帖等功能!点击阅读原文即可访问
cs.CV 方向,今日共计101篇
Transformer(2篇)
【1】 Unified Questioner Transformer for Descriptive Question Generation in Goal-Oriented Visual Dialogue 标题:面向目标的视觉对话中描述性问题生成的统一提问器转换器
作者:Shoya Matsumori,Kosuke Shingyouchi,Yuki Abe,Yosuke Fukuchi,Komei Sugiura,Michita Imai 机构:Keio University 链接:https://arxiv.org/abs/2106.15550 摘要:建立一个能够询问真实世界问题的交互式人工智能是视觉和语言问题面临的最大挑战之一。尤其是以目标为导向的视觉对话,其目的是在话轮转换对话中通过提问来寻求信息。而现有的几种模型基于什么?!数据集已经提出,提问者通常会问简单的基于类别的问题或绝对空间问题。对于对象共享属性的复杂场景或需要描述性问题来区分对象的情况,这可能会有问题。在本文中,我们提出了一种新的提问器架构,称为统一提问器转换器(UniQer),用于用引用表达式生成描述性问题。此外,我们还构建了一个面向目标的视觉对话任务CLEVR-Ask。它综合了要求提问者生成描述性问题的复杂场景。我们用CLEVR-Ask数据集的两个变体来训练我们的模型。定量和定性评价结果表明,UniQer的性能优于基线。 摘要:Building an interactive artificial intelligence that can ask questions about the real world is one of the biggest challenges for vision and language problems. In particular, goal-oriented visual dialogue, where the aim of the agent is to seek information by asking questions during a turn-taking dialogue, has been gaining scholarly attention recently. While several existing models based on the GuessWhat?! dataset have been proposed, the Questioner typically asks simple category-based questions or absolute spatial questions. This might be problematic for complex scenes where the objects share attributes or in cases where descriptive questions are required to distinguish objects. In this paper, we propose a novel Questioner architecture, called Unified Questioner Transformer (UniQer), for descriptive question generation with referring expressions. In addition, we build a goal-oriented visual dialogue task called CLEVR Ask. It synthesizes complex scenes that require the Questioner to generate descriptive questions. We train our model with two variants of CLEVR Ask datasets. The results of the quantitative and qualitative evaluations show that UniQer outperforms the baseline.
【2】 Multi-Exit Vision Transformer for Dynamic Inference 标题:用于动态推理的多出口视觉Transformer
作者:Arian Bakhtiarnia,Qi Zhang,Alexandros Iosifidis 机构:DIGIT, Department of Electrical and Computer Engineering, Aarhus University, Denmark 链接:https://arxiv.org/abs/2106.15183 摘要:通过在深度神经网络的中间层插入早期的出口分支,可以将深度神经网络转换为多出口结构。这使得他们的推理过程成为动态的,这对于具有严格延迟要求但具有时变通信和计算资源的时间关键型物联网应用非常有用。特别是在边缘计算系统和物联网网络中,精确的计算时间预算是可变的,并且事先不知道。视觉变换器是最近提出的一种体系结构,它在计算机视觉的各个领域有着广泛的应用。在这项工作中,我们提出了七种不同的早期退出分支架构,可用于视觉转换器主干中的动态推理。通过涉及分类和回归问题的大量实验,我们证明了我们提出的每一种体系结构都可以在精度和速度之间进行折衷。 摘要:Deep neural networks can be converted to multi-exit architectures by inserting early exit branches after some of their intermediate layers. This allows their inference process to become dynamic, which is useful for time critical IoT applications with stringent latency requirements, but with time-variant communication and computation resources. In particular, in edge computing systems and IoT networks where the exact computation time budget is variable and not known beforehand. Vision Transformer is a recently proposed architecture which has since found many applications across various domains of computer vision. In this work, we propose seven different architectures for early exit branches that can be used for dynamic inference in Vision Transformer backbones. Through extensive experiments involving both classification and regression problems, we show that each one of our proposed architectures could prove useful in the trade-off between accuracy and speed.
检测相关(14篇)
【1】 Hate speech detection using static BERT embeddings 标题:基于静电BERT嵌入的仇恨语音检测
作者:Gaurav Rajput,Narinder Singh punn,Sanjay Kumar Sonbhadra,Sonali Agarwal 机构:Agarwal, Indian Institute of Information Technology, Allahabad, Uttar Pradesh , India 链接:https://arxiv.org/abs/2106.15537 摘要:随着社交媒体平台的日益普及,仇恨言论正在成为一个主要关注点,它表达针对特定群体特征(如性别、宗教或种族)的辱骂性言论,以传播暴力。以前人们用口头表达仇恨,但现在随着科技的发展,一些人故意利用社交媒体平台通过发帖、分享、评论等方式传播仇恨,无论是基督城清真寺枪击案还是西方针对亚洲人的仇恨犯罪,据观察,罪犯受到网上仇恨文字的影响很大。尽管人工智能系统已经到位,可以标记这些文本,但关键的挑战之一是降低误报率(将非仇恨标记为仇恨),以便这些系统能够在不损害言论自由的情况下检测仇恨言论。本文利用ETHOS仇恨语音检测数据集,通过将单词嵌入(fastText、GloVe或FT+GV)与静态BERT嵌入(BE)进行替换或集成,分析了仇恨语音检测分类器的性能。通过大量的实验研究发现,与使用FT、GV或FT+GV作为单词嵌入相比,使用静态BE的神经网络具有更好的性能。与微调的BERT相比,一个显著改善的指标是特异性。 摘要:With increasing popularity of social media platforms hate speech is emerging as a major concern, where it expresses abusive speech that targets specific group characteristics, such as gender, religion or ethnicity to spread violence. Earlier people use to verbally deliver hate speeches but now with the expansion of technology, some people are deliberately using social media platforms to spread hate by posting, sharing, commenting, etc. Whether it is Christchurch mosque shootings or hate crimes against Asians in west, it has been observed that the convicts are very much influenced from hate text present online. Even though AI systems are in place to flag such text but one of the key challenges is to reduce the false positive rate (marking non hate as hate), so that these systems can detect hate speech without undermining the freedom of expression. In this paper, we use ETHOS hate speech detection dataset and analyze the performance of hate speech detection classifier by replacing or integrating the word embeddings (fastText (FT), GloVe (GV) or FT + GV) with static BERT embeddings (BE). With the extensive experimental trails it is observed that the neural network performed better with static BE compared to using FT, GV or FT + GV as word embeddings. In comparison to fine-tuned BERT, one metric that significantly improved is specificity.
【2】 Fast and Accurate Road Crack Detection Based on Adaptive Cost-Sensitive Loss Function 标题:基于自适应代价敏感损失函数的快速准确道路裂缝检测
作者:Kai Li,Bo Wang,Yingjie Tian,Zhiquan Qi 链接:https://arxiv.org/abs/2106.15510 摘要:计算机视觉中的许多检测问题,包括道路裂缝检测,都存在着非常严重的前-背景不平衡问题。幸运的是,对损失函数的修正似乎彻底解决了这个难题。在本文中,我们提出了一种基于像素的自适应加权交叉熵损失结合Jaccard距离,以便于高质量的像素级道路裂缝检测。我们的工作深刻地证明了损失函数对检测结果的影响,并揭示了裂纹检测领域中复杂的连续改进。具体来说,为了验证所提出的损失的有效性,我们在四个公共数据库上进行了广泛的实验,即CrackForest、AigleRN、Crack360和BJN260。与传统的加权交叉熵相比,该方法在保证测试精度的前提下,显著加快了训练过程。 摘要:Numerous detection problems in computer vision, including road crack detection, suffer from exceedingly foreground-background imbalance. Fortunately, modification of loss function appears to solve this puzzle once and for all. In this paper, we propose a pixel-based adaptive weighted cross-entropy loss in conjunction with Jaccard distance to facilitate high-quality pixel-level road crack detection. Our work profoundly demonstrates the influence of loss functions on detection outcomes, and sheds light on the sophisticated consecutive improvements in the realm of crack detection. Specifically, to verify the effectiveness of the proposed loss, we conduct extensive experiments on four public databases, i.e., CrackForest, AigleRN, Crack360, and BJN260. Compared with the vanilla weighted cross-entropy, the proposed loss significantly speeds up the training process while retaining the test accuracy.
【3】 Detecting Cattle and Elk in the Wild from Space 标题:从太空探测野外的牛和麋鹿
作者:Caleb Robinson,Anthony Ortiz,Lacey Hughey,Jared A. Stabach,Juan M. Lavista Ferres 机构: Smithsonian Conservation Biology Institute 备注:Presented at the KDD 2021 Fragile Earth Workshop 链接:https://arxiv.org/abs/2106.15448 摘要:在非常高分辨率的卫星图像中定位和计算大型有蹄类动物,如牛和麋鹿,是支持生态学研究的一项重要任务。以前的工作表明,这是可行的,与深入学习为基础的方法和亚米多光谱卫星图像。我们通过提出一种基线方法CowNet来扩展这项工作,CowNet可以同时估计图像中动物的数量(计数),并在像素级预测它们的位置(局部化)。我们还提出了一种方法来评估这种模型在大型场景中的计数和定位任务,考虑到噪声标签的不确定性和利益相关者在生态监测任务中需要的信息。最后,我们用最先进的视觉方法来测试我们的基线方法,以计算场景中的物体。我们在加利福尼亚州雷耶斯角海岸的一个大型景观上具体测试了结果模型的时间泛化。我们发现,LC-FCN模型的性能最好,在三个测试场景中的平均精度在0.56到0.61之间,平均召回率在0.78到0.92之间。 摘要:Localizing and counting large ungulates -- hoofed mammals like cows and elk -- in very high-resolution satellite imagery is an important task for supporting ecological studies. Prior work has shown that this is feasible with deep learning based methods and sub-meter multi-spectral satellite imagery. We extend this line of work by proposing a baseline method, CowNet, that simultaneously estimates the number of animals in an image (counts), as well as predicts their location at a pixel level (localizes). We also propose an methodology for evaluating such models on counting and localization tasks across large scenes that takes the uncertainty of noisy labels and the information needed by stakeholders in ecological monitoring tasks into account. Finally, we benchmark our baseline method with state of the art vision methods for counting objects in scenes. We specifically test the temporal generalization of the resulting models over a large landscape in Point Reyes Seashore, CA. We find that the LC-FCN model performs the best and achieves an average precision between 0.56 and 0.61 and an average recall between 0.78 and 0.92 over three held out test scenes.
【4】 TANet++: Triple Attention Network with Filtered Pointcloud on 3D Detection 标题:TANET++:基于3D检测的过滤点云三重注意力网络
作者:Cong Ma 机构:Peking University, Beijing, China 备注:3pages, 2 figures 链接:https://arxiv.org/abs/2106.15366 摘要:TANet是KITTI和JRDB基准上最先进的三维目标检测方法之一,网络中包含了三重注意模块和由粗到细的回归模块,提高了三维检测的鲁棒性和准确性。然而,由于原始输入数据(点云)在采集过程中含有大量的噪声,这将进一步影响模型的训练。例如,物体距离机器人较远,传感器很难获得足够的点云。如果目标只包含少量的点云,并且训练时将样本与正常样本一起送入模型,检测器将很难区分点云较少的个体属于目标还是背景。在本文中,我们提出了TANet++来提高三维检测的性能,它采用了一种新的训练策略来训练TANet。为了减少弱样本的负面影响,训练策略先对训练数据进行过滤,然后用剩余的数据对TANet++进行训练。实验结果表明,在JRDB基准上,TANet++的AP评分比TANet高8.98%。 摘要:TANet is one of state-of-the-art 3D object detection method on KITTI and JRDB benchmark, the network contains a Triple Attention module and Coarse-to-Fine Regression module to improve the robustness and accuracy of 3D Detection. However, since the original input data (point clouds) contains a lot of noise during collecting the data, which will further affect the training of the model. For example, the object is far from the robot, the sensor is difficult to obtain enough pointcloud. If the objects only contains few point clouds, and the samples are fed into model with the normal samples together during training, the detector will be difficult to distinguish the individual with few pointcloud belong to object or background. In this paper, we propose TANet++ to improve the performance on 3D Detection, which adopt a novel training strategy on training the TANet. In order to reduce the negative impact by the weak samples, the training strategy previously filtered the training data, and then the TANet++ is trained by the rest of data. The experimental results shows that AP score of TANet++ is 8.98\% higher than TANet on JRDB benchmark.
【5】 TNCR: Table Net Detection and Classification Dataset 标题:TNCR:表网检测和分类数据集
作者:Abdelrahman Abdallah,Alexander Berendeyev,Islam Nuradin,Daniyar Nurseitov 机构:a Department of Machine Learning & Data Science , Satbayev, University, Almaty, Almaty, Kazakhstan, National Open Research Laboratory for Information and Space Technologies, Satbayev 链接:https://arxiv.org/abs/2106.15322 摘要:我们提出TNCR,一个新的表格数据集,不同的图像质量收集免费网站。TNCR数据集可用于扫描文档图像中的表检测,并可将其分为5类。TNCR包含9428个高质量的标记图像。在本文中,我们已经实现了最先进的基于深度学习的表检测方法,以创建几个强大的基线。在TNCR数据集上,采用ResNeXt-101-64x4d骨干网的Cascade Mask R-CNN方法的准确率为79.7%,召回率为89.8%,f1得分为84.4%,取得了最好的效果。我们已经使TNCR开源,希望鼓励对表检测、分类和结构识别采用更深入的学习方法。数据集和经过训练的模型检查点在https://github.com/abdoelsayed2016/TNCR_Dataset. 摘要:We present TNCR, a new table dataset with varying image quality collected from free websites. The TNCR dataset can be used for table detection in scanned document images and their classification into 5 different classes. TNCR contains 9428 high-quality labeled images. In this paper, we have implemented state-of-the-art deep learning-based methods for table detection to create several strong baselines. Cascade Mask R-CNN with ResNeXt-101-64x4d Backbone Network achieves the highest performance compared to other methods with a precision of 79.7%, recall of 89.8%, and f1 score of 84.4% on the TNCR dataset. We have made TNCR open source in the hope of encouraging more deep learning approaches to table detection, classification, and structure recognition. The dataset and trained model checkpoints are available at https://github.com/abdoelsayed2016/TNCR_Dataset.
【6】 On-board Volcanic Eruption Detection through CNNs and Satellite Multispectral Imagery 标题:基于CNNs和卫星多光谱图像的机载火山喷发探测
作者:Maria Pia Del Rosso,Alessandro Sebastianelli,Dario Spiller,Pierre Philippe Mathieu,Silvia Liberata Ullo 备注:This is an ongoing work to be revised and submitted to a journal 链接:https://arxiv.org/abs/2106.15281 摘要:近年来,随着机器学习算法在各种不同应用中的发展,人们对这些算法在实际场景中的适用性进行了大量的研究。其中,由于其物理要求,最困难的场景之一是航空航天场景。在这种情况下,作者的这项工作的目的是提出一个第一原型和可行性研究的人工智能模型是'加载'在船上。作为一个案例研究,作者决定调查火山喷发的探测作为一种快速产生警报的方法。提出并建立了两个卷积神经网络,并说明了如何在实际硬件上正确实现它们,以及如何调整CNN的复杂度以适应计算要求。 摘要:In recent years, the growth of Machine Learning algorithms in a variety of different applications has raised numerous studies on the applicability of these algorithms in real scenarios. Among all, one of the hardest scenarios, due to its physical requirements, is the aerospace one. In this context, the authors of this work aim to propose a first prototype and a study of feasibility for an AI model to be 'loaded' on board. As a case study, the authors decided to investigate the detection of volcanic eruptions as a method to swiftly produce alerts. Two Convolutional Neural Networks have been proposed and created, also showing how to correctly implement them on real hardware and how the complexity of a CNN can be adapted to fit computational requirements.
【7】 SRF-Net: Selective Receptive Field Network for Anchor-Free Temporal Action Detection 标题:SRF-Net:无锚定时间动作检测的选择性感受场网络
作者:Ranyu Ning,Can Zhang,Yuexian Zou 机构:ADSPLAB, School of ECE, Peking University, Shenzhen, China, Peng Cheng Laboratory, Shenzhen, China 备注:Accepted by ICASSP 2021 链接:https://arxiv.org/abs/2106.15258 摘要:时间动作检测(TAD)是一项具有挑战性的任务,其目的是对未剪辑视频中的人类动作进行时间定位和识别。当前主流的单阶段TAD方法依赖于预定义的锚来定位和分类动作建议,其中动作实例的位置和规模由设计者设置。显然,这种基于锚的TAD方法限制了其泛化能力,当视频包含丰富的动作变化时,会导致性能下降。在这项研究中,我们探讨消除对TAD方法预定义锚的要求。提出了一种新的TAD模型,称为选择性感受野网络(SRF-Net),该模型可以在特征图中直接估计每个时间点的位置偏移量和分类得分,并对SRF-Net进行端到端训练。创新性地,设计了一种称为选择性感受野卷积(SRFC)的构造块,该构造块能够根据特征图中每个时间位置输入信息的多个尺度自适应地调整其感受野大小。在THUMOS14数据集上进行了广泛的实验,并报告了与最先进的TAD方法相比的优越结果。 摘要:Temporal action detection (TAD) is a challenging task which aims to temporally localize and recognize the human action in untrimmed videos. Current mainstream one-stage TAD approaches localize and classify action proposals relying on pre-defined anchors, where the location and scale for action instances are set by designers. Obviously, such an anchor-based TAD method limits its generalization capability and will lead to performance degradation when videos contain rich action variation. In this study, we explore to remove the requirement of pre-defined anchors for TAD methods. A novel TAD model termed as Selective Receptive Field Network (SRF-Net) is developed, in which the location offsets and classification scores at each temporal location can be directly estimated in the feature map and SRF-Net is trained in an end-to-end manner. Innovatively, a building block called Selective Receptive Field Convolution (SRFC) is dedicatedly designed which is able to adaptively adjust its receptive field size according to multiple scales of input information at each temporal location in the feature map. Extensive experiments are conducted on the THUMOS14 dataset, and superior results are reported comparing to state-of-the-art TAD approaches.
【8】 Spatio-Temporal Context for Action Detection 标题:用于动作检测的时空上下文
作者:Manuel Sarmiento Calderó,David Varas,Elisenda Bou-Balust 机构:Apple 备注:Computer Vision and Pattern Recognition Workshop 链接:https://arxiv.org/abs/2106.15171 摘要:动作检测在视频理解中起着关键性的作用,因此近年来动作检测的研究日益增多。为参与者和他们的环境之间的相互作用(空间或时间)建模已经被证明是这项任务的关键。虽然最近的工作使用聚集时间信息的空间特征,这项工作建议使用非聚集时间信息。这是通过添加一个基于注意的方法来实现的,该方法利用了片段中场景中元素之间的时空交互。这项工作的主要贡献是引入了两个交叉注意块来有效地建模空间关系和捕捉短距离的时间交互。在AVA数据集上的实验表明所提出的方法的优点是在场景中对相关元素之间的时空关系进行建模,比其他通过+0.31map模拟参与者与其上下文的交互的方法要好。 摘要:Research in action detection has grown in the recentyears, as it plays a key role in video understanding. Modelling the interactions (either spatial or temporal) between actors and their context has proven to be essential for this task. While recent works use spatial features with aggregated temporal information, this work proposes to use non-aggregated temporal information. This is done by adding an attention based method that leverages spatio-temporal interactions between elements in the scene along the clip.The main contribution of this work is the introduction of two cross attention blocks to effectively model the spatial relations and capture short range temporal interactions.Experiments on the AVA dataset show the advantages of the proposed approach that models spatio-temporal relations between relevant elements in the scene, outperforming other methods that model actor interactions with their context by +0.31 mAP.
【9】 Do Not Deceive Your Employer with a Virtual Background: A Video Conferencing Manipulation-Detection System 标题:不要用虚拟背景欺骗你的雇主:视频会议操作检测系统
作者:Mauro Conti,Simone Milani,Ehsan Nowroozi,Gabriele Orazi 机构:† Department of Mathematics, University of Padua, Via Trieste, - Padua, ITALY, ∗Department of Information Engineering, University of Padua, Via Gradenigo ,B - Padua, ITALY 备注:6 pages 链接:https://arxiv.org/abs/2106.15130 摘要:上一代的视频会议软件允许用户利用虚拟背景来隐藏个人环境,尤其是在与其他雇主的正式会议上。另一方面,用户可能希望通过考虑虚拟背景来隐藏他们所在的位置,从而在会议中愚弄人们。在这种情况下,开发工具来理解虚拟背景,并利用这些工具在会议中愚弄人们,起着重要的作用。此外,由于恶意用户可以通过在视频上应用一组对抗性的编辑步骤来隐藏任何暴露的足迹,因此此类检测器必须对不同类型的攻击具有鲁棒性。本文研究了一种有效的视频会议用户背景检测工具的可行性。特别地,我们提供第一个工具来计算像素共现矩阵,并使用它们来搜索光谱和空间波段之间的不一致性。实验证明交叉共现矩阵提高了检测器对各种攻击的鲁棒性。这项工作的表现是特别值得注意的方面,彩色垃圾邮件的特点。此外,与后处理(如几何变换、滤波、对比度增强和具有不同质量因子的JPEG压缩)相比,性能尤其重要。 摘要:The last-generation video conferencing software allows users to utilize a virtual background to conceal their personal environment due to privacy concerns, especially in official meetings with other employers. On the other hand, users maybe want to fool people in the meeting by considering the virtual background to conceal where they are. In this case, developing tools to understand the virtual background utilize for fooling people in meeting plays an important role. Besides, such detectors must prove robust against different kinds of attacks since a malicious user can fool the detector by applying a set of adversarial editing steps on the video to conceal any revealing footprint. In this paper, we study the feasibility of an efficient tool to detect whether a videoconferencing user background is real. In particular, we provide the first tool which computes pixel co-occurrences matrices and uses them to search for inconsistencies among spectral and spatial bands. Our experiments confirm that cross co-occurrences matrices improve the robustness of the detector against different kinds of attacks. This work's performance is especially noteworthy with regard to color SPAM features. Moreover, the performance especially is significant with regard to robustness versus post-processing, like geometric transformations, filtering, contrast enhancement, and JPEG compression with different quality factors.
【10】 EVPropNet: Detecting Drones By Finding Propellers For Mid-Air Landing And Following 标题:EVPropNet:通过寻找用于半空着陆和跟随的螺旋桨来探测无人机
作者:Nitin J. Sanket,Chahat Deep Singh,Chethan M. Parameshwara,Cornelia Fermüller,Guido C. H. E. de Croon,Yiannis Aloimonos 机构:a b 备注:11 pages, 10 figures, 6 tables. Accepted in Robotics: Science and Systems (RSS) 2021 链接:https://arxiv.org/abs/2106.15045 摘要:无人机或无人机的无障碍性迅速增加,对一般安全和保密构成威胁。大多数商用或定制的无人机都是多旋翼的,由多个螺旋桨组成。由于这些推进器以高速旋转,它们通常是图像中移动最快的部分,在没有严重运动模糊的情况下,经典相机无法直接“看到”。我们利用了一类传感器,这些传感器特别适合这种场景,称为事件摄像机,具有高时间分辨率、低延迟和高动态范围。在本文中,我们建立了螺旋桨的几何模型,并用它来生成模拟事件,这些事件被用来训练一个叫做EVPropNet的深层神经网络,从事件摄像机的数据中检测螺旋桨。EVPropNet直接转移到现实世界中,无需任何微调或再训练。我们介绍了我们的网络的两个应用:(a)跟踪和跟踪一个无标记的无人机和(b)降落在一个近悬停无人机。我们在许多不同螺旋桨形状和尺寸的真实实验中成功地评估和演示了所提出的方法。我们的网络能够以85.1%的速度检测螺旋桨,即使60%的螺旋桨被阻塞,并且可以在2W功率预算下以高达35Hz的频率运行。据我们所知,这是第一个基于深度学习的螺旋桨探测(无人机探测)解决方案。最后,我们的应用也显示了令人印象深刻的成功率为92%和90%的跟踪和着陆任务分别。 摘要:The rapid rise of accessibility of unmanned aerial vehicles or drones pose a threat to general security and confidentiality. Most of the commercially available or custom-built drones are multi-rotors and are comprised of multiple propellers. Since these propellers rotate at a high-speed, they are generally the fastest moving parts of an image and cannot be directly "seen" by a classical camera without severe motion blur. We utilize a class of sensors that are particularly suitable for such scenarios called event cameras, which have a high temporal resolution, low-latency, and high dynamic range. In this paper, we model the geometry of a propeller and use it to generate simulated events which are used to train a deep neural network called EVPropNet to detect propellers from the data of an event camera. EVPropNet directly transfers to the real world without any fine-tuning or retraining. We present two applications of our network: (a) tracking and following an unmarked drone and (b) landing on a near-hover drone. We successfully evaluate and demonstrate the proposed approach in many real-world experiments with different propeller shapes and sizes. Our network can detect propellers at a rate of 85.1% even when 60% of the propeller is occluded and can run at upto 35Hz on a 2W power budget. To our knowledge, this is the first deep learning-based solution for detecting propellers (to detect drones). Finally, our applications also show an impressive success rate of 92% and 90% for the tracking and landing tasks respectively.
【11】 An Uncertainty Estimation Framework for Probabilistic Object Detection 标题:一种概率目标检测的不确定性分析框架
作者:Zongyao Lyu,Nolan B. Gutierrez,William J. Beksi 机构: University of Texas atArlington 备注:To be published in the 2021 International Conference on Automation Science and Engineering (CASE) 链接:https://arxiv.org/abs/2106.15007 摘要:在本文中,我们介绍了一种新的技术,结合两种流行的方法来估计目标检测中的不确定性。量化不确定性在实际机器人应用中至关重要。传统的检测模型即使在提供高概率输出的情况下也可能是模糊的。基于高置信度但不可靠预测的机器人动作可能会造成严重后果。该框架采用深集成和蒙特卡罗差分法对预测不确定度进行逼近,提高了基线法的不确定度估计质量。该方法在公开的合成图像数据集上进行了评价。 摘要:In this paper, we introduce a new technique that combines two popular methods to estimate uncertainty in object detection. Quantifying uncertainty is critical in real-world robotic applications. Traditional detection models can be ambiguous even when they provide a high-probability output. Robot actions based on high-confidence, yet unreliable predictions, may result in serious repercussions. Our framework employs deep ensembles and Monte Carlo dropout for approximating predictive uncertainty, and it improves upon the uncertainty estimation quality of the baseline method. The proposed approach is evaluated on publicly available synthetic image datasets captured from sequences of video.
【12】 Object Detection Based Handwriting Localization 标题:基于目标检测的笔迹定位
作者:Yuli Wu,Yucheng Hu,Suting Miao 机构: Rheinisch-Westf¨alische Technische Hochschule Aachen, Germany, Nanjing Normal University, China, SAP Innovation Center Network (ICN) Nanjing, China 备注:ICDAR 2021 Workshop: Industrial Applications of Document Analysis and Recognition 链接:https://arxiv.org/abs/2106.14989 摘要:提出了一种基于目标检测的手写体区域定位方法,旨在提高数据传输过程中的匿名性。将包含印刷文本和手写笔记或签名的原始图像和预处理图像的串联融合输入到卷积神经网络中,学习边界盒来检测手写。之后,可以对手写区域进行处理(例如,用经过编辑的签名替换),以隐藏个人识别信息(PII)。这种基于深度学习网络级联R-CNN的处理流水线在推理过程中以10fps的速度工作在GPU上,以最小的计算开销保证了增强的匿名性。此外,令人印象深刻的普遍性已被经验证明:基于英语主导数据集的训练模型在虚构的看不见的发票上效果良好,即使是中文发票。所提出的方法也有望促进其他任务,如手写识别和签名验证。 摘要:We present an object detection based approach to localize handwritten regions from documents, which initially aims to enhance the anonymization during the data transmission. The concatenated fusion of original and preprocessed images containing both printed texts and handwritten notes or signatures are fed into the convolutional neural network, where the bounding boxes are learned to detect the handwriting. Afterwards, the handwritten regions can be processed (e.g. replaced with redacted signatures) to conceal the personally identifiable information (PII). This processing pipeline based on the deep learning network Cascade R-CNN works at 10 fps on a GPU during the inference, which ensures the enhanced anonymization with minimal computational overheads. Furthermore, the impressive generalizability has been empirically showcased: the trained model based on the English-dominant dataset works well on the fictitious unseen invoices, even in Chinese. The proposed approach is also expected to facilitate other tasks such as handwriting recognition and signature verification.
【13】 Achieving Real-Time Object Detection on MobileDevices with Neural Pruning Search 标题:用神经修剪搜索实现移动设备上的实时目标检测
作者:Pu Zhao,Wei Niu,Geng Yuan,Yuxuan Cai,Bin Ren,Yanzhi Wang,Xue Lin 机构:Northeastern University, Boston, MA, William & Mary, Williamsburg, VA 备注:Presented on the HiPEAC 2021 workshop (cogarch 2021) 链接:https://arxiv.org/abs/2106.14943 摘要:目标检测在自动驾驶汽车安全发展中起着重要的作用。然而,在自动驾驶汽车上的移动系统由于计算资源有限,给目标检测带来了困难。为了实现这一点,我们提出了一个编译器感知的神经剪枝搜索框架,以实现对自主车辆进行二维和三维目标检测的高速推理。该框架自动搜索每一层的剪枝方案和剪枝率,在编译器优化的情况下找到一个最适合的剪枝,以优化检测精度和速度性能。我们的实验表明,该方法首次在现成的手机上实现了(接近)实时、55ms和99ms的推理时间,分别用于基于YOLOv4的二维目标检测和基于PointPillars的三维检测,精度损失很小(或没有)。 摘要:Object detection plays an important role in self-driving cars for security development. However, mobile systems on self-driving cars with limited computation resources lead to difficulties for object detection. To facilitate this, we propose a compiler-aware neural pruning search framework to achieve high-speed inference on autonomous vehicles for 2D and 3D object detection. The framework automatically searches the pruning scheme and rate for each layer to find a best-suited pruning for optimizing detection accuracy and speed performance under compiler optimization. Our experiments demonstrate that for the first time, the proposed method achieves (close-to) real-time, 55ms and 99ms inference times for YOLOv4 based 2D object detection and PointPillars based 3D detection, respectively, on an off-the-shelf mobile phone with minor (or no) accuracy loss.
【14】 Cosmic-CoNN: A Cosmic Ray Detection Deep-Learning Framework, Dataset, and Toolkit 标题:COSMIC-CONN:一个宇宙线探测深度学习框架、数据集和工具包
作者:Chengyuan Xu,Curtis McCully,Boning Dong,D. Andrew Howell,Pradeep Sen 机构:Media Arts and Technology, University of California, Santa Barbara, CA , USA, Department of Computer Science, University of California, Santa Barbara, CA , USA, Las Cumbres Observatory, Cortona Drive, Suite , Goleta, CA ,-, USA 备注:18 pages, 13 figures, 3 tables. Submitted to AAS Journals. See this https URL for the open-source software and this https URL for the dataset 链接:https://arxiv.org/abs/2106.14922 摘要:拒绝宇宙射线(CRs)是科学解释CCD捕获数据的必要条件,但在单次曝光图像中检测CRs仍然具有挑战性。传统的CR检测算法需要在实验上调整多个参数,因此很难在不同的仪器或观测请求之间实现自动化。最近使用深度学习来训练CR检测模型的工作已经证明了有希望的结果。然而,特定于仪器的模型在训练数据中未包含的地面设施的图像上存在性能损失。在这项工作中,我们提出了Cosmic-CoNN,一个设计用于产生通用CR检测模型的深度学习框架。我们利用Las Cumbres天文台全球望远镜网络的数千幅图像建立了一个大型、多样的地面CR数据集,生成了一个通用的CR检测模型,该模型对双子座GMOS-N/S的未观测数据的真阳性检出率达到99.91%,真阳性检出率保持在96.40%以上,假阳性率为0.01%。除了开源框架和数据集之外,我们还构建了一套工具,包括控制台命令、一个基于web的应用程序和pythonapi,以使天文学家社区能够广泛访问自动、健壮的CR检测。 摘要:Rejecting cosmic rays (CRs) is essential for scientific interpretation of CCD-captured data, but detecting CRs in single-exposure images has remained challenging. Conventional CR-detection algorithms require tuning multiple parameters experimentally making it hard to automate across different instruments or observation requests. Recent work using deep learning to train CR-detection models has demonstrated promising results. However, instrument-specific models suffer from performance loss on images from ground-based facilities not included in the training data. In this work, we present Cosmic-CoNN, a deep-learning framework designed to produce generic CR-detection models. We build a large, diverse ground-based CR dataset leveraging thousands of images from the Las Cumbres Observatory global telescope network to produce a generic CR-detection model which achieves a 99.91% true-positive detection rate and maintains over 96.40% true-positive rates on unseen data from Gemini GMOS-N/S, with a false-positive rate of 0.01%. Apart from the open-source framework and dataset, we also build a suite of tools including console commands, a web-based application, and Python APIs to make automatic, robust CR detection widely accessible by the community of astronomers.
分类|识别相关(12篇)
【1】 A Systematic Evaluation of Domain Adaptation in Facial Expression Recognition 标题:人脸表情识别中领域自适应的系统评价
作者:Yan San Kong,Varsha Suresh,Jonathan Soh,Desmond C. Ong 机构:Department of Information Systems and Analytics, National University of Singapore, Department of Computer Science, National University of Singapore, Institute of High Performance Computing, Agency for Science, Technology and Research, Singapore 链接:https://arxiv.org/abs/2106.15453 摘要:面部表情识别是一个重要的商业应用,但一个常见的限制是,应用程序通常需要对样本外分布进行预测,其中目标图像可能与模型训练的图像具有非常不同的特性。这些模型在看不见的目标域上做得好还是坏?在本文中,我们提供了一个系统的评估领域适应在面部表情识别。利用最先进的迁移学习技术和六个常用的面部表情数据集(三个在实验室收集,三个在野外收集),我们进行了广泛的循环实验,以检验最先进的CNN模型的分类精度。我们还进行多源实验,检查模型从多源数据集传输的能力,包括(i)设置内(例如,实验室到实验室),(ii)交叉设置(例如,在野外到实验室),(iii)混合设置(例如,实验室和野外到实验室)传输学习实验。我们发现,迁移学习的准确率并不高,并且随着目标数据集的不同而不同,而源数据集的准确率则较低。一般来说,传递的最佳设置包括微调预训练模型的权重,我们发现,无论设置如何,使用更多数据集的训练都可以提高传递性能。最后,我们讨论了对FER模型(尤其是已部署应用程序)的可推广性进行更多、定期的系统性研究的必要性。 摘要:Facial Expression Recognition is a commercially important application, but one common limitation is that applications often require making predictions on out-of-sample distributions, where target images may have very different properties from the images that the model was trained on. How well, or badly, do these models do on unseen target domains? In this paper, we provide a systematic evaluation of domain adaptation in facial expression recognition. Using state-of-the-art transfer learning techniques and six commonly-used facial expression datasets (three collected in the lab and three "in-the-wild"), we conduct extensive round-robin experiments to examine the classification accuracies for a state-of-the-art CNN model. We also perform multi-source experiments where we examine a model's ability to transfer from multiple source datasets, including (i) within-setting (e.g., lab to lab), (ii) cross-setting (e.g., in-the-wild to lab), (iii) mixed-setting (e.g., lab and wild to lab) transfer learning experiments. We find sobering results that the accuracy of transfer learning is not high, and varies idiosyncratically with the target dataset, and to a lesser extent the source dataset. Generally, the best settings for transfer include fine-tuning the weights of a pre-trained model, and we find that training with more datasets, regardless of setting, improves transfer performance. We end with a discussion of the need for more -- and regular -- systematic investigations into the generalizability of FER models, especially for deployed applications.
【2】 Cloud based Scalable Object Recognition from Video Streams using Orientation Fusion and Convolutional Neural Networks 标题:基于方向融合和卷积神经网络的视频流中云可伸缩目标识别
作者:Muhammad Usman Yaseen,Ashiq Anjum,Giancarlo Fortino,Antonio Liotta,Amir Hussain 机构:Department of Computer Science, Comsats University, Pak, Department of Informatics, University of Leicester, Uk, Department of Informatics, University of Calabria, Italy, Free University of Bozen-Bolzano, Italy 链接:https://arxiv.org/abs/2106.15329 摘要:从实时视频流中进行目标识别面临许多挑战,如光照条件和姿态的变化。卷积神经网络(CNNs)被广泛应用于智能视觉目标识别。然而,CNNs仍然存在严重的精度下降问题,特别是在光照变化的数据集上。针对这一问题,提出了一种基于方向融合的CNN视觉目标识别方法。提出的基于云的视频分析系统率先使用二维经验模式分解将视频帧分解为固有模式函数(imf)。我们进一步提出这些imf来经受Reisz变换,产生单基因的目标成分,这些成分又用于CNNs的训练。过去的工作已经证明了如何使用面向对象组件来追求高达93%的精度水平。在本文中,我们展示了方向分量的特征融合策略如何进一步将视觉识别准确率提高到97%。我们还评估了我们的方法的可伸缩性,查看了被仔细检查的视频流的数量和大小。我们在公开的Yale数据集上进行了广泛的实验,包括一个自生成的视频数据集,与AlexNet、LeNet和SE-ResNeXt这三种最常用的视觉对象识别和分类深度学习模型相比,我们发现了显著的改进(精度和规模)。 摘要:Object recognition from live video streams comes with numerous challenges such as the variation in illumination conditions and poses. Convolutional neural networks (CNNs) have been widely used to perform intelligent visual object recognition. Yet, CNNs still suffer from severe accuracy degradation, particularly on illumination-variant datasets. To address this problem, we propose a new CNN method based on orientation fusion for visual object recognition. The proposed cloud-based video analytics system pioneers the use of bi-dimensional empirical mode decomposition to split a video frame into intrinsic mode functions (IMFs). We further propose these IMFs to endure Reisz transform to produce monogenic object components, which are in turn used for the training of CNNs. Past works have demonstrated how the object orientation component may be used to pursue accuracy levels as high as 93\%. Herein we demonstrate how a feature-fusion strategy of the orientation components leads to further improving visual recognition accuracy to 97\%. We also assess the scalability of our method, looking at both the number and the size of the video streams under scrutiny. We carry out extensive experimentation on the publicly available Yale dataset, including also a self generated video datasets, finding significant improvements (both in accuracy and scale), in comparison to AlexNet, LeNet and SE-ResNeXt, which are the three most commonly used deep learning models for visual object recognition and classification.
【3】 Effective Evaluation of Deep Active Learning on Image Classification Tasks 标题:深度主动学习在图像分类任务中的有效性评价
作者:Nathan Beck,Durga Sivasubramanian,Apurva Dani,Ganesh Ramakrishnan,Rishabh Iyer 机构:The University of Texas at Dallas, Indian Institute of Technology, Bombay, AIFY Innovation Labs 备注:9 pages in main paper, 6 figures in main paper, 3 tables in main paper. 23 pages in total, 15 figures in total, 14 tables in total. Submitted to and currently under review for NeurIPS 2021 链接:https://arxiv.org/abs/2106.15324 摘要:为了提高深度学习的效率,越来越多的论文研究了基于深度模型的主动学习。然而,在普遍的实验环境中存在着一些问题,主要是由于缺乏统一的实施和基准。当前文献中存在的问题包括:对不同AL算法性能的观察有时相互矛盾,无意中排除了重要的泛化方法,如用于优化的数据扩充和SGD,缺乏对评估方面的研究,如AL的标记效率,对于AL优于随机抽样(RS)的情况,很少或没有明确的说明。在这项工作中,我们提出了一个统一的重新实现的国家最先进的AL算法的背景下的图像分类,我们仔细研究这些问题作为有效的评估方面。在积极的一面,我们表明,与使用数据扩充的RS相比,AL技术的标签效率提高了2到4倍。令人惊讶的是,当包含数据扩充时,使用BADGE(一种最先进的方法)与简单的不确定性采样相比,不再有一致的收益。然后,我们仔细分析现有方法在不同冗余度和每个类的示例数下的性能。最后,我们提供了一些见解供AL从业者在未来的工作中考虑,例如AL批量大小的影响、初始化的影响、每轮重新训练新模型的重要性以及其他见解。 摘要:With the goal of making deep learning more label-efficient, a growing number of papers have been studying active learning (AL) for deep models. However, there are a number of issues in the prevalent experimental settings, mainly stemming from a lack of unified implementation and benchmarking. Issues in the current literature include sometimes contradictory observations on the performance of different AL algorithms, unintended exclusion of important generalization approaches such as data augmentation and SGD for optimization, a lack of study of evaluation facets like the labeling efficiency of AL, and little or no clarity on the scenarios in which AL outperforms random sampling (RS). In this work, we present a unified re-implementation of state-of-the-art AL algorithms in the context of image classification, and we carefully study these issues as facets of effective evaluation. On the positive side, we show that AL techniques are 2x to 4x more label-efficient compared to RS with the use of data augmentation. Surprisingly, when data augmentation is included, there is no longer a consistent gain in using BADGE, a state-of-the-art approach, over simple uncertainty sampling. We then do a careful analysis of how existing approaches perform with varying amounts of redundancy and number of examples per class. Finally, we provide several insights for AL practitioners to consider in future work, such as the effect of the AL batch size, the effect of initialization, the importance of retraining a new model at every round, and other insights.
【4】 Face Identification Proficiency Test Designed Using Item Response Theory 标题:基于项目反应理论的人脸识别能力测验设计
作者:Géraldine Jeckeln,Ying Hu,Jacqueline G. Cavazos,Amy N. Yates,Carina A. Hahn,Larry Tang,Jonathon Phillips,Alice J. O'Toole 机构:The University of Texas at Dallas, National Institute of Standards and Technology, University of Central Florida, Alice J. O’Toole 备注:17 pages (including references), 7 figures 链接:https://arxiv.org/abs/2106.15323 摘要:面部识别能力的测量对于确保专业法医面部检查人员和其他在应用场景中执行面部识别任务的人员的准确和一致的表现至关重要。目前的能力测试依赖于静态的刺激项目集,因此不能对同一个人进行多次有效的测试。要创建一个能力测试,必须集合大量“已知”难度的项目。同样难度的多个测试可以用项目的子集来构造。在这里,我们介绍了一个能力测试,三位一体身份匹配(TIM)测试,基于项目反应理论(IRT)的刺激难度测量。参与者查看面部图像“三元组”(N=225)(两张同一身份的图像和一张不同身份的图像)并选择不同的身份。在实验1中,大学生(n=197)在提姆测验上表现出广泛的准确性。此外,IRT模型显示TIM测验产生不同难度的项目。实验二采用基于IRT的项目难度测量方法,将TIM测试分为三个相同的“简单”和三个相同的“困难”子集。模拟结果表明,全套,以及策展的子集,TIM项目产生了可靠的估计学科能力。总之,TIM测试可以为开发一个灵活、校准和适应性的框架提供一个起点,以衡量不同能力水平的熟练程度(例如,有面部处理缺陷的专业人员或人群) 摘要:Measures of face identification proficiency are essential to ensure accurate and consistent performance by professional forensic face examiners and others who perform face identification tasks in applied scenarios. Current proficiency tests rely on static sets of stimulus items, and so, cannot be administered validly to the same individual multiple times. To create a proficiency test, a large number of items of "known" difficulty must be assembled. Multiple tests of equal difficulty can be constructed then using subsets of items. Here, we introduce a proficiency test, the Triad Identity Matching (TIM) test, based on stimulus difficulty measures based on Item Response Theory (IRT). Participants view face-image "triads" (N=225) (two images of one identity and one image of a different identity) and select the different identity. In Experiment 1, university students (N=197) showed wide-ranging accuracy on the TIM test. Furthermore, IRT modeling demonstrated that the TIM test produces items of various difficulty levels. In Experiment 2, IRT-based item difficulty measures were used to partition the TIM test into three equally "easy" and three equally "difficult" subsets. Simulation results indicated that the full set, as well as curated subsets, of the TIM items yielded reliable estimates of subject ability. In summary, the TIM test can provide a starting point for developing a framework that is flexible, calibrated, and adaptive to measure proficiency across various ability levels (e.g., professionals or populations with face processing deficits)
【5】 Cells are Actors: Social Network Analysis with Classical ML for SOTA Histology Image Classification 标题:细胞是行动者:基于经典ML的SOTA组织图像分类的社会网络分析
作者:Neda Zamanitajeddin,Mostafa Jahanifar,Nasir Rajpoot 机构: Tissue Image Analytics Centre, Department of Computer Science, University of Warwick, Coventry, UK 链接:https://arxiv.org/abs/2106.15299 摘要:组织学图像的数字化和新的计算方法的出现,如深度学习,有助于结直肠癌(CRA)的自动分级。然而,目前的CRA自动分级方法通常使用微小的图像块,因此无法整合整个组织的微结构进行分级。为了应对这些挑战,我们建议使用统计网络分析方法来描述组织微环境的复杂结构,将细胞核及其连接建模为一个网络。我们表明,通过分析网络中细胞之间的相互作用,我们可以提取出高分辨的统计特征进行CRA分级。与其他基于深度学习或卷积图的方法不同,我们的方法具有高度可伸缩性(可用于由数百万个节点组成的细胞网络)、完全可解释性和计算成本低廉。我们在一个广泛的CRC组织学图像数据集上创建细胞网络,用我们的方法进行实验,并报告预测三级CRA分级的最新性能。 摘要:Digitization of histology images and the advent of new computational methods, like deep learning, have helped the automatic grading of colorectal adenocarcinoma cancer (CRA). Present automated CRA grading methods, however, usually use tiny image patches and thus fail to integrate the entire tissue micro-architecture for grading purposes. To tackle these challenges, we propose to use a statistical network analysis method to describe the complex structure of the tissue micro-environment by modelling nuclei and their connections as a network. We show that by analyzing only the interactions between the cells in a network, we can extract highly discriminative statistical features for CRA grading. Unlike other deep learning or convolutional graph-based approaches, our method is highly scalable (can be used for cell networks consist of millions of nodes), completely explainable, and computationally inexpensive. We create cell networks on a broad CRC histology image dataset, experiment with our method, and report state-of-the-art performance for the prediction of three-class CRA grading.
【6】 MFR 2021: Masked Face Recognition Competition 标题:MFR 2021:蒙面人脸识别大赛
作者:Fadi Boutros,Naser Damer,Jan Niklas Kolf,Kiran Raja,Florian Kirchbuchner,Raghavendra Ramachandra,Arjan Kuijper,Pengcheng Fang,Chao Zhang,Fei Wang,David Montero,Naiara Aginako,Basilio Sierra,Marcos Nieto,Mustafa Ekrem Erakin,Ugur Demir,Hazim Kemal,Ekenel,Asaki Kataoka,Kohei Ichikawa,Shizuma Kubo,Jie Zhang,Mingjie He,Dan Han,Shiguang Shan,Klemen Grm,Vitomir Štruc,Sachith Seneviratne,Nuran Kasthuriarachchi,Sanka Rasnayaka,Pedro C. Neto,Ana F. Sequeira,Joao Ribeiro Pinto,Mohsen Saffari,Jaime S. Cardoso 机构:Fraunhofer Institute for Computer Graphics Research IGD, Germany - ,TU Darmstadt, Germany, Norwegian University of Science and Technology, Norway - ,TYAI, China, VICOMTECH, Spain - ,University of the Basque Country, Spain 备注:Accepted at International Join Conference on Biometrics (IJCB 2021) 链接:https://arxiv.org/abs/2106.15288 摘要:本文概述了2021年国际生物特征识别联合会议(ijcb2021)期间举行的蒙面人脸识别竞赛(MFR)。本次比赛共吸引了10支参赛队伍,参赛队伍均提交了有效的参赛材料。这些团队的隶属关系多种多样,并与九个不同国家的学术界和工业界联系在一起。这些小组成功地提交了18个有效的解决方案。比赛旨在激励解决方案,旨在提高蒙面人脸识别的准确性。此外,考虑到人脸识别模型的紧凑性,竞争对手考虑了所提方案的可部署性。一个私有数据集表示一个协作的、多会话的、真实的屏蔽的捕获场景,用于评估提交的解决方案。与一个表现最好的学术人脸识别解决方案相比,18个提交的解决方案中有10个确实获得了更高的蒙面人脸验证准确率。 摘要:This paper presents a summary of the Masked Face Recognition Competitions (MFR) held within the 2021 International Joint Conference on Biometrics (IJCB 2021). The competition attracted a total of 10 participating teams with valid submissions. The affiliations of these teams are diverse and associated with academia and industry in nine different countries. These teams successfully submitted 18 valid solutions. The competition is designed to motivate solutions aiming at enhancing the face recognition accuracy of masked faces. Moreover, the competition considered the deployability of the proposed solutions by taking the compactness of the face recognition models into account. A private dataset representing a collaborative, multi-session, real masked, capture scenario is used to evaluate the submitted solutions. In comparison to one of the top-performing academic face recognition solutions, 10 out of the 18 submitted solutions did score higher masked face verification accuracy.
【7】 Similarity Embedding Networks for Robust Human Activity Recognition 标题:基于相似嵌入网络的鲁棒人体活动识别
作者:Chenglin Li,Carrie Lu Tong,Di Niu,Bei Jiang,Xiao Zuo,Lei Cheng,Jian Xiong,Jianming Yang 机构: University of Alberta, University ofAlberta 链接:https://arxiv.org/abs/2106.15283 摘要:基于传感器数据的人类活动识别(HAR)深度学习模型是近年来研究的热点。然而,由于难以获得高质量的标记活动数据,深部模型对复杂现实世界HAR数据的泛化能力受到限制。本文设计了一种相似嵌入神经网络,通过精心设计的卷积层和LSTM层将传感器输入信号映射到实向量上。嵌入网络采用两两相似性损失训练,鼓励在嵌入的真实空间中对同一类的样本进行聚类,并且可以在小样本集上甚至在带有错误标记样本的噪声数据集上进行有效的训练。在此基础上,进一步提出了非参数和参数的活动识别方法。基于两个公共数据集的广泛评估表明,所提出的相似性嵌入网络在HAR分类任务上显著优于现有的深度模型,对训练集中的错误标记样本具有鲁棒性,并且可以有效地去除噪声数据集。 摘要:Deep learning models for human activity recognition (HAR) based on sensor data have been heavily studied recently. However, the generalization ability of deep models on complex real-world HAR data is limited by the availability of high-quality labeled activity data, which are hard to obtain. In this paper, we design a similarity embedding neural network that maps input sensor signals onto real vectors through carefully designed convolutional and LSTM layers. The embedding network is trained with a pairwise similarity loss, encouraging the clustering of samples from the same class in the embedded real space, and can be effectively trained on a small dataset and even on a noisy dataset with mislabeled samples. Based on the learned embeddings, we further propose both nonparametric and parametric approaches for activity recognition. Extensive evaluation based on two public datasets has shown that the proposed similarity embedding network significantly outperforms state-of-the-art deep models on HAR classification tasks, is robust to mislabeled samples in the training set, and can also be used to effectively denoise a noisy dataset.
【8】 Domain-Class Correlation Decomposition for Generalizable Person Re-Identification 标题:基于域-类相关分解的广义人再识别
作者:Kaiwen Yang,Xinmei Tian 机构: Tian are with CAS Key Laboratory of Technology inGeo-spatial Information Processing and Application System, Universityof Science and Technology of China 备注:10 pages, 5 figures 链接:https://arxiv.org/abs/2106.15206 摘要:人的再识别中的领域泛化是一项非常重要且有实际意义的任务,在这项任务中,使用多个源领域的数据训练的模型可以很好地泛化到不可见的目标领域。领域对抗学习是一种很有前途的领域泛化方法,其目的是通过对抗训练去除潜在表征中的领域信息。然而,在人的再识别中,领域和类是相互关联的,我们从理论上证明了领域对抗性学习会因为这种领域-类的关联而丢失某些关于类的信息。受偶然推理的启发,我们建议对域因子$d$进行干预,以分解域类相关性。为了实现这一目标,我们建议通过一阶和二阶统计特征匹配来估计由干预引起的结果表示$z^{*}$。具体来说,我们建立一个记忆库来恢复每个域的统计特征。然后,我们使用新生成的样本$\{z^{*},y,d^{*}\}$来计算损失函数。对这些样本进行域类相关分解;因此,我们可以学习一种域不变的表示,它可以捕获更多与类相关的特征。大量实验表明,在大规模领域综合reid基准测试中,该模型的性能优于现有的方法。 摘要:Domain generalization in person re-identification is a highly important meaningful and practical task in which a model trained with data from several source domains is expected to generalize well to unseen target domains. Domain adversarial learning is a promising domain generalization method that aims to remove domain information in the latent representation through adversarial training. However, in person re-identification, the domain and class are correlated, and we theoretically show that domain adversarial learning will lose certain information about class due to this domain-class correlation. Inspired by casual inference, we propose to perform interventions to the domain factor $d$, aiming to decompose the domain-class correlation. To achieve this goal, we proposed estimating the resulting representation $z^{*}$ caused by the intervention through first- and second-order statistical characteristic matching. Specifically, we build a memory bank to restore the statistical characteristics of each domain. Then, we use the newly generated samples $\{z^{*},y,d^{*}\}$ to compute the loss function. These samples are domain-class correlation decomposed; thus, we can learn a domain-invariant representation that can capture more class-related features. Extensive experiments show that our model outperforms the state-of-the-art methods on the large-scale domain generalization Re-ID benchmark.
【9】 Inconspicuous Adversarial Patches for Fooling Image Recognition Systems on Mobile Devices 标题:用于欺骗移动设备上的图像识别系统的不起眼的对抗性补丁
作者:Tao Bai,Jinqi Luo,Jun Zhao 机构: Zhao are with the School of Computer Science andEngineering, Nanyang Technological University 备注:arXiv admin note: substantial text overlap with arXiv:2009.09774 链接:https://arxiv.org/abs/2106.15202 摘要:基于深度学习的图像识别系统在当今世界的移动设备上得到了广泛的应用。然而,在最近的研究中,深度学习模式很容易受到对抗性例子的影响。对抗性例子的一个变体,叫做对抗性补丁,由于其强大的攻击能力而引起了研究人员的注意。尽管敌对补丁的攻击成功率很高,但由于补丁与原始图像的视觉不一致性,很容易被检测到。此外,文献中的对抗性补丁生成通常需要大量的数据,计算量大且耗时。为了应对这些挑战,我们提出了一种方法来生成一个单一的图像不显眼的敌对补丁。在该方法中,首先根据受害者模型的感知敏感度来确定斑块的位置,然后利用多个尺度生成器和鉴别器从粗到精地生成对抗性斑块。鼓励补丁与对抗训练的背景图像保持一致,同时保持强大的攻击能力。通过对不同结构和训练方法的模型进行大量实验,证明了该方法在白盒环境下具有很强的攻击能力,在黑盒环境下具有很好的可移植性。与其他对抗性补丁相比,我们的对抗性补丁具有可忽略的风险,可以避免人类的观察,这得到了显著性图和用户评估结果的支持。最后,我们证明了我们的对抗补丁可以应用于物理世界。 摘要:Deep learning based image recognition systems have been widely deployed on mobile devices in today's world. In recent studies, however, deep learning models are shown vulnerable to adversarial examples. One variant of adversarial examples, called adversarial patch, draws researchers' attention due to its strong attack abilities. Though adversarial patches achieve high attack success rates, they are easily being detected because of the visual inconsistency between the patches and the original images. Besides, it usually requires a large amount of data for adversarial patch generation in the literature, which is computationally expensive and time-consuming. To tackle these challenges, we propose an approach to generate inconspicuous adversarial patches with one single image. In our approach, we first decide the patch locations basing on the perceptual sensitivity of victim models, then produce adversarial patches in a coarse-to-fine way by utilizing multiple-scale generators and discriminators. The patches are encouraged to be consistent with the background images with adversarial training while preserving strong attack abilities. Our approach shows the strong attack abilities in white-box settings and the excellent transferability in black-box settings through extensive experiments on various models with different architectures and training methods. Compared to other adversarial patches, our adversarial patches hold the most negligible risks to be detected and can evade human observations, which is supported by the illustrations of saliency maps and results of user evaluations. Lastly, we show that our adversarial patches can be applied in the physical world.
【10】 Constructing Stronger and Faster Baselines for Skeleton-based Action Recognition 标题:为基于骨架的动作识别构建更强更快的基线
作者:Yi-Fan Song,Zhang Zhang,Caifeng Shan,Liang Wang 机构: NationalLaboratory of Pattern Recognition (NLPR), Institute of Automation 备注:15 pages, 12 tables, 10 figures, submitted to IEEE T-PAMI. arXiv admin note: text overlap with arXiv:2010.09978 链接:https://arxiv.org/abs/2106.15125 摘要:在基于骨架的动作识别中,一个重要的问题是如何提取所有骨架关节的特征。然而,用于这项任务的最新状态(SOTA)模型的复杂性往往是非常复杂和过度参数化的。模型训练和推理的低效率增加了大规模数据集模型结构的验证成本。为了解决上述问题,将最新的可分离卷积层嵌入到早期融合的多输入分支(MIB)网络中,构造了一个有效的基于骨架的动作识别的图卷积网络(GCN)基线。此外,基于这样的基线,我们设计了一种复合缩放策略来同步扩展模型的宽度和深度,最终得到一系列高精度、可训练参数少的有效GCN基线,称为EfficientGCN Bx,其中“x”表示缩放系数。在NTU RGB+D 60和120两个大规模数据集上,所提出的EfficientGCN-B4基线优于其他SOTA方法,例如,在NTU 60数据集的跨学科基准上获得了91.7%的准确率,同时比最好的SOTA方法之一MS-G3D小3.15倍,快3.21倍。PyTorch版本的源代码和预先训练的模型可以在https://github.com/yfsong0709/EfficientGCNv1. 摘要:One essential problem in skeleton-based action recognition is how to extract discriminative features over all skeleton joints. However, the complexity of the recent State-Of-The-Art (SOTA) models for this task tends to be exceedingly sophisticated and over-parameterized. The low efficiency in model training and inference has increased the validation costs of model architectures in large-scale datasets. To address the above issue, recent advanced separable convolutional layers are embedded into an early fused Multiple Input Branches (MIB) network, constructing an efficient Graph Convolutional Network (GCN) baseline for skeleton-based action recognition. In addition, based on such the baseline, we design a compound scaling strategy to expand the model's width and depth synchronously, and eventually obtain a family of efficient GCN baselines with high accuracies and small amounts of trainable parameters, termed EfficientGCN-Bx, where ''x'' denotes the scaling coefficient. On two large-scale datasets, i.e., NTU RGB+D 60 and 120, the proposed EfficientGCN-B4 baseline outperforms other SOTA methods, e.g., achieving 91.7% accuracy on the cross-subject benchmark of NTU 60 dataset, while being 3.15x smaller and 3.21x faster than MS-G3D, which is one of the best SOTA methods. The source code in PyTorch version and the pretrained models are available at https://github.com/yfsong0709/EfficientGCNv1.
【11】 ElephantBook: A Semi-Automated Human-in-the-Loop System for Elephant Re-Identification 标题:ElephantBook:一种用于大象再识别的半自动人在环系统
作者:Peter Kulits,Jake Wall,Anka Bedetti,Michelle Henley,Sara Beery 机构:Individual ID and, SEEK Coding, Sighting Images &, Matched Boxes, EarthRanger, Dashboard, ElephantBook, Sex: Male, Age: , Right Tusk: Yes, Left Tusk: Yes, Right Prominent Tear:, Quadrant , Right Secondary Tear:, ., SEEK Code:, b,T,E,?,-, X,S 链接:https://arxiv.org/abs/2106.15083 摘要:非洲象对它们的生态系统至关重要,但它们的种群正受到人象冲突和偷猎上升的威胁。监测种群动态对保护工作至关重要;然而,追踪大象是一项困难的任务,通常依靠GPS项圈的侵入性,有时甚至是危险的安置。尽管最近在使用计算机视觉技术自动识别其他物种方面取得了许多成功,但识别大象是极其困难的,通常需要专业知识以及熟悉种群中的大象。我们已经建立并部署了一个基于网络的平台和数据库,将人工属性标记和最先进的计算机视觉算法(ElephantBook)结合起来,对大象进行人在回路的重新识别。我们的系统目前正在马拉大象项目中使用,帮助监测大马赛马拉生态系统中受保护和有风险的大象种群。ElephantBook使大象的再鉴定可供非专家使用,并可扩展为多个保护非政府组织使用。 摘要:African elephants are vital to their ecosystems, but their populations are threatened by a rise in human-elephant conflict and poaching. Monitoring population dynamics is essential in conservation efforts; however, tracking elephants is a difficult task, usually relying on the invasive and sometimes dangerous placement of GPS collars. Although there have been many recent successes in the use of computer vision techniques for automated identification of other species, identification of elephants is extremely difficult and typically requires expertise as well as familiarity with elephants in the population. We have built and deployed a web-based platform and database for human-in-the-loop re-identification of elephants combining manual attribute labeling and state-of-the-art computer vision algorithms, known as ElephantBook. Our system is currently in use at the Mara Elephant Project, helping monitor the protected and at-risk population of elephants in the Greater Maasai Mara ecosystem. ElephantBook makes elephant re-identification usable by non-experts and scalable for use by multiple conservation NGOs.
【12】 Improving Transferability of Adversarial Patches on Face Recognition with Generative Models 标题:利用产生式模型提高人脸识别中对抗性补丁的可转移性
作者:Zihao Xiao,Xianfeng Gao,Chilin Fu,Yinpeng Dong,Wei Gao,Xiaolu Zhang,Jun Zhou,Jun Zhu 机构: RealAI, Ant Financial, Tsinghua University, Beijing Institute of Technology, Nanyang Technological University 备注:Accpeted by CVPR 2021. Based on the camera ready version, some typos are fixed 链接:https://arxiv.org/abs/2106.15058 摘要:深度卷积神经网络(CNNs)极大地提高了人脸识别率。近年来,这些人脸识别模型被用于安全敏感应用中的身份认证。然而,deep-CNNs容易受到物理上可实现且隐蔽的对抗性补丁的攻击,这给这些模型的实际应用带来了新的安全问题。在本文中,我们评估了基于可转移性的对抗性面片人脸识别模型的鲁棒性,其中攻击者对目标模型的可访问性有限。首先,我们扩展现有的基于传输的攻击技术来生成可传输的对抗补丁。然而,我们观察到可转移性对初始值很敏感,当扰动幅度较大时,可转移性会下降,这表明对替代模型的过度拟合。其次,我们提出在低维数据流形上正则化对抗补丁。流形由在合法人脸图像上预先训练的生成模型表示。通过对流形的优化,将人脸特征作为对抗性扰动,我们发现替代模型的响应与目标模型的响应之间的差距显著减小,表现出更好的可转移性。大量的数字世界实验证明了该方法在黑盒环境下的优越性。我们也将所提出的方法应用于物理世界。 摘要:Face recognition is greatly improved by deep convolutional neural networks (CNNs). Recently, these face recognition models have been used for identity authentication in security sensitive applications. However, deep CNNs are vulnerable to adversarial patches, which are physically realizable and stealthy, raising new security concerns on the real-world applications of these models. In this paper, we evaluate the robustness of face recognition models using adversarial patches based on transferability, where the attacker has limited accessibility to the target models. First, we extend the existing transfer-based attack techniques to generate transferable adversarial patches. However, we observe that the transferability is sensitive to initialization and degrades when the perturbation magnitude is large, indicating the overfitting to the substitute models. Second, we propose to regularize the adversarial patches on the low dimensional data manifold. The manifold is represented by generative models pre-trained on legitimate human face images. Using face-like features as adversarial perturbations through optimization on the manifold, we show that the gaps between the responses of substitute models and the target models dramatically decrease, exhibiting a better transferability. Extensive digital world experiments are conducted to demonstrate the superiority of the proposed method in the black-box setting. We apply the proposed method in the physical world as well.
分割|语义相关(12篇)
【1】 Segmentation with Multiple Acceptable Annotations: A Case Study of Myocardial Segmentation in Contrast Echocardiography 标题:具有多个可接受注释的分割:声学造影中心肌分割的实例研究
作者:Dewen Zeng,Mingqi Li,Yukun Ding,Xiaowei Xu,Qiu Xie,Ruixue Xu,Hongwen Fei,Meiping Huang,Jian Zhuang,Yiyu Shi 机构: University of Notre Dame, Notre Dame, USA, Guangdong Provincial People’s Hospital, Guangzhou, China 备注:12 pages 链接:https://arxiv.org/abs/2106.15597 摘要:大多数现有的基于深度学习的图像分割框架都假设已知唯一的基本事实,并可用于性能评估。许多应用程序都是这样,但并非所有应用程序都是这样。心肌造影超声心动图(MCE)的心肌分割是自动心肌灌注分析中的一个关键任务。由于MCE数据的低分辨率和严重的伪影,来自不同心脏病专家的注释会有很大的差异,很难判断哪一个是最好的。在这种情况下,我们如何找到一个好的方法来评估分割性能,我们如何训练神经网络?本文针对第一个问题,提出了一种新的扩展Dice算法,在多个可接受的地面真值存在的情况下,有效地评估了分割性能。然后,基于我们提出的度量,我们进一步将新的度量合并到损失函数中,使神经网络能够灵活地学习心肌的一般特征,从而解决了第二个问题。在我们的临床MCE数据集上的实验结果表明,用所提出的损失函数训练的神经网络在数量和质量上都优于那些试图从多个注释中获得唯一基本真理的现有神经网络。最后,我们的分级研究表明,与使用Dice相比,使用扩展Dice作为评价指标可以更好地识别需要手动校正的分割结果。 摘要:Most existing deep learning-based frameworks for image segmentation assume that a unique ground truth is known and can be used for performance evaluation. This is true for many applications, but not all. Myocardial segmentation of Myocardial Contrast Echocardiography (MCE), a critical task in automatic myocardial perfusion analysis, is an example. Due to the low resolution and serious artifacts in MCE data, annotations from different cardiologists can vary significantly, and it is hard to tell which one is the best. In this case, how can we find a good way to evaluate segmentation performance and how do we train the neural network? In this paper, we address the first problem by proposing a new extended Dice to effectively evaluate the segmentation performance when multiple accepted ground truth is available. Then based on our proposed metric, we solve the second problem by further incorporating the new metric into a loss function that enables neural networks to flexibly learn general features of myocardium. Experiment results on our clinical MCE data set demonstrate that the neural network trained with the proposed loss function outperforms those existing ones that try to obtain a unique ground truth from multiple annotations, both quantitatively and qualitatively. Finally, our grading study shows that using extended Dice as an evaluation metric can better identify segmentation results that need manual correction compared with using Dice.
【2】 IMENet: Joint 3D Semantic Scene Completion and 2D Semantic Segmentation through Iterative Mutual Enhancement 标题:IMENet:基于迭代相互增强的联合3D语义场景补全和2D语义分割
作者:Jie Li,Laiyan Ding,Rui Huang 机构:The Chinese University of Hong Kong, Shenzhen, Shenzhen Institute of Artificial Intelligence and Robotics for Society 备注:Accepted by IJCAI 2021 链接:https://arxiv.org/abs/2106.15413 摘要:三维语义场景完成和二维语义分割是两个紧密相关的任务,它们都是室内场景理解的关键,因为它们使用正相关的高级特征预测相同的语义类。目前的方法是从早期融合的RGB-D图像中提取二维特征进行二维分割,以提高三维场景的完整性。我们认为,这种序贯方案不能保证这两个任务充分地相互受益,并提出了一种迭代互增强网络(IMENet)来联合求解这两个任务,在预测后期对这两个任务进行交互细化。具体来说,两个细化模块是在一个统一的框架下为这两个任务开发的。第一个是2D可变形上下文金字塔(DCP)模块,它接收来自当前3D预测的投影以细化2D预测。提出了一种三维变形深度注意(DDA)模块,利用二维预测的结果更新粗三维预测。这种迭代融合在后期阶段发生在两个任务的稳定高级特征上。在NYU和NYUCAD数据集上的大量实验验证了所提出的迭代后期融合方案的有效性,并且我们的方法在3D语义场景完成和2D语义分割方面都优于现有的方法。 摘要:3D semantic scene completion and 2D semantic segmentation are two tightly correlated tasks that are both essential for indoor scene understanding, because they predict the same semantic classes, using positively correlated high-level features. Current methods use 2D features extracted from early-fused RGB-D images for 2D segmentation to improve 3D scene completion. We argue that this sequential scheme does not ensure these two tasks fully benefit each other, and present an Iterative Mutual Enhancement Network (IMENet) to solve them jointly, which interactively refines the two tasks at the late prediction stage. Specifically, two refinement modules are developed under a unified framework for the two tasks. The first is a 2D Deformable Context Pyramid (DCP) module, which receives the projection from the current 3D predictions to refine the 2D predictions. In turn, a 3D Deformable Depth Attention (DDA) module is proposed to leverage the reprojected results from 2D predictions to update the coarse 3D predictions. This iterative fusion happens to the stable high-level features of both tasks at a late stage. Extensive experiments on NYU and NYUCAD datasets verify the effectiveness of the proposed iterative late fusion scheme, and our approach outperforms the state of the art on both 3D semantic scene completion and 2D semantic segmentation.
【3】 Probabilistic Attention for Interactive Segmentation 标题:交互式分词中的概率关注度
作者:Prasad Gabbur,Manjot Bilkhu,Javier Movellan 机构:Apple 备注:17 pages, 8 figures 链接:https://arxiv.org/abs/2106.15338 摘要:我们提供了注意的概率解释,并证明了Transformer中的标准点积注意是最大后验概率(MAP)推理的特例。提出的方法建议使用期望最大化算法在线调整关键和价值模型参数。这种方法对于外部代理(如注释器)提供有关某些标记的正确值(如某些像素的语义类别)的推理时信息的情况非常有用,我们需要以原则性的方式将此新信息传播到其他标记。在一个交互式语义切分任务中,注释者和模型在线协作以提高注释效率。使用标准基准,我们观察到关键自适应在低反馈状态下提高了模型性能($\sim10\%$mIoU),而值传播在高反馈状态下提高了模型响应性。我们的概率注意模型的PyTorch层实现将公开。 摘要:We provide a probabilistic interpretation of attention and show that the standard dot-product attention in transformers is a special case of Maximum A Posteriori (MAP) inference. The proposed approach suggests the use of Expectation Maximization algorithms for online adaptation of key and value model parameters. This approach is useful for cases in which external agents, e.g., annotators, provide inference-time information about the correct values of some tokens, e.g, the semantic category of some pixels, and we need for this new information to propagate to other tokens in a principled manner. We illustrate the approach on an interactive semantic segmentation task in which annotators and models collaborate online to improve annotation efficiency. Using standard benchmarks, we observe that key adaptation boosts model performance ($\sim10\%$ mIoU) in the low feedback regime and value propagation improves model responsiveness in the high feedback regime. A PyTorch layer implementation of our probabilistic attention model will be made publicly available.
【4】 Contrastive Semantic Similarity Learning for Image Captioning Evaluation with Intrinsic Auto-encoder 标题:基于本征自动编码器的图像字幕评价的对比语义相似度学习
作者:Chao Zeng,Tiesong Zhao,Sam Kwong 机构:City University of Hong Kong, Fuzhou University 链接:https://arxiv.org/abs/2106.15312 摘要:自动评估图像字幕的质量是一项非常具有挑战性的工作,因为人类语言非常灵活,可以对相同的意思有不同的表达方式。目前大多数字幕指标都依赖于候选字幕和地面真值标签语句之间的标记级匹配。它通常忽略句子层面的信息。基于自动编码机制和对比表征学习的发展,我们提出了一种基于学习的图像字幕评价方法,称之为内在图像字幕评价(I^2CE$)。我们开发了三种递进模式结构来学习句子层次的表征——单分支模式、双分支模式和三分支模式。实验结果表明,采用双分支结构训练的$I^2CE$与人类对当代图像字幕评价指标的判断具有较好的一致性。此外,我们选择了几种最先进的图像字幕模型,并在MS-COCO数据集上测试了它们在当代指标和拟议的$I^2CE$方面的性能。实验结果表明,我们提出的方法可以很好地与其他当代指标的得分保持一致。在这个问题上,提出的指标可以作为一个新的指标之间的内在信息字幕,这可能是补充现有的。 摘要:Automatically evaluating the quality of image captions can be very challenging since human language is quite flexible that there can be various expressions for the same meaning. Most of the current captioning metrics rely on token level matching between candidate caption and the ground truth label sentences. It usually neglects the sentence-level information. Motivated by the auto-encoder mechanism and contrastive representation learning advances, we propose a learning-based metric for image captioning, which we call Intrinsic Image Captioning Evaluation($I^2CE$). We develop three progressive model structures to learn the sentence level representations--single branch model, dual branches model, and triple branches model. Our empirical tests show that $I^2CE$ trained with dual branches structure achieves better consistency with human judgments to contemporary image captioning evaluation metrics. Furthermore, We select several state-of-the-art image captioning models and test their performances on the MS COCO dataset concerning both contemporary metrics and the proposed $I^2CE$. Experiment results show that our proposed method can align well with the scores generated from other contemporary metrics. On this concern, the proposed metric could serve as a novel indicator of the intrinsic information between captions, which may be complementary to the existing ones.
【5】 Multimodal Semantic Scene Graphs for Holistic Modeling of Surgical Procedures 标题:用于手术过程整体建模的多模态语义场景图
作者:Ege Özsoy,Evin Pınar Örnek,Ulrich Eck,Federico Tombari,Nassir Navab 机构: Computer Aided Medical Procedures, Technische Universit¨at M¨unchen, Germany, Google, Computer Aided Medical Procedures, Johns Hopkins University, Baltimore, USA 链接:https://arxiv.org/abs/2106.15309 摘要:从计算机科学的观点来看,外科领域模型需要是一个概念模型,包含行为和数据。因此,它应该为参与者、设备、工具及其复杂的交互和数据流建模。为了捕捉和建模这些,我们利用最新的计算机视觉方法从相机视图生成三维场景图。然后介绍了多模态语义场景图(MSSG),旨在为外科手术提供统一的符号、时空和语义表示。该方法旨在建立外科领域不同组成部分之间的关系模型,包括医务人员、成像系统和外科设备,为外科手术的整体理解和建模开辟道路。然后,我们使用MSSG引入一个动态生成的图形用户界面工具,用于外科手术分析,可用于许多应用,包括过程优化、设计和自动报告生成。我们最后证明,所提出的mssg也可以用于同步不同的复杂手术过程。虽然该系统在得到验证之前还需要集成到实际的手术室中,但本会议论文的主要目的是通过第一个基于MVOR数据集的原型部分实现,向社会提供这一新概念的基本原理。 摘要:From a computer science viewpoint, a surgical domain model needs to be a conceptual one incorporating both behavior and data. It should therefore model actors, devices, tools, their complex interactions and data flow. To capture and model these, we take advantage of the latest computer vision methodologies for generating 3D scene graphs from camera views. We then introduce the Multimodal Semantic Scene Graph (MSSG) which aims at providing a unified symbolic, spatiotemporal and semantic representation of surgical procedures. This methodology aims at modeling the relationship between different components in surgical domain including medical staff, imaging systems, and surgical devices, opening the path towards holistic understanding and modeling of surgical procedures. We then use MSSG to introduce a dynamically generated graphical user interface tool for surgical procedure analysis which could be used for many applications including process optimization, OR design and automatic report generation. We finally demonstrate that the proposed MSSGs could also be used for synchronizing different complex surgical procedures. While the system still needs to be integrated into real operating rooms before getting validated, this conference paper aims mainly at providing the community with the basic principles of this novel concept through a first prototypal partial realization based on MVOR dataset.
【6】 Tackling Catastrophic Forgetting and Background Shift in Continual Semantic Segmentation 标题:处理连续语义切分中的灾难性遗忘和背景漂移
作者:Arthur Douillard,Yifu Chen,Arnaud Dapogny,Matthieu Cord 机构: which requiresthe model to never forget these classes in the case of learning only 1 Sorbonne Universit´e 备注:Under review at IEEE TPAMI, journal extension of arXiv:2011.11390 链接:https://arxiv.org/abs/2106.15287 摘要:目前,深度学习方法广泛应用于处理语义分割等需要大量数据集和强大计算能力的计算机视觉任务。语义分段的持续学习(CSS)是一种新兴的趋势,它通过连续添加新类来更新旧模型。然而,持续的学习方法通常容易导致灾难性的遗忘。在CSS中,这个问题进一步恶化,在每一步中,以前迭代中的旧类都被折叠到后台。在本文中,我们提出了局部POD,一个多尺度的池蒸馏方案,在特征水平上保持长距离和短距离的空间关系。此外,我们还设计了一个基于熵的伪标记来处理由旧模型预测的背景w.r.t.类,以避免对旧类的灾难性遗忘。最后,我们介绍了一种新的排练方法,特别适合分割。我们的方法,称为PLOP,在现有CSS场景中,以及在新提出的具有挑战性的基准中,显著优于最先进的方法。 摘要:Deep learning approaches are nowadays ubiquitously used to tackle computer vision tasks such as semantic segmentation, requiring large datasets and substantial computational power. Continual learning for semantic segmentation (CSS) is an emerging trend that consists in updating an old model by sequentially adding new classes. However, continual learning methods are usually prone to catastrophic forgetting. This issue is further aggravated in CSS where, at each step, old classes from previous iterations are collapsed into the background. In this paper, we propose Local POD, a multi-scale pooling distillation scheme that preserves long- and short-range spatial relationships at feature level. Furthermore, we design an entropy-based pseudo-labelling of the background w.r.t. classes predicted by the old model to deal with background shift and avoid catastrophic forgetting of the old classes. Finally, we introduce a novel rehearsal method that is particularly suited for segmentation. Our approach, called PLOP, significantly outperforms state-of-the-art methods in existing CSS scenarios, as well as in newly proposed challenging benchmarks.
【7】 Perception-aware Multi-sensor Fusion for 3D LiDAR Semantic Segmentation 标题:基于感知的多传感器融合三维LiDAR语义分割
作者:Zhuangwei Zhuang,Rong Li,Yuanqing Li,Kui Jia,Qicheng Wang,Mingkui Tan 机构:South China University of Technology, China, Shenzhen Youjia Innov Tech Co., Ltd, China 备注:11 pages,9 figures 链接:https://arxiv.org/abs/2106.15277 摘要:基于三维激光雷达(lightdetectionandranging,简称3D-LiDAR)的语义分割在场景理解中有着重要的应用,例如自动驾驶和机器人技术。例如,对于装备RGB相机和LiDAR的自动驾驶汽车来说,融合来自不同传感器的互补信息对于鲁棒和准确的分割是至关重要的。然而,由于两种模式之间的巨大差异,现有的基于融合的方法可能无法获得令人满意的性能。在这项工作中,我们研究了一种称为感知感知多传感器融合(PMF)的协作融合方案,以利用两种模式的感知信息,即RGB图像的外观信息和点云的空间深度信息。为此,我们首先将点云投影到相机坐标系中,为RGB图像提供空间深度信息。然后,我们提出了一种双流网络,分别从两种模式中提取特征,并通过有效的基于残差的融合模块对特征进行融合。此外,我们提出额外的知觉知觉损失来衡量两种模式之间的巨大知觉差异。在两个基准数据集上的大量实验表明了该方法的优越性。例如,在nuScenes上,我们的PMF在mIoU中比最先进的方法高出0.8%。 摘要:3D LiDAR (light detection and ranging) based semantic segmentation is important in scene understanding for many applications, such as auto-driving and robotics. For example, for autonomous cars equipped with RGB cameras and LiDAR, it is crucial to fuse complementary information from different sensors for robust and accurate segmentation. Existing fusion-based methods, however, may not achieve promising performance due to the vast difference between two modalities. In this work, we investigate a collaborative fusion scheme called perception-aware multi-sensor fusion (PMF) to exploit perceptual information from two modalities, namely, appearance information from RGB images and spatio-depth information from point clouds. To this end, we first project point clouds to the camera coordinates to provide spatio-depth information for RGB images. Then, we propose a two-stream network to extract features from the two modalities, separately, and fuse the features by effective residual-based fusion modules. Moreover, we propose additional perception-aware losses to measure the great perceptual difference between the two modalities. Extensive experiments on two benchmark data sets show the superiority of our method. For example, on nuScenes, our PMF outperforms the state-of-the-art method by 0.8% in mIoU.
【8】 Predicting the Solar Potential of Rooftops using Image Segmentation and Structured Data 标题:基于图像分割和结构化数据的屋顶太阳能潜力预测
作者:Daniel de Barros Soares,François Andrieux,Bastien Hell,Julien Lenhardt,Jordi Badosa,Sylvain Gavoille,Stéphane Gaiffas,Emmanuel Bacry 机构:namR, Paris, France, ENSTA Paris, LMD, Ecole polytechnique, IP Paris, Palaiseau, France, LPSM, Université de Paris, DMA, Ecole normale supérieure, CEREMADE, Université Paris Dauphine 链接:https://arxiv.org/abs/2106.15268 摘要:估算屋顶光伏发电系统的发电量是一个耗时的过程,需要现场测量,这是一项难以大规模实现的任务。在本文中,我们提出了一种方法来估计屋顶太阳能潜力的基础上,他们的位置和建筑特点,以及他们每年收到的太阳辐射量。该方法一方面利用计算机视觉实现屋顶截面和屋顶对象的语义分割,另一方面利用基于结构化建筑特征的机器学习模型预测屋顶坡度。然后,我们用几何方法计算了可以安装在屋顶上的太阳能电池板的方位角和最大数量。最后,我们计算出精确的遮光掩模,并将其与太阳辐射数据相结合,使我们能够估计屋顶的年太阳能潜力。 摘要:Estimating the amount of electricity that can be produced by rooftop photovoltaic systems is a time-consuming process that requires on-site measurements, a difficult task to achieve on a large scale. In this paper, we present an approach to estimate the solar potential of rooftops based on their location and architectural characteristics, as well as the amount of solar radiation they receive annually. Our technique uses computer vision to achieve semantic segmentation of roof sections and roof objects on the one hand, and a machine learning model based on structured building features to predict roof pitch on the other hand. We then compute the azimuth and maximum number of solar panels that can be installed on a rooftop with geometric approaches. Finally, we compute precise shading masks and combine them with solar irradiation data that enables us to estimate the yearly solar potential of a rooftop.
【9】 Predicting Depth from Semantic Segmentation using Game Engine Dataset 标题:基于游戏引擎数据集的语义分割深度预测
作者:Mohammad Amin Kashi 机构:Supervisor, Hamid D. Taghirad, Summer , arXiv:,.,v, [cs.CV] , Jun 备注:79 pages, Master's thesis at K. N. Toosi University of Technology, supervised by Professor Hamid D. Taghirad 链接:https://arxiv.org/abs/2106.15257 摘要:深度感知是机器人理解周围环境的基础。根据认知神经科学的观点,视觉深度知觉方法可分为三类,即双目视觉、主动视觉和图像视觉。前两类已经详细研究了几十年。然而,近年来随着深度学习方法的出现,对第三类知识的探索研究仍处于起步阶段,并取得了一定的发展势头。在认知神经科学中,图像深度知觉机制依赖于所见物体的知觉。受此启发,本文研究了卷积神经网络中物体感知与深度估计的关系。为此,我们开发了一种新的网络结构,它基于一个简单的深度估计网络,只使用一幅图像作为输入。我们提出的结构使用图像和图像的语义标签作为输入。我们使用语义标签作为对象感知的输出。与原网络的性能比较结果表明,新结构能使深度估计性能提高52%。大多数实验研究都是在游戏引擎生成的合成数据集上进行的,目的是将性能比较与非合成数据集不准确的深度和语义标签的影响隔离开来。结果表明,在没有合适的数据集的情况下,特定的合成数据集可用于深度网络的训练。此外,我们还发现,在这些情况下,语义标签的使用提高了网络对域从合成训练数据向非合成测试数据转移的鲁棒性。 摘要:Depth perception is fundamental for robots to understand the surrounding environment. As the view of cognitive neuroscience, visual depth perception methods are divided into three categories, namely binocular, active, and pictorial. The first two categories have been studied for decades in detail. However, research for the exploration of the third category is still in its infancy and has got momentum by the advent of deep learning methods in recent years. In cognitive neuroscience, it is known that pictorial depth perception mechanisms are dependent on the perception of seen objects. Inspired by this fact, in this thesis, we investigated the relation of perception of objects and depth estimation convolutional neural networks. For this purpose, we developed new network structures based on a simple depth estimation network that only used a single image at its input. Our proposed structures use both an image and a semantic label of the image as their input. We used semantic labels as the output of object perception. The obtained results of performance comparison between the developed network and original network showed that our novel structures can improve the performance of depth estimation by 52\% of relative error of distance in the examined cases. Most of the experimental studies were carried out on synthetic datasets that were generated by game engines to isolate the performance comparison from the effect of inaccurate depth and semantic labels of non-synthetic datasets. It is shown that particular synthetic datasets may be used for training of depth networks in cases that an appropriate dataset is not available. Furthermore, we showed that in these cases, usage of semantic labels improves the robustness of the network against domain shift from synthetic training data to non-synthetic test data.
【10】 Face Sketch Synthesis via Semantic-Driven Generative Adversarial Network 标题:基于语义驱动的生成性对抗网络的人脸素描合成
作者:Xingqun Qi,Muyi Sun,Weining Wang,Xiaoxiao Dong,Qi Li,Caifeng Shan 机构:School of Automation, Beijing University of Posts and Telecommunications, Beijing, China, Center for Research on Intelligent Perception and Computing, NLPR, CASIA, Beijing, China, Artificial Intelligence Research, CAS, Jiaozhou, Qingdao, China 链接:https://arxiv.org/abs/2106.15121 摘要:近年来,随着深度神经网络的发展,人脸草图合成技术取得了长足的进步。素描肖像的精细描绘促进了数字娱乐和执法等广泛的应用。然而,由于真实场景中光照的变化和背景的复杂,精确、逼真的人脸草图生成仍然是一项具有挑战性的任务。为了应对这些挑战,我们提出了一种新的语义驱动的生成对抗网络(SDGAN),它嵌入了全局结构级的风格注入和局部类级的知识重加权。具体来说,我们对输入的人脸照片进行人脸显著性检测,以提供整体的人脸纹理结构,作为一种全局的先验信息。此外,我们利用人脸解析布局作为语义层空间,然后在SDGAN生成器中强制全局结构样式注入。此外,为了增强细节的真实感,我们提出了一种新的自适应重加权损失(ARLoss),它致力于平衡不同语义类的贡献。实验上,我们在CUFS和CUFSF数据集上的大量实验表明,我们提出的算法达到了最先进的性能。 摘要:Face sketch synthesis has made significant progress with the development of deep neural networks in these years. The delicate depiction of sketch portraits facilitates a wide range of applications like digital entertainment and law enforcement. However, accurate and realistic face sketch generation is still a challenging task due to the illumination variations and complex backgrounds in the real scenes. To tackle these challenges, we propose a novel Semantic-Driven Generative Adversarial Network (SDGAN) which embeds global structure-level style injection and local class-level knowledge re-weighting. Specifically, we conduct facial saliency detection on the input face photos to provide overall facial texture structure, which could be used as a global type of prior information. In addition, we exploit face parsing layouts as the semantic-level spatial prior to enforce globally structural style injection in the generator of SDGAN. Furthermore, to enhance the realistic effect of the details, we propose a novel Adaptive Re-weighting Loss (ARLoss) which dedicates to balance the contributions of different semantic classes. Experimentally, our extensive experiments on CUFS and CUFSF datasets show that our proposed algorithm achieves state-of-the-art performance.
【11】 An Efficient Cervical Whole Slide Image Analysis Framework Based on Multi-scale Semantic and Spatial Features using Deep Learning 标题:一种基于深度学习的多尺度语义和空间特征的高效颈部整体幻灯片图像分析框架
作者:Ziquan Wei,Shenghua Cheng,Xiuli Liu,Shaoqun Zeng 机构:ID , Collaborative Innovation Center for Biomedical Engineering, Wuhan National Laboratory for Optoelectronics-Huazhong, University of Science and Technology, Wuhan, Hubei , China 备注:16 pages, 8 figures, journal article 链接:https://arxiv.org/abs/2106.15113 摘要:数字化千兆像素整张幻灯片图像(WSI)在临床诊断中有着广泛的应用,而WSI的自动分析是计算机辅助诊断的关键。目前,从ResNet分类器编码的大量局部斑块中分析概率或特征映射的综合描述符是WSI水平预测的主要方式。然而,宫颈切片中稀疏和微小病变细胞的特征表示对于提升不足的上游编码器仍然具有挑战性,而宫颈细胞未使用的空间表示则是提供语义分析的可用特征。以及重叠和重复处理的斑块采样,会导致效率低下和不可预测的副作用。本研究通过丰富多尺度连通性,设计了一种新颖的内联连接网络(InCNet),在空间信息的监督下,构建了一个轻量级的模型You Only Look cytopathony Once(YOLCO)。提出的模型允许将输入尺寸放大到百万像素,这样就可以在不重叠的情况下缝合WSI,平均重复次数从$10^3\sim10^4$减少到$10^1\sim10^2$,用于收集两个尺度的特征和预测。基于Transformer的综合多尺度多任务特征分类,实验结果显示,在四台扫描设备的2019张幻灯片的多水平数据集上,与传统的最佳WSI分类方法相比,AUC得分提高了0.872美元,速度提高了2.51美元。 摘要:Digital gigapixel whole slide image (WSI) is widely used in clinical diagnosis, and automated WSI analysis is key for computer-aided diagnosis. Currently, analyzing the integrated descriptor of probabilities or feature maps from massive local patches encoded by ResNet classifier is the main manner for WSI-level prediction. Feature representations of the sparse and tiny lesion cells in cervical slides, however, are still challengeable for the under-promoted upstream encoders, while the unused spatial representations of cervical cells are the available features to supply the semantics analysis. As well as patches sampling with overlap and repetitive processing incur the inefficiency and the unpredictable side effect. This study designs a novel inline connection network (InCNet) by enriching the multi-scale connectivity to build the lightweight model named You Only Look Cytopathology Once (YOLCO) with the additional supervision of spatial information. The proposed model allows the input size enlarged to megapixel that can stitch the WSI without any overlap by the average repeats decreased from $10^3\sim10^4$ to $10^1\sim10^2$ for collecting features and predictions at two scales. Based on Transformer for classifying the integrated multi-scale multi-task features, the experimental results appear $0.872$ AUC score better and $2.51\times$ faster than the best conventional method in WSI classification on multicohort datasets of 2,019 slides from four scanning devices.
【12】 Striking the Right Balance: Recall Loss for Semantic Segmentation 标题:取得恰当的平衡:语义分割的召回损失
作者:Junjiao Tian,Niluthpol Mithun,Zach Seymour,Han-Pang Chiu,Zsolt Kira 机构:Georgia Institute of Technology, SRI International 链接:https://arxiv.org/abs/2106.14917 摘要:类不平衡是语义分割等计算机视觉应用中的一个基本问题。特别是,在训练数据集中,不均匀的类分布往往导致表现不佳的类不令人满意的表现。许多工作提出了用预先计算好的基于类统计的权重(如样本数和类边距)对标准交叉熵损失函数进行加权。这些方法有两个主要缺点:1)不断增加少数类的权重会在语义切分中引入过多的误报;2) 少数民族阶级不一定是硬阶级。其结果是由于过多的误报而导致低精度。在这方面,我们提出了一种硬类挖掘损失,通过重塑香草交叉熵损失,使其基于瞬时召回性能动态加权每个类的损失。我们发现新的召回损失在标准交叉熵损失和逆频率加权损失之间逐渐变化。召回损失还可以提高平均准确率,同时提供竞争性的平均交集(IoU)性能。在Synthia数据集上,使用DeepLab-ResNet18,与交叉熵损失相比,查全率损失在平均准确率上实现了9%的相对改进。代码位于https://github.com/PotatoTian/recall-semseg. 摘要:Class imbalance is a fundamental problem in computer vision applications such as semantic segmentation. Specifically, uneven class distributions in a training dataset often result in unsatisfactory performance on under-represented classes. Many works have proposed to weight the standard cross entropy loss function with pre-computed weights based on class statistics, such as the number of samples and class margins. There are two major drawbacks to these methods: 1) constantly up-weighting minority classes can introduce excessive false positives in semantic segmentation; 2) a minority class is not necessarily a hard class. The consequence is low precision due to excessive false positives. In this regard, we propose a hard-class mining loss by reshaping the vanilla cross entropy loss such that it weights the loss for each class dynamically based on instantaneous recall performance. We show that the novel recall loss changes gradually between the standard cross entropy loss and the inverse frequency weighted loss. Recall loss also leads to improved mean accuracy while offering competitive mean Intersection over Union (IoU) performance. On Synthia dataset, recall loss achieves 9% relative improvement on mean accuracy with competitive mean IoU using DeepLab-ResNet18 compared to the cross entropy loss. Code available at https://github.com/PotatoTian/recall-semseg.
Zero/Few Shot|迁移|域适配|自适应(2篇)
【1】 Source-free Domain Adaptation via Avatar Prototype Generation and Adaptation 标题:基于化身原型生成和适配的无源领域适配
作者:Zhen Qiu,Yifan Zhang,Hongbin Lin,Shuaicheng Niu,Yanxia Liu,Qing Du,Mingkui Tan 机构:School of Software Engineering, South China University of Technology, School of Computing, National University of Singapore, Key Laboratory of Big Data and Intelligent Robot, Ministry of Education, Pazhou Laboratory 备注:Accepted by IJCAI 2021 链接:https://arxiv.org/abs/2106.15326 摘要:我们研究了一个实际的域自适应问题,称为无源无监督域自适应(UDA)问题,在该问题中,由于数据隐私问题,我们无法访问源域数据,但只有一个预先训练的源模型和未标记的目标数据可用。然而,这个任务非常困难,因为有一个关键的挑战:源数据和目标域标签的缺乏使得模型自适应非常具有挑战性。为了解决这个问题,我们建议挖掘源模型中隐藏的知识,并利用它来生成源化身原型(即每个源类的代表性特征)以及用于域对齐的目标伪标签。为此,本文提出了一种对比原型生成与自适应(CPGA)方法。具体来说,CPGA包括两个阶段:(1)原型生成:通过探索源模型的分类边界信息,训练原型生成器,通过对比学习生成化身原型(2) 原型自适应:基于生成的源原型和目标伪标记,我们提出了一种新的鲁棒对比原型自适应策略,将每个伪标记的目标数据与相应的源原型对齐。在三个UDA基准数据集上的实验表明了该方法的有效性和优越性。 摘要:We study a practical domain adaptation task, called source-free unsupervised domain adaptation (UDA) problem, in which we cannot access source domain data due to data privacy issues but only a pre-trained source model and unlabeled target data are available. This task, however, is very difficult due to one key challenge: the lack of source data and target domain labels makes model adaptation very challenging. To address this, we propose to mine the hidden knowledge in the source model and exploit it to generate source avatar prototypes (i.e., representative features for each source class) as well as target pseudo labels for domain alignment. To this end, we propose a Contrastive Prototype Generation and Adaptation (CPGA) method. Specifically, CPGA consists of two stages: (1) prototype generation: by exploring the classification boundary information of the source model, we train a prototype generator to generate avatar prototypes via contrastive learning. (2) prototype adaptation: based on the generated source prototypes and target pseudo labels, we develop a new robust contrastive prototype adaptation strategy to align each pseudo-labeled target data to the corresponding source prototypes. Extensive experiments on three UDA benchmark datasets demonstrate the effectiveness and superiority of the proposed method.
【2】 Adaptive Sample Selection for Robust Learning under Label Noise 标题:标签噪声下鲁棒学习的自适应样本选择
作者:Deep Patel,P. S. Sastry 机构:Department of Electrical Engineering, Indian Institute of Science, Bangalore, Karnataka 备注:Preprint. Under review 链接:https://arxiv.org/abs/2106.15292 摘要:深度神经网络(DNNs)已被证明在有噪声标记的数据存在时容易被记忆或过度拟合。针对这种噪声数据下的鲁棒学习问题,提出了几种算法。一个突出的算法类依赖于样本选择策略,课程学习的动机。例如,许多算法使用“小损失技巧”,其中选择损失值低于某个阈值的部分样本进行训练。这些算法对这些阈值非常敏感,很难确定或学习这些阈值。通常,这些算法还需要标签噪声率等信息,而这些信息在实际中通常是不可用的。在本文中,我们提出了一个数据相关的自适应样本选择策略,它只依赖于给定小批量的批次统计信息来提供对标签噪声的鲁棒性。该算法不需要任何额外的样本选择超参数,不需要任何噪声率信息,也不需要访问带有干净标签的单独数据。我们在基准测试数据集上验证了算法的有效性。 摘要:Deep Neural Networks (DNNs) have been shown to be susceptible to memorization or overfitting in the presence of noisily labelled data. For the problem of robust learning under such noisy data, several algorithms have been proposed. A prominent class of algorithms rely on sample selection strategies, motivated by curriculum learning. For example, many algorithms use the `small loss trick' wherein a fraction of samples with loss values below a certain threshold are selected for training. These algorithms are sensitive to such thresholds, and it is difficult to fix or learn these thresholds. Often, these algorithms also require information such as label noise rates which are typically unavailable in practice. In this paper, we propose a data-dependent, adaptive sample selection strategy that relies only on batch statistics of a given mini-batch to provide robustness against label noise. The algorithm does not have any additional hyperparameters for sample selection, does not need any information on noise rates, and does not need access to separate data with clean labels. We empirically demonstrate the effectiveness of our algorithm on benchmark datasets.
半弱无监督|主动学习|不确定性(5篇)
【1】 Uncertainty-Guided Progressive GANs for Medical Image Translation 标题:不确定性引导的渐进GANS在医学图像翻译中的应用
作者:Uddeshya Upadhyay,Yanbei Chen,Tobias Hepp,Sergios Gatidis,Zeynep Akata 机构: University of T¨ubingen, Max Planck Institute for Intelligent Systems 备注:accepted at MICCAI 2021, code is released here: this https URL 链接:https://arxiv.org/abs/2106.15542 摘要:图像到图像的转换在处理衰减校正、运动校正、欠采样重建和去噪等各种医学成像任务中起着至关重要的作用。生成性对抗网络已被证明在为这些任务生成高保真图像方面达到了最先进的水平。然而,最先进的基于GAN的框架并不能估计网络预测中的不确定性,这对于做出明智的医疗决策和医学专家随后的修订至关重要,而且最近已经证明可以提高模型的性能和可解释性。在这项工作中,我们提出了一个不确定性引导的图像到图像的渐进学习方案。通过将任意不确定性作为以渐进方式训练的动作的注意图,我们逐步生成保真度不断提高的图像。我们证明了我们的模型在三个具有挑战性的医学图像翻译任务上的有效性,包括PET到CT的翻译、欠采样MRI重建和MRI运动伪影校正。我们的模型在三个不同的任务中都有很好的推广,并且在数据有限的情况下,在完全监督和弱监督的情况下提高了性能。代码在此处发布:https://github.com/ExplainableML/UncerGuidedI2I 摘要:Image-to-image translation plays a vital role in tackling various medical imaging tasks such as attenuation correction, motion correction, undersampled reconstruction, and denoising. Generative adversarial networks have been shown to achieve the state-of-the-art in generating high fidelity images for these tasks. However, the state-of-the-art GAN-based frameworks do not estimate the uncertainty in the predictions made by the network that is essential for making informed medical decisions and subsequent revision by medical experts and has recently been shown to improve the performance and interpretability of the model. In this work, we propose an uncertainty-guided progressive learning scheme for image-to-image translation. By incorporating aleatoric uncertainty as attention maps for GANs trained in a progressive manner, we generate images of increasing fidelity progressively. We demonstrate the efficacy of our model on three challenging medical image translation tasks, including PET to CT translation, undersampled MRI reconstruction, and MRI motion artefact correction. Our model generalizes well in three different tasks and improves performance over state of the art under full-supervision and weak-supervision with limited data. Code is released here: https://github.com/ExplainableML/UncerGuidedI2I
【2】 Where is the disease? Semi-supervised pseudo-normality synthesis from an abnormal image 标题:疾病在哪里?基于异常图像的半监督伪正态合成
作者:Yuanqi Du,Quan Quan,Hu Han,S. Kevin Zhou 机构: George Mason University, Institute of Computing Technology, Chi-, nese Academy of Sciences 链接:https://arxiv.org/abs/2106.15345 摘要:伪正态性合成(Pseudo normality synthesis,Pseudo normality synthesis)是从病变检测、数据增强到临床手术建议等多个角度对异常图像(如病变)进行计算生成伪正态图像的关键技术。然而,在缺乏病变信息的情况下,生成高质量的伪正常图像是一个挑战。因此,引入昂贵的病灶分割数据为生成模型提供病灶信息,提高合成图像的质量。本文旨在解决伪正常图像生成过程中对大量病灶分割数据的需求。提出了一种半监督医学图像生成学习网络(SMILE),该网络不仅利用有限的带分割掩模的医学图像,而且利用不带分割掩模的海量医学图像生成逼真的伪正常图像。大量的实验表明,我们的模型在数据增强任务上比目前最好的模型高出6%,在生成高质量的图像上比现有的模型高出3%。此外,本文提出的半监督学习算法仅需50%的分割数据,就可以获得与监督学习模型相当的医学图像合成质量。 摘要:Pseudo-normality synthesis, which computationally generates a pseudo-normal image from an abnormal one (e.g., with lesions), is critical in many perspectives, from lesion detection, data augmentation to clinical surgery suggestion. However, it is challenging to generate high-quality pseudo-normal images in the absence of the lesion information. Thus, expensive lesion segmentation data have been introduced to provide lesion information for the generative models and improve the quality of the synthetic images. In this paper, we aim to alleviate the need of a large amount of lesion segmentation data when generating pseudo-normal images. We propose a Semi-supervised Medical Image generative LEarning network (SMILE) which not only utilizes limited medical images with segmentation masks, but also leverages massive medical images without segmentation masks to generate realistic pseudo-normal images. Extensive experiments show that our model outperforms the best state-of-the-art model by up to 6% for data augmentation task and 3% in generating high-quality images. Moreover, the proposed semi-supervised learning achieves comparable medical image synthesis quality with supervised learning model, using only 50 of segmentation data.
【3】 Understanding Cognitive Fatigue from fMRI Scans with Self-supervised Learning 标题:利用自我监督学习从fMRI扫描中理解认知疲劳
作者:Ashish Jaiswal,Ashwin Ramesh Babu,Mohammad Zaki Zadeh,Fillia Makedon,Glenn Wylie 机构:The University of Texas at Arlington, Arlington, TX , Kessler Foundation, East Hanover, New Jersey 备注:8 pages, 5 figures, 2 tables 链接:https://arxiv.org/abs/2106.15009 摘要:功能磁共振成像(functionalmagnetic resonance imaging,fMRI)是一种神经成像技术,它通过捕捉受试者执行的任务不同区域的血氧水平来记录大脑中的神经活动。鉴于功能磁共振成像数据,预测一个人的认知疲劳状态的问题还没有得到充分的研究。本文提出将认知疲劳状态划分为六个不同的层次,从无疲劳到极端疲劳状态,作为一个多类别的分类问题来解决这一问题。我们建立了一个利用卷积神经网络(CNN)进行空间特征提取的时空模型和一个利用长短时记忆(LSTM)网络进行4D功能磁共振扫描时间建模的时空模型。我们还应用了一种称为MoCo的自我监督方法在公共数据集BOLD5000上对我们的模型进行预训练,并在标记的数据集上对其进行微调,以分类认知疲劳。我们的新数据集包含来自创伤性脑损伤(TBI)患者和健康对照者(HCs)的fMRI扫描,同时执行一系列认知任务。这种方法建立了一种从功能磁共振成像数据分析认知疲劳的最新技术,并优于以往解决这一问题的方法。 摘要:Functional magnetic resonance imaging (fMRI) is a neuroimaging technique that records neural activations in the brain by capturing the blood oxygen level in different regions based on the task performed by a subject. Given fMRI data, the problem of predicting the state of cognitive fatigue in a person has not been investigated to its full extent. This paper proposes tackling this issue as a multi-class classification problem by dividing the state of cognitive fatigue into six different levels, ranging from no-fatigue to extreme fatigue conditions. We built a spatio-temporal model that uses convolutional neural networks (CNN) for spatial feature extraction and a long short-term memory (LSTM) network for temporal modeling of 4D fMRI scans. We also applied a self-supervised method called MoCo to pre-train our model on a public dataset BOLD5000 and fine-tuned it on our labeled dataset to classify cognitive fatigue. Our novel dataset contains fMRI scans from Traumatic Brain Injury (TBI) patients and healthy controls (HCs) while performing a series of cognitive tasks. This method establishes a state-of-the-art technique to analyze cognitive fatigue from fMRI data and beats previous approaches to solve this problem.
【4】 A Mixed-Supervision Multilevel GAN Framework for Image Quality Enhancement 标题:一种用于图像质量增强的混合监督多级GaN框架
作者:Uddeshya Upadhyay,Suyash Awate 机构:Computer Science and Engineering, Indian Institute of Tehnology, Bombay, India 备注:MICCAI 2019 链接:https://arxiv.org/abs/2106.15575 摘要:用于图像质量增强的深度神经网络通常需要大量由一对低质量图像及其相应的高质量图像组成的高度精确的训练数据。虽然高质量图像采集通常昂贵且耗时,但中等质量图像的采集速度更快,设备成本更低,并且可以大量获取。因此,我们提出了一种新的生成性对抗网络(GAN),它可以在多个质量级别(例如,高质量和中等质量)上利用训练数据来提高性能,同时限制数据管理的成本。我们将我们的混合监督GAN应用于(i)超分辨率组织病理学图像和(ii)结合超分辨率和外科消烟来增强腹腔镜图像。在大量临床和临床前数据集上的结果表明,我们的混合监督机制优于现有技术。 摘要:Deep neural networks for image quality enhancement typically need large quantities of highly-curated training data comprising pairs of low-quality images and their corresponding high-quality images. While high-quality image acquisition is typically expensive and time-consuming, medium-quality images are faster to acquire, at lower equipment costs, and available in larger quantities. Thus, we propose a novel generative adversarial network (GAN) that can leverage training data at multiple levels of quality (e.g., high and medium quality) to improve performance while limiting costs of data curation. We apply our mixed-supervision GAN to (i) super-resolve histopathology images and (ii) enhance laparoscopy images by combining super-resolution and surgical smoke removal. Results on large clinical and pre-clinical datasets show the benefits of our mixed-supervision GAN over the state of the art.
【5】 Two-Stage Self-Supervised Cycle-Consistency Network for Reconstruction of Thin-Slice MR Images 标题:用于薄层MR图像重建的两级自监督循环一致性网络
作者:Zhiyang Lu,Zheng Li,Jun Wang,Jun shi,Dinggang Shen 机构: Key laboratory of Specialty Fiber Optics and Optical Access Networks, Joint International, Research Laboratory of Specialty Fiber Optics and Advanced Communication, School of Com-, munication and Information Engineering, Shanghai University, China 链接:https://arxiv.org/abs/2106.15395 摘要:厚层磁共振(MR)图像在冠状面和矢状面上常出现结构模糊,给诊断和图像后处理带来一定的危害。深度学习(Deep learning,DL)在低分辨率(low-resolution,LR)情况下重建高分辨率(high-resolution,HR)薄层MR图像方面显示出巨大的潜力,本文称之为切片插值任务。然而,由于对大量成对的LR-HR MR图像进行采样比较困难,传统的基于完全监督DL的模型无法有效地训练以获得鲁棒性能。为此,我们提出了一种新的用于MR切片插值的两阶段自监督循环一致性网络(TSCNet),提出了一种用于无监督DL网络训练的两阶段自监督学习(SSL)策略。在第一阶段SSL中,沿着输入LR图像的矢状和冠状方向合成成对的LR-HR图像进行网络预训练,然后在第二阶段SSL中设计基于三重轴向切片的循环插值程序进行进一步细化。利用多个全方位丰富背景的训练样本作为指导,保证了训练性能的提高。此外,还提出了一种新的循环一致性约束来监督循环过程,鼓励网络重构更真实的HR图像。在实际MRI数据集上的实验结果表明,TSCNet算法比传统的SSL算法和其他基于SSL的算法具有更高的性能,在质量和定量上都优于完全监督算法。 摘要:The thick-slice magnetic resonance (MR) images are often structurally blurred in coronal and sagittal views, which causes harm to diagnosis and image post-processing. Deep learning (DL) has shown great potential to re-construct the high-resolution (HR) thin-slice MR images from those low-resolution (LR) cases, which we refer to as the slice interpolation task in this work. However, since it is generally difficult to sample abundant paired LR-HR MR images, the classical fully supervised DL-based models cannot be effectively trained to get robust performance. To this end, we propose a novel Two-stage Self-supervised Cycle-consistency Network (TSCNet) for MR slice interpolation, in which a two-stage self-supervised learning (SSL) strategy is developed for unsupervised DL network training. The paired LR-HR images are synthesized along the sagittal and coronal directions of input LR images for network pretraining in the first-stage SSL, and then a cyclic in-terpolation procedure based on triplet axial slices is designed in the second-stage SSL for further refinement. More training samples with rich contexts along all directions are exploited as guidance to guarantee the improved in-terpolation performance. Moreover, a new cycle-consistency constraint is proposed to supervise this cyclic procedure, which encourages the network to reconstruct more realistic HR images. The experimental results on a real MRI dataset indicate that TSCNet achieves superior performance over the conventional and other SSL-based algorithms, and obtains competitive quali-tative and quantitative results compared with the fully supervised algorithm.
时序|行为识别|姿态|视频|运动估计(3篇)
【1】 A Behavior-aware Graph Convolution Network Model for Video Recommendation 标题:一种行为感知的视频推荐图卷积网络模型
作者:Wei Zhuo,Kunchi Liu,Taofeng Xue,Beihong Jin,Beibei Li,Xinzhou Dong,He Chen,Wenhai Pan,Xuejian Zhang,Shuo Zhou 机构: MX Media Co., Ltd., Singapore, Singapore, State Key Laboratory of Computer Science, Institute of Software, Chinese, University of Chinese Academy of Sciences, Beijing, China 链接:https://arxiv.org/abs/2106.15402 摘要:用户与视频的交互是视频推荐的主要数据源。尽管现有的视频推荐方法很多,但用户在视频上的行为,即用户与视频之间的复杂关系,还没有得到充分的研究。在本文中,我们提出了一个名为人马座的模型。Sagittarius采用了一种图卷积神经网络来捕捉用户和视频之间的影响。特别是,人马座通过加权区分不同的用户行为,并将用户行为的语义融合到用户和视频的嵌入中。此外,Sagittarius结合多个优化目标学习用户和视频嵌入,并通过学习用户和视频嵌入实现视频推荐。在多个数据集上的实验结果表明,Sagittarius在回忆、独特回忆和NDCG方面的表现优于几种最先进的模型。 摘要:Interactions between users and videos are the major data source of performing video recommendation. Despite lots of existing recommendation methods, user behaviors on videos, which imply the complex relations between users and videos, are still far from being fully explored. In the paper, we present a model named Sagittarius. Sagittarius adopts a graph convolutional neural network to capture the influence between users and videos. In particular, Sagittarius differentiates between different user behaviors by weighting and fuses the semantics of user behaviors into the embeddings of users and videos. Moreover, Sagittarius combines multiple optimization objectives to learn user and video embeddings and then achieves the video recommendation by the learned user and video embeddings. The experimental results on multiple datasets show that Sagittarius outperforms several state-of-the-art models in terms of recall, unique recall and NDCG.
【2】 Boggart: Accelerating Retrospective Video Analytics via Model-Agnostic Ingest Processing 标题:boggart:通过模型不可知的接收处理加速回溯视频分析
作者:Neil Agarwal,Ravi Netravali 机构:Princeton University 链接:https://arxiv.org/abs/2106.15315 摘要:由于需要考虑大量的帧以及在每个帧上运行卷积神经网络(CNNs)的高成本,很难对视频数据集上的回顾性查询提供快速响应。一个自然的解决方案是在接收视频时提前执行必要计算的子集。然而,现有的摄取时间系统需要在未来查询中使用的特定CNN的知识——鉴于CNN体系结构和训练数据集/方法的不断增长的空间,这是一个具有挑战性的必要条件。本文介绍了Boggart,一个回顾性的视频分析系统,它以模型无关的方式提供摄取时间加速。我们的基本见解是,传统的计算机视觉(CV)算法能够执行计算,这些计算可用于加速具有广泛cnn的各种查询。在此基础上,在摄取时间,博格特小心地采用各种运动跟踪算法,以确定潜在的对象和他们的轨迹帧。然后,在查询时,Boggart使用了几种新的技术来收集满足目标准确性所需的CNN结果的最小样本:(1)一种有效挖掘CV-和CNN生成的输出之间不可避免的差异的聚类策略,以及(2)一组保持精度的传播技术,以沿着每个轨迹安全地扩展采样结果。在许多视频、CNN和查询中,Boggart始终满足准确率目标,同时节省使用CNN(在3-54%的帧上)。 摘要:Delivering fast responses to retrospective queries on video datasets is difficult due to the large number of frames to consider and the high costs of running convolutional neural networks (CNNs) on each one. A natural solution is to perform a subset of the necessary computations ahead of time, as video is ingested. However, existing ingest-time systems require knowledge of the specific CNN that will be used in future queries -- a challenging requisite given the evergrowing space of CNN architectures and training datasets/methodologies. This paper presents Boggart, a retrospective video analytics system that delivers ingest-time speedups in a model-agnostic manner. Our underlying insight is that traditional computer vision (CV) algorithms are capable of performing computations that can be used to accelerate diverse queries with wide-ranging CNNs. Building on this, at ingest-time, Boggart carefully employs a variety of motion tracking algorithms to identify potential objects and their trajectories across frames. Then, at query-time, Boggart uses several novel techniques to collect the smallest sample of CNN results required to meet the target accuracy: (1) a clustering strategy to efficiently unearth the inevitable discrepancies between CV- and CNN-generated outputs, and (2) a set of accuracy-preserving propagation techniques to safely extend sampled results along each trajectory. Across many videos, CNNs, and queries Boggart consistently meets accuracy targets while using CNNs sparingly (on 3-54% of frames).
【3】 Towards Fast and Accurate Multi-Person Pose Estimation on Mobile Devices 标题:面向移动设备的快速准确的多人位姿估计
作者:Xuan Shen,Geng Yuan,Wei Niu,Xiaolong Ma,Jiexiong Guan,Zhengang Li,Bin Ren,Yanzhi Wang 机构:Northeastern University, College of William & Mary 链接:https://arxiv.org/abs/2106.15304 摘要:随着自主驾驶、异常行为检测和行为识别技术的迅速发展,对基于多人姿态估计的应用提出了越来越高的要求,尤其是在移动平台上。然而,为了获得较高的精度,现有的方法往往具有较大的模型尺寸和复杂的后处理算法,计算量大,端到端延迟长。为了解决这个问题,我们提出了一个架构优化和权值剪枝框架来加速移动设备上多人姿态估计的推理。与典型的轻量级多人姿态估计器相比,我们的优化框架使模型推理速度提高了2.51倍,精度更高。 摘要:The rapid development of autonomous driving, abnormal behavior detection, and behavior recognition makes an increasing demand for multi-person pose estimation-based applications, especially on mobile platforms. However, to achieve high accuracy, state-of-the-art methods tend to have a large model size and complex post-processing algorithm, which costs intense computation and long end-to-end latency. To solve this problem, we propose an architecture optimization and weight pruning framework to accelerate inference of multi-person pose estimation on mobile devices. With our optimization framework, we achieve up to 2.51x faster model inference speed with higher accuracy compared to representative lightweight multi-person pose estimator.
医学相关(1篇)
【1】 Data augmentation for deep learning based accelerated MRI reconstruction with limited data 标题:基于深度学习的有限数据MRI加速重建的数据增强
作者:Zalan Fabian,Reinhard Heckel,Mahdi Soltanolkotabi 备注:27 pages, 19 figures, to be published in ICML2021 链接:https://arxiv.org/abs/2106.14947 摘要:深度神经网络已经成为图像恢复和重建任务中非常成功的工具。这些网络通常是端到端训练的,直接从图像的噪声或损坏的测量值重建图像。为了达到最先进的性能,对大型和多样化的图像集的训练被认为是至关重要的。然而,收集大量的训练图像通常是困难和/或昂贵的。受数据增强(DA)在分类问题上取得成功的启发,本文提出了一种用于MRI加速重建的数据增强管道,并研究了它在各种情况下减少所需训练数据的有效性。我们的DA管道,MRAugment,是专门设计来利用医学成像测量中的不变性,作为忽略问题物理的天真DA策略。通过对多个数据集的广泛研究,我们证明了在低数据区DA可以防止过度拟合,并且可以匹配甚至超越现有技术,同时使用更少的训练数据,而在高数据区DA的回报是递减的。此外,我们的研究结果表明,DA可以提高模型对测试分布的各种变化的鲁棒性。 摘要:Deep neural networks have emerged as very successful tools for image restoration and reconstruction tasks. These networks are often trained end-to-end to directly reconstruct an image from a noisy or corrupted measurement of that image. To achieve state-of-the-art performance, training on large and diverse sets of images is considered critical. However, it is often difficult and/or expensive to collect large amounts of training images. Inspired by the success of Data Augmentation (DA) for classification problems, in this paper, we propose a pipeline for data augmentation for accelerated MRI reconstruction and study its effectiveness at reducing the required training data in a variety of settings. Our DA pipeline, MRAugment, is specifically designed to utilize the invariances present in medical imaging measurements as naive DA strategies that neglect the physics of the problem fail. Through extensive studies on multiple datasets we demonstrate that in the low-data regime DA prevents overfitting and can match or even surpass the state of the art while using significantly fewer training data, whereas in the high-data regime it has diminishing returns. Furthermore, our findings show that DA can improve the robustness of the model against various shifts in the test distribution.
GAN|对抗|攻击|生成相关(9篇)
【1】 Spiking-GAN: A Spiking Generative Adversarial Network Using Time-To-First-Spike Coding 标题:SPINKING-GAN:一种使用第一个尖峰时间编码的尖峰生成性对抗网络
作者:Vineet Kotariya,Udayan Ganguly 机构:Department of Electrical Engineering, IIT-Bombay, Mumbai, India 链接:https://arxiv.org/abs/2106.15420 摘要:脉冲神经网络(SNNs)在解决深度学习问题中显示出巨大的潜力。然而,它们仍然局限于简单的分类任务。在本文中,我们提出了第一个基于尖峰的生成性对抗网络(GAN)。它采用了一种称为时间到第一峰值编码的时态编码方案。我们使用时域近似反向传播来训练它。我们使用简单的集成和火灾(如果)神经元非常高的不应期为我们的网络,这确保了一个神经元的最大峰值。这使得该模型比基于峰值速率的系统要稀疏得多。我们改进的时间损失函数称为“攻击性TTFS”,与以前的工作相比,网络的推理时间提高了33%以上,网络中的尖峰数减少了11%以上。实验表明,利用该方法在MNIST数据集上训练网络,可以生成高质量的样本。从而证明了该框架在解决尖峰域中的此类问题方面的潜力。 摘要:Spiking Neural Networks (SNNs) have shown great potential in solving deep learning problems in an energy-efficient manner. However, they are still limited to simple classification tasks. In this paper, we propose Spiking-GAN, the first spike-based Generative Adversarial Network (GAN). It employs a kind of temporal coding scheme called time-to-first-spike coding. We train it using approximate backpropagation in the temporal domain. We use simple integrate-and-fire (IF) neurons with very high refractory period for our network which ensures a maximum of one spike per neuron. This makes the model much sparser than a spike rate-based system. Our modified temporal loss function called 'Aggressive TTFS' improves the inference time of the network by over 33% and reduces the number of spikes in the network by more than 11% compared to previous works. Our experiments show that on training the network on the MNIST dataset using this approach, we can generate high quality samples. Thereby demonstrating the potential of this framework for solving such problems in the spiking domain.
【2】 Efficient Realistic Data Generation Framework leveraging Deep Learning-based Human Digitization 标题:利用基于深度学习的人类数字化的高效真实感数据生成框架
作者:C. Symeonidis,P. Nousi,P. Tosidis,K. Tsampazis,N. Passalis,A. Tefas,N. Nikolaidis 机构:Artificial Intelligence and Information Analysis Lab, Department of Informatics, Aristotle University of Thessaloniki, Thessaloniki, Greece 链接:https://arxiv.org/abs/2106.15409 摘要:有监督深度学习算法的性能在很大程度上取决于用于训练的数据的规模、质量和多样性。收集和手动注释大量数据既费时又费钱。在与视觉人类中心感知相关的任务中,由于有关隐私的立法,此类数据的收集和分发也可能面临限制。此外,复杂系统的设计和测试,例如机器人,通常采用基于深度学习的感知模型,可能会面临严重的困难,因为即使是在真实和大规模数据集上训练的最新方法也不能始终充分发挥作用,因为它们不能适应虚拟世界和真实世界数据之间的视觉差异。为了解决和减轻这些问题的影响,我们提出了一种自动生成具有注释的真实合成数据的方法,用于a)人员检测、b)人脸识别和c)人体姿势估计。该方法以真实背景图像为输入,用不同姿态的人像填充。我们不使用手工制作的三维人体模型,而是建议使用通过深度学习方法生成的模型,进一步降低数据集的创建成本,同时保持较高的真实感。此外,我们还提供了开源且易于使用的工具来实现拟议的管道,允许为各种任务生成高度真实的合成数据集。在相应的任务中进行的基准测试和评估表明,合成数据可以有效地作为真实数据的补充。 摘要:The performance of supervised deep learning algorithms depends significantly on the scale, quality and diversity of the data used for their training. Collecting and manually annotating large amount of data can be both time-consuming and costly tasks to perform. In the case of tasks related to visual human-centric perception, the collection and distribution of such data may also face restrictions due to legislation regarding privacy. In addition, the design and testing of complex systems, e.g., robots, which often employ deep learning-based perception models, may face severe difficulties as even state-of-the-art methods trained on real and large-scale datasets cannot always perform adequately as they have not adapted to the visual differences between the virtual and the real world data. As an attempt to tackle and mitigate the effect of these issues, we present a method that automatically generates realistic synthetic data with annotations for a) person detection, b) face recognition, and c) human pose estimation. The proposed method takes as input real background images and populates them with human figures in various poses. Instead of using hand-made 3D human models, we propose the use of models generated through deep learning methods, further reducing the dataset creation costs, while maintaining a high level of realism. In addition, we provide open-source and easy to use tools that implement the proposed pipeline, allowing for generating highly-realistic synthetic datasets for a variety of tasks. A benchmarking and evaluation in the corresponding tasks shows that synthetic data can be effectively used as a supplement to real data.
【3】 Multi-stage Optimization based Adversarial Training 标题:基于多阶段优化的对抗性训练
作者:Xiaosen Wang,Chuanbiao Song,Liwei Wang,Kun He 机构: School of Computer Science and Technology, Huazhong University of Science and Technology, School of Electronics Engineering and Computer Sciences, Peking University 备注:13 pages 链接:https://arxiv.org/abs/2106.15357 摘要:在对抗鲁棒性领域,通常采用单步对抗训练来快速建立对抗鲁棒性模型。然而,单步对抗性训练极有可能导致灾难性的过度拟合,因为经过几个训练周期后,很难产生强大的对抗性例子来不断提高对抗的鲁棒性。在这项工作中,我们通过在单步对抗训练中引入多步对抗例子来避免灾难性的过度拟合。然后,为了平衡生成多步对抗性示例所需的大量训练开销,我们提出了一种基于多阶段优化的对抗性训练(MOAT)方法,该方法对混合良性示例、单步对抗性示例和多步对抗性示例进行阶段性训练。这样,模型的总体训练开销大大减少,同时避免了灾难性的过度拟合。在CIFAR-10和CIFAR-100数据集上的大量实验表明,在相同的训练开销下,该方法比单步或多步对抗训练方法具有更好的鲁棒性。 摘要:In the field of adversarial robustness, there is a common practice that adopts the single-step adversarial training for quickly developing adversarially robust models. However, the single-step adversarial training is most likely to cause catastrophic overfitting, as after a few training epochs it will be hard to generate strong adversarial examples to continuously boost the adversarial robustness. In this work, we aim to avoid the catastrophic overfitting by introducing multi-step adversarial examples during the single-step adversarial training. Then, to balance the large training overhead of generating multi-step adversarial examples, we propose a Multi-stage Optimization based Adversarial Training (MOAT) method that periodically trains the model on mixed benign examples, single-step adversarial examples, and multi-step adversarial examples stage by stage. In this way, the overall training overhead is reduced significantly, meanwhile, the model could avoid catastrophic overfitting. Extensive experiments on CIFAR-10 and CIFAR-100 datasets demonstrate that under similar amount of training overhead, the proposed MOAT exhibits better robustness than either single-step or multi-step adversarial training methods.
【4】 Image Inpainting Using Wasserstein Generative Adversarial Imputation Network 标题:基于Wasserstein生成性对抗性补偿网络的图像修复
作者:Daniel Vašata,Tomáš Halama,Magda Friedjungová 机构:Czech Technical University in Prague, Prague, Czech Republic 备注:To be published in conference proceedings of ICANN 2021 链接:https://arxiv.org/abs/2106.15341 摘要:图像修复是计算机视觉中的一项重要任务,它主要解决图像中缺失区域的重建问题。本文的目的是介绍一种基于Wasserstein生成对抗性插补网络的图像修复模型。模型的生成网络使用具有不同膨胀率的卷积层的构建块,以及帮助模型再现输出的精细细节的跳过连接。这种结合产生了一个通用的插补模型,能够以足够的质量处理各种情况下的缺失。为了在实验上证明这一点,我们同时训练模型来处理三种情况:随机丢失像素,丢失各种较小的正方形区域,以及在图像中心丢失一个正方形。结果表明,我们的模型在所有场景下都能获得高质量的修复结果。使用峰值信噪比和结构相似性指数对两个真实基准数据集CelebA faces和Paris StreetView的性能进行了评估。我们的模型的结果与双调和插补和其他一些最先进的图像修复方法进行了比较。 摘要:Image inpainting is one of the important tasks in computer vision which focuses on the reconstruction of missing regions in an image. The aim of this paper is to introduce an image inpainting model based on Wasserstein Generative Adversarial Imputation Network. The generator network of the model uses building blocks of convolutional layers with different dilation rates, together with skip connections that help the model reproduce fine details of the output. This combination yields a universal imputation model that is able to handle various scenarios of missingness with sufficient quality. To show this experimentally, the model is simultaneously trained to deal with three scenarios given by missing pixels at random, missing various smaller square regions, and one missing square placed in the center of the image. It turns out that our model achieves high-quality inpainting results on all scenarios. Performance is evaluated using peak signal-to-noise ratio and structural similarity index on two real-world benchmark datasets, CelebA faces and Paris StreetView. The results of our model are compared to biharmonic imputation and to some of the other state-of-the-art image inpainting methods.
【5】 SE-MD: A Single-encoder multiple-decoder deep network for point cloud generation from 2D images 标题:SE-MD:一种用于从二维图像生成点云的单编码器多解码器深度网络
作者:Abdul Mueed Hafiz,Rouf Ul Alam Bhat,Shabir Ahmad Parah,M. Hassaballah 机构:the date of receipt and acceptance should be inserted later 链接:https://arxiv.org/abs/2106.15325 摘要:从单个二维RGB图像生成三维模型是一项具有挑战性的计算机视觉研究课题。针对同一问题,已经提出了使用传统网络体系结构的各种技术。然而,目前的研究工作还很有限,存在着各种各样的问题,如使用低效的三维表示格式、弱的三维模型生成主干、无法生成稠密点云、稠密点云生成的后处理依赖性以及RGB图像中轮廓的依赖性。本文提出了一种新的二维RGB图像到点云的转换技术,该技术利用网络结构中的并行化概念,以其高效、健壮和简单的模型改进了该领域的研究现状。它不仅利用了点云的高效和丰富的三维表示,而且利用了一种新颖而健壮的点云生成主干来解决当前普遍存在的问题。这涉及使用单个编码器-多解码器深度网络架构,其中每个解码器生成特定的固定视点。然后融合所有视点生成密集点云。对该技术进行了各种实验,并将其性能与其它先进技术进行了比较,取得了显著的效果。代码位于https://github.com/mueedhafiz1982/ 摘要:3D model generation from single 2D RGB images is a challenging and actively researched computer vision task. Various techniques using conventional network architectures have been proposed for the same. However, the body of research work is limited and there are various issues like using inefficient 3D representation formats, weak 3D model generation backbones, inability to generate dense point clouds, dependence of post-processing for generation of dense point clouds, and dependence on silhouettes in RGB images. In this paper, a novel 2D RGB image to point cloud conversion technique is proposed, which improves the state of art in the field due to its efficient, robust and simple model by using the concept of parallelization in network architecture. It not only uses the efficient and rich 3D representation of point clouds, but also uses a novel and robust point cloud generation backbone in order to address the prevalent issues. This involves using a single-encoder multiple-decoder deep network architecture wherein each decoder generates certain fixed viewpoints. This is followed by fusing all the viewpoints to generate a dense point cloud. Various experiments are conducted on the technique and its performance is compared with those of other state of the art techniques and impressive gains in performance are demonstrated. Code is available at https://github.com/mueedhafiz1982/
【6】 Cascaded Diffusion Models for High Fidelity Image Generation 标题:用于高保真图像生成的级联扩散模型
作者:Jonathan Ho,Chitwan Saharia,William Chan,David J. Fleet,Mohammad Norouzi,Tim Salimans 机构:Google Research 链接:https://arxiv.org/abs/2106.15282 摘要:我们证明了级联扩散模型能够在类条件ImageNet生成挑战上生成高保真图像,而不需要任何辅助图像分类器的帮助来提高样本质量。级联扩散模型包括多个扩散模型的管道,这些扩散模型生成分辨率不断提高的图像,首先是最低分辨率的标准扩散模型,然后是一个或多个超分辨率扩散模型,这些模型依次对图像进行上采样并添加更高分辨率的细节。我们发现级联管道的样本质量主要依赖于条件增强,我们提出的方法是将低分辨率条件输入数据增强到超分辨率模型中。我们的实验表明,条件增强可以防止级联模型采样过程中的复合误差,帮助我们训练级联管道,在64x64、128x128和256x256分辨率下的FID分数分别达到1.48、3.52和4.88,优于BigGAN-deep。 摘要:We show that cascaded diffusion models are capable of generating high fidelity images on the class-conditional ImageNet generation challenge, without any assistance from auxiliary image classifiers to boost sample quality. A cascaded diffusion model comprises a pipeline of multiple diffusion models that generate images of increasing resolution, beginning with a standard diffusion model at the lowest resolution, followed by one or more super-resolution diffusion models that successively upsample the image and add higher resolution details. We find that the sample quality of a cascading pipeline relies crucially on conditioning augmentation, our proposed method of data augmentation of the lower resolution conditioning inputs to the super-resolution models. Our experiments show that conditioning augmentation prevents compounding error during sampling in a cascaded model, helping us to train cascading pipelines achieving FID scores of 1.48 at 64x64, 3.52 at 128x128 and 4.88 at 256x256 resolutions, outperforming BigGAN-deep.
【7】 SDL: New data generation tools for full-level annotated document layout 标题:SDL:用于全级注释文档布局的新数据生成工具
作者:Son Nguyen Truong 机构: NEW DATA GENERATION TOOLS FOR FULL-LEVELANNOTATED DOCUMENT LAYOUTNguyen Truong SonSchool of Environment and SocietyTokyo Institute of TechnologyTokyo 链接:https://arxiv.org/abs/2106.15117 摘要:我们提出了一种新的文档处理数据生成工具。该工具侧重于在普通类型文档中提供最大级别的视觉信息,范围从字符位置到段落级别的位置。它还支持在低资源语言上使用大型数据集,并提供一种处理文档文本的完整级别信息的方法。这些数据生成工具附带了一个32万张越南合成文档图像的数据集,以及一个生成其他语言中类似大小数据集的指令。存储库位于:https://github.com/tson1997/SDL-Document-Image-Generation 摘要:We present a novel data generation tool for document processing. The tool focuses on providing a maximal level of visual information in a normal type document, ranging from character position to paragraph-level position. It also enables working with a large dataset on low-resource languages as well as providing a mean of processing thorough full-level information of the documented text. The data generation tools come with a dataset of 320000 Vietnamese synthetic document images and an instruction to generate a dataset of similar size in other languages. The repository can be found at: https://github.com/tson1997/SDL-Document-Image-Generation
【8】 Constructing Forest Biomass Prediction Maps from Radar Backscatter by Sequential Regression with a Conditional Generative Adversarial Network 标题:基于条件生成对抗网络的序贯回归雷达后向散射构建森林生物量预测图
作者:Sara Björk,Stian Normann Anfinsen,Erik Næsset,Terje Gobakken,Eliakimu Zahabu 机构: Norwegian University of Life Sciences 链接:https://arxiv.org/abs/2106.15020 摘要:研究了利用合成孔径雷达(SAR)强度图像构建地上生物量(AGB)预测图的方法。其目的是改进传统的基于SAR强度的回归模型,用有限的AGB原位测量数据进行训练。虽然采集成本很高,但机载激光扫描(ALS)传感器的数据与AGB高度相关。因此,我们建议使用基于ALS数据的AGB预测作为SAR数据的替代响应变量,并采用顺序建模的方式。这大大增加了训练数据量。为了模拟SAR强度与ALS预测AGB之间的回归函数,我们建议使用条件生成对抗网络(cGAN),即Pix2Pix卷积神经网络。这使得现有的基于ALS的AGB预测图的重建成为可能。所生成的综合基于ALS的AGB预测与从同一区域训练的传统非序列回归模型中检索到的基于ALS的AGB预测进行了定性和定量评估。结果表明,所提出的体系结构能够捕获实际数据的特征。这表明使用ALS引导的生成模型是从SAR强度预测AGB的一个很有前途的途径。对这一领域的进一步研究有可能提供大规模和低成本的AGB预测。 摘要:This paper studies construction of above-ground biomass (AGB) prediction maps from synthetic aperture radar (SAR) intensity images. The purpose is to improve traditional regression models based on SAR intensity, trained with a limited amount of AGB in situ measurements. Although it is costly to collect, data from airborne laser scanning (ALS) sensors are highly correlated with AGB. Therefore, we propose using AGB predictions based on ALS data as surrogate response variables for SAR data in a sequential modelling fashion. This increases the amount of training data dramatically. To model the regression function between SAR intensity and ALS-predicted AGB we propose to utilise a conditional generative adversarial network (cGAN), i.e. the Pix2Pix convolutional neural network. This enables the recreation of existing ALS-based AGB prediction maps. The generated synthesised ALS-based AGB predictions are evaluated qualitatively and quantitatively against ALS-based AGB predictions retrieved from a traditional non-sequential regression model trained in the same area. Results show that the proposed architecture manages to capture characteristics of the actual data. This suggests that the use of ALS-guided generative models is a promising avenue for AGB prediction from SAR intensity. Further research on this area has the potential of providing both large-scale and low-cost predictions of AGB.
【9】 Are conditional GANs explicitly conditional? 标题:有条件的甘斯是否有明确的条件?
作者:Houssem-eddine Boulahbal,Adrian Voicila,Andrew Comport 机构:Renault Software Labs, CNRS-I,S, Sophia Antipolis university 链接:https://arxiv.org/abs/2106.15011 摘要:本文提出了条件生成对抗网络(cGANs)的两个重要贡献,以改进利用该体系结构的各种应用。第一个主要贡献是对cgan的分析,以表明它们不是显式条件的。特别地,将证明鉴别器和随后的cGAN不会自动学习输入之间的条件。第二个贡献是一种新的方法,称为acontrario,它通过一种新的acontrario损失来为对抗体系结构的两个部分显式地建模条件性,该损失涉及训练鉴别器来学习无条件(不利)示例。这导致了一种新型的数据扩充方法,它允许使用不利的例子将生成器的搜索空间限制为条件输出。广泛的实验进行了评估条件的鉴别器提出了一个概率分布分析。与不同应用的cGAN体系结构的比较表明,在众所周知的数据集上,包括语义图像合成、图像分割和使用不同度量的单目深度预测的性能有显著改进,这些度量包括Fr′echet初始距离(FID)、联合平均交集(mIoU),均方根误差对数(RMSE对数)和统计上不同的箱数(NDB) 摘要:This paper proposes two important contributions for conditional Generative Adversarial Networks (cGANs) to improve the wide variety of applications that exploit this architecture. The first main contribution is an analysis of cGANs to show that they are not explicitly conditional. In particular, it will be shown that the discriminator and subsequently the cGAN does not automatically learn the conditionality between inputs. The second contribution is a new method, called acontrario, that explicitly models conditionality for both parts of the adversarial architecture via a novel acontrario loss that involves training the discriminator to learn unconditional (adverse) examples. This leads to a novel type of data augmentation approach for GANs (acontrario learning) which allows to restrict the search space of the generator to conditional outputs using adverse examples. Extensive experimentation is carried out to evaluate the conditionality of the discriminator by proposing a probability distribution analysis. Comparisons with the cGAN architecture for different applications show significant improvements in performance on well known datasets including, semantic image synthesis, image segmentation and monocular depth prediction using different metrics including Fr\'echet Inception Distance(FID), mean Intersection over Union (mIoU), Root Mean Square Error log (RMSE log) and Number of statistically-Different Bins (NDB)
自动驾驶|车辆|车道检测等(1篇)
【1】 Autonomous Driving Implementation in an Experimental Environment 标题:自主驾驶在实验环境中的实现
作者:Namig Aliyev,Oguzhan Sezer,Mehmet Turan Guzel 机构:ID ,‡, Department of Computer Engineering, Sakarya University 备注:8 pages, 21 figures.This is a bachelor's thesis research report and was supported by the Scientific and Technological Research Council of Turkey 链接:https://arxiv.org/abs/2106.15274 摘要:自主系统需要识别环境,而要将其安全地付诸实践还有很长的路要走。在自动驾驶系统中,障碍物和红绿灯的检测与车道跟踪同样重要。在本研究中,我们开发了一套自主驾驶系统,并在设计的实验环境中进行了测试。在该系统中,采用带摄像头的模型车进行车道跟踪和避障实验,研究自主驾驶行为。训练卷积神经网络模型进行车道跟踪。针对车辆避障,分别建立了拐角检测、光流、焦点扩展、碰撞时间、平衡计算和决策机制。 摘要:Autonomous systems require identifying the environment and it has a long way to go before putting it safely into practice. In autonomous driving systems, the detection of obstacles and traffic lights are of importance as well as lane tracking. In this study, an autonomous driving system is developed and tested in the experimental environment designed for this purpose. In this system, a model vehicle having a camera is used to trace the lanes and avoid obstacles to experimentally study autonomous driving behavior. Convolutional Neural Network models were trained for Lane tracking. For the vehicle to avoid obstacles, corner detection, optical flow, focus of expansion, time to collision, balance calculation, and decision mechanism were created, respectively.
OCR|文本相关(1篇)
【1】 Text Prior Guided Scene Text Image Super-resolution 标题:文本先导场景文本图像超分辨率
作者:Jianqi Ma,Shi Guo,Lei Zhang 机构:Dept. of Computing, The Hong Kong Polytechnic University 链接:https://arxiv.org/abs/2106.15368 摘要:场景文本图像超分辨率(STISR)旨在提高低分辨率(LR)场景文本图像的分辨率和视觉质量,从而提高文本识别的性能。然而,现有的STISR方法大多将文本图像视为自然场景图像,忽略了文本的分类信息。在本文中,我们做了一个鼓舞人心的尝试,将分类文本嵌入到STISR模型训练中。具体地说,我们采用字符概率序列作为文本先验,这可以方便地从文本识别模型中获得。文本优先提供分类指导,以恢复高分辨率(HR)文本图像。另一方面,重构后的HR图像可以对文本进行细化。最后,我们提出了一个多阶段的文本优先引导超分辨率(TPGSR)框架。在基准TextZoom数据集上的实验表明,TPGSR不仅能有效地提高场景文本图像的视觉质量,而且比现有的STISR方法显著提高了文本识别的准确率。在TextZoom上训练的模型对其它数据集中的LR图像也具有一定的泛化能力。 摘要:Scene text image super-resolution (STISR) aims to improve the resolution and visual quality of low-resolution (LR) scene text images, and consequently boost the performance of text recognition. However, most of existing STISR methods regard text images as natural scene images, ignoring the categorical information of text. In this paper, we make an inspiring attempt to embed categorical text prior into STISR model training. Specifically, we adopt the character probability sequence as the text prior, which can be obtained conveniently from a text recognition model. The text prior provides categorical guidance to recover high-resolution (HR) text images. On the other hand, the reconstructed HR image can refine the text prior in return. Finally, we present a multi-stage text prior guided super-resolution (TPGSR) framework for STISR. Our experiments on the benchmark TextZoom dataset show that TPGSR can not only effectively improve the visual quality of scene text images, but also significantly improve the text recognition accuracy over existing STISR methods. Our model trained on TextZoom also demonstrates certain generalization capability to the LR images in other datasets.
Attention注意力(2篇)
【1】 Soft Attention: Does it Actually Help to Learn Social Interactions in Pedestrian Trajectory Prediction? 标题:软注意:在行人轨迹预测中学习社会互动真的有帮助吗?
作者:Laurent Boucaud,Daniel Aloise,Nicolas Saunier 机构: Boucaud was with the Department of Computer Engineering, Aloise is with the Department of Computer Engineering 备注:This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible 链接:https://arxiv.org/abs/2106.15321 摘要:我们考虑利用行人的运动历史和周围行人的运动历史(称为社会信息)预测行人未来路径的问题。自从关于社会LSTM的开创性论文发表以来,深度学习已经成为用来模拟社会互动对行人运动影响的主要工具。这些模型能够学习社会互动的证明依赖于对这些模型的深入研究。通过两个标准度量,即平均位移误差和最终位移误差,比较了有无社会交互模块的模型。然而,这些复杂的模型最近被简单的等速方法所超越。这个问题是,他们是否真的允许社会互动模型以及证据的有效性。本文主要研究具有软注意机制的深度学习模型在社会互动建模中的应用,并研究其是否在预测时使用社会信息。我们在ETH和UCY的数据集上进行了两个实验,这些数据集在以前的工作中也使用过。首先,用随机噪声代替社会信息对模型进行训练,并与实际社会信息训练的模型进行比较。第二,我们使用了一个门控机制和一个$L\u 0$的惩罚,允许模型关闭它们的内部组件。这些模型不断地学习删减他们的软注意机制。两个实验的收敛过程和预测性能都没有改变。这说明模型忽略了软注意机制和社会信息。 摘要:We consider the problem of predicting the future path of a pedestrian using its motion history and the motion history of the surrounding pedestrians, called social information. Since the seminal paper on Social-LSTM, deep-learning has become the main tool used to model the impact of social interactions on a pedestrian's motion. The demonstration that these models can learn social interactions relies on an ablative study of these models. The models are compared with and without their social interactions module on two standard metrics, the Average Displacement Error and Final Displacement Error. Yet, these complex models were recently outperformed by a simple constant-velocity approach. This questions if they actually allow to model social interactions as well as the validity of the proof. In this paper, we focus on the deep-learning models with a soft-attention mechanism for social interaction modeling and study whether they use social information at prediction time. We conduct two experiments across four state-of-the-art approaches on the ETH and UCY datasets, which were also used in previous work. First, the models are trained by replacing the social information with random noise and compared to model trained with actual social information. Second, we use a gating mechanism along with a $L_0$ penalty, allowing models to shut down their inner components. The models consistently learn to prune their soft-attention mechanism. For both experiments, neither the course of the convergence nor the prediction performance were altered. This demonstrates that the soft-attention mechanism and therefore the social information are ignored by the models.
【2】 Towards Understanding the Effectiveness of Attention Mechanism 标题:走向理解注意机制的有效性
作者:Xiang Ye,Zihang He,Heng Wang,Yong Li 机构:Xiang Ye is with School of Electronic Engineering, Beijing University ofPosts and Telecommunications Beijing, Yong Li is with School of Electronic Engineering 链接:https://arxiv.org/abs/2106.15067 摘要:注意机制是提高卷积神经网络在计算机视觉任务中性能的一种广泛应用的方法。尽管其普遍存在,但我们对其效力的来源缺乏了解。人们普遍认为,它的有效性来源于视觉注意解释,主张将注意力集中在输入数据的重要部分,而不是将输入数据全部摄取。在本文中,我们发现特征的注意权重与其重要性之间只有微弱的一致性。相反,我们验证了特征映射倍增在注意机制中的关键作用,揭示了特征映射倍增对CNN学习景观的根本影响:由于特征映射倍增带来的高阶非线性,它对CNN起到了调节作用,这使得他们学习到更平滑和更稳定的景观接近真实样品相比,香草CNN。这种平滑性和稳定性使得CNN在真实样本之间的行为更具预测性和稳定性,从而使CNN的生成效果更好。此外,基于特征映射乘法的有效性,我们设计了特征映射乘法网络(FMMNet),简单地用特征映射乘法代替ResNet中的特征映射加法。FMMNet在各种数据集上的性能都优于ResNet,这表明即使在现有方法中没有精心设计的注意机制的情况下,特征映射乘法对提高性能也起着至关重要的作用。 摘要:Attention Mechanism is a widely used method for improving the performance of convolutional neural networks (CNNs) on computer vision tasks. Despite its pervasiveness, we have a poor understanding of what its effectiveness stems from. It is popularly believed that its effectiveness stems from the visual attention explanation, advocating focusing on the important part of input data rather than ingesting the entire input. In this paper, we find that there is only a weak consistency between the attention weights of features and their importance. Instead, we verify the crucial role of feature map multiplication in attention mechanism and uncover a fundamental impact of feature map multiplication on the learned landscapes of CNNs: with the high order non-linearity brought by the feature map multiplication, it played a regularization role on CNNs, which made them learn smoother and more stable landscapes near real samples compared to vanilla CNNs. This smoothness and stability induce a more predictive and stable behavior in-between real samples, and make CNNs generate better. Moreover, motivated by the proposed effectiveness of feature map multiplication, we design feature map multiplication network (FMMNet) by simply replacing the feature map addition in ResNet with feature map multiplication. FMMNet outperforms ResNet on various datasets, and this indicates that feature map multiplication plays a vital role in improving the performance even without finely designed attention mechanism in existing methods.
人脸|人群计数(1篇)
【1】 Deep Learning for Face Anti-Spoofing: A Survey 标题:深度学习在人脸反欺骗中的研究进展
作者:Zitong Yu,Yunxiao Qin,Xiaobai Li,Chenxu Zhao,Zhen Lei,Guoying Zhao 机构: and also with Northwestern Polytechnical University 备注:submitted to IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 链接:https://arxiv.org/abs/2106.14948 摘要:人脸反欺骗(FAS)由于在防止人脸识别系统遭受表示攻击(PAs)中的重要作用,近年来受到了越来越多的关注。随着越来越多具有新颖类型的真实PAs的出现,传统的基于手工特征的FAS方法由于其有限的表示能力而变得不可靠。近十年来,随着大规模学术数据集的出现,基于深度学习的FAS在这一领域取得了令人瞩目的成绩。然而,该领域现有的评论主要集中在手工制作的特性上,这些特性已经过时,对FAS社区的发展没有什么启发。在这篇论文中,为了促进未来的研究,我们首先全面回顾了基于深度学习的FAS的最新进展。它涵盖了几个新颖而有见地的组成部分:1)除了使用二进制标签进行监控(例如,“0”表示真实,而“1”表示PAs)之外,我们还研究了最近使用像素级监控的方法(例如,伪深度图);2) 除了传统的数据集内评价方法外,我们还收集和分析了专门为领域泛化和开集FAS设计的最新方法;除商用RGB相机外,本文还总结了在多模式(如深度和红外)或专用(如光场和闪光)传感器下的深度学习应用。我们通过强调当前存在的问题和突出潜在的前景来结束这项调查。 摘要:Face anti-spoofing (FAS) has lately attracted increasing attention due to its vital role in securing face recognition systems from presentation attacks (PAs). As more and more realistic PAs with novel types spring up, traditional FAS methods based on handcrafted features become unreliable due to their limited representation capacity. With the emergence of large-scale academic datasets in the recent decade, deep learning based FAS achieves remarkable performance and dominates this area. However, existing reviews in this field mainly focus on the handcrafted features, which are outdated and uninspiring for the progress of FAS community. In this paper, to stimulate future research, we present the first comprehensive review of recent advances in deep learning based FAS. It covers several novel and insightful components: 1) besides supervision with binary label (e.g., '0' for bonafide vs. '1' for PAs), we also investigate recent methods with pixel-wise supervision (e.g., pseudo depth map); 2) in addition to traditional intra-dataset evaluation, we collect and analyze the latest methods specially designed for domain generalization and open-set FAS; and 3) besides commercial RGB camera, we summarize the deep learning applications under multi-modal (e.g., depth and infrared) or specialized (e.g., light field and flash) sensors. We conclude this survey by emphasizing current open issues and highlighting potential prospects.
表征学习(2篇)
【1】 Winner Team Mia at TextVQA Challenge 2021: Vision-and-Language Representation Learning with Pre-trained Sequence-to-Sequence Model 标题:TextVQA挑战赛2021的获胜者团队Mia:使用预先训练的序列到序列模型学习视觉和语言表征
作者:Yixuan Qiao,Hao Chen,Jun Wang,Yihao Chen,Xianbin Ye,Ziliang Li,Xianbiao Qi,Peng Gao,Guotong Xie 机构: SFE Deeplearning Platform, Ping An Health Technology, Beijing, China., Visual Computing Group, Ping An Property & Casualty Insurance Company, Shenzhen, China. 备注:Winner of TextVQA 2021 链接:https://arxiv.org/abs/2106.15332 摘要:TextVQA要求模型阅读并推理图像中的文本,以回答关于它们的问题。具体来说,模型需要在图像中加入一种新的文本形式,并在其上推理来回答TextVQA问题。在这个挑战中,我们将生成模型T5用于TextVQA任务。基于HuggingFace存储库中预先训练好的检查点T5-3B,设计了另外两个预训练任务,包括蒙面语言建模(MLM)和相对位置预测(RPP),以更好地对齐对象特征和场景文本。在预训练阶段,编码器致力于处理问题文本、目标文本标签、场景文本标签、目标视觉特征、场景视觉特征等多种形态的融合。在解码器逐步生成文本序列之后,默认情况下需要交叉熵损失。我们在预训练中使用大规模场景文本数据集,然后仅使用TextVQA数据集对T5-3B进行微调。 摘要:TextVQA requires models to read and reason about text in images to answer questions about them. Specifically, models need to incorporate a new modality of text present in the images and reason over it to answer TextVQA questions. In this challenge, we use generative model T5 for TextVQA task. Based on pre-trained checkpoint T5-3B from HuggingFace repository, two other pre-training tasks including masked language modeling(MLM) and relative position prediction(RPP) are designed to better align object feature and scene text. In the stage of pre-training, encoder is dedicate to handle the fusion among multiple modalities: question text, object text labels, scene text labels, object visual features, scene visual features. After that decoder generates the text sequence step-by-step, cross entropy loss is required by default. We use a large-scale scene text dataset in pre-training and then fine-tune the T5-3B with the TextVQA dataset only.
【2】 Open-Set Representation Learning through Combinatorial Embedding 标题:基于组合嵌入的开集表示学习
作者:Geeho Kim,Bohyung Han 机构:Computer Vision Lab. & ASRI, Seoul National University 备注:12 pages, 4 figures 链接:https://arxiv.org/abs/2106.15278 摘要:视觉识别任务通常仅限于处理一小部分类,因为其余类的标签不可用。我们感兴趣的是通过基于标记类和未标记类的实例的表示学习来识别数据集中的新概念,并将识别范围扩展到已知类和新类。为了解决这个具有挑战性的任务,我们提出了一种组合学习方法,该方法利用多个有监督的元分类器在异构标签空间上给出的组合知识,自然地将示例聚类到不可见的类中。我们还引入了一种度量学习策略来估计成对伪标签,以改进未标记示例的表示,有效地保留了已知类和新类之间的语义关系。该算法通过联合优化来发现新的概念,增强未知类的可辨性,同时学习可归纳为新类的已知类的表示。我们的大量实验表明,该方法在多图像检索和新的类发现基准测试中取得了显著的性能改进。 摘要:Visual recognition tasks are often limited to dealing with a small subset of classes simply because the labels for the remaining classes are unavailable. We are interested in identifying novel concepts in a dataset through representation learning based on the examples in both labeled and unlabeled classes, and extending the horizon of recognition to both known and novel classes. To address this challenging task, we propose a combinatorial learning approach, which naturally clusters the examples in unseen classes using the compositional knowledge given by multiple supervised meta-classifiers on heterogeneous label spaces. We also introduce a metric learning strategy to estimate pairwise pseudo-labels for improving representations of unlabeled examples, which preserves semantic relations across known and novel classes effectively. The proposed algorithm discovers novel concepts via a joint optimization of enhancing the discrimitiveness of unseen classes as well as learning the representations of known classes generalizable to novel ones. Our extensive experiments demonstrate remarkable performance gains by the proposed approach in multiple image retrieval and novel class discovery benchmarks.
噪声标签Label Noise(1篇)
【1】 How Does Heterogeneous Label Noise Impact Generalization in Neural Nets? 标题:异构标签噪声如何影响神经网络中的泛化?
作者:Bidur Khanal,Christopher Kanan 机构:Rochester Institute of Technology, Paige, Cornell Tech 链接:https://arxiv.org/abs/2106.15475 摘要:在现实世界的计算机视觉数据集中,错误标记的例子或标签噪声是很常见的。虽然标签噪声对深度神经网络学习的影响在以前的工作中已经被研究过,但是这些研究只集中在均匀的标签噪声上,即标签噪声的程度在所有类别中都是相同的。然而,在现实世界中,标签噪声往往是异质的,一些类别受到的影响比其他类别更大。在这里,我们讨论了文献中的这个缺口。我们假设异质标签噪声只会影响有标签噪声的类,除非这些类向没有标签噪声的类转移。为了验证这一假设,我们设计了一系列使用MNIST、CIFAR-10、CIFAR-100和MS-COCO的计算机视觉研究,其中我们在训练多类、多任务和多标签系统的过程中施加了异质标签噪声。我们的结果提供了证据支持我们的假设:标签噪声只影响受其影响的类,除非有转移。 摘要:Incorrectly labeled examples, or label noise, is common in real-world computer vision datasets. While the impact of label noise on learning in deep neural networks has been studied in prior work, these studies have exclusively focused on homogeneous label noise, i.e., the degree of label noise is the same across all categories. However, in the real-world, label noise is often heterogeneous, with some categories being affected to a greater extent than others. Here, we address this gap in the literature. We hypothesized that heterogeneous label noise would only affect the classes that had label noise unless there was transfer from those classes to the classes without label noise. To test this hypothesis, we designed a series of computer vision studies using MNIST, CIFAR-10, CIFAR-100, and MS-COCO where we imposed heterogeneous label noise during the training of multi-class, multi-task, and multi-label systems. Our results provide evidence in support of our hypothesis: label noise only affects the class affected by it unless there is transfer.
蒸馏|知识提取(1篇)
【1】 ScanBank: A Benchmark Dataset for Figure Extraction from Scanned Electronic Theses and Dissertations 标题:ScanBank:一种用于扫描电子论文图形提取的基准数据集
作者:Sampanna Yashwant Kahu,William A. Ingram,Edward A. Fox,Jian Wu 机构:Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Blacksburg, VA , Department of Computer Science, Old Dominion University, Norfolk, VA 备注:16 pages, 3 figures, submitted to ACM/IEEE Joint Conference on Digital Libraries 链接:https://arxiv.org/abs/2106.15320 摘要:我们专注于电子论文和学位论文(etd),旨在提高访问和扩大其效用,因为超过600万是公开的,它们构成了一个重要的语料库,以帮助跨学科的研究和教育。随着新诞生的数字文档的加入,语料库也在不断增长,自从数百万篇老论文和学位论文被转换成数字形式,以电子方式在机构知识库中传播以来。在ETDs中,与其他学术著作一样,图表能以简洁的方式传达大量的信息。虽然已经有人提出了从出生数字PDF中提取图形和表格的方法,但它们不能很好地用于扫描的ETD。考虑到这个问题,我们对最先进的图形提取系统的评估是,它们在扫描的PDF上不能很好地工作的原因是,它们只接受过数字文档方面的训练。为了解决这一限制,我们提出了ScanBank,一个新的数据集,包含10000个扫描页面图像,由人类手动标记其中的33000个图形或表格。利用该数据集训练基于YOLOv5的深层神经网络模型,从扫描的etd中准确提取图形和表格。我们提出并回答了一些重要的研究问题,旨在找到更好的方法从扫描文件中提取图形。其中一个涉及到数据增强技术的训练价值,这些技术应用于天生的数字文档,用于训练更适合于从扫描文档中提取图形的模型。据我们所知,ScanBank是第一个用于扫描ETD图形和表格提取的手动注释数据集。一个基于YOLOv5的模型,在ScanBank上训练,比现有的可比较的开源和免费的基线方法有相当大的优势。 摘要:We focus on electronic theses and dissertations (ETDs), aiming to improve access and expand their utility, since more than 6 million are publicly available, and they constitute an important corpus to aid research and education across disciplines. The corpus is growing as new born-digital documents are included, and since millions of older theses and dissertations have been converted to digital form to be disseminated electronically in institutional repositories. In ETDs, as with other scholarly works, figures and tables can communicate a large amount of information in a concise way. Although methods have been proposed for extracting figures and tables from born-digital PDFs, they do not work well with scanned ETDs. Considering this problem, our assessment of state-of-the-art figure extraction systems is that the reason they do not function well on scanned PDFs is that they have only been trained on born-digital documents. To address this limitation, we present ScanBank, a new dataset containing 10 thousand scanned page images, manually labeled by humans as to the presence of the 3.3 thousand figures or tables found therein. We use this dataset to train a deep neural network model based on YOLOv5 to accurately extract figures and tables from scanned ETDs. We pose and answer important research questions aimed at finding better methods for figure extraction from scanned documents. One of those concerns the value for training, of data augmentation techniques applied to born-digital documents which are used to train models better suited for figure extraction from scanned documents. To the best of our knowledge, ScanBank is the first manually annotated dataset for figure and table extraction for scanned ETDs. A YOLOv5-based model, trained on ScanBank, outperforms existing comparable open-source and freely available baseline methods by a considerable margin.
多模态(1篇)
【1】 Multimodal Trajectory Prediction Conditioned on Lane-Graph Traversals 标题:基于车道图遍历条件的多模态轨迹预测
作者:Nachiket Deo,Eric M. Wolff,Oscar Beijbom 机构:University of California San Diego, Motional 链接:https://arxiv.org/abs/2106.15004 摘要:准确预测周围车辆的未来运动需要对目标和驾驶行为中固有的不确定性进行推理。这种不确定性可以松散地分解为横向(如保持车道、转弯)和纵向(如加速、制动)。我们提出了一种新的方法,结合学习离散策略卷展和集中解码器的子集的车道图。根据我们目前的观察,政策的推出探索了不同的目标,确保模型捕捉到横向变化。我们的新型潜变量模型解码器以车道图的不同子集为条件,捕捉到了车道图的纵向变化。我们的模型在nuScenes运动预测数据集上达到了最先进的性能,并且定性地证明了良好的场景遵从性。详细的烧蚀强调了策略展开和解码器架构的重要性。 摘要:Accurately predicting the future motion of surrounding vehicles requires reasoning about the inherent uncertainty in goals and driving behavior. This uncertainty can be loosely decoupled into lateral (e.g., keeping lane, turning) and longitudinal (e.g., accelerating, braking). We present a novel method that combines learned discrete policy rollouts with a focused decoder on subsets of the lane graph. The policy rollouts explore different goals given our current observations, ensuring that the model captures lateral variability. The longitudinal variability is captured by our novel latent variable model decoder that is conditioned on various subsets of the lane graph. Our model achieves state-of-the-art performance on the nuScenes motion prediction dataset, and qualitatively demonstrates excellent scene compliance. Detailed ablations highlight the importance of both the policy rollouts and the decoder architecture.
3D|3D重建等相关(3篇)
【1】 Automatic 2D-3D Registration without Contrast Agent during Neurovascular Interventions 标题:神经血管介入治疗中无造影剂的2D-3D自动配准
作者:Robert Homan,René van Rijsselt,Daniel Ruijters 机构:E 链接:https://arxiv.org/abs/2106.15308 摘要:将实时荧光透视图像与血管系统的三维旋转重建融合,可以在微创神经血管治疗中导航血管内装置,同时减少有害碘造影剂的使用。使用X射线C臂几何体的传感器信息初始化透视图像和三维重建的对准。患者的运动,然后纠正了基于图像的配准算法,基于梯度差相似性测量使用数字重建射线照片的三维重建。该算法不要求透视图像中的血管充满碘造影剂,而是依赖图像中的梯度(骨结构、鼻窦)作为标志性特征。研究了基于图像的配准算法的精度、鲁棒性和计算时间。使用体模实验,97%的配准尝试通过了小于1mm平移和3{\deg}旋转的剩余配准误差的成功标准。本文建立了一种新的二维-三维配准验证方法,无需改变临床工作流程,如附加基准标记。因此,这种方法可以回顾性地应用于已有的临床资料。对于临床数据实验,87%的配准尝试通过了剩余平移误差<1mm的标准,84%的配准尝试通过了旋转误差<3{\deg}。 摘要:Fusing live fluoroscopy images with a 3D rotational reconstruction of the vasculature allows to navigate endovascular devices in minimally invasive neuro-vascular treatment, while reducing the usage of harmful iodine contrast medium. The alignment of the fluoroscopy images and the 3D reconstruction is initialized using the sensor information of the X-ray C-arm geometry. Patient motion is then corrected by an image-based registration algorithm, based on a gradient difference similarity measure using digital reconstructed radiographs of the 3D reconstruction. This algorithm does not require the vessels in the fluoroscopy image to be filled with iodine contrast agent, but rather relies on gradients in the image (bone structures, sinuses) as landmark features. This paper investigates the accuracy, robustness and computation time aspects of the image-based registration algorithm. Using phantom experiments 97% of the registration attempts passed the success criterion of a residual registration error of less than 1 mm translation and 3{\deg} rotation. The paper establishes a new method for validation of 2D-3D registration without requiring changes to the clinical workflow, such as attaching fiducial markers. As a consequence, this method can be retrospectively applied to pre-existing clinical data. For clinical data experiments, 87% of the registration attempts passed the criterion of a residual translational error of < 1 mm, and 84% possessed a rotational error of < 3{\deg}.
【2】 Roof Damage Assessment from Automated 3D Building Models 标题:基于自动化三维建筑模型的屋顶损伤评估
作者:Kenichi Sugihara,Martin Wallace,Kongwen,Zhang,Youry Khmelevsky 机构:. Doctor of Engineering, Professor, Gifu-Keizai University,-, Kitagata-chou Ogaki-city Gifu-Pref. ,-, Japan, +,-,-, . Undergraduate Student, Okanagan College, KLO Rd., Kelowna, BC V,Y ,X, Canada, +,-,- 链接:https://arxiv.org/abs/2106.15294 摘要:三维建筑建模在城市规划及相关领域中具有重要的意义。这样的三维模型可以用于在变化发生之前和之后,从单个建筑物到整个城市的多尺度城市图像的可视化。这种能力在规划师、地理设计师和建筑师的日常工作和特殊项目中非常重要。在这项研究中,我们实现了一种新的三维建筑模型的方法,包括地理信息系统(GIS)和三维计算机图形(3DCG)组件的集成,这些组件可以从建筑足迹(多边形)生成三维房屋模型,以及自动生成简单和复杂的屋顶几何图形,以便快速报告屋顶区域损坏情况。这些多边形(脚印)通常是正交的。一个复杂的正交多边形可以分割成一组矩形。提出的GIS和3DCG集成系统将正交建筑多边形划分为一组矩形,并在这些矩形上放置矩形屋顶和长方形建筑体。由于技术人员是使用数字化仪手动绘制这些多边形的,这取决于航空照片,因此并非所有建筑多边形都是精确正交的。但是,当放置一组长方体作为用于创建建筑的建筑体时,如果建筑多边形不是精确正交的,则这些长方体之间可能存在间隙或重叠。在我们的方案中,将近似正交的建筑多边形分割并校正成一组相互正交的矩形,每个矩形知道哪个矩形与哪个矩形相邻,哪个矩形的边与哪个矩形相邻,这样就避免了建筑体组合时不必要的门窗相交。 摘要:The 3D building modelling is important in urban planning and related domains that draw upon the content of 3D models of urban scenes. Such 3D models can be used to visualize city images at multiple scales from individual buildings to entire cities prior to and after a change has occurred. This ability is of great importance in day-to-day work and special projects undertaken by planners, geo-designers, and architects. In this research, we implemented a novel approach to 3D building models for such matter, which included the integration of geographic information systems (GIS) and 3D Computer Graphics (3DCG) components that generate 3D house models from building footprints (polygons), and the automated generation of simple and complex roof geometries for rapid roof area damage reporting. These polygons (footprints) are usually orthogonal. A complicated orthogonal polygon can be partitioned into a set of rectangles. The proposed GIS and 3DCG integrated system partitions orthogonal building polygons into a set of rectangles and places rectangular roofs and box-shaped building bodies on these rectangles. Since technicians are drawing these polygons manually with digitizers, depending on aerial photos, not all building polygons are precisely orthogonal. But, when placing a set of boxes as building bodies for creating the buildings, there may be gaps or overlaps between these boxes if building polygons are not precisely orthogonal. In our proposal, after approximately orthogonal building polygons are partitioned and rectified into a set of mutually orthogonal rectangles, each rectangle knows which rectangle is adjacent to and which edge of the rectangle is adjacent to, which will avoid unwanted intersection of windows and doors when building bodies combined.
【3】 Xihe: A 3D Vision-based Lighting Estimation Framework for Mobile Augmented Reality 标题:XIHE:一种基于三维视觉的移动增强现实光照估计框架
作者:Yiqin Zhao,Tian Guo 机构:Worcester Polytechnic Institute 链接:https://arxiv.org/abs/2106.15280 摘要:全向照明为实现空间变化的真实感3D渲染提供了基础,这是移动增强现实应用的理想属性。然而,在实践中,估计全向光照可能是具有挑战性的,由于诸如渲染位置的部分全景、固有的环境光照和移动用户动态等限制。随着移动3D视觉的发展,一个新的机遇出现了,包括内置的高精度深度传感器和基于深度学习的算法,它们提供了更好地感知和理解物理环境的方法。本文围绕三维视觉的核心思想,设计了一个边缘辅助框架Xihe,为移动AR应用提供实时准确的全向光照估计。具体来说,我们开发了一种新的采样技术,可以有效地压缩移动设备上生成的原始点云输入。这项技术是基于我们最近的三维室内数据集的实证分析得出的,在我们基于三维视觉的照明估计器管道设计中起着关键作用。为了达到实时性的目标,我们开发了一个定制的GPU流水线用于设备上的点云处理,并使用了一种编码技术来减少网络传输的字节数。最后,我们提出了一种自适应触发策略,允许Xihe跳过不必要的光照估计,并提供了一种与移动AR生态系统集成的时间相干渲染的实用方法。我们使用Xihe的API开发的参考移动应用程序来评估Xihe的光照估计精度和时间。结果表明,Xihe算法每次光照估计最快可达20.67ms,估计精度比现有的神经网络高9.4%。 摘要:Omnidirectional lighting provides the foundation for achieving spatially-variant photorealistic 3D rendering, a desirable property for mobile augmented reality applications. However, in practice, estimating omnidirectional lighting can be challenging due to limitations such as partial panoramas of the rendering positions, and the inherent environment lighting and mobile user dynamics. A new opportunity arises recently with the advancements in mobile 3D vision, including built-in high-accuracy depth sensors and deep learning-powered algorithms, which provide the means to better sense and understand the physical surroundings. Centering the key idea of 3D vision, in this work, we design an edge-assisted framework called Xihe to provide mobile AR applications the ability to obtain accurate omnidirectional lighting estimation in real time. Specifically, we develop a novel sampling technique that efficiently compresses the raw point cloud input generated at the mobile device. This technique is derived based on our empirical analysis of a recent 3D indoor dataset and plays a key role in our 3D vision-based lighting estimator pipeline design. To achieve the real-time goal, we develop a tailored GPU pipeline for on-device point cloud processing and use an encoding technique that reduces network transmitted bytes. Finally, we present an adaptive triggering strategy that allows Xihe to skip unnecessary lighting estimations and a practical way to provide temporal coherent rendering integration with the mobile AR ecosystem. We evaluate both the lighting estimation accuracy and time of Xihe using a reference mobile application developed with Xihe's APIs. Our results show that Xihe takes as fast as 20.67ms per lighting estimation and achieves 9.4% better estimation accuracy than a state-of-the-art neural network.
其他神经网络|深度学习|模型|建模(12篇)
【1】 Multiple Graph Learning for Scalable Multi-view Clustering 标题:用于可伸缩多视图聚类的多图学习
作者:Tianyu Jiang,Quanxue Gao 机构: Xidian University 链接:https://arxiv.org/abs/2106.15382 摘要:基于图的多视图聚类由于能够有效地刻画多媒体数据之间的复杂结构和关系而成为当前研究的热点。然而,现有的方法存在以下不足:(1)由于图的构造和特征分解,使得大规模的图学习效率低下甚至失败(2) 它们不能很好地利用嵌入在不同视图图形中的互补信息和空间结构。为了更好地利用互补信息,解决基于图的多视图聚类的可扩展性问题,提出了一种基于少量锚定点和张量Schatten p-范数最小化的多图学习模型。具体来说,我们通过锚图为每个视图构造一个隐藏的、可处理的大图,并利用张量Schatten p-范数正则化器很好地挖掘了不同视图锚图中的互补信息。最后,我们提出了一个有效的算法,该算法随数据大小线性扩展,以解决我们提出的模型。在多个数据集上的大量实验结果表明,本文提出的方法优于一些最新的多视图聚类算法。 摘要:Graph-based multi-view clustering has become an active topic due to the efficiency in characterizing both the complex structure and relationship between multimedia data. However, existing methods have the following shortcomings: (1) They are inefficient or even fail for graph learning in large scale due to the graph construction and eigen-decomposition. (2) They cannot well exploit both the complementary information and spatial structure embedded in graphs of different views. To well exploit complementary information and tackle the scalability issue plaguing graph-based multi-view clustering, we propose an efficient multiple graph learning model via a small number of anchor points and tensor Schatten p-norm minimization. Specifically, we construct a hidden and tractable large graph by anchor graph for each view and well exploit complementary information embedded in anchor graphs of different views by tensor Schatten p-norm regularizer. Finally, we develop an efficient algorithm, which scales linearly with the data size, to solve our proposed model. Extensive experimental results on several datasets indicate that our proposed method outperforms some state-of-the-art multi-view clustering algorithms.
【2】 Quantifying urban streetscapes with deep learning: focus on aesthetic evaluation 标题:基于深度学习的城市街景量化--以审美评价为中心
作者:Yusuke Kumakoshi,Shigeaki Onoda,Tetsuya Takahashi,Yuji Yoshimura 备注:4pages, 3 figures 链接:https://arxiv.org/abs/2106.15361 摘要:城市街道景观的无序会对人们的审美品质产生负面影响。建筑物正面广告牌的存在被认为是造成混乱的一个重要因素,但其量化方法尚未以可扩展的方式发展。为了填补这一空白,本文报告了我们的深度学习模型在东京一个独特的数据集上的表现,该数据集分别用于识别街道景观中正面和广告牌覆盖的区域。该模型达到63.17%的准确率,以联合交叉口(IoU)来衡量,从而使研究者和实践者能够通过结合人们的偏好数据来获得对城市街道景观设计的见解。 摘要:The disorder of urban streetscapes would negatively affect people's perception of their aesthetic quality. The presence of billboards on building facades has been regarded as an important factor of the disorder, but its quantification methodology has not yet been developed in a scalable manner. To fill the gap, this paper reports the performance of our deep learning model on a unique data set prepared in Tokyo to recognize the areas covered by facades and billboards in streetscapes, respectively. The model achieved 63.17 % of accuracy, measured by Intersection-over-Union (IoU), thus enabling researchers and practitioners to obtain insights on urban streetscape design by combining data of people's preferences.
【3】 LB-CNN: An Open Source Framework for Fast Training of Light Binary Convolutional Neural Networks using Chainer and Cupy 标题:LB-CNN:利用Chainer和Cupy快速训练轻型二进制卷积神经网络的开源框架
作者:Radu Dogaru,Ioana Dogaru 机构:Dept. of Applied and Information Engineering, University “Politehnica” of Bucharest, Bucharest, Romania 备注:6 pages, includes reference to code (Jupyter - Python notebook) 链接:https://arxiv.org/abs/2106.15350 摘要:轻二进制卷积神经网络(LB-CNN)在许多工业应用中需要在低能耗计算平台上实现时特别有用。本文介绍了一种优化紧凑LB-CNN的框架,并对其有效性进行了评价。该框架是免费提供的,可以在免费访问的云平台上运行,因此不需要重大投资。优化后的模型以标准化的.h5格式保存,可以作为专用工具的输入,进一步部署到特定技术中,从而实现各种智能图像传感器的快速发展。加速我们模型优化的主要因素,特别是二进制卷积核的选择,是Chainer/Cupy机器学习库,它为将输出层训练成一个极限学习机提供了显著的加速。包括使用Keras/Tensorflow对输出层进行额外的训练,因为这样可以提高精度。对于广泛使用的数据集,包括MNIST、GTSRB、ORL和VGG,结果显示在精确度和复杂性之间有很好的折衷。特别是,对于人脸识别问题,经过仔细优化的LB-CNN模型提供了高达100%的准确率。这种TinyML解决方案非常适合需要低能耗图像识别的工业应用。 摘要:Light binary convolutional neural networks (LB-CNN) are particularly useful when implemented in low-energy computing platforms as required in many industrial applications. Herein, a framework for optimizing compact LB-CNN is introduced and its effectiveness is evaluated. The framework is freely available and may run on free-access cloud platforms, thus requiring no major investments. The optimized model is saved in the standardized .h5 format and can be used as input to specialized tools for further deployment into specific technologies, thus enabling the rapid development of various intelligent image sensors. The main ingredient in accelerating the optimization of our model, particularly the selection of binary convolution kernels, is the Chainer/Cupy machine learning library offering significant speed-ups for training the output layer as an extreme-learning machine. Additional training of the output layer using Keras/Tensorflow is included, as it allows an increase in accuracy. Results for widely used datasets including MNIST, GTSRB, ORL, VGG show very good compromise between accuracy and complexity. Particularly, for face recognition problems a carefully optimized LB-CNN model provides up to 100% accuracies. Such TinyML solutions are well suited for industrial applications requiring image recognition with low energy consumption.
【4】 Deep Learning for Multi-View Stereo via Plane Sweep: A Survey 标题:基于平面扫描的多视点立体深度学习研究综述
作者:Qingtian Zhu 机构:Graphics and Interaction Lab, Dept. of EECS, Peking University, Beijing, China 链接:https://arxiv.org/abs/2106.15328 摘要:三维重建技术在自动驾驶、机器人技术、虚拟现实等领域有着广泛的应用,近年来受到越来越多的关注。深度学习作为人工智能领域的一种主流技术,已经成功地应用于解决各种计算机视觉问题。然而,三维重建的深度学习由于其独特的挑战和不同的管道仍处于初级阶段。为了促进未来的研究,本文综述了基于图像的三维重建的关键任务&多视点立体(MVS)深度学习方法的最新进展。它还提出了几个公开数据集的比较结果,有深刻的观察和启发未来的研究方向。 摘要:3D reconstruction has lately attracted increasing attention due to its wide application in many areas, such as autonomous driving, robotics and virtual reality. As a dominant technique in artificial intelligence, deep learning has been successfully adopted to solve various computer vision problems. However, deep learning for 3D reconstruction is still at its infancy due to its unique challenges and varying pipelines. To stimulate future research, this paper presents a review of recent progress in deep learning methods for Multi-view Stereo (MVS), which is considered as a crucial task of image-based 3D reconstruction. It also presents comparative results on several publicly available datasets, with insightful observations and inspiring future research directions.
【5】 Joint Learning of Portrait Intrinsic Decomposition and Relighting 标题:人像本征分解与重光照的联合学习
作者:Mona Zehni,Shaona Ghosh,Krishna Sridhar,Sethu Raman 机构:Department of ECE and CSL, University of Illinois at Urbana-Champaign, Apple Inc. 链接:https://arxiv.org/abs/2106.15305 摘要:逆渲染是将图像分解为其固有成分的问题,即反照率、法线和光照。为了解决单幅图像的不适定问题,现有的从阴影到形状的方法大多是对合成或真实数据集上的所有分量进行有监督的训练。在这里,我们提出了一种新的自我监督训练范式,即1)减少了对分解任务的全面监督,2)考虑了重照明任务。我们引入了新的自监督损失项,利用多照明图像(不同照明下相同场景的图像)之间的一致性。我们的方法适用于多光源数据集。我们在两种情况下应用我们的训练方法:1)在合成和真实数据的混合上训练,2)在有限的监督下在真实数据集上训练。我们展示了我们的训练范式在内在分解和重照明两方面的有效性,并展示了在有限的监督设置下,模型如何在没有自我监督损失项的情况下在两个任务中挣扎。我们提供了在SfSNet、CelebA和Photoface数据集上的综合实验结果,并在野外图像上验证了我们的方法的性能。 摘要:Inverse rendering is the problem of decomposing an image into its intrinsic components, i.e. albedo, normal and lighting. To solve this ill-posed problem from single image, state-of-the-art methods in shape from shading mostly resort to supervised training on all the components on either synthetic or real datasets. Here, we propose a new self-supervised training paradigm that 1) reduces the need for full supervision on the decomposition task and 2) takes into account the relighting task. We introduce new self-supervised loss terms that leverage the consistencies between multi-lit images (images of the same scene under different illuminations). Our approach is applicable to multi-lit datasets. We apply our training approach in two settings: 1) train on a mixture of synthetic and real data, 2) train on real datasets with limited supervision. We show-case the effectiveness of our training paradigm on both intrinsic decomposition and relighting and demonstrate how the model struggles in both tasks without the self-supervised loss terms in limited supervision settings. We provide results of comprehensive experiments on SfSNet, CelebA and Photoface datasets and verify the performance of our approach on images in the wild.
【6】 VolterraNet: A higher order convolutional network with group equivariance for homogeneous manifolds 标题:VolterraNet:齐次流形的一种高阶群等方差卷积网络
作者:Monami Banerjee,Rudrasis Chakraborty,Jose Bouza,Baba C. Vemuri 机构: Chakraborty is with University ofCalifornia, Vemuri are with University of Florida 备注:IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) 链接:https://arxiv.org/abs/2106.15301 摘要:卷积神经网络由于其平移等价性,在基于图像的学习任务中取得了很大的成功。最近的工作将卷积神经网络的传统卷积层推广到非欧氏空间,并证明了广义卷积运算的群等价性。本文针对黎曼齐次空间上定义为函数样本的数据,提出了一种新的高阶Volterra卷积神经网络。通过对传统卷积结果的分析,我们证明了Volterra函数卷积与黎曼齐次空间所承认的等距群的作用是等价的,并且在一定的限制条件下,任何非线性等变函数都可以表示为我们的齐次空间Volterra卷积,推广了欧氏空间Volterra展开式的非线性位移等变特征。我们还证明了二阶函数卷积运算可以表示为级联卷积运算,从而得到一个有效的实现。除此之外,我们还提出了一个扩展的volternet模型。这些进展导致相对于基线非欧几里德CNN的参数大幅度降低。为了证明volternet性能的有效性,我们提供了几个实际数据实验,包括球形MNIST、原子能、Shrec17数据集的分类任务,以及扩散MRI数据的分组测试。性能比较的国家最先进的也提出了。 摘要:Convolutional neural networks have been highly successful in image-based learning tasks due to their translation equivariance property. Recent work has generalized the traditional convolutional layer of a convolutional neural network to non-Euclidean spaces and shown group equivariance of the generalized convolution operation. In this paper, we present a novel higher order Volterra convolutional neural network (VolterraNet) for data defined as samples of functions on Riemannian homogeneous spaces. Analagous to the result for traditional convolutions, we prove that the Volterra functional convolutions are equivariant to the action of the isometry group admitted by the Riemannian homogeneous spaces, and under some restrictions, any non-linear equivariant function can be expressed as our homogeneous space Volterra convolution, generalizing the non-linear shift equivariant characterization of Volterra expansions in Euclidean space. We also prove that second order functional convolution operations can be represented as cascaded convolutions which leads to an efficient implementation. Beyond this, we also propose a dilated VolterraNet model. These advances lead to large parameter reductions relative to baseline non-Euclidean CNNs. To demonstrate the efficacy of the VolterraNet performance, we present several real data experiments involving classification tasks on spherical-MNIST, atomic energy, Shrec17 data sets, and group testing on diffusion MRI data. Performance comparisons to the state-of-the-art are also presented.
【7】 Evaluating Deep Neural Networks for Image Document Enhancement 标题:深度神经网络在图像文档增强中的评价
作者:Lucas N. Kirsten,Ricardo Piccoli,Ricardo Ribani 机构:Department of Print Software, HP Inc. – R&D, Porto Alegre – RS,-, Brazil 备注:12 pages, 6 figures, 2 tables, CBDAR conference 链接:https://arxiv.org/abs/2106.15286 摘要:这项工作评估了六个国家的最先进的深层神经网络(DNN)架构应用于增强相机捕获的文件图像的问题。利用图像质量评价(IQA)指标对每个网络的结果进行了定性和定量评价,并与现有的基于传统计算机视觉技术的方法进行了比较。与现有算法相比,性能最好的体系结构通常产生了良好的增强效果,这表明使用DNNs进行文档图像增强是可能的。此外,性能最好的体系结构可以作为未来使用深度学习技术进行文档增强研究的基线。本文的主要贡献是:一个可以进一步改进以提供更好结果的深度学习技术的基线,以及一个使用IQA度量来定量比较从神经网络产生的图像与地面真实值的评估方法。 摘要:This work evaluates six state-of-the-art deep neural network (DNN) architectures applied to the problem of enhancing camera-captured document images. The results from each network were evaluated both qualitatively and quantitatively using Image Quality Assessment (IQA) metrics, and also compared with an existing approach based on traditional computer vision techniques. The best performing architectures generally produced good enhancement compared to the existing algorithm, showing that it is possible to use DNNs for document image enhancement. Furthermore, the best performing architectures could work as a baseline for future investigations on document enhancement using deep learning techniques. The main contributions of this paper are: a baseline of deep learning techniques that can be further improved to provide better results, and a evaluation methodology using IQA metrics for quantitatively comparing the produced images from the neural networks to a ground truth.
【8】 AutoNovel: Automatically Discovering and Learning Novel Visual Categories 标题:AutoNovel:自动发现和学习新的视觉类别
作者:Kai Han,Sylvestre-Alvise Rebuffi,Sébastien Ehrhardt,Andrea Vedaldi,Andrew Zisserman 机构: Department of EngineeringScience, University of Oxford 备注:TPAMI 2021, code: this http URL arXiv admin note: substantial text overlap with arXiv:2002.05714 链接:https://arxiv.org/abs/2106.15252 摘要:我们处理的问题,发现新的类在一个图像收集给定标签的例子,其他类。我们提出了一种称为AutoNovel的新方法来解决这个问题,它结合了三个思想:(1)我们认为,通常使用标记数据引导图像表示的方法只会引入不必要的偏差,利用自监督学习从零开始训练标记数据和非标记数据的联合表示,可以避免这种情况(2) 我们利用排序统计量将模型中标记类的知识转化为未标记图像的聚类问题;(3)通过优化标记和未标记数据子集上的联合目标函数来训练数据表示,改进了标记数据的监督分类和未标记数据的聚类。此外,我们还提出了一种在新类别数目未知的情况下估计类别数目的方法。我们在标准分类基准上对AutoNovel进行了评估,大大优于现有的新类别发现方法。此外,我们还证明了AutoNovel可以用于完全无监督的图像聚类,取得了很好的效果。 摘要:We tackle the problem of discovering novel classes in an image collection given labelled examples of other classes. We present a new approach called AutoNovel to address this problem by combining three ideas: (1) we suggest that the common approach of bootstrapping an image representation using the labelled data only introduces an unwanted bias, and that this can be avoided by using self-supervised learning to train the representation from scratch on the union of labelled and unlabelled data; (2) we use ranking statistics to transfer the model's knowledge of the labelled classes to the problem of clustering the unlabelled images; and, (3) we train the data representation by optimizing a joint objective function on the labelled and unlabelled subsets of the data, improving both the supervised classification of the labelled data, and the clustering of the unlabelled data. Moreover, we propose a method to estimate the number of classes for the case where the number of new categories is not known a priori. We evaluate AutoNovel on standard classification benchmarks and substantially outperform current methods for novel category discovery. In addition, we also show that AutoNovel can be used for fully unsupervised image clustering, achieving promising results.
【9】 O2O-Afford: Annotation-Free Large-Scale Object-Object Affordance Learning 标题:O2O-OFFER:无注释的大规模对象-对象抗声学习
作者:Kaichun Mo,Yuzhe Qin,Fanbo Xiang,Hao Su,Leonidas Guibas 机构:Stanford University, UCSD 链接:https://arxiv.org/abs/2106.15087 摘要:与计算机视觉和机器人学中大量的建模、感知和理解智能体-对象(如人、手、机器人-对象)交互的文献相反,很少有人研究对象-对象交互任务,它在机器人操作和规划任务中也起着重要的作用。在我们的日常生活中,有着丰富的对象-对象交互场景空间,如将对象放在凌乱的桌面上、将对象放在抽屉中、使用工具推送对象等。本文提出了一个统一的启示学习框架来学习各种任务的对象-对象交互。通过使用物理模拟(SAPIEN)和数千个具有丰富几何多样性的ShapeNet模型构建四个对象-对象交互任务环境,我们能够进行大规模的对象-对象启示学习,而不需要人工注释或演示。在技术贡献的核心部分,我们提出了一个对象核点卷积网络来推理两个对象之间的详细交互。在大规模合成数据和真实数据上的实验证明了该方法的有效性。有关代码、数据、视频和更多资料,请参阅项目网页:https://cs.stanford.edu/~kaichun/o2oafford 摘要:Contrary to the vast literature in modeling, perceiving, and understanding agent-object (e.g., human-object, hand-object, robot-object) interaction in computer vision and robotics, very few past works have studied the task of object-object interaction, which also plays an important role in robotic manipulation and planning tasks. There is a rich space of object-object interaction scenarios in our daily life, such as placing an object on a messy tabletop, fitting an object inside a drawer, pushing an object using a tool, etc. In this paper, we propose a unified affordance learning framework to learn object-object interaction for various tasks. By constructing four object-object interaction task environments using physical simulation (SAPIEN) and thousands of ShapeNet models with rich geometric diversity, we are able to conduct large-scale object-object affordance learning without the need for human annotations or demonstrations. At the core of technical contribution, we propose an object-kernel point convolution network to reason about detailed interaction between two objects. Experiments on large-scale synthetic data and real-world data prove the effectiveness of the proposed approach. Please refer to the project webpage for code, data, video, and more materials: https://cs.stanford.edu/~kaichun/o2oafford
【10】 GuidedMix-Net: Learning to Improve Pseudo Masks Using Labeled Images as Reference 标题:GuidedMix-Net:学习以标签图像为参考改进伪掩模
作者:Peng Tu,Yawen Huang,Rongrong Ji,Feng Zheng,Ling Shao 机构:• Yawen Huang is with the Department of Information Science and Engineer-ing, Xiamen University 备注:11 pages 链接:https://arxiv.org/abs/2106.15064 摘要:半监督学习是一个具有挑战性的问题,它的目标是通过从有限数量的标记样本中学习来构造模型。为了解决这个问题,人们提出了许多方法,其中大多数集中在利用未标记实例一致性的预测来正则化网络。然而,将标记和未标记的数据分开处理往往会导致从标记样本中学习到的大量先验知识被丢弃,并且无法挖掘标记和未标记图像对之间的特征交互。本文提出了一种新的半监督语义切分方法GuidedMix-Net,它利用标记信息指导未标记实例的学习。具体来说,我们首先在标记和未标记的数据之间引入一个特征对齐目标来捕获潜在的相似图像对,然后从中生成混合输入。提出的基于聚类假设的互信息传递(MITrans)是一个强大的知识模块,可以进一步细化混合数据空间中未标记数据的特征。为了充分利用标记样本,指导未标记数据的学习,我们进一步提出了一个掩码生成模块,为未标记数据生成高质量的伪掩码。结合有标记数据的有监督学习,利用混合数据生成的伪掩模联合学习未标记数据的预测。在PASCAL VOC 2012、PASCAL Context和Cityscapes上的大量实验证明了我们的GuidedMix网络的有效性,与以前最先进的方法相比,它实现了具有竞争力的细分精度,并显著地将mIoU提高了+7$\%$。 摘要:Semi-supervised learning is a challenging problem which aims to construct a model by learning from a limited number of labeled examples. Numerous methods have been proposed to tackle this problem, with most focusing on utilizing the predictions of unlabeled instances consistency alone to regularize networks. However, treating labeled and unlabeled data separately often leads to the discarding of mass prior knowledge learned from the labeled examples, and failure to mine the feature interaction between the labeled and unlabeled image pairs. In this paper, we propose a novel method for semi-supervised semantic segmentation named GuidedMix-Net, by leveraging labeled information to guide the learning of unlabeled instances. Specifically, we first introduce a feature alignment objective between labeled and unlabeled data to capture potentially similar image pairs and then generate mixed inputs from them. The proposed mutual information transfer (MITrans), based on the cluster assumption, is shown to be a powerful knowledge module for further progressive refining features of unlabeled data in the mixed data space. To take advantage of the labeled examples and guide unlabeled data learning, we further propose a mask generation module to generate high-quality pseudo masks for the unlabeled data. Along with supervised learning for labeled data, the prediction of unlabeled data is jointly learned with the generated pseudo masks from the mixed data. Extensive experiments on PASCAL VOC 2012, PASCAL-Context and Cityscapes demonstrate the effectiveness of our GuidedMix-Net, which achieves competitive segmentation accuracy and significantly improves the mIoU by +7$\%$ compared to previous state-of-the-art approaches.
【11】 Fast Training of Neural Lumigraph Representations using Meta Learning 标题:基于元学习的神经图表征的快速训练
作者:Alexander W. Bergman,Petr Kellnhofer,Gordon Wetzstein 机构:Stanford University, computationalimaging.orgpublicationsmetanlr 备注:Project website: this http URL 链接:https://arxiv.org/abs/2106.14942 摘要:新颖的视图合成是机器学习和计算机视觉中一个长期存在的问题。最近,神经场景表示和渲染技术的发展取得了重大进展,这些技术可以从任意视图合成照片级真实感图像。但是,这些表示法的训练速度非常慢,而且渲染速度通常也很慢。受基于图像的绘制的神经变体的启发,我们开发了一种新的神经绘制方法,目的是快速学习高质量的表示,也可以实时绘制。我们的方法MetaNLR++,通过使用神经形状表示和基于2dcnn的图像特征提取、聚集和重投影的独特组合来实现这一点。为了将表征收敛时间缩短到几分钟,我们利用元学习来学习神经形状和图像特征的先验知识,从而加速训练。然后,可以使用传统的图形技术提取优化的形状和图像特征,并进行实时渲染。我们证明了MetaNLR++实现类似或更好的新视图合成结果的时间比竞争方法所需的时间要短。 摘要:Novel view synthesis is a long-standing problem in machine learning and computer vision. Significant progress has recently been made in developing neural scene representations and rendering techniques that synthesize photorealistic images from arbitrary views. These representations, however, are extremely slow to train and often also slow to render. Inspired by neural variants of image-based rendering, we develop a new neural rendering approach with the goal of quickly learning a high-quality representation which can also be rendered in real-time. Our approach, MetaNLR++, accomplishes this by using a unique combination of a neural shape representation and 2D CNN-based image feature extraction, aggregation, and re-projection. To push representation convergence times down to minutes, we leverage meta learning to learn neural shape and image feature priors which accelerate training. The optimized shape and image features can then be extracted using traditional graphics techniques and rendered in real time. We show that MetaNLR++ achieves similar or better novel view synthesis results in a fraction of the time that competing methods require.
【12】 Unified Framework for Spectral Dimensionality Reduction, Maximum Variance Unfolding, and Kernel Learning By Semidefinite Programming: Tutorial and Survey 标题:用半定规划进行频谱降维、最大方差展开和核学习的统一框架:教程和综述
作者:Benyamin Ghojogh,Ali Ghodsi,Fakhri Karray,Mark Crowley 机构:Department of Electrical and Computer Engineering, Machine Learning Laboratory, University of Waterloo, Waterloo, ON, Canada, Department of Statistics and Actuarial Science & David R. Cheriton School of Computer Science 备注:To appear as a part of an upcoming textbook on dimensionality reduction and manifold learning 链接:https://arxiv.org/abs/2106.15379 摘要:这是一篇关于谱降维方法、半定规划(SDP)、最大方差展开(MVU)或半定嵌入(SDE)的核学习及其变体的统一的教程和综述。我们首先解释了如何将谱降维方法统一为具有不同核的核主成分分析(PCA)。这种统一可以解释为特征函数学习或用距离矩阵表示核。然后,由于谱方法统一为核主成分分析,我们说让我们学习最好的核展开流形的数据,其最大方差。本文首先简要介绍了用SDP进行核学习的方法。然后,我们详细介绍了MVU。介绍了利用最近邻图、类展开、Fisher准则和有色MVU实现有监督MVU的各种方法。我们还利用特征函数和核映射解释了MVU的样本外扩展。最后,我们介绍了MVU的其他变体,包括基于动作的嵌入、放松MVU和用于大数据的landmark MVU。 摘要:This is a tutorial and survey paper on unification of spectral dimensionality reduction methods, kernel learning by Semidefinite Programming (SDP), Maximum Variance Unfolding (MVU) or Semidefinite Embedding (SDE), and its variants. We first explain how the spectral dimensionality reduction methods can be unified as kernel Principal Component Analysis (PCA) with different kernels. This unification can be interpreted as eigenfunction learning or representation of kernel in terms of distance matrix. Then, since the spectral methods are unified as kernel PCA, we say let us learn the best kernel for unfolding the manifold of data to its maximum variance. We first briefly introduce kernel learning by SDP for the transduction task. Then, we explain MVU in detail. Various versions of supervised MVU using nearest neighbors graph, by class-wise unfolding, by Fisher criterion, and by colored MVU are explained. We also explain out-of-sample extension of MVU using eigenfunctions and kernel mapping. Finally, we introduce other variants of MVU including action respecting embedding, relaxed MVU, and landmark MVU for big data.
其他(16篇)
【1】 An Image is Worth More Than a Thousand Words: Towards Disentanglement in the Wild 标题:一幅图像胜过千言万语:走向荒野中的解脱
作者:Aviv Gabbay,Niv Cohen,Yedid Hoshen 机构:School of Computer Science and Engineering, The Hebrew University of Jerusalem, Israel 备注:Project page: this http URL 链接:https://arxiv.org/abs/2106.15610 摘要:无监督解纠缠已被证明是理论上不可能没有归纳偏见的模型和数据。作为一种替代方法,最近的方法依赖于有限的监督来分离变异因素并允许其可识别性。虽然只有有限数量的观察才需要注释真正的生成因子,但我们认为,列举所有描述真实世界图像分布的变异因子是不可行的。为此,我们提出了一种方法来分离一组只被部分标记的因子,以及分离一组从未被明确指定的互补剩余因子。我们在这一具有挑战性的环境中取得的成功,在合成基准上得到了证明,这使得我们能够利用现成的图像描述符,以最少的手动工作对真实图像域(例如人脸)中的属性子集进行部分注释。具体来说,我们使用最近的语言图像嵌入模型(CLIP)以Zero-Shot的方式注释一组感兴趣的属性,并展示最先进的分离图像处理结果。 摘要:Unsupervised disentanglement has been shown to be theoretically impossible without inductive biases on the models and the data. As an alternative approach, recent methods rely on limited supervision to disentangle the factors of variation and allow their identifiability. While annotating the true generative factors is only required for a limited number of observations, we argue that it is infeasible to enumerate all the factors of variation that describe a real-world image distribution. To this end, we propose a method for disentangling a set of factors which are only partially labeled, as well as separating the complementary set of residual factors that are never explicitly specified. Our success in this challenging setting, demonstrated on synthetic benchmarks, gives rise to leveraging off-the-shelf image descriptors to partially annotate a subset of attributes in real image domains (e.g. of human faces) with minimal manual effort. Specifically, we use a recent language-image embedding model (CLIP) to annotate a set of attributes of interest in a zero-shot manner and demonstrate state-of-the-art disentangled image manipulation results.
【2】 Framework for an Intelligent Affect Aware Smart Home Environment for Elderly People 标题:面向老年人的智能情感感知智能家居环境框架
作者:Nirmalya Thakur,Chia Y. Han 机构:edu Department of Electrical Engineering and Computer Science University of Cincinnati Cincinnati 备注:None 链接:https://arxiv.org/abs/2106.15599 摘要:在过去的几十年里,老年人的人口一直在快速增长,预计他们的人口在不久的将来还会进一步增加。随着年龄的增长,老年人面临着身体残疾、认知问题、记忆力减退和行为紊乱等问题,这与他们日益增长的需求有关。为了减轻他们在世界经济中的财政负担,提高他们的生活质量,必须开发具有适应性、辅助性和智能性的基于技术的解决方案。智能情感感知系统不仅可以分析,而且可以预测老年人在物联网环境中与技术的日常交互中的行为,具有巨大的潜力,可以作为改善智能家居中老年人用户体验的长期解决方案。因此,这项工作提出了一个老年人智能情感感知环境的框架,不仅可以分析他们互动的情感成分,而且可以预测他们可能的用户体验,甚至在他们开始在给定的智能家居环境中从事任何活动之前。这种对用户体验的预测将为增强用户体验提供空间,从而增强此类智能系统的辅助性和适应性。为了支持这一框架在改善智能家居中老年人生活质量方面的有效性,我们在三个数据集上进行了测试,并对结果进行了介绍和讨论。 摘要:The population of elderly people has been increasing at a rapid rate over the last few decades and their population is expected to further increase in the upcoming future. Their increasing population is associated with their increasing needs due to problems like physical disabilities, cognitive issues, weakened memory and disorganized behavior, that elderly people face with increasing age. To reduce their financial burden on the world economy and to enhance their quality of life, it is essential to develop technology-based solutions that are adaptive, assistive and intelligent in nature. Intelligent Affect Aware Systems that can not only analyze but also predict the behavior of elderly people in the context of their day to day interactions with technology in an IoT-based environment, holds immense potential for serving as a long-term solution for improving the user experience of elderly in smart homes. This work therefore proposes the framework for an Intelligent Affect Aware environment for elderly people that can not only analyze the affective components of their interactions but also predict their likely user experience even before they start engaging in any activity in the given smart home environment. This forecasting of user experience would provide scope for enhancing the same, thereby increasing the assistive and adaptive nature of such intelligent systems. To uphold the efficacy of this proposed framework for improving the quality of life of elderly people in smart homes, it has been tested on three datasets and the results are presented and discussed.
【3】 Evaluation of Automated Image Descriptions for Visually Impaired Students 标题:视障学生自动图像描述的评价
作者:Anett Hoppe,David Morris,Ralph Ewerth 备注:None 链接:https://arxiv.org/abs/2106.15553 摘要:插图在教育中被广泛使用,有时,视力受损的学生无法获得替代品。因此,这些学生将受益于自动插图描述系统,但只有当这些描述是完整的,正确的,并易于理解使用屏幕阅读器。在本文中,我们报告了一项研究,以评估自动图像描述。我们采访了专家来建立评估标准,然后我们用这些标准为有视力的非专家评分员创建了一份评估问卷和描述模板。我们使用这个问卷来评估描述的质量,这些描述可以由基于模板的自动图像描述器生成。我们提出的证据表明,这些模板有可能产生有用的描述,并且问卷确定了描述模板的问题。 摘要:Illustrations are widely used in education, and sometimes, alternatives are not available for visually impaired students. Therefore, those students would benefit greatly from an automatic illustration description system, but only if those descriptions were complete, correct, and easily understandable using a screenreader. In this paper, we report on a study for the assessment of automated image descriptions. We interviewed experts to establish evaluation criteria, which we then used to create an evaluation questionnaire for sighted non-expert raters, and description templates. We used this questionnaire to evaluate the quality of descriptions which could be generated with a template-based automatic image describer. We present evidence that these templates have the potential to generate useful descriptions, and that the questionnaire identifies problems with description templates.
【4】 Study of visual processing techniques for dynamic speckles: a comparative analysis 标题:动态散斑视觉处理技术的比较分析
作者:Amit Chatterjee,Jitendra Dhanotiya,Vimal Bhatia,Shashi Prakash 机构: Signals and Software Group, Discipline of Electrical Engineering, Indian, Institute of Technology Indore, Indore-, India, # Photonics laboratory, Department of Electronics and Instrumentation 备注:None 链接:https://arxiv.org/abs/2106.15507 摘要:从散斑图中获取信息的主要视觉技术有Fujii法、广义差分法、加权广义差分法、平均窗差分法、结构函数法(SF)和修正的SF法等。本文对天然牙龈样品的主要视觉技术进行了比较分析。所得结果最终确定了基于SF的方法作为动态散斑数据视觉检测的最佳工具。 摘要:Main visual techniques used to obtain information from speckle patterns are Fujii method, generalized difference, weighted generalized difference, mean windowed difference, structural function (SF), modified SF, etc. In this work, a comparative analysis of major visual techniques for natural gum sample is carried out. Obtained results conclusively establish SF based method as an optimum tool for visual inspection of dynamic speckle data.
【5】 Improved Padding in CNNs for Quantitative Susceptibility Mapping 标题:用于定量磁化率作图的CNN填充方法的改进
作者:Juan Liu 链接:https://arxiv.org/abs/2106.15331 摘要:近年来,人们提出了定量磁化率成像(QSM)数据处理的深度学习方法:背景场去除、场源反演和单步QSM重建。然而,卷积神经网络(CNNs)中使用的传统填充机制会引入空间伪影,特别是在QSM背景场去除和单步QSM中,需要从感兴趣体积边缘具有极大值的总场中进行推断。为了解决这个问题,我们提出了一种改进的填充技术,利用相邻的有效体素来估计神经网络中体积边界处的无效体素。利用模拟和活体数据进行的研究表明,在背景场去除、场源反演和单步QSM重建等任务中,该填充方法大大提高了估计精度,减少了结果中的伪影。 摘要:Recently, deep learning methods have been proposed for quantitative susceptibility mapping (QSM) data processing: background field removal, field-to-source inversion, and single-step QSM reconstruction. However, the conventional padding mechanism used in convolutional neural networks (CNNs) can introduce spatial artifacts, especially in QSM background field removal and single-step QSM which requires inference from total fields with extreme large values at the edge boundaries of volume of interest. To address this issue, we propose an improved padding technique which utilizes the neighboring valid voxels to estimate the invalid voxels of feature maps at volume boundaries in the neural networks. Studies using simulated and in-vivo data show that the proposed padding greatly improves estimation accuracy and reduces artifacts in the results in the tasks of background field removal, field-to-source inversion, and single-step QSM reconstruction.
【6】 Patch-Based Image Restoration using Expectation Propagation 标题:基于期望传播的面片图像恢复
作者:Dan Yao,Stephen McLaughlin,Yoann Altmann 机构:School of Engineering and Physical Sciences, Heriot-Watt University, Edinburgh, EH,AS, United Kingdom. 链接:https://arxiv.org/abs/2106.15327 摘要:提出了一种新的基于分片先验分布的期望传播(EP)图像恢复框架。montecarlo技术通常用于从难以处理的后验分布中采样,但在高维推理问题(如图像恢复)中会遇到可伸缩性问题。为了解决这个问题,EP在这里被用来近似后验分布使用产品的多元高斯密度。此外,对这些密度的协方差矩阵施加结构约束允许更大的可伸缩性和分布式计算。虽然该方法自然适合处理加性高斯观测噪声,但也可以推广到非高斯噪声。针对高斯噪声和泊松噪声的去噪、修复和反褶积问题进行的实验表明,这种灵活的近似贝叶斯方法对于成像问题中的不确定性量化具有潜在的优势,并且与采样技术相比,计算成本较低。 摘要:This paper presents a new Expectation Propagation (EP) framework for image restoration using patch-based prior distributions. While Monte Carlo techniques are classically used to sample from intractable posterior distributions, they can suffer from scalability issues in high-dimensional inference problems such as image restoration. To address this issue, EP is used here to approximate the posterior distributions using products of multivariate Gaussian densities. Moreover, imposing structural constraints on the covariance matrices of these densities allows for greater scalability and distributed computation. While the method is naturally suited to handle additive Gaussian observation noise, it can also be extended to non-Gaussian noise. Experiments conducted for denoising, inpainting and deconvolution problems with Gaussian and Poisson noise illustrate the potential benefits of such flexible approximate Bayesian method for uncertainty quantification in imaging problems, at a reduced computational cost compared to sampling techniques.
【7】 Serial-EMD: Fast Empirical Mode Decomposition Method for Multi-dimensional Signals Based on Serialization 标题:基于串行化的多维信号快速经验模态分解方法Serial-EMD
作者:Jin Zhang,Fan Feng,Pere Marti-Puig,Cesar F. Caiafa,Zhe Sun,Feng Duan,Jordi Solé-Casals 机构:Sol´e-Casalsb,c,∗, College of Computer Science, Nankai University, Tianjin, China, College of Artificial Intelligence, Nankai University, Tianjin, China, Data and Signal Processing Group, University of Vic—Central University of Catalonia, Vic, Catalonia, Spain 备注:19 pages, 17 figures 链接:https://arxiv.org/abs/2106.15319 摘要:经验模式分解(EMD)已发展成为机器人、安全和生物医学工程等领域中基于尺度的自适应信号分析的重要工具。由于数据量的急剧增加对信号的实时分析能力提出了更高的要求,现有的EMD及其变种很难兼顾数据维数的增长和信号分析的速度。为了加快多维信号的分解速度,提出了一种新的信号串行化方法(serial-EMD),该方法将多变量或多维信号串联成一维信号,并利用各种一维EMD算法对其进行分解。为了验证该方法的有效性,对合成多变量时间序列、不同纹理的人工二维图像和真实人脸图像进行了测试。与现有的多EMD算法相比,分解时间大大缩短。此外,利用本征模函数(IMFs)提取的人脸识别结果比现有的多EMD算法具有更高的识别精度,表明本方法在IMFs质量方面具有优越的性能。此外,该方法为优化现有EMD算法提供了一个新的视角,即改变输入信号的结构,而不受包络计算技术或信号分解方法的限制。综上所述,研究表明,串行EMD技术是一种极具竞争力和快速的多维信号分析方法。 摘要:Empirical mode decomposition (EMD) has developed into a prominent tool for adaptive, scale-based signal analysis in various fields like robotics, security and biomedical engineering. Since the dramatic increase in amount of data puts forward higher requirements for the capability of real-time signal analysis, it is difficult for existing EMD and its variants to trade off the growth of data dimension and the speed of signal analysis. In order to decompose multi-dimensional signals at a faster speed, we present a novel signal-serialization method (serial-EMD), which concatenates multi-variate or multi-dimensional signals into a one-dimensional signal and uses various one-dimensional EMD algorithms to decompose it. To verify the effects of the proposed method, synthetic multi-variate time series, artificial 2D images with various textures and real-world facial images are tested. Compared with existing multi-EMD algorithms, the decomposition time becomes significantly reduced. In addition, the results of facial recognition with Intrinsic Mode Functions (IMFs) extracted using our method can achieve a higher accuracy than those obtained by existing multi-EMD algorithms, which demonstrates the superior performance of our method in terms of the quality of IMFs. Furthermore, this method can provide a new perspective to optimize the existing EMD algorithms, that is, transforming the structure of the input signal rather than being constrained by developing envelope computation techniques or signal decomposition methods. In summary, the study suggests that the serial-EMD technique is a highly competitive and fast alternative for multi-dimensional signal analysis.
【8】 Analysing Affective Behavior in the second ABAW2 Competition 标题:第二届ABAW2比赛中的情感行为分析
作者:Dimitrios Kollias,Irene Kotsia,Elnar Hajiyev,Stefanos Zafeiriou 机构:University of Greenwich, UK, Middlesex University London, UK, Realeyes, Imperial College London, UK 链接:https://arxiv.org/abs/2106.15318 摘要:《野外情感行为分析》(ABAW2)2021年竞赛是继第一届非常成功的ABAW竞赛和IEEE FG 2020年竞赛之后的第二届竞赛,旨在自动分析情感。ABAW2分为三个挑战,每一个挑战都涉及到三个主要行为任务中的一个:价唤醒估计、基本表达分类和动作单元检测。所有这三个挑战都基于一个通用的基准数据库Aff-Wild2,Aff-Wild2是一个大规模的野生数据库,也是第一个为所有这三个任务添加注释的数据库。在本文中,我们描述了这项比赛,将举行与ICCV 2021年。我们提出了三个挑战,利用竞争语料库。我们概述了评估指标,并提出了基线系统及其结果。有关比赛的更多信息,请访问比赛网站:https://ibug.doc.ic.ac.uk/resources/iccv-2021-2nd-abaw. 摘要:The Affective Behavior Analysis in-the-wild (ABAW2) 2021 Competition is the second -- following the first very successful ABAW Competition held in conjunction with IEEE FG 2020- Competition that aims at automatically analyzing affect. ABAW2 is split into three Challenges, each one addressing one of the three main behavior tasks of valence-arousal estimation, basic expression classification and action unit detection. All three Challenges are based on a common benchmark database, Aff-Wild2, which is a large scale in-the-wild database and the first one to be annotated for all these three tasks. In this paper, we describe this Competition, to be held in conjunction with ICCV 2021. We present the three Challenges, with the utilized Competition corpora. We outline the evaluation metrics and present the baseline system with its results. More information regarding the Competition is provided in the Competition site: https://ibug.doc.ic.ac.uk/resources/iccv-2021-2nd-abaw.
【9】 Artificial Intelligence in Minimally Invasive Interventional Treatment 标题:人工智能在微创介入治疗中的应用
作者:Daniel Ruijters 机构: Philips Healthcare, Image Guided Therapy Systems Innovation, Veenpluis ,PC, the Netherlands, Technische Universiteit Eindhoven, Dept. Electrical Engineering, Den Dolech ,AZ Eindhoven 链接:https://arxiv.org/abs/2106.15306 摘要:微创图像引导治疗程序通常采用先进的图像处理算法。人工智能算法的最新发展为进一步加强这一领域提供了潜力。本文探讨了微创治疗领域的几个应用领域,并讨论了人工智能在这些领域的应用。 摘要:Minimally invasive image guided treatment procedures often employ advanced image processing algorithms. The recent developments of artificial intelligence algorithms harbor potential to further enhance this domain. In this article we explore several application areas within the minimally invasive treatment space and discuss the deployment of artificial intelligence within these areas.
【10】 Convolutional Sparse Coding Fast Approximation with Application to Seismic Reflectivity Estimation 标题:卷积稀疏编码快速逼近及其在地震反射率估计中的应用
作者:Deborah Pereg,Israel Cohen,Anthony A. Vassiliou 链接:https://arxiv.org/abs/2106.15296 摘要:在稀疏编码中,我们试图提取输入向量的特征,假设数据本身是基本构造块的稀疏叠加。类似地,神经网络通过学习训练数据集的特征来执行给定的任务。近年来,数据驱动和模型驱动的特征提取方法得到了广泛的应用,并取得了显著的效果。然而,实际实现往往太慢,无法在实际场景中使用,特别是对于实时应用程序。我们提出了一个经典迭代阈值算法的加速升级版本,在2-5次迭代中产生了卷积稀疏码的良好逼近。速度优势主要来自于观察到大多数解算器都被低效的全局阈值降低了速度。其主要思想是在应用阈值之前,通过局部感受野能量对每个数据点进行归一化。通过这种方式,可以抑制对强特征表达式的自然倾向,从而可以依赖于在训练期间容易逼近或学习的全局阈值。所提出的算法可以用于已知的预定词典,也可以用于经过训练的词典。训练后的版本被实现为一个神经网络,作为所提出的解算器的展开。通过在合成和真实数据情况下的地震反演问题,验证了该方法的有效性。为稳定的支护回收提供了理论保证。也就是说,我们证明了在一定条件下,真正的支持是完全恢复在第一次迭代。 摘要:In sparse coding, we attempt to extract features of input vectors, assuming that the data is inherently structured as a sparse superposition of basic building blocks. Similarly, neural networks perform a given task by learning features of the training data set. Recently both data-driven and model-driven feature extracting methods have become extremely popular and have achieved remarkable results. Nevertheless, practical implementations are often too slow to be employed in real-life scenarios, especially for real-time applications. We propose a speed-up upgraded version of the classic iterative thresholding algorithm, that produces a good approximation of the convolutional sparse code within 2-5 iterations. The speed advantage is gained mostly from the observation that most solvers are slowed down by inefficient global thresholding. The main idea is to normalize each data point by the local receptive field energy, before applying a threshold. This way, the natural inclination towards strong feature expressions is suppressed, so that one can rely on a global threshold that can be easily approximated, or learned during training. The proposed algorithm can be employed with a known predetermined dictionary, or with a trained dictionary. The trained version is implemented as a neural net designed as the unfolding of the proposed solver. The performance of the proposed solution is demonstrated via the seismic inversion problem in both synthetic and real data scenarios. We also provide theoretical guarantees for a stable support recovery. Namely, we prove that under certain conditions the true support is perfectly recovered within the first iteration.
【11】 Using Robust Regression to Find Font Usage Trends 标题:使用稳健回归查找字体使用趋势
作者:Kaigen Tsuji,Daichi Haraguchi,Seiichi Uchida,Brian Kenji Iwana 机构:Kyushu University, Fukuoka, Japan 备注:16 pages with 10 figures. Accepted at ICDAR 2021 Workshop on Machine Learning(WML 2021, 3rd edition) 链接:https://arxiv.org/abs/2106.15232 摘要:字体在其整个历史中都有其发展趋势,不仅在发明的时间上,而且在使用和流行方面。在这篇论文中,我们试图通过对大量文本图像的稳健回归来具体地发现字体使用的趋势。我们使用电影海报作为这个任务的字体源,因为电影海报可以通过发布日期来表示时间段。此外,电影海报是经过精心设计的文件,代表了广泛的字体。为了了解电影海报的字体与时间的关系,我们使用回归卷积神经网络(CNN)来估计一部电影的发行年份。由于这项任务的难度,我们建议使用混合训练方案,使用均方误差(MSE)和Tukey的双重量损失相结合。此外,我们对字体随时间的变化趋势进行了深入的分析。 摘要:Fonts have had trends throughout their history, not only in when they were invented but also in their usage and popularity. In this paper, we attempt to specifically find the trends in font usage using robust regression on a large collection of text images. We utilize movie posters as the source of fonts for this task because movie posters can represent time periods by using their release date. In addition, movie posters are documents that are carefully designed and represent a wide range of fonts. To understand the relationship between the fonts of movie posters and time, we use a regression Convolutional Neural Network (CNN) to estimate the release year of a movie using an isolated title text image. Due to the difficulty of the task, we propose to use of a hybrid training regimen that uses a combination of Mean Squared Error (MSE) and Tukey's biweight loss. Furthermore, we perform a thorough analysis on the trends of fonts through time.
【12】 Wrong Colored Vermeer: Color-Symmetric Image Distortion 标题:错误的彩色Vermeer:颜色对称的图像失真
作者:Hendrik Richter 机构:HTWK Leipzig University of Applied Sciences, D–, Leipzig, Germany. 链接:https://arxiv.org/abs/2106.15179 摘要:颜色对称意味着几何对象的颜色是根据其对称性来分配的。它是通过将对称群的元素与颜色排列相关联来定义的。我用这个概念来创作艺术,并将对称一致的色彩扭曲应用到约翰内斯·弗米尔的绘画作品中。颜色置换是HSV颜色空间到自身的映射。 摘要:Color symmetry implies that the colors of geometrical objects are assigned according to their symmetry properties. It is defined by associating the elements of the symmetry group with a color permutation. I use this concept for generative art and apply symmetry-consistent color distortions to images of paintings by Johannes Vermeer. The color permutations are realized as mappings of the HSV color space onto itself.
【13】 TUCaN: Progressively Teaching Colourisation to Capsules 标题:图灿:逐步教授胶囊着色
作者:Rita Pucci,Niki Martinel 机构:Machine Learning and Perception Lab, University of Udine, Udine, ORCID:,-,-,-, ORCID: ,-,-,- 链接:https://arxiv.org/abs/2106.15176 摘要:自动图像着色是计算机视觉的研究方向,研究如何对灰度图像进行着色(恢复)。深度学习技术改善了图像的着色效果,产生了惊人的结果。这些不同的因素,如结构差异,输入类型,用户辅助等,其中大部分是基于卷积层的架构结构,而不是侧重于对象特征提取的层。我们介绍了一种新的下采样-上采样结构TUCaN(Tiny UCapsNet),它利用卷积层和胶囊层的协作来获得每个图像中实体的整洁着色。这是通过在这些层之间通过跳过和剩余连接强制协作来实现的。我们提出的问题作为一个每像素的颜色分类任务,将颜色识别为一个箱子在量化空间。为了训练网络,与标准的端到端学习方法相比,本文提出了一种渐进式学习方法,在不改变学习模型的情况下,只通过操作学习过程来提取对象的上下文。在该方案中,上采样从低分辨率图像的重建开始,在整个训练阶段逐渐增长到高分辨率图像。在三个基准数据集上的实验结果表明,我们使用ImageNet10k数据集的方法在标准质量指标上优于现有的方法,并且在图像着色方面达到了最先进的性能。我们进行了一项用户研究,以量化着色结果的感知真实性,结果表明:渐进式学习可以让用户获得比端到端方案更好的颜色;并指出了现有评价指标的局限性。 摘要:Automatic image colourisation is the computer vision research path that studies how to colourise greyscale images (for restoration). Deep learning techniques improved image colourisation yielding astonishing results. These differ by various factors, such as structural differences, input types, user assistance, etc. Most of them, base the architectural structure on convolutional layers with no emphasis on layers specialised in object features extraction. We introduce a novel downsampling upsampling architecture named TUCaN (Tiny UCapsNet) that exploits the collaboration of convolutional layers and capsule layers to obtain a neat colourisation of entities present in every single image. This is obtained by enforcing collaboration among such layers by skip and residual connections. We pose the problem as a per pixel colour classification task that identifies colours as a bin in a quantized space. To train the network, in contrast with the standard end to end learning method, we propose the progressive learning scheme to extract the context of objects by only manipulating the learning process without changing the model. In this scheme, the upsampling starts from the reconstruction of low resolution images and progressively grows to high resolution images throughout the training phase. Experimental results on three benchmark datasets show that our approach with ImageNet10k dataset outperforms existing methods on standard quality metrics and achieves state of the art performances on image colourisation. We performed a user study to quantify the perceptual realism of the colourisation results demonstrating: that progressive learning let the TUCaN achieve better colours than the end to end scheme; and pointing out the limitations of the existing evaluation metrics.
【14】 An End-to-End Autofocus Camera for Iris on the Move 标题:一种用于移动虹膜的端到端自动对焦相机
作者:Leyuan Wang,Kunbo Zhang,Yunlong Wang,Zhenan Sun 机构:School of Artificial Intelligence, UCAS, Center for Research on Intelligent Perception and Computing, National Laboratory of Pattern Recognition, CASIA 备注:8 pages, 7 figures, International Joint Conference on Biometrics 2021 链接:https://arxiv.org/abs/2106.15069 摘要:对于远距离虹膜识别,通常采用长焦距镜头来保证虹膜图像的分辨率,这会降低景深,导致潜在的离焦模糊。为了适应不同距离的用户,需要快速准确地控制焦点。而对于运动中的用户来说,它可以持续保持对虹膜区域的正确聚焦。本文介绍了一种新型的快速自动对焦相机,利用可调焦镜头对运动物体的虹膜区域进行主动重聚焦。我们的端到端计算算法可以从一幅模糊图像中预测最佳聚焦位置,并自动生成镜头屈光度控制信号。这种基于场景的主动操作方法可以对运动物体的虹膜区域进行实时聚焦跟踪。我们建立了一个测试平台来收集真实世界的焦点堆栈,以评估自动对焦方法。我们的相机已经达到了50帧以上的自动对焦速度。实验结果证明了我们提出的摄像机在静态和动态场景中进行生物特征感知的优势。代码可在https://github.com/Debatrix/AquulaCam. 摘要:For distant iris recognition, a long focal length lens is generally used to ensure the resolution ofiris images, which reduces the depth of field and leads to potential defocus blur. To accommodate users at different distances, it is necessary to control focus quickly and accurately. While for users in motion, it is expected to maintain the correct focus on the iris area continuously. In this paper, we introduced a novel rapid autofocus camera for active refocusing ofthe iris area ofthe moving objects using a focus-tunable lens. Our end-to-end computational algorithm can predict the best focus position from one single blurred image and generate a lens diopter control signal automatically. This scene-based active manipulation method enables real-time focus tracking of the iris area ofa moving object. We built a testing bench to collect real-world focal stacks for evaluation of the autofocus methods. Our camera has reached an autofocus speed ofover 50 fps. The results demonstrate the advantages of our proposed camera for biometric perception in static and dynamic scenes. The code is available at https://github.com/Debatrix/AquulaCam.
【15】 How to Reach Real-Time AI on Consumer Devices? Solutions for Programmable and Custom Architectures 标题:如何在消费类设备上实现实时人工智能?针对可编程和自定义架构的解决方案
作者:Stylianos I. Venieris,Ioannis Panopoulos,Ilias Leontiadis,Iakovos S. Venieris 机构:†Samsung AI Center, Cambridge, UK, ‡National Technical University of Athens, Athens, Greece 备注:Invited paper at the 32nd IEEE International Conference on Application-Specific Systems, Architectures and Processors (ASAP), 2021 链接:https://arxiv.org/abs/2106.15021 摘要:深度神经网络(DNNs)的空前性能使得各种人工智能(AI)推理任务,如目标识别和语音识别,有了长足的发展。然而,在商用设备上部署这样的人工智能模型面临着巨大的挑战:巨大的计算成本、多个性能目标、硬件异构性和对高精度的共同需求,这些共同构成了dnn在各种嵌入式和移动设备上部署的关键问题。因此,我们还没有看到最先进的深度学习算法在消费类设备上的主流应用。在本文中,我们提供了一系列有效的人工智能系统设计技术,初步回答了这个可能改变游戏规则的问题。我们首先检查主要障碍时,针对可编程处理器和定制加速器。然后,我们提出了不同的方法来实现实时性能后,跨栈的方法。这些技术跨越了模型级、系统级和硬件级技术及其组合。我们的发现提供了人工智能系统的示例,这些系统不会使移动硬件负担过重,同时也说明了它们如何提高推理的准确性。此外,我们还展示了基于ASIC和FPGA的定制加速器如何成为下一代人工智能应用(如多DNN系统)的有利因素。总的来说,这些结果强调了进一步探索各种跨栈解决方案如何最佳组合的关键需求,以便以稳健和高效的方式将深度学习的最新进展带给用户。 摘要:The unprecedented performance of deep neural networks (DNNs) has led to large strides in various Artificial Intelligence (AI) inference tasks, such as object and speech recognition. Nevertheless, deploying such AI models across commodity devices faces significant challenges: large computational cost, multiple performance objectives, hardware heterogeneity and a common need for high accuracy, together pose critical problems to the deployment of DNNs across the various embedded and mobile devices in the wild. As such, we have yet to witness the mainstream usage of state-of-the-art deep learning algorithms across consumer devices. In this paper, we provide preliminary answers to this potentially game-changing question by presenting an array of design techniques for efficient AI systems. We start by examining the major roadblocks when targeting both programmable processors and custom accelerators. Then, we present diverse methods for achieving real-time performance following a cross-stack approach. These span model-, system- and hardware-level techniques, and their combination. Our findings provide illustrative examples of AI systems that do not overburden mobile hardware, while also indicating how they can improve inference accuracy. Moreover, we showcase how custom ASIC- and FPGA-based accelerators can be an enabling factor for next-generation AI applications, such as multi-DNN systems. Collectively, these results highlight the critical need for further exploration as to how the various cross-stack solutions can be best combined in order to bring the latest advances in deep learning close to users, in a robust and efficient manner.
【16】 IREM: High-Resolution Magnetic Resonance (MR) Image Reconstruction via Implicit Neural Representation 标题:IREM:基于隐式神经表示的高分辨率磁共振(MR)图像重建
作者:Qing Wu,Yuwei Li,Lan Xu,Ruiming Feng,Hongjiang Wei,Qing Yang,Boliang Yu,Xiaozhao Liu,Jingyi Yu,Yuyao Zhang 机构:School of Information Science and Technology, ShanghaiTech University, Shanghai, China, School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, China, Institute of Brain-Intelligence Technology, Zhangjiang Laboratory 备注:8 pages, 6 figures, conference 链接:https://arxiv.org/abs/2106.15097 摘要:为了采集高质量的高分辨率(HR)MR图像,我们提出了一种新的图像重建网络IREM,该网络对多幅低分辨率(LR)MR图像进行训练,实现了任意上采样率的HR图像重建。在这项工作中,我们假设期望的HR图像是3D图像空间坐标的隐式连续函数,而厚切片LR图像是该函数的几个稀疏离散采样。然后,超分辨率(SR)任务是使用完全连接的神经网络结合Fourier特征位置编码从有限的观测值中学习连续体积函数。通过简单地最小化网络预测和每个成像平面上获得的LR图像强度之间的误差,IREM被训练成表示观察到的组织解剖结构的连续模型。实验结果表明,IREM成功地表达了高频图像特征,在真实场景数据采集中,IREM缩短了扫描时间,在信噪比和局部图像细节方面实现了高质量的高分辨率MR成像。 摘要:For collecting high-quality high-resolution (HR) MR image, we propose a novel image reconstruction network named IREM, which is trained on multiple low-resolution (LR) MR images and achieve an arbitrary up-sampling rate for HR image reconstruction. In this work, we suppose the desired HR image as an implicit continuous function of the 3D image spatial coordinate and the thick-slice LR images as several sparse discrete samplings of this function. Then the super-resolution (SR) task is to learn the continuous volumetric function from a limited observations using an fully-connected neural network combined with Fourier feature positional encoding. By simply minimizing the error between the network prediction and the acquired LR image intensity across each imaging plane, IREM is trained to represent a continuous model of the observed tissue anatomy. Experimental results indicate that IREM succeeds in representing high frequency image feature, and in real scene data collection, IREM reduces scan time and achieves high-quality high-resolution MR imaging in terms of SNR and local image detail.
机器翻译,仅供参考