访问www.arxivdaily.com获取含摘要速递,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏、发帖等功能!点击阅读原文即可访问
q-fin金融,共计4篇
cs.SD语音,共计7篇
eess.AS音频处理,共计7篇
1.q-fin金融:
【1】 MobilityCoins -- A new currency for the multimodal urban transportation system 标题:机动币--城市多式联运的新货币
作者:Klaus Bogenberger,Philipp Blum,Florian Dandl,Lisa-Sophie Hamm,Allister Loder,Patrick Malcom,Martin Margreiter,Natalie Sautter 机构:Chair of Traffic Engineering and Control, Department of Mobility Systems Engineering, School of Engineering and, Design,Technical University of Munich, Germany 链接:https://arxiv.org/abs/2107.13441 摘要:MobilityCoin是一种新的、全方位的货币,用于管理多式联运城市交通系统。MobilityCoins包括并取代了各种现有的交通政策工具,同时也鼓励转向更可持续的交通方式,并授权公众投票支持基础设施措施。 摘要:The MobilityCoin is a new, all-encompassing currency for the management of the multimodal urban transportation system. MobilityCoins includes and replaces various existing transport policy instruments while also incentivizing a shift to more sustainable modes as well as empowering the public to vote for infrastructure measures.
【2】 Renewable Energy Targets and Unintended Storage Cycling: Implications for Energy Modeling 标题:可再生能源目标和非预期储存循环:对能源建模的影响
作者:Martin Kittel,Wolf-Peter Schill 机构:DIW Berlin, Department of Energy, Transportation, Environment, Mohrenstraße , Berlin, Germany 链接:https://arxiv.org/abs/2107.13380 摘要:为了使经济脱碳,许多政府制定了使用可再生能源的目标。这些通常被表述为电力需求或供应的相对份额。在能源模型中实现各自的约束是一个令人惊讶的微妙问题。它们可能会导致过度用电的建模伪影。我们将这种现象称为“非预期存储循环”,可在同时存储充电和放电的情况下检测到。在本文中,我们提供了在模型中实现最小可更新共享约束的不同方法的分析表示,并展示了这些方法如何导致意外的存储循环。使用简约优化模型,我们量化了最优调度和投资决策以及市场价格的相关扭曲,并确定了这一现象的重要驱动因素。最后,我们就如何避免能源建模中意外存储循环的扭曲效应提出了建议。 摘要:To decarbonize the economy, many governments have set targets for the use of renewable energy sources. These are often formulated as relative shares of electricity demand or supply. Implementing respective constraints in energy models is a surprisingly delicate issue. They may cause a modeling artifact of excessive electricity storage use. We introduce this phenomenon as 'unintended storage cycling', which can be detected in case of simultaneous storage charging and discharging. In this paper, we provide an analytical representation of different approaches for implementing minimum renewable share constraints in models, and show how these may lead to unintended storage cycling. Using a parsimonious optimization model, we quantify related distortions of optimal dispatch and investment decisions as well as market prices, and identify important drivers of the phenomenon. Finally, we provide recommendations on how to avoid the distorting effects of unintended storage cycling in energy modeling.
【3】 Combining Machine Learning Classifiers for Stock Trading with Effective Feature Extraction 标题:将机器学习分类器与有效的特征提取相结合用于股票交易
作者:A. K. M. Amanat Ullah,Fahim Imtiaz,Miftah Uddin Md Ihsan,Md. Golam Rabiul Alam,Mahbub Majumdar 机构:Department of Computer Science and Engineering, BRAC University, Dhaka-, Bangladesh, A R T I C L E I N F O 链接:https://arxiv.org/abs/2107.13148 摘要:股票市场的不可预测性和波动性使得使用任何广义方案都很难获得可观的利润。本文旨在讨论我们的机器学习模型,它可以通过在Quantopian平台上执行实时交易,同时免费使用资源,在美国股市中获得可观的利润。我们的首要方法是使用集成学习和四个分类器:高斯朴素贝叶斯、决策树、具有L1正则化和随机梯度下降的逻辑回归,来决定是否对特定股票做多或做空。我们的最佳模型在2011年7月至2019年1月期间每日交易,产生54.35%的利润。最后,我们的工作表明,混合加权分类器在股票市场交易决策方面的表现优于任何单个预测因子。 摘要:The unpredictability and volatility of the stock market render it challenging to make a substantial profit using any generalized scheme. This paper intends to discuss our machine learning model, which can make a significant amount of profit in the US stock market by performing live trading in the Quantopian platform while using resources free of cost. Our top approach was to use ensemble learning with four classifiers: Gaussian Naive Bayes, Decision Tree, Logistic Regression with L1 regularization and Stochastic Gradient Descent, to decide whether to go long or short on a particular stock. Our best model performed daily trade between July 2011 and January 2019, generating 54.35% profit. Finally, our work showcased that mixtures of weighted classifiers perform better than any individual predictor about making trading decisions in the stock market.
【4】 Towards an enabling environment for social accountability in Bangladesh 标题:为孟加拉国的社会责任创造有利的环境
作者:Hossain Ahmed Taufiq 机构: Oregon State University, Independent University Bangladesh 备注:Proceedings: Conference on 'Accountability in Bangladesh: Issues & Debates', Global Studies and Governance Program, Independent University Bangladesh (IUB), 29 March 2018 链接:https://arxiv.org/abs/2107.13128 摘要:社会责任是指通过让执政精英更加积极响应,促进善政。在孟加拉国,官僚机构和立法机构的运作缺乏有效的问责或制衡,传统的横向或纵向问责被证明是非常直截了当和软弱的。在这种机制存在缺陷的情况下,普通公民常常无法获得信息,他们的声音也被保持沉默。它阻碍了有利环境的形成,在这种环境中,代表普通人民利益的积极分子和民间社会机构受到积极劝阻。他们很容易受到惩罚。另一方面,社会责任为活动家和民间社会机构自由运作提供了有利环境。因此,领导人和政府对人民更加负责。有利的环境意味着提供法律保护,提高信息的可获得性,增加公民的发言权,加强体制和公共服务能力,并指导促进问责制的激励措施。捐助者拨出大量资源,鼓励民间社会与精英结成伙伴,而不是让他们承担责任。本文主张建立一个更强大的法律环境来保护关键的公民社会和告密者,并主张独立的赠款发放者负责建立强大、自我监管的社会问责机构。关键词:问责制、法律保护、效率、民间社会、反应能力 摘要:Social accountability refers to promoting good governance by making ruling elites more responsive. In Bangladesh, where bureaucracy and legislature operate with little effective accountability or checks and balances, traditional horizontal or vertical accountability proved to be very blunt and weak. In the presence of such faulty mechanisms, ordinary citizens access to information is frequently denied, and their voices are kept mute. It impasses the formation of an enabling environment, where activists and civil society institutions representing the ordinary peoples interest are actively discouraged. They become vulnerable to retribution. Social accountability, on the other hand, provides an enabling environment for activists and civil society institutions to operate freely. Thus, leaders and administration become more accountable to people. An enabling environment means providing legal protection, enhancing the availability of information and increasing citizen voice, strengthening institutional and public service capacities and directing incentives that foster accountability. Donors allocate significant shares of resources to encouraging civil society to partner with elites rather than holding them accountable. This paper advocate for a stronger legal environment to protect critical civil society and whistle-blowers, and for independent grant-makers tasked with building strong, self-regulating social accountability institutions. Key Words: Accountability, Legal Protection, Efficiency, Civil Society, Responsiveness
2.cs.SD语音:
【1】 On Perceived Emotion in Expressive Piano Performance: Further Experimental Evidence for the Relevance of Mid-level Perceptual Features 标题:论表现性钢琴演奏中的知觉情感:中层知觉特征相关性的进一步实验证据
作者:Shreyan Chowdhury,Gerhard Widmer 机构:Institute of Computational Perception, Johannes Kepler University Linz, Austria, LIT AI Lab, Linz Institute of Technology, Austria 备注:In Proceedings of the 22nd International Society for Music Information Retrieval (ISMIR) Conference, Online, 2021 链接:https://arxiv.org/abs/2107.13231 摘要:尽管最近在基于音频内容的音乐情感识别方面取得了进展,但一个有待探索的问题是,一个算法是否能够可靠地识别同一乐曲不同表演之间的情感或表达质量。在本研究中,我们分析了几组特征,这些特征在预测巴赫《钢琴曲1》中六种不同演奏(由六位著名钢琴家演奏)的唤醒和配价方面的有效性。这些特征包括低水平声学特征、基于分数的特征、使用预先训练的情感模型提取的特征、,和中级知觉特征。我们通过在几个实验中评估他们的预测能力来比较他们,这些实验旨在测试情绪的表现或分段变化。我们发现中等水平的特征在唤醒和效价的表现上都有显著的贡献——甚至比预先训练的情绪模型更好。我们的研究结果进一步证明,中等水平的感知特征是音乐属性在若干任务中的一个重要表现——具体来说,在本例中,是为了捕捉音乐的表现方面,表现为音乐表演的感知情感。 摘要:Despite recent advances in audio content-based music emotion recognition, a question that remains to be explored is whether an algorithm can reliably discern emotional or expressive qualities between different performances of the same piece. In the present work, we analyze several sets of features on their effectiveness in predicting arousal and valence of six different performances (by six famous pianists) of Bach's Well-Tempered Clavier Book 1. These features include low-level acoustic features, score-based features, features extracted using a pre-trained emotion model, and Mid-level perceptual features. We compare their predictive power by evaluating them on several experiments designed to test performance-wise or piece-wise variations of emotion. We find that Mid-level features show significant contribution in performance-wise variation of both arousal and valence -- even better than the pre-trained emotion model. Our findings add to the evidence of Mid-level perceptual features being an important representation of musical attributes for several tasks -- specifically, in this case, for capturing the expressive aspects of music that manifest as perceived emotion of a musical performance.
【2】 Squeeze-Excitation Convolutional Recurrent Neural Networks for Audio-Visual Scene Classification 标题:压缩激励卷积递归神经网络在视听场景分类中的应用
作者:Javier Naranjo-Alcazar,Sergi Perez-Castanos,Aaron Lopez-Garcia,Pedro Zuccarello,Maximo Cobos,Francesc J. Ferri 机构:es 2 Universitat de Valencia 链接:https://arxiv.org/abs/2107.13180 摘要:使用多个语义相关的源可以相互提供补充信息,这些信息在单独使用单个模式时可能不明显。在这种情况下,多模态模型有助于在有视听数据的机器学习任务中产生更准确和稳健的预测。本文提出了一种同时利用听觉和视觉信息的多模态场景自动分类模型。该方法利用两个独立的网络,分别对音频和视频数据进行单独训练,使每个网络专门处理给定的模态。视觉子网络是一个经过预训练的VGG16模型,然后是一个双向递归层,而剩余音频子网络是基于从零开始训练的叠加压缩激励卷积块。在训练每个子网络之后,音频和视频流中的信息在两个不同的阶段进行融合。早期融合阶段将各个子网络在不同时间步的最后一个卷积块产生的特征结合起来,以提供双向递归结构。后期融合阶段将早期融合阶段的输出与两个子网络提供的独立预测相结合,从而产生最终预测。我们使用最近出版的TAU视听城市场景2021对该方法进行了评估,该场景包含来自10个不同场景类别的12个欧洲城市的同步音频和视频记录。在DCASE 2021挑战的评估结果中,所提出的模型在预测性能(86.5%)和系统复杂性(1500万个参数)之间提供了一个很好的权衡。 摘要:The use of multiple and semantically correlated sources can provide complementary information to each other that may not be evident when working with individual modalities on their own. In this context, multi-modal models can help producing more accurate and robust predictions in machine learning tasks where audio-visual data is available. This paper presents a multi-modal model for automatic scene classification that exploits simultaneously auditory and visual information. The proposed approach makes use of two separate networks which are respectively trained in isolation on audio and visual data, so that each network specializes in a given modality. The visual subnetwork is a pre-trained VGG16 model followed by a bidiretional recurrent layer, while the residual audio subnetwork is based on stacked squeeze-excitation convolutional blocks trained from scratch. After training each subnetwork, the fusion of information from the audio and visual streams is performed at two different stages. The early fusion stage combines features resulting from the last convolutional block of the respective subnetworks at different time steps to feed a bidirectional recurrent structure. The late fusion stage combines the output of the early fusion stage with the independent predictions provided by the two subnetworks, resulting in the final prediction. We evaluate the method using the recently published TAU Audio-Visual Urban Scenes 2021, which contains synchronized audio and video recordings from 12 European cities in 10 different scene classes. The proposed model has been shown to provide an excellent trade-off between prediction performance (86.5%) and system complexity (15M parameters) in the evaluation results of the DCASE 2021 Challenge.
【3】 CycleGAN-based Non-parallel Speech Enhancement with an Adaptive Attention-in-attention Mechanism 标题:基于CycleGAN的自适应注意中注意机制非并行语音增强
作者:Guochen Yu,Yutian Wang,Chengshi Zheng,Hui Wang,Qin Zhang 机构:∗ State Key Laboratory of Media Convergence and Communication, Communication University of China, Beijing, China, † Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of Sciences, Beijing, China 备注:Submitted to APSIPA-ASC 2021 链接:https://arxiv.org/abs/2107.13143 摘要:对于基于DNN的语音增强方法来说,非并行训练是一项困难但必不可少的任务,因为在许多实际场景中,缺乏足够的噪声和成对的干净语音语料库。在本文中,我们提出了一种用于非并行语音增强的自适应注意循环(AIA CycleGAN)。在以前基于CycleGAN的非并行语音增强方法中,生成器的有限映射能力可能导致性能下降和特征学习不足。为了缓解这种退化,我们提出了自适应时频注意(ATFA)和自适应层次注意(AHA)的集成,以形成注意中的注意(AIA)模块,以便在映射过程中进行更灵活的特征学习。更具体地说,ATFA可以捕获长程时间光谱上下文信息以获得更有效的特征表示,而AHA可以根据全局上下文通过权重灵活地聚合不同的中间特征映射。大量的实验结果表明,该方法在非并行训练中取得了优于以往基于GAN和CycleGAN的方法的性能。此外,并行训练实验验证了所提出的AIA CycleGAN在保持语音完整性和减少语音失真方面也优于最先进的基于GAN的语音增强方法。 摘要:Non-parallel training is a difficult but essential task for DNN-based speech enhancement methods, for the lack of adequate noisy and paired clean speech corpus in many real scenarios. In this paper, we propose a novel adaptive attention-in-attention CycleGAN (AIA-CycleGAN) for non-parallel speech enhancement. In previous CycleGAN-based non-parallel speech enhancement methods, the limited mapping ability of the generator may cause performance degradation and insufficient feature learning. To alleviate this degradation, we propose an integration of adaptive time-frequency attention (ATFA) and adaptive hierarchical attention (AHA) to form an attention-in-attention (AIA) module for more flexible feature learning during the mapping procedure. More specifically, ATFA can capture the long-range temporal-spectral contextual information for more effective feature representations, while AHA can flexibly aggregate different intermediate feature maps by weights depending on the global context. Numerous experimental results demonstrate that the proposed approach achieves consistently more superior performance over previous GAN-based and CycleGAN-based methods in non-parallel training. Moreover, experiments in parallel training verify that the proposed AIA-CycleGAN also outperforms most advanced GAN-based speech enhancement approaches, especially in maintaining speech integrity and reducing speech distortion.
【4】 Continual-wav2vec2: an Application of Continual Learning for Self-Supervised Automatic Speech Recognition 标题:连续波2Vector 2:连续学习在自监督自动语音识别中的应用
作者:Samuel Kessler,Bethan Thomas,Salah Karout 机构: 1University of Oxford 2Huawei R&D Cambridge 备注:11 pages, 9 figures including references and appendix. Accepted at ICML 2021 Workshop: Self-Supervised Learning for Reasoning and Perception 链接:https://arxiv.org/abs/2107.13530 摘要:我们提出了一种使用自监督学习(SSL)对多种语言的语音表示进行连续学习的方法,并将其应用于自动语音识别。有大量的未注释语音,因此从原始音频创建自监督表示并在小的注释数据集上进行微调是构建语音识别系统的一个有希望的方向。Wav2vec模型在预训练阶段对原始音频执行SSL,然后对一小部分带注释的数据进行微调。SSL模型为ASR产生了最先进的结果。然而,这些模型在自我监督下进行预训练非常昂贵。我们要解决的问题是不断地从音频中学习新的语言表征,而不会忘记以前的语言表征。我们利用持续学习中的想法来转移先前任务中的知识,以加速新语言任务的预训练。我们的Continuous-wav2vec2模型可以在学习新的语言任务时减少32%的训练前时间,并且在学习这种新的音频语言表征时不会忘记以前的语言表征。 摘要:We present a method for continual learning of speech representations for multiple languages using self-supervised learning (SSL) and applying these for automatic speech recognition. There is an abundance of unannotated speech, so creating self-supervised representations from raw audio and finetuning on a small annotated datasets is a promising direction to build speech recognition systems. Wav2vec models perform SSL on raw audio in a pretraining phase and then finetune on a small fraction of annotated data. SSL models have produced state of the art results for ASR. However, these models are very expensive to pretrain with self-supervision. We tackle the problem of learning new language representations continually from audio without forgetting a previous language representation. We use ideas from continual learning to transfer knowledge from a previous task to speed up pretraining a new language task. Our continual-wav2vec2 model can decrease pretraining times by 32% when learning a new language task, and learn this new audio-language representation without forgetting previous language representation.
【5】 Vowel-based Meeteilon dialect identification using a Random Forest classifier 标题:利用随机森林分类器识别基于元音的Meeteilon方言
作者:Thangjam Clarinda Devi,Kabita Thaoroijam 机构:Department of Computer Science and Engineering, Indian Institute of Information Technology Manipur, Manipur, India 备注:5 pages, double coulumn, 8 Figures, 1 table. Already presented as poster presentation at OCOCOSDA 2020 but not yet published 链接:https://arxiv.org/abs/2107.13419 摘要:本文提出了一个基于元音的Meeteilon方言识别系统。在这项工作中,元音数据集是使用印度语言语言数据联盟(LDC-IL)提供的Meeteilon语音语料库创建的。从单音元音中提取共振峰频率(F1、F1和F3)等光谱特征和韵律特征,如音调(F0)、能量、强度和分段持续时间值。随机森林分类器是一种基于决策树的集成算法,用于Meeteilon三种主要方言的分类,即Imphal、Kakching和Sekmai。该模型的平均方言识别率为61.57%。光谱和韵律特征在Meeteilon方言分类中起着重要作用。 摘要:This paper presents a vowel-based dialect identification system for Meeteilon. For this work, a vowel dataset is created by using Meeteilon Speech Corpora available at Linguistic Data Consortium for Indian Languages (LDC-IL). Spectral features such as formant frequencies (F1, F1 and F3) and prosodic features such as pitch (F0), energy, intensity and segment duration values are extracted from monophthong vowel sounds. Random forest classifier, a decision tree-based ensemble algorithm is used for classification of three major dialects of Meeteilon namely, Imphal, Kakching and Sekmai. Model has shown an average dialect identification performance in terms of accuracy of around 61.57%. The role of spectral and prosodic features are found to be significant in Meeteilon dialect classification.
【6】 Deep learning based cough detection camera using enhanced features 标题:使用增强特征的基于深度学习的咳嗽检测摄像机
作者:Gyeong-Tae Lee,Hyeonuk Nam,Seong-Hu Kim,Sang-Min Choi,Youngkey Kim,Yong-Hwa Park 机构:Department of Mechanical Engineering, Korea Advanced Institute of Science and Technology, Daejeon , SM Instruments Inc., Daejeon , South Korea, Corresponding author at: Department of Mechanical Engineering, Korea Advanced Institute of Science and 备注:submitted to Expert Systems With Applications 链接:https://arxiv.org/abs/2107.13260 摘要:咳嗽是新冠病毒-19的一种典型症状。为了远程检测和定位咳嗽声,本研究开发了一种基于卷积神经网络(CNN)的深度学习模型,并与声音摄像机集成,用于咳嗽声的可视化。咳嗽检测模型是一个二值分类器,其输入是一个两秒钟的声学特征,输出是两个推论(咳嗽或其他)中的一个。对采集到的音频文件进行数据扩充,以缓解类别不平衡,并反映实际环境中的各种背景噪声。为了有效地描述咳嗽声,在这项工作中,通过利用其速度(V)和加速度(A)图,增强了光谱图、mel标度光谱图和mel频率倒谱系数(MFCC)等常规特征。VGGNet、GoogLeNet和ResNet被简化为二进制分类器,分别命名为V-net、G-net和R-net。为了找到功能和网络的最佳组合,总共对39例患者进行了训练,并使用测试F1分数确认了性能。最后,使用MFCC-V-a特征(名为Spectroflow)从G-net获得91.9%的测试F1分数(测试准确度为97.2%),这是一种有效用于咳嗽检测的声学特征。训练后的咳嗽检测模型与声音摄像机(即使用波束形成麦克风阵列可视化声源的摄像机)集成。在一项初步试验中,咳嗽检测摄像机检测到咳嗽声,F1评分为90.0%(准确率为96.0%),并实时跟踪摄像机图像中的咳嗽位置。 摘要:Coughing is a typical symptom of COVID-19. To detect and localize coughing sounds remotely, a convolutional neural network (CNN) based deep learning model was developed in this work and integrated with a sound camera for the visualization of the cough sounds. The cough detection model is a binary classifier of which the input is a two second acoustic feature and the output is one of two inferences (Cough or Others). Data augmentation was performed on the collected audio files to alleviate class imbalance and reflect various background noises in practical environments. For effective featuring of the cough sound, conventional features such as spectrograms, mel-scaled spectrograms, and mel-frequency cepstral coefficients (MFCC) were reinforced by utilizing their velocity (V) and acceleration (A) maps in this work. VGGNet, GoogLeNet, and ResNet were simplified to binary classifiers, and were named V-net, G-net, and R-net, respectively. To find the best combination of features and networks, training was performed for a total of 39 cases and the performance was confirmed using the test F1 score. Finally, a test F1 score of 91.9% (test accuracy of 97.2%) was achieved from G-net with the MFCC-V-A feature (named Spectroflow), an acoustic feature effective for use in cough detection. The trained cough detection model was integrated with a sound camera (i.e., one that visualizes sound sources using a beamforming microphone array). In a pilot test, the cough detection camera detected coughing sounds with an F1 score of 90.0% (accuracy of 96.0%), and the cough location in the camera image was tracked in real time.
【7】 A Visual Domain Transfer Learning Approach for Heartbeat Sound Classification 标题:一种用于心跳音分类的视域迁移学习方法
作者:Uddipan Mukherjee,Sidharth Pancholi 链接:https://arxiv.org/abs/2107.13237 摘要:心脏病是人类死亡的最常见原因,几乎占全世界死亡人数的三分之一。早期发现疾病可增加患者存活的机会,有几种方法可以早期发现心脏病的迹象。本研究提出将净化和标准化的心音转换为视觉mel尺度的频谱图,然后使用视觉域转移学习方法自动提取心音特征并进行分类。以往的一些研究发现,不同类型的心音的频谱图对人眼来说是视觉上可分辨的,这就促使本研究对用于心音自动分类的视觉域分类方法进行实验。它将使用基于卷积神经网络的体系结构,即ResNet、MobileNetV2等,作为光谱图的自动特征提取器。这些在图像域中被广泛接受的模型显示,可以学习从不同环境中采集的具有不同振幅和噪声水平的心音的广义特征表示。使用的模型评估标准是分类准确性、精确性、召回率和AUROC,因为所选数据集是不平衡的。该方法已在PASCAL心音数据集A和B上实现,分类准确率达90%,AUROC为0.97。 摘要:Heart disease is the most common reason for human mortality that causes almost one-third of deaths throughout the world. Detecting the disease early increases the chances of survival of the patient and there are several ways a sign of heart disease can be detected early. This research proposes to convert cleansed and normalized heart sound into visual mel scale spectrograms and then using visual domain transfer learning approaches to automatically extract features and categorize between heart sounds. Some of the previous studies found that the spectrogram of various types of heart sounds is visually distinguishable to human eyes, which motivated this study to experiment on visual domain classification approaches for automated heart sound classification. It will use convolution neural network-based architectures i.e. ResNet, MobileNetV2, etc as the automated feature extractors from spectrograms. These well-accepted models in the image domain showed to learn generalized feature representations of cardiac sounds collected from different environments with varying amplitude and noise levels. Model evaluation criteria used were categorical accuracy, precision, recall, and AUROC as the chosen dataset is unbalanced. The proposed approach has been implemented on datasets A and B of the PASCAL heart sound collection and resulted in ~ 90% categorical accuracy and AUROC of ~0.97 for both sets.
3.eess.AS音频处理:
【1】 Continual-wav2vec2: an Application of Continual Learning for Self-Supervised Automatic Speech Recognition 标题:连续波2Vector 2:连续学习在自监督自动语音识别中的应用
作者:Samuel Kessler,Bethan Thomas,Salah Karout 机构: 1University of Oxford 2Huawei R&D Cambridge 备注:11 pages, 9 figures including references and appendix. Accepted at ICML 2021 Workshop: Self-Supervised Learning for Reasoning and Perception 链接:https://arxiv.org/abs/2107.13530 摘要:我们提出了一种使用自监督学习(SSL)对多种语言的语音表示进行连续学习的方法,并将其应用于自动语音识别。有大量的未注释语音,因此从原始音频创建自监督表示并在小的注释数据集上进行微调是构建语音识别系统的一个有希望的方向。Wav2vec模型在预训练阶段对原始音频执行SSL,然后对一小部分带注释的数据进行微调。SSL模型为ASR产生了最先进的结果。然而,这些模型在自我监督下进行预训练非常昂贵。我们要解决的问题是不断地从音频中学习新的语言表征,而不会忘记以前的语言表征。我们利用持续学习中的想法来转移先前任务中的知识,以加速新语言任务的预训练。我们的Continuous-wav2vec2模型可以在学习新的语言任务时减少32%的训练前时间,并且在学习这种新的音频语言表征时不会忘记以前的语言表征。 摘要:We present a method for continual learning of speech representations for multiple languages using self-supervised learning (SSL) and applying these for automatic speech recognition. There is an abundance of unannotated speech, so creating self-supervised representations from raw audio and finetuning on a small annotated datasets is a promising direction to build speech recognition systems. Wav2vec models perform SSL on raw audio in a pretraining phase and then finetune on a small fraction of annotated data. SSL models have produced state of the art results for ASR. However, these models are very expensive to pretrain with self-supervision. We tackle the problem of learning new language representations continually from audio without forgetting a previous language representation. We use ideas from continual learning to transfer knowledge from a previous task to speed up pretraining a new language task. Our continual-wav2vec2 model can decrease pretraining times by 32% when learning a new language task, and learn this new audio-language representation without forgetting previous language representation.
【2】 Vowel-based Meeteilon dialect identification using a Random Forest classifier 标题:利用随机森林分类器识别基于元音的Meeteilon方言
作者:Thangjam Clarinda Devi,Kabita Thaoroijam 机构:Department of Computer Science and Engineering, Indian Institute of Information Technology Manipur, Manipur, India 备注:5 pages, double coulumn, 8 Figures, 1 table. Already presented as poster presentation at OCOCOSDA 2020 but not yet published 链接:https://arxiv.org/abs/2107.13419 摘要:本文提出了一个基于元音的Meeteilon方言识别系统。在这项工作中,元音数据集是使用印度语言语言数据联盟(LDC-IL)提供的Meeteilon语音语料库创建的。从单音元音中提取共振峰频率(F1、F1和F3)等光谱特征和韵律特征,如音调(F0)、能量、强度和分段持续时间值。随机森林分类器是一种基于决策树的集成算法,用于Meeteilon三种主要方言的分类,即Imphal、Kakching和Sekmai。该模型的平均方言识别率为61.57%。光谱和韵律特征在Meeteilon方言分类中起着重要作用。 摘要:This paper presents a vowel-based dialect identification system for Meeteilon. For this work, a vowel dataset is created by using Meeteilon Speech Corpora available at Linguistic Data Consortium for Indian Languages (LDC-IL). Spectral features such as formant frequencies (F1, F1 and F3) and prosodic features such as pitch (F0), energy, intensity and segment duration values are extracted from monophthong vowel sounds. Random forest classifier, a decision tree-based ensemble algorithm is used for classification of three major dialects of Meeteilon namely, Imphal, Kakching and Sekmai. Model has shown an average dialect identification performance in terms of accuracy of around 61.57%. The role of spectral and prosodic features are found to be significant in Meeteilon dialect classification.
【3】 Deep learning based cough detection camera using enhanced features 标题:使用增强特征的基于深度学习的咳嗽检测摄像机
作者:Gyeong-Tae Lee,Hyeonuk Nam,Seong-Hu Kim,Sang-Min Choi,Youngkey Kim,Yong-Hwa Park 机构:Department of Mechanical Engineering, Korea Advanced Institute of Science and Technology, Daejeon , SM Instruments Inc., Daejeon , South Korea, Corresponding author at: Department of Mechanical Engineering, Korea Advanced Institute of Science and 备注:submitted to Expert Systems With Applications 链接:https://arxiv.org/abs/2107.13260 摘要:咳嗽是新冠病毒-19的一种典型症状。为了远程检测和定位咳嗽声,本研究开发了一种基于卷积神经网络(CNN)的深度学习模型,并与声音摄像机集成,用于咳嗽声的可视化。咳嗽检测模型是一个二值分类器,其输入是一个两秒钟的声学特征,输出是两个推论(咳嗽或其他)中的一个。对采集到的音频文件进行数据扩充,以缓解类别不平衡,并反映实际环境中的各种背景噪声。为了有效地描述咳嗽声,在这项工作中,通过利用其速度(V)和加速度(A)图,增强了光谱图、mel标度光谱图和mel频率倒谱系数(MFCC)等常规特征。VGGNet、GoogLeNet和ResNet被简化为二进制分类器,分别命名为V-net、G-net和R-net。为了找到功能和网络的最佳组合,总共对39例患者进行了训练,并使用测试F1分数确认了性能。最后,使用MFCC-V-a特征(名为Spectroflow)从G-net获得91.9%的测试F1分数(测试准确度为97.2%),这是一种有效用于咳嗽检测的声学特征。训练后的咳嗽检测模型与声音摄像机(即使用波束形成麦克风阵列可视化声源的摄像机)集成。在一项初步试验中,咳嗽检测摄像机检测到咳嗽声,F1评分为90.0%(准确率为96.0%),并实时跟踪摄像机图像中的咳嗽位置。 摘要:Coughing is a typical symptom of COVID-19. To detect and localize coughing sounds remotely, a convolutional neural network (CNN) based deep learning model was developed in this work and integrated with a sound camera for the visualization of the cough sounds. The cough detection model is a binary classifier of which the input is a two second acoustic feature and the output is one of two inferences (Cough or Others). Data augmentation was performed on the collected audio files to alleviate class imbalance and reflect various background noises in practical environments. For effective featuring of the cough sound, conventional features such as spectrograms, mel-scaled spectrograms, and mel-frequency cepstral coefficients (MFCC) were reinforced by utilizing their velocity (V) and acceleration (A) maps in this work. VGGNet, GoogLeNet, and ResNet were simplified to binary classifiers, and were named V-net, G-net, and R-net, respectively. To find the best combination of features and networks, training was performed for a total of 39 cases and the performance was confirmed using the test F1 score. Finally, a test F1 score of 91.9% (test accuracy of 97.2%) was achieved from G-net with the MFCC-V-A feature (named Spectroflow), an acoustic feature effective for use in cough detection. The trained cough detection model was integrated with a sound camera (i.e., one that visualizes sound sources using a beamforming microphone array). In a pilot test, the cough detection camera detected coughing sounds with an F1 score of 90.0% (accuracy of 96.0%), and the cough location in the camera image was tracked in real time.
【4】 A Visual Domain Transfer Learning Approach for Heartbeat Sound Classification 标题:一种用于心跳音分类的视域迁移学习方法
作者:Uddipan Mukherjee,Sidharth Pancholi 链接:https://arxiv.org/abs/2107.13237 摘要:心脏病是人类死亡的最常见原因,几乎占全世界死亡人数的三分之一。早期发现疾病可增加患者存活的机会,有几种方法可以早期发现心脏病的迹象。本研究提出将净化和标准化的心音转换为视觉mel尺度的频谱图,然后使用视觉域转移学习方法自动提取心音特征并进行分类。以往的一些研究发现,不同类型的心音的频谱图对人眼来说是视觉上可分辨的,这就促使本研究对用于心音自动分类的视觉域分类方法进行实验。它将使用基于卷积神经网络的体系结构,即ResNet、MobileNetV2等,作为光谱图的自动特征提取器。这些在图像域中被广泛接受的模型显示,可以学习从不同环境中采集的具有不同振幅和噪声水平的心音的广义特征表示。使用的模型评估标准是分类准确性、精确性、召回率和AUROC,因为所选数据集是不平衡的。该方法已在PASCAL心音数据集A和B上实现,分类准确率达90%,AUROC为0.97。 摘要:Heart disease is the most common reason for human mortality that causes almost one-third of deaths throughout the world. Detecting the disease early increases the chances of survival of the patient and there are several ways a sign of heart disease can be detected early. This research proposes to convert cleansed and normalized heart sound into visual mel scale spectrograms and then using visual domain transfer learning approaches to automatically extract features and categorize between heart sounds. Some of the previous studies found that the spectrogram of various types of heart sounds is visually distinguishable to human eyes, which motivated this study to experiment on visual domain classification approaches for automated heart sound classification. It will use convolution neural network-based architectures i.e. ResNet, MobileNetV2, etc as the automated feature extractors from spectrograms. These well-accepted models in the image domain showed to learn generalized feature representations of cardiac sounds collected from different environments with varying amplitude and noise levels. Model evaluation criteria used were categorical accuracy, precision, recall, and AUROC as the chosen dataset is unbalanced. The proposed approach has been implemented on datasets A and B of the PASCAL heart sound collection and resulted in ~ 90% categorical accuracy and AUROC of ~0.97 for both sets.
【5】 On Perceived Emotion in Expressive Piano Performance: Further Experimental Evidence for the Relevance of Mid-level Perceptual Features 标题:论表现性钢琴演奏中的知觉情感:中层知觉特征相关性的进一步实验证据
作者:Shreyan Chowdhury,Gerhard Widmer 机构:Institute of Computational Perception, Johannes Kepler University Linz, Austria, LIT AI Lab, Linz Institute of Technology, Austria 备注:In Proceedings of the 22nd International Society for Music Information Retrieval (ISMIR) Conference, Online, 2021 链接:https://arxiv.org/abs/2107.13231 摘要:尽管最近在基于音频内容的音乐情感识别方面取得了进展,但一个有待探索的问题是,一个算法是否能够可靠地识别同一乐曲不同表演之间的情感或表达质量。在本研究中,我们分析了几组特征,这些特征在预测巴赫《钢琴曲1》中六种不同演奏(由六位著名钢琴家演奏)的唤醒和配价方面的有效性。这些特征包括低水平声学特征、基于分数的特征、使用预先训练的情感模型提取的特征、,和中级知觉特征。我们通过在几个实验中评估他们的预测能力来比较他们,这些实验旨在测试情绪的表现或分段变化。我们发现中等水平的特征在唤醒和效价的表现上都有显著的贡献——甚至比预先训练的情绪模型更好。我们的研究结果进一步证明,中等水平的感知特征是音乐属性在若干任务中的一个重要表现——具体来说,在本例中,是为了捕捉音乐的表现方面,表现为音乐表演的感知情感。 摘要:Despite recent advances in audio content-based music emotion recognition, a question that remains to be explored is whether an algorithm can reliably discern emotional or expressive qualities between different performances of the same piece. In the present work, we analyze several sets of features on their effectiveness in predicting arousal and valence of six different performances (by six famous pianists) of Bach's Well-Tempered Clavier Book 1. These features include low-level acoustic features, score-based features, features extracted using a pre-trained emotion model, and Mid-level perceptual features. We compare their predictive power by evaluating them on several experiments designed to test performance-wise or piece-wise variations of emotion. We find that Mid-level features show significant contribution in performance-wise variation of both arousal and valence -- even better than the pre-trained emotion model. Our findings add to the evidence of Mid-level perceptual features being an important representation of musical attributes for several tasks -- specifically, in this case, for capturing the expressive aspects of music that manifest as perceived emotion of a musical performance.
【6】 Squeeze-Excitation Convolutional Recurrent Neural Networks for Audio-Visual Scene Classification 标题:压缩激励卷积递归神经网络在视听场景分类中的应用
作者:Javier Naranjo-Alcazar,Sergi Perez-Castanos,Aaron Lopez-Garcia,Pedro Zuccarello,Maximo Cobos,Francesc J. Ferri 机构:es 2 Universitat de Valencia 链接:https://arxiv.org/abs/2107.13180 摘要:使用多个语义相关的源可以相互提供补充信息,这些信息在单独使用单个模式时可能不明显。在这种情况下,多模态模型有助于在有视听数据的机器学习任务中产生更准确和稳健的预测。本文提出了一种同时利用听觉和视觉信息的多模态场景自动分类模型。该方法利用两个独立的网络,分别对音频和视频数据进行单独训练,使每个网络专门处理给定的模态。视觉子网络是一个经过预训练的VGG16模型,然后是一个双向递归层,而剩余音频子网络是基于从零开始训练的叠加压缩激励卷积块。在训练每个子网络之后,音频和视频流中的信息在两个不同的阶段进行融合。早期融合阶段将各个子网络在不同时间步的最后一个卷积块产生的特征结合起来,以提供双向递归结构。后期融合阶段将早期融合阶段的输出与两个子网络提供的独立预测相结合,从而产生最终预测。我们使用最近出版的TAU视听城市场景2021对该方法进行了评估,该场景包含来自10个不同场景类别的12个欧洲城市的同步音频和视频记录。在DCASE 2021挑战的评估结果中,所提出的模型在预测性能(86.5%)和系统复杂性(1500万个参数)之间提供了一个很好的权衡。 摘要:The use of multiple and semantically correlated sources can provide complementary information to each other that may not be evident when working with individual modalities on their own. In this context, multi-modal models can help producing more accurate and robust predictions in machine learning tasks where audio-visual data is available. This paper presents a multi-modal model for automatic scene classification that exploits simultaneously auditory and visual information. The proposed approach makes use of two separate networks which are respectively trained in isolation on audio and visual data, so that each network specializes in a given modality. The visual subnetwork is a pre-trained VGG16 model followed by a bidiretional recurrent layer, while the residual audio subnetwork is based on stacked squeeze-excitation convolutional blocks trained from scratch. After training each subnetwork, the fusion of information from the audio and visual streams is performed at two different stages. The early fusion stage combines features resulting from the last convolutional block of the respective subnetworks at different time steps to feed a bidirectional recurrent structure. The late fusion stage combines the output of the early fusion stage with the independent predictions provided by the two subnetworks, resulting in the final prediction. We evaluate the method using the recently published TAU Audio-Visual Urban Scenes 2021, which contains synchronized audio and video recordings from 12 European cities in 10 different scene classes. The proposed model has been shown to provide an excellent trade-off between prediction performance (86.5%) and system complexity (15M parameters) in the evaluation results of the DCASE 2021 Challenge.
【7】 CycleGAN-based Non-parallel Speech Enhancement with an Adaptive Attention-in-attention Mechanism 标题:基于CycleGAN的自适应注意中注意机制非并行语音增强
作者:Guochen Yu,Yutian Wang,Chengshi Zheng,Hui Wang,Qin Zhang 机构:∗ State Key Laboratory of Media Convergence and Communication, Communication University of China, Beijing, China, † Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of Sciences, Beijing, China 备注:Submitted to APSIPA-ASC 2021 链接:https://arxiv.org/abs/2107.13143 摘要:对于基于DNN的语音增强方法来说,非并行训练是一项困难但必不可少的任务,因为在许多实际场景中,缺乏足够的噪声和成对的干净语音语料库。在本文中,我们提出了一种用于非并行语音增强的自适应注意循环(AIA CycleGAN)。在以前基于CycleGAN的非并行语音增强方法中,生成器的有限映射能力可能导致性能下降和特征学习不足。为了缓解这种退化,我们提出了自适应时频注意(ATFA)和自适应层次注意(AHA)的集成,以形成注意中的注意(AIA)模块,以便在映射过程中进行更灵活的特征学习。更具体地说,ATFA可以捕获长程时间光谱上下文信息以获得更有效的特征表示,而AHA可以根据全局上下文通过权重灵活地聚合不同的中间特征映射。大量的实验结果表明,该方法在非并行训练中取得了优于以往基于GAN和CycleGAN的方法的性能。此外,并行训练实验验证了所提出的AIA CycleGAN在保持语音完整性和减少语音失真方面也优于最先进的基于GAN的语音增强方法。 摘要:Non-parallel training is a difficult but essential task for DNN-based speech enhancement methods, for the lack of adequate noisy and paired clean speech corpus in many real scenarios. In this paper, we propose a novel adaptive attention-in-attention CycleGAN (AIA-CycleGAN) for non-parallel speech enhancement. In previous CycleGAN-based non-parallel speech enhancement methods, the limited mapping ability of the generator may cause performance degradation and insufficient feature learning. To alleviate this degradation, we propose an integration of adaptive time-frequency attention (ATFA) and adaptive hierarchical attention (AHA) to form an attention-in-attention (AIA) module for more flexible feature learning during the mapping procedure. More specifically, ATFA can capture the long-range temporal-spectral contextual information for more effective feature representations, while AHA can flexibly aggregate different intermediate feature maps by weights depending on the global context. Numerous experimental results demonstrate that the proposed approach achieves consistently more superior performance over previous GAN-based and CycleGAN-based methods in non-parallel training. Moreover, experiments in parallel training verify that the proposed AIA-CycleGAN also outperforms most advanced GAN-based speech enhancement approaches, especially in maintaining speech integrity and reducing speech distortion.