自然语言处理学术速递[6.30]

公众号-arXiv每日学术速递

发布于 2021-07-02 16:24:18

7700

文章被收录于专栏：arXiv每日学术速递arXiv每日学术速递

访问www.arxivdaily.com获取含摘要速递，涵盖CS|物理|数学|经济|统计|金融|生物|电气领域，更有搜索、收藏、发帖等功能！点击阅读原文即可访问

cs.CL 方向，今日共计27篇

BERT(1篇)

【1】 Hate speech detection using static BERT embeddings 标题：基于静电BERT嵌入的仇恨语音检测

作者：Gaurav Rajput,Narinder Singh punn,Sanjay Kumar Sonbhadra,Sonali Agarwal 机构：Agarwal, Indian Institute of Information Technology, Allahabad, Uttar Pradesh , India 链接：https://arxiv.org/abs/2106.15537 摘要：随着社交媒体平台的日益普及，仇恨言论正在成为一个主要关注点，它表达针对特定群体特征（如性别、宗教或种族）的辱骂性言论，以传播暴力。以前人们用口头表达仇恨，但现在随着科技的发展，一些人故意利用社交媒体平台通过发帖、分享、评论等方式传播仇恨，无论是基督城清真寺枪击案还是西方针对亚洲人的仇恨犯罪，据观察，罪犯受到网上仇恨文字的影响很大。尽管人工智能系统已经到位，可以标记这些文本，但关键的挑战之一是降低误报率（将非仇恨标记为仇恨），以便这些系统能够在不损害言论自由的情况下检测仇恨言论。本文利用ETHOS仇恨语音检测数据集，通过将单词嵌入（fastText、GloVe或FT+GV）与静态BERT嵌入（BE）进行替换或集成，分析了仇恨语音检测分类器的性能。通过大量的实验研究发现，与使用FT、GV或FT+GV作为单词嵌入相比，使用静态BE的神经网络具有更好的性能。与微调的BERT相比，一个显著改善的指标是特异性。摘要：With increasing popularity of social media platforms hate speech is emerging as a major concern, where it expresses abusive speech that targets specific group characteristics, such as gender, religion or ethnicity to spread violence. Earlier people use to verbally deliver hate speeches but now with the expansion of technology, some people are deliberately using social media platforms to spread hate by posting, sharing, commenting, etc. Whether it is Christchurch mosque shootings or hate crimes against Asians in west, it has been observed that the convicts are very much influenced from hate text present online. Even though AI systems are in place to flag such text but one of the key challenges is to reduce the false positive rate (marking non hate as hate), so that these systems can detect hate speech without undermining the freedom of expression. In this paper, we use ETHOS hate speech detection dataset and analyze the performance of hate speech detection classifier by replacing or integrating the word embeddings (fastText (FT), GloVe (GV) or FT + GV) with static BERT embeddings (BE). With the extensive experimental trails it is observed that the neural network performed better with static BE compared to using FT, GV or FT + GV as word embeddings. In comparison to fine-tuned BERT, one metric that significantly improved is specificity.

QA|VQA|问答|对话(1篇)

【1】 Overview of BioASQ 2021: The ninth BioASQ challenge on Large-Scale Biomedical Semantic Indexing and Question Answering 标题：BioASQ 2021综述：第九届BioASQ关于大规模生物医学语义索引和问题回答的挑战

作者：Anastasios Nentidis,Georgios Katsimpras,Eirini Vandorou,Anastasia Krithara,Luis Gasco,Martin Krallinger,Georgios Paliouras 机构： National Center for Scientific Research “Demokritos”, Athens, Greece, Aristotle University of Thessaloniki, Thessaloniki, Greece, Barcelona Supercomputing Center, Barcelona, Spain 备注：25 pages, 15 tables, 3 figures. arXiv admin note: text overlap with arXiv:2106.14618 链接：https://arxiv.org/abs/2106.14885 摘要：推进大规模生物医学语义索引和问答技术的发展是BioASQ挑战的主要焦点。BioASQ组织了各自的任务，不同的团队开发了基于相同基准数据集的系统，这些数据集代表了生物医学领域专家的真实信息需求。本文以2021年评估论坛（CLEF）的会议和实验室为背景，对第九版BioASQ挑战进行了概述。今年，一项新的问答任务命名为Synergy，它的引入是为了支持研究COVID-19疾病的研究人员，并测量参与研究的团队在问题仍在发展时辨别信息的能力。共有42支拥有170多个系统的队伍报名参加挑战赛的4项任务。与前几年类似，评估结果显示，与基线相比，绩效有所提高，这表明该领域的最新技术水平不断提高。摘要：Advancing the state-of-the-art in large-scale biomedical semantic indexing and question answering is the main focus of the BioASQ challenge. BioASQ organizes respective tasks where different teams develop systems that are evaluated on the same benchmark datasets that represent the real information needs of experts in the biomedical domain. This paper presents an overview of the ninth edition of the BioASQ challenge in the context of the Conference and Labs of the Evaluation Forum (CLEF) 2021. In this year, a new question answering task, named Synergy, is introduced to support researchers studying the COVID-19 disease and measure the ability of the participating teams to discern information while the problem is still developing. In total, 42 teams with more than 170 systems were registered to participate in the four tasks of the challenge. The evaluation results, similarly to previous years, show a performance gain against the baselines which indicates the continuous improvement of the state-of-the-art in this field.

机器翻译(3篇)

【1】 Rethinking the Evaluation of Neural Machine Translation 标题：对神经机器翻译评价的再思考

作者：Jianhao Yan,Chenming Wu,Fandong Meng,Jie Zhou 机构：WeChat AI, Tencent, China 备注：Submitted to NeurIPS 2021 链接：https://arxiv.org/abs/2106.15217 摘要：神经机器翻译系统的评估通常建立在特定解码方法（例如，波束搜索）的生成翻译的基础上，其评估指标超过生成翻译（例如，BLEU）。然而，该评估框架存在启发式搜索算法带来的高搜索错误，并且受到对一个最佳候选进行评估的性质的限制。本文提出了一种新的评估协议，不仅避免了搜索错误的影响，而且从模型排序的角度提供了系统级的评估。特别地，我们的方法是基于我们新提出的精确top-$k$解码而不是波束搜索。我们的方法评估模型错误的距离候选空间评分的参考和模型分别。在WMT'14英德版上的大量实验表明，糟糕的排名能力与众所周知的波束搜索诅咒有关，最先进的Transformer模型面临着严重的排名错误。通过评估各种模型架构和技术，我们提供了几个有趣的发现。最后，为了在与原始波束搜索相同的时间开销下有效地逼近精确搜索算法，我们提出了一种最小堆增强波束搜索算法。摘要：The evaluation of neural machine translation systems is usually built upon generated translation of a certain decoding method (e.g., beam search) with evaluation metrics over the generated translation (e.g., BLEU). However, this evaluation framework suffers from high search errors brought by heuristic search algorithms and is limited by its nature of evaluation over one best candidate. In this paper, we propose a novel evaluation protocol, which not only avoids the effect of search errors but provides a system-level evaluation in the perspective of model ranking. In particular, our method is based on our newly proposed exact top-$k$ decoding instead of beam search. Our approach evaluates model errors by the distance between the candidate spaces scored by the references and the model respectively. Extensive experiments on WMT'14 English-German demonstrate that bad ranking ability is connected to the well-known beam search curse, and state-of-the-art Transformer models are facing serious ranking errors. By evaluating various model architectures and techniques, we provide several interesting findings. Finally, to effectively approximate the exact search algorithm with same time cost as original beam search, we present a minimum heap augmented beam search algorithm.

【2】 Scientific Credibility of Machine Translation Research: A Meta-Evaluation of 769 Papers 标题：机器翻译研究的科学可信度：769篇论文的元评价

作者：Benjamin Marie,Atsushi Fujita,Raphael Rubino 机构：National Institute of Information and Communications Technology, -, Hikaridai, Seika-cho, Soraku-gun, Kyoto,-, Japan 备注：Camera-ready for ACL2021 链接：https://arxiv.org/abs/2106.15195 摘要：本文提出了第一个大规模的机器翻译元评价方法。我们对2010年至2020年发表的769篇研究论文中的机器翻译评价进行了注释。我们的研究表明，机器翻译自动评价的实践在过去十年中发生了巨大的变化，并遵循了相关的趋势。越来越多的机器翻译评估完全依赖于BLEU分数之间的差异来得出结论，而没有进行任何统计显著性检验或人为评估，同时至少提出了108个声称比BLEU更好的指标。在最近的论文中，机器翻译评估倾向于复制和比较以前工作中的自动度量分数，以声称一种方法或算法的优越性，而没有确认使用了完全相同的训练、验证和测试数据，也没有确认度量分数的可比性。此外，报告标准化度量分数的工具还远远没有被机器翻译界广泛采用。在展示了这些缺陷的累积如何导致可疑的评估之后，我们提出了一个指导方针，鼓励更好的自动机器翻译评估以及一个简单的元评估评分方法来评估其可信度。摘要：This paper presents the first large-scale meta-evaluation of machine translation (MT). We annotated MT evaluations conducted in 769 research papers published from 2010 to 2020. Our study shows that practices for automatic MT evaluation have dramatically changed during the past decade and follow concerning trends. An increasing number of MT evaluations exclusively rely on differences between BLEU scores to draw conclusions, without performing any kind of statistical significance testing nor human evaluation, while at least 108 metrics claiming to be better than BLEU have been proposed. MT evaluations in recent papers tend to copy and compare automatic metric scores from previous work to claim the superiority of a method or an algorithm without confirming neither exactly the same training, validating, and testing data have been used nor the metric scores are comparable. Furthermore, tools for reporting standardized metric scores are still far from being widely adopted by the MT community. After showing how the accumulation of these pitfalls leads to dubious evaluation, we propose a guideline to encourage better automatic MT evaluation along with a simple meta-evaluation scoring method to assess its credibility.

【3】 Neural Machine Translation for Low-Resource Languages: A Survey 标题：面向低资源语言的神经机器翻译研究综述

作者：Surangika Ranathunga,En-Shiun Annie Lee,Marjana Prifti Skenduli,Ravi Shekhar,Mehreen Alam,Rishemjit Kaur 机构： University of Moratuwa, University of Toronto, University of New York Tirana, Queen Mary University, National University of Computer and Emerging Sciences 备注：35 pages, 8 figures 链接：https://arxiv.org/abs/2106.15115 摘要：神经机器翻译（NMT）在不到十年的时间里经历了巨大的增长，已经进入了一个成熟阶段。虽然被认为是机器翻译中应用最广泛的解决方案，但由于大型并行语料库的不可用，它在低资源语言对上的性能仍然低于高资源语言对上的性能。因此，针对低资源语言对的NMT技术的实现成为近年来NMT研究领域的一个热点，并由此产生了大量的研究报道。本文对低资源语言NMT（LRL-NMT）的研究进展进行了详细的综述，并对最流行的解决方案进行了定量分析。基于我们在回顾以往工作的基础上得出的结论，本文提供了一套针对给定LRL数据环境选择NMT技术的指南。它还提出了LRL-NMT研究领域的整体观点，并提供了进一步加强LRL-NMT研究工作的建议清单。摘要：Neural Machine Translation (NMT) has seen a tremendous spurt of growth in less than ten years, and has already entered a mature phase. While considered as the most widely used solution for Machine Translation, its performance on low-resource language pairs still remains sub-optimal compared to the high-resource counterparts, due to the unavailability of large parallel corpora. Therefore, the implementation of NMT techniques for low-resource language pairs has been receiving the spotlight in the recent NMT research arena, thus leading to a substantial amount of research reported on this topic. This paper presents a detailed survey of research advancements in low-resource language NMT (LRL-NMT), along with a quantitative analysis aimed at identifying the most popular solutions. Based on our findings from reviewing previous work, this survey paper provides a set of guidelines to select the possible NMT technique for a given LRL data setting. It also presents a holistic view of the LRL-NMT research landscape and provides a list of recommendations to further enhance the research efforts on LRL-NMT.

Graph|知识图谱|Knowledge(5篇)

【1】 Few-Shot Electronic Health Record Coding through Graph Contrastive Learning 标题：基于图形对比学习的Few-Shot电子病历编码

作者：Shanshan Wang,Pengjie Ren,Zhumin Chen,Zhaochun Ren,Huasheng Liang,Qiang Yan,Evangelos Kanoulas,Maarten de Rijke 链接：https://arxiv.org/abs/2106.15467 摘要：电子健康记录（EHR）编码的任务是为每个EHR分配ICD代码。以往的研究大多只关注于频繁的ICD码，对罕见和频繁的ICD码的处理方法也不尽相同。这些方法对常见的ICD码有很好的性能，但由于ICD码的分布极不均衡，对少数ICD码的性能很不理想。我们试图通过使用基于对比图的EHR编码框架CoGraph来提高频繁和罕见ICD码的性能，CoGraph将EHR编码重新转换为一个小镜头学习任务。首先，我们为每个EHR构造一个异构EHR词实体（HEWE）图，其中从EHR中提取的词和实体作为节点，它们之间的关系作为边。然后，CoGraph从不同的ICD码中学习不同的图之间的相似性和不相似性，以便在它们之间传递信息。在少数镜头学习场景中，该模型在训练过程中只能访问频繁的ICD码，这可能会迫使它编码只对频繁的ICD码有用的特征。为了降低这种风险，CoGraph设计了两种图形对比学习方案GSCL和GECL，它们利用HEWE图形结构对可转换特征进行编码。GSCL利用HEWE图中不同子图的内部相关性，而GECL利用不同临床阶段HEWE图之间的相互相关性。在micimic-III基准数据集上的实验表明，CoGraph在EHR编码方面，不仅在频繁ICD码上，而且在稀有码上，在多个评价指标上都显著优于现有的编码方法。在频繁ICD码上，GSCL和GECL分别提高了1.31%和0.61%的分类精度和F1，在稀有ICD码上CoGraph分别提高了2.12%和2.95%。摘要：Electronic health record (EHR) coding is the task of assigning ICD codes to each EHR. Most previous studies either only focus on the frequent ICD codes or treat rare and frequent ICD codes in the same way. These methods perform well on frequent ICD codes but due to the extremely unbalanced distribution of ICD codes, the performance on rare ones is far from satisfactory. We seek to improve the performance for both frequent and rare ICD codes by using a contrastive graph-based EHR coding framework, CoGraph, which re-casts EHR coding as a few-shot learning task. First, we construct a heterogeneous EHR word-entity (HEWE) graph for each EHR, where the words and entities extracted from an EHR serve as nodes and the relations between them serve as edges. Then, CoGraph learns similarities and dissimilarities between HEWE graphs from different ICD codes so that information can be transferred among them. In a few-shot learning scenario, the model only has access to frequent ICD codes during training, which might force it to encode features that are useful for frequent ICD codes only. To mitigate this risk, CoGraph devises two graph contrastive learning schemes, GSCL and GECL, that exploit the HEWE graph structures so as to encode transferable features. GSCL utilizes the intra-correlation of different sub-graphs sampled from HEWE graphs while GECL exploits the inter-correlation among HEWE graphs at different clinical stages. Experiments on the MIMIC-III benchmark dataset show that CoGraph significantly outperforms state-of-the-art methods on EHR coding, not only on frequent ICD codes, but also on rare codes, in terms of several evaluation indicators. On frequent ICD codes, GSCL and GECL improve the classification accuracy and F1 by 1.31% and 0.61%, respectively, and on rare ICD codes CoGraph has more obvious improvements by 2.12% and 2.95%.

【2】 Leveraging Static Models for Link Prediction in Temporal Knowledge Graphs 标题：利用静电模型进行时态知识图中的链接预测

作者：Wessel Radstok,Mel Chekol 机构：nlUtrecht UniversityAbstract 链接：https://arxiv.org/abs/2106.15223 摘要：在知识图嵌入（KGE）中包含事实的时间范围为改进结果嵌入提供了重要的机会，从而提高了下游应用程序的性能。然而，很少有研究致力于这一领域，与没有时间范围的训练模型（静态模型）相比，许多已开展的研究报告只略微改善了结果。此外，他们没有利用静态模型的现有工作，而是引入了特定于时态知识图的新模型。我们提出了一种新的观点，通过集中精力处理数据来利用现有静态嵌入模型的能力。我们的方法SpliMe从信号处理领域和早期的图形嵌入工作中得到了启发。我们证明了SpliMe与时态KGE的最新技术相竞争或优于后者。此外，我们揭示了当前用于评估时态图上静态模型性能的过程中存在的问题，并介绍了两种方法来抵消这些问题。摘要：The inclusion of temporal scopes of facts in knowledge graph embedding (KGE) presents significant opportunities for improving the resulting embeddings, and consequently for increased performance in downstream applications. Yet, little research effort has focussed on this area and much of the carried out research reports only marginally improved results compared to models trained without temporal scopes (static models). Furthermore, rather than leveraging existing work on static models, they introduce new models specific to temporal knowledge graphs. We propose a novel perspective that takes advantage of the power of existing static embedding models by focussing effort on manipulating the data instead. Our method, SpliMe, draws inspiration from the field of signal processing and early work in graph embedding. We show that SpliMe competes with or outperforms the current state of the art in temporal KGE. Additionally, we uncover issues with the procedure currently used to assess the performance of static models on temporal graphs and introduce two ways to counteract them.

【3】 Topic-to-Essay Generation with Comprehensive Knowledge Enhancement 标题：具有全面知识增强功能的主题到论文生成

作者：Zhiyue Liu,Jiahai Wang,Zhenghong Li 机构：School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China 备注：20 pages 链接：https://arxiv.org/abs/2106.15142 摘要：在自然语言生成过程中，生成高质量、多样化的主题文章是一项具有挑战性的任务。由于几个给定的主题只提供有限的源信息，利用各种主题相关的知识是提高论文生成性能的关键。然而，以往的工作不能充分利用这些知识来促进生成过程。本文旨在通过从内部知识和外部知识中提取信息来提高论文的生成效率。因此，本文提出了一个综合知识增强的主题到论文生成模型TEGKE。为了提高内部知识，主题和相关论文都作为源信息提供给教师网络。然后，从教师网络中获取信息特征，并将其转化为学生网络，学生网络只以主题作为输入，但提供与教师网络相似的信息。针对外部知识的增强，提出了一种主题知识图编码器。与以往只使用常识库中主题的最近邻不同，我们的主题知识图编码器可以利用常识知识图中更多的结构和语义信息来方便论文的生成。此外，还提出了基于Wasserstein距离的对抗性训练方法来提高生成质量。实验结果表明，TEGKE在自动评估和人工评估两方面都达到了最先进的水平。摘要：Generating high-quality and diverse essays with a set of topics is a challenging task in natural language generation. Since several given topics only provide limited source information, utilizing various topic-related knowledge is essential for improving essay generation performance. However, previous works cannot sufficiently use that knowledge to facilitate the generation procedure. This paper aims to improve essay generation by extracting information from both internal and external knowledge. Thus, a topic-to-essay generation model with comprehensive knowledge enhancement, named TEGKE, is proposed. For internal knowledge enhancement, both topics and related essays are fed to a teacher network as source information. Then, informative features would be obtained from the teacher network and transferred to a student network which only takes topics as input but provides comparable information compared with the teacher network. For external knowledge enhancement, a topic knowledge graph encoder is proposed. Unlike the previous works only using the nearest neighbors of topics in the commonsense base, our topic knowledge graph encoder could exploit more structural and semantic information of the commonsense knowledge graph to facilitate essay generation. Moreover, the adversarial training based on the Wasserstein distance is proposed to improve generation quality. Experimental results demonstrate that TEGKE could achieve state-of-the-art performance on both automatic and human evaluation.

【4】 Time-Aware Language Models as Temporal Knowledge Bases 标题：时间感知语言模型作为时态知识库

作者：Bhuwan Dhingra,Jeremy R. Cole,Julian Martin Eisenschlos,Daniel Gillick,Jacob Eisenstein,William W. Cohen 机构：Google Research 链接：https://arxiv.org/abs/2106.15110 摘要：从总统的名字到勒布朗·詹姆斯效力的篮球队，许多事实都有一个有效期。但是，语言模型（LMs）是在特定时刻采集数据的快照上训练的，这会限制它们的实用性，特别是在封闭的书本环境中，训练前的语料库必须包含模型应该记住的事实。我们介绍了一个诊断数据集，旨在探测随时间变化的LMs事实知识，并突出LMs在光谱两端的问题——即在特定时间数据切片上训练的LMs，以及在广泛时间数据上训练的LMs。为了缓解这些问题，我们提出了一种简单的技术来联合建模文本及其时间戳。这提高了对训练时间段内已知事实的记忆，以及对未来时间段内未知事实预测的校准。我们还表明，使用时态上下文训练的模型可以在新数据到达时有效地“刷新”，而不需要从头开始重新训练。摘要：Many facts come with an expiration date, from the name of the President to the basketball team Lebron James plays for. But language models (LMs) are trained on snapshots of data collected at a specific moment in time, and this can limit their utility, especially in the closed-book setting where the pretraining corpus must contain the facts the model should memorize. We introduce a diagnostic dataset aimed at probing LMs for factual knowledge that changes over time and highlight problems with LMs at either end of the spectrum -- those trained on specific slices of temporal data, as well as those trained on a wide range of temporal data. To mitigate these problems, we propose a simple technique for jointly modeling text with its timestamp. This improves memorization of seen facts from the training time period, as well as calibration on predictions about unseen facts from future time periods. We also show that models trained with temporal context can be efficiently ``refreshed'' as new data arrives, without the need for retraining from scratch.

【5】 Automatic Construction of Enterprise Knowledge Base 标题：企业知识库的自动构建

作者：Junyi Chai,Yujie He,Homa Hashemi,Bing Li,Daraksha Parveen,Ranganath Kondapally,Wenjin Xu 机构：Microsoft Corporation 链接：https://arxiv.org/abs/2106.15085 摘要：在本文中，我们提出了一个自动化的知识库建设系统，从大型企业的文件与最小的努力，人为干预。在设计和部署这样一个面向企业的知识挖掘系统时，我们面临着数据分布转移、性能评估、法规遵从性要求等实际问题。我们利用最先进的深度学习模型在每个文档级别提取信息（命名实体和定义），然后进一步应用经典机器学习技术处理全局统计信息以改进知识库。实验结果在实际的企业文档中报告。此系统当前是Microsoft 365服务的一部分。摘要：In this paper, we present an automatic knowledge base construction system from large scale enterprise documents with minimal efforts of human intervention. In the design and deployment of such a knowledge mining system for enterprise, we faced several challenges including data distributional shift, performance evaluation, compliance requirements and other practical issues. We leveraged state-of-the-art deep learning models to extract information (named entities and definitions) at per document level, then further applied classical machine learning techniques to process global statistical information to improve the knowledge base. Experimental results are reported on actual enterprise documents. This system is currently serving as part of a Microsoft 365 service.

摘要|信息提取(1篇)

【1】 Topic Modeling Based Extractive Text Summarization 标题：基于主题建模的抽取文本摘要

作者：Kalliath Abdul Rasheed Issam,Shivam Patel,Subalalitha C. N 机构：Engineering, SRM Institute of Science and Technology, Kattankulathur, Institute of Science and Technology, Kattankulathur, Chennai, India. Email:, SRM Institute of Science and Technology, Kattankulathur, Chennai, India. 备注：None 链接：https://arxiv.org/abs/2106.15313 摘要：文本摘要是一种识别文本文档中重要信息的方法。这种计算技术的目的是通过只包含源文本中存在的相关和显著信息来生成源文本的较短版本。在本文中，我们提出了一种新的方法来总结文本文档，方法是基于主题建模技术产生的潜在主题对文本文档内容进行聚类，并为每个已识别的文本聚类生成摘要。所有提取的子摘要稍后被合并以生成任何给定源文档的摘要。我们利用较少使用和具有挑战性的WikiHow数据集进行文本摘要。此数据集不同于可用于文本摘要的常用新闻数据集。众所周知的新闻数据集在其源文本的前几行中呈现其最重要的信息，这使得它们的摘要与总结WikiHow数据集相比不那么具有挑战性。与这些新闻数据集相反，WikiHow数据集中的文档是使用通用方法编写的，具有较少的抽象性和较高的压缩比，因此对生成摘要提出了更大的挑战。当前许多最先进的文本摘要技术都倾向于为了简洁而消除源文档中的重要信息。我们提出的技术旨在捕获源文档中存在的所有不同信息。尽管数据集被证明具有挑战性，但在我们的实验装置中进行了广泛的测试之后，我们发现，与其他已发表的抽取和抽象文本摘要模型相比，我们的模型产生了令人鼓舞的结果和摘要。摘要：Text summarization is an approach for identifying important information present within text documents. This computational technique aims to generate shorter versions of the source text, by including only the relevant and salient information present within the source text. In this paper, we propose a novel method to summarize a text document by clustering its contents based on latent topics produced using topic modeling techniques and by generating extractive summaries for each of the identified text clusters. All extractive sub-summaries are later combined to generate a summary for any given source document. We utilize the lesser used and challenging WikiHow dataset in our approach to text summarization. This dataset is unlike the commonly used news datasets which are available for text summarization. The well-known news datasets present their most important information in the first few lines of their source texts, which make their summarization a lesser challenging task when compared to summarizing the WikiHow dataset. Contrary to these news datasets, the documents in the WikiHow dataset are written using a generalized approach and have lesser abstractedness and higher compression ratio, thus proposing a greater challenge to generate summaries. A lot of the current state-of-the-art text summarization techniques tend to eliminate important information present in source documents in the favor of brevity. Our proposed technique aims to capture all the varied information present in source documents. Although the dataset proved challenging, after performing extensive tests within our experimental setup, we have discovered that our model produces encouraging ROUGE results and summaries when compared to the other published extractive and abstractive text summarization models.

推理|分析|理解|解释(3篇)

【1】 On the Interaction of Belief Bias and Explanations 标题：论信念偏差与解释的交互作用

作者：Ana Valeria Gonzalez,Anna Rogers,Anders Søgaard 机构：University of Copenhagen, Department of Computer Science, Copenhagen Centre for Social Data Science 备注：accepted at findings of ACL 2021 链接：https://arxiv.org/abs/2106.15355 摘要：近年来，人们提出了大量的解释性方法，但对如何评价这些方法却没有达成共识。虽然自动度量允许快速进行基准测试，但尚不清楚此类度量如何通过解释反映人与人之间的交互。人的评估是最重要的，但以前的协议没有考虑到信念偏见影响人的表现，这可能导致误导性的结论。我们提供了一个信念偏见的概述，它在人类评估中的作用，以及NLP从业者如何解释它的想法。对于两种实验范式，我们提出了一个基于梯度的解释性案例研究，介绍了解释人类先前信念的简单方法：不同质量的模型和对抗性的例子。我们的结论表明，有关最高性能的方法改变时，引入这样的控制，指出会计的重要性，信念偏差的评估。摘要：A myriad of explainability methods have been proposed in recent years, but there is little consensus on how to evaluate them. While automatic metrics allow for quick benchmarking, it isn't clear how such metrics reflect human interaction with explanations. Human evaluation is of paramount importance, but previous protocols fail to account for belief biases affecting human performance, which may lead to misleading conclusions. We provide an overview of belief bias, its role in human evaluation, and ideas for NLP practitioners on how to account for it. For two experimental paradigms, we present a case study of gradient-based explainability introducing simple ways to account for humans' prior beliefs: models of varying quality and adversarial examples. We show that conclusions about the highest performing methods change when introducing such controls, pointing to the importance of accounting for belief bias in evaluation.

【2】 Exploring the Efficacy of Automatically Generated Counterfactuals for Sentiment Analysis 标题：自动生成的反事实在情感分析中的有效性探讨

作者：Yang Linyi,Li Jiazheng,Cunningham Pádraig,Zhang Yue,Smyth Barry,Dong Ruihai 机构： University College Dublin 2 School of Computer Science, University College Dublin 3 School of Engineering, Westlake University 4 Institute of Advanced Technology, Westlake Institute for Advanced Study{linyi 备注：ACL-21, Main Conference, Long Paper 链接：https://arxiv.org/abs/2106.15231 摘要：近年来，虽然最先进的NLP模型在许多任务中都取得了优异的性能，但人们对其鲁棒性以及对训练和测试数据中可能存在的系统偏差的潜在敏感性提出了重要的问题。当在现场面对分布外的数据时，这些问题表现为性能问题。最近的一个解决方案是使用反事实增强的数据集，以减少对原始数据中可能存在的虚假模式的依赖。生成高质量的增强数据可能既昂贵又耗时，因为它通常需要人工反馈和众包工作。在这项工作中，我们提出了一个替代方案，通过描述和评估一种自动生成反事实数据的方法，用于数据扩充和解释。通过对多个不同数据集的综合评估，并使用各种最先进的基准，说明了我们的方法如何在模型性能方面取得显著改进，这一改进与基于原始数据的模型训练相比，甚至与利用人工生成的增强数据训练的模型相比。摘要：While state-of-the-art NLP models have been achieving the excellent performance of a wide range of tasks in recent years, important questions are being raised about their robustness and their underlying sensitivity to systematic biases that may exist in their training and test data. Such issues come to be manifest in performance problems when faced with out-of-distribution data in the field. One recent solution has been to use counterfactually augmented datasets in order to reduce any reliance on spurious patterns that may exist in the original data. Producing high-quality augmented data can be costly and time-consuming as it usually needs to involve human feedback and crowdsourcing efforts. In this work, we propose an alternative by describing and evaluating an approach to automatically generating counterfactual data for data augmentation and explanation. A comprehensive evaluation on several different datasets and using a variety of state-of-the-art benchmarks demonstrate how our approach can achieve significant improvements in model performance when compared to models training on the original data and even when compared to models trained with the benefit of human-generated augmented data.

【3】 Rethinking End-to-End Evaluation of Decomposable Tasks: A Case Study on Spoken Language Understanding 标题：对可分解任务端到端评价的再思考--以口语理解为例

作者：Siddhant Arora,Alissa Ostapenko,Vijay Viswanathan,Siddharth Dalmia,Florian Metze,Shinji Watanabe,Alan W Black 机构：Language Technologies Institute, Carnegie Mellon University, USA 备注：INTERSPEECH 2021 链接：https://arxiv.org/abs/2106.15065 摘要：可分解任务是复杂的，由一系列子任务组成。例如，口语意图预测结合了自动语音识别和自然语言理解。然而，现有的基准通常只提供表面级子任务的示例。因此，在这些基准上具有相似性能的模型在其他子任务上可能存在未观察到的性能差异。为了在竞争性的端到端架构之间进行有见地的比较，我们提出了一个框架来构造健壮的测试集，该框架使用子任务特定效用函数上的坐标上升。给定一个可分解任务的数据集，我们的方法为每个子任务创建一个测试集，以单独评估端到端模型的子组件。以口语理解为例，我们为Fluent语音命令和Snips-SmartLights数据集生成新的split。每一组都有两个测试集：一个测试集测试被试的自然语言理解能力，另一个测试集测试被试的语言处理能力。我们的拆分确定了在原始测试集上彼此相差不超过1%的端到端系统之间高达10%的性能差距。这些性能差距允许在不同体系结构之间进行更现实和可操作的比较，从而推动未来的模型开发。我们为社区发布分裂和工具。摘要：Decomposable tasks are complex and comprise of a hierarchy of sub-tasks. Spoken intent prediction, for example, combines automatic speech recognition and natural language understanding. Existing benchmarks, however, typically hold out examples for only the surface-level sub-task. As a result, models with similar performance on these benchmarks may have unobserved performance differences on the other sub-tasks. To allow insightful comparisons between competitive end-to-end architectures, we propose a framework to construct robust test sets using coordinate ascent over sub-task specific utility functions. Given a dataset for a decomposable task, our method optimally creates a test set for each sub-task to individually assess sub-components of the end-to-end model. Using spoken language understanding as a case study, we generate new splits for the Fluent Speech Commands and Snips SmartLights datasets. Each split has two test sets: one with held-out utterances assessing natural language understanding abilities, and one with held-out speakers to test speech processing skills. Our splits identify performance gaps up to 10% between end-to-end systems that were within 1% of each other on the original test sets. These performance gaps allow more realistic and actionable comparisons between different architectures, driving future model development. We release our splits and tools for the community.

GAN|对抗|攻击|生成相关(2篇)

【1】 Don't Take It Literally: An Edit-Invariant Sequence Loss for Text Generation 标题：不要从字面上理解：文本生成的编辑不变序列丢失

作者：Guangyi Liu,Zichao Yang,Tianhua Tao,Xiaodan Liang,Zhen Li,Bowen Zhou,Shuguang Cui,Zhiting Hu 机构：Chinese University of Hong Kong, Shenzhen, Carnegie Mellon University, Tsinghua University, Sun Yat-Sen University, JD AI Research, UC San Diego 备注：10 pages, 5 figures 链接：https://arxiv.org/abs/2106.15078 摘要：神经文本生成模型通常通过序列交叉熵损失最大化对数似然来训练，这鼓励了目标序列和生成序列之间的精确逐标记匹配。当目标序列不完美时，例如当目标序列被噪声破坏时，或者当只有弱序列监督可用时，这种训练目标是次优的。为了解决这一问题，我们提出了一种新的编辑不变序列丢失（EISL），它计算目标n-gram与生成序列中所有n-gram的匹配丢失。EISL从卷积网络（ConvNets）中获得灵感，卷积网络对图像具有平移不变性，因此对n-gram的平移具有鲁棒性，能够容忍对目标序列的编辑。此外，EISL的计算本质上是以目标n-gram为核的卷积运算，易于在现有库中实现。为了验证EISL的有效性，我们在三个任务上进行了实验：含噪声目标序列的机器翻译、无监督文本风格转换和非自回归机器翻译。实验结果表明，在这三个任务上，我们的方法都明显优于交叉熵损失法。摘要：Neural text generation models are typically trained by maximizing log-likelihood with the sequence cross entropy loss, which encourages an exact token-by-token match between a target sequence with a generated sequence. Such training objective is sub-optimal when the target sequence not perfect, e.g., when the target sequence is corrupted with noises, or when only weak sequence supervision is available. To address this challenge, we propose a novel Edit-Invariant Sequence Loss (EISL), which computes the matching loss of a target n-gram with all n-grams in the generated sequence. EISL draws inspirations from convolutional networks (ConvNets) which are shift-invariant to images, hence is robust to the shift of n-grams to tolerate edits in the target sequences. Moreover, the computation of EISL is essentially a convolution operation with target n-grams as kernels, which is easy to implement with existing libraries. To demonstrate the effectiveness of EISL, we conduct experiments on three tasks: machine translation with noisy target sequences, unsupervised text style transfer, and non-autoregressive machine translation. Experimental results show our method significantly outperforms cross entropy loss on these three tasks.

【2】 GANSpeech: Adversarial Training for High-Fidelity Multi-Speaker Speech Synthesis 标题：GanSpeech：高保真多说话人语音合成的对抗性训练

作者：Jinhyeok Yang,Jae-Sung Bae,Taejun Bak,Youngik Kim,Hoon-Young Cho 机构：Speech AI Lab, NCSOFT, Republic of Korea 备注：Accepted to INTERSPEECH 2021 链接：https://arxiv.org/abs/2106.15153 摘要：神经网络多说话人文本到语音（TTS）模型的最新进展使得用单一模型生成相当好的语音质量成为可能，并且使得用有限的训练数据合成说话人的语音成为可能。利用多说话人模型对目标说话人数据进行微调可以获得更好的语音质量，但与实际语音样本相比仍存在差距，且模型依赖于说话人。在这项工作中，我们提出了GANSpeech，这是一个高保真的多说话人TTS模型，它采用了非自回归多说话人TTS模型的对抗性训练方法。此外，本文还提出了一种简单而有效的对抗性训练中特征匹配丢失的自动缩放方法。在主观听力测试中，GANSpeech显著优于基线多说话人FastSpeech和FastSpeech2模型，并且显示出比特定说话人微调FastSpeech2更好的MOS分数。摘要：Recent advances in neural multi-speaker text-to-speech (TTS) models have enabled the generation of reasonably good speech quality with a single model and made it possible to synthesize the speech of a speaker with limited training data. Fine-tuning to the target speaker data with the multi-speaker model can achieve better quality, however, there still exists a gap compared to the real speech sample and the model depends on the speaker. In this work, we propose GANSpeech, which is a high-fidelity multi-speaker TTS model that adopts the adversarial training method to a non-autoregressive multi-speaker TTS model. In addition, we propose simple but efficient automatic scaling methods for feature matching loss used in adversarial training. In the subjective listening tests, GANSpeech significantly outperformed the baseline multi-speaker FastSpeech and FastSpeech2 models, and showed a better MOS score than the speaker-specific fine-tuned FastSpeech2.

半/弱/无监督|不确定性(1篇)

【1】 Unsupervised Technique To Conversational Machine Reading 标题：无监督技术在对话式机器阅读中的应用

作者：Peter Ochieng,Dennis Mugambi 机构：Department of Computing and Informatics, Taita Taveta University, P.O. Box , – , Voi,Kenya., Dennis Kaburu, Jomo Kenyatta University of Agriculture and Technology, P.O. Box , CITY SQUARE,Nairobi ,Kenya. 链接：https://arxiv.org/abs/2106.15247 摘要：会话式机器阅读（CMR）工具近年来发展迅速。现有的工具依赖于有监督学习技术，需要有标记的数据集进行训练。监督技术需要为每个新的规则文本创建一个手动标记的数据集。这是乏味和容易出错的。本文介绍并论证了无监督学习技术在CMR开发中的应用。具体来说，我们演示了如何无监督学习可以用于规则提取和蕴涵模块的CMR。与目前最好的CMR工具相比，我们开发的框架报告微观平均精度提高了3.3%，宏观平均精度提高了1.4%。摘要：Conversational machine reading (CMR) tools have seen a rapid progress in the recent past. The current existing tools rely on the supervised learning technique which require labeled dataset for their training. The supervised technique necessitates that for every new rule text, a manually labeled dataset must be created. This is tedious and error prone. This paper introduces and demonstrates how unsupervised learning technique can be applied in the development of CMR. Specifically, we demonstrate how unsupervised learning can be used in rule extraction and entailment modules of CMR. Compared to the current best CMR tool, our developed framework reports 3.3% improvement in micro averaged accuracy and 1.4 % improvement in macro averaged accuracy.

识别/分类(4篇)

【1】 Classification of Consumer Belief Statements From Social Media 标题：社交媒体中消费者信念声明的分类

作者：Gerhard Hagerer,Wenbin Le,Hannah Danner,Georg Groh 机构：Social Computing Research Group, Technical University of Munich, Chair of Marketing and Consumer Research, Technical University of Munich 链接：https://arxiv.org/abs/2106.15498 摘要：社交媒体提供大量的信息进行市场调查，以满足客户的需求。进行这项研究的一种方法是，领域专家将用户生成的内容收集并分类为复杂的细粒度类结构。在许多这样的情况下，很少的数据会遇到复杂的注释。目前还不完全了解如何将其成功地用于分类。我们检验了专家标签与a）许多细粒度类和b）少数抽象类一起使用时的分类精度。对于场景b），我们比较了领域专家给出的抽象类标签作为基线和自动分层聚类。我们将此与另一个基线进行比较，其中整个类结构是由完全无监督的聚类方法给出的。通过这样做，这项工作可以作为一个例子，说明如何复杂的专家注释是潜在的有益的，并可以利用在高度特定领域的意见挖掘的最佳方式。通过对一系列技术和实验的探索，我们发现自动类抽象方法，特别是无监督方法，在文本分类任务上比领域专家基线表现得非常好。这有可能激发意见挖掘应用程序，以便在实践中支持市场研究人员，并激发大规模的细粒度自动内容分析。摘要：Social media offer plenty of information to perform market research in order to meet the requirements of customers. One way how this research is conducted is that a domain expert gathers and categorizes user-generated content into a complex and fine-grained class structure. In many of such cases, little data meets complex annotations. It is not yet fully understood how this can be leveraged successfully for classification. We examine the classification accuracy of expert labels when used with a) many fine-grained classes and b) few abstract classes. For scenario b) we compare abstract class labels given by the domain expert as baseline and by automatic hierarchical clustering. We compare this to another baseline where the entire class structure is given by a completely unsupervised clustering approach. By doing so, this work can serve as an example of how complex expert annotations are potentially beneficial and can be utilized in the most optimal way for opinion mining in highly specific domains. By exploring across a range of techniques and experiments, we find that automated class abstraction approaches in particular the unsupervised approach performs remarkably well against domain expert baseline on text classification tasks. This has the potential to inspire opinion mining applications in order to support market researchers in practice and to inspire fine-grained automated content analysis on a large scale.

【2】 Representation based meta-learning for few-shot spoken intent recognition 标题：基于表征的元学习在Few-Shot口述意图识别中的应用

作者：Ashish Mittal,Samarth Bharadwaj,Shreya Khare,Saneem Chemmengath,Karthik Sankaranarayanan,Brian Kingsbury 机构：IBM Research AI 备注：Accepted paper at Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25-29 October, 2020 链接：https://arxiv.org/abs/2106.15238 摘要：语音意图检测已经成为一种流行的方法，可以方便地与各种智能设备进行接口。然而，这样的系统仅限于意图术语或命令的预设列表，这限制了个人设备对新意图的快速定制。本文提出了一种基于元学习范式的具有任务不可知表征的多镜头口语意图分类方法。具体来说，我们利用流行的基于表征的元学习来构建一个任务无关的话语表征，然后使用线性分类器进行预测。我们在两个流行的语音意图分类数据集：Google命令集和Fluent语音命令集上开发了一个新的实验协议，在此基础上对三种方法进行了评估。对于新类的5次（1次）分类，该框架在googlecommands数据集上的平均分类准确率为88.6%（76.3%），在Fluent Speech Commands数据集上的平均分类准确率为78.5%（64.2%）。其性能与传统的有监督分类模型相当，且训练样本丰富。摘要：Spoken intent detection has become a popular approach to interface with various smart devices with ease. However, such systems are limited to the preset list of intents-terms or commands, which restricts the quick customization of personal devices to new intents. This paper presents a few-shot spoken intent classification approach with task-agnostic representations via meta-learning paradigm. Specifically, we leverage the popular representation-based meta-learning learning to build a task-agnostic representation of utterances, that then use a linear classifier for prediction. We evaluate three such approaches on our novel experimental protocol developed on two popular spoken intent classification datasets: Google Commands and the Fluent Speech Commands dataset. For a 5-shot (1-shot) classification of novel classes, the proposed framework provides an average classification accuracy of 88.6% (76.3%) on the Google Commands dataset, and 78.5% (64.2%) on the Fluent Speech Commands dataset. The performance is comparable to traditionally supervised classification models with abundant training samples.

【3】 New Arabic Medical Dataset for Diseases Classification 标题：用于疾病分类的新的阿拉伯医学数据集

作者：Jaafar Hammoud,Aleksandra Vatian,Natalia Dobrenko,Nikolai Vedernikov,Anatoly Shalyto,Natalia Gusarova 机构： ITMO University, Kronverksky pr. , Saint Petersburg, Russia 链接：https://arxiv.org/abs/2106.15236 摘要：阿拉伯语的数据集非常缺乏，适合训练深度学习模型，现有的数据集包括一般的非专业分类。在这项工作中，我们介绍了一个新的阿拉伯医学数据集，其中包括2000份医疗文件收集自几个阿拉伯医学网站，除了阿拉伯医学百科全书。该数据集是为文本分类任务而建立的，包括10类（血液、骨骼、心血管、耳、内分泌、眼、胃肠、免疫、肝脏和肾脏）疾病。通过对Google的BERT、基于大型阿拉伯语语料库BERT的Arabert和基于阿拉伯语医学语料库Arabert的AraBioNER三个预训练模型进行微调，对数据集进行实验。摘要：The Arabic language suffers from a great shortage of datasets suitable for training deep learning models, and the existing ones include general non-specialized classifications. In this work, we introduce a new Arab medical dataset, which includes two thousand medical documents collected from several Arabic medical websites, in addition to the Arab Medical Encyclopedia. The dataset was built for the task of classifying texts and includes 10 classes (Blood, Bone, Cardiovascular, Ear, Endocrine, Eye, Gastrointestinal, Immune, Liver and Nephrological) diseases. Experiments on the dataset were performed by fine-tuning three pre-trained models: BERT from Google, Arabert that based on BERT with large Arabic corpus, and AraBioNER that based on Arabert with Arabic medical corpus.

【4】 Learning from Miscellaneous Other-Class Words for Few-shot Named Entity Recognition 标题：面向小概率命名实体识别的杂类词学习

作者：Meihan Tong,Shuai Wang,Bin Xu,Yixin Cao,Minghui Liu,Lei Hou,Juanzi Li 机构：Knowledge Engineering Laboratory, Tsinghua University, Beijing, China, SLP Group, AI Technology Department, JOYY Inc, China, S-Lab Nanyang Technological University, Singapore 链接：https://arxiv.org/abs/2106.15167 摘要：少数镜头命名实体识别（NER）只利用少数注释来识别和分类命名实体。典型网络在很少的时间内表现出优越的性能。然而，现有的原型方法无法区分其他类词中丰富的语义，这将加剧在Few-Shot场景下的过度拟合。为了解决这个问题，我们提出了一个新的模型，从其他类中挖掘未定义类（MUCO），该模型可以自动从其他类中归纳出不同的未定义类，从而提高系统的稳定性。利用这些额外的标记未定义类，我们的方法将提高NER分类器的识别能力，增强对具有备用语义知识的预定义类的理解。实验结果表明，在四个NER基准上，我们的模型在1拍和5拍两种情况下都优于五种最先进的模型。我们将在验收后发布代码。源代码发布在https://github.com/shuaiwa16/OtherClassNER.git上。摘要：Few-shot Named Entity Recognition (NER) exploits only a handful of annotations to identify and classify named entity mentions. Prototypical network shows superior performance on few-shot NER. However, existing prototypical methods fail to differentiate rich semantics in other-class words, which will aggravate overfitting under few shot scenario. To address the issue, we propose a novel model, Mining Undefined Classes from Other-class (MUCO), that can automatically induce different undefined classes from the other class to improve few-shot NER. With these extra-labeled undefined classes, our method will improve the discriminative ability of NER classifier and enhance the understanding of predefined classes with stand-by semantic knowledge. Experimental results demonstrate that our model outperforms five state-of-the-art models in both 1-shot and 5-shots settings on four NER benchmarks. We will release the code upon acceptance. The source code is released on https: //github.com/shuaiwa16/OtherClassNER.git.

Word2Vec|文本|单词(2篇)

【1】 Language Lexicons for Hindi-English Multilingual Text Processing 标题：印英多语种文本处理的语言词典

作者：Mohd Zeeshan Ansari,Tanvir Ahmad,Noaima Bari 机构：Department of Computer Engineering, Jamia Millia Islamia, India., Department of Electrical Engineering, Jamia Millia Islamia, India. 链接：https://arxiv.org/abs/2106.15105 摘要：文本文档中的语言识别是根据文档内容自动检测文档中所包含语言的过程。现有的语言识别技术假定一个文档包含一组固定语言中的文本，然而，当处理包含一种以上可能语言的内容的多语言文档时，这种假定是不正确的。由于印地语-英语混合语言处理任务缺乏大型标准语料库，本文提出了一种支持多种语言处理任务的新型词汇库语言词典。这些词汇是通过学习拼音印地语和英语词汇的量词而建立起来的。所设计的词典比其主要语料来源具有更丰富的数量特征，而这些语料来源是通过可视化技术揭示出来的。摘要：Language Identification in textual documents is the process of automatically detecting the language contained in a document based on its content. The present Language Identification techniques presume that a document contains text in one of the fixed set of languages, however, this presumption is incorrect when dealing with multilingual document which includes content in more than one possible language. Due to the unavailability of large standard corpora for Hindi-English mixed lingual language processing tasks we propose the language lexicons, a novel kind of lexical database that supports several multilingual language processing tasks. These lexicons are built by learning classifiers over transliterated Hindi and English vocabulary. The designed lexicons possess richer quantitative characteristic than its primary source of collection which is revealed using the visualization techniques.

【2】 A Simple and Efficient Probabilistic Language model for Code-Mixed Text 标题：一种简单高效的代码混合文本概率语言模型

作者：M Zeeshan Ansari,Tanvir Ahmad,M M Sufyan Beg,Asma Ikram 机构： Department of Computer Engineering, Jamia Millia Islamia, New Delhi, India, Department of Computer Engineering, Aligarh Muslim University, Aligarh, India 链接：https://arxiv.org/abs/2106.15102 摘要：传统的自然语言处理方法由于话语的口语化和非同质性，不适应社会化媒体文本。值得注意的是，在信息检索、命名实体识别、关系抽取等信息抽取应用中，多语种文档中的语言标识被确定为前面的子任务，在代码混合的文档中，这个问题通常更具挑战性，在这种文档中，在构建文本时，外语单词被抽取到基础语言中。单词嵌入是表示文本文档的强大语言建模工具，有助于获得单词或文档之间的相似性。我们提出了一种简单的概率方法来为代码混合文本构建有效的单词嵌入，并以从Twitter中删除的印地语英语短测试消息的语言识别为例进行了说明。我们使用双向LSTMs和SVMs检验了它在分类任务中的有效性，并观察了它在现有各种代码混合嵌入中的改进效果摘要：The conventional natural language processing approaches are not accustomed to the social media text due to colloquial discourse and non-homogeneous characteristics. Significantly, the language identification in a multilingual document is ascertained to be a preceding subtask in several information extraction applications such as information retrieval, named entity recognition, relation extraction, etc. The problem is often more challenging in code-mixed documents wherein foreign languages words are drawn into base language while framing the text. The word embeddings are powerful language modeling tools for representation of text documents useful in obtaining similarity between words or documents. We present a simple probabilistic approach for building efficient word embedding for code-mixed text and exemplifying it over language identification of Hindi-English short test messages scrapped from Twitter. We examine its efficacy for the classification task using bidirectional LSTMs and SVMs and observe its improved scores over various existing code-mixed embeddings

其他神经网络|深度学习|模型|建模(1篇)

【1】 Differential Privacy for Credit Risk Model 标题：信用风险模型的差分隐私性

作者：Tabish Maniar,Alekhya Akkinepally,Anantha Sharma 机构：Synechron Innovation Lab 备注：7 pages, 3 figures, 2 tables 链接：https://arxiv.org/abs/2106.15343 摘要：使用机器学习算法来模拟用户行为和驱动业务决策已经变得越来越普遍，特别是为自动化决策提供智能建议。这导致越来越多地使用客户的个人数据来分析客户行为和预测他们对公司产品的兴趣。越来越多地使用这些客户个人数据可以带来更好的模型，但也可能导致客户数据被泄露、逆向工程和错误处理。在本文中，我们评估差异隐私作为解决这些隐私问题的解决方案，将隐私保护纳入预测模型开发的数据工程和模型训练阶段。我们感兴趣的是在操作环境中的实用实现，这需要一个通用的差异私有建模框架，我们评估了LeapYear中应用于信用风险建模领域的一个工具。信用风险模型是银行业和金融业的一种主要建模方法，通过分析用户数据来确定银行的总预期损失。我们研究了差别隐私权在信用风险模型中的应用，并评估了差别隐私模型和非差别隐私模型的性能。信用风险模型是银行业和金融业的一种主要建模方法，通过分析用户数据来确定银行的总预期损失。本文探讨了差分隐私权在信用风险模型中的应用，并用差分隐私模型对一个非差分隐私模型的性能进行了评价。摘要：The use of machine learning algorithms to model user behavior and drive business decisions has become increasingly commonplace, specifically providing intelligent recommendations to automated decision making. This has led to an increase in the use of customers personal data to analyze customer behavior and predict their interests in a companys products. Increased use of this customer personal data can lead to better models but also to the potential of customer data being leaked, reverse engineered, and mishandled. In this paper, we assess differential privacy as a solution to address these privacy problems by building privacy protections into the data engineering and model training stages of predictive model development. Our interest is a pragmatic implementation in an operational environment, which necessitates a general purpose differentially private modeling framework, and we evaluate one such tool from LeapYear as applied to the Credit Risk modeling domain. Credit Risk Model is a major modeling methodology in banking and finance where user data is analyzed to determine the total Expected Loss to the bank. We examine the application of differential privacy on the credit risk model and evaluate the performance of a Differentially Private Model with a Non Differentially Private Model. Credit Risk Model is a major modeling methodology in banking and finance where users data is analyzed to determine the total Expected Loss to the bank. In this paper, we explore the application of differential privacy on the credit risk model and evaluate the performance of a Non Differentially Private Model with Differentially Private Model.

其他(3篇)

【1】 TWAG: A Topic-Guided Wikipedia Abstract Generator 标题：TWAG：一个主题制导的维基百科摘要生成器

作者：Fangwei Zhu,Shangqing Tu,Jiaxin Shi,Juanzi Li,Lei Hou,Tong Cui 机构：Dept. of Computer Sci.&Tech., BNRist, Tsinghua University, Beijing , China, KIRC, Institute for Artificial Intelligence, Tsinghua University, School of Computer Science and Engineering, Beihang University, Noah’s Ark Lab, Huawei Inc. 备注：Accepted by ACL 2021 链接：https://arxiv.org/abs/2106.15135 摘要：Wikipedia摘要生成旨在从web源中提取Wikipedia摘要，并通过采用多文档摘要技术取得了显著的成功。然而，以往的研究一般将摘要视为纯文本，忽略了它是对某个实体的描述，可以分解为不同的主题。在本文中，我们提出了一个两阶段的模型TWAG，用主题信息指导摘要的生成。首先，我们检测每个输入段落的主题，使用一个在现有Wikipedia文章上训练的分类器将输入文档划分为不同的主题。然后，我们预测每个抽象句子的主题分布，并使用指针生成器网络从主题感知的表示中解码句子。我们在WikiCatSum数据集上评估了我们的模型，结果表明\modelnames优于现有的各种基线，并且能够生成全面的摘要。我们的代码和数据集可以通过\url访问{https://github.com/THU-KEG/TWAG} 摘要：Wikipedia abstract generation aims to distill a Wikipedia abstract from web sources and has met significant success by adopting multi-document summarization techniques. However, previous works generally view the abstract as plain text, ignoring the fact that it is a description of a certain entity and can be decomposed into different topics. In this paper, we propose a two-stage model TWAG that guides the abstract generation with topical information. First, we detect the topic of each input paragraph with a classifier trained on existing Wikipedia articles to divide input documents into different topics. Then, we predict the topic distribution of each abstract sentence, and decode the sentence from topic-aware representations with a Pointer-Generator network. We evaluate our model on the WikiCatSum dataset, and the results show that \modelnames outperforms various existing baselines and is capable of generating comprehensive abstracts. Our code and dataset can be accessed at \url{https://github.com/THU-KEG/TWAG}

【2】 Sexism in the Judiciary 标题：司法机构中的性别歧视

作者：Noa Baker Gillis 机构：Tel Aviv, Israel 备注：To be published in GeBNLP 2021 conference proceedings 链接：https://arxiv.org/abs/2106.15103 摘要：我们分析了670万份判例文件，以确定我国司法系统中是否存在性别偏见。我们发现目前NLP中的偏见检测方法不足以确定案例法数据库中的性别偏见，并提出了一种替代方法。我们证明了现有算法的不一致结果是先前研究对偏差本身定义的结果。偏见检测算法依赖于一组词来表示偏见（例如，“薪水”、“工作”和“老板”来表示就业是文本中针对女性的潜在偏见主题）。然而，建立这些词组的方法有几个弱点，主要是词表是基于研究人员自己的直觉。我们建议两种新的方法来自动创建单词列表来表示偏见。我们发现我们的方法优于目前的NLP偏差检测方法。我们的研究提高了自然语言处理技术检测偏见的能力，突出了在有影响力的判例法中存在的性别偏见。为了检验我们的NLP偏差检测方法的性能，我们将判例法中的偏差结果与过去100年美国人口普查中妇女参与劳动力的数据进行回归。摘要：We analyze 6.7 million case law documents to determine the presence of gender bias within our judicial system. We find that current bias detectino methods in NLP are insufficient to determine gender bias in our case law database and propose an alternative approach. We show that existing algorithms' inconsistent results are consequences of prior research's definition of biases themselves. Bias detection algorithms rely on groups of words to represent bias (e.g., 'salary,' 'job,' and 'boss' to represent employment as a potentially biased theme against women in text). However, the methods to build these groups of words have several weaknesses, primarily that the word lists are based on the researchers' own intuitions. We suggest two new methods of automating the creation of word lists to represent biases. We find that our methods outperform current NLP bias detection methods. Our research improves the capabilities of NLP technology to detect bias and highlights gender biases present in influential case law. In order test our NLP bias detection method's performance, we regress our results of bias in case law against U.S census data of women's participation in the workforce in the last 100 years.

【3】 A Survey on Neural Speech Synthesis 标题：神经语音合成技术综述

作者：Xu Tan,Tao Qin,Frank Soong,Tie-Yan Liu 机构：Microsoft Research Asia 备注：A comprehensive survey on TTS, 63 pages, 18 tables, 7 figures, 447 references 链接：https://arxiv.org/abs/2106.15561 摘要：文本到语音（Text-to-speech，简称TTS）是语音、语言和机器学习领域的一个研究热点，在工业领域有着广泛的应用。近年来，随着深度学习和人工智能的发展，基于神经网络的TTS技术显著提高了合成语音的质量。本文对神经TTS进行了全面的综述，旨在对神经TTS的研究现状和发展趋势有一个很好的认识。我们重点讨论了神经TTS的关键组成部分，包括文本分析、声学模型和声码器，以及一些高级主题，包括快速TTS、低资源TTS、鲁棒TTS、表达TTS和自适应TTS等。我们进一步总结了与TTS相关的资源（如数据集，并讨论未来的研究方向。这项调查可以服务于学术研究人员和行业从业人员的TTS工作。摘要：Text to speech (TTS), or speech synthesis, which aims to synthesize intelligible and natural speech given text, is a hot research topic in speech, language, and machine learning communities and has broad applications in the industry. As the development of deep learning and artificial intelligence, neural network-based TTS has significantly improved the quality of synthesized speech in recent years. In this paper, we conduct a comprehensive survey on neural TTS, aiming to provide a good understanding of current research and future trends. We focus on the key components in neural TTS, including text analysis, acoustic models and vocoders, and several advanced topics, including fast TTS, low-resource TTS, robust TTS, expressive TTS, and adaptive TTS, etc. We further summarize resources related to TTS (e.g., datasets, opensource implementations) and discuss future research directions. This survey can serve both academic researchers and industry practitioners working on TTS.

机器翻译，仅供参考

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2021-06-30，如有侵权请联系 cloudcommunity@tencent.com 删除

linux