自然语言处理学术速递[8.24]

公众号-arXiv每日学术速递

发布于 2021-08-25 16:12:01

6600

Update！H5支持摘要折叠，体验更佳！点击阅读原文访问arxivdaily.com，涵盖CS|物理|数学|经济|统计|金融|生物|电气领域，更有搜索、收藏等功能！

cs.CL 方向，今日共计26篇

Transformer(1篇)

【1】 A Unified Transformer-based Framework for Duplex Text Normalization 标题：一种基于统一转换器的双工文本规范化框架链接：https://arxiv.org/abs/2108.09889

作者：Tuan Manh Lai,Yang Zhang,Evelina Bakhturina,Boris Ginsburg,Heng Ji 机构： NVIDIA, University of Illinois at Urbana-Champaign 备注：Under Review 摘要：文本规范化（TN）和反向文本规范化（ITN）分别是文本语音合成和自动语音识别的重要预处理和后处理步骤。对于TN或ITN，已经提出了许多方法，从加权有限状态传感器到神经网络。尽管这些方法的性能令人印象深刻，但它们的目标是只处理这两项任务中的一项，而不是同时处理这两项任务。因此，在一个完整的口语对话系统中，需要为TN和ITN建立两个独立的模型。这种异构性增加了系统的技术复杂性，进而增加了生产环境中的维护成本。基于这一观察，我们提出了一个统一的框架来构建一个能够同时处理TN和ITN的单一神经双工系统。结合一种简单但有效的数据扩充方法，我们的系统在Google TN数据集上实现了最先进的英语和俄语结果。它们还可以在内部英语TN数据集上达到95%以上的句子级准确率，而无需任何额外的微调。此外，我们还从德语维基百科口语语料库创建了一个清理过的数据集，并在该数据集上报告了我们系统的性能。总体而言，实验结果表明所提出的双重文本规范化框架是非常有效的，适用于一系列领域和语言摘要：Text normalization (TN) and inverse text normalization (ITN) are essential preprocessing and postprocessing steps for text-to-speech synthesis and automatic speech recognition, respectively. Many methods have been proposed for either TN or ITN, ranging from weighted finite-state transducers to neural networks. Despite their impressive performance, these methods aim to tackle only one of the two tasks but not both. As a result, in a complete spoken dialog system, two separate models for TN and ITN need to be built. This heterogeneity increases the technical complexity of the system, which in turn increases the cost of maintenance in a production setting. Motivated by this observation, we propose a unified framework for building a single neural duplex system that can simultaneously handle TN and ITN. Combined with a simple but effective data augmentation method, our systems achieve state-of-the-art results on the Google TN dataset for English and Russian. They can also reach over 95% sentence-level accuracy on an internal English TN dataset without any additional fine-tuning. In addition, we also create a cleaned dataset from the Spoken Wikipedia Corpora for German and report the performance of our systems on the dataset. Overall, experimental results demonstrate the proposed duplex text normalization framework is highly effective and applicable to a range of domains and languages

BERT(2篇)

【1】 Deploying a BERT-based Query-Title Relevance Classifier in a Production System: a View from the Trenches 标题：在生产系统中部署基于ERT的查询-标题相关性分类器：战壕视图链接：https://arxiv.org/abs/2108.10197

作者：Leonard Dahlmann,Tomer Lancewicki 机构：Bay Inc. 摘要：变换器的双向编码器表示（BERT）模型从根本上改善了许多自然语言处理（NLP）任务的性能，如文本分类和命名实体识别（NER）应用。然而，由于其庞大的规模，因此很难为低延迟和高吞吐量的工业用例扩展BERT。我们通过一个紧凑的模型成功地优化了一个用于部署的查询标题相关性（QTR）分类器，我们将其命名为BERT双向长短期记忆（BertBiLSTM）。该模型能够在CPU上推断最多0.2ms的输入。BertBiLSTM在上述实际生产任务的准确性和效率方面超过了现有BERT模型的性能。我们分两个阶段实现这一结果。首先，我们创建一个预先训练的模型，称为eBERT，它是使用我们独特的项目标题语料库训练的原始BERT体系结构。然后，我们为QTR任务微调eBERT。其次，我们训练BertBiLSTM模型，通过一个称为知识提取（KD）的过程来模拟eBERT模型的性能，并展示数据增强的效果，以实现相似的目标。实验结果表明，该模型优于其他紧凑型和生产型模型。摘要：The Bidirectional Encoder Representations from Transformers (BERT) model has been radically improving the performance of many Natural Language Processing (NLP) tasks such as Text Classification and Named Entity Recognition (NER) applications. However, it is challenging to scale BERT for low-latency and high-throughput industrial use cases due to its enormous size. We successfully optimize a Query-Title Relevance (QTR) classifier for deployment via a compact model, which we name BERT Bidirectional Long Short-Term Memory (BertBiLSTM). The model is capable of inferring an input in at most 0.2ms on CPU. BertBiLSTM exceeds the off-the-shelf BERT model's performance in terms of accuracy and efficiency for the aforementioned real-world production task. We achieve this result in two phases. First, we create a pre-trained model, called eBERT, which is the original BERT architecture trained with our unique item title corpus. We then fine-tune eBERT for the QTR task. Second, we train the BertBiLSTM model to mimic the eBERT model's performance through a process called Knowledge Distillation (KD) and show the effect of data augmentation to achieve the resembling goal. Experimental results show that the proposed model outperforms other compact and production-ready models.

【2】 UzBERT: pretraining a BERT model for Uzbek 标题：UZBERT：为乌兹别克语预训练BERT模型链接：https://arxiv.org/abs/2108.09814

作者：B. Mansurov,A. Mansurov 机构：Copper City Labs 备注：9 pages, 1 table 摘要：基于Transformer体系结构的预训练语言模型在各种自然语言处理任务（如词性标注、命名实体识别和问答）中取得了最新成果。然而，乌兹别克语的这种单语模式尚未公开。在本文中，我们介绍了UzBERT，一种基于BERT结构的预训练乌兹别克语模型。我们的模型在掩蔽语言模型的准确性上大大优于多语言的BERT。我们在麻省理工学院开放源码许可下公开该模型。摘要：Pretrained language models based on the Transformer architecture have achieved state-of-the-art results in various natural language processing tasks such as part-of-speech tagging, named entity recognition, and question answering. However, no such monolingual model for the Uzbek language is publicly available. In this paper, we introduce UzBERT, a pretrained Uzbek language model based on the BERT architecture. Our model greatly outperforms multilingual BERT on masked language model accuracy. We make the model publicly available under the MIT open-source license.

语义分析(2篇)

【1】 Semantic-Preserving Adversarial Text Attacks 标题：保留语义的对抗性文本攻击链接：https://arxiv.org/abs/2108.10015

作者：Xinghao Yang,Weifeng Liu,James Bailey,Tianqing Zhu,Dacheng Tao,Wei Liu 机构： Bailey is with the Department of Computing and Information Systems, University of Melbourne 备注：12 pages, 3 figures, 10 tables 摘要：深度神经网络（DNN）是已知的易受攻击的图像，而其鲁棒性的文本分类很少研究。文献中提出了几种文本攻击方法，包括字符级、单词级和句子级攻击。然而，在确保词汇正确性、句法正确性和语义相似性的同时，最大限度地减少导致错误分类所需的单词更改数量仍然是一个挑战。在本文中，我们提出了一种基于二元图和单元图的自适应语义保持优化（BU-SPO）方法来检查深层模型的脆弱性。我们的方法有四个主要优点。首先，我们提出不仅在单字级攻击文本文档，而且在双字级攻击文本文档，这样可以更好地保持语义，避免产生无意义的输出。其次，我们提出了一种混合方法，在同义词候选词和义素候选词之间使用选项替换输入词，与仅使用同义词相比，这大大丰富了潜在的替换。第三，我们设计了一种优化算法，即语义保持优化（SPO），以确定单词替换的优先级，从而降低修改成本。最后，我们使用语义过滤器（名为SPOF）进一步改进SPO，以找到语义相似度最高的对抗性示例。我们评估我们的BU-SPO和BU-SPOF对IMDB、AG新闻和雅虎的有效性！通过攻击四种流行的DNNs模型来回答文本数据集。结果表明，与现有方法相比，我们的方法通过改变最小的字数获得了最高的攻击成功率和语义率。摘要：Deep neural networks (DNNs) are known to be vulnerable to adversarial images, while their robustness in text classification is rarely studied. Several lines of text attack methods have been proposed in the literature, including character-level, word-level, and sentence-level attacks. However, it is still a challenge to minimize the number of word changes necessary to induce misclassification, while simultaneously ensuring lexical correctness, syntactic soundness, and semantic similarity. In this paper, we propose a Bigram and Unigram based adaptive Semantic Preservation Optimization (BU-SPO) method to examine the vulnerability of deep models. Our method has four major merits. Firstly, we propose to attack text documents not only at the unigram word level but also at the bigram level which better keeps semantics and avoids producing meaningless outputs. Secondly, we propose a hybrid method to replace the input words with options among both their synonyms candidates and sememe candidates, which greatly enriches the potential substitutions compared to only using synonyms. Thirdly, we design an optimization algorithm, i.e., Semantic Preservation Optimization (SPO), to determine the priority of word replacements, aiming to reduce the modification cost. Finally, we further improve the SPO with a semantic Filter (named SPOF) to find the adversarial example with the highest semantic similarity. We evaluate the effectiveness of our BU-SPO and BU-SPOF on IMDB, AG's News, and Yahoo! Answers text datasets by attacking four popular DNNs models. Results show that our methods achieve the highest attack success rates and semantics rates by changing the smallest number of words compared with existing methods.

【2】 Palmira: A Deep Deformable Network for Instance Segmentation of Dense and Uneven Layouts in Handwritten Manuscripts 标题：Palmira：一种深度变形网络，用于手写稿件密集和不均匀版面的分割链接：https://arxiv.org/abs/2108.09436

作者：Prema Satish Sharan,Sowmya Aitha,Amandeep Kumar,Abhishek Trivedi,Aaron Augustine,Ravi Kiran Sarvadevabhatla 机构：(�)[,−,−,−,], Centre for Visual Information Technology, International Institute of Information Technology, Hyderabad – , INDIA 备注：Accepted at ICDAR-21. Watch teaser video this https URL , code and pretrained models this https URL , project page this https URL 摘要：手写文档通常具有密集和不均匀布局的特点。尽管取得了一些进展，基于标准深度网络的语义布局分割方法对跨语义区域的复杂变形并不鲁棒。这一现象在低资源印度棕榈叶手稿领域尤为明显。为了解决这个问题，我们首先介绍了INSCAPES2，这是一个新的具有语义布局注释的大规模多样化印度手稿数据集。Incapes2包含来自四种不同历史收藏的文件，比其前身Incapes大150%。我们还提出了一种新的深度网络Palmira，用于手写手稿区域的鲁棒、变形感知的实例分割。我们还报告了Hausdorff距离及其变体作为边界感知性能度量。我们的实验表明，Palmira提供了稳健的布局，优于强基线方法和烧蚀变体。我们还包括阿拉伯语、东南亚和希伯来语历史手稿的定性结果，以展示帕尔米拉的泛化能力。摘要：Handwritten documents are often characterized by dense and uneven layout. Despite advances, standard deep network based approaches for semantic layout segmentation are not robust to complex deformations seen across semantic regions. This phenomenon is especially pronounced for the low-resource Indic palm-leaf manuscript domain. To address the issue, we first introduce Indiscapes2, a new large-scale diverse dataset of Indic manuscripts with semantic layout annotations. Indiscapes2 contains documents from four different historical collections and is 150% larger than its predecessor, Indiscapes. We also propose a novel deep network Palmira for robust, deformation-aware instance segmentation of regions in handwritten manuscripts. We also report Hausdorff distance and its variants as a boundary-aware performance measure. Our experiments demonstrate that Palmira provides robust layouts, outperforms strong baseline approaches and ablative variants. We also include qualitative results on Arabic, South-East Asian and Hebrew historical manuscripts to showcase the generalization capability of Palmira.

Graph|知识图谱|Knowledge(2篇)

【1】 A Hierarchical Entity Graph Convolutional Network for Relation Extraction across Documents 标题：一种用于跨文档关系提取的分层实体图卷积网络链接：https://arxiv.org/abs/2108.09505

作者：Tapas Nayak,Hwee Tou Ng 机构：Department of Computer Science, Indian Institute of Technology Kharagpur, National University of Singapore 备注：Accepted in RANLP 2021 摘要：用于关系抽取的远程监控数据集主要集中在句子级的抽取上，涉及的关系很少。在这项工作中，我们提出了跨文档关系提取，其中关系元组的两个实体出现在通过公共实体链连接的两个不同文档中。按照这个想法，我们创建了一个用于两跳关系提取的数据集，其中每个链正好包含两个文档。我们提出的数据集比公开的句子级数据集涵盖更多的关系。我们还为此任务提出了一个层次实体图卷积网络（HEGCN）模型，与一些强神经基线相比，该模型在两跳关系提取数据集上的性能提高了1.1\%F1分数。摘要：Distantly supervised datasets for relation extraction mostly focus on sentence-level extraction, and they cover very few relations. In this work, we propose cross-document relation extraction, where the two entities of a relation tuple appear in two different documents that are connected via a chain of common entities. Following this idea, we create a dataset for two-hop relation extraction, where each chain contains exactly two documents. Our proposed dataset covers a higher number of relations than the publicly available sentence-level datasets. We also propose a hierarchical entity graph convolutional network (HEGCN) model for this task that improves performance by 1.1\% F1 score on our two-hop relation extraction dataset, compared to some strong neural baselines.

【2】 CushLEPOR: Customised hLEPOR Metric Using LABSE Distilled Knowledge Model to Improve Agreement with Human Judgements 标题：CushLEPOR：使用LABSE精炼知识模型改进与人类判断的一致性的定制hLEPOR度量链接：https://arxiv.org/abs/2108.09484

作者：Lifeng Han,Irina Sorokina,Gleb Erofeev,Serge Gladkoff 机构： ADAPT Research Centre, DCU, Ireland, Logrus Global, Translation & Localization 备注：Extended work from MT SUMMIT 2021: Gleb Erofeev, Irina Sorokina, Lifeng Han, and Serge Gladkoff. 2021. cushLEPOR uses LABSE distilled knowledge to improve correlation with human translation evaluations. In Proceedings for the MT summit - User Track (In Press), online. Association for Computa- tional Linguistics & AMTA 摘要：在研究人员努力信任自动度量的同时，人工评估的成本一直很高。为了解决这个问题，我们建议通过利用预先训练的语言模型（PLM）和有限的可用人类标记分数来定制传统指标。我们首先重新介绍了hLEPOR度量因子，然后介绍了我们开发的Python可移植版本，该版本实现了hLEPOR度量中权重参数的自动调整。然后，我们提出了定制的hLEPOR（cushLEPOR），它使用LABSE提取的知识模型，通过自动优化与cushLEPOR部署到的精确机器翻译语言对相关的因子权重，来改进度量与人类判断的一致性。我们还优化了基于MQM和pSQM框架的英语-德语和汉英语言对的cushLEPOR人类评估数据。实验研究表明，cushLEPOR以更低的成本提高了hLEPOR的性能，使其与PLM（如LABSE）达成更好的协议，并与人类评估（包括MQM和pSQM分数）达成更好的协议，并且产生了比BLEU更好的性能（数据可在{https://github.com/poethan/cushLEPOR}). 摘要：Human evaluation has always been expensive while researchers struggle to trust the automatic metrics. To address this, we propose to customise traditional metrics by taking advantages of the pre-trained language models (PLMs) and the limited available human labelled scores. We first re-introduce the hLEPOR metric factors, followed by the Python portable version we developed which achieved the automatic tuning of the weighting parameters in hLEPOR metric. Then we present the customised hLEPOR (cushLEPOR) which uses LABSE distilled knowledge model to improve the metric agreement with human judgements by automatically optimised factor weights regarding the exact MT language pairs that cushLEPOR is deployed to. We also optimise cushLEPOR towards human evaluation data based on MQM and pSQM framework on English-German and Chinese-English language pairs. The experimental investigations show cushLEPOR boosts hLEPOR performances towards better agreements to PLMs like LABSE with much lower cost, and better agreements to human evaluations including MQM and pSQM scores, and yields much better performances than BLEU (data available at \url{https://github.com/poethan/cushLEPOR}).

摘要|信息提取(1篇)

【1】 Hierarchical Summarization for Longform Spoken Dialog 标题：长形口语对话的层次化摘要链接：https://arxiv.org/abs/2108.09597

作者：Daniel Li,Thomas Chen,Albert Tung,Lydia Chilton 机构：Columbia University, New York, New York, USA, Microsoft, Redmond, Washington, USA, Stanford University, Palo Alto, California, USA, Lydia B. Chilton 摘要：每天我们都被口语对话包围着。这一媒介在观众席上传递丰富多样的信息流；然而，系统地理解对话常常是非常重要的。尽管口语对话非常普遍，但自动语音理解和高质量信息提取仍然非常差，尤其是与书面散文相比。此外，与理解文本相比，听觉交流带来了许多额外的挑战，如说话人不流利、非正式的散文风格和缺乏结构。这些问题都表明需要一个独特的语音定制交互系统来帮助用户理解和导航口语领域。虽然个人自动语音识别（ASR）和文本摘要方法已经存在，但它们是不完善的技术；既不考虑用户的目的和意图，也不解决口语引起的并发症。因此，我们设计了一个两阶段ASR和文本摘要流水线，并提出了一套语义分割和合并算法来解决这些语音建模难题。我们的系统使用户能够轻松浏览和导航内容，以及从这些底层技术中的错误中恢复。最后，我们对系统进行了评估，强调了用户对分层摘要的偏好，将其作为快速浏览音频和识别用户感兴趣内容的工具。摘要：Every day we are surrounded by spoken dialog. This medium delivers rich diverse streams of information auditorily; however, systematically understanding dialog can often be non-trivial. Despite the pervasiveness of spoken dialog, automated speech understanding and quality information extraction remains markedly poor, especially when compared to written prose. Furthermore, compared to understanding text, auditory communication poses many additional challenges such as speaker disfluencies, informal prose styles, and lack of structure. These concerns all demonstrate the need for a distinctly speech tailored interactive system to help users understand and navigate the spoken language domain. While individual automatic speech recognition (ASR) and text summarization methods already exist, they are imperfect technologies; neither consider user purpose and intent nor address spoken language induced complications. Consequently, we design a two stage ASR and text summarization pipeline and propose a set of semantic segmentation and merging algorithms to resolve these speech modeling challenges. Our system enables users to easily browse and navigate content as well as recover from errors in these underlying technologies. Finally, we present an evaluation of the system which highlights user preference for hierarchical summarization as a tool to quickly skim audio and identify content of interest to the user.

推理|分析|理解|解释(3篇)

【1】 Towards Explainable Fact Checking 标题：走向可解释的事实核查链接：https://arxiv.org/abs/2108.10274

作者：Isabelle Augenstein 机构：arXiv:,.,v, [cs.CL] , Aug 备注：Thesis presented to the University of Copenhagen Faculty of Science in partial fulfillment of the requirements for the degree of Doctor Scientiarum (Dr. Scient.) 摘要：在过去的十年里，网上错误和虚假信息的数量大幅上升，从有针对性的影响政治的虚假信息运动，到无意中传播有关公共卫生的错误信息。这一发展促进了自动事实检查领域的研究，从检测有检查价值的声明和确定推特对声明的立场的方法，到确定给定证据文件的声明的准确性的方法。这些自动方法通常是基于内容的，使用自然语言处理方法，而自然语言处理方法又利用深层神经网络从文本中学习高阶特征以进行预测。由于深度神经网络是黑盒模型，它们的内部工作原理不容易解释。同时，有必要解释它们是如何作出某些决定的，特别是如果它们将用于决策的话。虽然这一点早已为人所知，但模型规模不断扩大、欧盟立法要求模型用于决策以提供解释，以及最近立法要求在欧盟运营的在线平台提供关于其服务的透明报告，加剧了这一问题。尽管如此，目前的解释性解决方案仍然缺乏事实核查领域。本文对自动事实检查进行了研究，包括索赔检查价值检测、立场检测和准确性预测。它的贡献超出了事实检查，论文提出了更通用的机器学习解决方案，用于有限标记数据学习领域的自然语言处理。最后，本文提出了一些可解释事实检查的初步解决方案。摘要：The past decade has seen a substantial rise in the amount of mis- and disinformation online, from targeted disinformation campaigns to influence politics, to the unintentional spreading of misinformation about public health. This development has spurred research in the area of automatic fact checking, from approaches to detect check-worthy claims and determining the stance of tweets towards claims, to methods to determine the veracity of claims given evidence documents. These automatic methods are often content-based, using natural language processing methods, which in turn utilise deep neural networks to learn higher-order features from text in order to make predictions. As deep neural networks are black-box models, their inner workings cannot be easily explained. At the same time, it is desirable to explain how they arrive at certain decisions, especially if they are to be used for decision making. While this has been known for some time, the issues this raises have been exacerbated by models increasing in size, and by EU legislation requiring models to be used for decision making to provide explanations, and, very recently, by legislation requiring online platforms operating in the EU to provide transparent reporting on their services. Despite this, current solutions for explainability are still lacking in the area of fact checking. This thesis presents my research on automatic fact checking, including claim check-worthiness detection, stance detection and veracity prediction. Its contributions go beyond fact checking, with the thesis proposing more general machine learning solutions for natural language processing in the area of learning with limited labelled data. Finally, the thesis presents some first solutions for explainable fact checking.

【2】 Analysis of Chronic Pain Experiences Based on Online Reports: the RRCP Dataset 标题：基于在线报告的慢性疼痛体验分析：RRCP数据集链接：https://arxiv.org/abs/2108.10218

作者：Diogo A. P. Nunes,David Martins de Matos,Joana Ferreira Gomes,Fani Neto 机构：INESC-ID, Lisbon, Portugal, Instituto Superior Técnico, Universidade de Lisboa, Lisbon, Portugal, Biomedicina, Unidade de Biologia Experimental, Faculdade de Medicina, Universidade do Porto, Porto, Portugal 备注：10 pages, 6 figures, 5 tables 摘要：慢性疼痛被认为是一个主要的健康问题，在经济、社会和个人层面都有影响。作为一种私人和主观体验，依赖于一个复杂的认知过程，涉及主体的过去经验、社会文化嵌入性以及情感和心理负荷，不可能从外部公正地体验、描述、，并将慢性疼痛解释为一种纯粹的有害刺激物，它将直接指向病因并促进其缓解。因此，口头沟通是向卫生专业人员传达外部实体无法获取的相关信息的关键。具体而言，患有慢性疼痛的患者从经历中描述的内容以及这些信息是如何披露的，揭示了患者和疼痛体验本身的内在品质。我们介绍了Reddit Reports of Chronic Pain（RRCP）数据集，该数据集包括社交媒体文本描述和对各种形式的慢性疼痛体验的讨论，从不同基础病理学的角度进行报告。对于每一种病理学，我们确定其慢性疼痛的后续经验所引起的主要问题，如明确与之相关的文件子集所示。这是通过潜在空间中的文档聚类获得的。通过余弦相似性，我们确定不同病理学的哪些关注点是所有疼痛体验的核心，哪些是特定形式所独有的。最后，我们认为，我们对慢性疼痛描述的无监督语义分析呼应了关于慢性疼痛体验中不同病理表现的临床研究。摘要：Chronic pain is recognized as a major health problem, with impacts at the economic, social, and individual levels. Being a private and subjective experience, dependent on a complex cognitive process involving the subject's past experiences, sociocultural embeddedness, as well as emotional and psychological loads, it is impossible to externally and impartially experience, describe, and interpret chronic pain as a purely noxious stimulus that would directly point to a causal agent and facilitate its mitigation. Verbal communication is, thus, key to convey relevant information to health professionals that would otherwise not be accessible to external entities. Specifically, what a patient suffering of chronic pain describes from the experience and how this information is disclosed reveals intrinsic qualities about the patient and the experience of pain itself. We present the Reddit Reports of Chronic Pain (RRCP) dataset, which comprises social media textual descriptions and discussion of various forms of chronic pain experiences, as reported from the perspective of different base pathologies. For each pathology, we identify the main concerns emergent of its consequent experience of chronic pain, as represented by the subset of documents explicitly related to it. This is obtained via document clustering in the latent space. By means of cosine similarity, we determine which concerns of different pathologies are core to all experiences of pain, and which are exclusive to certain forms. Finally, we argue that our unsupervised semantic analysis of descriptions of chronic pain echoes clinical research on how different pathologies manifest in terms of the chronic pain experience.

【3】 2020 U.S. Presidential Election: Analysis of Female and Male Users on Twitter 标题：2020年美国总统大选：推特上的男性和女性用户分析链接：https://arxiv.org/abs/2108.09416

作者：Amir Karami,Spring B. Clark,Anderson Mackenzie,Dorathea Lee,Michael Zhu,Hannah R. Boyajieff,Bailey Goldschmidt 机构：University of South Carolina, USA, Dora Lee, The order of authors is not finalized yet, TBD 摘要：在竞选期间，公众通常使用社交媒体来表达他们对不同问题的意见。在各种社交媒体渠道中，推特为研究人员和政界人士提供了一个有效的平台，让他们探讨经济和外交政策等广泛话题的公众意见。目前的文献主要集中在分析推特的内容，而没有考虑用户的性别。这项研究收集和分析了大量推文，并使用计算、人工编码和统计分析来确定2020年美国总统选举期间发布的30多万条推文中的主题，并就主题的平均权重对女性和男性用户进行比较。2019冠状病毒疾病、气候变化和COVID-19流行病等广泛议题。在这些主题中，超过70%的主题在女性用户和男性用户之间存在显著差异。我们的研究方法可以为信息学、政治学和传播学领域的研究提供信息，也可以被政治运动用来获得基于性别的公众舆论理解。摘要：Social media is commonly used by the public during election campaigns to express their opinions regarding different issues. Among various social media channels, Twitter provides an efficient platform for researchers and politicians to explore public opinion regarding a wide range of topics such as economy and foreign policy. Current literature mainly focuses on analyzing the content of tweets without considering the gender of users. This research collects and analyzes a large number of tweets and uses computational, human coding, and statistical analyses to identify topics in more than 300,000 tweets posted during the 2020 U.S. presidential election and to compare female and male users regarding the average weight of the topics. Our findings are based upon a wide range of topics, such as tax, climate change, and the COVID-19 pandemic. Out of the topics, there exists a significant difference between female and male users for more than 70% of topics. Our research approach can inform studies in the areas of informatics, politics, and communication, and it can be used by political campaigns to obtain a gender-based understanding of public opinion.

半/弱/无监督|不确定性(1篇)

【1】 Improving Distantly Supervised Relation Extraction with Self-Ensemble Noise Filtering 标题：利用自集成噪声滤波改进远程监督关系提取链接：https://arxiv.org/abs/2108.09689

作者：Tapas Nayak,Navonil Majumder,Soujanya Poria 机构：IIT Kharagpur, India, SUTD, Singapore 备注：Accepted in RANLP 2021. arXiv admin note: substantial text overlap with arXiv:2104.01799, arXiv:2103.16929 摘要：由于使用远程监控方法可以获得大量的训练数据，而无需人工标注，因此远程监控模型在关系抽取中非常流行。在远程监控中，如果句子同时包含元组的两个实体，则句子被视为元组的源。但是，该条件过于宽松，不能保证句子中存在相关的关系特定信息。因此，远程监督的训练数据包含许多噪声，这些噪声对模型的性能产生不利影响。在本文中，我们提出了一种自集成滤波机制来滤除训练过程中的噪声样本。我们在通过远程监控获得的《纽约时报》数据集上评估了我们提出的框架。我们对多个最先进的神经关系提取模型的实验表明，我们提出的过滤机制提高了模型的鲁棒性并提高了它们的F1分数。摘要：Distantly supervised models are very popular for relation extraction since we can obtain a large amount of training data using the distant supervision method without human annotation. In distant supervision, a sentence is considered as a source of a tuple if the sentence contains both entities of the tuple. However, this condition is too permissive and does not guarantee the presence of relevant relation-specific information in the sentence. As such, distantly supervised training data contains much noise which adversely affects the performance of the models. In this paper, we propose a self-ensemble filtering mechanism to filter out the noisy samples during the training process. We evaluate our proposed framework on the New York Times dataset which is obtained via distant supervision. Our experiments with multiple state-of-the-art neural relation extraction models show that our proposed filtering mechanism improves the robustness of the models and increases their F1 scores.

检测相关(2篇)

【1】 An Interpretable Approach to Hateful Meme Detection 标题：一种可解释的仇恨模因检测方法链接：https://arxiv.org/abs/2108.10069

作者：Tanvi Deshpande,Nitya Mani 机构：Irvington High School, Fremont, CA, USA, Massachusetts Institute of Technology, Cambridge, MA, USA 备注：5 pages. 2021 ACM International Conference on Multimodal Interaction (ICMI) 摘要：仇恨模因是一种新兴的在互联网上传播仇恨的方法，它依靠图像和文本来传达仇恨信息。我们采用可解释的方法来检测可恨的模因，使用机器学习和简单的启发式方法来识别对将模因归类为可恨模因最重要的特征。在此过程中，我们构建了一个梯度增强的决策树和一个基于LSTM的模型，在这项具有挑战性的任务中，该模型的性能（73.8验证和72.7测试auROC）与人类金标准和最先进的Transformer模型相当。摘要：Hateful memes are an emerging method of spreading hate on the internet, relying on both images and text to convey a hateful message. We take an interpretable approach to hateful meme detection, using machine learning and simple heuristics to identify the features most important to classifying a meme as hateful. In the process, we build a gradient-boosted decision tree and an LSTM-based model that achieve comparable performance (73.8 validation and 72.7 test auROC) to the gold standard of humans and state-of-the-art transformer models on this challenging task.

【2】 Sarcasm Detection in Twitter -- Performance Impact when using Data Augmentation: Word Embeddings 标题：Twitter中的讽刺检测--使用数据增强时的性能影响：Word嵌入链接：https://arxiv.org/abs/2108.09924

作者：Alif Tri Handoyo,Hidayaturrahman,Derwin Suhartono 机构：Computer Science Department, School of Computer Science, Bina Nusantara University, Jakarta , Indonesia 备注：7 pages, 4 figures. arXiv admin note: text overlap with arXiv:2104.09261 by other authors 摘要：讽刺通常是用来嘲弄或惹恼某人，或用于幽默目的的词语。讽刺主要用于社交网络和微博网站，在这些网站上，人们嘲笑或指责的方式甚至让人们很难辨别所说的是什么意思。在情感分析和观点挖掘等自然语言处理应用中，如果不能识别讽刺话语，将混淆分类算法并产生错误结果。一些关于讽刺检测的研究使用了不同的学习算法。然而，这些学习模型大多只关注表达的内容，而将上下文信息孤立起来。结果，他们未能捕捉到讽刺表达中的上下文信息。此外，一些研究中使用的数据集存在不平衡数据集，这会影响模型结果。在本文中，我们使用RoBERTa提出了一个用于twitter讽刺识别的上下文模型，并通过应用全局向量表示（GlobalVectorRepresentation，手套）来构建单词嵌入和上下文学习，从而生成更多数据并平衡数据集，从而对数据集进行扩充。该技术的有效性通过各种数据集和数据增强设置进行了测试。特别是，当使用数据扩充将标记为讽刺的数据增加20%时，我们在ISARCSM数据集中实现了3.2%的性能提升，从而使F分数从没有数据扩充的37.2%提高到40.4%。摘要：Sarcasm is the use of words usually used to either mock or annoy someone, or for humorous purposes. Sarcasm is largely used in social networks and microblogging websites, where people mock or censure in a way that makes it difficult even for humans to tell if what is said is what is meant. Failure to identify sarcastic utterances in Natural Language Processing applications such as sentiment analysis and opinion mining will confuse classification algorithms and generate false results. Several studies on sarcasm detection have utilized different learning algorithms. However, most of these learning models have always focused on the contents of expression only, leaving the contextual information in isolation. As a result, they failed to capture the contextual information in the sarcastic expression. Moreover, some datasets used in several studies have an unbalanced dataset which impacting the model result. In this paper, we propose a contextual model for sarcasm identification in twitter using RoBERTa, and augmenting the dataset by applying Global Vector representation (GloVe) for the construction of word embedding and context learning to generate more data and balancing the dataset. The effectiveness of this technique is tested with various datasets and data augmentation settings. In particular, we achieve performance gain by 3.2% in the iSarcasm dataset when using data augmentation to increase 20% of data labeled as sarcastic, resulting F-score of 40.4% compared to 37.2% without data augmentation.

表征(1篇)

【1】 Yseop at FinSim-3 Shared Task 2021: Specializing Financial Domain Learning with Phrase Representations 标题：FinSim-3共享任务2021的Yseop：使用短语表示进行金融领域学习的专业化链接：https://arxiv.org/abs/2108.09485

作者：Hanna Abi Akl,Dominique Mariko,Hugues de Mazancourt 机构：Yseop 备注：To be published in ACL Anthology 摘要：在本文中，我们介绍了FinSim-3共享任务2021的方法：学习金融领域的语义相似性。此共享任务的目的是将金融领域中的给定术语列表正确分类为外部本体中最相关的超词（或顶级）概念。对于我们提交的系统，我们评估了两种方法：一种是在定制语料库上预先训练的句子RoBERTa（SRoBERTa）嵌入模型，另一种是基于第一种方法的双单词句子嵌入模型，该模型使用FastText模型改进建议的基线单词嵌入结构，以提高分类性能。我们的系统在这两个指标上总体排名第二，平均准确度得分为0.917，平均排名为1.141。摘要：In this paper, we present our approaches for the FinSim-3 Shared Task 2021: Learning Semantic Similarities for the Financial Domain. The aim of this shared task is to correctly classify a list of given terms from the financial domain into the most relevant hypernym (or top-level) concept in an external ontology. For our system submission, we evaluate two methods: a Sentence-RoBERTa (SRoBERTa) embeddings model pre-trained on a custom corpus, and a dual word-sentence embeddings model that builds on the first method by improving the proposed baseline word embeddings construction using the FastText model to boost the classification performance. Our system ranks 2nd overall on both metrics, scoring 0.917 on Average Accuracy and 1.141 on Mean Rank.

Word2Vec|文本|单词(1篇)

【1】 How Cute is Pikachu? Gathering and Ranking Pokémon Properties from Data with Pokémon Word Embeddings 标题：皮卡丘有多可爱？使用精灵宝可梦单词嵌入从数据中收集精灵宝可梦属性并对其进行排序链接：https://arxiv.org/abs/2108.09546

作者：Mika Hämäläinen,Khalid Alnajjar,Niko Partanen 机构：Department of Digital Humanities, University of Helsinki 备注：English translation of H\"am\"al\"ainen, M., Alnajjar, K. \& Partanen, N. (2021). Nettikorpuksen avulla tuotettuja sanavektorimalleja Pok\'emonien ominaisuuksien kuvaamiseksi. In Saarikivi, T. \& Saarikivi, J. (eds.) \textit{Turhan tiedon kirja -- Tutkimuksista pois j\"atettyj\"a sivuja} 摘要：我们提供了不同的方法来自动获取151个原始口袋妖怪的描述性属性。我们在爬网的口袋妖怪语料库上训练了几种不同的单词嵌入模型，并使用它们根据英语形容词对给定口袋妖怪的特征自动对其进行排序。根据我们的实验，使用特定领域的数据训练模型比使用预训练模型更好。与fastText模型相比，Word2Vec在结果中产生的噪声更少。此外，我们会自动展开每个口袋妖怪的属性列表。然而，没有一种方法是准确的，并且在不同的语义模型中存在大量的噪声。我们的模型已经在Zenodo上发布。摘要：We present different methods for obtaining descriptive properties automatically for the 151 original Pok\'emon. We train several different word embeddings models on a crawled Pok\'emon corpus, and use them to rank automatically English adjectives based on how characteristic they are to a given Pok\'emon. Based on our experiments, it is better to train a model with domain specific data than to use a pretrained model. Word2Vec produces less noise in the results than fastText model. Furthermore, we expand the list of properties for each Pok\'emon automatically. However, none of the methods is spot on and there is a considerable amount of noise in the different semantic models. Our models have been released on Zenodo.

其他神经网络|深度学习|模型|建模(2篇)

【1】 Metric Learning in Multilingual Sentence Similarity Measurement for Document Alignment 标题：用于文档对齐的多语种句子相似度度量学习链接：https://arxiv.org/abs/2108.09495

作者：Charith Rajitha,Lakmali Piyarathne,Dilan Sachintha,Surangika Ranathunga 机构：Department of Computer Science and Engineering, University of Moratuwa, Katubedda , Sri Lanka. 摘要：基于多语言句子表示的文档对齐技术最近显示了最新的成果。然而，这些技术依赖于无监督的距离测量技术，无法根据手头的任务进行调整。在本文中，我们使用度量学习来推导特定于任务的距离度量，而不是这些无监督的距离度量技术。这些测量是有监督的，这意味着使用并行数据集对距离测量度量进行训练。使用一个属于英语、僧伽罗语和泰米尔语三种不同语系的数据集，我们发现这些任务特定的有监督远程学习指标在文档对齐方面优于无监督指标。摘要：Document alignment techniques based on multilingual sentence representations have recently shown state of the art results. However, these techniques rely on unsupervised distance measurement techniques, which cannot be fined-tuned to the task at hand. In this paper, instead of these unsupervised distance measurement techniques, we employ Metric Learning to derive task-specific distance measurements. These measurements are supervised, meaning that the distance measurement metric is trained using a parallel dataset. Using a dataset belonging to English, Sinhala, and Tamil, which belong to three different language families, we show that these task-specific supervised distance learning metrics outperform their unsupervised counterparts, for document alignment.

【2】 BoundaryNet: An Attentive Deep Network with Fast Marching Distance Maps for Semi-automatic Layout Annotation 标题：边界网：一种带快速行进距离图的细心深度网络半自动布局标注链接：https://arxiv.org/abs/2108.09433

作者：Abhishek Trivedi,Ravi Kiran Sarvadevabhatla 机构：(�)[,−,−,−,], Centre for Visual Information Technology (CVIT), International Institute of Information Technology, Hyderabad – , INDIA 备注：Accepted at ICDAR-21 for oral presentation - watch video this https URL View webpage this http URL Code and pretrained models this https URL 摘要：图像区域的精确边界标注对于依赖区域类语义的下游应用程序至关重要。一些文档集合包含布局密集、高度不规则和重叠的多类区域实例，其纵横比范围很大。全自动边界估计方法往往是数据密集型的，不能处理可变大小的图像，并且对上述图像产生次优结果。为了解决这些问题，我们提出了BoundaryNet，这是一种新的无需调整大小的高精度半自动布局注释方法。可变大小的用户选择的感兴趣区域首先由注意力引导的跳过网络进行处理。通过快速行进距离图指导网络优化，以获得高质量的初始边界估计和相关特征表示。这些输出由使用Hausdorff损失优化的残差图卷积网络处理，以获得最终区域边界。在具有挑战性的图像手稿数据集上的结果表明，BoundaryNet优于强基线，并生成高质量的语义区域边界。定性地说，我们的方法在包含不同脚本系统和布局的多个文档图像数据集上进行了推广，而无需进行额外的微调。我们将BoundaryNet集成到文档注释系统中，并表明与手动和全自动替代方案相比，它提供了高注释吞吐量。摘要：Precise boundary annotations of image regions can be crucial for downstream applications which rely on region-class semantics. Some document collections contain densely laid out, highly irregular and overlapping multi-class region instances with large range in aspect ratio. Fully automatic boundary estimation approaches tend to be data intensive, cannot handle variable-sized images and produce sub-optimal results for aforementioned images. To address these issues, we propose BoundaryNet, a novel resizing-free approach for high-precision semi-automatic layout annotation. The variable-sized user selected region of interest is first processed by an attention-guided skip network. The network optimization is guided via Fast Marching distance maps to obtain a good quality initial boundary estimate and an associated feature representation. These outputs are processed by a Residual Graph Convolution Network optimized using Hausdorff loss to obtain the final region boundary. Results on a challenging image manuscript dataset demonstrate that BoundaryNet outperforms strong baselines and produces high-quality semantic region boundaries. Qualitatively, our approach generalizes across multiple document image datasets containing different script systems and layouts, all without additional fine-tuning. We integrate BoundaryNet into a document annotation system and show that it provides high annotation throughput compared to manual and fully automatic alternatives.

其他(8篇)

【1】 Legal Search in Case Law and Statute Law 标题：判例法与成文法中的法律搜索链接：https://arxiv.org/abs/2108.10127

作者：Julien Rossi,Evangelos Kanoulas 机构：Amsterdam Business School, University of Amsterdam, Institute of Informatics, University of Amsterdam 摘要：在这项工作中，我们描述了一种在典型的法律文档集合中识别文档成对相关性的方法：有限的资源、长查询和长文档。我们回顾了广义语言模型的使用，包括有监督和无监督学习。我们观察我们的方法在使用文本摘要时，如何超越基于全文的现有基线，并激发未来工作的潜在改进方向。摘要：In this work we describe a method to identify document pairwise relevance in the context of a typical legal document collection: limited resources, long queries and long documents. We review the usage of generalized language models, including supervised and unsupervised learning. We observe how our method, while using text summaries, overperforms existing baselines based on full text, and motivate potential improvement directions for future work.

【2】 VerbCL: A Dataset of Verbatim Quotes for Highlight Extraction in Case Law 标题：VerbCL：用于案例法重点提取的逐字引语数据集链接：https://arxiv.org/abs/2108.10120

作者：Julien Rossi,Svitlana Vakulenko,Evangelos Kanoulas 机构：University of Amsterdam, Netherlands 备注：CIKM 2021, Resource Track 摘要：引用法律意见是法律论证的一个关键部分，这是一项专家任务，需要从法院判决中检索、提取和总结信息。为了引用的目的，在意见书中识别法律上显著的部分可以被视为突出部分提取或段落检索任务的特定领域的表述。由于网络搜索等其他领域的类似任务显示出极大的关注和改进，法律领域的进展因缺乏训练和评估资源而受阻。本文提出了一个新的数据集，其中包括法院意见引用图，该图引用了以前发表的法院意见以支持其论点。特别是，我们关注逐字引用，即直接重用原始意见的文本。通过这种方法，我们解释了法院意见书不同文本跨度的相对重要性，展示了它们在引文中的使用情况，并测量了它们对引文图中意见之间关系的贡献。我们发布了从CourtListener派生的大规模数据集VerbCL，并将突出显示提取任务作为单个文档摘要任务引入，该任务基于引用图，在VerbCL数据集上为该任务建立了第一个基线结果。摘要：Citing legal opinions is a key part of legal argumentation, an expert task that requires retrieval, extraction and summarization of information from court decisions. The identification of legally salient parts in an opinion for the purpose of citation may be seen as a domain-specific formulation of a highlight extraction or passage retrieval task. As similar tasks in other domains such as web search show significant attention and improvement, progress in the legal domain is hindered by the lack of resources for training and evaluation. This paper presents a new dataset that consists of the citation graph of court opinions, which cite previously published court opinions in support of their arguments. In particular, we focus on the verbatim quotes, i.e., where the text of the original opinion is directly reused. With this approach, we explain the relative importance of different text spans of a court opinion by showcasing their usage in citations, and measuring their contribution to the relations between opinions in the citation graph. We release VerbCL, a large-scale dataset derived from CourtListener and introduce the task of highlight extraction as a single-document summarization task based on the citation graph establishing the first baseline results for this task on the VerbCL dataset.

【3】 Polarity in the Classroom: A Case Study Leveraging Peer Sentiment Toward Scalable Assessment 标题：课堂中的两极：利用同伴情绪进行可扩展评估的案例研究链接：https://arxiv.org/abs/2108.10068

作者：Zachariah J. Beasley,Les A. Piegl,Paul Rosen 摘要：在大型或大规模的开放式在线课程（MOOC）中，准确地为开放式作业评分并非易事。同行评议是一个很有希望的解决方案，但由于评议者很少，且评议表未经评估，因此可能不可靠。到目前为止，没有任何研究1）在同行评议过程中利用情绪分析来通知或验证成绩，或2）利用方面提取来根据学生实际交流的内容编制评议表。我们的工作利用而不是丢弃来自复习表评论的学生数据，以便向讲师提供更好的信息。在这项工作中，我们详细介绍了创建领域相关词汇和方面相关评论表单的过程，以及整个情绪分析算法，该算法仅从文本中提供细粒度情绪分数。最后，我们分析了有效性，并讨论了来自9门课程6800多篇同行评议的语料库中的结论，以了解课堂情绪的可行性，从而增加大型课程中开放式作业评分的信息和可靠性。摘要：Accurately grading open-ended assignments in large or massive open online courses (MOOCs) is non-trivial. Peer review is a promising solution but can be unreliable due to few reviewers and an unevaluated review form. To date, no work has 1) leveraged sentiment analysis in the peer-review process to inform or validate grades or 2) utilized aspect extraction to craft a review form from what students actually communicated. Our work utilizes, rather than discards, student data from review form comments to deliver better information to the instructor. In this work, we detail the process by which we create our domain-dependent lexicon and aspect-informed review form as well as our entire sentiment analysis algorithm which provides a fine-grained sentiment score from text alone. We end by analyzing validity and discussing conclusions from our corpus of over 6800 peer reviews from nine courses to understand the viability of sentiment in the classroom for increasing the information from and reliability of grading open-ended assignments in large courses.

【4】 Event Extraction by Associating Event Types and Argument Roles 标题：通过关联事件类型和参数角色进行事件提取链接：https://arxiv.org/abs/2108.10038

作者：Qian Li,Shu Guo,Jia Wu,Jianxin Li,Jiawei Sheng,Lihong Wang,Xiaohan Dong,Hao Peng 机构：Beijing Advanced Innovation Center for Big Data and Brain Computing, Beihang University, Beijing, China, School of Computer Science and Engineering, Beihang University, Beijing, China 摘要：事件抽取（EE）从文本中获取结构化事件知识，可分为两个子任务：事件类型分类和元素抽取（即识别不同角色模式下的触发器和参数）。由于不同的事件类型总是拥有不同的提取模式（即，角色模式），以前的EE工作通常遵循一种独立的学习范式，为不同的事件类型独立执行元素提取。它忽略了事件类型和参数角色之间有意义的关联，导致不太频繁的类型/角色的性能相对较差。本文提出了一种新的神经关联框架。对于给定的文档，它首先通过构造文档级图来关联不同类型的句子节点，并采用图注意网络来学习句子嵌入来进行类型分类。然后，通过构建参数角色的通用模式来实现元素提取，并使用参数继承机制来增强提取元素的角色偏好。因此，我们的模型考虑了EE期间的类型和角色关联，实现了它们之间的隐式信息共享。实验结果表明，在这两个子任务中，我们的方法始终优于大多数最先进的EE方法。特别是对于训练数据较少的类型/角色，其性能优于现有方法。摘要：Event extraction (EE), which acquires structural event knowledge from texts, can be divided into two sub-tasks: event type classification and element extraction (namely identifying triggers and arguments under different role patterns). As different event types always own distinct extraction schemas (i.e., role patterns), previous work on EE usually follows an isolated learning paradigm, performing element extraction independently for different event types. It ignores meaningful associations among event types and argument roles, leading to relatively poor performance for less frequent types/roles. This paper proposes a novel neural association framework for the EE task. Given a document, it first performs type classification via constructing a document-level graph to associate sentence nodes of different types, and adopting a graph attention network to learn sentence embeddings. Then, element extraction is achieved by building a universal schema of argument roles, with a parameter inheritance mechanism to enhance role preference for extracted elements. As such, our model takes into account type and role associations during EE, enabling implicit information sharing among them. Experimental results show that our approach consistently outperforms most state-of-the-art EE methods in both sub-tasks. Particularly, for types/roles with less training data, the performance is superior to the existing methods.

【5】 Fluent: An AI Augmented Writing Tool for People who Stutter 标题：流利：一款针对口吃者的人工智能增强写作工具链接：https://arxiv.org/abs/2108.09918

作者：Bhavya Ghai,Klaus Mueller 机构：Stony Brook University 备注：Accepted to ACM ASSETS 2021 conference 摘要：口吃是一种言语障碍，影响着全世界数百万人的个人和职业生活。为了使自己免于耻辱和歧视，口吃者（PWS）可能会采取不同的策略来掩盖口吃。其中一个常见的策略是单词替换，即个体避免说他们可能结巴的单词，而是使用另一个替代词。这个过程本身会造成压力，增加负担。在这项工作中，我们介绍了Fluent，一种人工智能增强的写作工具，它可以帮助PWS编写脚本，使他们能够说得更流利。Fluent体现了一种新的基于主动学习的方法，用于识别个人可能难以发音的单词。这些词在界面中突出显示。将鼠标悬停在任何此类单词上，Fluent会显示一组具有类似含义但更容易说话的备选单词。用户可以自由接受或忽略这些建议。基于这种用户交互（反馈），Fluent不断改进其分类器，以更好地满足每个用户的个性化需求。我们通过测量10个模拟用户识别难懂单词的能力来评估我们的工具。我们发现，我们的工具可以在20次以下的互动中识别难词，平均准确率超过80%，并且随着反馈的增加，它会不断改进。我们的工具可以用于某些重要的生活场合，如演讲、演示等。该工具的源代码已在github.com/bhavyaghai/Fluent上公开提供。摘要：Stuttering is a speech disorder which impacts the personal and professional lives of millions of people worldwide. To save themselves from stigma and discrimination, people who stutter (PWS) may adopt different strategies to conceal their stuttering. One of the common strategies is word substitution where an individual avoids saying a word they might stutter on and use an alternative instead. This process itself can cause stress and add more burden. In this work, we present Fluent, an AI augmented writing tool which assists PWS in writing scripts which they can speak more fluently. Fluent embodies a novel active learning based method of identifying words an individual might struggle pronouncing. Such words are highlighted in the interface. On hovering over any such word, Fluent presents a set of alternative words which have similar meaning but are easier to speak. The user is free to accept or ignore these suggestions. Based on such user interaction (feedback), Fluent continuously evolves its classifier to better suit the personalized needs of each user. We evaluated our tool by measuring its ability to identify difficult words for 10 simulated users. We found that our tool can identify difficult words with a mean accuracy of over 80% in under 20 interactions and it keeps improving with more feedback. Our tool can be beneficial for certain important life situations like giving a talk, presentation, etc. The source code for this tool has been made publicly accessible at github.com/bhavyaghai/Fluent.

【6】 Analyzing the Granularity and Cost of Annotation in Clinical Sequence Labeling 标题：临床序列标注中标注的粒度和成本分析链接：https://arxiv.org/abs/2108.09913

作者：Haozhan Sun,Chenchen Xu,Hanna Suominen 机构：Duke University, Durham, United States, The Australian National University, Canberra, Australia 摘要：正如最近的顶级研究所显示的那样，在有监督机器学习（ML）中，注释良好的数据集对研究人员来说比以往任何时候都更加重要。然而，数据集注释过程及其相关的人力成本仍然被忽视。在这项工作中，我们使用护理交接班的临床记录，分析序列标记中注释粒度和ML性能之间的关系。我们首先研究一个仅从文本语言特征衍生的模型，没有基于护理知识的额外信息。我们发现，在这种粒度下，该序列标记器在大多数类别中都表现良好。然后，我们进一步包括由护士提供的附加手动注释，并发现序列标记性能几乎保持不变。最后，我们为社区提供了一个指南和参考，认为由于投资回报率低，没有必要，甚至不建议进行详细的粒度注释。因此，我们建议强调研究人员和从业者的其他特征，如文本知识，作为提高序列标记性能的经济有效的来源。摘要：Well-annotated datasets, as shown in recent top studies, are becoming more important for researchers than ever before in supervised machine learning (ML). However, the dataset annotation process and its related human labor costs remain overlooked. In this work, we analyze the relationship between the annotation granularity and ML performance in sequence labeling, using clinical records from nursing shift-change handover. We first study a model derived from textual language features alone, without additional information based on nursing knowledge. We find that this sequence tagger performs well in most categories under this granularity. Then, we further include the additional manual annotations by a nurse, and find the sequence tagging performance remaining nearly the same. Finally, we give a guideline and reference to the community arguing it is not necessary and even not recommended to annotate in detailed granularity because of a low Return on Investment. Therefore we recommend emphasizing other features, like textual knowledge, for researchers and practitioners as a cost-effective source for increasing the sequence labeling performance.

【7】 Grid-VLP: Revisiting Grid Features for Vision-Language Pre-training 标题：GRID-VLP：重温网格功能以进行视觉语言预训练链接：https://arxiv.org/abs/2108.09479

作者：Ming Yan,Haiyang Xu,Chenliang Li,Bin Bi,Junfeng Tian,Min Gui,Wei Wang 机构：Alibaba Group 摘要：现有的视觉语言预训练（VLP）方法严重依赖于基于边界盒（区域）的对象检测器，首先从图像中检测出显著对象，然后使用基于变换器的模型进行跨模态融合。尽管这些方法具有优越的性能，但在有效性和效率方面都受到目标检测器能力的限制。此外，目标检测的存在对模型设计施加了不必要的约束，使得支持端到端训练变得困难。在本文中，我们重新讨论了基于网格的卷积特征用于视觉语言预训练，跳过了昂贵的区域相关步骤。我们提出了一种简单而有效的基于网格的VLP方法，该方法非常适合网格特性。通过仅使用域内数据集进行预训练，所提出的网格VLP方法在三个检查的视觉语言理解任务上优于最具竞争力的基于区域的VLP方法。我们希望我们的发现有助于进一步推动视觉语言预训练的发展，并为有效的视觉语言预训练提供新的方向。摘要：Existing approaches to vision-language pre-training (VLP) heavily rely on an object detector based on bounding boxes (regions), where salient objects are first detected from images and then a Transformer-based model is used for cross-modal fusion. Despite their superior performance, these approaches are bounded by the capability of the object detector in terms of both effectiveness and efficiency. Besides, the presence of object detection imposes unnecessary constraints on model designs and makes it difficult to support end-to-end training. In this paper, we revisit grid-based convolutional features for vision-language pre-training, skipping the expensive region-related steps. We propose a simple yet effective grid-based VLP method that works surprisingly well with the grid features. By pre-training only with in-domain datasets, the proposed Grid-VLP method can outperform most competitive region-based VLP methods on three examined vision-language understanding tasks. We hope that our findings help to further advance the state of the art of vision-language pre-training, and provide a new direction towards effective and efficient VLP.

【8】 One Chatbot Per Person: Creating Personalized Chatbots based on Implicit User Profiles 标题：每人一个聊天机器人：基于隐式用户配置文件创建个性化聊天机器人链接：https://arxiv.org/abs/2108.09355

作者：Zhengyi Ma,Zhicheng Dou,Yutao Zhu,Hanxun Zhong,Ji-Rong Wen 机构： Gaoling School of Artificial Intelligence, Renmin University of China, School of Information, Renmin University of China, Beijing Key Laboratory of Big Data Management and Analysis Methods, Key Laboratory of Data Engineering and Knowledge Engineering, MOE 备注：Accepted By SIGIR 2021, Full Papers 摘要：个性化聊天机器人专注于赋予聊天机器人一致的个性，让其表现得像真正的用户，提供更多信息反馈，并进一步充当个人助理。现有的个性化方法试图将多个文本描述合并为明确的用户配置文件。然而，获取这样的显式概要文件既昂贵又耗时，因此对于大规模的实际应用是不切实际的。此外，受限的预定义配置文件忽略了真实用户的语言行为，并且不能随着用户兴趣的变化自动更新。在本文中，我们建议从大规模用户对话历史中自动学习隐式用户配置文件，以构建个性化聊天机器人。具体来说，利用Transformer在语言理解方面的优势，我们训练了一个个性化的语言模型，以根据用户的历史响应构建通用的用户配置文件。为了突出显示输入帖子的相关历史响应，我们进一步建立了一个包含历史帖子响应对的键值存储网络，并构建了一个动态的帖子感知用户配置文件。动态配置文件主要描述用户对历史上类似帖子的响应内容和方式。为了明确地利用用户经常使用的单词，我们设计了一个个性化解码器来融合两种解码策略，包括从通用词汇中生成一个单词和从用户的个性化词汇中复制一个单词。在两个真实数据集上的实验表明，与现有方法相比，我们的模型有了显著的改进。摘要：Personalized chatbots focus on endowing chatbots with a consistent personality to behave like real users, give more informative responses, and further act as personal assistants. Existing personalized approaches tried to incorporate several text descriptions as explicit user profiles. However, the acquisition of such explicit profiles is expensive and time-consuming, thus being impractical for large-scale real-world applications. Moreover, the restricted predefined profile neglects the language behavior of a real user and cannot be automatically updated together with the change of user interests. In this paper, we propose to learn implicit user profiles automatically from large-scale user dialogue history for building personalized chatbots. Specifically, leveraging the benefits of Transformer on language understanding, we train a personalized language model to construct a general user profile from the user's historical responses. To highlight the relevant historical responses to the input post, we further establish a key-value memory network of historical post-response pairs, and build a dynamic post-aware user profile. The dynamic profile mainly describes what and how the user has responded to similar posts in history. To explicitly utilize users' frequently used words, we design a personalized decoder to fuse two decoding strategies, including generating a word from the generic vocabulary and copying one word from the user's personalized vocabulary. Experiments on two real-world datasets show the significant improvement of our model compared with existing methods.

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2021-08-24，如有侵权请联系 cloudcommunity@tencent.com 删除

linux