Loading [MathJax]/jax/output/CommonHTML/config.js
前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
社区首页 >专栏 >Hello Edge: Keyword Spotting on Microcontrollers

Hello Edge: Keyword Spotting on Microcontrollers

作者头像
用户6026865
发布于 2023-03-02 13:23:52
发布于 2023-03-02 13:23:52
6050
举报

- Hello Edge: Keyword Spotting on Microcontrollers -

Keyword spotting (KWS) is a critical component for enabling speech based user interactions on smart devices. It requires real-time response and high accuracy forgood user experience.

Recently, neural networks have become an attractive choicefor KWS architecture because of their superior accuracy compared to traditionalspeech processing algorithms.

Due to its always-on nature, KWS application hashighly constrained power budget and typically runs on tiny microcontrollers withlimited memory and compute capability.

The design of neural network architecturefor KWS must consider these constraints. In this work, we perform neural networkarchitecture evaluation and exploration for running KWS on resource-constrainedmicrocontrollers.

We train various neural network architectures for keywordspotting published in literature to compare their accuracy and memory/compute requirements.

We show that it is possible to optimize these neural network architectures to fit within the memory and compute constraints of microcontrollers withoutsacrificing accuracy. We further explore the depthwise separable convolutional neural network (DS-CNN) and compare it against other neural network architectures.DS-CNN achieves an accuracy of 95.4%, which is ~10% higher than the DNNmodel with similar number of parameters.

Introduction

Deep learning algorithms have evolved to a stage where they have surpassed human accuracies in avariety of cognitive tasks including image classification and conversational speech recognition.

Motivated by the recent breakthroughs in deep learning based speech recognition technologies,speech is increasingly becoming a more natural way to interact with consumer electronic devices, forexample, Amazon Echo, Google Home and smart phones.

However, always-on speech recognitionis not energy-efficient and may also cause network congestion to transmit continuous audio streamfrom billions of these devices to the cloud. Furthermore, such a cloud based solution adds latencyto the application, which hurts user experience.

There are also privacy concerns when audio iscontinuously transmitted to the cloud. To mitigate these concerns, the devices first detect predefinedkeyword(s) such as "Alexa", "Ok Google", "Hey Siri", etc., which is commonly known as keywordspotting (KWS). Detection of keyword wakes up the device and then activates the full scale speechrecognition either on device or in the cloud.

In some applications, the sequence of keywordscan be used as voice commands to a smart device such as a voice-enabled light bulb. Since KWSsystem is always-on, it should have very low power consumption to maximize battery life.

On theother hand, the KWS system should detect the keywords with high accuracy and low latency, forbest user experience.

These conflicting system requirements make KWS an active area of researchever since its inception over 50 years ago. Recently, with the renaissance of artificial neuralnetworks in the form of deep learning algorithms, neural network (NN) based KWS has become verypopular.

Low power consumption requirement for keyword spotting systems make microcontrollers an obviouschoice for deploying KWS in an always-on system.

Microcontrollers are low-cost energy-efficientprocessors that are ubiquitous in our everyday life with their presence in a variety of devices rangingfrom home appliances, automobiles and consumer electronics to wearables.

However, deployment ofneural network based KWS on microcontrollers comes with following challenges:

  • Limited memory footprint: Typical microcontroller systems have only tens to few hundred KB ofmemory available. The entire neural network model, including input/output, weights and activations,has to fit within this small memory budget.
  • Limited compute resources: Since KWS is always-on, the real-time requirement limits the totalnumber of operations per neural network inference.

These microcontroller resource constraints in conjunction with the high accuracy and low latencyrequirements of KWS call for a resource-constrained neural network architecture exploration to findlean neural network structures suitable for KWS, which is the primary focus of our work. The maincontributions of this work are as follows:

  • We first train the popular KWS neural net models from the literature on Googlespeech commands dataset and compare them in terms of accuracy, memory footprintand number of operations per inference.
  • In addition, we implement a new KWS model using depth-wise separable convolutions andpoint-wise convolutions, inspired by the success of resource-efficient MobileNet [10] incomputer vision. This model outperforms the other prior models in all aspects of accuracy,model size and number of operations.
  • Finally, we perform resource-constrained neural network architecture exploration and presentcomprehensive comparison of different network architectures within a set of compute andmemory constraints of typical microcontrollers. The code, model definitions and pretrainedmodels are available at https://github.com/ARM-software/ML-KWS-for-MCU.

Background

Keyword Spotting (KWS) System

A typical KWS system consists of a feature extractor and a neural network based classifier as shownin Fig. 1. First, the input speech signal of length L is framed into overlapping frames of length lwith a stride s, giving a total of T =L−ls + 1 frames. From each frame, F speech features areextracted, generating a total of T × F features for the entire input speech signal of length L.

Log-melfilter bank energies (LFBE) and Mel-frequency cepstral coefficients (MFCC) are the commonlyused human-engineered speech features in deep learning based speech-recognition, that are adaptedfrom traditional speech processing techniques.

Feature extraction using LFBE or MFCC involvestranslating the time-domain speech signal into a set of frequency-domain spectral coefficients, whichenables dimensionality compression of the input signal.

The extracted speech feature matrix is fedinto a classifier module, which generates the probabilities for the output classes. In a real-worldscenario where keywords need to be identified from a continuous audio stream, a posterior handlingmodule averages the output probabilities of each output class over a period of time, improving theoverall confidence of the prediction.

Traditional speech recognition technologies for KWS use Hidden Markov Models (HMMs) andViterbi decoding. While these approaches achieve reasonable accuracies, they are hardto train and are computationally expensive during inference.

Other techniques explored for KWSinclude discriminative models adopting a large-margin problem formulation or recurrent neuralnetworks (RNN). Although these methods significantly outperform HMM based KWS in termsof accuracy, they suffer from large detection latency.

Traditional speech recognition technologies for KWS use Hidden Markov Models (HMMs) andViterbi decoding. While these approaches achieve reasonable accuracies, they are hardto train and are computationally expensive during inference.

Other techniques explored for KWSinclude discriminative models adopting a large-margin problem formulation or recurrent neuralnetworks (RNN). Although these methods significantly outperform HMM based KWS in termsof accuracy, they suffer from large detection latency.

KWS models using deep neural networks (DNN)based on fully-connected layers with rectified linear unit (ReLU) activation functions are introduced, which outperforms the HMM models with a very small detection latency.

Furthermore,low-rank approximation techniques are used to compress the DNN model weights achieving similaraccuracy with less hardware resources. The main drawback of DNNs is that they ignorethe local temporal and spectral correlation in the input speech features.

In order to exploit thesecorrelations, different variants of convolutional neural network (CNN) based KWS are explored, which demonstrate higher accuracy than DNNs.

The drawback of CNNs in modeling timevarying signals (e.g. speech) is that they ignore long term temporal dependencies. Combining thestrengths of CNNs and RNNs, convolutional recurrent neural network based KWS is investigated and demonstrate the robustness of the model to noise.

While all the prior KWS neural networksare trained with cross entropy loss function, a max-pooling based loss function for training KWSmodel with long short-term memory (LSTM) is proposed, which achieves better accuracy thanthe DNNs and LSTMs trained with cross entropy loss.

Although many neural network models for KWS are presented in literature, it is difficult to make afair comparison between them as they are all trained and evaluated on different proprietary datasets(e.g. "TalkType" dataset, "Alexa" dataset, etc.) with different input speech featuresand audio duration.

Also, the primary focus of prior research has been to maximize the accuracywith a small memory footprint model, without explicit constraints of underlying hardware, such aslimits on number of operations per inference.

In contrast, this work is more hardware-centric andtargeted towards neural network architectures that maximize accuracy on microcontroller devices.The constraints on memory and compute significantly limit the neural network parameters and thenumber of operations.

Microcontroller Systems

A typical microcontroller system consists of a processor core, an on-chip SRAM block and anon-chip embedded flash.

Table 1 shows some commercially available microcontroller developmentboards with Arm Cortex-M processor cores with different compute capabilities running at differentfrequencies (16 MHz to 216 MHz), consisting of a wide range of on-chip memory (SRAM: 8 KB to320 KB; Flash: 128 KB to 1 MB).

The program binary, usually preloaded into the non-volatile flash,is loaded into the SRAM at startup and the processor runs the program with the SRAM as the maindata memory. Therefore, the size of the SRAM limits the size of memory that the software can use.

Other than the memory footprint, performance (i.e., operations per second) is also a constraining factorfor running neural networks on microcontrollers.

Most microcontrollers are designed for embeddedapplications with low cost and high energy-efficiency as the primary targets, and do not have highthroughput for compute-intensive workloads such as neural networks.

Some microcontrollers haveintegrated DSP instructions that can be useful for running neural network workloads. For example,Cortex-M4 and Cortex-M7 have integrated SIMD and MAC instructions that can be used to acceleratelow-precision computation in neural networks.

Neural Network Architectures for KWS

This section gives an overview of all the different neural network architectures explored in thiswork including the deep neural network (DNN), convolutional neural network (CNN), recurrentneural network (RNN), convolutional recurrent neural network (CRNN) and depthwise separableconvolutional neural network (DS-CNN).

Deep Neural Network (DNN)

The DNN is a standard feed-forward neural network made of a stack of fully-connected layers andnon-linear activation layers. The input to the DNN is the flattened feature matrix, which feeds into astack of d hidden fully-connected layers each with n neurons. Typically, each fully-connected layeris followed by a rectified linear unit (ReLU) based activation function. At the output is a linear layerfollowed by a softmax layer generating the output probabilities of the k keywords, which are used forfurther posterior handling.

Convolutional Neural Network (CNN)

One main drawback of DNN based KWS is that they fail to efficiently model the local temporaland spectral correlation in the speech features.

CNNs exploit this correlation by treating the inputtime-domain and spectral-domain features as an image and performing 2-D convolution operationsover it.

The convolution layers are typically followed by batch normalization, ReLU basedactivation functions and optional max/average pooling layers, which reduce the dimensionality ofthe features. During inference, the parameters of batch normalization can be folded into the weightsof the convolution layers. In some cases, a linear low-rank layer, which is simply a fully-connectedlayer without non-linear activation, is added in between the convolution layers and dense layers forthe purpose of reducing parameters and accelerating training .

Recurrent Neural Network (RNN)

RNNs have shown superior performance in many sequence modeling tasks, especially speech recognition [20], language modeling [21], translation [22], etc. RNNs not only exploit the temporal relationbetween the input signal, but also capture the long-term dependencies using "gating" mechanism

Convolutional Recurrent Neural Network (CRNN)

Convolution recurrent neural network is a hybrid of CNN and RNN, which takes advantages ofboth. It exploits the local temporal/spatial correlation using convolution layers and global temporaldependencies in the speech features using recurrent layers. As shown in Fig. 3, a CRNN model

Depthwise Separable Convolutional Neural Network (DS-CNN)

Recently, depthwise separable convolution has been proposed as an efficient alternative to the standard3-D convolution operation [29] and has been used to achieve compact network architectures in thearea of computer vision

Experiments and ResultsConclusions

Hardware optimized neural network architecture is key to get efficient results on memory and computeconstrained microcontrollers.

We trained various neural network architectures for keyword spottingpublished in literature on Google speech commands dataset to compare their accuracy and memoryrequirements vs. operations per inference, from the perspective of deployment on microcontrollersystems.

We quantized representative trained 32-bit floating-point KWS models into 8-bit fixed-pointversions demonstrating that these models can easily be quantized for deployment without any lossin accuracy, even without retraining. Furthermore, we trained a new KWS model using depthwiseseparable convolution layers, inspired from MobileNet.

Based on typical microcontroller systems,we derived three sets of memory/compute constraints for the neural networks and performed resourceconstrained neural network architecture exploration to find the best networks achieving maximumaccuracy within these constraints.

In all three sets of memory/compute constraints, depthwiseseparable CNN model (DS-CNN) achieves the best accuracies of 94.4%, 94.9% and 95.4% comparedto the other model architectures within those constraints, which shows good scalability of the DS-CNNmodel.

The code, model definitions and pretrained models are available at https://github.com/ARMsoftware/ML-KWS-for-MCU.

本文参与 腾讯云自媒体同步曝光计划,分享自微信公众号。
原始发表:2022-12-29,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 SmellLikeAISpirit 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
暂无评论
推荐阅读
编辑精选文章
换一批
生物启发的终生学习系列论文The Neural Adaptive Computing Laboratory
Neural architectures trained with back-propagation of errors are susceptible to catastrophic forgetting. In other words, old information acquired by these models is lost when new information for new tasks is acquired. This makes building models that continually learn extremely difficult if not near impossible. The focus of the NAC group's research is to draw from models of cognition and biological neurocircuitry, as well as theories of mind and brain functionality, to construct new learning procedures and architectures that generalize across tasks and continually adapt to novel situations, combining input from multiple modalities/sensory channels. The NAC team is focused with developing novel, neurocognitively-inspired learning algorithms and memory architectures for artificial neural systems (for both non-spiking and spiking neurons). Furthermore, we explore and develop nature-inspired metaheuristic optimization algorithms, ranging from (neuro-)evolution to ant colony optimization to hybrid procedures. We primarily are concerned with the various sub-problems associated with lifelong machine learning, which subsumes online/stream learning, transfer learning, multi-task learning, multi-modal/input learning, and semi-supervised learning.
CreateAMind
2023/09/12
2120
生物启发的终生学习系列论文The Neural Adaptive Computing Laboratory
专栏 | 如何对比评价各种深度神经网络硬件?不妨给它们跑个分
矽说专栏 作者:唐杉 作者简介:唐杉博士先后在 T3G(STE)、中科院计算所、紫光展锐(RDA)工作。具有 15 年以上的芯片设计经验,在 3G/4G 通信基带处理,专用处理器 ASIP,多核 SoC 架构,ESL 级设计和 Domain-specific 计算等方面有深入研究和实际经验。近一年来主要关注 Deep Learning 处理器和相关技术,现在 Synopsys 公司任职。公众号:StarryHeavensAbove。 面对越来越多的 DNN 专用处理器设计(芯片和 IP),我们很自然的需要解
机器之心
2018/05/09
9730
专栏 | 如何对比评价各种深度神经网络硬件?不妨给它们跑个分
金融/语音/音频处理学术速递[8.18]
【1】 Analysis of Data Mining Process for Improvement of Production Quality in Industrial Sector 标题:提高工业部门生产质量的数据挖掘过程分析 链接:https://arxiv.org/abs/2108.07615
公众号-arXiv每日学术速递
2021/08/24
5940
【世界读书日】2018版十大引用数最高的深度学习论文集合
在过去的几年里,深度学习是机器学习和统计学习交叉领域的一个子集,强大的开源工具以及大数据的热潮让其取得了令人惊讶的进展。 本文根据微软学术的引用量作为评价指标,从中选取了10篇引用量最高的论文。希望在今天的读书日,能够给大家带来一份学习的干货。 Deep Learning, by Yann L., Yoshua B. & Geoffrey H. (2015) 引用次数:5716 Deep learning enables computational models that are composed of
量化投资与机器学习微信公众号
2018/05/28
3740
【Github2.2K星】PyTorch资源列表:450个NLP/CV/SP、论文实现、教程、示例
https://github.com/bharathgs/Awesome-pytorch-list
新智元
2018/11/22
7440
重磅纯干货 | 超级赞的语音识别/语音合成经典论文的路线图(1982-2018.5)
网址:https://github.com/zzw922cn/awesome-speech-recognition-speech-synthesis-papers
用户7623498
2020/08/04
1.3K0
CVPR 2019 | 近日新出论文汇总(含视频目标分割、GAN、度量学习、高效语义分割等主题)
CV君汇总了最近两天新出的CVPR 2019 论文,涵盖内容包括:度量学习、视频目标分割、GAN图像生成、基于RGB图像的物体表面网格生成、深度补全、高效卷积网络设计、高效语义分割。
CV君
2019/12/27
1.2K0
A review on TinyML: State-of-the-art and prospects PartI
Machine learning has become an indispensable part of the existing technological domain. Edge computing and Internet of Things (IoT) together presents a new opportunity to imply machine learning techniques at the resource constrained embedded devices at the edge of the network.
用户6026865
2022/09/02
3730
A review on TinyML: State-of-the-art and prospects PartI
金融/语音/音频处理学术速递[8.30]
【1】 European option pricing under generalized fractional Brownian motion 标题:广义分数布朗运动下的欧式期权定价 链接:https://arxiv.org/abs/2108.12042
公众号-arXiv每日学术速递
2021/09/16
4980
【论文推荐】最新6篇卷积神经网络相关论文—多任务学习、SAR和光学图像、动态加权排列、去雾新方法、点CNN、肿瘤生长预测
【导读】专知内容组整理了最近六篇卷积神经网络(CNN)相关文章,为大家进行介绍,欢迎查看! 1. NDDR-CNN: Layer-wise Feature Fusing in Multi-Task CNN by Neural Discriminative Dimensionality Reduction(NDDR-CNN: 多任务CNN中基于神经判别降维的分层特征融合方法) ---- 作者:Yuan Gao,Qi She,Jiayi Ma,Mingbo Zhao,Wei Liu,Alan L. Yuille
WZEARW
2018/04/13
1.9K0
【论文推荐】最新6篇卷积神经网络相关论文—多任务学习、SAR和光学图像、动态加权排列、去雾新方法、点CNN、肿瘤生长预测
【论文推荐】最新五篇图像分割相关论文—R2U-Net、ScatterNet混合深度学习、分离卷积编解码、控制、Embedding
【导读】专知内容组整理了最近五篇图像分割(Image Segmentation)相关文章,为大家进行介绍,欢迎查看! 1. Recurrent Residual Convolutional Neural Network based on U-Net (R2U-Net) for Medical Image Segmentation(基于U-Net (R2U-Net)循环残差卷积神经网络的医学图像分割) ---- ---- 作者:Md Zahangir Alom,Mahmudul Hasan,Chris Yak
WZEARW
2018/04/13
1.8K0
【论文推荐】最新五篇图像分割相关论文—R2U-Net、ScatterNet混合深度学习、分离卷积编解码、控制、Embedding
Github项目推荐 | 语义分割、实例分割、全景分割和视频分割的论文和基准列表
Segmentation.X - Papers and Benchmarks about semantic segmentation, instance segmentation, panoptic segmentation and video segmentation
AI研习社
2019/05/08
2.6K0
WHEN NOT TO USE DEEP LEARNING
转载自: http://hyperparameter.space/blog/when-not-to-use-deep-learning/
GavinZhou
2019/05/26
5610
博客 | 关于SLU(意图识别、槽填充、上下文LU、结构化LU)和NLG的论文汇总
不少人通过知乎或微信给我要论文的链接,统一发一下吧,后续还有DST、DPL、迁移学习在对话系统的应用、强化学习在对话系统的应用、memory network在对话系统的应用、GAN在对话系统的应用等论文,整理后发出来,感兴趣的可以期待一下。
AI研习社
2018/12/21
1.9K0
博客 | 关于SLU(意图识别、槽填充、上下文LU、结构化LU)和NLG的论文汇总
深度学习被高频引用的41篇论文下载
1 ImageNet Classification with Deep Convolutional Neural Networks
统计学家
2020/02/20
6290
【最全开工干货】深度学习书单、文献及数据集(共446项)
新年伊始,相信每个人已经制定好了自己2016年的计划。随着无人机和智能机器人在春晚亮相,想必许多人会对“人工智能”、“机器学习”,“深度学习”这些科技热词充满了好奇。为此,新智元给众多热爱人工智能领域的读者准备了一份丰厚的大理。小编深知许多对人工智能领域感兴趣的读者可能还不知如何入手该领域,那么,小编建议就从了解深度学习开始吧!新智元为学习深度学习的初学者整理了一份非常全面的书单,下面就随小编一起来看看这份书单中包含哪些板块的内容呢? 给深度学习从业者的书单 一、关于矩阵或者单变量微积分计算的文献(共5项)
新智元
2018/03/14
1K0
金融/语音/音频处理学术速递[7.29]
【1】 MobilityCoins -- A new currency for the multimodal urban transportation system 标题:机动币--城市多式联运的新货币
公众号-arXiv每日学术速递
2021/07/30
3880
[计算机视觉论文速递] 2018-03-01
[1]《Stereoscopic Neural Style Transfer》 CVPR 2018 论文首次尝试对3D电影或AR/VR的新需求进行立体神经风格转换。首先对立体图像的左、右视图应用现有的单目风格传递方法进行了细致的研究。这表明,在最终的风格迁移结果中,原始的视差一致性不能很好的保存,这会导致观众3D观感疲劳。为了解决这一问题,我们通过在非闭塞区域实施双向视差约束,将新的视差损失引入到广泛采用的风格损失函数中。针对实际的实时解决方案, 我们通过联合培训化子网络和视差子网络,提出了第一个前馈网络
Amusi
2018/04/12
1.1K0
[计算机视觉论文速递] 2018-03-01
CVPR2019 | 10篇论文速递(涵盖全景分割、实例分割和姿态估计等方向)
【导读】CVPR 2019 接收论文列表已经出来了,但只是一些索引号,所以并没有完整的论文合集。CVer 最近也在整理收集,今天一文涵盖10篇 CVPR 2019 论文速递,内容涵盖全景分割、实例分割和姿态估计等方向。
Amusi
2019/12/31
6650
CVPR 2019 | 今日新出14篇论文汇总(来自微软、商汤、腾讯、斯坦福等)
今天新出了14篇CVPR2019的论文,CV君汇总了他们的简略信息,有代码的也一并列出了,感兴趣的朋友,可以文末下载细读。
CV君
2019/12/27
8350
推荐阅读
生物启发的终生学习系列论文The Neural Adaptive Computing Laboratory
2120
专栏 | 如何对比评价各种深度神经网络硬件?不妨给它们跑个分
9730
金融/语音/音频处理学术速递[8.18]
5940
【世界读书日】2018版十大引用数最高的深度学习论文集合
3740
【Github2.2K星】PyTorch资源列表:450个NLP/CV/SP、论文实现、教程、示例
7440
重磅纯干货 | 超级赞的语音识别/语音合成经典论文的路线图(1982-2018.5)
1.3K0
CVPR 2019 | 近日新出论文汇总(含视频目标分割、GAN、度量学习、高效语义分割等主题)
1.2K0
A review on TinyML: State-of-the-art and prospects PartI
3730
金融/语音/音频处理学术速递[8.30]
4980
【论文推荐】最新6篇卷积神经网络相关论文—多任务学习、SAR和光学图像、动态加权排列、去雾新方法、点CNN、肿瘤生长预测
1.9K0
【论文推荐】最新五篇图像分割相关论文—R2U-Net、ScatterNet混合深度学习、分离卷积编解码、控制、Embedding
1.8K0
Github项目推荐 | 语义分割、实例分割、全景分割和视频分割的论文和基准列表
2.6K0
WHEN NOT TO USE DEEP LEARNING
5610
博客 | 关于SLU(意图识别、槽填充、上下文LU、结构化LU)和NLG的论文汇总
1.9K0
深度学习被高频引用的41篇论文下载
6290
【最全开工干货】深度学习书单、文献及数据集(共446项)
1K0
金融/语音/音频处理学术速递[7.29]
3880
[计算机视觉论文速递] 2018-03-01
1.1K0
CVPR2019 | 10篇论文速递(涵盖全景分割、实例分割和姿态估计等方向)
6650
CVPR 2019 | 今日新出14篇论文汇总(来自微软、商汤、腾讯、斯坦福等)
8350
相关推荐
生物启发的终生学习系列论文The Neural Adaptive Computing Laboratory
更多 >
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档