💡💡💡本文内容:YOLOv12创新点A2C2f和Area Attention结构分析,以及如何训练自己的私有数据集
论文:[2502.12524] YOLOv12: Attention-Centric Real-Time Object Detectors
摘要:
长期以来,提升YOLO框架的网络架构至关重要,但相关改进主要聚焦于基于CNN的优化,尽管注意力机制已被证实具备更卓越的建模能力。这种现状源于注意力模型在速度上始终无法与CNN模型相媲美。本研究提出了一种以注意力机制为核心的YOLO框架——YOLOv12,在保持与先前CNN模型相当速度的同时,充分释放了注意力机制的性能优势。
YOLOv12在保持具有竞争力的推理速度下,其准确率超越了所有主流实时目标检测器。具体而言,YOLOv12-N在T4 GPU上以1.64ms的推理延迟实现了40.6%的mAP,相较先进的YOLOv10-N/YOLOv11-N分别提升2.1%/1.2%的mAP,同时保持相近速度。该优势在其他模型规模上同样显著。相较于改进DETR的端到端实时检测器,YOLOv12也展现出优越性:例如YOLOv12-S以42%的速度优势超越RT-DETR-R18/RT-DETRv2-R18,仅需36%的计算量和45%的参数量。更多对比详见图1。
结构图如下:
本文旨在解决这些挑战,并进一步构建了一个以注意力为中心的YOLO框架,即YOLOv12。我们引入了三项关键改进。首先,我们提出了一个简单高效的区域注意力模块(A²),它以一种非常简单的方式在保持较大感受野的同时减少了注意力的计算复杂度,从而提高了速度。其次,我们引入了残差高效层聚合网络(R-ELAN),以应对注意力机制(尤其是大规模模型)引入的优化挑战。R-ELAN在原始ELAN的基础上引入了两项改进:(i)基于块的残差设计与缩放技术;(ii)重新设计的特征聚合方法。第三,我们在传统注意力机制的基础上进行了一些架构改进,以适应YOLO系统。我们升级了传统的注意力中心架构,包括:引入FlashAttention以解决注意力的内存访问问题,移除位置编码等设计以使模型更快速、更简洁,将MLP比率从4调整为1.2以平衡注意力机制和前馈网络之间的计算量,从而获得更好的性能,减少堆叠块的深度以促进优化,以及尽可能多地利用卷积操作来发挥其计算效率。
总之,YOLOv12的贡献可以概括为以下两点:1)它建立了一个以注意力为中心的、简单而高效的YOLO框架,通过方法创新和架构改进,打破了CNN模型在YOLO系列中的主导地位。2)YOLOv12在不依赖预训练等额外技术的情况下,实现了快速推理速度和更高的检测精度的最新成果,展现了其潜力。
YOLOv12设计了区域注意力模块(A2),将特征图划分为简单的垂直或水平区域,减少了注意力机制的计算复杂度,同时保持了较大的感受野。
核心源码如下:
代码位置ultralytics/nn/modules/block.py
class AAttn(nn.Module):
"""
Area-attention module with the requirement of flash attention.
Attributes:
dim (int): Number of hidden channels;
num_heads (int): Number of heads into which the attention mechanism is divided;
area (int, optional): Number of areas the feature map is divided. Defaults to 1.
Methods:
forward: Performs a forward process of input tensor and outputs a tensor after the execution of the area attention mechanism.
Examples:
>>> import torch
>>> from ultralytics.nn.modules import AAttn
>>> model = AAttn(dim=64, num_heads=2, area=4)
>>> x = torch.randn(2, 64, 128, 128)
>>> output = model(x)
>>> print(output.shape)
Notes:
recommend that dim//num_heads be a multiple of 32 or 64.
"""
def __init__(self, dim, num_heads, area=1):
"""Initializes the area-attention module, a simple yet efficient attention module for YOLO."""
super().__init__()
self.area = area
self.num_heads = num_heads
self.head_dim = head_dim = dim // num_heads
all_head_dim = head_dim * self.num_heads
self.qkv = Conv(dim, all_head_dim * 3, 1, act=False)
self.proj = Conv(all_head_dim, dim, 1, act=False)
self.pe = Conv(all_head_dim, dim, 7, 1, 3, g=dim, act=False)
def forward(self, x):
"""Processes the input tensor 'x' through the area-attention"""
B, C, H, W = x.shape
N = H * W
qkv = self.qkv(x).flatten(2).transpose(1, 2)
if self.area > 1:
qkv = qkv.reshape(B * self.area, N // self.area, C * 3)
B, N, _ = qkv.shape
q, k, v = qkv.view(B, N, self.num_heads, self.head_dim * 3).split(
[self.head_dim, self.head_dim, self.head_dim], dim=3
)
if x.is_cuda and USE_FLASH_ATTN:
x = flash_attn_func(
q.contiguous().half(),
k.contiguous().half(),
v.contiguous().half()
).to(q.dtype)
elif x.is_cuda and not USE_FLASH_ATTN:
x = sdpa(q.permute(0, 2, 1, 3), k.permute(0, 2, 1, 3), v.permute(0, 2, 1, 3), attn_mask=None, dropout_p=0.0, is_causal=False)
x = x.permute(0, 2, 1, 3)
else:
q = q.permute(0, 2, 3, 1)
k = k.permute(0, 2, 3, 1)
v = v.permute(0, 2, 3, 1)
attn = (q.transpose(-2, -1) @ k) * (self.head_dim ** -0.5)
max_attn = attn.max(dim=-1, keepdim=True).values
exp_attn = torch.exp(attn - max_attn)
attn = exp_attn / exp_attn.sum(dim=-1, keepdim=True)
x = (v @ attn.transpose(-2, -1))
x = x.permute(0, 3, 1, 2)
v = v.permute(0, 3, 1, 2)
if self.area > 1:
x = x.reshape(B // self.area, N * self.area, C)
v = v.reshape(B // self.area, N * self.area, C)
B, N, _ = x.shape
x = x.reshape(B, H, W, C).permute(0, 3, 1, 2)
v = v.reshape(B, H, W, C).permute(0, 3, 1, 2)
x = x + self.pe(v)
x = self.proj(x)
return x
A2C2f模块全称为“Area-Attention Enhanced Cross-Feature module”,是YOLOv12中提出的一种改进型特征提取模块,结合了区域注意力(Area-Attention)和残差连接,主要用于提升特征提取的效率和精度
A2C2f模块由以下关键部分组成:
代码位置ultralytics/nn/modules/block.py
class ABlock(nn.Module):
"""
ABlock class implementing a Area-Attention block with effective feature extraction.
This class encapsulates the functionality for applying multi-head attention with feature map are dividing into areas
and feed-forward neural network layers.
Attributes:
dim (int): Number of hidden channels;
num_heads (int): Number of heads into which the attention mechanism is divided;
mlp_ratio (float, optional): MLP expansion ratio (or MLP hidden dimension ratio). Defaults to 1.2;
area (int, optional): Number of areas the feature map is divided. Defaults to 1.
Methods:
forward: Performs a forward pass through the ABlock, applying area-attention and feed-forward layers.
Examples:
Create a ABlock and perform a forward pass
>>> model = ABlock(dim=64, num_heads=2, mlp_ratio=1.2, area=4)
>>> x = torch.randn(2, 64, 128, 128)
>>> output = model(x)
>>> print(output.shape)
Notes:
recommend that dim//num_heads be a multiple of 32 or 64.
"""
def __init__(self, dim, num_heads, mlp_ratio=1.2, area=1):
"""Initializes the ABlock with area-attention and feed-forward layers for faster feature extraction."""
super().__init__()
self.attn = AAttn(dim, num_heads=num_heads, area=area)
mlp_hidden_dim = int(dim * mlp_ratio)
self.mlp = nn.Sequential(Conv(dim, mlp_hidden_dim, 1), Conv(mlp_hidden_dim, dim, 1, act=False))
self.apply(self._init_weights)
def _init_weights(self, m):
"""Initialize weights using a truncated normal distribution."""
if isinstance(m, nn.Conv2d):
nn.init.trunc_normal_(m.weight, std=0.02)
if m.bias is not None:
nn.init.constant_(m.bias, 0)
def forward(self, x):
"""Executes a forward pass through ABlock, applying area-attention and feed-forward layers to the input tensor."""
x = x + self.attn(x)
x = x + self.mlp(x)
return x
class A2C2f(nn.Module):
"""
A2C2f module with residual enhanced feature extraction using ABlock blocks with area-attention. Also known as R-ELAN
This class extends the C2f module by incorporating ABlock blocks for fast attention mechanisms and feature extraction.
Attributes:
c1 (int): Number of input channels;
c2 (int): Number of output channels;
n (int, optional): Number of 2xABlock modules to stack. Defaults to 1;
a2 (bool, optional): Whether use area-attention. Defaults to True;
area (int, optional): Number of areas the feature map is divided. Defaults to 1;
residual (bool, optional): Whether use the residual (with layer scale). Defaults to False;
mlp_ratio (float, optional): MLP expansion ratio (or MLP hidden dimension ratio). Defaults to 1.2;
e (float, optional): Expansion ratio for R-ELAN modules. Defaults to 0.5;
g (int, optional): Number of groups for grouped convolution. Defaults to 1;
shortcut (bool, optional): Whether to use shortcut connection. Defaults to True;
Methods:
forward: Performs a forward pass through the A2C2f module.
Examples:
>>> import torch
>>> from ultralytics.nn.modules import A2C2f
>>> model = A2C2f(c1=64, c2=64, n=2, a2=True, area=4, residual=True, e=0.5)
>>> x = torch.randn(2, 64, 128, 128)
>>> output = model(x)
>>> print(output.shape)
"""
def __init__(self, c1, c2, n=1, a2=True, area=1, residual=False, mlp_ratio=2.0, e=0.5, g=1, shortcut=True):
super().__init__()
c_ = int(c2 * e) # hidden channels
assert c_ % 32 == 0, "Dimension of ABlock be a multiple of 32."
# num_heads = c_ // 64 if c_ // 64 >= 2 else c_ // 32
num_heads = c_ // 32
self.cv1 = Conv(c1, c_, 1, 1)
self.cv2 = Conv((1 + n) * c_, c2, 1) # optional act=FReLU(c2)
init_values = 0.01 # or smaller
self.gamma = nn.Parameter(init_values * torch.ones((c2)), requires_grad=True) if a2 and residual else None
self.m = nn.ModuleList(
nn.Sequential(*(ABlock(c_, num_heads, mlp_ratio, area) for _ in range(2))) if a2 else C3k(c_, c_, 2, shortcut, g) for _ in range(n)
)
def forward(self, x):
"""Forward pass through R-ELAN layer."""
y = [self.cv1(x)]
y.extend(m(y[-1]) for m in self.m)
if self.gamma is not None:
return x + self.gamma.view(1, -1, 1, 1) * self.cv2(torch.cat(y, 1))
return self.cv2(torch.cat(y, 1))
NEU-DET钢材表面缺陷共有六大类,一共1800张,
类别分别为:'crazing','inclusion','patches','pitted_surface','rolled-in_scale','scratches'
数据集下载地址:
https://download.csdn.net/download/m0_63774211/89846379?spm=1001.2014.3001.5503
标签可视化:
path: D:/ultralytics-main/data/NEU-DET # dataset root dir
train: train.txt # train images (relative to 'path') 118287 images
val: val.txt # val images (relative to 'path') 5000 images
# number of classes
nc: 6
# class names
names:
0: crazing
1: inclusion
2: patches
3: pitted_surface
4: rolled-in_scale
5: scratches
import warnings
warnings.filterwarnings('ignore')
from ultralytics import YOLO
if __name__ == '__main__':
model = YOLO('ultralytics/cfg/models/v12/yolov12n.yaml')
#model.load('yolo12n.pt') # loading pretrain weights
model.train(data='data/NEU-DET.yaml',
cache=False,
imgsz=640,
epochs=200,
batch=16,
close_mosaic=10,
device='0',
optimizer='SGD', # using SGD
project='runs/train',
name='exp',
)
YOLOv12原始mAP50为0.763
YOLOv12n summary (fused): 352 layers, 2,557,898 parameters, 0 gradients, 6.3 GFLOPs
Class Images Instances Box(P R mAP50 mAP50-95): 100%|██████████| 11/11 [00:11<00:00, 1.04s/it]
all 324 747 0.718 0.714 0.763 0.435
crazing 47 104 0.497 0.433 0.431 0.178
inclusion 71 190 0.741 0.721 0.802 0.434
patches 59 149 0.826 0.926 0.942 0.641
pitted_surface 61 93 0.789 0.645 0.76 0.467
rolled-in_scale 56 117 0.656 0.624 0.709 0.334
scratches 54 94 0.8 0.936 0.935 0.556
预测结果:
关注后获取YOLOv12 windows下环境!!!
关注后获取YOLOv12 windows下环境!!!
关注后获取YOLOv12 windows下环境!!!
原文地址:
https://blog.csdn.net/m0_63774211/article/details/145771893
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。