假设输入特征图形状为 [batch, modal_leng, patch_num, input_dim],窗口大小为 window_size,则窗口数量为 patch_num // window_size...具体来说,相邻层的窗口会向右下移动半个窗口大小(即 window_size // 2)。...整体架构Swin Transformer由多个stage组成,每个stage包含若干Swin Transformer Block和一个Patch Merging层。...重塑为多头形式([batch, modal_leng, window_num, window_size, head_num, att_size])。使用permute调整维度顺序,便于后续矩阵乘法。...位置编码:x.permute(0, 3, 2, 1):调整形状为[batch, modal_leng, patch_num, embedding_dim]。