文章/答案/技术大牛

发布

社区首页 >专栏 >如何系统得对目标检测模型的误差分析？

如何系统得对目标检测模型的误差分析？

AI算法与图像处理

发布于 2022-12-11 03:11:39

73900

代码可运行

文章被收录于专栏：AI算法与图像处理AI算法与图像处理

运行总次数：0

代码可运行

大家好，我是阿潘

分享一种系统的、数据驱动的方法，用来了解阻碍模型性能的因素

图1

现实中得目标检测是具有挑战性，具体原因如下：

缺乏数据通常是限制因素。人们经常花费大量时间来考虑设计决策，例如选择模型架构和调整超参数。然而，我们会发现，在解决实际问题时，提高性能的最简单方法是采用数据驱动的方法，找到模型识别能力不足的区域并收集额外的数据以提高那里的性能。
数据集标签通常不一致。构建（或找到）高质量的目标检测数据集非常困难。思考图 1：如果要将这张图片展示给两个不同的人，并要求他们标记存在的对象（即添加框和标签），结果肯定会不一样。边界框的位置和大小可能会有一些差异，并且根据图像和用例的挑战性，标注工具可能会引入不同的框。此外，这个标注过程很繁琐，这使得在长时间的会话中更容易引入错误或不一致。
标准指标很难解释。平均精度 (mAP) - 用于评估目标检测器性能的首选指标 - 不直观，并且与分类问题的准确度、精度或召回率不同，可能难以准确了解模型的执行情况.事实上，这对于检测模型表现不佳的区域没有帮助，更不用说帮助设计改善情况的策略了。

总而言之，我们通常拥有不太理想的数据集、难以解释的指标以及缺乏识别数据集中问题的工具。所有这些因素加在一起，很难对手头的问题建立直觉，并且常常让人不清楚如何遵循系统的、迭代的方法来提高模型性能。

在寻找解决该问题的工具时，发现论文 TIDE：A General Toolbox for Identification Object Detection Errors ，其中介绍了一种方法。主要思想是，首先，模型预测的所有边界框都被分配到一个错误类别（或被认为是正确的）。之后，计算这些错误类别中的每一个对 mAP 的负面影响。这提供了不同错误类型的重要性度量，有助于关注最妨碍性能的错误。

论文：https://arxiv.org/abs/2008.08115

虽然这篇论文在 GitHub 上提供了一个实现，但存在两个主要缺点：

它只提供最终的 mAP 影响结果，没有简单的方法来获得每个预测的错误分类。在我看来，能够检查不同的类别并了解每个预测所代表的错误类型是非常有价值的。这些信息可以帮助建立对问题的良好直觉。
实现中的一些重要细节在论文中并不完全清楚，我发现从没有密集调试会话的可用代码中理解它们很棘手。

什么是错误分析？

在继续之前，重要的是要澄清错误分析和模型评估是不同的。虽然评估包括获得单个指标来总结模型是否总体上表现良好，但可以将错误分析视为机器学习系统的调试，检查模型的输出，并将其与基本事实进行比较，最终帮助建立对问题的直觉。它要求深入了解数据和模型。很多时候，这可能涉及一一查看样本和预测。

此外，即使模型表现良好，也可能存在它一直在努力解决的样本——例如，影响训练集中几乎不存在的少数人的错误预测——而且，对于现实世界的系统，了解是否部署模型后，这些可能会成为问题。错误分析是帮助你理解这一点的过程。

有关错误分析的更多信息，这些是深入研究的重要资源：

深入了解机器学习（和深度学习）中的错误分析和模型调试
带有错误分析的“可靠得”机器学习

相关论文汇总： https://neptune.ai/blog/deep-dive-into-error-analysis-and-model-debugging-in-machine-learning-and-deep-learning https://techcommunity.microsoft.com/t5/ai-machine-learning-blog/responsible-machine-learning-with-error-analysis/ba-p/2141774

现在，让我们用一个例子来说明在处理目标检测问题时通常遵循的系统方法。

数据

我们需要的第一件事是用作示例的数据。为此，我们将使用 MS COCO 2017 验证集，这是最流行的目标检测基准数据集之一。我们首先将它下载到我们的工作文件夹，然后我们加载它并将其格式化为方便的 pandas DataFrames。

# Download images and annotations
!curl http://images.cocodataset.org/zips/val2017.zip --output coco_valid.zip
!curl http://images.cocodataset.org/annotations/annotations_trainval2017.zip --output coco_valid_anns.zip
# Unzip images into coco_val2017/images
!mkdir coco_val2017/
!unzip -q coco_valid.zip -d coco_val2017/
!mv -f coco_val2017/val2017 coco_val2017/images
# Unzip and keep only valid annotations as coco_val2017/annotations.json
!unzip -q coco_valid_anns.zip -d coco_val2017
!mv -f coco_val2017/annotations/instances_val2017.json coco_val2017/annotations.json
!rm -rf coco_val2017/annotations
# Remove zip files downloaded
!rm -f coco_valid.zip
!rm -f coco_valid_anns.zip

# Copyright © 2022 Bernat Puig Camps
import json
from pathlib import Path
from typing import Tuple

import pandas as pd

DATA_PATH = Path("./coco_val2017")


def load_dataset(
    data_path: Path = DATA_PATH,
) -> Tuple[pd.DataFrame, pd.DataFrame]:
    """Read the COCO style json dataset and transform it into convenient DataFrames
    :return (images_df, targets_df):
        images_df: Columns "image_id" and "file_name"
        targets_df: Columns
            "target_id", "image_id", "xmin", "ymin", "xmax", "ymax", "label_id"
    """
    annotations_path = data_path / "annotations.json"

    with open(annotations_path, "r") as f:
        targets_json = json.load(f)

    images_df = pd.DataFrame.from_records(targets_json["images"])
    images_df.rename(columns={"id": "image_id"}, inplace=True)
    images_df = images_df[["image_id", "file_name"]]

    targets_df = pd.DataFrame.from_records(targets_json["annotations"])
    targets_df[["xmin", "ymin", "w", "h"]] = targets_df["bbox"].tolist()
    targets_df["xmax"] = targets_df["xmin"] + targets_df["w"]
    targets_df["ymax"] = targets_df["ymin"] + targets_df["h"]
    targets_df.reset_index(inplace=True)
    targets_df.rename(
        columns={"index": "target_id", "category_id": "label_id"}, inplace=True
    )
    targets_df = targets_df[
        ["target_id", "image_id", "label_id", "xmin", "ymin", "xmax", "ymax"]
    ]

    return images_df, targets_df

images_df, targets_df = load_dataset()

现在，让我们看一下数据集中的几个样本示例，以了解它们的外观。

图2 数据集中得样例

模型

如前所述，我们希望利用训练模型的预测来了解其缺点。为了方便和简单，我们将使用在 COCO 数据集上预训练的模型。这样，我们可以完全跳过训练（这不是本文的重点），并且模型将简单地开箱即用。

虽然有很多架构，但我们将使用带有 ResNet50 主干的 Faster-RCNN。这种选择的主要原因是这种架构往往性能相当好，并且有现成的 PyTorch 实现。

要进行任何错误分析，我们首先需要数据集中的模型预测。此外，我们还将保存每个样本的模型损失（我们将在文章中进一步了解原因）。让我们定义一个函数来为我们执行此操作，并将它们保存在 pandas DataFrame 中，就像我们为目标创建的那样。

# Copyright © 2022 Bernat Puig Camps
from pathlib import Path

import pandas as pd
from PIL import Image
import torch
import torchvision


def get_predictions(
    images_path: Path, images_df: pd.DataFrame, targets_df: pd.DataFrame
):
    """Get predictions and losses of `model` for all images in `images_df`
    :param model: Faster-RCNN PyTorch model.
    :param images_df: DataFrame with images.
    :param targets_df: DataFrame with ground truth target for images.
    :return preds_df: DataFrame with columns
        [
            "pred_id", "image_id", "image_loss", "label_id", "score",
            "xmin", "ymin", "xmax", "ymax"
        ]
    """
    model = torchvision.models.detection.fasterrcnn_resnet50_fpn(
        pretrained=True
    )

    device = (
        torch.device("cuda:0")
        if torch.cuda.is_available()
        else torch.device("cpu")
    )
    model = model.to(device)

    preds_dfs = []
    for sample in images_df.itertuples():
        # We iterate over single samples (batch size of 1) because we need one
        # loss per image and PyTorch Faster-RCNN outputs losses per batch,
        # not sample.
        t_df = targets_df.query("image_id == @sample.image_id")

        image = Image.open(images_path / sample.file_name).convert("RGB")
        image_tensor = torchvision.transforms.ToTensor()(image).to(device)

        bboxes = t_df[["xmin", "ymin", "xmax", "ymax"]].values
        labels = t_df["label_id"].values
        if bboxes.size == 0:
            # PyTorch Faster-RCNN expects targets to be tensors that fulfill
            # len(boxes.shape) == 2 & boxes.shape[-1] == 4
            bboxes = torch.empty(0, 4)

        targets = {
            "boxes": torch.as_tensor(bboxes, dtype=torch.float32).to(device),
            "labels": torch.as_tensor(labels, dtype=torch.int64).to(device),
        }
        with torch.no_grad():
            # Faster-RCNN outputs losses only when train mode
            model.train()
            losses = model([image_tensor], [targets])
            # Faster-RCNN outputs predictions only when eval mode
            model.eval()
            preds = model([image_tensor])
        # Unify all sublosses into one (this is just one way of doing it)
        loss = sum(losses.values()).item()

        preds_dfs.append(
            pd.DataFrame(
                {
                    "image_id": sample.image_id,
                    "image_loss": loss,
                    "label_id": preds[0]["labels"].to("cpu"),
                    "score": preds[0]["scores"].to("cpu"),
                    "xmin": preds[0]["boxes"][:, 0].to("cpu"),
                    "ymin": preds[0]["boxes"][:, 1].to("cpu"),
                    "xmax": preds[0]["boxes"][:, 2].to("cpu"),
                    "ymax": preds[0]["boxes"][:, 3].to("cpu"),
                }
            )
        )

    preds_df = pd.concat(preds_dfs, ignore_index=True)
    preds_df = preds_df.reset_index().rename(columns={"index": "pred_id"})
    return preds_df[
        [
            "pred_id",
            "image_id",
            "label_id",
            "xmin",
            "ymin",
            "xmax",
            "ymax",
            "score",
            "image_loss",
        ]
    ]

preds_df = get_predictions(images_path, images_df, targets_df)

检查Loss

在我们深入研究 TIDE 分析之前，还有另一个工具也可以证明对错误分析很有帮助：模型损失。损失旨在衡量预测的好坏。因此，最高损失说明模型最难预测的图像。我们可以将它们可视化以尝试了解正在发生的事情。事实上，这种方法并不是目标检测所独有的。任何输出每个样本损失的模型都可以用于此。

在可视化任何图像之前，检查损失的分布是很有价值的。一般来说，我们希望大多数图像具有相对较低的损失，而其中一些图像具有更高的值。如果不是这种情况并且所有样本都呈现大致相同的值，那么查看最高值将没有意义，因为它们只是平均值附近微小变化的结果。

图3 处理后图像的损失分布

从图中，我们可以确认大多数图像的损失确实低于 1，并且分布向右倾斜，一些样本的损失几乎高达 6（在图中几乎看不到）。此外，我们在直方图中看到一个接近 0 的大峰值和大量损失在 0.5 和 1 之间的样本。因此，让我们可视化最高损失，一个峰值中的样本和一个平均损失的样本，以了解差异 . 请注意，我手动从相似域中手动选择图像以简化推理，但数据集包含各种域。

图 4：损失最高的图像

图 5：中间损失的图像

图 6：低损失图像

对于损失最高的图像，我们看到两个主要问题：

图片中远处有很多鸟，可能是火烈鸟。他们的目标是不一致的，因为其中一小部分被单独标记，而所有目标都在一个大框里。该模型无法找到其中任何一个。事实上，一些损失非常高的图像也呈现出类似的情况：模型无法找到的远处的小鸟。
有一种没有边界框的动物存在。这可能意味着 COCO 没有它的类别，因此没有标记它。尽管如此，我们的模型确实将它识别为一种动物，并且没有更好的选择，所以用斑马、马或牛等词来标记它。

如果我们的领域是识别大草原中的动物，那么这张图片可能表明，确保鸟类的标签一致并且我们为所有可能的动物提供盒子可能会导致改进。然而，这是一个应该用更多样本进一步探索的假设。

对于具有平均损失的图像，我们看到对大多数或所有目标都有正确的预测。问题在于不应该存在的额外盒子。这不仅适用于选择的示例，而且对于大多数损失在 0.5 到 1 范围内的图像通常都是正确的。使用此图像，尝试提高性能的一种潜在方法是通过删除低分预测、执行框融合或应用非最大抑制来删除额外的框。

最后，我们看到对于低损失图像，该模型做得几乎完美。事实上，它检测到了一只额外的鸟，它没有在 ground truth 中标记，但确实存在！

虽然这只是一个有限的例子，但很容易看出，损失已经是制定假设并将我们的努力指向最有问题的样本的一个很好的工具。最重要的是，它们通常会提供有关问题、模型和数据集的有价值信息。

错误分类

现在，让我们最后看看 TIDE 是如何工作的，以及我们如何利用它进行错误分析。虽然我强烈建议您阅读本文以获得更深入的理解，但我的目标是在此处提供足够的背景信息，以便您可以在项目中成功利用该工具。

如上所述，TIDE 要么将每个输出的预测分配给一个错误类别，要么认为它是正确的。为此，它需要一种机制来尝试将每个预测与它可能试图预测的目标（边界框）相匹配。它尝试匹配的方式是通过联合交集（IoU）。请参阅下图，了解如何获得任意两个边界框的 IoU。

利用两个 IoU 阈值，前景阈值 (Tf) 和背景阈值 (Tb)，我们可以定义以下错误类型（在 TIDE 论文的第 2.2 节中有更详细的解释）：

分类错误 (CLS)：IoU >= Tf 用于不正确类的目标（即，定位正确但分类错误）。
定位误差 (LOC)：Tb <= IoU < Tf 用于正确类别的目标（即，分类正确但定位不正确）。
Cls 和 Loc 错误 (CLS & LOC)：Tb <= IoU < Tf 用于不正确类的目标（即，分类和定位不正确）。
重复检测错误 (DUP)：IoU >= Tf 表示正确类别的目标，但另一个得分较高的检测已经与目标匹配（即，如果不是得分较高的检测，那将是正确的）。
背景误差 (BKG)：所有目标的 IoU < Tb（即，检测到的背景为前景）。
丢失目标错误（MISS）：分类或定位错误尚未涵盖的所有未检测到的目标（假阴性）。

下图（摘自论文）说明了不同类型的错误：

阅读论文时我发现一个细节并不清楚：算法假设模型尽可能地做正确的事情。为此，它首先尝试将预测与具有相同标签的目标进行匹配，这反过来意味着，例如，预测将作为定位错误与目标匹配，然后作为分类错误（即使 IoU LOC 目标较低）。

让我们根据 DataFrame 中的预测和注释定义一些函数来帮助我们对这些错误类型进行分类。

# Copyright © 2022 Bernat Puig Camps
from typing import Dict, Set, Tuple

import numpy as np
import pandas as pd
import torch
import torchvision

TARGETS_DF_COLUMNS = [
    "target_id",
    "image_id",
    "label_id",
    "xmin",
    "ymin",
    "xmax",
    "ymax",
]
PREDS_DF_COLUMNS = [
    "pred_id",
    "image_id",
    "label_id",
    "xmin",
    "ymin",
    "xmax",
    "ymax",
    "score",
]
ERRORS_DF_COLUMNS = ["pred_id", "target_id", "error_type"]

BACKGROUND_IOU_THRESHOLD = 0.1
FOREGROUND_IOU_THRESHOLD = 0.5


class ErrorType:
    OK = "correct"  # pred -> IoU > foreground; target_label == pred_label; highest score
    CLS = "classification"  # pred -> IoU > foreground; target_label != pred_label
    LOC = "localization"  # pred -> background < IoU < foreground; target_label == pred_label
    CLS_LOC = "cls & loc"  # pred -> background < IoU < foreground; target_label != pred_label
    DUP = "duplicate"  # pred -> background < IoU < foreground; target_label != pred_label
    BKG = "background"  # pred -> IoU > foreground; target_label == pred_label; no highest score
    MISS = "missed"  # target -> No pred with Iou > background


def classify_predictions_errors(
    targets_df: pd.DataFrame,
    preds_df: pd.DataFrame,
    iou_background: float = BACKGROUND_IOU_THRESHOLD,
    iou_foreground: float = FOREGROUND_IOU_THRESHOLD,
) -> pd.DataFrame:
    """Classify predictions
    We assume model is right as much as possible. Thus, in case of doubt
    (i.e matching two targets), a prediction will be first considered
    ErrorType.LOC before ErrorType.CLS.
    The error definition credit belongs to the following paper (refer to it for
    conceptual details):
        TIDE: A General Toolbox for Identifying Object Detection Errors
        https://arxiv.org/abs/2008.08115
    :param targets_df: DataFrame with all targets for all images with TARGETS_DF_COLUMNS.
    :param preds_df: DataFrame with all predictions for all images with PREDS_DF_COLUMNS.
    :param iou_background: Minimum IoU for a prediction not to be considered background.
    :param iou_foreground: Minimum IoU for a prediction to be considered foreground.
    :return errors_df: DataFrame with all error information with ERRORS_DF_COLUMNS
    """

    # Provide clarity on expectations and avoid confusing errors down the line
    assert (set(TARGETS_DF_COLUMNS) - set(targets_df.columns)) == set()
    assert (set(PREDS_DF_COLUMNS) - set(preds_df.columns)) == set()

    pred2error = dict()  # {pred_id: ErrorType}
    target2pred = (
        dict()
    )  # {target_id: pred_id}, require iou > iou_foreground & max score
    pred2target = dict()  # {pred_id: target_id}, require iou >= iou_background
    missed_targets = set()  # {target_id}

    # Higher scoring preds take precedence when multiple fulfill criteria
    preds_df = preds_df.sort_values(by="score", ascending=False)

    for image_id, im_preds_df in preds_df.groupby("image_id"):
        # Need to reset index to access dfs with same idx we access
        #   IoU matrix down the line
        im_targets_df = targets_df.query("image_id == @image_id").reset_index(
            drop=True
        )
        im_preds_df = im_preds_df.reset_index(drop=True)

        if im_targets_df.empty:
            pred2error = {**pred2error, **_process_empty_image(im_preds_df)}
        else:
            iou_matrix, iou_label_match_matrix = _compute_iou_matrices(
                im_targets_df, im_preds_df
            )

            # Iterate over all predictions. Higher scores first
            for pred_idx in range(len(im_preds_df)):
                match_found = _match_pred_to_target_with_same_label(
                    pred_idx,
                    pred2error,
                    pred2target,
                    target2pred,
                    iou_label_match_matrix,
                    im_targets_df,
                    im_preds_df,
                    iou_background,
                    iou_foreground,
                )
                if match_found:
                    continue

                _match_pred_wrong_label_or_background(
                    pred_idx,
                    pred2error,
                    pred2target,
                    iou_matrix,
                    im_targets_df,
                    im_preds_df,
                    iou_background,
                    iou_foreground,
                )

    missed_targets = _find_missed_targets(targets_df, pred2target)
    errors_df = _format_errors_as_dataframe(
        pred2error, pred2target, missed_targets
    )
    return errors_df[list(ERRORS_DF_COLUMNS)]


def _process_empty_image(im_preds_df: pd.DataFrame) -> Dict[int, str]:
    """In an image without targets, all predictions represent a background error"""
    return {
        pred_id: ErrorType.BKG for pred_id in im_preds_df["pred_id"].unique()
    }


def _compute_iou_matrices(
    im_targets_df: pd.DataFrame, im_preds_df: pd.DataFrame
) -> Tuple[np.array, np.array]:
    """Compute IoU matrix between all targets and preds in the image
    :param im_targets_df: DataFrame with targets for the image being processed.
    :param im_preds_df: DataFrame with preds for the image being processed.
    :return:
        iou_matrix: Matrix of size (n_targets, n_preds) with IoU between all
            targets & preds
        iou_label_match_matrix: Same as `iou_matrix` but 0 for all target-pred
            pair with different labels (i.e. IoU kept only if labels match).
    """
    # row indexes point to targets, column indexes to predictions
    iou_matrix = iou_matrix = torchvision.ops.box_iou(
        torch.from_numpy(
            im_targets_df[["xmin", "ymin", "xmax", "ymax"]].values
        ),
        torch.from_numpy(im_preds_df[["xmin", "ymin", "xmax", "ymax"]].values),
    ).numpy()

    # boolean matrix with True iff target and pred have the same label
    label_match_matrix = (
        im_targets_df["label_id"].values[:, None]
        == im_preds_df["label_id"].values[None, :]
    )
    # IoU matrix with 0 in all target-pred pairs that have different label
    iou_label_match_matrix = iou_matrix * label_match_matrix
    return iou_matrix, iou_label_match_matrix


def _match_pred_to_target_with_same_label(
    pred_idx: int,
    pred2error: Dict[int, str],
    pred2target: Dict[int, int],
    target2pred: Dict[int, int],
    iou_label_match_matrix: np.array,
    im_targets_df: pd.DataFrame,
    im_preds_df: pd.DataFrame,
    iou_background: float,
    iou_foreground: float,
) -> bool:
    """Try to match `pred_idx` to a target with the same label and identify error (if any)
    If there is a match `pred2error`, `pred2target` and (maybe) `target2pred`
    are modified in place.
    Possible error types found in this function:
        ErrorType.OK, ErrorType.DUP, ErrorType.LOC
    :param pred_idx: Index of prediction based on score (index 0 is maximum score for image).
    :param pred2error: Dict mapping pred_id to error type.
    :param pred2target: Dict mapping pred_id to target_id (if match found with iou above background)
    :param target2pred: Dict mapping target_id to pred_id to pred considered correct (if any).
    :param iou_label_match_matrix: Matrix with size [n_targets, n_preds] with IoU between all preds
        and targets that share label (i.e. IoU = 0 if there is a label missmatch).
    :param im_targets_df: DataFrame with targets for the image being processed.
    :param im_preds_df: DataFrame with preds for the image being processed.
    :param iou_background: Minimum IoU to consider a pred not background for target.
    :param iou_foreground: Minimum IoU to consider a pred foreground for a target.
    :return matched: Whether or not there was a match and we could identify the pred error.
    """
    # Find highest overlapping target for pred processed
    target_idx = np.argmax(iou_label_match_matrix[:, pred_idx])
    iou = np.max(iou_label_match_matrix[:, pred_idx])
    target_id = im_targets_df.at[target_idx, "target_id"]
    pred_id = im_preds_df.at[pred_idx, "pred_id"]

    matched = False
    if iou >= iou_foreground:
        pred2target[pred_id] = target_id
        # Check if another prediction is already the match for target to
        #   identify duplicates
        if target2pred.get(target_id) is None:
            target2pred[target_id] = pred_id
            pred2error[pred_id] = ErrorType.OK
        else:
            pred2error[pred_id] = ErrorType.DUP
        matched = True

    elif iou_background <= iou < iou_foreground:
        pred2target[pred_id] = target_id
        pred2error[pred_id] = ErrorType.LOC
        matched = True
    return matched


def _match_pred_wrong_label_or_background(
    pred_idx: int,
    pred2error: Dict[int, str],
    pred2target: Dict[int, int],
    iou_matrix: np.array,
    im_targets_df: pd.DataFrame,
    im_preds_df: pd.DataFrame,
    iou_background: float,
    iou_foreground: float,
) -> None:
    """Try to match `pred_idx` to a target (with different label) and identify error
    If there is a match `pred2error` and  (maybe) `pred2target` are modified in place.
    Possible error types found in this function:
        ErrorType.BKG, ErrorType.CLS, ErrorType.CLS_LOC
    :param pred_idx: Index of prediction based on score (index 0 is maximum score for image).
    :param pred2error: Dict mapping pred_id to error type.
    :param pred2target: Dict mapping pred_id to target_id (if match found with iou above background)
    :param target2pred: Dict mapping target_id to pred_id to pred considered correct (if any).
    :param iou: Matrix with size [n_targets, n_preds] with IoU between all preds and targets.
    :param im_targets_df: DataFrame with targets for the image being processed.
    :param im_preds_df: DataFrame with preds for the image being processed.
    :param iou_background: Minimum IoU to consider a pred not background for target.
    :param iou_foreground: Minimum IoU to consider a pred foreground for a target.
    """
    # Find highest overlapping target for pred processed
    target_idx = np.argmax(iou_matrix[:, pred_idx])
    iou = np.max(iou_matrix[:, pred_idx])
    target_id = im_targets_df.at[target_idx, "target_id"]
    pred_id = im_preds_df.at[pred_idx, "pred_id"]

    if iou < iou_background:
        pred2error[pred_id] = ErrorType.BKG

    # preds with correct label do not get here. Thus, no need to check if label
    #   is wrong
    elif iou >= iou_foreground:
        pred2target[pred_id] = target_id
        pred2error[pred_id] = ErrorType.CLS
    else:
        # No match to target, as we cannot be sure model was remotely close to
        #   getting it right
        pred2error[pred_id] = ErrorType.CLS_LOC


def _find_missed_targets(
    im_targets_df: pd.DataFrame, pred2target: Dict[int, int]
) -> Set[int]:
    """Find targets in the processed image that were not matched by any prediction
    :param im_targets_df: DataFrame with targets for the image being processed.
    :param pred2target: Dict mapping pred_id to target_id (if match found with
        iou above background)
    :return missed_targets: Set of all the target ids that were missed
    """
    matched_targets = [t for t in pred2target.values() if t is not None]
    missed_targets = set(im_targets_df["target_id"]) - set(matched_targets)
    return missed_targets


def _format_errors_as_dataframe(
    pred2error: Dict[int, str],
    pred2target: Dict[int, int],
    missed_targets: Set[int],
) -> pd.DataFrame:
    """Use the variables used to classify errors to format them in a ready to use DataFrame
    :param pred2error: Dict mapping pred_id to error type.
    :param pred2target: Dict mapping pred_id to target_id (if match found with
        iou above background)
    :param missed_targets: Set of all the target ids that were missed
    :return: DataFrame with columns ERRORS_DF_COLUMNS
    """
    errors_df = pd.DataFrame.from_records(
        [
            {"pred_id": pred_id, "error_type": error}
            for pred_id, error in pred2error.items()
        ]
    )
    errors_df["target_id"] = None
    errors_df.set_index("pred_id", inplace=True)
    for pred_id, target_id in pred2target.items():
        errors_df.at[pred_id, "target_id"] = target_id

    missed_df = pd.DataFrame(
        {
            "pred_id": None,
            "error_type": ErrorType.MISS,
            "target_id": list(missed_targets),
        }
    )
    errors_df = pd.concat(
        [errors_df.reset_index(), missed_df], ignore_index=True
    ).astype(
        {"pred_id": float, "target_id": float, "error_type": pd.StringDtype()}
    )
    return errors_df

现在，让我们使用classify_predictions_errors 函数来了解我们的模型所犯的错误类型：

errors_df = classify_predictions_errors(targets_df, preds_df)

虽然每种类型的错误数量可以提供信息，但它并不能告诉我们全貌。并非所有错误都会以同样的方式影响我们关心的指标。对于某些问题，有很多背景预测可能无关紧要，因为误报不是问题。在其他情况下，它们可能是一个大问题（例如，在医学成像中识别肿瘤）。错误的分类很重要，因为它允许我们检查代表特定错误的预测并尝试理解为什么会发生这种情况。然而，每个类别的错误数量通常不足以直观地了解我们的用例中的主要问题。

错误的影响

在任何现实世界的场景中，都有一个指标或一组指标，我们希望模型对其表现良好。理想情况下，这些指标与项目目标一致，并且很好地总结了模型在完成手头任务方面的成功程度。在上一节中，我们找到了不同类型错误的绝对计数。这些类型的错误中的每一种如何影响我们的绩效评估将在很大程度上取决于所使用的指标。因此，我们有兴趣找到对我们的目标影响最大的错误类型，以便我们可以相应地指导我们的工作。

直觉与 TIDE 论文中介绍的直觉相同：我们可以使用模型的预测来计算度量。然后，我们可以一次修复（即纠正）一种类型的错误，并重新计算指标，看看如果模型没有犯这种错误会是什么样子。最后，我们将每种误差的影响定义为修正后的度量值与原始值之间的差异。这为我们提供了一个可量化的结果，即我们感兴趣的指标受到每种类型错误的惩罚程度。

为此，我们需要为每种类型定义“修复错误”的含义。再一次，我们只是使用 TIDE 论文中介绍的方法。这些解释比论文中的解释更详细（他们称之为“预言机”而不是“修复”），其中的警告只能在其实施的深处找到。

CLS 修复：将检测的标签更正为正确的标签。校正仅适用于表示 CLS 错误并且是匹配目标 (IoU >= Tb) 的最高得分预测的预测，并且该目标没有匹配的 OK 预测。所有不满足所述条件的 CLS 预测都将被丢弃。
LOC 修复：更正检测的边界框以匹配匹配目标之一。校正仅适用于表示 LOC 错误并且是与目标匹配的最高得分预测 (IoU >= Tb) 的预测，并且目标没有匹配的 OK 预测。所有不满足所述条件的 LOC 预测都将被丢弃。
CLS & LOC 修复：因为我们无法完全确定检测器试图匹配的目标，所以我们放弃了预测。
DUP 修复：删除重复检测。
BKG 修复：删除幻觉检测。
未命中修复：删除未命中的目标。

请务必注意以下几点：上述所有修复均不重叠。这意味着它们是以这种特定方式定义的，因此更正不会发生冲突。每个预测都可以（并且将）以一种且只有一种方式进行纠正。例如，这就是为什么对于 CLS，只有最高得分预测的 LOC 错误在某些条件下是固定的。因此，如果同时应用所有修复，则结果始终是完美的度量，因为所有目标都完美匹配一次且仅一次。尽管如此，将每个单独的影响和原始指标值相加并不能保证结果将是该指标的完美分数。

现在我们将看到一个实际情况的示例。对于这个例子，我们将使用平均精度 (mAP) 指标，因为它是对象检测问题的标准首选指标。如果您不熟悉 mAP，我建议您从第 37 分钟开始观看此视频。这是我见过的对指标的最佳解释之一。

让我们定义一些代码来帮助我们计算这个指标。为此，我们将使用 torchmetrics 实现，并进行一些额外的处理，以帮助我们将 DataFrame 中的预测和目标转换为 torchmetrics 所需的格式。

# Copyright © 2022 Bernat Puig Camps
import torch
from torchmetrics.detection.mean_ap import MeanAveragePrecision


class MyMeanAveragePrecision:
    """Wrapper for the torchmetrics MeanAveragePrecision exposing API we need"""

    def __init__(self, foreground_threshold):
        self.device = (
            torch.device("cuda:0")
            if torch.cuda.is_available()
            else torch.device("cpu")
        )
        self.map = MeanAveragePrecision(
            iou_thresholds=[foreground_threshold]
        ).to(self.device)

    def __call__(self, targets_df, preds_df):
        targets, preds = self._format_inputs(targets_df, preds_df)
        self.map.update(preds=preds, target=targets)
        result = self.map.compute()["map"].item()
        self.map.reset()
        return result

    def _format_inputs(self, targets_df, preds_df):
        image_ids = set(targets_df["image_id"]) | set(preds_df["image_id"])
        targets, preds = [], []
        for image_id in image_ids:
            im_targets_df = targets_df.query("image_id == @image_id")
            im_preds_df = preds_df.query("image_id == @image_id")
            targets.append(
                {
                    "boxes": torch.as_tensor(
                        im_targets_df[["xmin", "ymin", "xmax", "ymax"]].values,
                        dtype=torch.float32,
                    ).to(self.device),
                    "labels": torch.as_tensor(
                        im_targets_df["label_id"].values, dtype=torch.int64
                    ).to(self.device),
                }
            )
            preds.append(
                {
                    "boxes": torch.as_tensor(
                        im_preds_df[["xmin", "ymin", "xmax", "ymax"]].values,
                        dtype=torch.float32,
                    ).to(self.device),
                    "labels": torch.as_tensor(
                        im_preds_df["label_id"].values, dtype=torch.int64
                    ).to(self.device),
                    "scores": torch.as_tensor(
                        im_preds_df["score"].values, dtype=torch.float32
                    ).to(self.device),
                }
            )
        return targets, preds

现在，让我们定义一些函数来衡量我们的错误对我们的指标的影响。在这里，由于我们的指标可以是任何可调用的，因此此实现不耦合到上面定义的 MeanAveragePrecision 实现。

# Copyright © 2022 Bernat Puig Camps
from typing import Callable, Dict, Tuple

import pandas as pd

from classify_errors import PREDS_DF_COLUMNS, TARGETS_DF_COLUMNS, ErrorType


def calculate_error_impact(
    metric_name: str,
    metric: Callable,
    errors_df: pd.DataFrame,
    targets_df: pd.DataFrame,
    preds_df: pd.DataFrame,
) -> Dict[str, float]:
    """Calculate the `metric` and the independant impact each error type has on it
    Impact is defined as the (metric_after_fixing - metric_before_fixing).
    Note that all error impacts and the metric will not add to 1. Nonetheless,
    the errors (and fixes) are defined in such a way that applying all fixes
    would end up with a perfect metric score.
    :param metric_name: Name of the metric to display for logging purposes.
    :param metric: Callable that will be called as metric(targets_df, preds_df)
        and returns a float.
    :param errors_df: DataFrame with error classification for all preds and targets
    :param targets_df: DataFrame with the targets.
    :param preds_df: DataFrame with the predictions.
    :return impact: Dictionary with one key for the metric without fixing and
        one for each error type.
    """

    ensure_consistency(errors_df, targets_df, preds_df)

    metric_values = {
        ErrorType.CLS: metric(*fix_cls_error(errors_df, targets_df, preds_df)),
        ErrorType.LOC: metric(*fix_loc_error(errors_df, targets_df, preds_df)),
        ErrorType.CLS_LOC: metric(
            *fix_cls_loc_error(errors_df, targets_df, preds_df)
        ),
        ErrorType.DUP: metric(*fix_dup_error(errors_df, targets_df, preds_df)),
        ErrorType.BKG: metric(*fix_bkg_error(errors_df, targets_df, preds_df)),
        ErrorType.MISS: metric(
            *fix_miss_error(errors_df, targets_df, preds_df)
        ),
    }

    # Compute the metric on the actual results
    baseline_metric = metric(targets_df, preds_df)
    # Calculate the difference (impact) in the metric when fixing each error
    impact = {
        error: (error_metric - baseline_metric)
        for error, error_metric in metric_values.items()
    }
    impact[metric_name] = baseline_metric
    return impact


def ensure_consistency(
    errors_df: pd.DataFrame, targets_df: pd.DataFrame, preds_df: pd.DataFrame
):
    """Make sure that all targets are preds are accounted for in errors"""
    target_ids = set(targets_df["target_id"])
    pred_ids = set(preds_df["pred_id"])

    error_target_ids = set(errors_df.query("target_id.notnull()")["target_id"])
    error_pred_ids = set(errors_df.query("pred_id.notnull()")["pred_id"])

    if not target_ids == error_target_ids:
        raise ValueError(
            f"Missing target IDs in error_df: {target_ids - error_target_ids}"
        )

    if not pred_ids == error_pred_ids:
        raise ValueError(
            f"Missing pred IDs in error_df: {pred_ids - error_pred_ids}"
        )


def fix_cls_error(
    errors_df, targets_df, preds_df
) -> Tuple[pd.DataFrame, pd.DataFrame]:
    return _fix_by_correcting_and_removing_preds(
        errors_df, targets_df, preds_df, ErrorType.CLS
    )


def fix_loc_error(
    errors_df, targets_df, preds_df
) -> Tuple[pd.DataFrame, pd.DataFrame]:
    return _fix_by_correcting_and_removing_preds(
        errors_df, targets_df, preds_df, ErrorType.LOC
    )


def fix_cls_loc_error(
    errors_df, targets_df, preds_df
) -> Tuple[pd.DataFrame, pd.DataFrame]:
    return _fix_by_removing_preds(
        errors_df, targets_df, preds_df, ErrorType.CLS_LOC
    )


def fix_bkg_error(
    errors_df, targets_df, preds_df
) -> Tuple[pd.DataFrame, pd.DataFrame]:
    return _fix_by_removing_preds(
        errors_df, targets_df, preds_df, ErrorType.BKG
    )


def fix_dup_error(
    errors_df, targets_df, preds_df
) -> Tuple[pd.DataFrame, pd.DataFrame]:
    return _fix_by_removing_preds(
        errors_df, targets_df, preds_df, ErrorType.DUP
    )


def fix_miss_error(
    errors_df, targets_df, preds_df
) -> Tuple[pd.DataFrame, pd.DataFrame]:
    """Fix missed targets by removing them
    Missed targets is the only type of errors that deals with targets rather
    than predictions
    :return: Fixed (`targets_df`, `errors_df`)
    """
    ensure_consistency(errors_df, targets_df, preds_df)

    targets_df = targets_df.merge(
        # Need to filter rest of errors or multi prediction per target makes
        #   target_df bigger
        errors_df.query("error_type == @ErrorType.MISS"),
        on="target_id",
        how="left",
    ).query("error_type.isnull()")
    return targets_df[TARGETS_DF_COLUMNS], preds_df


def _fix_by_correcting_and_removing_preds(
    errors_df: pd.DataFrame,
    targets_df: pd.DataFrame,
    preds_df: pd.DataFrame,
    error_type: ErrorType,
) -> Tuple[pd.DataFrame, pd.DataFrame]:
    """Correct predictions of `error_type` of unmatched target and remove the rest
    CLS and LOC errors are matched to targets. To assess their impact, we
    correct the highest scoring prediction for an unmatched target
    (no OK error for it).
        - For CLS, we set the label to the right one.
        - For LOC, we set the bounding box to match perfectly with the target's.
    The non-corrected predictions of `error_type` are removed from `preds_df`.
    The idea is to assess what happened if instead of missing a target due to an
    incorrect prediction, we would have had a correct one instead. The ones that
    are not highest-scoring for target would have been duplicates, so we remove
    them.
    :return: Fixed (`targets_df`, `errors_df`)
    """

    assert error_type in {
        ErrorType.CLS,
        ErrorType.LOC,
    }, f"error_type='{error_type}'"
    ensure_consistency(errors_df, targets_df, preds_df)

    cols_to_correct = {
        ErrorType.CLS: ["label_id"],
        ErrorType.LOC: ["xmin", "ymin", "xmax", "ymax"],
    }[error_type]

    # Add matched targets to relevant preds and sort so highest scoring is first.
    preds_df = (
        preds_df.merge(
            errors_df.query(
                "error_type in [@ErrorType.OK, @ErrorType.CLS, @ErrorType.LOC]"
            ),
            on="pred_id",
            how="left",
        )
        .merge(
            targets_df[["target_id"] + cols_to_correct],
            on="target_id",
            how="left",
            suffixes=("", "_target"),
        )
        .sort_values(by="score", ascending=False)
    )

    to_correct = preds_df["error_type"].eq(error_type)
    target_cols = [col + "_target" for col in cols_to_correct]
    preds_df.loc[to_correct, cols_to_correct] = preds_df.loc[
        to_correct, target_cols
    ].values

    to_drop = []
    for _, target_df in preds_df.groupby("target_id"):
        if target_df["error_type"].eq(ErrorType.OK).any():
            # If target has a correct prediction, drop all predictions of `error_type`
            to_drop += target_df.query("error_type == @error_type")[
                "pred_id"
            ].tolist()
        elif (
            target_df["error_type"].eq(error_type).any() and len(target_df) > 1
        ):
            # If target unmatched, drop all predictions of `error_type` that are
            #   not highest score
            to_keep = target_df["pred_id"].iloc[0]
            to_drop += target_df.query(
                "error_type == @error_type and pred_id != @to_keep"
            )["pred_id"].tolist()
    return (
        targets_df,
        preds_df.query("pred_id not in @to_drop")[PREDS_DF_COLUMNS],
    )


def _fix_by_removing_preds(
    errors_df: pd.DataFrame,
    targets_df: pd.DataFrame,
    preds_df: pd.DataFrame,
    error_type: ErrorType,
) -> Tuple[pd.DataFrame, pd.DataFrame]:
    """Fix the `error_type` by removing the predictions assigned to that error
    This is applicable to:
        - ErrorType.CLS_LOC and ErrorType.BKG because there is no target we
            could match it and be sure the model was "intending" to predict that.
        - ErrorType.DUP by definition.
    :return: Fixed (`targets_df`, `errors_df`)
    """

    assert error_type in {
        ErrorType.CLS_LOC,
        ErrorType.BKG,
        ErrorType.DUP,
    }, f"error_type='{error_type}'"
    ensure_consistency(errors_df, targets_df, preds_df)

    preds_df = preds_df.merge(errors_df, on="pred_id", how="left").query(
        "error_type != @error_type"
    )
    return targets_df, preds_df[PREDS_DF_COLUMNS]

现在，让我们使用 calculate_error_impact 函数来了解我们的错误对我们的 mAP 的影响。

不出所料，该指标的基值相当高。这是意料之中的，因为该模型经过专门训练，可以在此验证集上表现良好。虽然我们看到除了重复之外的大多数错误都有一些贡献，但错过的目标和背景预测对性能的影响最大。

回想一下我们在损失部分看到的图像检查；这些结果可能会强化之前假设的观点：

模型正确检测到的标签缺失（例如，这只鸟和长颈鹿一起喝水），但会作为背景错误而受到惩罚。
有些对象没有被标记，因为数据集没有类别（例如，与斑马一起放牧的动物），或者看起来像其他有类别的对象并且也被作为背景错误惩罚。
对于模型无法检测到的强检测（例如，火烈鸟或其他小鸟），可能存在一些不一致的标签，并被归类为遗漏错误。

现在下一步将是更深入地探索这些假设，以确认或放弃它们。我们现在可以查看包含分类为背景错误的预测的图像，并查看基本事实中是否缺少标签。如果是这种情况，我们可以通过添加缺失的框并再次重新评估来修复这些问题。希望我们的 mAP 会增加，而背景误差贡献会减少。请注意，在此示例中，问题出在数据而不是模型上。也就是说，模型做得比指标告诉我们的要好，但如果我们不进行更彻底的分析，我们就无法知道。

如你所见，通过错误分析，我们很快设计了一些可能限制我们模型性能的假设，并且有了这些假设，更容易设计潜在的改进策略。

总结

在这里，我们探讨了如何利用错误分析来解决对象检测问题。需要注意的是，这是一个迭代过程。这个想法是你从某个地方开始解决最紧迫的问题，然后是下一个问题，然后是下一个问题，直到你的解决方案满足所需的标准。在大多数情况下，这种方法比无休止的参数调整要好得多。最重要的是，它是建立直觉和更深入地理解手头问题的好工具。就其本身而言，这对于最终出现在现实世界中的系统来说非常重要。

最后，基本思想并不是目标检测所独有的。在大多数机器学习情况下，有一些方法可以利用经过训练的模型的预测来引导努力取得令人满意的结果。我鼓励你在你面临的下一个问题中用这些术语来思考，我希望你会发现我在这里写的东西也很有用！

代码地址：

https://gist.github.com/bepuca/1798371425b73cff60cdfa3c023ebff8

https://gist.github.com/bepuca/fdaac6acd9a1e0085726a33ee0341250

https://medium.com/data-science-at-microsoft/error-analysis-for-object-detection-models-338cb6534051

今天的分享就到这里，大家喜欢的话，可以多多支持，感谢！

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2022-07-14，如有侵权请联系 cloudcommunity@tencent.com 删除

图像识别