动手实现YOLO5Face

发表于 2024-10-19 更新于 2024-12-01 分类于人脸识别/face recognition 阅读次数：

本文字数： 21k 阅读时长 ≈ 38 分钟

YOLO5Face可以理解为YOLOv5的扩展版本，它在原有的基础上添加了一个专门用于检测面部特征点的HEAD。这使得 YOLO5Face 在实时推理过程中，不仅能高效地检测出人脸的边界框，还能同时精确定位出人脸上的五个关键特征点坐标。这种一体化的设计大大简化了人脸识别与分析的工作流程，为后续的应用（如面部表情识别、姿态估计等）提供更丰富的信息支持。

概述

官方代码deepcam-cn/yolov5-face参考的是ultralytics/yolov5之前版本的实现，为了更好的利用yolov5最新的算法和训练实现，我尝试着在ultralytics/yolov5 v7.0上集成YOLO5Face，主要有以下几个模块的修改，分别是模型实现、图像预处理、损失函数以及数据后处理。

论文解读：YOLO5Face: Why Reinventing a Face Detector
自定义实现：zjykzj/YOLO5Face
数据集：WIDER FACE: A Face Detection Benchmark

模型实现

在论文描述中，YOLO5Face在模型方面主要有5点优化：

新建StemBlock替换原先的输入层Focus；
新建CBS block替换原先的CSP block(C3)；
使用更小的kernel size (3/5/7)优化SPP模块（原先是5/9/11）；
使用SPP模块处理过的卷积特征作为P6输出层（可选）；
在最后输出层增加Landmark HEAD，用于检测人脸关键点坐标。

从YOLO5Face官网最新实现来看，第2点已经取消，仍旧使用C3模块进行特征提取；另外，参考yolov5-v7.0的实现，已经使用单纯的Conv层替换了Focus模块，以及使用SPPF模块替换了SPP模块。所以综合上述情况，我在yolov5-v7.0版本的基础上进行了两点修改：

在SPPF模块中使用更小的kernel size进行池化操作（SPPF(k=3) == SPP(k=(3, 5, 7)）；
在最后的输出层增加Landmark HEAD实现。

具体配置文件如下，以yolov5s_v7_0.yaml为例：

# YOLOv5 🚀 by Ultralytics, GPL-3.0 license

# Parameters
nc: 1  # number of classes
depth_multiple: 0.33  # model depth multiple
width_multiple: 0.50  # layer channel multiple
anchors:
  - [4,5, 6,8, 10,12]  # P3/8
  - [14,18, 21,27, 37,47]  # P4/16
  - [65,86, 129,172, 251,369]  # P5/32

# YOLOv5 v6.0 backbone
backbone:
  # [from, number, module, args]
  [[-1, 1, Conv, [64, 6, 2, 2]],  # 0-P1/2
   [-1, 1, Conv, [128, 3, 2]],  # 1-P2/4
   [-1, 3, C3, [128]],
   [-1, 1, Conv, [256, 3, 2]],  # 3-P3/8
   [-1, 6, C3, [256]],
   [-1, 1, Conv, [512, 3, 2]],  # 5-P4/16
   [-1, 9, C3, [512]],
   [-1, 1, Conv, [1024, 3, 2]],  # 7-P5/32
   [-1, 3, C3, [1024]],
   [-1, 1, SPPF, [1024, 3]],  # 9
  ]

# YOLOv5 v6.0 head
head:
  [[-1, 1, Conv, [512, 1, 1]],
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],
   [[-1, 6], 1, Concat, [1]],  # cat backbone P4
   [-1, 3, C3, [512, False]],  # 13

   [-1, 1, Conv, [256, 1, 1]],
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],
   [[-1, 4], 1, Concat, [1]],  # cat backbone P3
   [-1, 3, C3, [256, False]],  # 17 (P3/8-small)

   [-1, 1, Conv, [256, 3, 2]],
   [[-1, 14], 1, Concat, [1]],  # cat head P4
   [-1, 3, C3, [512, False]],  # 20 (P4/16-medium)

   [-1, 1, Conv, [512, 3, 2]],
   [[-1, 10], 1, Concat, [1]],  # cat head P5
   [-1, 3, C3, [1024, False]],  # 23 (P5/32-large)

   [[17, 20, 23], 1, FaceDetect, [nc, anchors]],  # Detect(P3, P4, P5)
  ]

对比了YOLOv5n算法在SPPF(k=5)与SPPF(k=3)两种配置下于WIDER FACE数据集上的表现，发现两者精度相近。因此，最终选择了YOLOv5模型的默认配置（SPPF(k=5)）。

模型	Easy Val AP	Medium Val AP	Hard Val AP
YOLOv5n with SPPF(k=5)	0.92797	0.90772	0.80272
YOLOv5n with SPPF(k=3)	0.92628	0.90603	0.80403

SPP vs. SPPF

SPP模块最初源自SPPNet分类模型，该模型采用多级池化结构以确保无论输入图像的尺寸如何，都能获得相同大小的输出特征向量。在YOLOv3中，SPP模块首次被引入至目标检测网络，作为Backbone层和Neck层之间的组件，旨在利用不同内核大小的池化层捕获多尺度特征（注：输出特征向量的空间大小仍旧和输入特征保持一致）。随后，在YOLOv5中，SPP模块被作者进一步优化为SPPF模块，和SPP模块具有相同的计算结果，同时加快了推理速度。

# Profile
from utils.torch_utils import profile
from models.common import SPP, SPPF

m1 = SPP(1024, 1024)
m2 = SPPF(1024, 1024)
results = profile(input=torch.randn(16, 1024, 64, 64), ops=[m1, m2], n=100)

对于SPPF模块，SPPF(k=5) == SPP(k=(5, 9, 13)以及SPPF(k=3) == SPP(k=(3, 5, 7)

class SPP(nn.Module):
    # Spatial Pyramid Pooling (SPP) layer https://arxiv.org/abs/1406.4729
    def __init__(self, c1, c2, k=(5, 9, 13)):
        super().__init__()
        c_ = c1 // 2  # hidden channels
        self.cv1 = Conv(c1, c_, 1, 1)
        self.cv2 = Conv(c_ * (len(k) + 1), c2, 1, 1)
        self.m = nn.ModuleList([nn.MaxPool2d(kernel_size=x, stride=1, padding=x // 2) for x in k])

    def forward(self, x):
        x = self.cv1(x)
        with warnings.catch_warnings():
            warnings.simplefilter('ignore')  # suppress torch 1.9.0 max_pool2d() warning
            return self.cv2(torch.cat([x] + [m(x) for m in self.m], 1))


class SPPF(nn.Module):
    # Spatial Pyramid Pooling - Fast (SPPF) layer for YOLOv5 by Glenn Jocher
    def __init__(self, c1, c2, k=5):  # equivalent to SPP(k=(5, 9, 13))
        super().__init__()
        c_ = c1 // 2  # hidden channels
        self.cv1 = Conv(c1, c_, 1, 1)
        self.cv2 = Conv(c_ * 4, c2, 1, 1)
        self.m = nn.MaxPool2d(kernel_size=k, stride=1, padding=k // 2)

    def forward(self, x):
        x = self.cv1(x)
        with warnings.catch_warnings():
            warnings.simplefilter('ignore')  # suppress torch 1.9.0 max_pool2d() warning
            y1 = self.m(x)
            y2 = self.m(y1)
            return self.cv2(torch.cat((x, y1, y2, self.m(y2)), 1))

FaceDetect

对于YOLOv5的输出层（Detect模块），最终输出的卷积特征大小是(bs, na, fw, fh, o_dim)，

bs表示批量大小
na表示该层锚点个数，默认是3；
fw/fh表示卷积特征空间大小；
o_dim表示xy/wh/conf/probs输出，默认是xy=2, wh=2, conf=1, n_classes=80（以COCO数据集为例），所以o_dim长度是2+2+1+80=85。

对于YOLO5Face而言，人脸目标即通用目标，所以仅需额外增加10个关键点坐标的预测即可，最终o_dim由xy/wh/conf/probs/landmarks构成，o_dim长度是2+2+1+80+10=95。人脸目标和关键点的计算公式如下：

\[ xy=(sigmoid(xy)*2+grid_i-0.5)*stride_i\ \ wh=(sigmoid(wh)*2)^2*anchor_i \]

\[ probs=sigmoid(probs)\ \ landmarks=landmarks*anchor_i+grid_i*stride_i \]

xy表示预测框中心点坐标，使用sigmoid函数进行归一化后结合网格坐标和步长放大到输入图像大小；
wh表示预测框宽高，使用sigmoid函数进行归一化后集合步长和锚点框大小放大到输入图像大小；
probs使用BCSLoss进行训练，所以仅需对每个分类item进行归一化操作即可得到分类概率；
landmarks坐标的计算参考了预测框的实现，它结合了网格坐标、步长和锚点框大小进行计算。

注意：上面公式中anchor_i已经经过了步长放大，否则计算公式中wh和landmarks坐标的计算可以修改为：

\[ wh=(sigmoid(wh)*2)^2*anchor_i*stride_i \]

\[ landmarks=(landmarks*anchor_i+grid_i)*stride_i \]

class FaceDetect(nn.Module):
    # YOLOv5 Detect head for detection models
    stride = None  # strides computed during build
    dynamic = False  # force grid reconstruction
    export = False  # export mode

    def __init__(self, nc=80, anchors=(), ch=(), inplace=True):  # detection layer
        super().__init__()
        self.nc = nc  # number of classes
        self.n_landmarks = 10  # number of landmarks
        # self.no = nc + 5   # number of outputs per anchor
        self.no = nc + 5 + self.n_landmarks  # number of outputs per anchor
        self.nl = len(anchors)  # number of detection layers
        self.na = len(anchors[0]) // 2  # number of anchors
        self.grid = [torch.empty(0) for _ in range(self.nl)]  # init grid
        self.anchor_grid = [torch.empty(0) for _ in range(self.nl)]  # init anchor grid
        self.register_buffer('anchors', torch.tensor(anchors).float().view(self.nl, -1, 2))  # shape(nl,na,2)
        self.m = nn.ModuleList(nn.Conv2d(x, self.no * self.na, 1) for x in ch)  # output conv
        self.inplace = inplace  # use inplace ops (e.g. slice assignment)

    def forward(self, x):
        z = []  # inference output
        for i in range(self.nl):
            x[i] = self.m[i](x[i])  # conv
            bs, _, ny, nx = x[i].shape  # x(bs,255,20,20) to x(bs,3,20,20,85)
            x[i] = x[i].view(bs, self.na, self.no, ny, nx).permute(0, 1, 3, 4, 2).contiguous()

            if not self.training:  # inference
                if self.dynamic or self.grid[i].shape[2:4] != x[i].shape[2:4]:
                    self.grid[i], self.anchor_grid[i] = self._make_grid(nx, ny, i)

                # Detect (boxes only)
                # xy, wh, conf = x[i].sigmoid().split((2, 2, self.nc + 1), 4)
                x[i][..., :(5 + self.nc)] = torch.sigmoid(x[i][..., :(5 + self.nc)])
                xy, wh, conf, cls, landmarks = x[i].split((2, 2, 1, self.nc, self.n_landmarks), 4)
                xy = (xy * 2 + self.grid[i]) * self.stride[i]  # xy
                wh = (wh * 2) ** 2 * self.anchor_grid[i]  # wh

                landmarks = landmarks.reshape(bs, self.na, ny, nx, -1, 2) * self.anchor_grid[i][:, :, :, :, None] + \
                            (self.grid[i][:, :, :, :, None] + 0.5) * self.stride[i]
                landmarks = landmarks.reshape(bs, self.na, ny, nx, -1)

                # y = torch.cat((xy, wh, conf), 4)
                y = torch.cat((xy, wh, conf, cls, landmarks), 4)
                z.append(y.view(bs, self.na * nx * ny, self.no))

        return x if self.training else (torch.cat(z, 1),) if self.export else (torch.cat(z, 1), x)

    def _make_grid(self, nx=20, ny=20, i=0, torch_1_10=check_version(torch.__version__, '1.10.0')):
        d = self.anchors[i].device
        t = self.anchors[i].dtype
        shape = 1, self.na, ny, nx, 2  # grid shape
        y, x = torch.arange(ny, device=d, dtype=t), torch.arange(nx, device=d, dtype=t)
        yv, xv = torch.meshgrid(y, x, indexing='ij') if torch_1_10 else torch.meshgrid(y, x)  # torch>=0.7 compatibility
        grid = torch.stack((xv, yv), 2).expand(shape) - 0.5  # add grid offset, i.e. y = 2.0 * x - 0.5
        anchor_grid = (self.anchors[i] * self.stride[i]).view((1, self.na, 1, 1, 2)).expand(shape)
        return grid, anchor_grid

图像预处理

YOLO5Face把人脸抽象成对象，所以yolov5工程中的图像预处理算法都可以作用于人脸数据。另外，YOLO5Face增加了对人脸关键点的检测，所以人脸数据不仅仅包含人脸边界框，还拥有人脸关键点坐标，需要对原先的图像预处理算法进行改造，同步处理人脸边界框和人脸关键点坐标。虽然在yolov5-v7.0工程中，对于Nano和Small模型使用配置文件hyp.scratch-low.yaml，对于其他更大的模型使用hyp.scratch-high.yaml进行训练；对于自定义的YOLO5Face工程，统一使用hyp.scratch-low.yaml，所以在图像预处理方面，涉及到3方面的改造：

标签文件的加载和保存
人脸边界框和人脸关键点的坐标变换
图像预处理算法

标签文件

原先的yolov5工程的标签文件格式如下：

1
2
3

# cls_id x_center y_center box_w box_h
0 0.234375 0.361639 0.037109 0.061493
0 0.038085 0.330161 0.03125 0.051244

YOLO5Face在每个边界框增加了5个人脸关键点坐标，格式如下：

1
2
3

# cls_id x_center y_center box_w box_h kp1_x kp1_y kp2_x kp2_y kp3_x kp3_y kp4_x kp4_y kp5_x kp5_y
0 0.234375 0.361639 0.037109 0.061493 0.226597 0.348409 0.244384 0.348409 0.236275 0.358998 0.229212 0.373509 0.242553 0.373901
0 0.038085 0.330161 0.03125 0.051244 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0

注1：这5个人脸关键点分别是左眼、右眼、鼻头、左嘴角、右嘴角。

注2：如果因为遮挡、过小等因素没有标注出人脸关键点，使用-1.0进行标识。

在yolov5工程中涉及到标签文件的读取和解析实现是verify_image_label函数，

def verify_image_label(args):
    # Verify one image-label pair
    im_file, lb_file, prefix = args
    nm, nf, ne, nc, msg, segments = 0, 0, 0, 0, '', []  # number (missing, found, empty, corrupt), message, segments
    try:
        # verify images
        im = Image.open(im_file)
        im.verify()  # PIL verify
        shape = exif_size(im)  # image size
        assert (shape[0] > 9) & (shape[1] > 9), f'image size {shape} <10 pixels'
        assert im.format.lower() in IMG_FORMATS, f'invalid image format {im.format}'
        if im.format.lower() in ('jpg', 'jpeg'):
            with open(im_file, 'rb') as f:
                f.seek(-2, 2)
                if f.read() != b'\xff\xd9':  # corrupt JPEG
                    ImageOps.exif_transpose(Image.open(im_file)).save(im_file, 'JPEG', subsampling=0, quality=100)
                    msg = f'{prefix}WARNING ⚠️ {im_file}: corrupt JPEG restored and saved'

        # verify labels
        if os.path.isfile(lb_file):
            nf = 1  # label found
            with open(lb_file) as f:
                lb = [x.split() for x in f.read().strip().splitlines() if len(x)]
                # if any(len(x) > 6 for x in lb):  # is segment
                #     classes = np.array([x[0] for x in lb], dtype=np.float32)
                #     segments = [np.array(x[1:], dtype=np.float32).reshape(-1, 2) for x in lb]  # (cls, xy1...)
                #     lb = np.concatenate((classes.reshape(-1, 1), segments2boxes(segments)), 1)  # (cls, xywh)
                lb = np.array(lb, dtype=np.float32)
            nl = len(lb)
            if nl:
                # assert lb.shape[1] == 5, f'labels require 5 columns, {lb.shape[1]} columns detected'
                assert lb.shape[1] == (5 + 10), f'labels require 15 columns, {lb.shape[1]} columns detected'
                # assert (lb >= 0).all(), f'negative label values {lb[lb < 0]}'
                # assert (lb[:, 1:] <= 1).all(), f'non-normalized or out of bounds coordinates {lb[:, 1:][lb[:, 1:] > 1]}'
                _, i = np.unique(lb, axis=0, return_index=True)
                if len(i) < nl:  # duplicate row check
                    lb = lb[i]  # remove duplicates
                    if segments:
                        segments = [segments[x] for x in i]
                    msg = f'{prefix}WARNING ⚠️ {im_file}: {nl - len(i)} duplicate labels removed'
            else:
                ne = 1  # label empty
                # lb = np.zeros((0, 5), dtype=np.float32)
                lb = np.zeros((0, 15), dtype=np.float32)
        else:
            nm = 1  # label missing
            # lb = np.zeros((0, 5), dtype=np.float32)
            lb = np.zeros((0, 15), dtype=np.float32)
        return im_file, lb, shape, segments, nm, nf, ne, nc, msg
    except Exception as e:
        nc = 1
        msg = f'{prefix}WARNING ⚠️ {im_file}: ignoring corrupt image/label: {e}'
        return [None, None, None, None, nm, nf, ne, nc, msg]

坐标变换

在图像预处理过程中，坐标格式会在xywhn和xyxy之间进行转换，这里面涉及到两个坐标变换函数：

xywhn2xyxy
xyxy2xywhn

在处理边界框坐标的过程中，还需要考虑到关键点坐标的变换：

labels = self.labels[index].copy()
if labels.size:  # normalized xywh to pixel xyxy format
    # labels[:, 1:] = xywhn2xyxy(labels[:, 1:], ratio[0] * w, ratio[1] * h, padw=pad[0], padh=pad[1])
    labels[:, 1:5] = xywhn2xyxy(labels[:, 1:5], ratio[0] * w, ratio[1] * h, padw=pad[0], padh=pad[1])

    labels[:, 5::2] = np.where(labels[:, 5::2] < 0, -1, labels[:, 5::2] * w + pad[0])
    labels[:, 6::2] = np.where(labels[:, 6::2] < 0, -1, labels[:, 6::2] * h + pad[1])

nl = len(labels)  # number of labels
if nl:
    labels[:, 1:5] = xyxy2xywhn(labels[:, 1:5], w=img.shape[1], h=img.shape[0], clip=True, eps=1E-3)

    # normalized landmark x 0-1
    labels[:, 5::2] = np.where(labels[:, 5::2] < 0, -1, labels[:, 5::2] / img.shape[1])
    # normalized landmark y 0-1
    labels[:, 6::2] = np.where(labels[:, 6::2] < 0, -1, labels[:, 6::2] / img.shape[0])

图像预处理

涉及到以下几个图像预处理函数（注意：仅考虑配置文件hyp.scratch-low.yaml）：

mosaic
random_perspective
letterbox
fliplr

马赛克

简单描述下yolov5工程中马赛克预处理函数的操作流程：

初始化设置
1. 创建结果图像img4，大小为模型输入大小的2倍(img_size*2, img_size*2)，默认填充值是114；
2. 随机创建马赛克中心点(xc, yc)，取值范围是[img_size // 2, img_size + img_size //2]；
随机遍历4张图像
1. 第一张图像赋值到img4的左上角，截取的是原图的右下角内容；
2. 第二张图像赋值到img4的右上角，截取的是原图的左下角内容；
3. 第三张图像赋值到img4的左下角，截取的是原图的右上角内容；
4. 第四张图像赋值到img4的右下角，截取的是原图的左上角内容；
5. 注意：每张子图具体截取大小取决于马赛克中心点。最理想情况下中心点位于图像img4的中心（此时(xc, yc) == (img_size, img_size)）；
随机透视
1. 将结果图像img4进行透视变换，最终得到模型指定的输入大小(img_size, img_size)。

mosaic函数中涉及到坐标变换的工作有两部分：

将原先经过归一化缩放的标注框和关键点坐标转换到裁剪后的坐标系大小；
在随机透视函数中进行标注框和关键点的坐标转换。

def load_mosaic(self, index):
    # YOLOv5 4-mosaic loader. Loads 1 image + 3 random images into a 4-image mosaic
    labels4, segments4 = [], []
    s = self.img_size
    yc, xc = (int(random.uniform(-x, 2 * s + x)) for x in self.mosaic_border)  # mosaic center x, y
    indices = [index] + random.choices(self.indices, k=3)  # 3 additional image indices
    random.shuffle(indices)
    for i, index in enumerate(indices):
        # Load image
        img, _, (h, w) = self.load_image(index)

        # place img in img4
        if i == 0:  # top left
            img4 = np.full((s * 2, s * 2, img.shape[2]), 114, dtype=np.uint8)  # base image with 4 tiles
            x1a, y1a, x2a, y2a = max(xc - w, 0), max(yc - h, 0), xc, yc  # xmin, ymin, xmax, ymax (large image)
            x1b, y1b, x2b, y2b = w - (x2a - x1a), h - (y2a - y1a), w, h  # xmin, ymin, xmax, ymax (small image)
        elif i == 1:  # top right
            x1a, y1a, x2a, y2a = xc, max(yc - h, 0), min(xc + w, s * 2), yc
            x1b, y1b, x2b, y2b = 0, h - (y2a - y1a), min(w, x2a - x1a), h
        elif i == 2:  # bottom left
            x1a, y1a, x2a, y2a = max(xc - w, 0), yc, xc, min(s * 2, yc + h)
            x1b, y1b, x2b, y2b = w - (x2a - x1a), 0, w, min(y2a - y1a, h)
        elif i == 3:  # bottom right
            x1a, y1a, x2a, y2a = xc, yc, min(xc + w, s * 2), min(s * 2, yc + h)
            x1b, y1b, x2b, y2b = 0, 0, min(w, x2a - x1a), min(y2a - y1a, h)

        img4[y1a:y2a, x1a:x2a] = img[y1b:y2b, x1b:x2b]  # img4[ymin:ymax, xmin:xmax]
        padw = x1a - x1b
        padh = y1a - y1b

        # Labels
        labels, segments = self.labels[index].copy(), self.segments[index].copy()
        if labels.size:
            # labels[:, 1:] = xywhn2xyxy(labels[:, 1:], w, h, padw, padh)  # normalized xywh to pixel xyxy format
            labels[:, 1:5] = xywhn2xyxy(labels[:, 1:5], w, h, padw, padh)  # normalized xywh to pixel xyxy format

            labels[:, 5::2] = np.where(labels[:, 5::2] < 0, -1, labels[:, 5::2] * w + padw)
            labels[:, 6::2] = np.where(labels[:, 6::2] < 0, -1, labels[:, 6::2] * h + padh)

            segments = [xyn2xy(x, w, h, padw, padh) for x in segments]

        labels4.append(labels)
        segments4.extend(segments)

    # Concat/clip labels
    labels4 = np.concatenate(labels4, 0)
    for x in (labels4[:, 1:], *segments4):
        # np.clip(x, 0, 2 * s, out=x)  # clip when using random_perspective()
        np.clip(x[:, :4], 0, 2 * s, out=x[:, :4])  # clip when using random_perspective()

        x[:, 4::2] = np.where(x[:, 4::2] < 0, -1, x[:, 4::2])
        x[:, 4::2] = np.where(x[:, 4::2] > 2 * s, -1, x[:, 4::2])
        x[:, 5::2] = np.where(x[:, 5::2] < 0, -1, x[:, 5::2])
        x[:, 5::2] = np.where(x[:, 5::2] > 2 * s, -1, x[:, 5::2])
    # img4, labels4 = replicate(img4, labels4)  # replicate

    # Augment
    img4, labels4, segments4 = copy_paste(img4, labels4, segments4, p=self.hyp['copy_paste'])
    img4, labels4 = random_perspective(img4,
                                       labels4,
                                       segments4,
                                       degrees=self.hyp['degrees'],
                                       translate=self.hyp['translate'],
                                       scale=self.hyp['scale'],
                                       shear=self.hyp['shear'],
                                       perspective=self.hyp['perspective'],
                                       border=self.mosaic_border)  # border to remove

    return img4, labels4

随机透视

yolov5工程的随机透视函数random_perspective执行以下几何变换操作：

中心化 (Center)：将图像中心移到坐标系原点；
透视变换 (Perspective)：在x和y轴方向上添加随机的透视效果；
旋转和缩放 (Rotation and Scale)：
1. 随机选择一个角度范围内的旋转值，并对图像进行旋转
2. 随机选择一个缩放比例，并应用到图像上
剪切 (Shear)：在x和y方向上分别添加一个随机的剪切角度；
平移 (Translation)：图像在x和y方向上进行随机平移。

如果不执行透视变换 (Perspective)操作，也就是只需要进行仿射变换时，即旋转、缩放、剪切和平移等线性变换时，使用函数cv2.warpAffine；否则，使用cv2.warpPerspective。

def random_perspective(im,
                       targets=(),
                       segments=(),
                       degrees=10,
                       translate=.1,
                       scale=.1,
                       shear=10,
                       perspective=0.0,
                       border=(0, 0)):
    # torchvision.transforms.RandomAffine(degrees=(-10, 10), translate=(0.1, 0.1), scale=(0.9, 1.1), shear=(-10, 10))
    # targets = [cls, xyxy]
    # use targets = [cls, xyxy, landmarks] instead

    height = im.shape[0] + border[0] * 2  # shape(h,w,c)
    width = im.shape[1] + border[1] * 2

    # Center
    C = np.eye(3)
    C[0, 2] = -im.shape[1] / 2  # x translation (pixels)
    C[1, 2] = -im.shape[0] / 2  # y translation (pixels)

    # Perspective
    P = np.eye(3)
    P[2, 0] = random.uniform(-perspective, perspective)  # x perspective (about y)
    P[2, 1] = random.uniform(-perspective, perspective)  # y perspective (about x)

    # Rotation and Scale
    R = np.eye(3)
    a = random.uniform(-degrees, degrees)
    # a += random.choice([-180, -90, 0, 90])  # add 90deg rotations to small rotations
    s = random.uniform(1 - scale, 1 + scale)
    # s = 2 ** random.uniform(-scale, scale)
    R[:2] = cv2.getRotationMatrix2D(angle=a, center=(0, 0), scale=s)

    # Shear
    S = np.eye(3)
    S[0, 1] = math.tan(random.uniform(-shear, shear) * math.pi / 180)  # x shear (deg)
    S[1, 0] = math.tan(random.uniform(-shear, shear) * math.pi / 180)  # y shear (deg)

    # Translation
    T = np.eye(3)
    T[0, 2] = random.uniform(0.5 - translate, 0.5 + translate) * width  # x translation (pixels)
    T[1, 2] = random.uniform(0.5 - translate, 0.5 + translate) * height  # y translation (pixels)

    # Combined rotation matrix
    M = T @ S @ R @ P @ C  # order of operations (right to left) is IMPORTANT
    if (border[0] != 0) or (border[1] != 0) or (M != np.eye(3)).any():  # image changed
        if perspective:
            im = cv2.warpPerspective(im, M, dsize=(width, height), borderValue=(114, 114, 114))
        else:  # affine
            im = cv2.warpAffine(im, M[:2], dsize=(width, height), borderValue=(114, 114, 114))

    # Visualize
    # import matplotlib.pyplot as plt
    # ax = plt.subplots(1, 2, figsize=(12, 6))[1].ravel()
    # ax[0].imshow(im[:, :, ::-1])  # base
    # ax[1].imshow(im2[:, :, ::-1])  # warped

    # Transform label coordinates
    n = len(targets)
    if n:
        use_segments = any(x.any() for x in segments)
        new = np.zeros((n, 4))
        if use_segments:  # warp segments
            segments = resample_segments(segments)  # upsample
            for i, segment in enumerate(segments):
                xy = np.ones((len(segment), 3))
                xy[:, :2] = segment
                xy = xy @ M.T  # transform
                xy = xy[:, :2] / xy[:, 2:3] if perspective else xy[:, :2]  # perspective rescale or affine

                # clip
                new[i] = segment2box(xy, width, height)

        else:  # warp boxes
            # xy = np.ones((n * 4, 3))
            # xy[:, :2] = targets[:, [1, 2, 3, 4, 1, 4, 3, 2]].reshape(n * 4, 2)  # x1y1, x2y2, x1y2, x2y1
            # xy = xy @ M.T  # transform
            # xy = (xy[:, :2] / xy[:, 2:3] if perspective else xy[:, :2]).reshape(n, 8)  # perspective rescale or affine
            xy = np.ones((n * (4 + 5), 3))
            # x1y1, x2y2, x1y2, x2y1, kp1, kp2, kp3, kp4, kp5
            xy[:, :2] = (
                targets[:, [1, 2, 3, 4, 1, 4, 3, 2, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]].reshape(n * 9, 2))
            xy = xy @ M.T  # transform
            xy = (xy[:, :2] / xy[:, 2:3] if perspective else xy[:, :2]).reshape(n, 18)  # perspective rescale or affine

            # create new boxes
            x = xy[:, [0, 2, 4, 6]]
            y = xy[:, [1, 3, 5, 7]]

            landmarks = xy[:, [8, 9, 10, 11, 12, 13, 14, 15, 16, 17]]
            mask_landmarks = np.array(targets[:, 5:15] > 0, dtype=np.int32)
            landmarks = landmarks * mask_landmarks
            landmarks = landmarks + mask_landmarks - 1

            landmarks = np.where(landmarks < 0, -1, landmarks)
            landmarks[:, [0, 2, 4, 6, 8]] = np.where(landmarks[:, [0, 2, 4, 6, 8]] > width, -1,
                                                     landmarks[:, [0, 2, 4, 6, 8]])
            landmarks[:, [1, 3, 5, 7, 9]] = np.where(landmarks[:, [1, 3, 5, 7, 9]] > height, -1,
                                                     landmarks[:, [1, 3, 5, 7, 9]])

            new = np.concatenate((x.min(1), y.min(1), x.max(1), y.max(1))).reshape(4, n).T
            new = np.concatenate((new, landmarks), axis=1)

            # clip
            new[:, [0, 2]] = new[:, [0, 2]].clip(0, width)
            new[:, [1, 3]] = new[:, [1, 3]].clip(0, height)

        # filter candidates
        # i = box_candidates(box1=targets[:, 1:5].T * s, box2=new.T, area_thr=0.01 if use_segments else 0.10)
        i = box_candidates(box1=targets[:, 1:5].T * s, box2=new[:, :4].T, area_thr=0.01 if use_segments else 0.10)
        targets = targets[i]
        # targets[:, 1:5] = new[i]
        targets[:, 1:(5 + 10)] = new[i]

    return im, targets

等比填充

在图像经过等比填充后，会对边界框坐标进行转换，这一过程也需要对关键点坐标进行转换。

# Letterbox
shape = self.batch_shapes[self.batch[index]] if self.rect else self.img_size  # final letterboxed shape
img, ratio, pad = letterbox(img, shape, auto=False, scaleup=self.augment)
shapes = (h0, w0), ((h / h0, w / w0), pad)  # for COCO mAP rescaling

labels = self.labels[index].copy()
if labels.size:  # normalized xywh to pixel xyxy format
    # labels[:, 1:] = xywhn2xyxy(labels[:, 1:], ratio[0] * w, ratio[1] * h, padw=pad[0], padh=pad[1])
    labels[:, 1:5] = xywhn2xyxy(labels[:, 1:5], ratio[0] * w, ratio[1] * h, padw=pad[0], padh=pad[1])

    labels[:, 5::2] = np.where(labels[:, 5::2] < 0, -1, labels[:, 5::2] * w + pad[0])
    labels[:, 6::2] = np.where(labels[:, 6::2] < 0, -1, labels[:, 6::2] * h + pad[1])

左右翻转

对人脸边界框进行左右翻转，需要考虑到人脸关键点中左右眼和左右嘴角的变换：

# Flip left-right
if random.random() < hyp['fliplr']:
    img = np.fliplr(img)
    if nl:
        labels[:, 1] = 1 - labels[:, 1]

    labels[:, 5::2] = np.where(labels[:, 5::2] < 0, -1, 1 - labels[:, 5::2])

    # 左右镜像的时候，左眼、右眼，　左嘴角、右嘴角无法区分, 应该交换位置，便于网络学习
    eye_left = np.copy(labels[:, [5, 6]])
    mouth_left = np.copy(labels[:, [11, 12]])
    labels[:, [5, 6]] = labels[:, [7, 8]]
    labels[:, [7, 8]] = eye_left
    labels[:, [11, 12]] = labels[:, [13, 14]]
    labels[:, [13, 14]] = mouth_left

损失函数

对于损失函数，主要在yolov5 loss的基础上增加两部分内容：一是确定landmarks的坐标计算方式；一是确定landmarks的损失计算方式。论文参考了预测框的计算方式，结合锚点和网格点来计算人脸关键点坐标。

\[ landmarks=(landmarks*anchor_i+grid_i)*stride_i \]

另外，论文最终选择了Wing-loss，在预测坐标和真值坐标之间的误差较小时使用对数函数，在误差较大时使用线性函数计算损失值，这样可以减少异常值的影响同时保持损失函数对小误差的敏感度。

class WingLoss(nn.Module):
    def __init__(self, w=10, e=2):
        super(WingLoss, self).__init__()
        # https://arxiv.org/pdf/1711.06753v4.pdf   Figure 5
        self.w = w
        self.e = e
        self.C = self.w - self.w * np.log(1 + self.w / self.e)

    def forward(self, x, t, sigma=1):
        weight = torch.ones_like(t)
        weight[torch.where(t == -1)] = 0
        diff = weight * (x - t)
        abs_diff = diff.abs()
        flag = (abs_diff.data < self.w).float()
        y = flag * self.w * torch.log(1 + abs_diff / self.e) + (1 - flag) * (abs_diff - self.C)
        return y.sum()


class LandmarksLoss(nn.Module):
    # BCEwithLogitLoss() with reduced missing label effects.
    def __init__(self, alpha=1.0):
        super(LandmarksLoss, self).__init__()
        self.loss_fcn = WingLoss()  # nn.SmoothL1Loss(reduction='sum')
        self.alpha = alpha

    def forward(self, pred, truel, mask):
        loss = self.loss_fcn(pred * mask, truel * mask)
        return loss / (torch.sum(mask) + 10e-14)

数据后处理

对于数据后处理函数non_max_suppression，其主要功能是通过置信度阈值来过滤预测框，然后通过IOU阈值过滤相同类别中低置信度的预测框。这个过程不直接涉及人脸关键点的计算，但在过滤预测框的过程中需要特别注意保存和更新人脸关键点的坐标，确保在保留预测框的同时，对应的关键点坐标也被正确保存，而在过滤掉预测框时，相应的关键点坐标也应被丢弃。

小结

YOLO5Face是一个非常优秀的人脸检测框架，它不仅结合了YOLO系列算法的优势，还额外实现了人脸关键点（Landmarks）的计算。该框架基于YOLOv5这一成熟的目标检测模型，进一步提升了人脸检测的精度和效率。随着YOLO系列算法的不断发展，我们可以期待未来会有更多改进版本，如YOLO6Face、YOLO7Face、YOLO8Face等。