CenterNet. Objects as Points
This paper represents objects by a single point at their bounding box center. Object detection is then a standard keypoint estimation problem.
Other properties, such as object size, dimension, 3D extent, orientation, and pose are then regressed directly from image features at the center location.
The input image are fed to a fully convolutional network that generates a heatmap. Peaks in this heatmap correspond to object centers. Image features at each peak predict the object bounding box height and weight. The model trains using standard dense supervised learning. Inference is a single network forward-pass, without non-maximal suppression for post-processing.
Other used techniques:
- simple Resnet-18
- up-convolutional layers
- DLA-34 (deep layer aggregation, a key keypoint detection network)
- Hourglass-104 (keypoint estimation network)
- multi-scale testing
Loss function
p∈R2 of class c;
low-resolution equivalent ˜p=⌊pR⌋;
heatmap Y∈[0,1]WR×HR×C
Gaussian kernel Yxyc=exp(−(x−˜px)2+(y−˜py)22σ2p)
The training objective is a penalty-reduced pixel-wise logistic regression with focal loss:
Lk=−1N∑xyc{(1−ˆYxyc)αlog(ˆYxyc) if Yxyc=1(1−Yxyc)β(ˆYxyc)αlog(1−ˆYxyc) otherwiseadditionally predict a local offset ˆO∈RWR×HR×2
All classes c share the same offset prediction. The offset is trained with an L1 loss
Loff=1N∑p|ˆO˜p−(pR−˜p)|bounding box: (x(k)1,y(k)1,x(k)2,y(k)2)
center point: pk=(x(k)1+x(k)22,y(k)1+y(k)22)
regress to the object size sk=(x(k)2−x(k)1,y(k)2−y(k)1) for each object k
single prediction ˆS∈RWR×HR×2 for all object categories.
Use an L1 loss at the center point similar to Object:
Lsize=1NN∑k=1|ˆSpk−sk|The overall training objective (instead of normalizing the scale and directly use the raw pixel coordinates, the paper scales the loss by a constant λsize.):
Ldet=Lk+λsizeLsize+λoffLoff Pay more attention to the diagram in this paper!Reference
- Zhou, X., Wang, D., & Krähenbühl, P. (2019). Objects as points. arXiv preprint arXiv:1904.07850.
Gitalking ...