PETR: Position Embedding Transformation for Multi-View 3D Object Detection

Problem

DETR3D, and DETR struggle with coordinate prediction and feature sampling complexity.

(a) In DETR, the object queries interact with 2D features to perform 2D detection. (b) DETR3D repeatedly projects the generated 3D reference points into image plane and samples the 2D features to interact with object queries in decoder. (c) PETR generates the 3D position-aware features by encoding the 3D position embed-ding (3D PE) into 2D image features. The object queries directly interact with 3D position-aware features and output 3D detection results.

The PETR framework is shown below.

Steps:

Discretizing the camera frustum space into meshgrid coordinates of size $(W_F, H_F, D)$ .
Transform the coordinates into 3D world space using the camera parameters.
3D position-aware features are extracted by combining the 2D image features and 3D world space using the 3D position Encoder.
Queries interact with 3D aware features to predict 3D bbox. The queries are obtained by transforming the anchor points in 3D coordinates using MLP (query generator).

Implicit Neural Representation (INR)

maps the coordinates to visual cues by a MLP to efficiently model 3D objects, 3D scenes and 2D images.

The queries are obtained by transforming the anchor points in 3D coordinates using MLP.

Method

Input images $I = \{I_i \in \mathcal{R}^{3\times H_I \times W_I}, i = 1, 2, \dots, N\}$ from $N$ views.
Backbone Network: extracts 2D multi-view features $F^{2d} = F_i^{2d} \in \{\mathcal{R}^{C \times H_F \times W_F}, i=1,2,\dots, N\}$
Discretizing the camera frustum space into meshgrid coordinates of size $(W_F, H_F, D)$ . Each point in meshgrid is denoted as $p^m_j = (u_j \times d_j, v_j \times d_j, d_j, 1)^T$ , where $(u_j, v_j)$ in camera coordinates. j denotes the j-th meshgrid.
Generate 3D world space coordinates by transforming meshgrid points using camera parameters. The meshgrid is shared by multiple views, the 3D world space is
The principal point $(c_u,c_v)$ are $(o_x, o_y)$ where the optical axis pierces the sensor.
$p_{i,j}^{3d} = K_i^{-1}p_j^m, ~~p_{i,j}^{3d} = (x_{i,j}, y_{i,j}, z_{i,j}, 1)^T$ . $i-th$ view.
Since there are $D$ depths in the meshgrid, the normalized coordinates of size $H_F \times W_F \times D$ are denoted as , $P^{3d} = P_i^{3d} \in \{\mathcal{R}^{ (D\times 4) \times H_F \times W_F}, i=1,2,\dots, N\}$
3D position encoder produces 3D position aware features, i.e., $F^{3d} = F_i^{3d} \in \{\mathcal{R}^{C \times H_F \times W_F}, i=1,2,\dots, N\}$
3D position encoder architecture $F^{3d}_i = \psi(F_i^{2d}, P_i^{3d}), i = 1, 2, \dots, N$
Query Generator and Decoder: a set of learnable anchor points, in 3D space, initialized with a uniform distribution are fed to MLPs (two linear layers) to obtain initial object queries ( $Q_0$ ). useful for convergence.
Decoder: $\mathcal{Q}_l = \Omega_l(F^{3d}, \mathcal{Q}_{l-1}), l = 1, \dots, L$

Sample Code

Positional Embedding (3D)

import torch
import torch.nn as nn
import numpy as np

class PositionEmbedding3D(nn.Module):
    def __init__(self, embed_dim=256, depth_bins=torch.tensor([1., 20, 64])):
        super().__init__()
        self.embed_dim = embed_dim
        self.depth_bins = depth_bins
        
        # MLP to encode (X, Y, Z) into embedding space
        self.mlp = nn.Sequential(
            nn.Linear(3, embed_dim // 2),
            nn.ReLU(),
            nn.Linear(embed_dim // 2, embed_dim)
        
        # Precompute depth bins (can also be learned)
        self.register_buffer('depth_bins', torch.linspace(1, 100, depth_bins))

    def forward(self, H, W, device):
        # 1. Generate 2D pixel coordinates (u, v)
        u = torch.arange(W, dtype=torch.float32, device=device)  # [W]
        v = torch.arange(H, dtype=torch.float32, device=device)  # [H]
        u, v = torch.meshgrid(u, v)  # [H, W]
        
        # Normalize coordinates (optional, e.g., to [-1, 1])
        u = (u / W) * 2 - 1  # [H, W]
        v = (v / H) * 2 - 1  # [H, W]
        
        # 2. Generate 3D points for all depth bins
        # Shape: [H, W, depth_bins, 3]
        X = u.unsqueeze(-1) * self.depth_bins  # [H, W, depth_bins]
        Y = v.unsqueeze(-1) * self.depth_bins  # [H, W, depth_bins]
        Z = self.depth_bins.reshape(1, 1, -1).repeat(H, W, 1)  # [H, W, depth_bins]
        xyz = torch.stack([X, Y, Z], dim=-1)  # [H, W, depth_bins, 3]
        
        # 3. Project 3D points into embedding space
        xyz = xyz.flatten(0, 2)  # [H*W*depth_bins, 3]
        embedding = self.mlp(xyz)  # [H*W*depth_bins, embed_dim]
        embedding = embedding.view(H, W, self.depth_bins, self.embed_dim)  # [H, W, depth_bins, embed_dim]
        
        return embedding

Query Generator

class PETRQueryGenerator(nn.Module):
    def __init__(self, num_queries=900, embed_dim=256):
        super().__init__()
        # Learnable content queries
        self.content_queries = nn.Parameter(torch.randn(num_queries, embed_dim))
        # 3D reference points (initialized as normalized 3D coordinates)
        self.ref_points = nn.Parameter(torch.rand(num_queries, 3))  # [x, y, z]

    def forward(self, camera_params):
        # Project 3D ref points to 2D for each camera view
        projected_2d = project_3d_to_2d(self.ref_points, camera_params)  # [N_views, N_queries, 2]
        # Generate 3D positional embeddings
        pos_embed = self.mlp(projected_2d)  # [N_views, N_queries, C]
        return self.content_queries + pos_embed.mean(dim=0)  # Fused queries

Detection Head and Loss in PETR

(a) Detection Head

The detection head in PETR consists of two parallel branches:

Classification Branch: Predicts the probability distribution over object classes (e.g., car, pedestrian, cyclist).
Regression Branch: Predicts the 7-DoF 3D bounding box parameters:
- Center ((x, y, z)) in world coordinates.
- Dimensions ((w, h, l)) (width, height, length).
- Rotation ((\theta)) around the vertical axis (yaw).

Each object query from the decoder outputs:

A class score vector (via softmax).
A 7-DoF box regression vector (unbounded, real-valued).

(b) Objective Function

PETR optimizes detection using a multi-task loss:

$\mathcal{L} = \lambda_{cls} \cdot \mathcal{L}{cls} + \lambda{reg} \cdot \mathcal{L}_{reg}$

Classification Loss ((\mathcal{L}_{cls})):
- Focal Loss (variant of cross-entropy for class imbalance):
  $\mathcal{L}_{cls} = -\alpha_t (1 - p_t)^\gamma \log(p_t)$
  - $(p_t)$ : Predicted probability for the target class.
  - ( $\alpha_t, \gamma$ ): Hyperparameters to balance easy/hard samples.
Regression Loss ((\mathcal{L}_{reg})):
- L1 Loss (for box coordinates):
  [ $$\mathcal{L}{reg} = | \mathbf{b}{pred} - \mathbf{b}_{gt} |_1 $$]
  - ( $\mathbf{b}_{pred}$ ): Predicted box parameters.
  - ( $\mathbf{b}_{gt}$ ): Ground-truth box.
- Optional: Smooth L1 or GIoU loss for rotation stability.
Balancing Weights (($$\lambda_{cls}, \lambda_{reg}$$)):
- Typically ( $\lambda_{cls} = 1.0), (\lambda_{reg} = 5.0$ ) (empirically tuned).

Perception