Share Dialog
Share Dialog

Subscribe to Perception

Subscribe to Perception
<100 subscribers
<100 subscribers
DETR3D, and DETR struggle with coordinate prediction and feature sampling complexity.

The PETR framework is shown below.

Steps:
Discretizing the camera frustum space into meshgrid coordinates of size .
Transform the coordinates into 3D world space using the camera parameters.
3D position-aware features are extracted by combining the 2D image features and 3D world space using the 3D position Encoder.
Queries interact with 3D aware features to predict 3D bbox. The queries are obtained by transforming the anchor points in 3D coordinates using MLP (query generator).
maps the coordinates to visual cues by a MLP to efficiently model 3D objects, 3D scenes and 2D images.
The queries are obtained by transforming the anchor points in 3D coordinates using MLP.
Input images from views.
Positional Embedding (3D)
import torch
import torch.nn as nn
import numpy as np
class PositionEmbedding3D(nn.Module):
def __init__(self, embed_dim=256, depth_bins=torch.tensor([1., 20, 64])):
super().__init__()
self.embed_dim = embed_dim
self.depth_bins = depth_bins
# MLP to encode (X, Y, Z) into embedding space
self.mlp = nn.Sequential(
nn.Linear(3, embed_dim // 2),
nn.ReLU(),
nn.Linear(embed_dim // 2, embed_dim)
# Precompute depth bins (can also be learned)
self.register_buffer('depth_bins', torch.linspace(1, 100, depth_bins))
def forward(self, H, W, device):
# 1. Generate 2D pixel coordinates (u, v)
u = torch.arange(W, dtype=torch.float32, device=device) # [W]
v = torch.arange(H, dtype=torch.float32, device=device) # [H]
u, v = torch.meshgrid(u, v) # [H, W]
# Normalize coordinates (optional, e.g., to [-1, 1])
u = (u / W) * 2 - 1 # [H, W]
v = (v / H) * 2 - 1 # [H, W]
# 2. Generate 3D points for all depth bins
# Shape: [H, W, depth_bins, 3]
X = u.unsqueeze(-1) * self.depth_bins # [H, W, depth_bins]
Y = v.unsqueeze(-1) * self.depth_bins # [H, W, depth_bins]
Z = self.depth_bins.reshape(1, 1, -1).repeat(H, W, 1) # [H, W, depth_bins]
xyz = torch.stack([X, Y, Z], dim=-1) # [H, W, depth_bins, 3]
# 3. Project 3D points into embedding space
xyz = xyz.flatten(0, 2) # [H*W*depth_bins, 3]
embedding = self.mlp(xyz) # [H*W*depth_bins, embed_dim]
embedding = embedding.view(H, W, self.depth_bins, self.embed_dim) # [H, W, depth_bins, embed_dim]
return embeddingQuery Generator
class PETRQueryGenerator(nn.Module):
def __init__(self, num_queries=900, embed_dim=256):
super().__init__()
# Learnable content queries
self.content_queries = nn.Parameter(torch.randn(num_queries, embed_dim))
# 3D reference points (initialized as normalized 3D coordinates)
self.ref_points = nn.Parameter(torch.rand(num_queries, 3)) # [x, y, z]
def forward(self, camera_params):
# Project 3D ref points to 2D for each camera view
projected_2d = project_3d_to_2d(self.ref_points, camera_params) # [N_views, N_queries, 2]
# Generate 3D positional embeddings
pos_embed = self.mlp(projected_2d) # [N_views, N_queries, C]
return self.content_queries + pos_embed.mean(dim=0) # Fused queriesDetection Head and Loss in PETR
(a) Detection Head
The detection head in PETR consists of two parallel branches:
Classification Branch: Predicts the probability distribution over object classes (e.g., car, pedestrian, cyclist).
Regression Branch: Predicts the 7-DoF 3D bounding box parameters:
Center ((x, y, z)) in world coordinates.
Dimensions ((w, h, l)) (width, height, length).
Rotation ((\theta)) around the vertical axis (yaw).
Each object query from the decoder outputs:
A class score vector (via softmax).
A 7-DoF box regression vector (unbounded, real-valued).
(b) Objective Function
PETR optimizes detection using a multi-task loss:
Classification Loss ((\mathcal{L}_{cls})):
Focal Loss (variant of cross-entropy for class imbalance):
DETR3D, and DETR struggle with coordinate prediction and feature sampling complexity.

The PETR framework is shown below.

Steps:
Discretizing the camera frustum space into meshgrid coordinates of size .
Transform the coordinates into 3D world space using the camera parameters.
3D position-aware features are extracted by combining the 2D image features and 3D world space using the 3D position Encoder.
Queries interact with 3D aware features to predict 3D bbox. The queries are obtained by transforming the anchor points in 3D coordinates using MLP (query generator).
maps the coordinates to visual cues by a MLP to efficiently model 3D objects, 3D scenes and 2D images.
The queries are obtained by transforming the anchor points in 3D coordinates using MLP.
Input images from views.
Positional Embedding (3D)
import torch
import torch.nn as nn
import numpy as np
class PositionEmbedding3D(nn.Module):
def __init__(self, embed_dim=256, depth_bins=torch.tensor([1., 20, 64])):
super().__init__()
self.embed_dim = embed_dim
self.depth_bins = depth_bins
# MLP to encode (X, Y, Z) into embedding space
self.mlp = nn.Sequential(
nn.Linear(3, embed_dim // 2),
nn.ReLU(),
nn.Linear(embed_dim // 2, embed_dim)
# Precompute depth bins (can also be learned)
self.register_buffer('depth_bins', torch.linspace(1, 100, depth_bins))
def forward(self, H, W, device):
# 1. Generate 2D pixel coordinates (u, v)
u = torch.arange(W, dtype=torch.float32, device=device) # [W]
v = torch.arange(H, dtype=torch.float32, device=device) # [H]
u, v = torch.meshgrid(u, v) # [H, W]
# Normalize coordinates (optional, e.g., to [-1, 1])
u = (u / W) * 2 - 1 # [H, W]
v = (v / H) * 2 - 1 # [H, W]
# 2. Generate 3D points for all depth bins
# Shape: [H, W, depth_bins, 3]
X = u.unsqueeze(-1) * self.depth_bins # [H, W, depth_bins]
Y = v.unsqueeze(-1) * self.depth_bins # [H, W, depth_bins]
Z = self.depth_bins.reshape(1, 1, -1).repeat(H, W, 1) # [H, W, depth_bins]
xyz = torch.stack([X, Y, Z], dim=-1) # [H, W, depth_bins, 3]
# 3. Project 3D points into embedding space
xyz = xyz.flatten(0, 2) # [H*W*depth_bins, 3]
embedding = self.mlp(xyz) # [H*W*depth_bins, embed_dim]
embedding = embedding.view(H, W, self.depth_bins, self.embed_dim) # [H, W, depth_bins, embed_dim]
return embeddingQuery Generator
class PETRQueryGenerator(nn.Module):
def __init__(self, num_queries=900, embed_dim=256):
super().__init__()
# Learnable content queries
self.content_queries = nn.Parameter(torch.randn(num_queries, embed_dim))
# 3D reference points (initialized as normalized 3D coordinates)
self.ref_points = nn.Parameter(torch.rand(num_queries, 3)) # [x, y, z]
def forward(self, camera_params):
# Project 3D ref points to 2D for each camera view
projected_2d = project_3d_to_2d(self.ref_points, camera_params) # [N_views, N_queries, 2]
# Generate 3D positional embeddings
pos_embed = self.mlp(projected_2d) # [N_views, N_queries, C]
return self.content_queries + pos_embed.mean(dim=0) # Fused queriesDetection Head and Loss in PETR
(a) Detection Head
The detection head in PETR consists of two parallel branches:
Classification Branch: Predicts the probability distribution over object classes (e.g., car, pedestrian, cyclist).
Regression Branch: Predicts the 7-DoF 3D bounding box parameters:
Center ((x, y, z)) in world coordinates.
Dimensions ((w, h, l)) (width, height, length).
Rotation ((\theta)) around the vertical axis (yaw).
Each object query from the decoder outputs:
A class score vector (via softmax).
A 7-DoF box regression vector (unbounded, real-valued).
(b) Objective Function
PETR optimizes detection using a multi-task loss:
Classification Loss ((\mathcal{L}_{cls})):
Focal Loss (variant of cross-entropy for class imbalance):
Backbone Network: extracts 2D multi-view features
Discretizing the camera frustum space into meshgrid coordinates of size . Each point in meshgrid is denoted as , where in camera coordinates. j denotes the j-th meshgrid.

Generate 3D world space coordinates by transforming meshgrid points using camera parameters. The meshgrid is shared by multiple views, the 3D world space is

. view.

Since there are depths in the meshgrid, the normalized coordinates of size are denoted as ,
3D position encoder produces 3D position aware features, i.e.,

Query Generator and Decoder: a set of learnable anchor points, in 3D space, initialized with a uniform distribution are fed to MLPs (two linear layers) to obtain initial object queries (). useful for convergence.
Decoder:
: Predicted probability for the target class.
(): Hyperparameters to balance easy/hard samples.
Regression Loss ((\mathcal{L}_{reg})):
L1 Loss (for box coordinates):
[ $$\mathcal{L}{reg} = | \mathbf{b}{pred} - \mathbf{b}_{gt} |_1 $$]
(): Predicted box parameters.
(): Ground-truth box.
Optional: Smooth L1 or GIoU loss for rotation stability.
Balancing Weights (($$\lambda_{cls}, \lambda_{reg}$$)):
Typically () (empirically tuned).
Backbone Network: extracts 2D multi-view features
Discretizing the camera frustum space into meshgrid coordinates of size . Each point in meshgrid is denoted as , where in camera coordinates. j denotes the j-th meshgrid.

Generate 3D world space coordinates by transforming meshgrid points using camera parameters. The meshgrid is shared by multiple views, the 3D world space is

. view.

Since there are depths in the meshgrid, the normalized coordinates of size are denoted as ,
3D position encoder produces 3D position aware features, i.e.,

Query Generator and Decoder: a set of learnable anchor points, in 3D space, initialized with a uniform distribution are fed to MLPs (two linear layers) to obtain initial object queries (). useful for convergence.
Decoder:
: Predicted probability for the target class.
(): Hyperparameters to balance easy/hard samples.
Regression Loss ((\mathcal{L}_{reg})):
L1 Loss (for box coordinates):
[ $$\mathcal{L}{reg} = | \mathbf{b}{pred} - \mathbf{b}_{gt} |_1 $$]
(): Predicted box parameters.
(): Ground-truth box.
Optional: Smooth L1 or GIoU loss for rotation stability.
Balancing Weights (($$\lambda_{cls}, \lambda_{reg}$$)):
Typically () (empirically tuned).
Semere Gerezgiher Tesfay
Semere Gerezgiher Tesfay
No activity yet