

Share Dialog
Share Dialog

Subscribe to Perception

Subscribe to Perception
<100 subscribers
<100 subscribers
By Semere Gerezgiher Tesfay
Autonomous driving relies heavily on robust 3D perception systems, which require large-scale labeled datasets for training. However, annotating 3D data (LiDAR, camera) is expensive and time-consuming. Self-supervised learning (SSL) offers a solution by leveraging unlabeled data, but existing methods often borrow ideas from 2D vision without addressing 3D-specific challenges like sparsity and irregularity in point clouds.
Enter UniPAD, a novel SSL framework that introduces 3D volumetric differentiable rendering for pre-training. Unlike contrastive or masked autoencoding (MAE) methods, UniPAD reconstructs continuous 3D shapes and their 2D projections through neural rendering, enabling better feature learning for both LiDAR and camera-based models.
First 3D Differentiable Rendering for SSL in Autonomous Driving
Uses neural rendering to implicitly encode 3D geometry and appearance.
Unified Pre-training for 2D & 3D Modalities
Works with LiDAR, cameras, or fused inputs.
Memory-Efficient Ray Sampling
Reduces computational costs while maintaining accuracy.
State-of-the-Art Performance
Achieves 73.2 NDS (detection) and 79.4 mIoU (segmentation) on nuScenes.
Existing SSL methods for 3D perception suffer from:
Contrastive Learning: Sensitive to positive/negative sample selection.
MAE-based Methods: Struggle with sparse, irregular point clouds.
Lack of Unified 2D-3D Pre-training: Most approaches specialize in one modality.
UniPAD addresses these by:
Reconstructing masked 3D scenes via rendering (no contrastive pairs needed).
Handling both LiDAR and camera data in a single framework.
UniPAD consists of:
Modal-Specific Encoder (processes LiDAR/camera inputs).
Unified 3D Volumetric Representation (combines modalities).
Neural Rendering Decoder (reconstructs masked regions).

Input: Masked LiDAR point clouds () or multi-view images ( ).
Masking: Block-wise masking (size and ratio: 8 and 0.8 for LiDAR, 32 and 0.3 for images).
Encoders:
LiDAR: VoxelNet extracts hierarchical features ().
Camera: ConvNeXt extracts 2D features (), lifted to 3D via
Both modalities are transformed into a shared voxel grid :
For images, 3D voxel coordinates are projected to 2D:
= LiDAR-to-camera transform.
= Camera-to-image transform.
Voxel features are computed via bilinear () and trilinear () interpolation:
where is a convolutional layer with softmax activation. hile and retrieve the 2D features and scaling factor, respectively.
A projection layer (3D convs) further refines .
Reconstructs masked regions using differentiable volume rendering:
Ray Sampling: For each pixel, cast a ray (camera origin , direction ).
Memory-Friendly Ray Sampling:
Dilation: Skips rays at fixed intervals ().
Random: Selects random rays.
Depth-Aware: Prioritizes rays near objects (LiDAR-guided).

Loss Function:
Combines L1 losses for RGB and depth:
used MMDetection3D and train on 4× NVIDIA A100 GPUs.
Image resolution: 1600 × 900
Voxel size: [0.075, 0.075, 0.2]
Data augmentation: Random scaling, rotation, and masking (image mask: size 32, ratio 0.3; point mask: size 8, ratio 0.8).
Encoders:
Image: ConvNeXt-small
Point cloud: VoxelNet
Voxel representation:
Feature projection: conv, reducing to 32-Dim.
Decoders:
SDF: 6-layer MLP
RGB: 4-layer MLP
Rendering: 512 rays/image, 96 points/ray.
Loss weights: = 10, = 10.
Optimizer: AdamW (LR: 2e−5 for points, 1e−4 for images).
Training schedule: 12 epochs (pre-training), 12–20 epochs (fine-tuning on 20–50% data (point-image)).
No CBGS or cut-and-paste augmentation

LiDAR Detection: +9.1 NDS over baselines.
Camera Detection: +7.7 NDS.
Multi-Modal Fusion: 73.2 NDS, matching SOTA.
Segmentation: 79.4 mIoU (nuScenes val set).
Continuous 3D Supervision: Rendering enforces geometric consistency.
Modality-Agnostic: Shared voxel space bridges 2D/3D gaps.
Efficiency: Depth-aware ray sampling cuts compute costs.
UniPAD rethinks SSL for autonomous driving by merging neural rendering with multi-modal pre-training. Its flexibility and performance make it a promising backbone for future perception systems.
Future Work: Extending to dynamic scenes and higher-resolution voxels.
Fig 1: UniPAD architecture (encoder → voxelization → rendering).
Fig 2: Comparison of ray sampling strategies.
Fig 3: UniPAD experiment result sample visualizations.
Would you like a deeper dive into the rendering math or ablation studies? Let me know in the comments! 💨
By Semere Gerezgiher Tesfay
Autonomous driving relies heavily on robust 3D perception systems, which require large-scale labeled datasets for training. However, annotating 3D data (LiDAR, camera) is expensive and time-consuming. Self-supervised learning (SSL) offers a solution by leveraging unlabeled data, but existing methods often borrow ideas from 2D vision without addressing 3D-specific challenges like sparsity and irregularity in point clouds.
Enter UniPAD, a novel SSL framework that introduces 3D volumetric differentiable rendering for pre-training. Unlike contrastive or masked autoencoding (MAE) methods, UniPAD reconstructs continuous 3D shapes and their 2D projections through neural rendering, enabling better feature learning for both LiDAR and camera-based models.
First 3D Differentiable Rendering for SSL in Autonomous Driving
Uses neural rendering to implicitly encode 3D geometry and appearance.
Unified Pre-training for 2D & 3D Modalities
Works with LiDAR, cameras, or fused inputs.
Memory-Efficient Ray Sampling
Reduces computational costs while maintaining accuracy.
State-of-the-Art Performance
Achieves 73.2 NDS (detection) and 79.4 mIoU (segmentation) on nuScenes.
Existing SSL methods for 3D perception suffer from:
Contrastive Learning: Sensitive to positive/negative sample selection.
MAE-based Methods: Struggle with sparse, irregular point clouds.
Lack of Unified 2D-3D Pre-training: Most approaches specialize in one modality.
UniPAD addresses these by:
Reconstructing masked 3D scenes via rendering (no contrastive pairs needed).
Handling both LiDAR and camera data in a single framework.
UniPAD consists of:
Modal-Specific Encoder (processes LiDAR/camera inputs).
Unified 3D Volumetric Representation (combines modalities).
Neural Rendering Decoder (reconstructs masked regions).

Input: Masked LiDAR point clouds () or multi-view images ( ).
Masking: Block-wise masking (size and ratio: 8 and 0.8 for LiDAR, 32 and 0.3 for images).
Encoders:
LiDAR: VoxelNet extracts hierarchical features ().
Camera: ConvNeXt extracts 2D features (), lifted to 3D via
Both modalities are transformed into a shared voxel grid :
For images, 3D voxel coordinates are projected to 2D:
= LiDAR-to-camera transform.
= Camera-to-image transform.
Voxel features are computed via bilinear () and trilinear () interpolation:
where is a convolutional layer with softmax activation. hile and retrieve the 2D features and scaling factor, respectively.
A projection layer (3D convs) further refines .
Reconstructs masked regions using differentiable volume rendering:
Ray Sampling: For each pixel, cast a ray (camera origin , direction ).
Memory-Friendly Ray Sampling:
Dilation: Skips rays at fixed intervals ().
Random: Selects random rays.
Depth-Aware: Prioritizes rays near objects (LiDAR-guided).

Loss Function:
Combines L1 losses for RGB and depth:
used MMDetection3D and train on 4× NVIDIA A100 GPUs.
Image resolution: 1600 × 900
Voxel size: [0.075, 0.075, 0.2]
Data augmentation: Random scaling, rotation, and masking (image mask: size 32, ratio 0.3; point mask: size 8, ratio 0.8).
Encoders:
Image: ConvNeXt-small
Point cloud: VoxelNet
Voxel representation:
Feature projection: conv, reducing to 32-Dim.
Decoders:
SDF: 6-layer MLP
RGB: 4-layer MLP
Rendering: 512 rays/image, 96 points/ray.
Loss weights: = 10, = 10.
Optimizer: AdamW (LR: 2e−5 for points, 1e−4 for images).
Training schedule: 12 epochs (pre-training), 12–20 epochs (fine-tuning on 20–50% data (point-image)).
No CBGS or cut-and-paste augmentation

LiDAR Detection: +9.1 NDS over baselines.
Camera Detection: +7.7 NDS.
Multi-Modal Fusion: 73.2 NDS, matching SOTA.
Segmentation: 79.4 mIoU (nuScenes val set).
Continuous 3D Supervision: Rendering enforces geometric consistency.
Modality-Agnostic: Shared voxel space bridges 2D/3D gaps.
Efficiency: Depth-aware ray sampling cuts compute costs.
UniPAD rethinks SSL for autonomous driving by merging neural rendering with multi-modal pre-training. Its flexibility and performance make it a promising backbone for future perception systems.
Future Work: Extending to dynamic scenes and higher-resolution voxels.
Fig 1: UniPAD architecture (encoder → voxelization → rendering).
Fig 2: Comparison of ray sampling strategies.
Fig 3: UniPAD experiment result sample visualizations.
Would you like a deeper dive into the rendering math or ablation studies? Let me know in the comments! 💨
Signed Distance Field (SDF) & Color Prediction:
Sample points along the ray , i.e., .
Predict Signed Distance Field (SDF) and RGB values via MLPs:
Rendering: Accumulate color (Ŷᵢᴿᴳᴮ) and depth (Ŷᵢᵈᵉᵖᵗʰ) via:
where an unbiased and occlusion-aware weight (transmittance , opacity ), and
The opacity value is determined by:
,
Signed Distance Field (SDF) & Color Prediction:
Sample points along the ray , i.e., .
Predict Signed Distance Field (SDF) and RGB values via MLPs:
Rendering: Accumulate color (Ŷᵢᴿᴳᴮ) and depth (Ŷᵢᵈᵉᵖᵗʰ) via:
where an unbiased and occlusion-aware weight (transmittance , opacity ), and
The opacity value is determined by:
,
where and denote the surface normal (gradient at ) and geometry feature vector from .
where and denote the surface normal (gradient at ) and geometry feature vector from .
Semere Gerezgiher Tesfay
Semere Gerezgiher Tesfay
No activity yet