SplatSLAM
End-to-End Indoor Scene Reconstruction from Monocular Video via MASt3R-SLAM and 3DGS.
SplatSLAM
An integrated pipeline for high-fidelity indoor scene reconstruction bridging transformer-based SLAM and 3D Gaussian Splatting.
Abstract
We present SplatSLAM, an integrated pipeline designed for high-fidelity indoor reconstruction using monocular RGB input. By leveraging the MASt3R-SLAM framework for robust trajectory estimation and dense geometric priors, we bridge the gap between learning-based SLAM and 3D Gaussian Splatting (3DGS). Our approach introduces a standardized geometric refinement process using SOR denoising, effectively eliminating unphysical artifacts. Evaluated on the TUM-RGBD benchmark and self-collected datasets, our system achieves sub-10cm localization accuracy and photo-realistic novel view synthesis.
Methodology
10 FPS Frame Extraction
Pose & Dense Mapping
Geometric Refinement
NVS Optimization
Experiments
SLAM Benchmarking (TUM-RGBD)
| Sequence | ATE RMSE (m) ↓ | ATE Mean (m) |
|---|---|---|
| fr1_room | 0.0987 | 0.0909 |
| fr1_360 | 0.0717 | 0.0667 |
Analysis: Sub-10cm accuracy achieved without prior camera calibration.
3DGS Fidelity: Iteration & Denoising Analysis
We found that 7,000 iterations provide the best visual balance. Excessive iterations lead to overfitting artifacts. Additionally, our SOR refinement (removing 7.3% outliers) effectively eliminates "floaters," ensuring cleaner geometric surfaces.
Hardware Constraints (8GB VRAM Stress Test)
Testing on a local RTX 4060 laptop showed a bottleneck with the ViT-Large backbone, resulting in ~2 FPS limit due to VRAM overflow and memory swapping.
Interactive Demo
This project was developed as a final project for the Computer Vision course at Southern University of Science and Technology (SUSTech).
Special thanks to Prof. Feng Zheng and Prof. Weiyu Wang for their professional guidance.