CGGS: Consistency-Augmented Geometric Gaussian Splatting for Ego-centric 3D Scene Generation

TIP 2026

Zhenyu Sun^{1, 2}, Xiaohan Zhang¹, Qi Liu^1†, Huan Wang^2†

¹South China University of Technology, ²Westlake University

† Corresponding author.

(a) Generation results of CGGS for ego-centric multi-view priors, gaussian point clouds, novel view synthesis, and depth maps. Our method generates harmonious, domain-free 3D scenes from ego-centric views, highly aligned with complex textual descriptions. (b) Additional generation results from CGGS. Our work can generate richly detailed, high-fidelity scenes with considerable diversity while ensuring cohesive semantic content and a harmonized visual style that faithfully reflects even the most intricate textual descriptions. CGGS exhibits notable generalization ability,

Abstract

Challenges remain in ego-centric 3D scene generation due to limited view overlap and the dominant influence of individual perspectives on scene interpretation. These factors hinder the creation of viewpoint-consistent and semantically aligned visual content, as well as the construction of accurate geometric structures. In this paper, we propose CGGS, a text-to-3D framework aiming to enhance 3D-content-awareness and address geometric distortions in ego-centric scene generation. Firstly, the Ego-centric Generator is proposed by fine-tuning a Multi-View Latent Diffusion Model with consistency-augmented loss to generate consistent, high-fidelity 2D content aligned with textual descriptions. Then, Layout Decorator leverages optical flow and point-track correspondence to estimate depth, therefore producing dense point clouds as coarse layouts from the ego-centric 2D priors. Building on this initialization, Geometric Refiner is proposed to enhance 3D Gaussian reconstruction via an entropy-based Mutual Information Depth Loss (MID) combined with a hierarchical optimization scheme for improving visual quality and geometric structure. Comprehensive experiments demonstrate that CGGS outperforms previous methods in generating coherent and accurate text-driven 3D scenes.

Geometric Distortions in Panoramic Generation

Visualization of geometric distortions in panoramic generation (using DreamScene360 as an example). 1) Insufficient Text-Content Alignment: Significant textual details are omitted in the generation, such as the absence of ”mismatched frames” on the gallery wall despite being explicitly specified in the prompt. 2) Polar Geometric Distortions: Due to the inherent nature of equirectangular projection, severe radial stretching and bending occur near the top and bottom boundaries (e.g., the warped ceiling and distorted sand), which violates perspective consistency. 3) Unreasonable Structural Artifacts: The model fails to maintain physical continuity, most notably evidenced by the severed and floating tree trunks in the beach scene, as well as incoherent horizon lines that hinder valid 3D reconstruction.

Algorithm Design and Model Architecture

While panoramic generation naturally ensures global continuity with a unified 360° field of view, the requisite equirectangular projection introduces severe geometric distortions—particularly near the poles—which fundamentally violate the pinhole camera assumption inherent in 3DGS and SfM pipelines. Conversely, multi-view generation synthesizes perspective images that are geometrically distortion-free and rich in local high-frequency details, offering a mathematically robust foundation for high-fidelity reconstruction. However, this paradigm inherently struggles with inter-view consistency due to the lack of a unified canvas. To tackle the aforementioned issues, we introduce CGGS, which unleashes the potential of latent diffusion models in text-image and image-image alignment, and learns the ego-centric 3D representation from the 2D images through a hierarchical 3D Gaussian optimization.

To enhance the semantic alignment and cross-view consistency, we introduce a consistency-augmented loss term as regularization to the LDM loss during the training of the CAA module. Building upon the synthesized views, a Flow-Depth Estimator is used to generate a dense point cloud as layout initialization. This approach can reconstruct a robust 3D structural layout of the scene from ego-centric 2D priors, whereas conventional Structure-from-Motion methods (SfM) typically struggle with such tasks. Based on the initial 3D layout, we further leverage the Mutual Information Depth Loss (MID) to refine the scene during 3D Gaussian optimization, combined with a hierarchical optimization strategy, aiming at maintaining rendering robustness.

Qualitative Comparison

Qualitative comparison between CGGS with other baselines. Our CGGS produces multi-view images with rich detail and superior semantic coherence, showcasing domain-agnosticity. Our results outperform other works with an accurately detailed description and unified 3D consistency. Specifically, DreamScene360 generates visual results with less major content in the horizon field; while Director3D is capable of depicting the content described in text prompts, it is constrained by a limited field of view; LucidDreamr causes undesirable style transfer, wrong stitches between concepts, and inconsistent content, as highlighted in the red box.

Quantitative Comparisons

Quantitative comparison between CGGS and other baselines. We benchmark our methods with other brilliant prior works across 24 scenes, covering indoor and outdoor environments. We report the evaluation in metrics that reflect both generation quality and reconstruction quality. The results indicate that CGGS achieves the best overall performance, generating semantically consistent 3D scenes with high visual quality and proper geometric structure.

Ablation Study on Consistency-Augmented Loss

Ablation study on consistency-augmented loss L_aug. Without L_aug, cross-view texture discrepancies become pronounced, with abrupt background artifacts (e.g., exposed ceilings in bedroom scenes) and physically implausible anomalies (e.g., floating, distorted trees on beaches) emerging.

We further report the training time of the Ego-centric Generator and the vanilla version without L_aug. The quantitative results illustrate that our proposed L_aug enhances the training of the diffusion process, thus improving the semantic alignment and perceptual quality.

Ablation Study on Layout Decorator with other SfM methods

We compare Layout Decorator with conventional SfM methods, COLMAP, by directly substituting the block in CGGS while keeping other modules same. To ensure a fair comparison, we evaluate COLMAP both with and without the identical camera trajectories used in our method. Quantitative results demonstrate that our methods provide more reliable 3D structure for subsequent 3D Gaussian optimization. Under relatively sparse-view settings with little overlap across views, conventional SfM methods tend to diverge during optimization or result in suboptimal spatial structure.

Ablation Study on Geometric Refiner

Ablation studies of Geometric-Refiner on MID loss and hierarchical optimization. Here we demonstrate the qualitative comparison between the ground truth with different settings, including MID+HO, MID, PD+HO, PD, and w/o (MID+HO). The comparison covers both indoor and outdoor scenes. Our design of Geometric-Refiner provides the most accurate texture recovery, with fewer blurred blocks than other settings.

We further explore the effectiveness of our proposed MID loss and hierarchical optimization, and compare our depth loss with the conventional Pearson Depth loss. The quantitative comparison indicates that relying solely on the depth supervision slightly degrades the rendering visual quality, due to the stricter structure constraints. In contrast, the combined application of hierarchical optimization and the MID loss yields substantial improvements in the geometric coherence of the 3D Gaussian primitives as well as in overall rendering performance. The results below support our design choices.

Acknowledgment

We thank Zhenyu Sun, Xiaohan Zhang, Qi Liu, and Huan Wang for their contributions. We sincerely express gratitude to Qi Liu and Huan Wang for their invaluable guidance and support throughout this research. This paper is supported by Young Scientists Fund of the National Natural Science Foundation of China (NSFC) (No. 62506305), Zhejiang Leading Innovative and Entrepreneur Team Introduction Program (No. 2024R01007), Key Research and Development Program of Zhejiang Province (No. 2025C01026), Scientific Research Project of Westlake University (No. WU2025WF003), Chinese Association for Artificial Intelligence (CAAI) & Ant Group Research Fund - AGI Track (No. 2025CAAI-ANT-13). It is also supported by the research funds of the National Talent Program and Hangzhou Municipal Talent Program. It is also supported in part by the GJYC program of Guangzhou under Grant 2024D01J0081, in part by the ZJ program of Guangdong under Grant 2023QN10X455, and in part by the Fundamental Research Funds for the Central Universities under Grant 2025ZYGXZR053.

BibTeX

@article{sun2026cggs, title = {CGGS: Consistency-Augmented Geometric Gaussian Splatting for Ego-centric 3D Scene Generation}, author = {Zhenyu Sun and Xiaohan Zhang and Qi Liu and Huan Wang}, journal = {IEEE Transactions on Image Processing}, year = {2026}, }