Meet EscherNet: A Multi-View Conditioned Diffusion Model for View Synthesis

Feb 24, 2024

Meet EscherNet: A Multi-View Conditioned Diffusion Model for View Synthesis

Feb 14, 2024

The task of view synthesis is essential in both computer vision and graphics, enabling the re-rendering of scenes from various viewpoints akin to the human eye. This capability is vital for everyday tasks and fosters creativity by allowing the envisioning and crafting of immersive objects with depth and perspective. Researchers at Dyson Robotics Lab aim to address the challenge of scalable view synthesis by considering two key observations.

While recent advancements have focused on training speed and rendering efficiency, they rely heavily on volumetric rendering and scene-specific encoding. They propose a shift towards learning general 3D representations based solely on scene colors and geometries without requiring ground-truth 3D geometry or specific coordinate systems. This approach enables scalability by overcoming constraints imposed by scene-specific encoding.

Secondly, view synthesis can be framed as a conditional generative modeling problem, akin to generative image in-painting, where the model should provide multiple plausible predictions based on sparse reference views. They argue for a more flexible generative formulation that accommodates varying levels of input information, gradually converging towards ground-truth representations as more data becomes available.

Building upon these insights, they introduce EscherNet, an image-to-image conditional diffusion model for view synthesis. EscherNet utilizes a transformer architecture with dot-product self-attention to capture relationships between reference-to-target and target-to-target views. A key innovation is the Camera Positional Encoding (CaPE), representing both 4 Degrees of Freedom (DoF) and 6 DoF camera poses, enabling self-attention computation based on relative camera transformations.

EscherNet showcases remarkable characteristics that distinguish it in the field of view synthesis. Firstly, it achieves a high level of consistency by integrating view consistency through its Camera Positional Encoding (CaPE), which fosters coherence between reference and target views. Secondly, EscherNet demonstrates excellent scalability by detaching itself from specific coordinate systems and circumventing costly 3D operations, making it adaptable to everyday 2D image data.

Lastly, its impressive generalization capabilities allow it to generate target views based on varying numbers of reference views, improving quality as more references are provided. These qualities collectively position EscherNet as a promising advancement in view synthesis and 3D vision research.

Comprehensive evaluations across view synthesis and 3D reconstruction benchmarks demonstrate EscherNet’s superior generation quality compared to existing models, particularly under limited view constraints. This underscores the effectiveness of their approach in advancing view synthesis and 3D vision.

Check out the Paper, Github, and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and Google News. Join our 37k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our Telegram Channel

The post Meet EscherNet: A Multi-View Conditioned Diffusion Model for View Synthesis appeared first on MarkTechPost.

#AIShorts #Applications #ArtificialIntelligence #ComputerVision #EditorsPick #Staff #TechNews #Technology #Uncategorized
[Source: AI Techpark]

View synthesis, integral to computer vision and graphics, enables scene re-rendering from diverse perspectives akin to human vision. It aids in tasks like object manipulation and navigation while fostering creativity. Early neural 3D representation learning primarily optimized 3D data directly, aiming to enhance view synthesis capabilities for broader applications in these fields. However, all these existing methods heavily rely on ground-truth 3D geometry, limiting their applicability to small-scale synthetic 3D data.

Early works in neural 3D representation learning focused on optimizing 3D data directly, using voxels and point clouds for explicit representation learning. Alternatively, methods mapped 3D spatial coordinates to signed distance functions or occupancies for implicit representation learning. However, these heavily relied on ground-truth 3D geometry, limiting applicability. Differentiable rendering functions improved scalability with multi-view posed images. Direct training on 3D datasets using point clouds or neural fields improved efficiency but encountered computational challenges.

Researchers from Dyson Robotics Lab, Imperial College London, and The University of Hong Kong present EscherNet, a multi-view conditioned diffusion model that controls precise camera transformation between reference and target views. It learns implicit 3D representations with specialized camera positional encoding, offering exceptional generality and scalability in view synthesis. Despite training with a fixed number of reference views, EscherNet can generate over 100 consistent target views on a single GPU. It unifies single- and multi-image 3D reconstruction tasks.

EscherNet integrates a 2D diffusion model and camera positional encoding to handle arbitrary numbers of views for view synthesis. It utilizes Stable Diffusion v1.5 as a backbone, modifying self-attention blocks to ensure target-to-target consistency across multiple views. By incorporating Camera Positional Encoding (CaPE), EscherNet accurately encodes camera poses for each view, facilitating relative camera transformation learning. It achieves high-quality results by efficiently encoding high-level semantics and low-level texture details from reference views.

EscherNet demonstrates superior performance across various tasks in 3D vision. In novel view synthesis, it outperforms 3D diffusion models and neural rendering methods, achieving high-quality results with fewer reference views. Additionally, EscherNet excels in 3D generation, surpassing state-of-the-art models in reconstructing accurate and visually appealing 3D geometry. Its flexibility enables seamless integration into text-to-3D generation pipelines, producing consistent and realistic results from textual prompts.

To sum up, the researchers from Dyson Robotics Lab, Imperial College London, and The University of Hong Kong introduce EscherNet, a multi-view conditioned diffusion model for scalable view synthesis. By leveraging Stable Diffusion’s 2D architecture and innovative CaPE, EscherNet effectively learns implicit 3D representations from various reference views, enabling consistent 3D novel view synthesis. This approach demonstrates promising results for addressing challenges in view synthesis and offers potential for further advancements in scalable neural architectures for 3D vision.

Check out the Paper and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and Google News. Join our 36k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.