cross-posted from: https://lemmy.intai.tech/post/41706

One-2-3-45: Any Single Image to 3D Mesh in 45 Seconds without Per-Shape Optimization

Minghua Liu1∗ Chao Xu2∗ Haian Jin3,4∗ Linghao Chen1,4∗ Mukund Varma T5 Zexiang Xu6Hao Su1

Summary: This paper presents a method to reconstruct 3D shapes from a single image in an end-to-end manner without time-consuming optimization. Their approach consists of three main parts:

Multi-view synthesis: They leverage a view-conditioned 2D diffusion model, Zero123, to generate multi-view images of the input object.

Pose estimation: They estimate the elevation angle of the input image to determine the camera poses of the multi-view images.

3D reconstruction: They employ a neural surface reconstruction method based on signed distance fields to reconstruct a 3D textured mesh from the multi-view images in a single feed-forward pass.

Their key contributions are:

Reconstruction in just 45 seconds without per-shape optimization Producing higher quality geometry due to the use of SDF representation Generating more 3D consistent results thanks to the multi-view synthesis module Achieving better adherence to the input image compared to existing methods They evaluate their approach on synthetic data and real images, demonstrating superior performance in terms of both mesh quality and runtime compared to existing zero-shot single image 3D reconstruction approaches.

Evaluation: This approach has strong potential for applications in 3D content creation and augmented/virtual reality. The key benefits are:

Fast inference time of 45 seconds, which is orders of magnitude faster than optimization-based approaches. This makes it suitable for production environments with low latency requirements.

Ability to reconstruct 3D shapes from a single image of any object, not restricted to specific object categories. This enables a wide range of applications.

Good adherence to the input image, producing realistic 3D shapes that match the given input. This is important for applications where fidelity to the input is critical.

The ability to extend to text-to-3D tasks by integrating with text-to-image diffusion models, providing an unrestricted input domain.

The main limitation is the dependence on the Zero123 diffusion model for multi-view synthesis, which occasionally produces inconsistent predictions that can impact reconstruction quality. However, the overall results demonstrate strong potential for real-world applications. With further improvements to the multi-view synthesis module and additional regularizations, this approach could enable a wide range of novel applications that require reconstructing realistic 3D shapes from a single image in near real-time.