Intel Labs and Blockade Labs created a latent diffusion model for 3D, using generative AI to create lifelike 3D visual content. It is the first of its kind to produce a depth map using the diffusion process. This technology allows for the creation of immersive 360-degree views in 3D images, with the potential to reshape content creation, metaverse applications, and digital experiences. Its impact spans diverse sectors such as entertainment, gaming, architecture, and design, promising a transformative future.
Intel hopes to democratize AI and break the limitations of closed ecosystems, which the company thinks will pave the way for a more accessible and open AI environment and, as a result, sell more processors. While computer vision, particularly generative AI, has made significant progress, current models mainly focus on generating 2D images. In contrast, the Latent Diffusion Model for 3D (LDM3D) enables users to generate both an image and a depth map from a given text prompt. By employing a comparable number of parameters such as latent stable diffusion, LDM3D, says Intel, offers superior accuracy in estimating relative depth for each pixel, surpassing standard post-processing depth estimation methods.
This research has the potential to revolutionize interactions with digital content, enabling users to use text prompts in unprecedented ways. With LDM3D’s generated images and depth maps, users can transform a simple text description of a tranquil tropical beach, a futuristic skyscraper, or a sci-fi universe into a highly detailed 360-degree panorama. This new capability to capture depth information instantaneously enhances realism and immersion. From entertainment, gaming, and interior design to real estate listings, virtual museums, and immersive VR experiences, LDM3D opens doors to a multitude of possibilities.
The project used Intel Xeon processors and Intel Habana Gaudi AI accelerators, and the LDM3D model was trained on an Intel AI supercomputer. This combination of hardware enabled the integration of generated RGB images and depth maps, culminating in the creation of 360-degree views for captivating immersive experiences.
To showcase the capabilities of LDM3D, researchers from Intel and Blockade Labs developed DepthFusion, an application that harnesses standard 2D RGB photos and depth maps. By leveraging TouchDesigner, a node-based visual programming language tailored for real-time interactive multimedia content, DepthFusion transforms text prompts into interactive and immersive digital encounters. With the LDM3D model serving as a single point of reference for generating both RGB images and their corresponding depth maps, significant benefits in terms of memory usage and reduced latency are achieved.
“Generative AI technology aims to further augment and enhance human creativity and save time. However, most of today’s generative AI models are limited to generating 2D images, and only very few can generate 3D images from text prompts. Unlike existing latent stable diffusion models, LDM3D allows users to generate an image and a depth map from a given text prompt using almost the same number of parameters. It provides more accurate relative depth for each pixel in an image, compared to standard post-processing methods for depth estimation, and saves developers significant time to develop scenes,” stated Vasudev Lal, AI/ML research scientist, Intel Labs.
What do we think?
The development of LDM3D and DepthFusion is a significant step in the advancement of multi-view generative AI and computer vision. Intel is making LDM3D available as open source through HuggingFace. This work will contribute to AI researchers who will enhance and refine the system as well as customize it for specific applications, and that should encourage collaborative and continuous improvement.
This development adds a new tool to the digital content creators’ kit and has the potential to give the lagging VR segment a boost by making more vistas much faster and less expensively.