Meta is releasing a new multimodal AI model, called ImageBind,;as an open-source tool.;
While still in the early stages, ImageBind acts as a framework for eventually creating;complex scenes and environments from one or several inputs, such as a text or image prompt.
For example, if fed a picture of a beach, ImageBind could identify the sound of waves. Similarly, if given a photo of a tiger along with the sound of a waterfall, the system could produce a video of both.
- The model currently works with six types of data, which are text, visual (image/video), audio, depth, temperature, and movement.
- Its approach is comparable to how humans gather information through multiple senses and can relate inputs between the different data modes.
Meta says the model gives machines a "holistic understanding" that links objects in a photo to their corresponding sound, 3D structure, temperature, and motion.
- While Meta hasn't released it as a product, ImageBind's applications could include enhancing search functionality for photos and videos or creating mixed-reality environments. Meta plans to expand ImageBind's data modes to other senses in the future.
- Meta's research paper on ImageBind is;available here.