Following the recent AI competition between OpenAI and Google, Meta’s AI researchers are preparing to enter the fray with their multimodal model.
Multimodal AI models are advanced versions of large language models, capable of processing various forms of media such as text, images, sound recordings, and videos.
For instance, OpenAI’s latest GPT-4 model can now describe your surroundings when you open your camera and ask it to.
Chameleon: Meta’s Early-Fusion Approach to Multimodal AI
Meta, the parent company of Facebook, is aiming to launch a similar tool with its multimodal model named Chameleon.
Unlike the earlier late-fusion technique, Chameleon’s early-fusion architecture processes data together rather than as separate entities, overcoming the limitations of late fusion.
According to TechXplore, the team has developed a system that integrates different types of data—such as images, text, and code—by converting them into a common set of tokens.
This method, similar to how large language models process words, allows for advanced computing techniques to be applied to mixed input data.
With a unified vocabulary, the system can efficiently handle and transform various data types, enhancing the processing and understanding of complex information.
Meta’s Chameleon Outshines Larger Models in Multimodal AI Tasks
Unlike Google’s Gemini, Chameleon is an end-to-end model, handling the entire process directly.
The researchers introduced novel training techniques, including a two-stage learning process and a massive dataset of approximately 4.4 trillion texts, images, or token pairs, along with interleaved data.
The system was trained with 7 billion and then 34 billion parameters over 5 million hours on high-speed GPUs. For comparison, OpenAI’s GPT-4 reportedly has 1 trillion parameters.
In a paper posted on the arXiv preprint server, the team shared promising results from testing.
The outcome is a multimodal model with impressive versatility, achieving state-of-the-art performance in image captioning tasks.
The researchers claim this model surpasses Llama-2 in text-only tasks and competes with models like Mixtral 8x7B and Gemini-Pro.
It also performs sophisticated image generation within a single, unified framework. The team asserts that Chameleon matches or even outperforms larger models such as Gemini Pro and GPT-4 based on certain tests.