The Ultimate Apple MM1 AI System: Unlocking New Data Interpretation Possibilities

Apple has developed a family of multimodal models called MM1, which can interpret and generate different types of data simultaneously, such as text and images.
The new MM1 boasts superior abilities and can offer advanced reasoning and in-context learning to respond to text and images.

Implications for Apple Products:

The new AI system could benefit future Apple products, including iPhones, Macs, and Siri voice assistant.
Anticipated at Apple’s developer conference in June is the unveiling of several new AI features.

Partnership with Google:

Apple has reportedly reached a deal with Google to explore licensing and integrating Google’s Gemini AI engine, which encompasses chatbots and various AI tools, into future iPhones and features of iOS 18.
The partnership would catapult Apple into the growing AI arms race and could bring Gemini to nearly 2 billion Apple devices.
The company has made substantial investments to maintain Google as the default search engine option on Apple’s Safari browser. Nevertheless, the search agreement between these two tech giants is currently under scrutiny by antitrust authorities.

MM1:

A group of Apple researchers has conducted a Comprehensive Study on building high-performing multimodal large language models (MLLMs), focusing on the significance of various architectural components and data choices in multimodal pre-training.
Through meticulous ablation studies, the research identifies key design lessons for effective model construction, demonstrating that a strategic combination of image-caption, interleaved image-text, and text-only data is vital for achieving state-of-the-art results across multiple benchmarks.
It highlights the substantial impact of the image encoder, image resolution, and image token count while noting the comparatively minor importance of vision-language connector design.
Scaling up these insights, the MM1 model family, including dense and mixture-of-experts variants up to 64B, showcases competitive performance post-supervised fine-tuning across established multimodal benchmarks.
The large-scale pre-training endows MM1 with enhanced in-context learning and multi-image reasoning capabilities, facilitating few-shot chain-of-thought prompting.
The work aims to offer enduring design principles for building MLLMs, transcending specific model components and data sources, with the hope of guiding future research in the field.

**Source:** https://techxplore.com/news/2024-03-apple-mm1-multimodal-llm-capable.html

Figure:

Model ablations: What visual encoder to use, How to feed rich visual data, and How to connect the visual representation to the LLM.
Data ablations: Type of data and their mixture.
Credit: arXiv (2024). DOI: 10.48550/arxiv.2403.09611

The Video: https://www.youtube.com/watch?v=QB5cSqrESlE

The potential partnership is seen as a boon for both Apple and Google, with significant license fees and technology access involved. The deal would be a considerable validation moment for Google’s generative AI positioning, considering Microsoft and OpenAI captured early market share by commercializing some of their products.