Rhymes AI has introduced Aria, an open-source multimodal native Mixture-of-Experts (MoE) model capable of processing text, images, video, and code effectively. In benchmarking tests, Aria has outperformed other open models and demonstrated competitive performance against proprietary models such as GPT-4o and Gemini-1.5. Additionally, Rhymes AI has released a codebase that includes model weights and guidance for fine-tuning and development.
Aria has several features, including multimodal native understanding and competitive performance against existing proprietary models. Rhymes AI has shared that Aria’s architecture, built from scratch using multimodal and language data, achieves state-of-the-art results across various tasks. This architecture includes a fine-grained mixture-of-experts model with 3.9 billion activated parameters per token, offering efficient processing with improved parameter utilization.
Rashid Iqbal, a machine learning engineer, raised considerations regarding Aria’s architecture:
Impressive release! Aria’s Mixture-of-Experts architecture and novel multimodal training approach certainly set it apart. However, I am curious about the practical implications of using 25.3B parameters with only 3.9B active—does this lead to increased latency or inefficiency in certain applications?
Also, while beating giants like GPT-4o and Gemini-1.5 on benchmarks is fantastic, it is crucial to consider how it performs in real-world scenarios beyond controlled tests.
In benchmarking tests, Aria has outperformed other open models, such as Pixtral-12B and Llama3.2-11B, and performs competitively against proprietary models like GPT-4o and Gemini-1.5. The model excels in areas like document understanding, scene text recognition, chart reading, and video comprehension, underscoring its suitability for complex, multimodal tasks.
Source: https://huggingface.co/rhymes-ai/Aria
In order to support development, Rhymes AI has released a codebase for Aria, including model weights, a technical report, and guidance for using and fine-tuning the model with various datasets. The codebase also includes best practices to streamline adoption for different applications, with support for frameworks like vLLM. All resources are made available under the Apache 2.0 license.
Aria’s efficiency extends to its hardware requirements. In response to a community question about the necessary GPU for inference, Leonardo Furia explained:
ARIA’s MoE architecture activates only 3.5B parameters during inference, allowing it to potentially run on a consumer-grade GPU like the NVIDIA RTX 4090. This makes it highly efficient and accessible for a wide range of applications.
Addressing a question from the community. about whether there are plans to offer Aria via API, Rhymes AI confirmed that API support is on the roadmap for future models.
With Aria’s release, Rhymes AI encourages participation from researchers, developers, and organizations in exploring and developing practical applications for the model. This collaborative approach aims to further enhance Aria’s capabilities and explore new potential for multimodal AI integration across different fields.
For those interested in trying or training the model, it is available for free at Hugging Face.