Unveiling NExT-GPT: An Open Source Marvel Blending AI Mastery Across Audio, Visuals, and Text


In the realm of cutting-edge technology, where text-based titans like Google’s Gemini and OpenAI’s ChatGPT-Vision hold sway, a rising star known as NExT-GPT emerges as a formidable contender. This open-source multimodal large language model (LLM) is poised to make its mark in a landscape dominated by tech giants such as OpenAI and Google.

ChatGPT took the world by storm with its adeptness at comprehending natural language queries and crafting responses akin to human discourse. However, in the relentless march of AI progress, the demand for greater capabilities has soared. The era of sole reliance on text is now a relic of the past, as the era of multimodal LLMs dawns.

Sculpted through a fruitful collaboration between the National University of Singapore (NUS) and Tsinghua University, NExT-GPT boasts the capability to seamlessly process and generate amalgamations of text, imagery, audio, and video. This versatility grants it an edge over text-exclusive models like the fundamental ChatGPT tool, ushering in more organic and lifelike interactions.

The creators of NExT-GPT champion it as an “any-to-any” system, signifying its adaptability to accept inputs across diverse modalities and furnish responses in the most fitting format.

The potential for rapid evolution looms large. Being an open-source model, NExT-GPT is amenable to customization by users, allowing tailoring to their specific requirements. This democratization of access empowers creators to shape the technology for maximal impact, akin to the transformative journey from Stable Diffusion’s initial iteration to its current state.

But how does NExT-GPT operate, you might wonder? As elucidated in the model’s research paper, the system incorporates discrete modules to transmute inputs such as images and audio into text-like representations, intelligible to the core language model. A technique known as “modality-switching instruction tuning” is introduced to enhance its cross-modal reasoning capabilities, enabling the model to seamlessly oscillate between modalities during conversations.

For the processing of inputs, NExT-GPT employs distinctive tokens tailored to each modality—dedicated tokens for images, audio, and video. Each input category undergoes conversion into embeddings that align with the language model’s comprehension. Subsequently, the language model can produce response text, accompanied by specialized signal tokens that trigger generation in alternative modalities.

For instance, a token in the response instructs the video decoder to produce a corresponding video output. The system’s utilization of tailored tokens for both input and output modalities facilitates a versatile any-to-any conversion process. Different decoders then come into play, generating outputs for each modality: Stable Diffusion for Image Decoding, AudioLDM for Audio Decoding, and Zeroscope for Video Decoding. Additionally, Vicuna serves as the foundational LLM, while ImageBind encodes the inputs.

NExT-GPT essentially amalgamates the prowess of distinct AIs, evolving into an all-encompassing super AI, poised to revolutionize the AI landscape.

Notably, NExT-GPT accomplishes this flexible “any-to-any” conversion while training only a mere 1% of the total parameters. The remainder of the parameters remains frozen, comprising pretrained modules—a design that has garnered accolades for its remarkable efficiency, as hailed by the researchers.

While a demo site has been established to provide individuals with a taste of NExT-GPT’s capabilities, its availability remains intermittent.

In a milieu where tech behemoths like Google and OpenAI are rolling out their own multimodal AI offerings, NExT-GPT stands as an open-source alternative, beckoning creators to embark on their own AI journeys. Multimodality stands as the linchpin of authentic and seamless interactions, and by open-sourcing NExT-GPT, researchers are offering a launchpad for the community to propel AI to new heights.