Deyao Zhu∗ Jun Chen∗ Xiaoqian Shen Xiang Li Mohamed Elhoseiny
King Abdullah University of Science and Technology
{deyao.zhu, jun. Chen, xiaoqian.shen, xiang.li.1, mohamed.elhoseiny}@kaust.edu.sa
Abstract
The recent GPT-4 has demonstrated extraordinary multi-modal abilities, such
as directly generating websites from handwritten text and identifying humorous
elements within images. These features are rarely observed in previous vision language
models. We believe the primary reason for GPT-4’s advanced multi-modal
generation capabilities lies in utilizing a more advanced large language
model (LLM). To examine this phenomenon, we present MiniGPT-4, which aligns
a frozen visual encoder with a frozen LLM, Vicuna, using just one projection
layer. Our findings reveal that MiniGPT-4 possesses many capabilities similar to
those exhibited by GPT-4 like detailed image description generation and website
creation from hand-written drafts. Furthermore, we also observe other emerging
capabilities in MiniGPT-4, including writing stories and poems inspired by given
images, providing solutions to problems shown in images, teaching users how to
cook based on food photos, etc. In our experiment, we found that only performing
the pretraining on raw image-text pairs could produce unnatural language outputs
that lack coherency including repetition and fragmented sentences. To address
this problem, we curate a high-quality, well-aligned dataset in the second stage to
finetune our model using a conversational template. This step proved crucial for
augmenting the model’s generation reliability and overall usability. Notably, our
model is highly computationally efficient, as we only train a projection layer utilizing
approximately 5 million aligned image-text pairs. Our code, pre-trained model,
and collected dataset are available at https://minigpt-4.github.io/.
Introduction
In recent years, large language models (LLMs) have experienced rapid advancements [21, 18, 4, 24,
32, 9, 14]. With exceptional language understanding capabilities, these models can perform a variety
of intricate linguistic tasks in a zero-shot manner. Notably, GPT-4 [19], a large-scale multimodal
model, has been recently introduced with demonstrating many impressive capabilities. For example,
GPT-4 can produce very detailed and accurate image descriptions, explain unusual visual phenomena,
and even construct websites based on handwritten text instructions.
Although GPT-4 has exhibited remarkable capabilities, the methods behind its exceptional abilities
are still a mystery [19]. We believe that these superior skills may stem from the utilization of a
more advanced large language model (LLM). LLMs have demonstrated various emergent abilities, as
evidenced in GPT-3’s few-shot prompting setup [4] and the findings of Wei et al. (2022) [34]. Such
emergent properties are hard to find in smaller-scale models. It is conjectured that these emergent abilities are also applicable to multi-modal models, which could be the foundation of GPT-4’s
impressive visual description capabilities.
To substantiate our hypothesis, we present a novel model named MiniGPT-4. It utilizes an advanced
large language model (LLM), Vicuna [8], which is built upon LLaMA [32] and reported to achieve
90% of ChatGPT’s quality as per GPT-4’s evaluation, as the language decoder. In terms of visual
perception, we employ the same pretrained vision component of BLIP-2 [16] that consists of a
ViT-G/14 from EVA-CLIP [13] and a Q-Former. MiniGPT-4 adds a single projection layer to align
the encoded visual features with the Vicuna language model and freezes all the other vision and
language components. MiniGPT-4 is initially trained for 20k steps using a batch size of 256 on 4
A100 GPUs, leveraging a combined dataset that includes images from LAION [26], Conceptual
Captions [5, 27], and SBU [20] to align visual features with the Vicuna language model. However,
simply aligning the visual features with the LLM is insufficient to train high-performing model
with visual conversation abilities like a chatbot, and the noises underlying the raw image-text pairs
may result in incoherent language output. Therefore, we collect another 3,500 high-quality aligned
image-text pairs to further fine-tune the model with a designed conversational template in order to
improve the naturalness of the generated language and its usability.
In our experiments, we discovered that MiniGPT-4 possesses numerous capabilities similar to those
demonstrated by GPT-4. For instance, MiniGPT-4 can generate intricate image descriptions, create
websites based on handwritten text instructions, and explain unusual visual phenomena. Furthermore,
our findings revealed that MiniGPT-4 also has a variety of other intriguing abilities not showcased
in the GPT-4 demonstrations. For example, MiniGPT-4 can directly generate detailed recipes by
observing appetizing food photos, craft stories or rap songs inspired by images, write advertisements
for products in images, distinguish problems shown in photos and provide corresponding solutions,
and retrieve rich facts about people, movies, or art directly from images, among other capabilities.
These abilities are absent in previous vision-language models like Kosmos-1 [15] and BLIP-2 [16],
which do not apply a stronger language model such as Vicuna. This contrast validates that integrating
visual features with an advanced language model can yield emergent vision-language abilities.
We present a summary of our key findings:
• Our research reveals that by aligning visual features with the advanced large language model,
Vicuna, we can achieve emergent vision-language capabilities. We demonstrate that our
MiniGPT-4 can process abilities like those showcased in the GPT-4 demonstrations.
• By utilizing a pre-trained vision encoder and a large language model, MiniGPT-4 achieves
greater computational efficiency. Our findings suggest that training merely one projection
layer can effectively align the visual features with the large language model. Our MiniGPT-4
only requires training for approximately 10 hours on 4 A100 GPUs.
• We discovered that simply aligning visual features with large language models using raw
image-text pairs from public datasets is not sufficient for developing a well-performing
MiniGPT-4 model. It may produce unnatural language outputs that lack coherency, including
repetition and fragmented sentences. Addressing this limitation requires training with a
high-quality, well-aligned dataset, significantly improving its usability.
2. Related Works
3. Method
4. Demonstrations:
Our MiniGPT-4 exhibits many capabilities similar to those demonstrated by GPT-4. These
include generating detailed image descriptions (Fig. 2), identifying amusing aspects within images
(Fig. 3), and uncovering unique content (Fig. 4). Additionally, the model can generate websites
from the handwritten text (Fig. 5). We have also discovered that our MiniGPT-4 possesses other abilities, such as identifying problems in images and providing solutions (Fig. 6), creating poems or rap songs
inspired by images (Fig. 7), writing stories for images (Fig. 8), making advertisements for products
in images (Fig. 9), identifying individuals (Fig. 10), providing insightful image comments (Fig. 11),
retrieving facts related to images (Fig. 12), and teaching users to cook foods with given photos (Fig.
13). These diverse examples showcase the strong capabilities of our MiniGPT-4.
5. Limitations
References