MiniGPT-4 Paper : Enhancing Vision-Language Understanding with Advanced Large Language Models
Deyao Zhu∗ Jun Chen∗ Xiaoqian Shen Xiang Li Mohamed Elhoseiny
King Abdullah University of Science and Technology
{deyao.zhu, jun. Chen, xiaoqian.shen, xiang.li.1, mohamed.elhoseiny}@kaust.edu.sa
Abstract
The recent GPT-4 has demonstrated extraordinary multi-modal abilities, such
as directly generating websites from handwritten text and identifying humorous
elements within images. These features are rarely observed in previous vision language
models. We believe the primary reason for GPT-4’s advanced multi-modal
generation capabilities lies in utilizing a more advanced large language
model (LLM). To examine this phenomenon, we present MiniGPT-4, which aligns
a frozen visual encoder with a frozen LLM, Vicuna, using just one projection
layer. Our findings reveal that MiniGPT-4 possesses many capabilities similar to
those exhibited by GPT-4 like detailed image description generation and website
creation from hand-written drafts. Furthermore, we also observe other emerging
capabilities in MiniGPT-4, including writing stories and poems inspired by given
images, providing solutions to problems shown in images, teaching users how to
cook based on food photos, etc. In our experiment, we found that only performing
the pretraining on raw image-text pairs could produce unnatural language outputs
that lack coherency including repetition and fragmented sentences. To address
this problem, we curate a high-quality, well-aligned dataset in the second stage to
finetune our model using a conversational template. This step proved crucial for
augmenting the model’s generation reliability and overall usability. Notably, our
model is highly computationally efficient, as we only train a projection layer utilizing
approximately 5 million aligned image-text pairs. Our code, pre-trained model,
and collected dataset are available at https://minigpt-4.github.io/.
Introduction
In recent years, large language models (LLMs) have experienced rapid advancements [21, 18, 4, 24,
32, 9, 14]. With exceptional language understanding capabilities, these models can perform a variety
of intricate linguistic tasks in a zero-shot manner. Notably, GPT-4 [19], a large-scale multimodal
model, has been recently introduced with demonstrating many impressive capabilities. For example,
GPT-4 can produce very detailed and accurate image descriptions, explain unusual visual phenomena,
and even construct websites based on handwritten text instructions.
Although GPT-4 has exhibited remarkable capabilities, the methods behind its exceptional abilities
are still a mystery [19]. We believe that these superior skills may stem from the utilization of a
more advanced large language model (LLM). LLMs have demonstrated various emergent abilities, as
evidenced in GPT-3’s few-shot prompting setup [4] and the findings of Wei et al. (2022) [34]. Such
emergent properties are hard to find in smaller-scale models. It is conjectured that these emergent abilities are also applicable to multi-modal models, which could be the foundation of GPT-4’s
impressive visual description capabilities.
To substantiate our hypothesis, we present a novel model named MiniGPT-4. It utilizes an advanced
large language model (LLM), Vicuna [8], which is built upon LLaMA [32] and reported to achieve
90% of ChatGPT’s quality as per GPT-4’s evaluation, as the language decoder. In terms of visual
perception, we employ the same pretrained vision component of BLIP-2 [16] that consists of a
ViT-G/14 from EVA-CLIP [13] and a Q-Former. MiniGPT-4 adds a single projection layer to align
the encoded visual features with the Vicuna language model and freezes all the other vision and
language components. MiniGPT-4 is initially trained for 20k steps using a batch size of 256 on 4
A100 GPUs, leveraging a combined dataset that includes images from LAION [26], Conceptual
Captions [5, 27], and SBU [20] to align visual features with the Vicuna language model. However,
simply aligning the visual features with the LLM is insufficient to train high-performing model
with visual conversation abilities like a chatbot, and the noises underlying the raw image-text pairs
may result in incoherent language output. Therefore, we collect another 3,500 high-quality aligned
image-text pairs to further fine-tune the model with a designed conversational template in order to
improve the naturalness of the generated language and its usability.
In our experiments, we discovered that MiniGPT-4 possesses numerous capabilities similar to those
demonstrated by GPT-4. For instance, MiniGPT-4 can generate intricate image descriptions, create
websites based on handwritten text instructions, and explain unusual visual phenomena. Furthermore,
our findings revealed that MiniGPT-4 also has a variety of other intriguing abilities not showcased
in the GPT-4 demonstrations. For example, MiniGPT-4 can directly generate detailed recipes by
observing appetizing food photos, craft stories or rap songs inspired by images, write advertisements
for products in images, distinguish problems shown in photos and provide corresponding solutions,
and retrieve rich facts about people, movies, or art directly from images, among other capabilities.
These abilities are absent in previous vision-language models like Kosmos-1 [15] and BLIP-2 [16],
which do not apply a stronger language model such as Vicuna. This contrast validates that integrating
visual features with an advanced language model can yield emergent vision-language abilities.
We present a summary of our key findings:
• Our research reveals that by aligning visual features with the advanced large language model,
Vicuna, we can achieve emergent vision-language capabilities. We demonstrate that our
MiniGPT-4 can process abilities like those showcased in the GPT-4 demonstrations.
• By utilizing a pre-trained vision encoder and a large language model, MiniGPT-4 achieves
greater computational efficiency. Our findings suggest that training merely one projection
layer can effectively align the visual features with the large language model. Our MiniGPT-4
only requires training for approximately 10 hours on 4 A100 GPUs.
• We discovered that simply aligning visual features with large language models using raw
image-text pairs from public datasets is not sufficient for developing a well-performing
MiniGPT-4 model. It may produce unnatural language outputs that lack coherency, including
repetition and fragmented sentences. Addressing this limitation requires training with a
high-quality, well-aligned dataset, significantly improving its usability.
2. Related Works
Large language models have experienced tremendous success in recent years due to the scaling
up of training data and an increase in the number of parameters. Early models, such as BERT [11],
GPT-2 [22], and T5 [23], laid the foundation for this progress. Subsequently, GPT-3 [4], with a
massive scale of 175 billion parameters, was introduced, demonstrating significant breakthroughs
across numerous language benchmarks. This development inspired the creation of various other
large language models, including Megatron-Turing NLG [28], Chinchilla [14], PaLM [9], OPT [38],
BLOOM [25], and LLaMA [32], among others. Wei et al. [34] further discovered several emergent
abilities, which appear exclusively in large models. The emergence of these abilities underscores
the importance of scaling up in the development of large language models. Moreover, by aligning
the pre-trained large language model GPT-3 with human intent, instructions and human feedback,
InstructGPT [21] and ChatGPT [18] enable conversational interactions with humans and can answer
a wide range of diverse and complex questions. More recently, several open-sourced models, such

Figure 1: The architecture of MiniGPT-4. It consists of a vision encoder with a pretrained ViT and
Q-Former, a single linear projection layer, and an advanced Vicuna large language model. MiniGPT-4
only requires training the linear projection layer to align the visual features with the Vicuna.
as Alpaca [30] and Vicuna [8], have been developed based on LLaMA [32] and also exhibit similar
performance.
Leveraging Pre-trained LLMs in Vision-Language Tasks.
In recent years, the trend of using autoregressive language models as decoders in vision-language tasks have gained significant traction [6, 15, 36, 31, 2, 16, 17, 12]. This approach takes advantage of cross-modal transfer, allowing knowledge to be shared between language and multimodal domains. Pioneering studies like VisualGPT [6] and Frozen [33] have demonstrated the benefits of employing a pre-trained language model as a
vision-language model decoder. Flamingo [2] was then developed to align a pre-trained vision
encoder and language model using gated cross-attention and was trained on billions of image-text
pairs, showcasing impressive in-context few-shot learning capabilities. Following that, BLIP-2 [16]
was introduced, employing a Flan-T5 [10] with a Q-Former to efficiently align visual features with the
language model. Most recently, PaLM-E [12], featuring 562 billion parameters, has been developed
to integrate real-world continuous sensor modalities into an LLM, thereby establishing a connection
between real-world perceptions and human languages. GPT-4 [19] has also been recently released,
showcasing more powerful visual understanding and reasoning abilities after pre-training on a vast
collection of aligned image-text data.
LLMs, such as ChatGPT, have proven to be powerful tools in enhancing the performance of vision-language
tasks by collaborating with other specialized models. For instance, Visual ChatGPT [35]
and MM-REACT [37] showcases how ChatGPT can act as a coordinator, integrating with diverse
visual foundation models and facilitating their collaboration to tackle more complex challenges.
ChatCaptioner [39] treats ChatGPT as a questioner, prompting diverse questions for BLIP-2 to
answer. Through multi-round conversations, ChatGPT extracts visual information from BLIP-2
and effectively summarizes the image content. Video ChatCaptioner [7] extends this approach,
applying it to video spatiotemporal understanding. ViperGPT [29] demonstrates the potential of
combining an LLM with different vision models to address complex visual queries programmatically.
In contrast, MiniGPT4 directly aligns visual information with the language model to accomplish
diverse vision-language tasks without the usage of external vision models.
3. Method
MiniGPT-4 aims to align visual information from a pretrained vision encoder with an advanced large
language model (LLM). Specifically, we utilize the Vicuna [8] as our language decoder, which is
constructed upon LLaMA [32] and can perform a wide range of complex linguistic tasks. For visual
perception, we employ the same visual encoder as used in BLIP-2 [16], a ViT backbone [13] coupled
with their pre-trained Q-Former. Both language and vision models are open-sourced. We target to
bridge the gap between the visual encoder and LLM using a linear projection layer, with an overview
of our model displayed in Fig.1.
To achieve an effective MiniGPT-4, we propose a two-stage training approach. The initial stage
involves pretraining the model on a large collection of aligned image-text pairs to acquire vision language
knowledge. In the second stage, we fine-tune the pretrained model with a smaller but
high-quality image-text dataset with a designed conversational template to enhance the model’s
generation reliability and usability.
3.1 First pretraining stage
During the initial pretraining stage, the model is designed to acquire vision-language knowledge from
a large collection of aligned image-text pairs. We regard the output from the injected projection layer
as a soft prompt for the LLM, prompting it to generate the corresponding ground-truth texts.
Throughout the entire pretraining process, both the pretrained vision encoder and the LLM remain
frozen, with only the linear projection layer being pretrained. We use a combined dataset of Conceptual
Caption [5, 27], SBU [20] and LAION [26] to train our model. Our model undergoes 20,000
training steps with a batch size of 256, covering approximately 5 million image-text pairs. The entire
process takes about 10 hours to complete, utilizing 4 A100 (80GB) GPUs.
Issues of the first pretraining stage. Following the first pretraining stage, our MiniGPT-4 demonstrates
the capacity to possess a wealth of knowledge and offer reasonable responses to human
inquiries. However, we have observed instances where it struggles to produce coherent linguistic
output, such as generating repetitive words or sentences, fragmented sentences, or irrelevant content.
These issues hinder MiniGPT-4’s ability to engage in a fluent visual conversation with humans.
We have also noticed that similar issues were also faced in GPT-3. Despite being pretrained on an
extensive language dataset, GPT-3 could not directly generate language outputs that are in accordance
with the users’ intentions. Through a process of instruction fine-tuning and reinforcement learning
from human feedback, GPT-3 evolves into GPT-3.5 [21, 18] and becomes capable of producing more
human-friendly outputs. This phenomenon bears a resemblance to the current state of MiniGPT-4
following its initial pretraining stage. As such, it is not surprising that our model may struggle to
generate fluent and natural human language outputs at this stage.
3.2 Curating a high-quality alignment dataset for vision-language domain.
To achieve greater naturalness in the generated language and enhance the model’s usability, a second stage
alignment process is essential. While in the realm of NLP, instruction fine-tuning datasets
[30] and conversations [1] are easily accessible, no equivalent datasets exist for the vision-language
domain. To address this deficiency, we carefully curated a high-quality image-text dataset, specifically
tailored for alignment purposes. This dataset is subsequently utilized to fine-tune our MiniGPT-4
during the second-stage alignment process.
Initial aligned image-text generation
In the initial phase, we employ the model derived from
the first pretraining stage to generate a comprehensive description of a given image. To enable our
model to produce the more detailed image descriptions, we have designed a prompt that adheres to
the conversational format of the Vicuna [8] language model, as shown below:
###Human: <Img><ImageFeature></Img> Describe this image in detail. Give as many details as
possible. Say everything you see. ###Assistant:
In this prompt, <ImageFeature> represents the visual features produced by the linear projection
layer.
To identify incomplete sentences, we examine whether the generated sentence exceeds 80 tokens. If
it does not, we incorporate an additional prompt, ###Human: Continue ###Assistant: , prompting
our MiniGPT-4 to extend the generation. By concatenating the outputs from both steps, we can create
a more comprehensive image description. This approach enables us to generate more image-text
pairs with detailed and informative image descriptions. We randomly select 5,000 images from the
Conceptual Caption dataset [5, 27] and employ this approach to generate corresponding language
descriptions for each image.
Data post-processing: The generated image descriptions still have many noises and contain
errors, such as repetition of words or sentences and incoherent statements. To mitigate these issues, we employ ChatGPT to refine the descriptions by utilizing the subsequent
prompt:
Fix the error in the given paragraph. Remove any repeating sentences, and meaningless characters, not
English sentences, and so on. Remove unnecessary repetition. Rewrite any incomplete sentences.
Return the results directly without explanation. Return the input paragraph directly if it is already
correct without explanation.
Upon completing the post-processing stage, we manually verify the correctness of each image
description to guarantee its high quality. Specifically, we check if each generated image description
follows our desired format and manually refine the generated captions by eliminating redundant
words or sentences that ChatGPT fails to detect. Finally, only approximately 3,500 out of 5,000
image-text pairs satisfy our requirement, and these pairs are subsequently utilized for the second-stage
alignment process.
3.3 Second-stage finetuning
During the second stage, we finetune our pre-trained model with the curated high-quality image-text
pairs. During the finetuning, we use the predefined prompts in the following template:
###Human: <Img><ImageFeature></Img> <Instruction> ###Assistant:
In this prompt, <Instruction> represents a randomly sampled instruction from our predefined instruction
set containing variant forms of instructions such as “Describe this image in detail” or “Could
you describe the contents of this image for me”. It is important to note that we do not calculate the
regression loss for this specific text-image prompt.
As a result, MiniGPT-4 can now produce more natural and reliable responses. Furthermore,
we have observed that the model’s fine-tuning process is remarkably efficient, only requiring a mere
400 training steps with a batch size of 12, which takes a brief 7 minutes to complete with a single A100
GPU.
4. Demonstrations:
Our MiniGPT-4 exhibits many capabilities similar to those demonstrated by GPT-4. These
include generating detailed image descriptions (Fig. 2), identifying amusing aspects within images
(Fig. 3), and uncovering unique content (Fig. 4). Additionally, the model can generate websites
from the handwritten text (Fig. 5). We have also discovered that our MiniGPT-4 possesses other abilities, such as identifying problems in images and providing solutions (Fig. 6), creating poems or rap songs
inspired by images (Fig. 7), writing stories for images (Fig. 8), making advertisements for products
in images (Fig. 9), identifying individuals (Fig. 10), providing insightful image comments (Fig. 11),
retrieving facts related to images (Fig. 12), and teaching users to cook foods with given photos (Fig.
13). These diverse examples showcase the strong capabilities of our MiniGPT-4.
5. Limitations
Although MiniGPT-4 processes numerous advanced vision-language capabilities, as displayed in our
demonstrations, it currently still faces several limitations.
Language hallucination. As MiniGPT-4 is built upon LLMs, it inherits LLM’s limitations like
unreliable reasoning ability and hallucinating nonexistent knowledge. This issue might be alleviated.
Describe this image as detailed as
possible.

Figure 2: Detailed image descriptions
by training the model with more high-quality, aligned image-text pairs, or aligning with more
advanced LLMs in the future.
Inadequate perception capacities. MiniGPT-4’s visual perception remains limited. It may struggle
to recognize detailed textual information from images, and differentiate spatial localization. This
limitation may stem from several factors: 1) A lack of sufficient aligned image-text data containing
adequate information such as spatial localization and optical character annotations. This issue could
be alleviated by training on more well-aligned and rich data. 2) The frozen Q-former used in the visual
encoder may lose some essential features, such as visual-spatial grounding. This could potentially be
improved by replacing it with a stronger visual perception model. 3) Training only one projection
layer might not provide enough capacity to learn extensive visual-text alignment.

Figure 3: Identifying amusing aspects within images

Figure 4: Discovering unusual content (Images are from WHOOPS dataset [3])

Figure 5: Generating website code from handwritten text and the rendered website

Figure 6: Identifying problems from photos and providing solutions

Figure 7: Rhyme generation

Figure 8: Story generation
References
[1] Sharegpt. https://github.com/domeccleston/sharegpt, 2023.
[2] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc,
Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for
few-shot learning. In Advances in Neural Information Processing Systems, 2022.
[3] Nitzan Bitton-Guetta, Yonatan Bitton, Jack Hessel, Ludwig Schmidt, Yuval Elovici, Gabriel Stanovsky,
and Roy Schwartz. Breaking common sense: Whoops! a vision-and-language benchmark of synthetic and
compositional images. arXiv preprint arXiv:2303.07274, 2023.
[4] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind
Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.
Advances in neural information processing systems, 33:1877–1901, 2020.
[5] Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale
image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages 3558–3568, 2021.
[6] Jun Chen, Han Guo, Kai Yi, Boyang Li, and Mohamed Elhoseiny. Visualgpt: Data-efficient adaptation
of pretrained language models for image captioning. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pages 18030–18040, 2022.
[7] Jun Chen, Deyao Zhu, Kilichbek Haydarov, Xiang Li, and Mohamed Elhoseiny. Video chatcaptioner:
Towards the enriched spatiotemporal descriptions. arXiv preprint arXiv:2304.04227, 2023.
[8] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan
Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source
chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023.
[9] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts,
Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language
modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
[10] Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi
Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. arXiv
preprint arXiv:2210.11416, 2022.
[11] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional
transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
[12] Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan
Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language
model. arXiv preprint arXiv:2303.03378, 2023.
[13] Yuxin Fang, WenWang, Binhui Xie, Quan Sun, LedellWu, XinggangWang, Tiejun Huang, XinlongWang,
and Yue Cao. Eva: Exploring the limits of masked visual representation learning at scale. arXiv preprint
arXiv:2211.07636, 2022.
[14] Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford,
Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal
large language models. arXiv preprint arXiv:2203.15556, 2022.
[15] Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei
Cui, Owais Khan Mohammed, Qiang Liu, et al. Language is not all you need: Aligning perception with
language models. arXiv preprint arXiv:2302.14045, 2023.
[16] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training
with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
[17] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training
for unified vision-language understanding and generation. In International Conference on Machine
Learning, pages 12888–12900. PMLR, 2022.
[18] OpenAI. Introducing chatgpt. https://openai.com/blog/chatgpt, 2022.
[19] OpenAI. Gpt-4 technical report, 2023.
[20] Vicente Ordonez, Girish Kulkarni, and Tamara Berg. Im2text: Describing images using 1 million captioned
photographs. Advances in neural information processing systems, 24, 2011.
[21] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang,
Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with
human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
[22] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language
models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
[23] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.
The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
[24] Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ili´c, Daniel Hesslow, Roman
Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, et al. Bloom: A 176b-parameter
open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
[25] Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ili´c, Daniel Hesslow, Roman
Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, et al. Bloom: A 176b-parameter
open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
[26] Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush
Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400
17
million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
[27] Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned,
hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual
Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565,
2018.
[28] Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari, Jared Casper,
Zhun Liu, Shrimai Prabhumoye, George Zerveas, Vijay Korthikanti, et al. Using deepspeed and megatron to
train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990,
2022.
[29] Dídac Surís, Sachit Menon, and Carl Vondrick. Vipergpt: Visual inference via python execution for
reasoning. arXiv preprint arXiv:2303.08128, 2023.
[30] Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang,
and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.
com/tatsu-lab/stanford_alpaca, 2023.
[31] Anthony Meng Huat Tiong, Junnan Li, Boyang Li, Silvio Savarese, and Steven CH Hoi. Plug-and-play vqa:
Zero-shot vqa by conjoining large pretrained models with zero training. arXiv preprint arXiv:2210.08773,
2022.
[32] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix,
Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation
language models. arXiv preprint arXiv:2302.13971, 2023.
[33] Maria Tsimpoukelli, Jacob L Menick, Serkan Cabi, SM Eslami, Oriol Vinyals, and Felix Hill. Multimodal
few-shot learning with frozen language models. Advances in Neural Information Processing Systems,
34:200–212, 2021.
[34] Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama,
Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy
Liang, Jeff Dean, and William Fedus. Emergent abilities of large language models. Transactions on
Machine Learning Research, 2022. Survey Certification.
[35] Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. Visual chatgpt:
Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671, 2023.
[36] Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, and Cordelia Schmid. Zero-shot video question
answering via frozen bidirectional language models. arXiv preprint arXiv:2206.08155, 2022.
[37] Zhengyuan Yang*, Linjie Li*, Jianfeng Wang*, Kevin Lin*, Ehsan Azarnasab*, Faisal Ahmed*, Zicheng
Liu, Ce Liu, Michael Zeng, and Lijuan Wang. Mm-react: Prompting chatgpt for multimodal reasoning and
action. 2023.
[38] Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher
Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models.
arXiv preprint arXiv:2205.01068, 2022.
[39] Deyao Zhu, Jun Chen, Kilichbek Haydarov, Xiaoqian Shen, Wenxuan Zhang, and Mohamed Elhoseiny.
Chatgpt asks, blip-2 answers: Automatic questioning towards enriched visual descriptions. arXiv preprint
arXiv:2303.06594, 2023.
18