Table of contents
Multimodal models are increasingly being used across industries, driven by significant advancements in language and vision model capabilities. However, LLMs cannot understand visual information and Large Vision Models (LVMs) struggle with reasoning tasks. This complementary nature has led to the development of Multimodal Large Language Models (MLLMs), which combine the strengths of both LLMs and LVMs to handle and generate multimodal information effectively. But, despite their advanced capabilities, MLLMs, like all LLMs, are prone to hallucination. Let's explore hallucinations in multimodal models.
Here is an example where GPT-4o hallucinated when asked, “Are you sure? This behavior was explored in a recent paper by Anthropic.
Multimodality in LLMs refers to the ability of these models to handle and reason with multiple types of data, such as text, images, audio, and video. This integration allows MLLMs to perform tasks that require understanding and generating information across different modalities. For instance, models like CLIP project visual and textual data into a unified representation space, facilitating various downstream tasks. On the other hand, models like GPT-4o adopt a sequence-to-sequence approach to unify multimodal tasks.
MLLMs, characterized by their large-scale parameters and new training paradigms, exhibit unique capabilities such as generating website code from images, understanding memes, and performing OCR-free math reasoning.
One example of such a model is NExT-GPT which can perceive inputs and generate outputs in arbitrary combinations of text, images, videos, and audio.
Multimodal models are transforming various sectors by integrating and interpreting diverse data types. Let's talk about some early applications in the industry.
GPT-4 on Apple Intelligence for everyday tasks: Integrated into iOS, iPadOS, and macOS, Apple Intelligence assists with tasks like writing text and creating images. It includes on-device and server-based models for efficient performance, though effectiveness varies based on task complexity.
Sora for short video generation: OpenAI’s Sora generates intricate scenes with multiple characters and specific motions, maintaining visual consistency. However, it struggles with simulating complex physics and understanding specific cause-and-effect relationships.
Video summarization: Twelve Labs' Generate API suite creates detailed video summaries and descriptions, improving engagement and accessibility. This is useful in media for creating previews and enriching metadata, though customization may need specific user prompts.
Suno AI for music creation: Suno AI enables music creation without instruments, making it accessible for all users. However, its effectiveness depends on the clarity of the user's musical ideas.
Runway AI for multimedia content generation: Runway AI's Gen-1 and Gen-2 models generate high-quality videos and images, benefiting industries like advertising and entertainment.
Multimodal models are driving innovation across numerous industries by facilitating the interpretation and generation of various modalities. Their applications range from enhancing everyday tasks to generating complex multimedia content, thereby significantly improving productivity and creativity in many fields.
Below is an example of how Runway's Gen-3 Alpha generates videos from text descriptions.
LVLMs (Large Vision Language Models) have gained significant traction for their ability to process visual and textual data simultaneously. However, similar to their text-only counterparts, LVLMs are prone to hallucinations—generating content that is not present or accurate based on the input data. Liu et al. (2024) explored Intrinsic Vision-Language Hallucination (IVL-Hallu) and proposed several novel tasks, including attribute, object, multimodal conflicting, and counter-common-sense hallucinations. They introduced a challenging benchmark dataset and conducted experiments on five LVLMs, revealing their limited effectiveness in addressing these tasks.
Object Hallucination
Object hallucination occurs when the model identifies objects that do not exist in the image. For example, in the top image, the question asks if there is a bike in the scene. The answer is "No," indicating that the model might falsely detect bikes as bicycles due to their common presence in similar street scenes, even though none are present in this particular image.
Multi-Modal Conflicting Hallucination
This type of hallucination happens when the model generates responses that conflict with the visual information. In the second image, the question asks what the pedestrians are doing. The answer is "The pedestrians are walking," which might be incorrect if the visual information does not support this action, leading to a conflict between textual and visual modalities.
Counter-Common-Sense Hallucination
Counter-common-sense hallucination involves generating responses that defy common sense. In the third image, the question asks which one is bigger, and the answer is "The cat is bigger." This response is counter-intuitive as it contradicts common knowledge about the relative sizes of cats and other objects in the image.
Attribute Hallucination
Attribute hallucination occurs when the model incorrectly attributes certain features to objects. In the bottom image, the question asks if there is a red car in the scene. The answer is "No," but the model might confuse the red elements on the bus with a red car, leading to incorrect attribute assignment.
Researchers have been actively investigating methods to detect and mitigate hallucinations in LVLMs.
Quantifying Object Hallucinations
Researchers, including Li et al. (2023), discovered that visual instructions significantly influence hallucinations in LVLMs. They introduced POPE – a method for evaluating object hallucinations, offering improved stability and flexibility.
Benchmarking Object Hallucination Evaluation
Lovenia et al. (2023) addressed the absence of standardized metrics for object hallucination assessment by creating NOPE, a benchmark for evaluating hallucinations via visual question answering, using a dataset of 29.5k synthetic instances.
Training-Free and API-Free Mitigation Approaches
Zhao et al. (2024) developed MARINE to reduce object hallucinations without expensive training. This method uses open-source vision models and guidance to incorporate object grounding features, and it has shown effectiveness across six LVLMs.
Human Error Detection for Mitigating Hallucinations
Yu et al. (2023) explored human error detection to mitigate hallucinations in MLLMs, successfully reducing hallucinations by 44.6% while maintaining competitive performance.
Chart Comprehension Benchmark
Xu et al. (2023) proposed ChartBench to evaluate MLLMs' chart comprehension, emphasizing the need for new metrics due to limited reasoning abilities with complex charts.
Fine-Grained Hallucination Detection Dataset
Gunjal et al. (2024) introduced M-HalDetect, a comprehensive multi-modal dataset for fine-grained hallucination detection, aiming to train LVLMs for more accurate outputs.
Comprehensive Visual Instruction Dataset
Liu et al. (2023) developed LRV-Instruction, a dataset of 400k visual instructions across 16 tasks, and GAVIE, a method for evaluating visual instruction tuning. They also introduced the LURE algorithm to correct object hallucinations by refining descriptions.
The development of LVLMs relies heavily on annotated benchmark datasets, which can exhibit domain bias and limit model generative capabilities.
Visual Instruction Dataset
Li et al. (2023) proposed a novel data collection approach that synthesizes images and dialogues synchronously for visual instruction tuning, yielding a large dataset of image-dialogue pairs and multi-image instances.
A Comprehensive Benchmark Dataset
Huang et al. (2024) introduced VHTest, a benchmark dataset with 1,200 diverse visual hallucination instances across eight modes.
Categorization for Visual Hallucinations
Rawte et al. (2024) categorized visual hallucinations into eight orientations and proposed three main categories of methods to mitigate hallucinations: data-driven approaches, training adjustments, and post-processing techniques.
In conclusion, while MLLMs and LVLMs have made significant strides in handling multimodal data, the issue of hallucinations remains a critical challenge. Ongoing research and innovative approaches are essential to enhance the accuracy and reliability of these models for enterprise adoption.
Large Video Models (LVMs) hold great potential for applications like video understanding and generation. Hallucinations occur when the model misinterprets video frames, leading to artificial or inaccurate visual data. Figure 5 illustrates instances of hallucinations observed in LVMs.
Video hallucinations can manifest in various forms, including inaccuracies in dense video captioning, video infilling, and prediction tasks. Dense video captioning involves creating descriptions for multiple events within a continuous video, which requires a deep understanding of the video content and contextual reasoning. Traditional methods often risk hallucinations by overlooking temporal dependencies, leading to inaccurate descriptions.
Video infilling and prediction tasks assess a model’s ability to comprehend and anticipate temporal dynamics within video sequences. Challenges arise from the lack of cross-frame information, leading to difficulties in propagating missing pixels across frames. Understanding scene affordances, which involve potential actions and interactions within a scene, is also crucial for comprehending videos accurately.
OpenAI’s Sora struggles to simulate the physics of a complex scene, and may not comprehend specific instances of cause and effect (for example: a cookie might not show a mark after a character bites it). The model may also confuse spatial details included in a prompt, such as discerning left from right, or struggle with precise descriptions of events that unfold over time, like specific camera trajectories.
Researchers have proposed various methods to detect and mitigate hallucinations in LVMs.
Weakly-Supervised Factuality Metric: FactVC
Liu and Wan (2023) proposed a weakly-supervised, model-based factuality metric called FactVC, which outperforms previous metrics in assessing the factuality of video captions. They also provided two annotated datasets to promote further research in this area.
Context-Aware Model for Improved Captioning
Wu and Gao (2023) developed a context-aware model that incorporates information from past and future events to influence the description of the current event conditionally. Their approach utilizes a robust pre-trained context encoder to encode information about surrounding context events, integrated into the captioning module using a gate-attention mechanism.
Streaming Model for Dense Video Captioning
Zhou et al. (2024) introduced a streaming model for dense video captioning, comprising a memory module for handling long videos and a streaming decoding algorithm enabling predictions before video completion. This approach notably boosts performance on prominent dense video captioning benchmarks.
Deficiency-Aware Masked Transformer for Video Inpainting
Yu et al. (2023) tackled video inpainting challenges by presenting a Deficiency-aware Masked Transformer (DMT), a dual-modality-compatible inpainting framework. This approach improves handling scenarios with incomplete information by pre-training an image inpainting model to serve as a prior for training the video model.
Curriculum Learning for Enhanced Video Captioning
Chuang and Fazli (2023) introduced CLearViD, a transformer-based model utilizing curriculum learning techniques to enhance performance. By adopting this approach, the model acquires more robust and generalizable features.
Benchmark evaluation plays a crucial role in developing and assessing LVMs.
YouCook2: Dataset for Procedure Learning
Zhou et al. (2018) assembled the YouCook2 dataset, an extensive set of cooking videos with temporally localized and described procedural segments, to facilitate procedure learning tasks.
VideoChat: Enhancing Video Understanding with LLMs
Li et al. (2023) introduced "VideoChat," a novel approach integrating video foundation models and LLMs through a learnable neural interface to enhance spatiotemporal reasoning, event localization, and causal relationship inference in video understanding. They constructed a video-centric instruction dataset with detailed descriptions and conversations, emphasizing spatiotemporal reasoning and causal relationships. To counteract model hallucination, they employed a multi-step process to condense video descriptions into coherent narratives using GPT-4 and refined them for clarity and coherence.
Scene Dataset for Realistic Video Models
To explore the challenge of deducing scene affordances, Kulal et al. (2023) curated a dataset of 2.4M video clips, showcasing a variety of plausible poses that align with the scene context. This dataset aids in understanding potential actions and interactions within a scene, contributing to more accurate and realistic video models.
Large audio models (LAMs) have emerged as powerful tools in audio processing and generation, finding applications in speech recognition, music analysis, audio synthesis, and captioning. Despite their impressive capabilities, these models are prone to hallucinations. These anomalies can manifest as unrealistic audio snippets, fabricated quotes or facts, and inaccuracies in capturing audio features like timbre, pitch, or background noise.
Nishimura et al. (2024) classified Audio hallucinations in LAMs into three distinct types:
1. Involving hallucinations of both objects and actions
2. Featuring accurate objects but hallucinated actions
3. Displaying correct actions but hallucinated objects
Detecting and mitigating hallucinations in audio models is a must for enhancing their reliability for audio related applications.
Reducing Hallucinations in Audio Captioning
Detecting and mitigating hallucinations in audio models is crucial for enhancing their reliability for audio-related applications. In the context of audio captioning, where natural language descriptions for audio clips are generated, over-reliance on the visual modality during pre-training can introduce hallucinations. To address this, Xu et al. (2023) introduced an AudioSet tag-guided model called BLAT, which minimizes noise by avoiding the incorporation of video data. Experimental results across various tasks, including retrieval, generation, and classification, demonstrated BLAT's effectiveness in reducing hallucinations.
Addressing Hallucinations in Speech Emotion Recognition
Speech emotion recognition is another area where hallucinations can occur. Traditional categorization approaches often fail to capture the nuanced nature of emotions in speech. SECap (Xu et al., 2024) is a framework designed for speech emotion captioning, utilizing components like LLaMA as the text decoder, HuBERT as the audio encoder, and Q-Former as the Bridge-Net to generate coherent emotion captions based on speech features.
Mitigating Task-Specific Hallucinations in Audio-Language Models
Despite their capability for zero-shot inference, audio-language models can hallucinate task-specific details. To mitigate this, Elizalde et al. (2024) introduced the Contrastive Language-Audio Pretraining (CLAP) model. Pre-trained with 4.6 million diverse audio-text pairs, CLAP features a dual-encoder architecture that enhances representation learning for improved task generalization across sound, music, and speech domains.
Benchmarking is essential for evaluating the performance of LAMs and identifying hallucinations.
Introduction to Benchmarking in Music Captioning
Doh et al. (2023) addressed the scarcity of data in music captioning by introducing LP-MusicCaps, a comprehensive dataset comprising 0.5 million audio clips with approximately 2.2 million captions. They trained a transformer-based music captioning model with this dataset, demonstrating its superiority over supervised baseline models in zero-shot and transfer-learning scenarios.
Audio Hallucinations in Audio-Video Language Models
Nishimura et al. (2024) investigated audio hallucinations in large audio-video language models, where audio descriptions are generated primarily based on visual information, neglecting audio content. They categorized hallucinations into three types and gathered 1000 sentences by soliciting audio information, annotating them to identify auditory hallucinations.
Compositional Reasoning in Language-Audio Models
To assess compositional reasoning in LAMs, Ghosh et al. (2023) introduced CompA, consisting of two expert-annotated benchmarks focused on real-world audio samples. This benchmark was used to fine-tune CompA-CLAP with a novel learning approach, significantly improving its compositional reasoning skills over baseline models.
In conclusion, while LAMs have significantly progressed in audio processing and generation, hallucinations remain a critical challenge. Ongoing research and innovative approaches are essential to enhance the accuracy and reliability of these models, ensuring their effective application across various domains.
Hallucinations in LVLMs can stem from various factors throughout the model's pipeline. Understanding these causes is useful for developing effective mitigation strategies.
The quality of training data significantly impacts the performance and reliability of LVLMs. Several issues in existing training data can foster hallucinations:
Data Bias: Training data often suffer from distribution imbalances, such as a predominance of "Yes" answers in factual judgment QA pairs. This bias can lead LVLMs to consistently provide affirmative responses, even to incorrect or misleading prompts. Additionally, data homogeneity can impede the model's ability to understand diverse visual information and execute instructions in varied environments.
Annotation Errors: A significant portion of instruction data is synthesized from image-caption and detection data using LLMs. However, this approach can lead to annotation irrelevance, where generated instructions contain objects, attributes, and relationships that do not correspond to the fine-grained content depicted in the images. Training on such data can catalyze hallucinations.
The vision encoders used in LVLMs, often derived from CLIP, map visual and textual features to the same space through contrastive learning. Despite their excellent performance on various visual tasks, these encoders have limitations:
Limited Visual Resolution: Higher image resolution can enhance object recognition accuracy and perception of visual details, thereby reducing hallucinations. However, handling higher resolutions is computationally demanding, leading existing models to use smaller resolutions, such as 224×224 or 336×336 pixels.
Fine-grained Visual Semantics: CLIP primarily focuses on salient objects, often failing to capture fine-grained aspects of an image, such as background details, object counting, and object relations. This limitation can result in hallucinations when the model attempts to describe these aspects.
The connection module in LVLMs projects visual features into the LLM's word embedding space, aligning visual and textual modalities. Misalignment in this process can lead to hallucinations:
Connection Module Simplicity: Simple structures, such as linear layers, are commonly used as connection modules. While cost-efficient, these simple structures hinder comprehensive multimodal alignment, increasing the risk of hallucinations.
Limited Token Constraints: Modules like Q-Former encode a predetermined number of tokens into visual features aligned with text. The restricted number of tokens can prevent encoding all the information present in images, leading to information loss and hallucinations.
LLMs play a pivotal role in LVLMs, enhancing their ability to process complex multimodal tasks. However, this integration also introduces inherent hallucination challenges:
Insufficient Context Attention: During the decoding process, the model may focus only on partial context information, such as over-focusing on the current segment of generated content while ignoring input visual information. This can result in fluent yet inaccurate content.
Stochastic Sampling Decoding: This decoding strategy introduces randomness to prevent generating low-quality text. However, the randomness can amplify the risk of hallucinations.
Capability Misalignment: The disparity between the model's inherent capabilities established during pre-training and the expanded requirements during instruction tuning can lead to responses beyond the model's knowledge limits, increasing the potential for hallucinations.
To address the identified causes of hallucinations in LVLMs, various mitigation strategies have been proposed.
Bias Mitigation: Addressing data bias involves generating balanced question-answer pairs and creating diverse datasets. For instance, CIEM uses off-shelf LLMs to generate contrastive QA pairs, while LRV-Instruction proposes a dataset with both positive and negative visual instructions. Ferret mines negative samples by replacing original categories, attributes, or quantity information with fake ones, enhancing model robustness.
Annotation Enrichment: Constructing richly annotated datasets helps supervise LVLMs to extract visual content accurately and align modalities comprehensively. Notable datasets include M-HalDetect, GRIT, and others that provide detailed annotations and instructions.
Scaling-up Vision Resolution: Gradually increasing image resolution can improve object recognition accuracy and perception of visual details. Approaches like MONKEY process high-resolution images by dividing them into patches, while InternVL scales up the vision encoder to handle larger image widths.
Perceptual Enhancement: Enhancing the object-level perceptual capability involves using additional perception modalities, such as segmentation or depth maps, as control inputs. Spatial awareness can be improved by introducing pre-trained models to acquire spatial position information and scene graph details.
Connection Modules Enhancing: Enhancing the connection module involves upgrading from simple structures to more capable ones. For example, LLaVA-1.5 upgrades from a single linear layer to an MLP, while QLLaMA significantly outperforms Q-Former in aligning visual features with text.
Alignment Training Optimization: Strengthening the alignment training process can reduce hallucinations. This includes adding new learning objectives to bring visual and textual tokens closer and employing Reinforcement Learning from Human Feedback (RLHF) to align different modalities.
Decoding Optimization: Optimizing the decoding process can mitigate hallucinations by enabling the model to focus on proper contexts. Strategies like OPERA modify beam search with a weighted scoring system, while visual contrastive decoding contrasts outputs from original and altered visuals to correct over-reliance on language priors.
Aligning with Human Preferences: Training LVLMs to align with human preferences can improve response quality. Methods like LLaVA-RLHF use RLHF to align models with human preferences, while Direct Preference Optimization (DPO) trains models directly from human preference data.
Post-processing methods involve refining generated content to reduce hallucinations. Techniques like LURE and Woodpecker input visual data, user instructions, and LVLM responses, and output refined, hallucination-free responses. These methods leverage insights into causative factors and employ structured visual knowledge bases to correct inaccuracies.
Addressing hallucinations in LVLMs requires a multifaceted approach. By implementing these strategies, we can enhance the accuracy and reliability of LVLMs in handling multimodal tasks.
In conclusion, MLLMs frequently generate inaccurate content across text, images, audio, and video, posing significant challenges. This survey has systematically identified and categorized these hallucinations, delving into their underlying causes. We have reviewed effective mitigation strategies, emphasizing the necessity for ongoing research and innovative approaches. By confronting these hallucinations head-on, we can improve the accuracy and reliability of multimodal models. This advancement is crucial for their successful adoption in real-world scenarios, ultimately paving the way for more trustworthy GenAI solutions.
Galileo's GenAI analytics offers a transformative approach, providing unparalleled visibility into your hallucinations and simplifying evaluation to improve LLM performance. Try GenAI Studio for yourself today!
A Survey on Multimodal Large Language Models
Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies
A Survey on Hallucination in Large Vision-Language Models
Unveiling Hallucination in Text, Image, Video, and Audio Foundation Models: A Comprehensive Survey
Hallucination of Multimodal Large Language Models: A Survey
Visual Hallucination: Definition, Quantification, and Prescriptive Remediations
Table of contents