Created by Virginia Tech's NLP Lab.
Interleaved text-and-image generation has been an intriguing research direction, where the models are required to generate both images and text pieces in an arbitrary order. Despite the emerging advancements in interleaved generation, the progress in its evaluation still significantly lags behind. Existing evaluation benchmarks do not support arbitrarily interleaved images and text for both inputs and outputs, and they only cover a limited number of domains and use cases. Also, current works predominantly use similarity-based metrics which fall short in assessing the quality in open-ended scenarios. To this end, we introduce InterleavedBench, the first benchmark carefully curated for the evaluation of interleaved text-and-image generation. InterleavedBench features a rich array of tasks to cover diverse real-world use cases. In addition, we present InterleavedEval, a strong reference-free metric powered by GPT-4o to deliver accurate and explainable evaluation. We carefully define five essential evaluation aspects for InterleavedEval, including text quality, perceptual quality, image coherence, text-image coherence, and helpfulness, to ensure a comprehensive and fine-grained assessment. Through extensive experiments and rigorous human evaluation, we show that our benchmark and metric can effectively evaluate the existing models with a strong correlation with human judgments surpassing previous reference-based metrics. We also provide substantial findings and insights to foster future research in interleaved generation and its evaluation.
We introduce INTERLEAVEDBENCH, the first comprehensive benchmark meticulously constructed to evaluate text-and-image interleaved generation.
Our dataset includes two subsets:
context-based: subset where the instances contain a multimodal context of interleaved text and images in the input (first row in above figure)
context-free: subset with text-only inputs (second row in above figure). The context-free subset can assess whether the model can creatively generate interleaved content based on the text-only instruction, while the context-based subset can better benchmark the coherence and consistency of generated outputs.
Dataset Name | Detailed Instruction | Image Input | Text Output | Image Output |
---|---|---|---|---|
MagicBrush | No | Single | No | Single |
DreamBench | No | Multiple | No | Single |
CustomDiffusion | No | Multiple | No | Single |
DreamEditBench | No | Multiple | No | Single |
Mantis-Eval | Yes | Multiple | Yes | No |
InterleavedBench (Ours) | Yes | Multiple | Yes | Multiple |
We highlight the following key differences and unique challenges introduced by our INTERLEAVEDBENCH compared with the existing benchmark:
(1): Output modality: our benchmark requires the models to generate interleaved text and multiple images that could present in an arbitrary order, whereas exiting benchmarks only cover the output with single modality or a single image;
(2) Requirement on coherence: given that both inputs and outputs in our benchmark can contain multiple pieces of text and images, our dataset can assess whether the outputs are coherent and consistent with input instruction and context, and within the outputs themselves;
(3) Instruction following: Each instance in our benchmark contains a detailed human-annotated instruction to describe the task. Thus, our dataset can evaluate models' instructionfollowing and generalization capabilities. We show the difference between our benchmark and existing datasets in the above table.
(1) MiniGPT-5 (Zheng et al., 2023a) which connects a large language model with a stable diffusion model via generative vokens, enabling descriptionfree multimodal generation.
(2) GILL (Koh et al., 2023) which allows a pretrained large language model to generate multimodal responses by mapping the hidden states of text into the embedding space of an image generation model.
(3) EMU-2 (Sun et al., 2023a) which induces in-context learning capabilities of LLMs by scaling up the model size and the size of the pretraining dataset;
(4) EMU-2 Gen + Gold Text where EMU-2 Gen is a pretrained EMU-2 model instruction-tuned on various controllable image generation tasks. However, EMU-2 Gen cannot generate text so we combine it with ground-truth textual responses to come up with a complete text-and-image interleaved content for evaluation.
(5) GPT-4o (OpenAI, 2024) + DALL·E 3 (Betker et al.) where GPT-4o is the state-of-the-art proprietary LMM that can comprehend interleaved textand-image inputs and generate text-only responses. We leverage GPT-4o to generate text responses as well as captions for image responses in the desired positions. Then the captions are fed into DALL·E 3 to generate images. Finally, we combine the text responses with generated images in their original orders.
(6) Gemini-1.5 (Anil et al., 2023) + SDXL (Podell et al., 2023): we build this baseline in a similar way as GPT-4o + DALL·E 3 but use Gemini-1.5 Pro as the LMM and Stable Diffusion XL Turbo as the image generation model.
Best
Middle
Worst
Model | Text Quality | Perceptual Quality | Image Coherence | TIC | Helpfulness | AVG |
---|---|---|---|---|---|---|
MiniGPT-5 | 1.22 | 2.45 | 1.62 | 2.03 | 1.77 | 1.82 |
GILL | 0.75 | 3.21 | 2.25 | 1.53 | 1.48 | 1.84 |
EMU-2 | 1.26 | 2.28 | 1.89 | 1.34 | 1.64 | 1.68 |
EMU-2 (Gold Text) | 1.56 | 3.35 | 2.89 | 1.43 | 2.10 | 2.27 |
Gemini1.5 + SDXL | 4.40 | 3.99 | 3.64 | 4.13 | 3.62 | 3.96 |
GPT-4o + DALLE3 | 4.37 | 4.36 | 3.51 | 4.55 | 3.88 | 4.13 |
Best
Middle
Worst
Model | Text Quality | Perceptual Quality | Image Coherence | TIC | Helpfulness | AVG |
---|---|---|---|---|---|---|
GILL | 1.35 | 1.89 | 1.72 | 1.43 | 1.19 | 1.52 |
EMU-2 | 1.23 | 1.74 | 1.87 | 1.24 | 1.2 | 1.46 |
Gemini1.5 + SDXL | 2.59 | 2.36 | 2.13 | 2.27 | 2.08 | 2.28 |
GPT-4o + DALLE3 | 2.49 | 2.51 | 2.02 | 2.31 | 2.13 | 2.29 |
If you use InterleavedEval in your research, please cite the following papers.
@article{liu_holistic_2024,
author = {Minqian Liu and
Zhiyang Xu and
Zihao Lin and
Trevor Ashby and
Joy Rimchala and
Jiaxin Zhang and
Lifu Huang},
title = {Holistic Evaluation for Interleaved Text-and-Image Generation},
journal = {CoRR},
volume = {abs/2406.14643},
year = {2024},
url = {https://doi.org/10.48550/arXiv.2406.14643},
doi = {10.48550/ARXIV.2406.14643},
eprinttype = {arXiv},
eprint = {2406.14643},
timestamp = {Tue, 16 Jul 2024 16:17:50 +0200}
}
InterleavedEval dataset is for research purpose only. Please carefully check the licenses of the original datasets before using InterleavedEval. We provide the URLs to the original datasets and their Bibtex on this page. The images and tasks may be taken down at any time when requested by the original dataset owners or owners of the referenced images. If you hope to take down any tasks or the images, please contact Minqian Liu and Lifu Huang at minqianliu@vt.edu and lifuh@cs.vt.edu.