InterleavedEval: Holistic Evaluation for Interleaved Text-and-Image Generation

Created by Virginia Tech's NLP Lab.

Interleaved text-and-image generation has been an intriguing research direction, where the models are required to generate both images and text pieces in an arbitrary order. Despite the emerging advancements in interleaved generation, the progress in its evaluation still significantly lags behind. Existing evaluation benchmarks do not support arbitrarily interleaved images and text for both inputs and outputs, and they only cover a limited number of domains and use cases. Also, current works predominantly use similarity-based metrics which fall short in assessing the quality in open-ended scenarios. To this end, we introduce InterleavedBench, the first benchmark carefully curated for the evaluation of interleaved text-and-image generation. InterleavedBench features a rich array of tasks to cover diverse real-world use cases. In addition, we present InterleavedEval, a strong reference-free metric powered by GPT-4o to deliver accurate and explainable evaluation. We carefully define five essential evaluation aspects for InterleavedEval, including text quality, perceptual quality, image coherence, text-image coherence, and helpfulness, to ensure a comprehensive and fine-grained assessment. Through extensive experiments and rigorous human evaluation, we show that our benchmark and metric can effectively evaluate the existing models with a strong correlation with human judgments surpassing previous reference-based metrics. We also provide substantial findings and insights to foster future research in interleaved generation and its evaluation.

InterleavedBench

We introduce INTERLEAVEDBENCH, the first comprehensive benchmark meticulously constructed to evaluate text-and-image interleaved generation.

Our dataset includes two subsets:

context-based: subset where the instances contain a multimodal context of interleaved text and images in the input (first row in above figure)
context-free: subset with text-only inputs (second row in above figure). The context-free subset can assess whether the model can creatively generate interleaved content based on the text-only instruction, while the context-based subset can better benchmark the coherence and consistency of generated outputs.

Comparison with Existing Benchmarks

Dataset Name	Detailed Instruction	Image Input	Text Output	Image Output
MagicBrush	No	Single	No	Single
DreamBench	No	Multiple	No	Single
CustomDiffusion	No	Multiple	No	Single
DreamEditBench	No	Multiple	No	Single
Mantis-Eval	Yes	Multiple	Yes	No
InterleavedBench (Ours)	Yes	Multiple	Yes	Multiple

We highlight the following key differences and unique challenges introduced by our INTERLEAVEDBENCH compared with the existing benchmark:

(1): Output modality: our benchmark requires the models to generate interleaved text and multiple images that could present in an arbitrary order, whereas exiting benchmarks only cover the output with single modality or a single image;
(2) Requirement on coherence: given that both inputs and outputs in our benchmark can contain multiple pieces of text and images, our dataset can assess whether the outputs are coherent and consistent with input instruction and context, and within the outputs themselves;
(3) Instruction following: Each instance in our benchmark contains a detailed human-annotated instruction to describe the task. Thus, our dataset can evaluate models' instructionfollowing and generalization capabilities. We show the difference between our benchmark and existing datasets in the above table.

Main Results

Baselines

(1) MiniGPT-5 (Zheng et al., 2023a) which connects a large language model with a stable diffusion model via generative vokens, enabling descriptionfree multimodal generation.
(2) GILL (Koh et al., 2023) which allows a pretrained large language model to generate multimodal responses by mapping the hidden states of text into the embedding space of an image generation model.
(3) EMU-2 (Sun et al., 2023a) which induces in-context learning capabilities of LLMs by scaling up the model size and the size of the pretraining dataset;
(4) EMU-2 Gen + Gold Text where EMU-2 Gen is a pretrained EMU-2 model instruction-tuned on various controllable image generation tasks. However, EMU-2 Gen cannot generate text so we combine it with ground-truth textual responses to come up with a complete text-and-image interleaved content for evaluation.
(5) GPT-4o (OpenAI, 2024) + DALL·E 3 (Betker et al.) where GPT-4o is the state-of-the-art proprietary LMM that can comprehend interleaved textand-image inputs and generate text-only responses. We leverage GPT-4o to generate text responses as well as captions for image responses in the desired positions. Then the captions are fed into DALL·E 3 to generate images. Finally, we combine the text responses with generated images in their original orders.
(6) Gemini-1.5 (Anil et al., 2023) + SDXL (Podell et al., 2023): we build this baseline in a similar way as GPT-4o + DALL·E 3 but use Gemini-1.5 Pro as the LMM and Stable Diffusion XL Turbo as the image generation model.

Automatic Evaluation

Note: TIC means "Text-Image Coherence" and we use a scale of 0-5 for this evaluation.

Best

Middle

Worst

Model	Text Quality	Perceptual Quality	Image Coherence	TIC	Helpfulness	AVG
MiniGPT-5	1.22	2.45	1.62	2.03	1.77	1.82
GILL	0.75	3.21	2.25	1.53	1.48	1.84
EMU-2	1.26	2.28	1.89	1.34	1.64	1.68
EMU-2 (Gold Text)	1.56	3.35	2.89	1.43	2.10	2.27
Gemini1.5 + SDXL	4.40	3.99	3.64	4.13	3.62	3.96
GPT-4o + DALLE3	4.37	4.36	3.51	4.55	3.88	4.13

Human Evaluation

Note: TIC means "Text-Image Coherence" and we use a scale of 0-3 for this evaluation.