Nano Banana 150K

Blog

Junyan Ye Dongzhi Jiang Zilong Huang Jun He Leqi Zhu Zhiyuan Yan Ruichuan An Hongsheng Li Conghui He Weijia Li

Introduction

Nano-banana demonstrates outstanding capabilities in image generation and world knowledge, with particularly strong performance in identity consistency. In contrast, even OpenAI's GPT-4o and the most advanced open-source model, Qwen-Image, still fall noticeably short on consistency tasks. The consistency task refers to preserving the same individual's facial identity across diverse editing scenarios—such as background changes, action modifications, or style shifts—which has emerged as a critical capability for modern image generation models.

In our prior work ( Echo-4o ), we highlighted the advantages of leveraging GPT-4o as a source of high-quality synthetic data—superior to natural image datasets in generating scarce samples, achieving clean instruction alignment, and composing multi-reference image sets. Building upon the insights from our previous work, we present Nano-consistent-150k — the first dataset constructed using Nano-Banana that exceeds 150k high-quality samples, uniquely designed to preserve consistent human identity across diverse and complex editing scenarios.
A key feature is its remarkable identity consistency: for a single portrait, more than 35 distinct editing outputs are provided across diverse tasks and instructions. By anchoring on consistent human identities, the dataset enables the construction of interleaved data that seamlessly link multiple editing tasks, instructions, and modalities around the same individual.
We release Nano-consistent-150K openly to support the community's development of image generation and unified models. In addition, we are conducting lightweight fine-tuning on Qwen-Image for editing tasks, and the resulting model weights will be released later this month.

As shown in the Figure, the Nano-consistent-150k dataset comprises a total of 159,492 samples, including 120k single-image editing instances and 40k multi-reference generation samples, spanning eight distinct sub-tasks.

We began by downloading over 10K publicly available portrait images from Pixabay, followed by automated filtering with GPT-5-mini to ensure the presence of clear facial features. After filtering, approximately 4,000 portraits were retained as the base identity set for consistency tasks. In addition, we manually collected 500 anime-style character illustrations to further supplement the dataset.
Based on these identities, we drafted initial task-specific text instructions and employed Nano-banana to generate synthetic images. Details of each sub-task are provided in the subsequent task descriptions. While Nano-banana exhibits strong generative capabilities, it can still suffer from instruction non-compliance or visible cut-and-paste artifacts. To address this, we applied GPT-5-mini again to filter the outputs.
To further diversify and refine the instructions, we performed instruction rewriting and optimization conditioned on the input images and generated results. Two instruction formats were designed: Training-oriented instructions — providing detailed descriptions of image content and the applied edits, facilitating robust model training. User-oriented instructions — concise single-sentence edit commands, better aligned with practical user input scenarios.

Long:

Change the pose of the bearded man wearing red-accent sunglasses. In the original he tilts his head with one arm lowered/hand near his head; in the new version both arms should be crossed at his waist and his torso/head slightly adjusted to match, while keeping his appearance, clothing, facial hair, sunglasses and the plain wall background exactly the same.

Short:

Change the man's pose to crossing his hands across his waist.

Action Task: We randomly define a set of action instructions that require the model to modify a subject's pose while preserving the original identity details and background. This enables the generation of diverse derivative actions. Examples include making a “Yes” gesture, crossing the arms, or introducing new props such as hats or sunglasses to create varied action expressions.

Background Task: We define approximately 250 different scene locations, covering landmarks, natural landscapes, and common indoor and outdoor environments. The task requires replacing the original background with a new setting while preserving the subject's identity. Examples include switching the background to an indoor photo studio, a snowy mountain outdoors, or various scenic landmarks.

Hairstyle Task: We further explore the task of hairstyle and hair color modification on portrait data, leveraging Nano-banana to edit a subject’s hair details. Examples include changing straight bangs to wavy curls or a bun, and altering black hair to blonde, red, or other colors.

Temporal Task: We place portrait data within different historical or temporal contexts, requiring that both clothing styles and background details align with the designated era. For example, a subject may be rendered in a 1905 daily-life setting or situated in the millennial environment of the year 2000.

Human Interaction Task: We randomly select 2–4 images from the base identity set and use GPT to generate interaction-oriented instructions. Rather than merely placing individuals side by side, the task emphasizes interpersonal actions and interactions. Examples include two people drinking coffee and having a conversation, or a group of four forming a band and performing together. These instructions are then used with Nano-banana to synthesize images that capture rich interactive semantics.

OOTD Task: We collect clothing items from online sources and randomly combine 2–6 garments with a portrait for outfit display. The generated samples are required to preserve facial identity consistency, while incorporating pose variations to better highlight the details and presentation of the clothing.

We will release the Nano-consistent-150K dataset to foster future research on image generation and unified models. In addition, we will provide the Qwen-Image LoRA weights fine-tuned on this dataset in the near future, offering further support for advancing identity consistency and complex editing capabilities.

Citation

@article{ye2025echo4o,
    title={Echo-4o: Harnessing the Power of GPT-4o Synthetic Images for Improved Image Generation},
    author={Junyan Ye, Dongzhi Jiang, Zihao Wang, Leqi Zhu, Zhenghao Hu, Zilong Huang, Jun He, Zhiyuan Yan, Jinghua Yu, Hongsheng Li, Conghui He, Weijia Li},
    journal={https://arxiv.org/abs/2508.09987},
    year={2025},
}