Feb 08, 2024 13:10:00

Apple has released an image editing AI model 'MGIE' that edits photos simply by instructing in language, and a demo has also been released, so I tried using it

Apple has collaborated with the University of California, Santa Barbara to release MGIE , an AI model that can edit photos using simple verbal commands.

[2309.17102] Guiding Instruction-based Image Editing via Multimodal Large Language Models

https://arxiv.org/abs/2309.17102

apple/ml-mgie
https://github.com/apple/ml-mgie

MGIE, an abbreviation for MLLM-Guided Image Editing, is capable of performing various image editing tasks, such as changing the shape of objects in an image or editing brightness. MGIE is a multimodal large-scale language model that handles both images and natural language, and users only need to provide instructions in natural language. By generating 'expressive instructions' based on user input, the AI that actually performs the editing can perform appropriate image editing.

Examples of editing using MGIE are shown below. For each image pair, the original image is on the left and the output of MGIE is on the right. In the pizza example on the top left, the vague instruction 'make it more healthy' generated the detailed instruction 'The pizza includes vegetable toppings, such as tomatoes and herbs,' and vegetable toppings were added. In the image editing example on the top right, the instruction 'remove the woman in the background' was followed by the instruction. It is also possible to increase the brightness of the image or change the display content of the PC in the image.

Here's a comparison with the conventional methods 'InsPix2Pix' and 'LGIE.' The 'Input Image' on the left is the input data, and the 'Ground Truth' on the right is the correct data. It's clear that MGIE is the most accurate in editing, displaying lightning and erasing the Christmas tree.

The MGIE model is distributed under the CC-BY-NC license, which prohibits commercial use, as differential weights from LLaVA. Therefore, to use the MGIE model, you must also comply with the LLaVA license. Since LLaVA uses CLIP , LLaMA , Vicuna , and GPT-4 for training, you must also comply with these terms.

The link below provides a demo of MGIE, where you can actually try out image editing using MGIE.

MLLM-guided Image Editing (MGIE) - a Hugging Face Space by tsujuifu
https://huggingface.co/spaces/tsujuifu/ml-mgie

This time I'll try editing a photo of a roll cake .

Drag and drop the image into the 'Input Image' field, enter 'use strawberry as topping' in the Instructions field, and click 'Submit.'

If there are a lot of people using the service, you will have to wait in line. At the time of writing, there were about 50 people in line, and the estimated waiting time was 700 seconds.

After waiting for a while, the output will look like the image below. Detailed instructions have been generated: 'Place a round of cake on a plate with strawberries on top. Place the cake in the center of the plate, with the strawberries spread out around it.'