Advanced application capabilities of ChatGPT in the visual field - Level 1

, and today we will provide a more detailed introduction to the image-related aspects. Microsoft recently published a paper titled "The Dawn of Large Multimodal Models: A Preliminary Exploration of GPT-4V(ision)", focusing on the application capabilities of large language models in the field of vision.

Paper Link: https://arxiv.org/abs/2309.17421

GPT-4V supports three major input methods:

Plain text
Image with caption

Image recognition
Object localization
Image captioning
Visual Question Answering
Visual Dialogue
Dense Captioning

Interwoven Text and Images

Applicable to Various Application Scenarios
Process multiple image inputs simultaneously and extract query information
Effectively match information between images and text
Applicable to few-shot learning in context and other advanced instruction techniques

GPT-4V operation prompt tips:

Textual instructions

Instruction	Response	Note
Count the number of apples in the image	An apple	Wrong count
Count the apples in the picture row by row	First row: 4 apples	The result is correct, but the process is wrong.
As a counting expert, please count the apples in the figure below line by line to ensure the answer is correct.	First row: 4 apples	Clear instructions, correct responses

To accurately label an object in an image, we have six methods to choose from:

Coordinates
Cropping
Arrow
Rectangle
Oval
Hand-drawn

See the example

The generality and flexibility demonstrated by GPT-4V enable it to understand multimodal instructions in a nearly human-like manner, showcasing unprecedented adaptability.
Few-shot example indication

When given zero-shot instructions, the results may be incorrect.

Under one-shot instructions, the results are still incorrect.

But with the few-shot prompt, the result is completely accurate.