, and today we will provide a more detailed introduction to the image-related aspects. Microsoft recently published a paper titled "The Dawn of Large Multimodal Models: A Preliminary Exploration of GPT-4V(ision)", focusing on the application capabilities of large language models in the field of vision.
Paper Link: https://arxiv.org/abs/2309.17421
GPT-4V supports three major input methods:
- Plain text 
- Image with caption 
- Image recognition 
- Object localization 
- Image captioning 
- Visual Question Answering 
- Visual Dialogue 
- Dense Captioning 
- Interwoven Text and Images 
- Applicable to Various Application Scenarios 
- Process multiple image inputs simultaneously and extract query information 
- Effectively match information between images and text 
- Applicable to few-shot learning in context and other advanced instruction techniques 
GPT-4V operation prompt tips:
- Textual instructions 
| Instruction | Response | Note | 
|---|---|---|
| Count the number of apples in the image | An apple | Wrong count | 
| Count the apples in the picture row by row | First row: 4 apples | The result is correct, but the process is wrong. | 
| As a counting expert, please count the apples in the figure below line by line to ensure the answer is correct. | First row: 4 apples | Clear instructions, correct responses | 
- To accurately label an object in an image, we have six methods to choose from: 
- Coordinates 
- Cropping 
- Arrow 
- Rectangle 
- Oval 
- Hand-drawn 
- The generality and flexibility demonstrated by GPT-4V enable it to understand multimodal instructions in a nearly human-like manner, showcasing unprecedented adaptability.  
- Few-shot example indication 
When given zero-shot instructions, the results may be incorrect.

Under one-shot instructions, the results are still incorrect.

But with the few-shot prompt, the result is completely accurate.
