This week, Microsoft launched the Phi-3 series of open models, which is currently the most capable and cost-effective small language model.
Inspiration Source
Last year, while Ronen Eldan from Microsoft pondered solutions to machine learning puzzles during the day and read bedtime stories to his daughter at night, he wondered: "How did she learn that word? How does she know how to connect these words?" This sparked the thinking of this Microsoft machine learning expert: how much could an AI model learn using only vocabulary a four-year-old child could understand—eventually, this thought led to an innovative training method, producing a class of more powerful small language models, making AI more accessible to more people.
Application Scenarios
Phi-3-mini has been released, with 3.8 billion parameters, outperforming models twice its size; Phi-3-small (7 billion parameters) and Phi-3-medium (14 billion parameters) will soon be available in the Azure AI model catalog and other model hubs.
Small language models are designed to provide good performance for simpler tasks, making them easier for organizations with limited resources to obtain and use, and can be more easily fine-tuned to meet specific needs. Suitable for organizations hoping to build applications that can run on local devices (rather than in the cloud), applicable to tasks that do not require extensive reasoning or need quick responses, and by keeping data within the device, users can "minimize latency and maximize privacy."
Training Data
Seeking extremely high-quality data for training. Creating an independent dataset of 3000 words, including roughly equal numbers of nouns, verbs, and adjectives. Then, the team asked a large language model to create children's stories using one noun, one verb, and one adjective from the list—this prompt was repeated millions of times over several days, generating millions of tiny children's stories. Microsoft named the resulting dataset "TinyStories" and used it to train a very small language model with about 10 million parameters. To the team's surprise, when prompted to create their own stories, the small language models trained on TinyStories generated grammatically perfect, fluent narratives.
Next, a larger research team used carefully selected, publicly available data filtered based on educational value and content quality to train Phi-1. After collecting preliminary publicly available information, the team used prompts and seed formulas inspired by TinyStories but further complicated them to capture a wider range of data. To ensure high quality, the team repeatedly filtered the generated content before feeding it back into the LLM for further synthesis. In this way, after weeks of effort, the team accumulated a sufficiently large corpus of data to train a more capable SLM. The final dataset was named "CodeTextbook".
Researchers further enhanced the dataset by selecting data in a manner akin to a teacher explaining complex concepts to students. "Because it reads from textbook-like material, reading from documents explained very clearly and of high quality," Bubeck said, "you make the task of having the language model read and understand this material easier."
Evaluation
The Phi-3 model outperforms models of the same scale and even larger models in various benchmark tests evaluating language, programming, and mathematical abilities.
Usage Method
Phi-3-mini, a 3.8-billion-parameter language model, is now available on Microsoft Azure AI Studio, HuggingFace, and Ollama.
Azure https://aka.ms/phi3-azure-ai HuggingFace https://huggingface.co/collections/microsoft/phi-3-6626e15e9585a200d2d761e3 Ollama https://ollama.com/library/phi3
Phi-3-mini offers two context length variants—4K and 128K tokens. It is the first model of its kind to support a context window up to 128K tokens with minimal impact on quality. The model has undergone instruction tuning, meaning it is trained to follow different types of instructions reflecting how people typically communicate, ensuring the model works right out of the box.
It is available on Azure AI, leveraging the deployment-evaluation-fine-tuning toolchain, and is provided on Ollama so developers can run it locally on their laptops.
The model has been optimized for ONNX Runtime, supports Windows DirectML, and has cross-platform support, enabling it to run on graphics processing units (GPUs), central processing units (CPUs), and even mobile hardware.
It is also available in the form of NVIDIA NIM microservices with standard API interfaces, deployable anywhere, and has been optimized for NVIDIA GPUs.
Weaknesses
In terms of the capabilities of large language models (LLMs), although the Phi-3-mini model exhibits language understanding and reasoning abilities similar to larger models, it still has fundamental limitations in certain tasks due to its size constraints.
The model simply lacks the capacity to store too much "factual knowledge," as evidenced by its lower performance on TriviaQA. However, we believe this weakness can be addressed through enhancements with search engines. An example is shown below using HuggingFace's default Chat-UI with phi-3-mini. Another weakness related to model capacity is that Phi-3 primarily limits itself to English usage. Exploring the multilingual capabilities of small language models is an important next step, and introducing more multilingual data has already yielded some promising initial results with phi-3-small.