Loading Now

Phi Turns One: Small AI Models Achieve Remarkable Breakthroughs

Microsoft has once again propelled the AI discussion forward, introducing three new models: Phi-4-reasoning, Phi-4-reasoning-plus, and Phi-4-mini-reasoning.

Ushering in the Next Generation of AI

Twelve months ago, Microsoft made small language models (SLMs) more accessible by launching Phi-3 through Azure AI Foundry. This marked a shift towards creating more resource-efficient AI models and expanding customer choice for streamlined AI tools. 

Today marks the debut of Phi-4-reasoning, Phi-4-reasoning-plus, and Phi-4-mini-reasoning—an innovation that establishes a new benchmark for small language models and demonstrates just how capable compact AI has become. 

Reasoning models: advancing intelligent AI applications

Reasoning models are specifically trained for inference-time scaling—enabling them to break down complex challenges into manageable steps and apply logical self-assessment. These models excel in fields like mathematics, and are fast becoming essential for advanced AI applications that require many-layered reasoning. Traditionally, this calibre of ability was reserved for large, resource-heavy models. Phi-reasoning models, however, inaugurate a new breed of compact language models that successfully marry modest resource usage with exceptional analytical strength. By applying distillation, reinforcement learning, and premium datasets, these models are small enough for low-latency deployment whilst remaining impressively capable—making it possible to carry out high-level reasoning on much leaner hardware.

Phi-4-reasoning and Phi-4-reasoning-plus 

Phi-4-reasoning stands as a 14-billion parameter open-weight reasoning model that challenges larger alternatives in sophisticated problem solving. It is built by fine-tuning Phi-4 using expertly selected logical demonstrations, predominantly from OpenAI o3-mini, ensuring it can produce thorough, logical solution paths when given more compute during inference. This displays how targeted data and refined datasets allow smaller models to square up against models of far greater size.

Phi-4-reasoning-plus builds further by employing additional reinforcement learning, utilising 1.5 times the tokens compared to its predecessor, which improves precision for tasks that demand top-notch accuracy. 

In spite of their compact architecture, both models consistently outperform OpenAI o1-mini and DeepSeek-R1-Distill-Llama-70B on most tests, including advanced mathematical reasoning and questions of PhD complexity. Notably, they even surpass the DeepSeek-R1 model (671B parameters) on the AIME 2025—the USA Math Olympiad qualifier. Interested users can access these models via Azure AI Foundry or HuggingFace (Phi-4-reasoning) and here (Phi-4-reasoning-plus).

A graph of different colored bars
Figure 1. Comparison of Phi-4-reasoning performance against various benchmarks involving logical and scientific reasoning. The chart highlights improvements from reasoning-specific post-training, showing that both Phi-4-reasoning variants outstrip the Phi-4 base, surpass DeepSeek-R1 Distill Llama 70B (five times the size), and are competitive with models as vast as Deepseek-R1.
A graph of numbers and a number of people
Figure 2. Model accuracy measured over typical AI benchmarks: long-input Q&A (FlenQA), instruction compliance (IFEval), code generation (HumanEvalPlus), broader knowledge and language understanding (MMLUPro), safety analysis (ToxiGen), as well as capabilities like ArenaHard and PhiBench.

Phi-4-reasoning represents a substantial step forward from the original Phi-4, outperforming much larger competitors in tasks spanning maths, programming, algorithmic problem solving, and system planning. For an in-depth exploration of these advances, refer to the full technical paper which presents detailed quantitative results across a vast array of benchmarks.

Phi-4-mini-reasoning

Phi-4-mini-reasoning is tailored for anyone in need of a concise model that excels at logical problem solving. This transformer language model delivers precise, step-by-step maths solutions even when running under latency or resource restrictions. Trained using synthetic datasets produced by DeepSeek-R1, it harmonises efficiency with sophisticated reasoning. It is especially well-suited for educational software, embedded tutoring functionalities, and compact mobile deployments, having been trained on over a million maths questions ranging from secondary school up to doctoral level. You can experiment with Phi-4-mini-reasoning via Azure AI Foundry or HuggingFace.

A graph of numbers and a number of marks
Figure 3. The

How to Get Started with Reasoning Models in Azure AI Foundry

  • Step 1: Visit the Azure AI Foundry portal and sign in with your credentials.
  • Step 2: Search for the Phi-4-reasoning, Phi-4-reasoning-plus, or Phi-4-mini-reasoning models—choose the one that best matches your requirements.
  • Step 3: Create a project to experiment with the model. Adjust configurations to suit specific business or research needs.
  • Step 4: Deploy your project and monitor results. Use the detailed output and reasoning trace to help troubleshoot and finetune your AI solution.
  • Step 5: For issues related to model performance or access, consult the Azure AI documentation or community forums for troubleshooting guides.
The chart presents a comparison of several models tackling prominent mathematics benchmark tests focused on generating extended sentences. Notably, Phi-4-mini-reasoning consistently demonstrates superior results in long sentence creation, outperforming its foundational model and larger competitors such as OpenThinker-7B, Llama-3.2-3B-instruct, DeepSeek-R1-Distill-Qwen-7B, DeepSeek-R1-Distill-Llama-8B, and Bespoke-Stratos-7B. In direct comparison to OpenAI o1-mini, Phi-4-mini-reasoning matches or exceeds performance, particularly excelling in Math-500 and GPQA Diamond tasks. Impressively, with only 3.8 billion parameters, Phi-4-mini-reasoning manages to surpass rivals more than double its size.

If you are seeking in-depth technical details about this model, you can access the complete technical report, which offers comprehensive quantitative analyses.

Phi has seen notable advancements over the past year, continually improving the balance between quality and model size. The expanding Phi family now includes features designed to meet a wide variety of requirements. Thanks to this evolution, you can run these models locally on both CPUs and GPUs across the spectrum of Windows 11 computers.

As Windows paves the way for a new class of PC, Phi models are now embedded within Copilot+ PCs, equipped with the NPU-optimised Phi Silica edition. Specifically engineered to be memory-resident, this OS-managed variant ensures ultra-rapid initiation for token processing, enhanced efficiency, and minimal power consumption. It operates seamlessly in tandem with other programs on your device.

Phi Silica actively supports core Windows experiences such as Click to Do, enabling intelligent text suggestions for anything visible on your screen. Moreover, it provides developer APIs for straightforward integration into software—a capability already leveraged by productivity tools like Outlook for offline Copilot summaries. These compact-yet-powerful models are fully optimised and widely deployed across the Windows ecosystem. Both Phi-4-reasoning and Phi-4-mini-reasoning models feature streamlined low-bit variants for Phi Silica and will soon support on-device execution via Copilot+ PC NPUs.

Update: 15 May 2025: We are excited to share that Phi-4-reasoning and Phi-4-mini-reasoning, utilising optimisations via ONNX, are now ready for use on Snapdragon-powered Copilot+ PCs. By running these AI models directly on the Neural Processing Unit (NPU), power consumption is significantly reduced during inference. This approach enables the Phi-4 family to deliver heightened reasoning accuracy while operating more efficiently. Start today by installing the AI Toolkit extension for Visual Studio Code.

Safety and Microsoft’s Approach to Responsible AI

At Microsoft, the commitment to responsible AI practices is central to developing and releasing AI solutions, including the Phi models. Every Phi model is constructed following Microsoft’s guiding principles: accountability, transparency, equity, dependability and safety, privacy and security, and inclusivity.

The Phi model lineup brings together a thorough safety post-training regime that utilises Supervised Fine-Tuning (SFT), Direct Preference Optimisation (DPO), and Reinforcement Learning from Human Feedback (RLHF). This uses an assortment of publicly available datasets that prioritise usefulness and safety, along with targeted question and answer repositories for increased protection. While designed for broad effectiveness, it’s important to remember that limitations may still exist. For a complete overview of how these considerations are managed, review the model cards provided below with detailed guidance on responsible practices and approved usage.

Find Out More:

How to Troubleshoot Phi Model Deployment on Copilot+ and Windows 11

  • How to Run Phi Models Locally: First, ensure your device has the recommended CPU, GPU or NPU. Download the preferred Phi model and follow platform-specific instructions for your hardware configuration.
  • How to Optimise Performance: Use optimised builds like the ONNX versions for NPU compatibility. Check firmware and driver updates for best efficiency on Snapdragon and other Copilot+ machines.
  • How to Integrate Phi into Apps: Utilise the developer APIs provided. Example resources can be found on Microsoft’s official documentation for Phi Silica integration in C# and Python.
  • How to Interpret Benchmark Results: Compare both model size and accuracy for your use case. If long-form content creation is needed, consider models such as Phi-4-mini-reasoning, which offer high performance despite smaller parameter sizes.
  • How to Address Safety and Ethical Concerns: Reference Microsoft’s responsible AI guidelines to ensure your implementation is both ethical and GDPR-compliant. Regularly review model cards accompanying each release for updates.
  • How to Troubleshoot Common Issues:
    • If your model is slow to respond, confirm the NPU is active and drivers are current.
    • For integration problems, validate API keys and permissions as described in the developer portal.
    • If an application fails to summarise or comprehend text correctly, check for recent model updates and retrain if necessary.