Now in Foundry: IBM Granite 4.1, NVIDIA Nemotron Nano Omni, and Qwen3.6-35B-A3B
This week, Microsoft Foundry has introduced two significant model families, enhancing its reasoning capabilities across various areas including specialised speech, vision, general-purpose coding, and long-context analysis. Among these additions is the **IBM Granite 4.1** family, featuring ten models: six large language models (LLMs) across sizes of 3B, 8B, and 30B, available in both full-precision and FP8 formats. There’s also a dedicated safety model, a vision-language model intended for document extraction, and a multilingual speech recognition model. Additionally, **NVIDIA’s Nemotron-3-Nano-Omni-30B-A3B-Reasoning** offers multimodal functionality, accommodating video, audio, image, and text through a 31B Mamba2-Transformer Hybrid Mixture-of-Experts (MoE) structure, which engages only 3B parameters per forward pass; the FP8 variant is spotlighted here. **Qwen3.6-35B-A3B** has been crafted for advanced agentic coding in open models, retaining context over conversations with an expandable context window of up to 1 million tokens.
Model Specifications
- Parameters / size: 30B (the flagship of the Granite 4.1 family)
- Context length: 131,072 tokens
- Primary task: Text generation (multilingual instruction following, retrieval-augmented generation, tool calling, coding, summarisation)
What’s Compelling About This Release
- The Granite 4.1 range introduces ten models to Microsoft Foundry, with the LLM sequence incorporating granite-4.1-3b-instruct, granite-4.1-8b-instruct, and granite-4.1-30b-instruct, all with FP8 variations. It also includes granite-guardian-4.1-8b for safety measures, granite-vision-4.1-4b for understanding documents and charts, and granite-speech-4.1-2b for multilingual speech transcription. This comprehensive stack enables teams to efficiently mix and match different model sizes and modalities from one provider.
- Notable instruction-following and reasoning performance at the 30B scale: granite-4.1-30b-instruct achieves scores of 80.16 on MMLU, 64.09 on MMLU-Pro, 83.74 on Big-Bench Hard, 77.80 on AGI Eval, 45.76 on GPQA (a graduate-level science reasoning benchmark), and an average of 89.65 on IFEval (instruction following). These impressive results stem from supervised fine-tuning and reinforcement learning aimed specifically at enhancing instruction compliance, tool calling accuracy, and long-context retention. (Check benchmarks here)
- Improved tool calling and support for twelve languages: Granite 4.1 models excel at structured function calls and cater to twelve languages—Arabic, Chinese, Czech, Dutch, English, French, German, Italian, Japanese, Korean, Portuguese, and Spanish—boasting abilities in dialogue, extraction, and summarisation.
- Integration of safety and multimodal capabilities within the same family: The granite-guardian-4.1-8b model is a safety classifier designed to pinpoint harmful content and prompt injections, while granite-vision-4.1-4b serves as a Vision Language Model tailored for document extraction from PDFs, charts, and tables. Moreover, granite-speech-4.1-2b is a multilingual Automatic Speech Recognition model. This integration enables teams to handle safety, document parsing, and audio management within one model family, streamlining the integration process.
How to Implement It
Use Case | Prompt Pattern |
Multilingual Retrieval-Augmented Generation (RAG) | Provide retrieved document excerpts in any of the twelve supported languages; instruct the model to summarise and reference sources |
Agentic Tool Calling | Detail function definitions alongside the user’s objectives; the model plans and performs tool calls in a structured format |
Document Extraction (granite-vision-4.1-4b) | Upload a PDF page image; extract tables, key figures, or form fields as structured JSON |
Safety Classification (granite-guardian-4.1-8b) | Pass user inputs or model outputs; receive a structured risk assessment before delivering the response |
Here’s a sample prompt for deploying an enterprise document processing system:
Imagine you’re creating a multilingual document intelligence pipeline for a global financial organisation. Using the granite-4.1-30b-instruct model deployed in Microsoft Foundry, send each incoming policy or regulatory document with this command: “You are a compliance analysis assistant. Please review the document and extract: (1) all regulatory requirements, (2) the entities to which each requirement pertains, (3) any mentioned compliance deadlines, and (4) any penalties for non-compliance. Return the results as a structured JSON array, with each requirement as a separate entry.” For documents with scanned pages, first, pass them through granite-vision-4.1-4b to extract text and table content, and then send to the 30B model for compliance evaluation. Ensure that all outputs meant for the user undergo screening through granite-guardian-4.1-8b to filter out sensitive material before results are shared.
Further Model Specifications
- Parameters / size: 31B total, ~3B active per forward pass (Mamba2-Transformer Hybrid Mixture-of-Experts)
- Context length: 256,000 tokens
- Primary task: Video-audio-image-text-to-text (Multimodal understanding, reasoning, tool calling)
What’s Unique About This Model
- Integrates multiple input types into a single efficient endpoint: The Nemotron-3-Nano-Omni-30B-A3B-Reasoning can handle video (up to 2 minutes), audio (up to 1 hour), images (RGB), and text—all via one model accessible in Microsoft Foundry. Three variations are available: full-precision BF16, FP8, and NVFP4. For more, visit the Nemotron Nano Omni technical report.
- Exceptional performance across various benchmarks for vision and audio: When reasoning mode is active, the model scores 82.8 on MathVista-MINI (visual math reasoning), 67.04 on OCRBenchV2-EN (document OCR), 63.6 on Charxiv Reasoning (chart comprehension), 72.2 on Video MME (question answering about video), 74.52 on Daily Omni (comprehensive video and audio understanding), and 89.39 on VoiceBench (following spoken instructions). Additionally, on OSWorld (a benchmark assessing autonomous computer use), it achieves 47.4, showcasing impressive outcomes for a model with 3B active parameters. (Consult the model cards for additional benchmark data.)
- Mamba2-Transformer Hybrid MoE for efficient context processing: This model alternates between Mamba2 state-space blocks (which allow for linear sequence processing rather than quadratic) and traditional Transformer attention blocks, combined with Mixture-of-Experts feedforward layers. Only around 3B parameters are engaged per token, despite having a total of 31B, enabling efficient use of the 256K context window without excessive computational costs.
- Offers word-level timestamps, JSON output, and tool calling for streamlined media workflows: The model generates word-level timestamps from audio, facilitating precise alignment of transcripts with timecodes for easy review and indexing. Coupled with JSON-structured output, chain-of-thought reasoning, and seamless tool calling, it becomes an efficient agent that processes raw media (like meeting audio or training videos) and produces structured results, streamlining workflows without needing separate transcription or OCR stages.
Implementation Tips
Use Case | Prompt Pattern |
Meeting Intelligence | Submit an audio recording (up to 1 hour); generate a transcript, including timestamps, action items, and decisions as structured JSON |
Analysing Video Content | Provide a video clip (up to 2 minutes) with a query; receive a timestamped summary of important events or spoken content |
Joint Document and Audio Analysis | Upload a scanned document image with corresponding audio explanation; extract and reconcile data from both formats |
Multimodal Tool Calling | Supply tool definitions along with combined image/audio input; the model reasons through the material and executes structured tool calls |
For a media compliance deployment, here’s a sample prompt:
You are designing a compliance review system for a media organisation. By employing the Nemotron-3-Nano-Omni-30B-A3B-Reasoning-FP8 model in Microsoft Foundry, send each recorded video segment with this command: “Examine this video segment and create a compliance summary as a JSON object including: transcript (complete text with word-level timestamps), flagged_segments (array of objects each containing start_time, end_time, content, and reason for flagging), speaker_count (number of distinct speakers estimated), and compliance_summary (overall evaluation). Highlight any content featuring unverified claims, restricted product categories, or incomplete regulatory disclosures.” Utilize the word-level timestamps from the compliance summary to direct flagged segments to the editorial review queue accurately with specific timecode references.
Model Specifications
- Parameters / size: 35B overall, 3B active (Mixture-of-Experts)
- Context length: 262,144 tokens natively; can be extended to 1,010,000 tokens
- Primary task: Image-text-to-text (agentic coding, reasoning, vision)
Why This Model Stands Out
- Improvements in agentic coding compared to Qwen3.5-35B-A3B: Qwen3.6-35B-A3B achieves a score of 73.4 on SWE-bench Verified (up from 70.0 for Qwen3.5), 67.2 on SWE-bench Multilingual, and 49.5 on SWE-bench Pro. Terminal-Bench 2.0 reveals a score of 51.5. This update focuses on frontend processes and repository-level reasoning, addressing prior gaps identified in earlier versions. For more details, check the blog post: Qwen3.6-35B-A3B.
- Hybrid architecture: Gated DeltaNet and Mixture-of-Experts: The model consists of 40 layers combining Gated DeltaNet blocks (a variant of linear attention that eliminates the quadratic cost of traditional self-attention) and Gated Attention blocks. These incorporate Grouped Query Attention with 16 query heads and 2 key-value heads, alongside Mixture-of-Experts (MoE) feedforward layers featuring 256 experts (8 routed + 1 shared active per token). Only 3B parameters engage per forward pass, keeping inference costs comparable to a 3B dense model while benefiting from the knowledge and specialisation of a 35B model.
- Preservation of reasoning context between conversation turns: Qwen3.6 can retain analytical context from prior messages during multi-turn conversations. Previous models discarded chain-of-thought traces between turns, necessitating a re-evaluation of context. With this feature, iterative workflows like debugging become more efficient, as reasoning can build upon itself throughout the interaction.
- Extensible up to 1 million tokens natively: The model’s base context of 262K tokens ranks among the largest for open models at this scale, with the ability to extend to 1,010,000 tokens. Its performance on GPQA Diamond (science reasoning) is 86.0, surpassing both Gemma 4 31B (84.3) and Qwen3.5-27B (85.5) while matching Gemma 4 on MMLU Pro.
How to Use
Use Case | Prompt Pattern |
Changes to Repository Code | Provide the repository structure along with task details; the model plans the necessary file edits and generates a unified diff |
Iterative Debugging Over Multiple Turns | Activate thinking preservation; submit the failing test and related code across multiple turns; build up reasoning context |
Frontend Code Generation | Share a design specification or screenshot with the context of the current codebase; generate the relevant component implementation |
Reasoning Over Long Documents | Send the technical specification (up to 262K tokens); ask the model to identify ambiguities or gaps in implementation |
Here’s a sample prompt for a software engineering deployment:
You are developing an automated code review and implementation assistant for a platform engineering team. Utilising the Qwen3.6-35B-A3B model in Microsoft Foundry, ensure thinking preservation for multi-turn sessions. Initially, provide the repository structure and a GitHub issue describing a necessary change to an API endpoint. Instruct the model: “Examine the repository structure and propose an implementation plan detailing what files require alteration and why.” Upon the second interaction, send the relevant source files and prompt: “Based on your prior plan, execute the changes and generate a unified diff.” On the third turn, pass the test suite and encourage: “Create additional unit tests for the new endpoint, addressing edge cases uncovered in your reasoning.” The thinking preservation feature allows the model to retain its grasp of the codebase throughout all interactions without needing to revisit previous details.
You can deploy open-source Hugging Face models effortlessly within Microsoft Foundry. Simply explore the Hugging Face collection in the Foundry model catalog and deploy to managed endpoints in just a few clicks. Alternatively, start from the Hugging Face Hub by choosing any supported model and selecting “Deploy on Microsoft Foundry”, guiding you directly into Azure with secure and scalable inference already set up. For step-by-step guidance on discovering models and deploying them, refer to the Microsoft Foundry documentation.
Share this content:
Discover more from Qureshi
Subscribe to get the latest posts sent to your email.