About - Khrum Kashan

Hello, I'm Khrum Kashan.

I am a technology and product leader working at the intersection of hyperscale infrastructure, accelerated compute, and applied artificial intelligence. My career has focused on building the platforms that allow AI systems to operate reliably, efficiently, and at global scale.

I currently serve at Microsoft Azure, where I lead platform strategy and execution across Linux-based systems that power large-scale GPU and accelerator fleets. My scope spans silicon enablement, operating system quality, and end-to-end readiness for modern AI workloads training, inference, and emerging agentic systems.

Platform Leadership

My work centers on one core responsibility: ensuring that advanced AI models can run predictably on complex, distributed infrastructure.

At hyperscale, success is determined less by individual features and more by system behavior under pressure. I lead initiatives that set quality bars across millions of cores, define validation frameworks for new silicon generations, and align hardware, firmware, kernel, and orchestration layers into a coherent platform.

This includes deep engagement with GPU and accelerator ecosystems NVIDIA and AMD covering fleet bring-up, performance characterization, and inference optimization with kernels. My teams focus on platform latency, throughput, memory efficiency, and failure modes that only appear at scale.

AI Inference, Evaluation, and Agentic Systems

Beyond infrastructure, I work directly with AI workloads. I spend time evaluating large language models, measuring inference behavior, and understanding how system design influences token cost, response time, and reliability.

I actively experiment with agentic architectures, retrieval-augmented generation, and workflow-driven AI systems. My interest is not theoretical. I focus on how agents behave in production: how they fail, how they are evaluated, and how platforms must adapt to support long-running, stateful, tool-using AI.

This perspective allows me to bridge hardware capability with application reality. I help shape platforms that support not just model execution, but continuous evaluation, iteration, and operational control.

Experience Highlights

Microsoft Azure - Principal Program Manager, Azure Compute and AI Infrastructure
T-Mobile - Principal Program Manager
WindMobile Canada - Sr Program Manager Lead
Nortel Networks - Software Engineering Lead
Siemens - Systems Engineer

Experience at Scale

Before Microsoft, I held leadership roles at T-Mobile USA, where I managed national capacity planning and multi-billion-dollar infrastructure investments. I led data-driven forecasting initiatives and applied machine learning to real-world network performance problems, improving customer experience at nationwide scale.

Across roles, I have worked closely with silicon vendors, cloud engineering teams, open-source communities, and executive leadership. I hold multiple patents in machine learning and network quality, reflecting a long-standing focus on practical innovation.

How I Think

I approach leadership through systems thinking and technical depth. I value clarity, repeatability, and evidence over optimism. Platforms succeed when incentives, tooling, and architecture align and when teams are empowered to reason from first principles.

Outside of work, I am a mountaineer and ski instructor. The same habits apply: preparation, respect for constraints, and steady decision-making when conditions change.