Loading...

Site Reliability Engineer (AI)

  • Full Time
  • Anywhere

Madiff

This is a remote position.

We are looking for a Senior Site Reliability Engineer to support advanced AI platforms responsible for production-grade applications and pipelines. The role focuses on building and maintaining reliability, scalability, and operational excellence across multiple AI-driven systems.

The engineer will work on a central operational layer for monitoring and managing AI workloads, improving system stability, and reducing incidents. This is a hands-on role requiring direct involvement in diagnosing production issues, implementing fixes, and optimising monitoring, alerting, and CI/CD processes.

The position requires close collaboration with engineering teams to improve release quality, standardise telemetry, and ensure stable and predictable system behaviour in a distributed cloud environment.

Responsibilities

  • Build and maintain central monitoring and alerting layer for AI applications and pipelines
  • Define and implement SLIs, alerts, and operational dashboards
  • Manage incidents including triage, coordination, root cause analysis, and prevention
  • Standardise telemetry across systems including latency, throughput, and failures
  • Optimise CI CD pipelines and introduce quality gates for reliability
  • Work closely with engineering teams to reduce recurring issues and improve stability

Requirements

  • Minimum5+ years of experiencein SRE, Platform, or Production Engineering
  • Strong hands on experience withKubernetesand production environments
  • Experience withAzure and Azure DevOps
  • Experience with monitoring tools such asDatadog
  • Strong understanding ofincident management and root cause analysis
  • Ability to build practical monitoring and alerting systems
Nice to have

  • Experience withAI or LLM pipelines
  • Experience building monitoring platforms across multiple systems
  • Experience withGrafana
  • Experience working in large scale or distributed environments
Expectations

  • Strong ownership mindset and accountability for system stability
  • Proactive approach to identifying risks and improvements
  • Hands on engineer actively working with systems, not only coordinating
  • Comfortable working in dynamic and evolving environments

Benefits

  • Solid, competitive salary
  • Work in a multinational environment on international projects
  • Comprehensive healthcare
  • Long-term B2B contract with a stable project pipeline
  • Work model: fully remote

Originally posted on Himalayas

To apply for this job please visit himalayas.app.

About this role & career path

Working in United States

The United States of America (USA), also known as the United States (U.S.) or America, is a country primarily located in North America. It is a federal republic consisting of 50 states and a federal capital district, Washington, D.C. The 48 contiguous states border Canada to the north and Mexico to the south, with the semi-exclave of Alaska in the northwest and the archipelago of Hawaii in the Pacific Ocean. The United States also asserts sovereignty over five major island territories and various uninhabited islands in Oceania and the Caribbean. It is a megadiverse country, with the world's th

    More jobs at Madiff

    Keep exploring on Get A Job.ai

    Not quite the right fit? Your next opportunity is a click away.

    Hiring instead? Post a job and reach candidates searching right now.