Madiff
This is a remote position.
We are looking for a Senior Site Reliability Engineer to support advanced AI platforms responsible for production-grade applications and pipelines. The role focuses on building and maintaining reliability, scalability, and operational excellence across multiple AI-driven systems.
The engineer will work on a central operational layer for monitoring and managing AI workloads, improving system stability, and reducing incidents. This is a hands-on role requiring direct involvement in diagnosing production issues, implementing fixes, and optimising monitoring, alerting, and CI/CD processes.
The position requires close collaboration with engineering teams to improve release quality, standardise telemetry, and ensure stable and predictable system behaviour in a distributed cloud environment.
- Build and maintain central monitoring and alerting layer for AI applications and pipelines
- Define and implement SLIs, alerts, and operational dashboards
- Manage incidents including triage, coordination, root cause analysis, and prevention
- Standardise telemetry across systems including latency, throughput, and failures
- Optimise CI CD pipelines and introduce quality gates for reliability
- Work closely with engineering teams to reduce recurring issues and improve stability
Requirements
- Minimum5+ years of experiencein SRE, Platform, or Production Engineering
- Strong hands on experience withKubernetesand production environments
- Experience withAzure and Azure DevOps
- Experience with monitoring tools such asDatadog
- Strong understanding ofincident management and root cause analysis
- Ability to build practical monitoring and alerting systems
- Experience withAI or LLM pipelines
- Experience building monitoring platforms across multiple systems
- Experience withGrafana
- Experience working in large scale or distributed environments
- Strong ownership mindset and accountability for system stability
- Proactive approach to identifying risks and improvements
- Hands on engineer actively working with systems, not only coordinating
- Comfortable working in dynamic and evolving environments
Benefits
- Solid, competitive salary
- Work in a multinational environment on international projects
- Comprehensive healthcare
- Long-term B2B contract with a stable project pipeline
- Work model: fully remote
Originally posted on Himalayas
To apply for this job please visit himalayas.app.
About this role & career path
Working in United States
The United States of America (USA), also known as the United States (U.S.) or America, is a country primarily located in North America. It is a federal republic consisting of 50 states and a federal capital district, Washington, D.C. The 48 contiguous states border Canada to the north and Mexico to the south, with the semi-exclave of Alaska in the northwest and the archipelago of Hawaii in the Pacific Ocean. The United States also asserts sovereignty over five major island territories and various uninhabited islands in Oceania and the Caribbean. It is a megadiverse country, with the world's th
More jobs at Madiff
Keep exploring on Get A Job.ai
Not quite the right fit? Your next opportunity is a click away.
- Browse all jobs
- More jobs by category
- Remote jobs you can do from anywhere
- Research typical pay for this role
- Set a job alert so new matches reach you first
- Upload your resume to apply faster
Hiring instead? Post a job and reach candidates searching right now.