CertifID
What You’ll Do
- Reliability & Platform Operations: Own and improve the reliability, availability, and performance of production systems while defining and operationalizing SLIs/SLOs and error budgets.
- AI Agent Enablement: Design and implement autonomous and semi-autonomous AI agents for monitoring distributed systems and applications. Build agents capable of consuming multi-source observability data (metrics, logs, traces, etc.).
- Incident Response: Participate in and help lead an on-call rotation, serving as an escalation point for major incidents and facilitating blameless postmortems.
- Automation & Infrastructure: Build automated workflows to eliminate manual work and design/maintain Infrastructure-as-Code with Terraform.
- Observability: Improve metrics, logs, traces, and alerting using tools like Datadog or Prometheus to reduce noise and increase signal.
- Collaboration & Mentorship: Partner with application teams to implement reliability best practices and mentor junior engineers to foster a culture of knowledge sharing.
Who You Are
- Strategic Architect: You look beyond the “what” to understand the “why,” providing insights that influence our GTM and technical direction.
- Startup Veteran: You are comfortable moving fast and staying proactive in an environment where the playbook is still being written.
- Relatable & Adaptable: You can navigate different personalities across the organization, from high-energy sales teams to analytical engineering partners.
- Lifelong Learner: You have a thirst for learning, keeping up with emerging technologies and industry trends.
What We’re Looking For
- Experience: 5+ years in SRE, DevOps, Platform Engineering, or Infrastructure Engineering.
- Cloud Expertise: Proven experience supporting production SaaS systems in Azure (preferred), AWS, or GCP.
- Technical Stack: Strong Linux, networking, and distributed systems troubleshooting skills.
- Containers: Strong experience with containers and orchestration (Kubernetes/EKS/AKS).
- IaC & Tooling: Expertise with Infrastructure-as-Code (Terraform strongly preferred).
- Programming: Strong scripting/programming skills in Python, Go, Bash, or C#/.NET.
- Observability: Hands-on experience with Datadog, Prometheus/Grafana, or OpenTelemetry.
What We Offer
- Flexible vacation
- 12 company-paid holidays
- 10 paid sick days
- No work on your birthday
- Health, dental, and vision Insurance (including a $0 option)
- 401(k) with matching, and no waiting period
- Equity
- Life insurance
- Generous parental paid leave
- Wellness reimbursement of $300/year
- Remote worker reimbursement of $300/year
- Professional development reimbursement
- Competitive pay
- An award-winning culture
Originally posted on Himalayas
To apply for this job please visit himalayas.app.
Working in United States
The United States of America (USA), also known as the United States (U.S.) or America, is a country primarily located in North America. It is a federal republic consisting of 50 states and a federal capital district, Washington, D.C. The 48 contiguous states border Canada to the north and Mexico to the south, with the semi-exclave of Alaska in the northwest and the archipelago of Hawaii in the Pacific Ocean. The United States also asserts sovereignty over five major island territories and various uninhabited islands in Oceania and the Caribbean. It is a megadiverse country, with the world's th
More jobs at CertifID
Keep exploring on Get A Job.ai
Not quite the right fit? Your next opportunity is a click away.
- Browse all jobs
- More jobs by category
- Remote jobs you can do from anywhere
- Research typical pay for this role
- Set a job alert so new matches reach you first
- Upload your resume to apply faster
Hiring instead? Post a job and reach candidates searching right now.