InvestorFlow
You Will
- Design and implement comprehensive monitoring strategies rather than owning observability platforms outright.
- Collaborate with DevOps and Engineering on shared observability platforms (Grafana, Prometheus/Loki, Azure Monitor/Application Insights).
- Define golden signals dashboards, measure SLOs/SLIs/error budgets, and help implement actionable alerting.
- Drive structured logging standards, distributed tracing patterns, and OpenTelemetry implementation standards for teams to deploy and SRE to validate.
- Conduct monitoring/auditing of production systems to ensure instrumentation completeness.
- Take ownership of production incident response, lead incident handling, and drive remediation.
- Conduct blameless post-incident reviews and ensure follow-through on action items.
- Continuously improve operational processes, reliability practices, and team readiness.
- Monitor system resource utilization and forecast future needs.
- Tune autoscaling configurations in partnership with Engineering teams.
- Evaluate capacity efficiency and support cost optimization strategies.
- Validate DR environments and test failover processes—not build them.
- Ensure DR capabilities are functioning as-designed with clear documentation.
- Define and lead regular DR drills in partnership with Engineering/Platform teams.
- Work with the Non-Functional Testing team on resilience and DR scenario simulations.
- Support chaos experiment planning and validation as a nice-to-have capability.
You Have
- 5+ years in Site Reliability Engineering, Production Engineering, or related operations roles.
- Strong knowledge of cloud-native systems, preferably Microsoft Azure.
- Experience with observability tooling (Grafana ecosystem, Prometheus/Loki, Azure Monitor, Application Insights).
- Understanding of DR concepts, failover validation, and operational readiness.
- Familiarity with chaos engineering practices (nice-to-have).
- Ability to read Terraform/HCL is a plus but not required.
- Strong grasp of SRE principles (SLOs/SLIs, error budgets, toil reduction, postmortems).
- Strong collaboration and communication skills.
- Mindset We Value
- Treat observability as a foundational product feature — not an afterthought.
- Proactively break systems to strengthen them.
- Automate away repetitive pain and convert incidents into lasting defenses.
- Clearly articulate complex risks, trade-offs, and recovery approaches to both technical and non-technical stakeholders.
- Remain composed during incidents while relentlessly focused on prevention.
Originally posted on Himalayas
To apply for this job please visit himalayas.app.
About this role & career path
Working in Dominican Republic
The Dominican Republic is an island country on the eastern part of the Caribbean island of Hispaniola in the Greater Antilles of the Caribbean Sea in the North Atlantic Ocean. It shares a maritime border with Puerto Rico to the east and a land border with Haiti to the west, occupying the eastern five-eighths of Hispaniola which, along with Saint Martin, is one of only two islands in the Caribbean shared by two sovereign states. In the Antilles, the country is the second-largest country by area after Cuba at 48,671 square kilometers (18,792 sq mi) and the second most populous country after Hait
More jobs at InvestorFlow
Keep exploring on Get A Job.ai
Not quite the right fit? Your next opportunity is a click away.
- Browse all jobs
- More jobs by category
- Remote jobs you can do from anywhere
- Research typical pay for this role
- Set a job alert so new matches reach you first
- Upload your resume to apply faster
Hiring instead? Post a job and reach candidates searching right now.