Loading...

Site Reliability Engineer (SRE)

  • Full Time
  • Anywhere

Bright Vision Technologies

Bright Vision Technologies is seeking an experienced Site Reliability Engineer to join our dynamic team and contribute to our mission of transforming business processes through technology.

Requirements

  • Define, instrument, and continually refine service-level objectives (SLOs), service-level indicators (SLIs), and error budgets for critical services, and use those measures to drive concrete engineering and prioritization decisions.
  • Lead incident response and resolution for production issues, acting as a calm and effective incident commander when needed, and ensuring high-quality post-incident reviews that drive lasting improvements.
  • Design and implement comprehensive monitoring, logging, and tracing strategies using Prometheus, Grafana, OpenTelemetry, ELK/EFK, Datadog, or similar tooling so that operators have rich, actionable visibility into system behavior.
  • Build and maintain robust on-call processes, runbooks, and escalation paths that reduce mean time to detect and mean time to resolve while protecting the well-being of the engineers on rotation.
  • Automate operational toil aggressively by writing production-grade tooling in Python, Go, Bash, or similar languages, replacing manual workflows with reliable, auditable automation.
  • Architect and operate large-scale Kubernetes clusters and container-based workloads, including autoscaling, capacity planning, network policy, and integration with service meshes.
  • Design CI/CD pipelines that promote safe, frequent, and observable releases, supported by automated testing, canary deployments, feature flags, and progressive rollout strategies.
  • Lead capacity planning and performance engineering activities, building models that predict growth and stress, and validating those models through load testing and chaos experiments.
  • Partner closely with application development teams to embed reliability practices early in design — including failure-mode analyses, graceful degradation patterns, and dependency hardening.
  • Strengthen the platform’s resiliency through chaos engineering, fault injection, dependency isolation, retries, timeouts, circuit breakers, and well-tested failover paths.
  • Drive continuous improvement of security posture in collaboration with security teams, including patch management, vulnerability remediation, and secure-by-default platform defaults.
  • Contribute to the technical roadmap for reliability tooling, observability platforms, and developer-experience improvements that reduce friction and improve outcomes for engineering teams.
  • Mentor engineers across the organization on SRE practices and foster a strong, blameless culture of operational excellence.

Benefits

  • Competitive base salary commensurate with experience, plus benefits.
  • 401k Matching

Originally posted on Himalayas

To apply for this job please visit himalayas.app.

About this role & career path

Working in United States

The United States of America (USA), also known as the United States (U.S.) or America, is a country primarily located in North America. It is a federal republic consisting of 50 states and a federal capital district, Washington, D.C. The 48 contiguous states border Canada to the north and Mexico to the south, with the semi-exclave of Alaska in the northwest and the archipelago of Hawaii in the Pacific Ocean. The United States also asserts sovereignty over five major island territories and various uninhabited islands in Oceania and the Caribbean. It is a megadiverse country, with the world's th

    More jobs at Bright Vision Technologies

    Keep exploring on Get A Job.ai

    Not quite the right fit? Your next opportunity is a click away.

    Hiring instead? Post a job and reach candidates searching right now.