We’re hiring a Lead Site Reliability Engineer (SRE) to revitalize and lead the SRE function at Welcome to the Jungle, a ~300-person scale-up. You'll be stepping into a critical role not starting from scratch, but rather building upon and evolving existing foundations.
As the first new hire in this transition, you’ll be a hands-on technical leader responsible for continuing to define and implement our SRE strategy, ensuring the reliability, performance, and scalability of our platform.
You'll collaborate closely with engineering, product, and security teams to improve automation, observability, and infrastructure practices in a modern, cloud-native stack. The ideal candidate brings deep technical expertise, strong leadership, and the ability to both stabilize and grow a function in flux.
Key Responsibilities
Strategic & Technical Leadership
Define the vision, standards, and roadmap for Site Reliability Engineering at Welcome to the Jungle.
Lead the design and implementation of scalable, secure infrastructure in AWS using IaC (Terraform, Terragrunt).
Champion GitOps and CI/CD best practices via ArgoCD and CircleCI.
Own the development and enforcement of service-level objectives (SLOs) and indicators (SLIs).
Drive observability across the stack using OpenTelemetry and Datadog to ensure proactive issue detection and resolution.
Establish disaster recovery strategies, high-availability design patterns, and cost-effective infrastructure choices.
Operational Ownership & Automation
Lead incident response processes, postmortems, and on-call rotation design.
Build and maintain operational documentation and automation to reduce manual toil.
Ensure robust alerting, logging, and telemetry across all environments.
Proactively identify and remove bottlenecks in the infrastructure and deployment workflows.
Improve platform performance and reliability through rigorous monitoring, testing, and system design.
Cross-team Collaboration & Knowledge Sharing
Collaborate with development teams to ensure new services are production-ready and follow reliability best practices.
Partner with Security and DevOps to ensure infrastructure meets compliance and security standards.
Mentor developers and influence reliability-focused engineering culture across the company.
Lead internal knowledge sharing and help scale the SRE mindset organization-wide.
Act as a trusted advisor to engineering leadership on system reliability, scalability, and tooling.