Since 2015, we’ve been growing our consulting business by delivering exceptional quality work to our clients. We’re not afraid to take risks and always strive to find the best solution, not just the easiest one. Our highly skilled team of engineers is committed to using their expertise to tackle every challenge with passion and precision.
Teamwork is at the heart of everything we do. We believe in the power of collaboration, knowledge sharing, and mutual support. At Ambush, you’ll find a dynamic environment where you’re encouraged to grow, learn, and share your expertise with your colleagues. We offer various initiatives to help you enhance your skills and broaden your knowledge base.
If you’re a team player who’s driven to achieve great things and passionate about making a real impact, we want you on our team.
When you join us, you will:
- Design, implement, and maintain end-to-end observability solutions.
- Develop and maintain observability strategies encompassing logging, metrics, tracing, and alerting.
- Work closely with development and operations team to ensure optimal performance and uptime.
- Implement synthetic monitoring solutions to proactively identify potential performance or reliability issues from an end-user perspective.
- Monitor production systems to identify performance bottlenecks.
- Diagnose and resolve system failures and complex infrastructure issues.
- Develop self-service tooling and runbooks to minimize manual tasks (toil).
- Conduct post-incident reviews to identify root causes and drive continuous improvement.
- Collaborate with engineering teams to implement resilient solutions and incident mitigation strategies.
- Define and manage escalation processes to ensure timely communication and resolution of critical incidents.
- Provide mentorship and guidance on observability and SRE best practices.
What we'd like to see in a candidate:
- Strong emphasis on Observability.
- Experience with observability best practices and tools (Prometheus, Grafana, New Relic, Datadog ELK Stack, OpenTelemetry).
- Knowledge of SLIs (Service Level Indicators) and SLOs (Service Level Objectives) to ensure high availability.
- Experience with Infrastructure as Code (IaC) implementation using Terraform.
- Ability to advocate for SRE principles (blameless postmortems, error budgets) across the organization.
- Proficiency in one or more programming/scripting languages (Go, Python, Bash).
- Hands-on experience with AWS.
- Familiarity with containerization (Docker, Kubernetes) and orchestration languages.
- Experience with CI/CD pipelines and automation frameworks.
- Solid understanding of Linux/Unix systems and networking fundamentals.
- Excellent English communication skills.
- Strong analytical and problem-solving abilities.
Report job