via Indeed
Staff Site Reliability Engineer (m/f/d)
About the Role
As a Staff Site Reliability Engineer at ARX Robotics, your mission is to transform our central Cloud & IT services into highly reliable, observable, and automated products. You will take ownership of the critical infrastructure that our engineering teams depend on every day, including Vault/PKI, CI/CD systems, monitoring platforms, and other self-hosted tools.
This role is for you if you are driven by a deep need to automate, have strong opinions about backups because you’ve had to restore from them, and believe that doing a task manually more than once is a bug. You’ll be at the core of our engineering ecosystem, ensuring the systems that carry the company are robust, resilient, and always improving.
What You’ll Build
- Clear service ownership, SLOs, and incident response workflows for our shared platform services.
- A comprehensive observability practice with meaningful metrics, logs, alerts, and operational dashboards.
- Resilient and automated patterns for deployment, monitoring, backup, and recovery.
- Pragmatic automations that eliminate rec urring operational work and unblock engineering teams.
- Highly available and secure shared services like Vault/PKI, build infrastructure, and CI/CD support systems.
- Actionable runbooks and operational documentation that empower teams to respond with confidence.
- Strong partnerships with engineering teams to establish clear ownership boundaries and improve service handoffs.
- A close collaboration with Backend Engineering to ensure new internal applications are operable from day one.
- A culture of reliability by participating in incident response, recovery drills, and blameless post-mortems.
What You Bring
- A deep-seated passion for reliability and automation, likely demonstrated by personal projects, a homelab, or a history of automating your own workflows.
- Proven experience in a Site Reliability, DevOps, or Platform Engineering role where.you were responsible for production systems.
- Hands-on experience operating and improving shared services like CI/CD, secrets management, or monitoring platforms.
- An automation-first mindset, with the scripting skills (e.g., Python, Go, or shell) to back it up.
- A strong understanding of observability principles and experience building out monitoring for production services.
- The ability to write clear and concise documentation, especially for runbooks and incident procedures.
- A proactive, collaborative approach to problem-solving and a commitment to operational excellence.
Please note: You do not need to meet every single requirement to apply. We welcome motivated candidates who are eager to grow into the role and develop their expertise further.