Site Reliability Engineer (SRE)

Company Overview:

We’re a fast-growing GPU-as-a-Service provider, delivering scalable, high-performance compute infrastructure purpose-built for AI and HPC workloads. Operating across global data centres, we run mission-critical environments where uptime, throughput, and ultra-low latency are non-negotiable.

Job Purpose:

We’re looking for an experienced Site Reliability Engineer to help design, operate, and evolve our infrastructure stack. You’ll contribute across bare-metal, virtualization, and orchestration layers, keeping things stable 24/7 — all while mentoring teammates, improving processes, and helping translate deep technical concepts for a wide range of collaborators.

Role Responsibilities:

Architect and operate resilient, scalable infrastructure supporting AI/HPC workloads Optimize Linux system configuration, BIOS/firmware, kernel, and disk subsystem for performance
Configure, monitor and manage bare-metal infrastructure using IPMI, Redfish, etc
Support and evolve platform delivery components: Kubernetes, MaaS, CNIs
Build and maintain automation scripts and infrastructure as code to support platform lifecycle, as well as simplifying troubleshooting for Incident resolution and provision of tooling for our support organisation
Apply ITSM frameworks: Incident, Major Incident, Change Management, and service improvement.
Maintain and enhance ORI’s observability stack: Prometheus, Grafana, Mimir, and custom monitoring integrations
Operate and support services in 24x7 production environments, including on-call rotation
Contribute to incident postmortem analyses, root cause analysis, document learnings, and automate remediations
Mentor junior engineers and act as an operational requirements consultant to other departments
Communicate technical decisions clearly to non-technical stakeholders and customers
Uphold a culture of: do, document, automate
Willingness to cross train with HPC Engineering, supported by NVIDIA to enhance our HPC supportability offering.

Essential Skills & Experience:

5+ Years Proven experience in globally scaled, performance-intensive environments operating to a 24x7 support model
Expert-level Linux administration, especially Ubuntu distributions
Proficiency in system tuning, disk I/O optimization, and hardware-level performance tweaks
Familiarity with Out of Band management tools (IPMI, Redfish, PXE, etc.)
Strong networking fundamentals: TCP/IP, DNS, DHCP, VLANs, routing, switching
Strong experience with infrastructure scripting and automation (Bash, Python, Ansible)
Deep understanding of observability principles and tools (Prometheus, Grafana)
Hands-on experience operating orchestration platforms (Kubernetes, MaaS, Tinkerbell)
Strong grasp of ITSM and service operation best practices
Excellent communication and mentorship skills
Comfortable interfacing with internal stakeholders and external customers
Bonus: Knowledge of HPC workloads and GPU-based infrastructure
Bonus: Experience with InfiniBand networks and HPC performance tuning

Preferred Qualifications:

Batchelor or Masters Level degree in Computer Science, Engineering or related field, or equivalent experience.
LPIC Certifications
ITIL Foundation level qualification or equivalent experience
Certified Kubernetes Administrator (CKA)

This job description is not intended to be all-inclusive. Employees may be required to perform other duties as assigned.

Salary Range Information

Based on market data and other factors, the salary range for this position is $150,000K-$220,000K and will vary depending on the candidate's experience.

Equal Opportunity Employer

Ori is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law.