HPC Site Reliability Engineer (SRE)
Job Purpose
Ori is the AI Native GPU Cloud - this role is pivotal in shaping our HPC infrastructure. As an HPC SRE, you will play a crucial role in managing, optimising, and ensuring the reliability of our high-performance computing environments. You will be the go-to expert for all technical aspects of our HPC infrastructure, including system architecture, optimization, integrations, and networking. You will collaborate with cross-functional teams, driving innovations that align with business objectives and enhance user experiences. This role also ensures 24/7 support, maintaining high availability and performance for HPC systems.
Key Responsibilities
Infrastructure Management
Maintain and optimise HPC infrastructure, ensuring reliability and performance of Nvidia-based systems.
Set up HPC clusters with DGX or HGX platforms, GPU Direct, and establish network optimization for server-to-storage or storage-to-storage connectivity, including multi-cloud and WAN HPC interconnectivity.
Configure, troubleshoot, and quick-fix Networking R&S hardware from Cisco, Juniper, or relevant vendors.
Automation and Efficiency
Write, execute, and debug Ansible Playbooks for Cumulus Linux automation.
Utilise and maintain automated configuration management systems such as Ansible and Terraform.
Incident Management
Lead investigations into high-priority incidents, identify solutions, and prepare Root Cause Analysis (RCA).
Proactively monitor data centre health checks, licensing, and life-cycle management upgrades.
Provide 24/7 support through on-call rotations, ensuring continuous availability and rapid incident response.
Monitoring and Observability
Use observability metrics tools like Grafana Cloud, ELK, NVIDIA UFM, NetQ, and QoS metrics to monitor system health and performance.
Develop and implement monitoring strategies to ensure high availability and performance of HPC systems.
Collaboration and Communication
Collaborate with HPC solution architects and engineers to drive innovation and optimization.
Provide regular reports on P1/P2 incidents, RCAs, life-cycle upgrades, and change/incident management actions to senior management.
Maintain comprehensive documentation of infrastructure audits and policy changes.
Key Objectives and Goals
Reliability: Achieve and maintain high availability and uptime for HPC systems.
Performance: Continuously optimise the performance of Nvidia-based and other HPC systems.
Scalability: Develop scalable HPC solutions to support ongoing business growth.
Automation: Increase the level of automation to enhance efficiency and reduce manual tasks.
Continuous Availability: Ensure 24/7 support through effective coverage and on-call practices.
Collaboration: Foster a collaborative environment within the SRE teams and with other departments.
Continuous Improvement: Promote a culture of ongoing learning and improvement.
Key Metrics
MTTR (Mean Time to Recovery): Measure and minimise the time taken to recover from incidents.
MTBF (Mean Time Between Failures): Monitor and maximise the time between system failures.
System Uptime: Track and maintain high levels of system availability.
Service Level Objectives (SLOs): Set and meet clear SLOs for reliability and performance.
Service Level Indicators (SLIs): Define and monitor SLIs to ensure service quality.
Required Qualifications
Bachelor’s or Master’s degree in Telecommunications, Computer Science, Electrical and Computer Engineering (ECE), or related field.
6+ years of proven experience in networking and data centre operations, particularly with recent HPC architectures, NetDevOps workflows, NVIDIA Air, and GNS3 simulations.
3+ years of experience as a Site Reliability Engineer or in a similar role.
Expertise in networking technologies: TCP/UDP, IPv4/IPv6, BGP/MP-BGP, VPN, L2 switching, EVPN, VxLAN, SHARP, Segment Routing, BGP, MPLS, IS-IS, DWDM.
In-depth knowledge of network protocols such as RoCE, RDMA, IBoE, and network topologies like Spine Leaf, Link/Super Spine Switching, and Fat-Free topology.
Background in troubleshooting or testing server hardware/firmware, Linux OS, CLIs, and scripting.
Excellent problem-solving and on-demand decision-making skills.
Desired Skills
Certifications equivalent to CCIE, JNCIS, or InfiniBand NCP-IB.
Experience with automated configuration management systems like Ansible and Terraform.
Ability to handle high-pressure situations in HPC AI data centres.
Strong collaboration skills with HPC solution architects and engineers.
This job description is not intended to be all-inclusive. Employees may perform other duties as assigned.
Set the standard: Every single day, you spot opportunities to constructively shake things up.
Inspire the change: There’s no blueprint for the future. You’ll embrace challenges and change.
You’re real and you’re true to yourself: We cherish and celebrate diversity so you’ll feel right at home whoever you are and whoever you’re talking to, you treat everyone the same.
Equal Opportunity Employer
We are an Equal Opportunity Employer and do not discriminate on the basis of race, color, religion, sex, national origin, age, disability, veteran status, sexual orientation, gender identity, or any other characteristic protected by local, state, or federal law.
If you require a reasonable accommodation to apply for a job or to perform a job, please contact careers@ori.co.
- Department
- Engineering
- Locations
- US, Remote Working
- Remote status
- Fully Remote
- Yearly salary
- $200,000 - $220,000
HPC Site Reliability Engineer (SRE)
Loading application form