HPC Site Reliability Engineer (SRE)
Company Overview:
Ori Industries is at the forefront of AI infrastructure, revolutionising the connection between software and hardware for the AI era. Our mission is to empower AI teams with scalable, secure, and efficient infrastructure solutions that support seamless model training, deployment, and scaling.
Job Purpose:
Ori is the AI Native GPU Cloud - this role is pivotal in shaping our HPC infrastructure. As an HPC SRE, you will play a crucial role in managing, optimising, and ensuring the reliability of our high-performance computing environments. You will be the go-to expert for all technical aspects of our HPC infrastructure, including system architecture, optimization, integrations, and networking. You will collaborate with cross-functional teams, driving innovations that align with business objectives and enhance user experiences. This role also ensures 24/7 support, maintaining high availability and performance for HPC systems.
Role Responsibilities:
Infrastructure Management
Maintain and optimise HPC infrastructure, ensuring reliability and performance of Nvidia-based systems.
Set up HPC clusters with DGX or HGX platforms, GPU Direct, and establish network optimization for server-to-storage or storage-to-storage connectivity, including multi-cloud and WAN HPC interconnectivity.
Configure, troubleshoot, and quick-fix Networking R&S hardware from Cisco, Juniper, or relevant vendors.
Automation and Efficiency
Write, execute, and debug Ansible Playbooks for Cumulus Linux automation.
Utilise and maintain automated configuration management systems such as Ansible and Terraform.
Incident Management
Lead investigations into high-priority incidents, identify solutions, and prepare Root Cause Analysis (RCA).
Proactively monitor data centre health checks, licensing, and life-cycle management upgrades.
Provide 24/7 support through on-call rotations, ensuring continuous availability and rapid incident response.
Monitoring and Observability
Use observability metrics tools like Grafana Cloud, ELK, NVIDIA UFM, NetQ, and QoS metrics to monitor system health and performance.
Develop and implement monitoring strategies to ensure high availability and performance of HPC systems.
Collaboration and Communication
Collaborate with HPC solution architects and engineers to drive innovation and optimization.
Provide regular reports on P1/P2 incidents, RCAs, life-cycle upgrades, and change/incident management actions to senior management.
Maintain comprehensive documentation of infrastructure audits and policy changes.
Key Objectives and Goals
Reliability: Achieve and maintain high availability and uptime for HPC systems.
Performance: Continuously optimise the performance of Nvidia-based and other HPC systems.
Scalability: Develop scalable HPC solutions to support ongoing business growth.
Automation: Increase the level of automation to enhance efficiency and reduce manual tasks.
Continuous Availability: Ensure 24/7 support through effective coverage and on-call practices.
Collaboration: Foster a collaborative environment within the SRE teams and with other departments.
Continuous Improvement: Promote a culture of ongoing learning and improvement.
Key Metrics
MTTR (Mean Time to Recovery): Measure and minimise the time taken to recover from incidents.
MTBF (Mean Time Between Failures): Monitor and maximise the time between system failures.
System Uptime: Track and maintain high levels of system availability.
Service Level Objectives (SLOs): Set and meet clear SLOs for reliability and performance.
Service Level Indicators (SLIs): Define and monitor SLIs to ensure service quality.
Your Profile:
Bachelor’s or Master’s degree in Telecommunications, Computer Science, Electrical and Computer Engineering (ECE), or related field.
6+ years of proven experience in networking and data centre operations, particularly with recent HPC architectures, NetDevOps workflows, NVIDIA Air, and GNS3 simulations.
3+ years of experience as a Site Reliability Engineer or in a similar role.
Expertise in networking technologies: TCP/UDP, IPv4/IPv6, BGP/MP-BGP, VPN, L2 switching, EVPN, VxLAN, SHARP, Segment Routing, BGP, MPLS, IS-IS, DWDM.
In-depth knowledge of network protocols such as RoCE, RDMA, IBoE, and network topologies like Spine Leaf, Link/Super Spine Switching, and Fat-Free topology.
Background in troubleshooting or testing server hardware/firmware, Linux OS, CLIs, and scripting.
Excellent problem-solving and on-demand decision-making skills.
Desired Skills:
Certifications equivalent to CCIE, JNCIS, or InfiniBand NCP-IB.
Experience with automated configuration management systems like Ansible and Terraform.
Ability to handle high-pressure situations in HPC AI data centres.
Strong collaboration skills with HPC solution architects and engineers.
- Department
- Engineering
- Locations
- US, Remote Working
- Remote status
- Fully Remote
- Yearly salary
- $200,000 - $220,000
HPC Site Reliability Engineer (SRE)
Loading application form