Platform Site Reliability Engineer
About Ori
Ori Industries is at the forefront of AI infrastructure, revolutionising the connection between software and hardware for the AI era. Our mission is to empower AI teams with scalable, secure, and efficient infrastructure solutions that support seamless model training, deployment, and scaling.
Job Description
Ori Industries is at the forefront of AI infrastructure, revolutionising the connection between software and hardware for the AI era. Our mission is to empower AI teams with scalable, secure, and efficient infrastructure solutions that support seamless model training, deployment, and scaling.
Role Responsibilities
Deploy and Manage Kubernetes Clusters, deployed at scale to support AI centric workloads, across both our bare metal clusters and via trusted partner infrastructure
Develop Kubernetes Manifests and Operators: Facilitate application deployments and maintain Kubernetes-native services for networking, storage, security, identity and infrastructure management
Optimize Linux system configuration including kernel, driver, filesystem and services to support workloads running via our orchestration layer
Build and maintain automation scripts and infrastructure as code to support platform lifecycle, as well as simplifying troubleshooting for Incident resolution and provision of tooling for our support organisation
Apply ITSM frameworks: Incident, Major Incident, Change Management, and service improvement.
Maintain and enhance ORI’s observability stack: Prometheus, Grafana, and custom monitoring integrations
Operate and support services in 24x7 production environments, including on-call rotation
Contribute to Incident postmortem analyses, root cause analysis, document learnings, and automate remediations
Mentor junior engineers and act as an Operational requirements consultant to other departments
Communicate technical decisions clearly to non-technical stakeholders and customers
Uphold a culture of: do, document, automate
Willingness to cross train with Platform Engineering/Platform SRE to fully support both our infrastructure and platform stacks.
Willingness to cross train with HPC Engineering, supported by NVIDIA to enhance our HPC supportability offering
Requirements
5+ Years Proven experience in globally scaled, performance-intensive environments operating to a 24/7 support model in an SRE or equivalent role
3+ years experience in both running, deploying and optimising orchestration platforms with a strong emphasis on Kubernetes
Expert-level Linux administration, especially Ubuntu distributions
Proficiency in system tuning, disk I/O optimization, and hardware-level performance tweaks
Strong networking fundamentals: TCP/IP, DNS, DHCP, VLANs, routing, switching
Strong experience with API interrogation
Strong experience with infrastructure scripting and automation (Bash, Python, Ansible)
Deep understanding of observability principles and tools (Prometheus, Grafana preferred)
Strong grasp of ITSM and service operation best practices
Excellent communication and mentorship skills
Comfortable interfacing with internal stakeholders and external customers
Bonus: Knowledge of running AI workloads via orchestration platforms
Bonus Requirements
Bachelor or Masters Level degree in Computer Science, Engineering or related field, or equivalent experience.
LPIC Certifications
ITIL Foundation level qualification or equivalent experience
Certified Kubernetes Administrator (CKA)
Qualities we look for:
You approach problems with a systems mindset - balancing practical execution with long-term scalability
You elevate the team, setting high standards for technical quality and engineering excellence.
You hold yourself and others accountable - giving direct feedback and expecting the same
You take initiative, owning challenges end-to-end and proactively driving solutions.
You invest in others, mentoring to build both capability and confidence.
Why should you join us?
What sets us apart is our blend of modern technology, competitive benefits, and an open, welcoming work culture that enables our people to thrive.
Here are just some of the great things you can expect from us:
Remote work, flexible hours: we offer a fully remote work schedule, with flexible working hours and trust in your productivity, we are in sync with your team’s general locations and time zones to foster effective and seamless collaboration.
30 days of annual leave: we value your peace of mind. With 30 days off (excluding public holidays) and access to mental health resources, we make sure you're as strong mentally as you are professionally.
A culture that emphasises results over hierarchy, process & ego: we place great emphasis on the quality, ingenuity and creativity of work.
Open communication, regular feedback: we value smooth collaboration, direct and actionable feedback, and believe that leading with empathy and a growth mindset makes us better together.
Learning Time: we all have dedicated learning time to focus on new skills, projects or interests that lay outside of your day-to-day job.
Health & Wellbeing: we want everyone to feel healthy and happy, so we offer private medical insurance via Bupa.
Cycle to Work Scheme: we're committed to building a sustainable business, so we encourage cycling to work.
Gympass subscription to a variety of gyms and wellbeing apps
Participation in the company shares program
Enhanced parental pay & leave
Diversity, Equity, Inclusion and Belonging
We are an equal opportunity employer and we strive to reduce unconscious bias throughout our hiring process. All applicants will be considered for employment without attention to ethnicity, religion, sexual orientation, gender identity, family or parental status, national origin, veteran, neurodiversity status or disability status. To ensure our recruitment processes provide an equal opportunity for all applicants to succeed, we encourage you to let us know if there are any adjustments that we can make.
- Department
- Engineering
- Locations
- Gloucestershire
- Remote status
- Hybrid
- Employment type
- Full-time