logo

KMC Solutions Inc

XTN-DAD3686 SITE RELIABILITY ENGINEER

Department
Engineering
Job Type / Location
remote
Experience Required
3+ years
Posted On

You will be working on validating and testing GPU clusters prior to production release, ensuring hardware integrity, system reliability, and optimal performance. This role involves provisioning clusters, executing performance benchmarks, maintaining automated validation frameworks, and troubleshooting Linux-based systems in high-performance compute environments. You will collaborate closely with engineering and operations teams to ensure seamless handovers and production readiness.

.• Health Insurance/HMO

• Enjoy unlimited MadMax Coffee

• Diverse learning & growth opportunities • Accessible Cloud HR platform (Sprout)

• Above standard leaves

Cluster Validation & Testing

Validate GPU clusters of varying sizes to ensure hardware and system integrity prior to production release

Perform functional and reliability testing of GPUs, servers, and associated components

Verify network connectivity and performance, including InfiniBand where applicable

Orchestration & Benchmarking

Provision and configure GPU clusters using automated workflows

Execute and analyse performance and stability benchmarks orchestrated via Slurm

Validate results against expected performance and reliability thresholds

Test Framework & Automation

Maintain and extend the automated validation framework built using Python and Ansible

Integrate new test cases to support additional hardware platforms and GPU generations

Improve test reliability, coverage, and execution efficiency

Remediation & System Integrity

Diagnose and remediate unhealthy nodes through configuration changes or software fixes

Coordinate with on-site support and Smart Hands teams for hardware replacements when required

Ensure all issues are resolved and documented prior to handover to production operations

Documentation & Handover

Produce clear, accurate documentation of test results, hardware states, and remediation actions

Ensure smooth handovers to operations and engineering teams

Maintain up-to-date runbooks and validation procedures

Essential

• Strong hands-on experience administering and troubleshooting Linux systems (Prio) • Confident use of CLI tools for diagnostics, including analysis of kernel logs, drivers, and system

services

• Excellent written and verbal English communication skills • High standards for system reliability, consistency, and documentation Preferred / Desirable • Experience working with GPU-based or high-performance compute environments • Familiarity with Slurm or other workload schedulers • Understanding of datacenter hardware lifecycle and server validation processes • Exposure to InfiniBand or high-speed networking technologies • Experience working with distributed or

View Assessment Process

Think you'll be a good fit?