Site Reliability Engineer

Posted Aug 19

CIQ OVERVIEW

CIQ believes in helping people do great things. We do this by building strong communities for open-source software, innovating software infrastructure, and building the next generation of performance computing. Our software stack consists of Rocky Linux the CentOS replacement, Apptainer the container solution of choice for HPC, Warewulf a provisioning and cluster management solution, and Fuzzball our next-generation performance computing platform that is a multi-cloud, multi-site, multi-cluster, and multi-node.

If you are interested in an environment built on ownership, diversity of thought, and pushing the limits of what is possible, then we would be interested in you.

POSITION SUMMARY

We are looking for a dedicated Site Reliability Engineer. You will be responsible for ensuring the optimal performance, reliability, and efficiency of our mission-critical systems and applications. You will work closely with our Customer Support team, Software Engineering, and IT to design, implement, and maintain scalable, reliable, and efficient systems that meet our business needs and customer expectations. Additional responsibilities include but are not limited to:

  • Collaborating with cross-functional teams to identify, design, and implement infrastructure and software solutions that improve the reliability, scalability, and performance of our systems and applications. Leveraging DORA metrics for grading SLA vs uptime requirements.
  • Monitoring, analyzing, and troubleshooting system performance and reliability issues, and providing proactive solutions to prevent future occurrences.
  • Developing and maintaining documentation on system architecture, processes, and best practices for site reliability.
  • Implementing and maintaining automated monitoring and alerting tools to ensure timely detection and resolution of system issues.
  • Participating in on-call rotation to provide 24/7 support for critical systems and applications.
  • Developing and maintaining infrastructure-as-code and configuration management solutions to automate and standardize system provisioning, configuration, and deployment.
  • Conducting capacity planning and performance analysis to ensure systems can scale to meet demand.
  • Developing and implementing disaster recovery and business continuity plans to ensure system availability and data protection.
  • Continuously researching and staying up-to-date on industry trends, emerging technologies, and best practices to improve site reliability processes and tools.
  • Contributing to a culture of collaboration, continuous improvement, and knowledge sharing.

Job requirements

NEEDED TO SUCCEED

Successful candidates will have a strong understanding of Linux/Unix systems, networking, and cloud computing platforms (e.g., AWS, GCP, Azure, OCI). Proficiency in scripting and programming languages (e.g., Bash, Python, Go, Node, Java, or similar). Familiarity with monitoring, logging, and alerting tools (e.g., Prometheus, Grafana, ELK Stack, or similar). Familiarity with CI/CD tools and SDLC practices. Strong analytical and problem-solving skills, with the ability to troubleshoot complex system issues. Excellent communication and collaboration skills, with the ability to work effectively within a team and across departments. A friendly, collaborative, humble, honest, and always striving to be better attitude.

EDUCATION AND EXPERIENCE

At least 5 years experience as a Site Reliability Engineer or similar experience as a DevOps or Software Engineer. Experience with infrastructure-as-code and configuration management tools (e.g., Terraform, Ansible, Puppet, or Chef). Experience with containerization technologies, orchestration platforms (e.g., Docker, Kubernetes), and service mesh with microservice architecture.