Staff SRE

Posted Jul 15

Virtasant is a leading cloud consulting services provider. We heavily focus on lift & shift, cloud-native development, cloud cost optimization, and migration services. As a consulting company, we often face the challenge of creating an engineering team in a matter of a week or two. To do that, we have created a secondary support business that runs a talent network and provides staffing services.

We are seeking a highly skilled and experienced Staff Site Reliability Engineer (SRE) to join our dynamic team. In this role, you will be responsible for ensuring the reliability, scalability, and performance of our critical systems and services. As a Staff SRE, you will play a pivotal role in shaping infrastructure for our client and driving initiatives that improve the overall service quality.

Key Responsibilities:

System Design and Architecture:
Design, build, and maintain scalable and reliable infrastructure.
Collaborate with engineering teams to ensure systems are designed with reliability and scalability in mind.
Evaluate and integrate new technologies to enhance our infrastructure.
Monitoring and Incident Management:
Implement and maintain monitoring and alerting systems to detect and respond to issues promptly.
Lead incident response efforts, ensuring quick resolution and effective communication.
Conduct post-incident reviews and drive improvements based on findings.
Automation and Optimization - Reduce SRE Toil:
Architect & Build innovative automation projects (preferably in Python/GoLang) from scratch to help reduce day-to-day SRE toil
Create Bash scripts to automate mundate manual activities like upgrades, status checks and deployment
Develop and maintain infrastructure as code (IaC) using tools such as Terraform, Ansible, or similar.
Automate repetitive tasks and processes to improve efficiency and reduce manual intervention.
Collaboration and Mentorship:
Collaborate with cross-functional teams to deliver high-quality products and services.
Mentor and guide junior SREs and other team members.
Advocate for best practices in reliability engineering across the organization.
Continuous Improvement:
Drive initiatives to improve service reliability, capacity, and performance.
Participate in capacity planning and disaster recovery exercises.
Stay current with industry trends and emerging technologies.

Qualifications:

Education and Experience:

Bachelor's degree in Computer Science, Engineering, or a related field (or equivalent practical experience).
8+ years of minimum experience in the industry as a Software Engineer, SRE or Platform Engineer

Minimum 3+ years of experience as a Platform Engineer or SRE

Proven experience in managing large-scale, mission-critical infrastructure.
Technical Skills:

Deep understanding of Linux/Unix systems and networking.

Proficiency in at least one or more programming languages (e.g., Python, Go, Java).
Intermediate to Expert level skill in bash scripting
Experience with cloud platforms (AWS, Azure, GCP) and container orchestration (Docker, Kubernetes).
Strong knowledge of monitoring and logging tools (e.g., Prometheus, Grafana, ELK stack).
Familiarity with CI/CD pipelines and tools (e.g., Jenkins, GitLab CI).
Soft Skills:

Excellent problem-solving skills and a proactive attitude.
Strong communication and collaboration skills.
Ability to work independently and as part of a team.
Demonstrated leadership and mentoring abilities.

Candidates must be able to work during Pacific time hours 8am - 5pm PST, open to on-call rotation.

Recruitment process

Recruiter screen (30 mins)
Technical Interview (45 min)
Hiring Manager Interview (30min)

We strive to move efficiently from step to step so the recruitment process can be as fast as possible.

What we offer

Fully remote, 40 hours/week.
Long term contract
Payment in USD
PTO
Training and certification opportunities on AWS, GCP, and/or Azure.

​​Recruitment process

What we offer

Recruitment process