Site Reliability Engineer

Posted Aug 20

Job Description

Arista Networks is looking for Site Reliability Engineers to play an active role and have a high impact in the early rollout of both internal and customer-facing services making key architecture decisions, and designing and implementing best practices in advancing the Software Defined Networking revolution in the cloud. The Site Reliability Engineering (SRE) role combines software and systems engineering to build and run high performance, massively distributed, robust systems. The role is key in optimizing our system capacity and performance at all times.

SRE roles at Arista are generally in one of two areas:

  • Internal Tools: Designing and Operating our internal systems including CI/CD pipelines as well as source repos and other internal tools
  • External SaaS: An active role with a high impact on a cloud-based public SaaS across all Arista teams.

Both roles have the freedom to push the envelope forward in terms of quality and availability while designing, choosing, and building their own best practices and tools to make that happen.

Responsibilities:

  • Engage in and improve the whole lifecycle of services—from inception and design, deployment, operation, and refinement.
  • Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning, and launch reviews.
  • Maintain services once they are live by measuring and monitoring availability, latency, and overall system health.
  • Scale systems sustainably through mechanisms like automation; evolve systems by pushing for changes that improve reliability and velocity.
  • Practice sustainable incident response and blameless postmortems.

Qualifications

  • Bachelor's degree in Computer Science, a related technical field involving software/systems engineering, or equivalent practical experience.
  • Experience programming in the following languages: Go and Python.
  • Experience in operating a cloud-based SaaS
  • Expertise in designing, analyzing, and troubleshooting large-scale distributed systems.
  • Experience with Jenkins, Docker, K8s
  • Ability to debug, optimize code, and automate routine tasks.
  • Understanding of Unix/Linux operating systems.

Additional Information

All your information will be kept confidential according to EEO guidelines.