Director of Site Reliability Engineering
Propelus is modernizing how professionals, their employers, regulators, and associations work better together. For over 20 years, Propelus solutions — CE Broker, EverCheck, and Immuware — have propelled the progress of millions of dedicated professionals in their career journey. Our market-leading workforce compliance management technology, full-lifecycle continuing education software, and vital data simplify total professional management for a happier workforce, better operations, and safer communities.
The primary responsibility of the Director of SRE is to lead the SRE functions for all product lines at Propelus. This individual will have very strong technical expertise on technologies like AWS, Nginx, New Relic and VMWare and strong expertise in supporting and monitoring database technologies like Oracle, MySQL, MongoDB and Aurora Postgres. This position requires good experience running production operations to support business critical applications. The position should lead the SRE team and implement best in class SRE practices that will improve the reliability of applications.
The person needs to be ultra-focused and a strong leader to help execute on the vision. Site Reliability Engineering (SRE) holds the responsibility for the big picture: determining how our systems relate to each other and using a wide array of tools and building auto healing solutions to improve reliability for customers. Practices, such as limiting time spent on operations, and proactive identification of potential outages, factor into the iterative improvement key to both product quality and dynamic day-to-day work. SRE’s culture of diversity, intellectual curiosity, problem solving, and openness unlocks its success.
Responsibilities
- The primary responsibility of the Director of SRE is to lead the SRE functions for all product lines
- Own the observability platform and best practices so that future teams can track and support the health of their apps
- The position should lead the SRE team and implement best in class SRE practices that will improve the reliability of applications
- Drive synthetic and real user data monitoring all the way from the edge to backend services
- Reduce the number of alerts by deducting failures and automate the recovery process to improve efficiency
- Set clear goals and performance expectations for the team members, providing regular feedback and conducting performance evaluations
- Manage and mentor a team of site reliability engineers, fostering a collaborative and high-performing environment
- Develop SRE team members into senior levels and leaders within the team
- Support a team on an on-call rotation to respond to incidents that impact availability and drive the efforts to provide service restoration within SLAs
- Provide governance over HA and DR capability management and reporting and conducting quarterly DR exercises
- Be proactive in reviewing daily health checks, monitoring, reporting and taking timely actions
- Set error budgets, SLIs, SLOs, and SLAs
- Publish daily, weekly and monthly incident reports with corrective actions across all environments
- Lead the release management function and establish strong operational readiness across teams
Qualifications
- Bachelor’s degree in computer science or a related field
- Deep understanding of New Relic
- 5+ years experience SRE experience
- 5+ years experience managing SRE team
- Deep understanding creating incident response plans with automation.
- Passion for toil reduction
- 8+ years IT experience networking, applications, infrastructure.
- Knowledge of network architecture and internet security best practices.
- VSphere/VMWare experience a must
- Perform analytics on previous incidents and usage patterns to better predict issues and take proactive actions
- Lead and participate in performance tests identify bottlenecks, opportunities for optimization, and capacity demands for product launches
- Experience working with cloud service platforms (AWS/Azure) and knowledge of best practices and methods for resolving issues in those settings
- 5+ years experience managing a team with on-call support
- Infrastructure as code using Pulumi is a plus
Benefits and Perks for Propelus employees include but are not limited to:
- Awarded one of BuiltIn's 2023 Best Place to Work and 7 years running by Outside Magazine!
- Professional development allowance to help you grow in the ways that mean the most to you.
- Flexibility for balancing work with the rest of life and ample PTO, including paid time off for volunteering and for becoming a new parent.
- 401K with company matching, as well as financial planning education and resources.
- Employees choose from HSA, FSA and traditional insurance options for medical, dental, and vision coverage for themselves and dependents.
- Wellness benefits - we’ll help you pay for fitness endeavors and organic produce delivery services.
- Check us out for yourself at our careers page or our Propelus culture Instagram accounts.
We are an equal opportunity employer and value diversity at Propelus. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status. Candidates from all backgrounds are encouraged to apply.
This position is scheduled to work 40 hours per week, M-F unless required otherwise by projects. This job is open to candidates authorized to work in the US and located within US borders.