Lead Site Reliability Engineer
San Diego, California, United States; Remote, United States
Guild Mortgage Company, closing loans and opening doors since 1960. As a mortgage banking firm, we are dedicated to serving the homeowner/buyer. Our goal is to provide affordable home financing for our customers, utilizing the best terms available while providing a level of professionalism and service unsurpassed in the lending industry.
Position Summary
The Lead Site Reliability Engineer is responsible for driving the organizational reliability strategy and conducting resiliency design reviews to ensure the reliability, scalability, and performance of our company's software systems and applications meet organizational service level objectives (SLOs) and error budgets. The role is responsible for leading a team of Site Reliability Engineers in designing, implementing, and maintaining the infrastructure and tools necessary to support our platforms, as well as improving our monitoring, automation, and deployment processes. This role involves strategic planning, technical leadership, and collaboration with various stakeholders including Guild’s Product Delivery, Data Services, DevOps, DataOps, Governance, and Infrastructure teams to support organizational goals.
Essential Functions
- Lead, mentor, and develop a team of Site Reliability engineers, fostering a collaborative and innovative work environment.
- Oversee an SRE team and drive the reliability strategy for the organization.
- Conduct resiliency design reviews and lead complex problem-solving efforts.
- Design, implement, and maintain monitoring systems to track the performance, availability, and reliability of services.
- Respond to incidents promptly, investigate root causes, and coordinate efforts to mitigate and resolve them.
- Analyze performance data, and plan for scalability and capacity requirements.
- Identify and optimize performance bottlenecks, both at the infrastructure and application levels.
- Automate repetitive tasks and processes to improve efficiency and reduce manual intervention.
- Implement and enforce change management practices to ensure safe and controlled changes to the production environment.
- Design and implement fault-tolerant systems and practices to minimize downtime and ensure service availability.
- Collaborate with the GRC team on developing and maintaining disaster recovery plans and procedures relevant to the software supported to minimize the impact of catastrophic failures.
- Work with the Incident Management and other teams to conduct a thorough analysis of incidents, document postmortem reports, and implement improvements based on lessons learned.
- Work closely with development, operations, and other teams to foster a culture of reliability, and provide feedback on system design and architecture for improved reliability.
- Perform other duties as assigned.
Qualifications
- Bachelor's Degree directly related to the position or equivalent, preferred. Bachelor's degree or 8+ years demonstrated work experience or an equivalent combination of related training and experience and at least three of those years spent in a leadership level role(s) required.
- Minimum of eight years demonstrated work experience.
- Minimum three years supervisory or leadership experience. Proven leadership experience and ability to manage a team, required.
- Ability to create DR strategies and execute DR drills.
- Collaborate with stakeholders to define RPO / RTO for Guild’s system footprint.
- Expert in Cloud-based redundancy, high availability, and reliability strategies.
- Expert in reliability, scalability, and performance optimization.
- Expert at maintaining Linux / Unix and Windows systems administration, provisioning, configuration, monitoring, and troubleshooting Web Servers in a 7x24 customer facing environment.
- Strong Linux and Windows Administration & scripting.
- Solid Database Administration skills (MySQL, MariaDB, RDS, Sql Server, and Azure Storage services).
- Deep knowledge of current methodologies in high performance operations and scalable multi-site implementations.
- Proven Experience with large-scale software implementation (high transaction volume, high- availability concepts).
- Deep knowledge of software deployment, versioning (GIT) and release management processes.
- Deep knowledge with infrastructure design, implementation, and support.
- Proficient at automated provisioning, automated configuration management, and containerization solutions and tools.
- Experienced in cloud-based hosting solutions (AWS, Azure, GCP).
- Experienced with Cloud server environments (AWS, Google Cloud, or Azure).
- Experienced in Agile software development best practices utilizing Continuous Integration & Delivery Pipelines as well as agile tools such as Jira.
- Excellent written and verbal communication skills.
- Proficient in communicating to both technical and management levels.
- Ability to interact with external customers and staff members.
- Highly adaptable.
- Ability to work in a fast paced, constantly expanding environment.
- Excellent verbal and written communication skills required.
- Highly organized and detail-oriented; ability to work in a fast-paced, metrics-driven environment required.
- Proficiency in Microsoft Office Suite, Word, Excel, Wiki, collaborative cloud-based programs, and third-party software applications required.
- Commitment to company values.
- Customer Service - Proactive attention to each person
- Integrity - Do and say what's right
- Respect - Treat others with dignity
- Collaboration - Listen and work together
- Learning - Seek knowledge and strive for improvement
- Excellence – Deliver the unexpected
Supervision
-
Job Scope: Plays a key role in area by generating insights and ideas on policies, processes, procedures, and efficiency; contributes ideas to strategic and operational plans to ensure alignment.
-
Complexity: Problems encountered are often complex and may involve significant resource coordination and availability, evaluating and resolving discrepancies with data, analyses, processes, etc. using own expertise and judgment.
-
Impact: Decisions and actions have an impact on the smooth operation and timeframes of the department, programs/projects; impact on the broader organization is generally indirect.
-
Interaction/Supervision: Acts as a mentor/guide to less experienced professional contributor staff in a similar role; works independently and only under general direction; guided by professional standards, desired outcomes, and project plan specifications.
Requirements
Physical: Work is primarily sedentary; mobility in an office setting.
Manual Dexterity: Ability to operate standard office equipment and keyboards
Audio/Visual: Regularly required to accurately perceive, distinguish and interpret information received visually and through audio, e.g., words, numbers and other data broadcasted aloud/viewed on a screen, as well as print and other media.
Environmental: Office environment – moderate noise, no substantial exposure to adverse environmental conditions.
Mental: Learn new tasks, remember processes, maintain focus, complete tasks independently, and make timely decisions in the context of a workflow.
Schedules: Work is primarily performed during the business week, Monday - Friday; occasional night or weekend may be necessary.
Guild offers a pleasant work environment, competitive compensation and excellent benefits package; including medical, dental, vision, life insurance, AD&D, LTD and 401(k) with employer match.
Guild Mortgage Company is an Equal Opportunity Employer.
Targeted Salary Range: $127,000 - $173,000 annually
Compensation at Guild is influenced by a wide array of factors including but not limited to local and federal minimum wage requirements, education, level of experience, and applicant’s geographical location.
REQ#: LEADS016959
ApplyJob Profile
AD&D Competitive compensation Dental Excellent benefits package Life Insurance LTD Medical Pleasant work environment Vision
Tasks- Analyze performance
- Analyze performance data
- Automate tasks
- Collaborate on disaster recovery plans
- Collaborate with stakeholders
- Conduct resiliency design reviews
- Design
- Design fault-tolerant systems
- Develop
- Implement monitoring systems
- Incident management
- Lead SRE team
- Perform other duties
- Respond to incidents
Administration Agile Analysis Automation AWS Azure Change Management Cloud-Based Programs Cloud-based Redundancy Collaboration Collaborative cloud-based programs Communication Configuration Configuration Management Containerization Continuous Integration Coordination Customer service DevOps Disaster Recovery Disaster recovery planning Excel Feedback Git High Availability Incident Management Infrastructure Design Integration Jira Judgment Leadership Linux administration Metrics Microsoft Office Microsoft Office Suite Monitoring Monitoring Systems Mortgage Banking Organization Organizational Performance Optimization Problem-solving Provisioning Release Management Reliability Strategies Scripting Site Reliability Engineering Software Development SQL Team Leadership Third-party software applications Training Troubleshooting Unix Administration Windows Systems Administration Written communication
Experience8 years
EducationBachelor's Bachelor's degree Business Communication Degree Equivalent
TimezonesAmerica/Anchorage America/Chicago America/Denver America/Los_Angeles America/New_York Pacific/Honolulu UTC-10 UTC-5 UTC-6 UTC-7 UTC-8 UTC-9