Senior Site Reliability Engineer
Silver Triangle Building, United States
Credit Acceptance is proud to be an award-winning company with local and national workplace recognition in multiple categories! Our world-class culture is shaped by dedicated Team Members who share a drive to succeed as professionals and together as a company. A great product, amazing people and our stable financial history have made us one of the largest used car finance companies nationally.
Our Engineering and Analytics Team Members utilize the latest technology to develop, monitor, and maintain complex practices that help optimize our success. Our Team Members value being challenged, are encouraged to express their ideas, and have the flexibility to enjoy work life balance. We build intrinsic value by partnering with all functions of our business to support their success and make strategic business decisions. We focus on professional development and continuous improvement while enjoying a casual work environment and Great Place to Work culture!
We are seeking a talented and experienced Senior Site Reliability Engineer to join our dynamic and innovative team. As a Senior Site Reliability Engineer, you will play a crucial role in ensuring our software systems' reliability, availability, and performance. You will collaborate with cross-functional teams to design, implement, and maintain robust systems, monitoring tools, and processes. The ideal candidate will have a strong background in software development, system architecture, and a passion for creating reliable and scalable software solutions.Outcomes and Activities:
- This position will work from home; occasional planned travel to an assigned Southfield, Michigan office location may be required. However, this position is permitted to work at a Southfield, Michigan office location if requested by the team member.
- System Architecture and Design:
- Collaborate with software engineers, architects, and operations teams to design highly reliable and scalable systems.
- Evaluate existing systems and propose improvements to enhance reliability, performance, and availability.
- Drive modernization initiatives, including implementing Open Telemetry collectors and transitioning to structured logging for improved observability and cost efficiency.
- Implementation and Coding:
- Develop and implement code to automate operational processes and tasks to improve system reliability and performance.
- Create self-service tools, such as observability dashboards and automated incident analysis solutions, enabling teams to detect and resolve issues faster.
- Build and maintain scripts, pipelines, and tools for monitoring, logging, and alerting, aligned with Golden Path initiatives.
- Monitoring and Incident Response:
- Implement and manage monitoring solutions to proactively identify and address reliability issues.
- Participate in on-call rotations and respond promptly to incidents to minimize downtime and improve Mean Time to Restore (MTTR).
- Define and implement standardized logging schemas for improved debugging efficiency and cost optimization.
- Lead efforts to adopt Open Telemetry (OTEL) for distributed tracing, metrics, and logs, enabling better observability and scalability.
- Performance Analysis and Optimization:
- Conduct performance analysis to identify bottlenecks and optimize system performance.
- Partner with development teams to address performance issues in the codebase and ensure systems are resilient under load.
- Capacity Planning:
- Collaborate with capacity planning teams to ensure systems can handle anticipated growth and demand.
- Proactively identify capacity-related challenges and propose solutions.
- Documentation and knowledge sharing:
- Maintain comprehensive documentation for system configurations, processes, and procedures to ensure operational transparency. .
- Contribute to knowledge sharing within the SRE team and across departments by creating best practice guides and conducting training sessions.
Competencies: The following items detail how you will be successful in this role.
- Development: Develops solutions using standards and best practices of the applications language. Writes code that implements the design that is testable, extensible, efficient and maintainable.
- Impact Analysis: Understand the rationale behind and how changes impact the enterprise and/or applications and across the technical ecosystem.
- Solution Design: Ability to translate high level requirements to create and implement designs that meet the needs of the customer, are technically sound, maintainable and cost effective. Ability to identify missing or ambiguous requirements. Ability to design at both high and low levels of abstraction, understand complex requirements and translate into understandable solutions. Ability to accurately estimate based on requirements.
- Technical Domain: Have an understanding of the technical domain, including the application architecture, design and data of the application they support and systems to which it interfaces.
- Facilitation Techniques: Organize, support and/or conduct workshops, meetings, presentations specific to the objectives of each, problem to be solve, and needs of the audience.
Requirements:
- Bachelor's or Master's degree in Computer Science, Information Technology, or a related field.
- Proven experience as a Site Reliability Engineer or similar role.
- Proficient in Java, Spring Boot, distributed systems, and modern observability practices (e.g., OpenTelemetry, Prometheus), with strong cross-functional collaboration and knowledge-sharing skills.
- In-depth knowledge of system architecture, distributed systems, and networking.
- Experience with cloud platforms (e.g., AWS, Azure, GCP) and containerization technologies (e.g., Docker, Kubernetes).
- Familiarity with continuous integration and continuous deployment (CI/CD) practices.
- Excellent troubleshooting and problem-solving skills.
- Strong communication and collaboration skills.
- Certification in relevant areas (e.g., AWS Certified DevOps Engineer, Kubernetes Certified Administrator) is a plus.
- Expertise in designing and implementing resilience patterns for distributed systems and microservices architectures, such as Circuit Breakers and Retries. Proficient in applying modern resiliency frameworks to address diverse failure scenarios.
- Ability to identify and address gaps in observability, scalability, and fault tolerance prior to deployment, ensuring systems meet reliability and performance standards throughout the SDLC.
- Develop efficient, testable, and maintainable Java solutions using industry best practices to enhance reliability and automate operational tasks.
- Design resilient, scalable, and cost-effective systems while evaluating the broader impact of changes on the technical ecosystem.
Target Compensation: A competitive base salary range from $117,963 - $173,012. This position is eligible for an annual variable cash bonus, between 7.5 - 15%. Final compensation within the range is influenced by many factors including role-specific skills, depth and experience level, industry background, relevant education and certifications.
Candidates who reside in the following major metropolitan areas may be eligible for a premium on top of the posted range based on their specific zone: San Francisco, Seattle, Boston, New York City, Los Angeles and San Diego.
INDENGLP
#zip
#LI-Remote
Benefits
- Excellent benefits package that includes 401(K) match, adoption assistance, parental leave, tuition reimbursement, comprehensive medical/ dental/vision and many nonstandard benefits that make us a Great Place to Work
Our Company Values:
To be successful in this role, Team Members need to be:
- Positive by maintaining resiliency and focusing on solutions
- Respectful by collaborating and actively listening
- Insightful by cultivating innovation, accumulating business and role specific knowledge, demonstrating self-awareness and making quality decisions
- Direct by effectively communicating and conveying courage
- Earnest by taking accountability, applying feedback and effectively planning and priority setting
Expectations:
- Remain compliant with our policies processes and legal guidelines
- All other duties as assigned
- Attendance as required by department
Advice!
We understand that your career search may look different than others. Our hiring team wants to make sure that this would be a fit not just for us, but for you long term. If you are actively looking or starting to explore new opportunities, send us your application!
P.S.
We have great details around our stats, success, history and more. We’re proud of our culture and are happy to share why – let’s talk!
Required degrees must have been earned at institutions of Higher Education which are accredited by the Council for Higher Education Accreditation or equivalent.
Credit Acceptance is dedicated to providing a safe and inclusive working environment for all. As part of our Culture of Compliance, we are proud to be an Equal Opportunity Employer and value our culturally diverse workforce. All qualified applicants will receive consideration for employment regardless of the person’s age, race, color, religion, sex, gender, sexual orientation, gender identity, national origin, veteran or disability status, criminal history, or any other legally protected characteristic.
California Residents: Please click here for the California Consumer Privacy Act (CCPA) notice regarding the personal information Credit Acceptance may collect from you.
Play the video below to learn more about our Company culture.
ApplyJob Profile
Occasional travel to Southfield, Michigan office required Position can work from Southfield office if requested Work from Home
Benefits/PerksAdoption Assistance Casual work environment Comprehensive medical Dental Excellent benefits Excellent benefits package Nonstandard benefits Parental leave Professional development Tuition reimbursement Vision Work From Home Work-life balance
Tasks- Capacity Planning
- Collaborate with teams
- Conduct performance analysis
- Develop automation tools
- Ensure system reliability
- Implement monitoring solutions
- Improve system reliability
- Training
Alerting Analytics Architecture Automation AWS Azure Best Practices Capacity planning CI/CD Cloud Coding Collaboration Communication Compliance Continuous Deployment Continuous Improvement Cross-functional Collaboration Debugging Design DevOps Distributed Systems Documentation Efficiency Finance GCP Impact Analysis Incident Response Innovation Kubernetes Microservices Monitoring Monitoring tools Observability Open telemetry Performance analysis Problem-solving SDLC Site Reliability Engineering Software Development System architecture Training Troubleshooting
EducationBachelor's Business Communication Computer Science Equivalent Finance Master's Related Field
TimezonesAmerica/Anchorage America/Chicago America/Denver America/Los_Angeles America/New_York Pacific/Honolulu UTC-10 UTC-5 UTC-6 UTC-7 UTC-8 UTC-9