Sr Software Engineer – Infrastructure, Telemetry and Site Reliability Engineer - *Remote*
Irvine, CA, United States
About the Role
We are seeking a skilled Sr Software Engineer – Infrastructure Telemetry and Site Reliability Engineer (SRE) to join our dynamic platform team. The ideal candidate will be responsible for ensuring the reliability, availability, and performance of our systems while leveraging telemetry data to enhance monitoring and observability. This role is critical in maintaining our high service standards and continuously improving our infrastructure.
Key Responsibilities
- Lead the design, develop, and implement monitoring, logging, and alerting solutions to ensure system reliability and performance.
- Utilize telemetry data to identify and troubleshoot issues, optimize system performance, and enhance overall observability.
- Collaborate with development and operations teams to ensure seamless integration of monitoring and alerting tools.
- Write and maintain scripts for infrastructure management and automation (e.g., Python, PowerShell, Bash).
- Automate repetitive tasks to improve efficiency and reduce manual intervention.
- Automate deployment pipelines using CI/CD tools such as Jenkins, GitHub Actions, or Azure DevOps.
- Participate in on-call rotations and incident response, providing timely resolution to system outages and performance issues.
- Develop and maintain documentation for system architecture, processes, and procedures related to telemetry and site reliability.
- Design and implementation of cloud infrastructure using Infrastructure as Code (IaC) tools such as Terraform, AWS CloudFormation, or Azure Resource Manager.
- Collaborate with cross-functional teams to design and implement scalable and resilient infrastructure solutions.
- Conduct root cause analysis of incidents and implement corrective actions to prevent recurrence.
- Drive the adoption of best practices in site reliability engineering and telemetry within the organization.
Qualifications
- Bachelor’s degree in Computer Science, Engineering, or a related field, or equivalent experience.
- 5+ years of experience in software engineering with a focus on site reliability engineering, DevOps, IaC and Cloud Infrastructure or a related field.
- Strong knowledge of monitoring, logging, and alerting tools (e.g., Datadog, Prometheus, Grafana, ELK stack, Splunk, New Relic).
- Proficiency in programming and scripting languages (e.g., Python, Go, Bash).
- Experience with cloud platforms (e.g., AWS, Azure, Google Cloud) and containerization technologies (e.g., Docker, Kubernetes).
- Strong understanding of Linux/Unix systems and networking concepts.
- Excellent problem-solving skills and the ability to troubleshoot complex issues in distributed systems.
- Experience with configuration management and automation tools (e.g., Terraform, Ansible, Puppet, Chef).
- Strong communication and collaboration skills, with the ability to work effectively in a team-oriented …
This job isn't fresh anymore!
Search Fresh JobsJob Profile
CA California Montana OR Oregon Texas Washington
Benefits/PerksBest-in-class benefits Collaboration Comprehensive benefits package Financial Security Health care benefits Inclusive workplace Paid parental leave Well-being resources
Tasks- Conduct root cause analysis
- Design cloud infrastructure
- Develop documentation
- Maintain documentation
- Participate in on-call rotations
Analysis Analytics Ansible Automation AWS CloudFormation Azure DevOps Azure Resource Manager Bash Best Practices Chef CI/CD CI/CD pipelines CircleCI Cloud Cloud Infrastructure Collaboration Communication Computer Science Configuration Management Data & Analytics Datadog DevOps Diversity Docker Documentation ELK stack Engineering GitHub Actions GitLab CI Go Grafana Health care Infrastructure as Code IT IT Security Jenkins Kubernetes Linux Logging tools Monitoring tools Networking New Relic Operations Organization PowerShell Problem-solving Prometheus Puppet Python Root Cause Analysis Scripting Languages Security Best Practices Site Reliability Engineering Software Engineering Splunk Terraform UNIX
Experience5 years
EducationAnalytics Bachelor's degree Computer Science Data Analytics Design Engineering Equivalent experience Insurance Related Field
Certifications TimezonesAmerica/Anchorage America/Chicago America/Denver America/Los_Angeles America/New_York Pacific/Honolulu UTC-10 UTC-5 UTC-6 UTC-7 UTC-8 UTC-9