Sr Software Engineer – Infrastructure, Telemetry and Site Reliability Engineer - Remote

Irvine, CA, United States

Providence

USD 91K+ Full Time Senior

Company preview All jobs at Providence

Search Fresh Jobs Job profile

Published 1 month ago

Hey, this job isn't fresh anymore! 👉 Find fresh remote jobs here

About the Role

We are seeking a skilled Sr Software Engineer – Infrastructure Telemetry and Site Reliability Engineer (SRE) to join our dynamic platform team. The ideal candidate will be responsible for ensuring the reliability, availability, and performance of our systems while leveraging telemetry data to enhance monitoring and observability. This role is critical in maintaining our high service standards and continuously improving our infrastructure.

Key Responsibilities

Lead the design, develop, and implement monitoring, logging, and alerting solutions to ensure system reliability and performance.
Utilize telemetry data to identify and troubleshoot issues, optimize system performance, and enhance overall observability.
Collaborate with development and operations teams to ensure seamless integration of monitoring and alerting tools.
Write and maintain scripts for infrastructure management and automation (e.g., Python, PowerShell, Bash).
Automate repetitive tasks to improve efficiency and reduce manual intervention.
Automate deployment pipelines using CI/CD tools such as Jenkins, GitHub Actions, or Azure DevOps.
Participate in on-call rotations and incident response, providing timely resolution to system outages and performance issues.
Develop and maintain documentation for system architecture, processes, and procedures related to telemetry and site reliability.
Design and implementation of cloud infrastructure using Infrastructure as Code (IaC) tools such as Terraform, AWS CloudFormation, or Azure Resource Manager.
Collaborate with cross-functional teams to design and implement scalable and resilient infrastructure solutions.
Conduct root cause analysis of incidents and implement corrective actions to prevent recurrence.
Drive the adoption of best practices in site reliability engineering and telemetry within the organization.

Qualifications

Bachelor’s degree in Computer Science, Engineering, or a related field, or equivalent experience.
5+ years of experience in software engineering with a focus on site reliability engineering, DevOps, IaC and Cloud Infrastructure or a related field.
Strong knowledge of monitoring, logging, and alerting tools (e.g., Datadog, Prometheus, Grafana, ELK stack, Splunk, New Relic).
Proficiency in programming and scripting languages (e.g., Python, Go, Bash).
Experience with cloud platforms (e.g., AWS, Azure, Google Cloud) and containerization technologies (e.g., Docker, Kubernetes).
Strong understanding of Linux/Unix systems and networking concepts.
Excellent problem-solving skills and the ability to troubleshoot complex issues in distributed systems.
Experience with configuration management and automation tools (e.g., Terraform, Ansible, Puppet, Chef).
Strong communication and collaboration skills, with the ability to work effectively in a team-oriented …