Site Reliability Engineer
Remote
We are looking for a Site Reliability Engineer (SRE) team member responsible for supporting our DMS Application running on EKS, PostgreSQL, and other AWS services. Along with the configuration and administration of Linux and Windows servers and other open source technologies. Perform the day to day operational monitoring over 3,500 systems from performance metrics to alerts on critical infrastructure. Responsible for system management and creating scripts or writing programs to automate maintenance and management tasks. Other responsibilities are system configurations, troubleshooting, security and supporting multiple teams from customer support to development and QA along with increasing productivity of the team. (The LightspeedDMS LLC job title is Engineering Systems Administrator).
Responsibilities:
- Implement systems that are highly available, scalable, and self-healing.
- Work closely with Application Dev. & Operations teams to provide fully automated deployment routines for Production (CI/CD).
- Monitoring system activity and tuning system parameters for optimal performance, configuring communications with other platforms/networks, configuring/managing system security, and maintaining current release levels and patch revision.
- Work across functional (development, testing, deployment, systems/ infrastructure) and project teams to ensure continuous operation of all environments.
- Manage, and maintain tools to automate operational processes.
- Work to continuously improve speed, efficiency and scalability of our systems and environments.
- Work directly with agile Application Development teams to provide daily support aligned with a model of Continuous Delivery.
- Build and maintain appropriate log gathering, system monitoring, and reporting infrastructures.
- Operate, in a supporting role for the implementation and our customer support groups along with QA and Dev – requires availability 24/7 at times to load software updates, projects and required to work during the maintenance windows and in cases of an emergency.
Minimum Qualifications
- 4+ Years in a Cloud/SRE/DevOps/System Administrator role(s) or equivalent experience.
- 4+ year experience with containerization/orchestration technologies like Kubernetes, Docker, AWS EKS, AWS NLB/ALB’s, GCP GKE, etc. . Must be able to configure and support Docker containers deployment.
- 4+ years experience with Cloud concepts such as VPC, Subnets, IAM, Security Groups, S3 or equivalent experience.
- 6+ years of Linux and Windows administrator experience.
- Scripting ability (Bash / Shell, Python, JavaScript)
- Must have an understanding of building and managing large-scale systems and application architectures
- Solid understanding of system performance and monitoring.
- Excellent project management skills and the ability to work in a fast-paced work environment.
- Must also have experience working in an Agile development environment that requires a lot of communication and collaboration.
- Demonstrate skills in priority setting, analysis, communication, time management, scheduling, and multitasking.
- Experience with config/provisioning tools like Terraform, CloudFormation, Cloud Init, or Salt/Chef/Puppet/Ansible in production environments with many nodes.
- Work well in a highly collaborative team environment.
- Familiar with Infrastructure as Code methodologies and tools.
Preferred Qualifications
- Experience managing Enterprise production systems in at least one public cloud: AWS, GCP, Azure; AWS is preferred.
- Experience managing Kubernetes in a large Enterprise production environment.
- Communication Skills: The candidate will have exceptional communication skills (verbal, written, and presentation) as well as excellent interpersonal skills including the ability to work and communicate with individuals at all levels in the organization, and matrixed team members.
- Good working knowledge of build automation and continuous integration/delivery processes and tools: Gitlab, Jenkins.
- Experience with messaging technologies such as AmazonMQ, RabbitMQ, Kafka, etc.
- Experience with monitoring solutions: Zabbix, Nagios, CloudWatch, Alert manager, Prometheus, Grafana, Dynatrace, NewRelic or equivalent.
- Experience with various data technologies including relational and nonrelational databases.
- PostgreSQL database knowledge preferred.
- Experience supporting Enterprise Wildfly/Jboss application servers and java.
- Experience with VMware/vSphere virtualization or the private/public cloud is a plus.
- Experience with incident management and finding root cause within a postmortem discovery.
- Familiar with AWS Well-Architected and the Six Pillars.
- Experience with a DevOps approach of managing infrastructure.
Lightspeed is committed to fair and equitable compensation practices. Compensation packages are based on several factors, including but not limited to skills, experience, certifications, and work location.
The total compensation package for this position may also include annual performance bonus, benefits and/or other applicable incentive compensation plans.
EEO Statement:
At Lightspeed, we believe inclusion and diversity are essential in inspiring meaningful connections to our people, customers, and communities. We are open, curious and encourage different views, so that everyone can be their best selves and make an impact. Lightspeed’s culture values and celebrates the uniqueness of individuals and the different perspectives they provide.
Lightspeed is an Equal Opportunity Employer committed to creating an inclusive workforce where everyone is valued. Qualified applicants will receive consideration for employment without regard to race, color, creed, ancestry, national origin, gender, sexual orientation, gender identity, gender expression, marital status, creed or religion, age, disability (including pregnancy), results of genetic testing, service in the military, veteran status or any other category protected by law.
Salary Description $110,000 - $130,000 ApplyJob Profile
Restrictions24/7 availability required at times
Tasks- Automate maintenance tasks
- Collaborate with teams
- Configure systems
- Monitor systems
- Support DMS application
- Troubleshoot issues
Agile Automation AWS CloudFormation Communication Docker Infrastructure as Code Javascript Kubernetes Linux PostgreSQL Python Site Reliability Engineering System Monitoring Terraform Windows
Experience4 years