What You’ll Do
● Help build a Site Reliability Engineering culture by sharing your best practices, approaches, documentation, and code with other engineering teams
● Apply automation and software to any tasks or parts of the system that would benefit from it or are performed manually
● Troubleshoot complicated issues handling OS, Networking, Database in a cloud-based SaaS environment/on-premises environment and handle live production incidents, debug/troubleshoot application and infrastructure issues, follow and implement SRE best practices
● Monitor application performance, take steps to improve overall application performance and stability and follow through with implementation
● Conduct system analysis, configuration management and develops improvements for system software performance, availability and reliability
● Work closely with software engineers and QAs to ensure the system is responding properly to no-functional requirements such as performance, security, and availability
● Document your system knowledge as you acquire it over time, create runbooks, and ensure critical system information is readily available to those who need it
● Maintain and monitoring deployment, orchestration, of the servers, docker containers, databases, and general backend infrastructure
● Keep up-to-date with security and proactively identify, diagnose, and solve complex security issues.
● Be part of an on-call rotation to support the global platform providing an excellent customer experience

What We’re Looking For
● Degree in Computer Science or equivalent combination of education and experience
● 7+ yrs experience in DevOps or SRE role
● 5+ years working in AWS with AWS certification(s)
● Experience using infrastructure as code principles to build and maintaining cloud infrastructure using Terraform or CloudFormation
● Experience with Docker and Kubernetes
● Experience in shell scripting languages
● Experience working with database and data store technologies such as RDS/MySQL, Elasticache/Redis or equivalent
● Knowledge of core server-side concepts and experience working with cloud networking, load balancers, HTTP or GRPC protocols, and large scale microservice environments
● Experience with observability stacks, instrumenting environments for logging and monitoring and building and designing dashboards and alerts
● Knowledge of DevOps methodologies and the tools involved such as CI/CD concepts, CI/CD tools (Jenkins, CodePipeline, etc.), automation.
● Ability to self-govern workload and show discipline around priority and time management, even while working remotely or in the absence of direct management for an extended period of time
● Ability and willingness to adapt to new application stacks and new technology concepts as the business evolves over time
● Excellent communication skills, both verbal and written
● Ability to present/lead technical discussions

What will set you apart
● Experience managing high scale web application platforms or SaaS platforms
● Strong Kubernetes, EKS or ECS/Fargate experience
● Understanding of security principles
● Experience with NoSQL databases such as Iceberg, Hbase/Hive, ScyllaDB.
● Experience working with Stream processing and big data technology stacks such as Spark, Kafka, IceBerg, Trino
● Experience with AWS networking concepts such as VPCs, VPC peering, Transit Gateway
● Experience with multi-geography, multi-tenant applications
● Experience designing and performing disaster recovery
● Experience with cost management

Yelo

Senior Site Reliability Engineer

Apply for this position