Job title: Site Reliability Engineer – DevOps Requisition ID 6010
Company: Qualys
Job description: We are seeking a highly motivated and talented Lead Site Reliability Engineer to work on Qualys’ DevOps Toolchains. Working with a team of engineers and architects, you will combine software development and systems engineering skills to build and run scalable, distributed and fault-tolerant systems.
The ideal candidate will write software to optimize day to day work through better automation, monitoring, alerting, testing and deployment.
Responsibilities
- Communicate effectively with the DevOps managers on release milestones, sprints and roadmap activities
- Co-develop and participate in the full lifecycle development of DevOps tool chains from inception and design, deployment, operation and improvement by applying scientific principles.
- Increase the effectiveness, reliability and performance of DevOps tool chains by identifying and measuring key indicators, making changes to the production systems in an automated way and evaluating the results.
- Support DevOps team before the technologies are pushed for production release through activities such as system design, capacity planning, automation of key deployments, engaging in building a strategy for production monitoring and alerting and participate in testing/verification process.
- Ensure that the DevOps tool chains are maintained properly by measuring and monitoring availability, latency, performance and system health.
- Advice the DevOps team to improve the reliability of the systems in production and scale them based on need.
- Participate in the development process by supporting new features, services, releases and hold an ownership mindset for the DevOps tool chains
- Develop tools and automate the process for achieving large scale provisioning and deployment of cloud platform technologies
- Participate in on-call rotation for DevOps tool chains. At times of incidents, lead incident response and be part of writing detailed postmortem analysis reports which are brutally honest with no-blame.
- Propose improvements and drive efficiencies in systems and processes related to capacity planning, configuration management, scaling services, performance tuning, monitoring, alerting and root cause analysis
Requirements
- 1+ years of relevant experience in running distributed systems at scale in production.
- Expertise in one of the programming languages: Java, Python or Go.
- Proficient in writing bash scripts
- Good understanding of SQL and NoSQL systems
- Good understanding of systems programming (network stack, file system, OS services)
- Understanding of network elements such as firewalls, load balancers, DNS, NAT, TLS/SSL, VLANs etc
- Skilled in identifying performance bottlenecks, identifying anomalous system behavior, and determining the root cause of incidents.
- Knowledge of JVM concepts like garbage collection, heap, stack, profiling, class loading, etc.
- Knowledge of best practices related to security, performance, high-availability, and disaster recovery.
- Demonstrate a proven record of handling production issues, planning escalation procedures, conducting post-mortems, impact analysis, risk assessments and other related procedures.
- Able to drive results and set priorities independently
- BS/MS degree in Computer Science, Applied Math or related field
EEO Employer/Vet/Disabled
Expected salary:
Location: Pune, Maharashtra
Job date: Sat, 25 Jun 2022 02:11:44 GMT
Apply for the job now!