Site Reliability Engineer – DevOps Requisition ID 6010

  WORK FROM HOME

Job title: Site Reliability Engineer – DevOps Requisition ID 6010

Company: Qualys

Job description: We are seeking a highly motivated and talented Lead Site Reliability Engineer to work on Qualys’ DevOps Toolchains. Working with a team of engineers and architects, you will combine software development and systems engineering skills to build and run scalable, distributed and fault-tolerant systems.

The ideal candidate will write software to optimize day to day work through better automation, monitoring, alerting, testing and deployment.

Responsibilities

  • Communicate effectively with the DevOps managers on release milestones, sprints and roadmap activities
  • Co-develop and participate in the full lifecycle development of DevOps tool chains from inception and design, deployment, operation and improvement by applying scientific principles.
  • Increase the effectiveness, reliability and performance of DevOps tool chains by identifying and measuring key indicators, making changes to the production systems in an automated way and evaluating the results.
  • Support DevOps team before the technologies are pushed for production release through activities such as system design, capacity planning, automation of key deployments, engaging in building a strategy for production monitoring and alerting and participate in testing/verification process.
  • Ensure that the DevOps tool chains are maintained properly by measuring and monitoring availability, latency, performance and system health.
  • Advice the DevOps team to improve the reliability of the systems in production and scale them based on need.
  • Participate in the development process by supporting new features, services, releases and hold an ownership mindset for the DevOps tool chains
  • Develop tools and automate the process for achieving large scale provisioning and deployment of cloud platform technologies
  • Participate in on-call rotation for DevOps tool chains. At times of incidents, lead incident response and be part of writing detailed postmortem analysis reports which are brutally honest with no-blame.
  • Propose improvements and drive efficiencies in systems and processes related to capacity planning, configuration management, scaling services, performance tuning, monitoring, alerting and root cause analysis

Requirements

  • 1+ years of relevant experience in running distributed systems at scale in production.
  • Expertise in one of the programming languages: Java, Python or Go.
  • Proficient in writing bash scripts
  • Good understanding of SQL and NoSQL systems
  • Good understanding of systems programming (network stack, file system, OS services)
  • Understanding of network elements such as firewalls, load balancers, DNS, NAT, TLS/SSL, VLANs etc
  • Skilled in identifying performance bottlenecks, identifying anomalous system behavior, and determining the root cause of incidents.
  • Knowledge of JVM concepts like garbage collection, heap, stack, profiling, class loading, etc.
  • Knowledge of best practices related to security, performance, high-availability, and disaster recovery.
  • Demonstrate a proven record of handling production issues, planning escalation procedures, conducting post-mortems, impact analysis, risk assessments and other related procedures.
  • Able to drive results and set priorities independently
  • BS/MS degree in Computer Science, Applied Math or related field

EEO Employer/Vet/Disabled

Expected salary:

Location: Pune, Maharashtra

Job date: Sat, 25 Jun 2022 02:11:44 GMT

Apply for the job now!