Key Responsibilities
- Design and execute large-scale system performance and reliability tests to ensure system stability.
- Monitor application and infrastructure performance, identifying bottlenecks and areas for optimization.
- Collaborate with cloud and infrastructure teams to develop and implement disaster recovery strategies.
- Analyze system behavior under failure conditions to improve resilience and reliability.
- Provide client communication for requirement gathering, status updates, and detailed reporting.
- Ensure performance optimization and reliability improvements across distributed systems.
Requirements
- Proficiency in chaos engineering and resilience testing for distributed systems.
- Familiarity with Azure Chaos Studio architecture and implementation.
- Strong knowledge of performance testing and performance engineering methodologies.
- Ability to configure, monitor networks, and plan cloud disaster recovery.
- Excellent communication, analytical, and problem-solving skills with a quality-focused mindset.