Key Responsibilities
- Design and optimize large-scale data processing pipelines using Apache Spark and Hadoop
- Develop distributed systems to handle petabyte-scale datasets with high availability
- Collaborate with cross-functional teams to define data architecture and infrastructure requirements
- Implement data quality frameworks and monitoring for production systems
- Mentor junior engineers and lead technical discussions on big data best practices
- Optimize query performance and resource utilization in cloud-based data environments
Requirements
- 7+ years of experience in big data technologies including Spark, Hadoop, and Kafka
- Strong proficiency in Python and SQL for data processing and analysis
- Experience with cloud platforms (AWS/GCP) and containerization technologies
- Proven track record of building scalable data pipelines and ETL processes
- Familiarity with data modeling, schema design, and performance tuning