Role Summary
We're looking for a Developer Experience Engineer to revolutionize the development experience at Clara by building a self-service platform that automates and simplifies infrastructure, deployments, and operations. You'll be responsible for eliminating friction in the development lifecycle, enabling engineering teams to focus on delivering business value.
Key Responsibilities
Developer Experience & Self-Service Platform
- Design and maintain a developer portal (Backstage.io or similar) as the central hub for resource management
- Build abstractions and APIs that enable developers to provision resources without manual intervention
- Implement self-service workflows for environment creation, configurations, and permissions
- Create reusable templates and blueprints for services, repositories, and pipelines
CI/CD & Automation
- Design, implement, and optimize highly automated CI/CD pipelines
- Reduce build and deployment times through intelligent caching, parallelization, and optimizations
- Implement GitOps and continuous deployment with automated rollback capabilities
- Automate testing (unit, integration, e2e) in pipelines with clear reporting
- Create advanced deployment strategies (blue-green, canary, feature flags)
Ephemeral Environments
- Design and implement ephemeral/preview environment solutions for each PR/branch
- Automate the complete lifecycle: creation, configuration, and cleanup
- Optimize costs through auto-scaling, scheduling, and garbage collection of unused resources
- Integrate ephemeral environments with code review and testing workflows
Observability & Alerting
- Implement intelligent alerting systems with noise reduction and event correlation
- Configure dashboards and SLI/SLO metrics for critical services
- Establish automated runbooks and auto-remediation for common incidents
- Integrate observability (logs, metrics, traces) into the developer portal
Infrastructure as Code & Security
- Maintain and evolve infrastructure as code (Terraform, CloudFormation, etc.)
- Implement automated security controls (policy as code, security scanning)
- Manage secrets, configurations, and access securely and with full auditability
- Apply least privilege and zero-trust principles across all systems
AI/ML Ops Integration
- Explore and implement AI tools for resource optimization and failure prediction
- Automate operational tasks using ML (anomaly detection, capacity planning, incident classification)
- Evaluate and adopt emerging AI Ops tools
Technical Requirements
Must Have
- 5+ years of experience in DevOps/SRE/Platform Engineering
- Mastery of cloud providers (preferably AWS)
- Solid experience with Kubernetes and microservices architectures
- Expertise in CI/CD tools (GitHub Actions, GitLab CI, Jenkins, ArgoCD)
- Proficiency in Infrastructure as Code (Terraform, Pulumi, CloudFormation)
- Experience with containers (Docker, Kubernetes, ECS/EKS)
- Advanced scripting skills (Python, Bash, Go)
- Knowledge of observability tools (Prometheus, Grafana, ELK, Datadog, New Relic)
Nice to Have
- Experience with Backstage.io or similar developer portal platforms
- FinOps knowledge and cloud cost optimization
- Experience in FinTech organizations or highly regulated environments
- Familiarity with AI Ops tools (AIOps platforms, ML-based monitoring)
- Cloud certifications (AWS Solutions Architect, CKA, etc.)
- Experience with service mesh (Istio, Linkerd)
- Compliance and security knowledge (PCI-DSS, SOC2)
Key Skills
- Automation obsession: If something is done twice, it should be automated
- Product mindset: Treat internal platform as a product with "customers" (developers)
- Ability to abstract complexity: Make the complex simple for end users
- Effective communication: Document clearly, create runbooks, and educate teams
- Problem solving: Systems thinking to solve problems at their root
- Continuous improvement mindset: Constantly seek ways to optimize and simplify
What You'll Build
- Self-Service Portal: Unified interface where developers can:
- Create new services from templates
- Provision ephemeral environments in seconds
- Configure alerts and dashboards with clicks
- Request access and permissions with automated approvals
- Intelligent Pipelines: CI/CD that:
- Detects changes and runs only relevant tests
- Auto-deploys to production with quality gates
- Auto-rollback on failures
- Provides instant feedback to developers
- On-Demand Environments: System that:
- Creates complete environments per PR in <5 minutes
- Sanitized copy of production data
- Unique URLs for testing and demos
- Automatic cleanup when PR is closed
- Proactive Observability: Platform that:
- Alerts only when action is required
- Automatically suggests root causes
- Auto-remediation of known issues
- Customized dashboards per team/service