logo

OKX

AI Application Architect

Department
Engineering
Job Type / Location
onsite
Experience Required
8+ years
Posted On

About The Team

The SRE team is dedicated to deeply integrating large language models (LLMs), AI Agents, and engineering platform capabilities to build an intelligent application system for R&D, operations, stability, and business scenarios. By creating an AI application architecture that is observable, evaluable, governable, and continuously evolving, the team is driving the company's shift from "tool-assisted" to "intelligent collaboration," improving R&D efficiency, system stability, fault diagnosis efficiency, and the quality of business decisions.

What You’ll Be Doing

  • Design and build AI Harness capabilities for SRE / DevOps scenarios, including fault detection, change analysis, capacity risk identification, automated inspection, drill evaluation, and recovery recommendations.
  • Drive the development of an automated RCA (Root Cause Analysis) system, combining logs, metrics, distributed tracing, events, changes, topology, and other data to achieve root cause analysis, impact scope assessment, and post-incident review support.
  • Build AIOps platform capabilities, including intelligent alert noise reduction, anomaly detection, event correlation, trend prediction, fault attribution, and automated closed-loop remediation.
  • Collaborate with R&D, SRE, platform, data, and business teams to embed AI capabilities into Code Review, CI/CD, GitOps, DevOps, incident response, and stability governance processes.

What We Look For In You

  • Bachelor's degree or above in Computer Science or a related field, with 8+ years of experience in R&D, architecture, or platform engineering; experience building AI applications, SRE, AIOps, or DevOps platforms is preferred.
  • Strong software architecture skills, familiar with microservices architecture, distributed systems, high-availability design, service governance, observability, and platform engineering.
  • Familiar with LLM application development; understanding of core technologies such as LLM, RAG, Embedding, vector databases, Agents, Function Calling / Tool Calling, and Prompt Engineering. Understanding of the production challenges of AI applications, including hallucination control, result evaluation, permission boundaries, data security, cost control, observability, and failure fallback mechanisms.
  • Experience delivering AI Agent or intelligent assistant products, able to design complex task decomposition, multi-tool invocation, multi-turn reasoning, context management, and human-machine collaboration workflows.
  • Familiar with RCA or AIOps capability development, including log analysis, metric anomaly detection, distributed tracing, event correlation, alert noise reduction, topology analysis, and root cause localization.
  • Proficient in at least one mainstream development language, such as Java, Python, Go, or TypeScript, with strong engineering implementation and system design skills.
  • Familiar with cloud-native technology stacks and common middleware, such as Kubernetes, Docker, Kafka, Redis, MySQL, Elasticsearch, Prometheus, Grafana, OpenTelemetry, etc.
  • Strong complex problem analysis skills and holistic architectural thinking, able to drive problem-solving from business, platform, process, and organizational collaboration perspectives.
  • Ability to communicate in both Chinese and English is preferred as the role requires collaborating with cross-region stakeholders.

View Assessment Process

Think you'll be a good fit?