We’re looking for a Senior AI Infrastructure Engineer to help design, build, and scale our AI and data infrastructure. In this role, you’ll focus on architecting and maintaining cloud-based MLOps pipelines to enable scalable, reliable, and production-grade AI/ML workflows, working closely with AI engineers, data engineers, and platform teams. Your expertise in building and operating modern cloud-native infrastructure will help enable world-class AI capabilities across the organization.
If you are passionate about building robust AI infrastructure, enabling rapid experimentation, and supporting production-scale AI workloads, we’d love to talk to you.
Responsibilities
- Design, implement, and maintain cloud-native infrastructure to support AI and data workloads, with a focus on AI and data platforms such as Databricks and AWS Bedrock.
- Build and manage scalable data pipelines to ingest, transform, and serve data for ML and analytics.
- Develop infrastructure-as-code using tools like Cloudformation, AWS CDK to ensure repeatable and secure deployments.
- Collaborate with AI engineers, data engineers, and platform teams to improve the performance, reliability, and cost-efficiency of AI models in production.
- Drive best practices for observability, including monitoring, alerting, and logging for AI platforms.
- Contribute to the design and evolution of our AI platform to support new ML frameworks, workflows, and data types.
- Stay current with new tools and technologies to recommend improvements to architecture and operations.
- Integrate AI models and large language models (LLMs) into production systems to enable use cases using architectures like retrieval-augmented generation (RAG).
Mandatory Requirements
- 7+ years of professional experience in software engineering and infrastructure engineering.
- Extensive experience building and maintaining AI/ML infrastructure in production, including model, deployment, and lifecycle management.
- Strong knowledge of AWS and infrastructure-as-code frameworks, ideally with CDK.
- Expert-level coding skills in TypeScript and Python building robust APIs and backend services.
- Production-level experience with Databricks MLFlow, including model registration, versioning, asset bundles, and model serving workflows.
- Expert level understanding of containerization (Docker), and hands on experience with CI/CD pipelines, orchestration tools (e.g., ECS) is a plus.
- Proven ability to design reliable, secure, and scalable infrastructure for both real-time and batch ML workloads.
- Ability to articulate ideas clearly, present findings persuasively, and build rapport with clients and team members.
- Strong collaboration skills and the ability to partner effectively with cross-functional teams.
- Familiarity with emerging LLM frameworks such as DSPy for advanced prompt orchestration and programmatic LLM pipelines.
- Understanding of LLM cost monitoring, latency optimization, and usage analytics in production environments.
- Knowledge of vector databases / embeddings stores (e.g., OpenSearch) to support semantic search and RAG.
Report job