We are looking for a Data Engineer to develop an AI-powered data mapping recommendation platform to speed up the integration and validation of complex datasets. The system will automate data extraction, mapping, and validation processes that currently require extensive manual effort due to inconsistencies in source data, reliance on domain-specific code mappings, and heuristic-based validation.
Responsibilities
- Build and maintain scalable data pipelines with Databricks, Spark, and PySpark.
- Manage data governance, security, and credentials using Unity Catalog and Secret Scopes.
- Develop and deploy ML models with MLflow; work with LLMs and embedding-based vector search.
- Apply ML/DL techniques (classification, regression, clustering, transformers) and evaluate using industry metrics.
- Design data models and warehouses leveraging dbt, Delta Lake, and Medallion architecture.
- Work with healthcare data standards and medical terminology mapping.
Mandatory Requirements:
- Databricks Expertise - Candidates must demonstrate strong hands-on experience with the Databricks platform, including:
- Unity Catalog: Managing data governance, access control, and auditing across workspaces.
- Secret Scopes: Secure handling of credentials and sensitive configurations.
- Apache Spark / PySpark: Writing performant, scalable distributed data pipelines.
- MLflow: Managing ML lifecycle including experiment tracking, model registry, and deployment.
- Vector Search: Working with vector databases or search APIs to build embedding-based retrieval systems.
- LLMs (Large Language Models): Familiarity with using or fine-tuning LLMs in Databricks or similar environments.
- Data Engineering Skills
- Experience designing and maintaining robust data pipelines:
- Data Modeling & Warehousing: Dimensional modeling, star/snowflake schemas, SCD (Slowly Changing Dimensions).
- Modern Data Stack: Familiarity with dbt, Delta Lake, and the Medallion architecture (Bronze, Silver, Gold layers).
Optional requirements:
- Machine Learning Knowledge (Nice to Have)
- Strong foundation in machine learning is expected, including:
- Traditional Machine Learning Techniques: Classification, regression, clustering, etc.
- Model Evaluation & Metrics: Precision, recall, F1-score, ROC-AUC, etc.
- Deep Learning (DL): Understanding of neural networks and relevant frameworks.
- Transformers & Attention Mechanisms: Knowledge of modern NLP architectures and their applications.
- Preferred Domain Knowledge (Nice to Have)
- Experience with healthcare data standards and medical code systems such as eCQM, VSAC, RxNorm, LOINC, SNOMED, etc.
- Understanding of medical terminology and how to map or normalize disparate coding systems.
Tech Stack
- Platforms & Tools: Databricks, Unity Catalog, Secret Scopes, MLflow
- Languages & Frameworks: Python, PySpark, Apache Spark
- Machine Learning & AI: Traditional ML techniques, Deep Learning, Transformers, Attention Mechanisms, LLMs
- Search & Retrieval: Vector databases, embedding-based vector search
- Data Engineering & Modeling: dbt, Delta Lake, Medallion architecture (Bronze/Silver/Gold), Dimensional modeling, Star/Snowflake schemas
- Domain (Optional): Healthcare data standards (eCQM, VSAC, RxNorm, LOINC, SNOMED)