Job Description
Location
Remote (India)
Experience
15+ years in data engineering, 5+ years in data architecture for ML/AI
Engagement
Full-Time, Permanent
ABOUT THE ROLE
The TDD's recommended architecture implements a Hybrid Graph ML + Graph RAG knowledge representation layer the most architecturally complex layer in the system. This requires a senior data architect who designs the unified context layer combining GNN-produced features with vector-retrieval-augmented context, architects the golden dataset schema with versioning and pool separation, designs the data migration strategy for integrating with the ~10B-record knowledge graph, and validates Graph RAG retrieval quality. The 2.00 man-month allocation reflects a focused, high-impact engagement during the foundational phases.
Project Context: You will architect the data layer for the Tradecraft Evaluation Platform, designing the Hybrid Graph ML + Graph RAG knowledge representation system that captures both structural patterns (entity embeddings, anomaly scores) and procedural tradecraft (reasoning chains, confidence rubrics) from a 10-billion-record knowledge graph. You will define the golden dataset schema, data pool separation strategy, and data migration approach that all downstream evaluation work depends on.
KEY RESPONSIBILITIES
A. STANDARD RESPONSIBILITIES:
- Design data architectures that balance performance, scalability, and maintainability for complex analytical workloads
- Define data modeling standards, schema versioning strategies, and data quality frameworks
- Evaluate and select data storage technologies based on workload characteristics and access patterns
- Review and approve data pipeline designs produced by other engineers
B. PROJECT-SPECIFIC RESPONSIBILITIES:
- Architect the Hybrid Graph ML + Graph RAG data layer, defining how GNN-produced entity embeddings (from PyTorch Geometric) and vector-retrieval results (from Weaviate) converge in the unified context layer
- Design the golden dataset schema in PostgreSQL with versioning, pool separation (eval/training/holdout using physically separate schemas), and full lineage tracking from source artifact through extraction to validation
- Design the data migration strategy for integrating with the knowledge graph (~10B records, ~700 sources) via read-only API access, defining subgraph extraction patterns for Chinese corporate network investigation scenarios
- Architect the vector database (Weaviate) schema for tradecraft corpus indexing, defining chunking strategies, embedding models, metadata schemas, and retrieval/re-ranking pipelines
- Define the Graph ML data pipeline for the lightweight GNN training in Phase 4, specifying feature engineering for node/edge attributes from corporate registration and trade data
- Validate Graph RAG retrieval quality by designing retrieval precision/recall benchmarks against expert-curated test queries
REQUIRED SKILLS & EXPERIENCE
- [STANDARD] 15+ years of experience in data engineering/architecture with at least 5 years designing data platforms for ML/AI workloads
- [STANDARD] Expert-level proficiency in PostgreSQL, including advanced features (partitioning, triggers, materialized views, row-level security)
- [PROJECT-SPECIFIC] Hands-on experience designing and deploying vector database systems (Weaviate, Pinecone, Qdrant, or Milvus) for RAG pipelines at production scale
- [PROJECT-SPECIFIC] Experience with graph data models and graph databases, including query optimization for large-scale knowledge graphs
- [PROJECT-SPECIFIC] Experience designing data architectures with physical data separation for compliance (separate schemas, separate storage buckets, infrastructure-level access controls)
- [STANDARD] Expert-level Python proficiency for data pipeline development
- [STANDARD] Experience with cloud-managed data services (AWS RDS, S3, or equivalents on Azure/GCP)
Experience Requirements
- YEARS OF EXPERIENCE: 15+ years in data engineering, 5+ years in data architecture for ML/AI
- SENIORITY LEVEL: Staff / Principal
- TYPICAL BACKGROUND: Senior data architect at an AI platform company; principal data engineer at a government analytics firm; data platform lead at a knowledge graph company; chief data architect at a risk/compliance technology company
- COMPLEXITY INDICATORS: Has designed data architectures integrating 3+ heterogeneous data stores (relational, graph, vector); has worked with datasets at a billion-record scale; has designed data separation architectures for compliance
- LEADERSHIP / OWNERSHIP EXPECTATIONS: Owns all data architecture decisions; reviews and approves data pipeline designs from IC3 Data Engineer; presents data architecture to client's Principal Architect (James) and SVP Engineering (Phillip)
- SUCCESS INDICATORS:
- Has designed and deployed a production vector database for RAG with measurable retrieval precision >70%
- Has architected a data platform integrating graph database features with vector retrieval for LLM augmentation
- Has designed golden dataset or evaluation dataset management systems with versioning and lineage
- Has implemented physical data separation for compliance in a regulated environment
- RED FLAGS:
- Vector database experience limited to tutorials or proof-of-concepts; no production deployment
- No experience with graph data models; treats all data as relational tables
- Cannot articulate chunking strategy trade-offs for RAG systems (semantic vs. fixed-size vs. structural)
Project-Specific Skills and Domain Knowledge
Must-Have:
- Experience designing vector database schemas for RAG systems, including chunking strategy selection (semantic vs. structural), embedding model selection, and metadata-filtered retrieval
- Experience with PyTorch Geometric or DGL for graph neural network feature engineering and data pipeline design
- Experience designing data versioning systems for ML evaluation datasets with full lineage tracking
- Experience with TimescaleDB or equivalent time-series extensions for metrics and cost tracking data
PREFERRED QUALIFICATIONS
- Experience architecting data layers for government or FedRAMP-compatible systems
- Experience with append-only data stores for immutable audit logging (Amazon QLDB or equivalent)
- AWS Data Analytics Specialty or equivalent certification
- Experience with Weaviate specifically (self-hosted on Kubernetes)
- Prior work with trade data (bills of lading) or corporate registration data schemas
Project-Specific Skills and Domain Knowledge
Strongly Preferred:
- Experience with knowledge graph systems at billion-record scale
- Experience with entity resolution data models (strong/weak identifier classification)
- Familiarity with HELM benchmark data formats and evaluation dataset structures
Trade-Craft Experience A Significant Plus
Candidates with backgrounds in intelligence analysis, signals intelligence, law enforcement data fusion, or related trade-craft disciplines are strongly encouraged to apply. Understanding of link analysis, entity disambiguation under adversarial conditions, handling classified or compartmentalised data, and mission-driven product constraints will set you apart.
Job Classification
Industry: Internet
Functional Area / Department: Data Science & Analytics
Role Category: Data Science & Machine Learning
Role: Data Engineer
Employement Type: Full time
Contact Details:
Company: Tanisha Systems
Location(s): Bengaluru
Keyskills:
Vector Database
Chunking & Retrieval Pipelines
Graph RAG
Artificial Intelligence
Graph Databases
Semantic/Hybrid
Machine Learning