About the role

Staff Software Engineer at Lattice designing AI evaluation frameworks and architecture. Leading technical projects and enhancing AI quality and reliability across the platform.

Responsibilities

Architect and scale the infrastructure that powers AI quality, reliability, and reuse across Lattice.
Design and scale an end-to-end AI evaluation framework spanning offline evals, production tracing, and human feedback loops.
Define meaningful performance metrics (task completion, hallucination, response quality, engagement, business impact) and build the datasets and automated scoring systems that prevent regressions.
Identify and quantify the drivers of agent quality improvement and set methodological standards for evaluation across the organization.
Architect reusable agent infrastructure (multi-turn workflows, LLM DAGs, recommendation systems, standardized topologies) using LangGraph or comparable frameworks.
Build and scale RAG pipelines, vector retrieval systems, and production-grade AI infrastructure with strong reliability, observability, and performance.
Make principled build-vs-buy decisions across LLM providers, agent frameworks, and evaluation tooling, balancing capability, cost, latency, and risk.
Engineer AI systems as reusable internal platforms that multiply product engineering velocity at Lattice.
Own projects end-to-end: scope, design, execution, and delivery.
Set technical direction for agent quality and evaluation strategy across Lattice engineering teams.
Lead rigorous discussions on AI system design and evaluation methodology.
Raise the AI engineering bar through mentorship, code review, and clear technical communication across engineering and leadership.

8+ years of professional experience writing and maintaining production-level code, with 5+ years in designing, delivering, and operating AI/ML systems in production.
Deep production experience with LLM systems (prompting, RAG, agent orchestration, evaluation frameworks, fine-tuning).
Experience building and operating agentic systems (multi-step workflows, multi-agent topologies) and managing their failure modes.
Strong command of AI evaluation methodology and statistical experimentation.
Strong system design judgment across scalability, latency, accuracy, reliability, and cost.
Production-grade Python (clean, maintainable, testable systems).
Experience with LangGraph and LLM observability/evaluation tooling (e.g., LangSmith).
Vector databases and retrieval system design (Pinecone or similar).
Experience operating AI systems in AWS or comparable cloud environments, including CI/CD, monitoring, and deployment workflows.
Familiarity with TypeScript is a plus.
Actively engaged in AI research and industry trends.
Experience with RLHF, LoRA, or other model adaptation techniques is nice to have.
Background in traditional ML and judgment in selecting ML vs. LLM approaches is nice to have.
Experience with MLOps tooling (MLflow, DataDog) is nice to have.
Published research, talks, or open-source contributions in AI/ML is nice to have.
Experience in HR tech or other trust-sensitive domains is nice to have.