EngineeringRemote (US)Full-time

AI Engineer (LLMs / RAG over customer data)

About the role

The core of Merrily is reading messy, real-world customer signal (Slack threads, email, meeting notes, support tickets, contracts) and turning it into structured, trustworthy judgments about the health of an account. As an AI engineer you will own the LLM systems that do this reading: extraction, summarization, retrieval over a customer's own history, and the scoring logic that ties it all together.

This is applied AI where the bar is reliability, not novelty. A health score that is confidently wrong is worse than no score at all, so much of the craft is in evaluation, grounding, and knowing when the model should abstain.

What you'll do

Design and ship LLM pipelines that read customer conversations and product events and extract the signals that move a health score.
Build retrieval-augmented generation over each account's own history, so the model reasons from real evidence and can cite it.
Own the evaluation harness: golden sets, offline scoring, and regression tracking, so we know when a prompt or model change actually helps.
Tune prompts, context assembly, and model selection against cost, latency, and accuracy.
Build guardrails for grounding and abstention, so the system says "not enough signal" instead of guessing.
Work closely with the customer success team to encode what a healthy versus at-risk account actually looks like.

What we're looking for

3+ years in software or ML engineering, including real production experience with LLMs.
Hands-on experience with prompt engineering, RAG architectures, and vector search.
A rigorous approach to evaluation: you measure quality rather than eyeballing outputs.
Strong general software engineering skills (we work in TypeScript and Python) and comfort owning a feature in production.
Sound judgment about where LLMs help, where they hurt, and how to keep them grounded.

Bonus points

You have built information-extraction or classification systems over unstructured text.
You have worked with pgvector or another vector store in a production retrieval setting.
You have shipped applied AI in a domain where being wrong has real consequences.
You have built evaluation tooling or LLM-as-judge pipelines that teams actually trusted.

Why Merrily

Most "AI for customer success" is a chatbot bolted onto a CRM. We are building the opposite: a system that quietly reads everything and tells a team where to look, with the evidence to back it up. If grounded, well-evaluated applied AI is the work you want to do, come build it with us.

Apply

Apply for this role

Tell us a little about yourself. We read every application.