Building an AI Assistant as an Apollo Federation Subgraph

Most AI assistants live at /api/chat — outside your architecture, outside your graph, outside your auth model.

In this system, the AI is a first-class Apollo Federation subgraph, sitting alongside users, movies, reviews, and search. The frontend doesn’t know there’s an AI service. It calls sendMessage the same as any other mutation.

The platform is a movie catalog with five subgraphs behind an Apollo Router gateway:

graph TD
    FE["BROWSER 
    React + Apollo Client · :5173"]

    FE -->|"GraphQL HTTP"| GW["GATEWAY · :4000
    Apollo Router
Query planning · Federation v2"]

    GW -->|"Federation Subgraphs"| U & M & R & S & AI

    U["users · :4001
    Auth · JWT · bcrypt
User · AuthPayload
SQLite"]
    M["movies · :4002
    Catalog · CRUD
Movie · Genre
SQLite"]
    R["reviews · :4003
    Ratings · reviews
Review · Rating
SQLite"]
    S["search · :4004
    FTS5 · trending
SearchResult
SQLite"]
    AI["ai · :4005
    LangChain · Groq
Conversation · Message
SQLite"]

The gateway composes all five into a single supergraph. From its point of view, the AI subgraph is just another service.

Why Federation Instead of a Standalone Endpoint?

You could wire up a /api/chat REST endpoint and call it from the frontend. It works. But folding the AI into the federation gives you three things without writing extra code.

Shared auth context. The gateway forwards the JWT to every subgraph. The AI service gets ctx.userId and ctx.token from the gateway context automatically, and passes those tokens straight through when the agent calls other subgraphs on the user’s behalf.

Schema composability. Conversations and messages become real types in the supergraph. Other services could reference Conversation by ID via federation’s entity resolution if needed. The frontend queries chat history the same way it queries movies.

One URL. The frontend doesn’t know there’s an AI service. It calls sendMessage the same as any other mutation.

Data Flow

A regular query (e.g. fetching the movie list) goes through the gateway and straight to one subgraph:

sequenceDiagram
    autonumber
    participant FE as Browser
    participant GW as Apollo Router :4000
    participant MV as Movies :4002

    FE->>GW: query { movies { id title } }
    GW->>MV: forward to movies subgraph
    MV-->>GW: movie list
    GW-->>FE: response

An AI message takes a different path. The gateway hands off to the AI subgraph, the agent decides which tools to call, and those tools talk directly to the target subgraphs — bypassing the gateway entirely (direct calls avoid an extra hop when the agent chains 3–5 tools per message):

sequenceDiagram
    autonumber
    participant FE as Browser
    participant GW as Apollo Router :4000
    participant AI as AI :4005
    participant LLM as Groq LLM
    participant MV as Movies :4002
    participant RV as Reviews :4003

    FE->>GW: mutation sendMessage(content: "Tell me about Dune")
    GW->>AI: forward to ai subgraph
    AI->>AI: check rate limit
    AI->>AI: load conversation history
    AI->>LLM: run agent (system prompt + history + message)
    LLM-->>AI: call tool: get_movie_details(movieId)
    par direct subgraph calls
        AI->>MV: query movie details
        AI->>RV: query reviews for movie
    end
    MV-->>AI: movie data
    RV-->>AI: review data
    AI->>LLM: tool result -> continue
    LLM-->>AI: final assistant reply
    AI->>AI: persist message to SQLite
    AI-->>GW: SendMessagePayload
    GW-->>FE: response

The Schema

type Conversation @key(fields: "id") {
  id: ID!
  title: String!
  createdAt: String!
  updatedAt: String!
  messages: [Message!]!
}

type Message {
  id: ID!
  role: MessageRole!
  content: String!
  createdAt: String!
}

enum MessageRole {
  USER
  ASSISTANT
}

type SendMessagePayload {
  conversation: Conversation!
  message: Message!
}

input SendMessageInput {
  conversationId: ID
  content: String!
}

type Query {
  conversation(id: ID!): Conversation
  myConversations: [Conversation!]!
}

type Mutation {
  sendMessage(input: SendMessageInput!): SendMessagePayload!
}

@key(fields: "id") on Conversation makes it a federation entity. Other subgraphs can reference it by ID. The only other federation-specific line in the whole service is:

export const schema = buildSubgraphSchema([{ typeDefs, resolvers }]);

That’s it.

Wiring It Into the Supergraph

Apollo Router uses a composed supergraph schema. You generate it with Rover CLI before building Docker images:

federation_version: =2.3.0
subgraphs:
  ai:
    routing_url: http://ai:4005/graphql
    schema:
      file: ./services/ai/schema.graphql

rover supergraph compose --config supergraph.yaml

Rover reads the SDL files directly. No services need to be running at compose time.

The Agent

The agent is a tool-calling LLM. Rather than answering directly, it decides when to call GraphQL-backed tools to read or write data. It uses LangChain with Groq, and it picks the model based on what the user is asking:

function chooseModel(input: string): string {
  const lower = input.toLowerCase();
  const needsStrong =
    lower.includes('add') ||
    lower.includes('review') ||
    lower.includes('create') ||
    lower.includes('graphql');
  if (!needsStrong && input.length < 120) return 'llama-3.1-8b-instant';
  return 'llama-3.3-70b-versatile';
}

Short browsing queries use the faster 8B model. Anything involving writes, explicit GraphQL, or longer inputs gets routed to the 70B model. The tradeoff is cost vs. capability: the 8B is fast and cheap, but it’s more likely to hallucinate arguments for mutations.

const llm = new ChatGroq({
  model: chooseModel(userMessage),
  temperature: 0.3,
  maxTokens: 512,
  apiKey: process.env.GROQ_API_KEY,
});

const agent = createToolCallingAgent({ llm, tools, prompt });
const executor = new AgentExecutor({ agent, tools, maxIterations: 5 });

Dynamic schema injection

On first use, the service introspects the gateway, builds the schema in memory, and formats all available Query and Mutation operations into a compact string. It strips federation internals like _service and _entities so the model only sees operations it can actually call. The result is cached indefinitely:

let cachedSDL: string | null = null;

export async function fetchSchemaSDL(): Promise<string> {
  if (cachedSDL) return cachedSDL;
  // introspect, format, cache...
}

export function refreshSchemaSDL(): void {
  cachedSDL = null;
}

A TTL makes no sense here. The schema only changes on redeploy, and a redeploy restarts the service, which clears the in-process cache. refreshSchemaSDL() exists for manual use if you ever need it.

The formatted operations get appended to the system prompt. The base prompt itself is structured as a numbered priority list, not free text, so the model has an explicit decision order for picking tools:

Tool priority (use the first tool that fits):
1. list_movies — browsing the catalog, asking what movies exist
2. search_movies — finding a movie by title or keyword (never invent IDs)
3. get_movie_details — full info + reviews for a known movieId
4. add_movie — ONLY when user explicitly asks to add a movie
5. add_review — ONLY when user explicitly asks to submit a review
6. execute_graphql — last resort for operations not covered above

There’s also an explicit rules section: always call a tool for live data, never answer from memory, never invent IDs, and if a tool returns an error, fix the arguments and retry once. The schema SDL then follows:

function buildSystemPrompt(schemaSDL: string): string {
  if (!schemaSDL) return base;
  return `${base}\n\nAvailable GraphQL operations (for use with execute_graphql):\n${schemaSDL}`;
}

Conversation history

Before each call, the resolver loads the last 20 messages from SQLite and passes them as chat history:

const history = allMessages.slice(-MAX_HISTORY - 1, -1);
const reply = await runAgent(input.content, history, ctx.token);

Tools That Talk to Other Subgraphs

Each tool makes a direct GraphQL request to its target service over Docker’s internal network, not through the gateway. All of them share one helper:

async function gqlFetch(
  url: string,
  query: string,
  variables: Record<string, unknown>,
  token?: string
): Promise<unknown> {
  const res = await fetch(`${url}/graphql`, {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
      ...(token ? { Authorization: `Bearer ${token}` } : {}),
    },
    body: JSON.stringify({ query, variables }),
  });
  const json = await res.json();
  if (json.errors?.length) throw new Error(json.errors[0].message);
  return json.data;
}

The user’s token travels with every request. When the agent calls add_movie, the movies service receives a real authenticated request and enforces its own auth rules. The agent has no elevated permissions.

Six tools in total:

Tool	Target	Auth required
`list_movies`	movies (direct)	No
`search_movies`	search (direct)	No
`get_movie_details`	movies + reviews (direct)	No
`add_movie`	movies (direct)	Yes
`add_review`	reviews (direct)	Yes
`execute_graphql`	gateway	Optional

The first five go direct to their subgraph. execute_graphql is the fallback for anything not covered by the specialized tools. It routes through the gateway and accepts any valid query:

const executeGraphQL = new DynamicStructuredTool({
  name: 'execute_graphql',
  description:
    'Execute any GraphQL query against the API. Use this for operations not covered by the specialized tools above.',
  schema: z.object({
    query: z.string().describe('Valid GraphQL query or mutation string'),
    variables: z.record(z.unknown()).optional(),
  }),
  func: async ({ query, variables }) => {
    const data = await gqlFetch(GATEWAY_URL, query, variables ?? {}, token);
    return JSON.stringify(data, null, 2);
  },
});

Because the model already has the formatted schema injected into its system prompt (covered above in “Dynamic schema injection”), it can construct valid queries without guessing. The specialized tools handle the common cases; execute_graphql covers the rest.

get_movie_details is worth calling out specifically. It fans out to two services in parallel and returns a combined result:

const [movieData, reviewData] = await Promise.all([
  gqlFetch(MOVIES_URL, movieQuery, { id: movieId }),
  gqlFetch(REVIEWS_URL, reviewsQuery, { movieId }),
]);

The model gets both the movie and its reviews in one tool call.

Why Tools Call Subgraphs Directly

Routing through the gateway would work, but the agent may call 3 to 5 tools per message. Each extra hop adds latency, and that compounds. Direct calls also keep auth enforcement inside each service where it belongs.

The tradeoff is that once you bypass the gateway, you also give up some centralized behavior. Observability, query controls, and policy enforcement are no longer automatic at the gateway layer, so you need to handle those concerns deliberately in the services themselves.

Rate Limiting

The sendMessage resolver checks the rate limit before it does anything else:

const HOURLY_LIMIT = 20;
const msgCount = countUserMessagesInLastHour(ctx.userId);
if (msgCount >= HOURLY_LIMIT) throw new Error('Rate limit exceeded: 20 messages per hour');

LLM calls cost money. Rejecting over-limit requests before building tools, fetching the schema, or running the agent means you pay nothing for them. The count is a single SQLite query, no external service required.

Conversation History in SQLite

Every service in this stack uses bun:sqlite, Bun’s built-in SQLite driver. No extra packages, no separate database process.

const dbPath = process.env.DB_PATH ?? 'ai.db';
const db = new Database(dbPath, { create: true });

db.run(`
  CREATE TABLE IF NOT EXISTS conversations (
    id TEXT PRIMARY KEY,
    user_id TEXT NOT NULL,
    title TEXT NOT NULL,
    created_at TEXT NOT NULL,
    updated_at TEXT NOT NULL
  );
  CREATE TABLE IF NOT EXISTS messages (
    id TEXT PRIMARY KEY,
    conversation_id TEXT NOT NULL REFERENCES conversations(id),
    role TEXT NOT NULL,
    content TEXT NOT NULL,
    created_at TEXT NOT NULL
  );
`);

For a single-instance Docker container storing only conversation history, SQLite is fine. It’s fast, it’s local, and it eliminates an entire category of operational overhead.

What I’d Change

Streaming. sendMessage blocks until the agent finishes. GraphQL subscriptions or chunked HTTP responses would make the chat feel less like submitting a form.

Tool error handling. When a tool throws, LangChain catches it and the model sometimes tries to work around it in ways that produce confusing output. Returning structured error objects instead of throwing would give the model cleaner signal.

Separate agent modes for reads vs. writes. The same agent instance handles both browsing and mutations. A read-only mode with a tighter prompt would reduce the risk of the model calling add_movie when the user just wanted a recommendation.

Deployment

The run order matters. Compose the supergraph schema first, then build and start the containers:

bash compose-supergraph.sh
docker compose build
docker compose up

The AI service needs GROQ_API_KEY, MOVIES_SERVICE_URL, REVIEWS_SERVICE_URL, GATEWAY_URL, and JWT_SECRET. Everything else picks it up from the gateway context.

Treating the AI as a subgraph instead of a sidecar costs almost nothing extra, and you get shared auth, typed schema, and a single API surface for free. If you’re already running Federation, the AI belongs in the graph.

The full code is on GitHub: mrSamDev/GraphQL-Federation

Live demo: movie.mrsamdev.xyz