Building a Thesis-Specific AI Agent (That Tries Not to Hallucinate)

4 minute read

Published: December 26, 2025

Hello there! In this post, I will write a bit about AI Agents, digressing about Patrine, my PhD Thesis Agent. You can find more details in the repo.

A glimpse of the streamlit app

Large Language Models (LLMs) are great at sounding confident, sometimes too confident.
Now it has been a while that I finished my PhD and I realized something ironic: even I do not remember every detail of my own thesis anymore.

So I built Patrine, a thesis-specific AI agent that answers questions only using the content of my PhD thesis, runs fully locally, and tries very hard not to hallucinate.

This post explains how it works, why I designed it this way, and what still goes wrong.

Why a thesis-specific agent?

I did not want:

a generic PDF chatbot
a model trained on the entire internet
something that confidently invents explanations

I wanted:

an assistant grounded only in my thesis
one that can say “this is not discussed in the thesis”
something I could run locally, without paid APIs or data uploads

This immediately rules out fine-tuning and points to a Retrieval-Augmented Generation (RAG) approach.

High-level architecture

The system has two clearly separated phases:

(1) Preprocessing (run once) Thesis PDF → chunks → embeddings → saved index

(2) Inference (interactive) Question → retrieve relevant chunks → LLM answer → optional verification

This separation is crucial:

the thesis becomes a fixed knowledge base
the agent cannot “learn” anything else at runtime

Step 1: One-time preprocessing

The thesis is processed once and saved to disk.

Text extraction and chunking

I use a sliding window over words to preserve context:

def sliding_window_chunks(text, window = 600, stride = 300, min_words = 50):

    words = text.split()
    chunks = []
    for i in range(0, len(words), stride):
        chunk = " ".join(words[i:i + window])
        if len(chunk.split()) >= min_words:
            chunks.append(chunk)
    return chunks

This avoids tiny fragments while keeping overlap between chunks!

Embeddings

Each chunk is embedded using a local sentence transformer:

    embedder = SentenceTransformer("all-MiniLM-L6-v2")
    embeddings = embedder.encode(chunks, convert_to_numpy = True, show_progress_bar = True, normalize_embeddings = True).astype(np.float32)

The resulting files:

chunks.json
embeddings.npy
config.json

Together, they form the thesis index.

Step 2: Retrieval at inference time

When the user asks a question, I embed it and retrieve the most relevant chunks:

  def retrieve_chunks(question, chunks, embeddings, embedder, top_k = 5):
    q_emb = embedder.encode([question], convert_to_numpy = True, normalize_embeddings = True)
    sims = cosine_similarity(q_emb, embeddings)[0]
    idx = sims.argsort()[-top_k:][::-1]
    return [chunks[i] for i in idx]

This ensures the LLM sees only thesis excerpts.

Step 3: A locked agent prompt

The most important design choice is locking the agent identity.

The model is explicitly told:

it is trained exclusively on my thesis
it must not use outside knowledge
it must refuse if the answer is not present

SYSTEM_PROMPT = (
    "You are an expert assistant trained exclusively on Natalí Soler Matubaro de Santi's PhD thesis."
    "Use ONLY the provided thesis excerpts."
    "Do NOT use external knowledge or speculation."
    "If the answer is not present, say: 'This is not discussed in the thesis.'"
)

This does not eliminate hallucinations, but it reduces them a lot.

Step 4: Local LLM inference with Ollama

Everything runs locally using Ollama:

  ollama_client = Client(host = "http://localhost:11434")

  def generate_answer(question, context, model = "llama3"):
    prompt = (SYSTEM_PROMPT + f"Thesis excerpts: {context}"
        f"Question: {question} Answer:" )
    response = ollama_client.generate(model = model, prompt = prompt, stream = False, options={"temperature": 0.1}
    return response["response"].strip()

Lower temperature helps reduce creative gap-filling.

Step 5: Answer verification (hallucination detection)

Even with all of the above, hallucinations still happen — especially with:

abbreviations
implicit assumptions
vague phrasing

To detect this, I added a verification pass that checks each sentence against the retrieved excerpts.

Conceptually:

Answer → split into sentences → verify each sentence → flag unsupported ones

This does not fix hallucinations: it surfaces them.

What still goes wrong (honestly)

Even with verification:

small models try to “help” by filling gaps
paraphrases can drift slightly from the text
abbreviations with multiple meanings are a common failure mode

The advantage here is that the thesis is mine, so I can immediately tell when something is off.

This project reinforced an important lesson:

Hallucination is not just a model problem: it is a retrieval and grounding problem.

Why keep it local?

This app cannot be deployed as-is, because it requires:

a local Ollama runtime
local models

That is intentional.

It gives me:

reproducibility
zero API dependencies

Not every AI system needs to scale!

Final thoughts

Patrine is not meant to replace reading the thesis. It is a navigation and memory aid, nothing more.

But as a small end-of-year project, it was a fun way to combine:

ML engineering
research rigor
and something genuinely useful to myself

Once again, you can find the full code here: https://github.com/natalidesanti/patrine_thesis_agent

Share on

Twitter Facebook LinkedIn

Natalí de Santi