How I built a persona-based LLM chatbot with self-evaluation.
This is a Chatbot I built with Gradio in Python hosted on Huggingface, acting as me to answer questions about my skills and work experience. Code for the project can be seen here.
Read more below for a step by step guide on how it was built.
1. Load profile data (pdf + Summary)
I read in my LinkedIn export and a short summary to give the model authentic, up-to-date context about me.
from pypdf import PdfReader
reader = PdfReader("me/linkedin.pdf")
linkedin = ""
for page in reader.pages:
text = page.extract_text()
if text:
linkedin += text
with open("me/summary.txt", "r", encoding="utf-8") as f:
summary = f.read()
Why: Grounding the model with my real experience reduces hallucinations and keeps answers consistent with my background.
2. Define the persona and system prompt
I set the chatbot to “act as” me and embedded both data sources directly into the system message.
name = "Helena Hook"
system_prompt = (
f"You are acting as {name}..."
f"\n\n## Summary:\n{summary}\n\n## LinkedIn Profile:\n{linkedin}\n\n"
f"With this context, please chat with the user, always staying in character as {name}."
)
Why: A strong system prompt + my materials gives the model guardrails (tone, audience, scope) and concrete facts.
3. Connect to the primary LLM (OpenAI)
This model produces the user-facing reply.
from openai import OpenAI
openai = OpenAI() # reads OPENAI_API_KEY from env
Why: Keep the main chat model simple and fast (I used gpt-4o-mini).
4. Add evaluator (Gemini)
I use a separate model to critique the first model’s reply before showing it to users.
from pydantic import BaseModel
import os
gemini = OpenAI(
api_key=os.getenv("GOOGLE_API_KEY"),
base_url="https://generativelanguage.googleapis.com/v1beta/openai/"
)
class Evaluation(BaseModel):
is_acceptable: bool
feedback: str
I share the same context with the evaluator and ask it to judge the chatbot’s latest response in the ongoing conversation.:
def evaluator_user_prompt(reply, message, history):
user_prompt = (
f"Here's the conversation... \n\n{history}\n\n"
f"Here's the latest message from the User: \n\n{message}\n\n"
f"Here's the latest response from the Agent: \n\n{reply}\n\n"
"Please evaluate the response, replying with whether it is acceptable and your feedback."
)
return user_prompt
Then I parse the evaluator’s output into the Evaluation schema:
def evaluate(reply, message, history) -> Evaluation:
messages = [
{"role": "system", "content": evaluator_system_prompt},
{"role": "user", "content": evaluator_user_prompt(reply, message, history)}
]
response = gemini.beta.chat.completions.parse(
model="gemini-2.0-flash",
messages=messages,
response_format=Evaluation
)
return response.choices[0].message.parsed
Why: Using a second model to check tone, accuracy, and professionalism catches weak answers before users see them. Pydantic keeps the evaluator’s output structured and reliable.
5. If rejected, help the main model improve and try again
When the evaluator says the answer isn’t good enough, I update the main model’s instructions with:
- The bad answer, and
- The evaluator’s reason for rejection
def rerun(reply, message, history, feedback):
updated_system_prompt = (
system_prompt
+ "\n\n## Previous answer rejected\nYou just tried to reply, but the quality control rejected your reply\n"
+ f"## Your attempted answer:\n{reply}\n\n"
+ f"## Reason for rejection:\n{feedback}\n\n"
)
messages = [{"role": "system", "content": updated_system_prompt}] + history + [{"role": "user", "content": message}]
response = openai.chat.completions.create(model="gpt-4o-mini", messages=messages)
return response.choices[0].message.content
Why: This creates a tight feedback loop where the main model learns what to fix and tries again.
6. Add an example enforced style rule (keyword trigger)
If the user’s message contains the word "patent", I force the reply to be in Pig Latin (just to demonstrate hard constraints).
def chat(message, history):
if "patent" in message:
system = system_prompt + "\n\nEverything in your reply needs to be in pig latin ..."
else:
system = system_prompt
...
Why: Shows how to conditionally tighten style/format policies at runtime.
7. Full message flow per user turn
- Build the system prompt (Pig Latin variant if triggered).
- Call the primary LLM to get a draft reply.
- Send draft, user message, and conversation history to the evaluator.
- If
evaluation.is_acceptable:- Return the draft to the user.
- Else, call
rerun(...)with evaluator feedback and return the improved reply.
response = openai.chat.completions.create(model="gpt-4o-mini", messages=messages)
reply = response.choices[0].message.content
evaluation = evaluate(reply, message, history)
if not evaluation.is_acceptable:
reply = rerun(reply, message, history, evaluation.feedback)
return reply
Why: This keeps latency reasonable (usually one pass) but upgrades quality automatically when needed.
8. Wrap it in a simple UI (Gradio)
I expose the chat loop as a web app with Gradio’s ChatInterface.
import gradio as gr
gr.ChatInterface(
chat,
type="messages",
title="Chatbot",
theme=gr.themes.Soft(),
fill_height=True
).launch()
Why: Instant local demo and easy deployment to a small server.
Extensions I’d Add Next
- Retrieval: Embed and index the PDF/summary for better grounding than a giant system prompt.
- Memory: Store common Q&A and let the model cite sources.
- Analytics: Log evaluator feedback to see recurring failure modes.
- Tests: Scripted prompts that must pass the evaluator before deploys.