Improving LLM Chatbots: Context, Memory & Evaluation

Welcome back!

In Part 2, you built a working product inquiry chatbot that could understand user queries, extract product information, and respond through a simple web interface.

It worked — but it wasn’t perfect. It couldn't handle follow-up questions well. In this concluding part, we'll address this issue and explore how to evaluate our chatbot's performance.

The complete code for this project is available on GitHub, so you can experiment as you read along.

🧠 Giving the Chatbot a Simple Memory

Our chatbot's biggest flaw was its inability to remember the context of a conversation. To fix this, we need to introduce a simple form of memory. We can do this by keeping a history of the conversation and providing it to the Large Language Model (LLM) with each new query.

Here’s how we modify our process_user_message() function in chatbot.py:

What Changed?

The magic lies in passing all_messages — the conversation history along with product information— back into the model each time.
This lets the LLM “remember” past interactions within the current session, allowing it to respond naturally to follow-up questions.

Now, a dialogue looks like this:

A small change — but it makes our chatbot feel intelligent.

📊 Evaluating Your Chatbot’s Performance

In part one of this series, I mentioned that our LLM does not always accurately extract product and category information from the user’s query. Evaluating a model’s accuracy and quality of response is critical — in production scenarios inaccurate response could lead to not just lack of trust, but also financial and reputation harm. But how do you get started?

The first step is to look at your overall workflow and break it down into testable parts based on LLM touchpoints.

🔍 1. Evaluating Product and Category Extraction

The first step in our chatbot's logic is to extract products and categories. To test this, I created a labeled dataset of user queries to test the function find_category_and_product_only() and compare predicted outputs against expected results, measuring metrics like accuracy.

As a Product Manager who has built DevOps products, I strongly appreciate the importance of testing automation in software development life-cycle.

Instead of running each test manually, I asked Gemini CLI to generate an automated script (test_scripts/test_utils.py) with 100 test cases — covering scenarios like exact matches, hallucinated products, typos, vague questions, and multiple products in one query. Here’s a code snippet.

        # Case 5: Vague or general questions
        vague_prompts = ["What do you sell?", "I need a new gadget", "Help me choose a gift", "What are your best products?"]
        for i in range(5):
            test_cases.append({
                "prompt": random.choice(vague_prompts),
                "expected": []
            })

        # Case 6: Prompts with typos
        for i in range(5):
            product_name = random.choice(self.product_names)
            product_with_typo = product_name[:-2] + "xx" # Introduce a typo
            test_cases.append({
                "prompt": f"I'm looking for the {product_with_typo}",
                "expected": []
            })

You can fine tune this script to specify additional test cases or change what you think should be the expected result. I just stuck with what Gemini generated. Below is a screenshot of a sample run:

While an accuracy score of 46% may sound alarming, this could be improved in multiple ways:

Reevaluate the test script to specify whether the expected result is correct. For example, if a user query just includes the brand name, but not the actual product, what is your expected result? Do you expect the LLM to return a category or a blank list? (see image below - Gemini tagged it as an failed test)
Use a better model (Llama 3.2 is good, but not the best).
Refine the prompt further to reduce ambiguity.

What is your expected result for this user query?

💬 2. Evaluating the Final Response

Evaluating conversational quality is more subjective but equally important. Two useful methods are:

Human evaluation: Have users rate responses on accuracy, helpfulness, and tone.
LLM-as-a-Judge: Use a more capable LLM to review responses based on predefined criteria.

I adopted the first approach and used Gemini CLI to create a second test script (test_scripts/test_eval_final_response.py) that selects 30 random cases from the first test’s output and evaluates the chatbot’s final responses. Each result includes the original prompt, extracted entities, and the chatbot’s reply — making manual review easy.

Below is a screenshot from a test run. As a tester, I have to manually evaluate the responses generated by the LLM.

--- Test Case 63 --- 
Prompt: Do you have the GameSphere Y in the Gaming Consoles and Accessories section?
Used Category/Product: [{'category': 'Gaming Consoles and Accessories'}, {'products': ['GameSphere Y']}]
Final Response: Yes, we do have the GameSphere Y in our Gaming Consoles and Accessories section. It's currently priced at $399.99. Would you like to know more about its features or is there something specific you'd like to ask about it?

This lightweight framework provides a practical starting point for evaluating consistency and overall reliability.

📈 3. Other Useful Evaluation Techniques

For more depth, you can:

Run user satisfaction surveys for real feedback.
Use BLEU or ROUGE scores to compare model responses to reference answers.
Test prompt consistency by rerunning the same question multiple times.

🚀 Wrapping Up: What You’ve Built

Congratulations — you’ve just completed the entire 3-part journey of building a product inquiry chatbot! 🎉

Here’s what we’ve accomplished:

Learned how to use an LLM to extract structured data from user queries.
Built a simple chat interface using Panel.
Enhanced your chatbot with context awareness and memory.
Learned how to evaluate your chatbot’s accuracy and quality.

You now have a solid foundation to build from. Want to keep going? Here are some next steps you could explore:

Integrate external APIs for real-time product data.
Add embedding-based retrieval for smarter context.
Deploy your chatbot on the web.

🌟 Concluding Thoughts

Over the past year, I’ve used GenAI tools for a variety of purposes — asking questions, refining my learning plans on AI, and even getting unstuck while coding. I’ve also used ChatGPT to refine this blog (though the first draft is always mine!) and have been an pilot user of internal AI agents at my workplace.

I’ll be honest — my experience has been a mix of fascination and frustration. These tools aren’t perfect — but I do see the benefits. Answers to complex git scenarios are now at my fingertips. When I hit an obscure error, Gemini or Claude often saves the day.

Does anyone remember those rigid menu-based chatbots from just a few years back? Today’s AI chatbots, while still evolving, offer a far more natural, helpful user experience and are faster to build.

My goal with this project was simple: to explore what Large Language Models can and can’t do — beyond just using them as a consumer. This journey reminded me that by starting small, focusing on clarity, structure, and thoughtful prompting, we can transform simple ideas into functional, safe, and intelligent assistants.

Keep experimenting, keep improving, and most importantly — keep chatting.

Building a Basic Product Inquiry Chatbot with an LLM — Part 3: Improving Context and Evaluating Performance