SHEBEEB S

With a strong foundation in software engineering, I discovered my passion for data-driven decision making and intelligent systems. This curiosity led me to transition into Data Science, exploring art of data and passionately solving real-world problems through Data Science, Machine Learning and storytelling.

Project Overview & Tools Used

I built this chatbot mainly out of curiosity and a bit of bragging-rights: I wanted to see how well an LLM could answer questions about news articles or other web content when it had direct access to those sources. My goal was a news content summarizer/Q&A bot: a simple Streamlit app where you paste a few URLs and ask a question, and the bot answers using those articles as its knowledge base.

To do this I stitched together a few key tools:

  • Streamlit (Python) – for building a quick web interface without much boilerplate. It lets me make input fields for URLs and questions, and easily display the answer and source links.

  • LangChain – an orchestration framework for LLM apps. LangChain provided the building blocks (like document loaders and text splitters) to glue together all the components in a neat workflow.

  • ChromaDB – an open-source vector database for storing embeddings. I used Chroma locally to hold all the chunk embeddings so I could run similarity search. It’s lightweight and worked great on my laptop.

  • HuggingFace Sentence-Transformers – specifically the “all-MiniLM-L6-v2” model. This maps text into 384-dimensional vectors. It’s fast and accurate for semantic search. (In code I used LangChain’s SentenceTransformerEmbeddingFunction under the hood.)

  • Groq’s LLM (Llama-3.3 via ChatGroq) – I used LangChain’s ChatGroq wrapper to call a Llama-3 model hosted by Groq. This acts like the ChatGPT/GPT-4 part of the pipeline. You could swap in OpenAI’s ChatGPT or any other LLM here, but I went with Groq’s offering.

  • Other: I also used python-dotenv to manage API keys (so my Groq API key stays in a .env file), uuid/pathlib for file paths, and LangChain’s UnstructuredURLLoader (from langchain_community) to scrape web pages. The Unstructured loader turned each URL into raw text documents I could work with.

In short, the stack was: Python + Streamlit for UI, LangChain to orchestrate, Chroma for vectors, a Sentence-Transformer for embeddings, and a hosted LLM for answers.

Introduction to RAG

n simple terms, RAG means giving a language model an “open book” to look things up in while it answers a question. First you ingest or index a set of documents (e.g. news articles, PDFs, your own knowledge base), converting them into embeddings and storing them in a vector database. Then at query time, when a user asks something, the system retrieves the most relevant documents (based on vector similarity) and passes that context along with the question to the LLM to generate an answer. In practice this often means taking a few top-matching text chunks and prepending them to the prompt sent to the model. The result is an answer that’s grounded in real data rather than purely the model’s internal “imagination.”

Researchers describe RAG as a way to “ground” LLM outputs in actual facts by fetching relevant information from an external knowledge base. For example, instead of trusting the model’s internal knowledge (which might be outdated or incomplete), we let it see fresh content. RAG can reduce hallucinations because the model answers “with the most current, reliable facts” instead of making things up.

How I Built It

Here’s the step-by-step pipeline I coded (mostly in a file rag.py, with a Streamlit front end in main.py):

  1. Loading and splitting documents: When you click “Process URLs,” the app takes each URL you entered and uses LangChain’s UnstructuredURLLoader to fetch and parse the text from the web pages. (This handles PDFs or news sites transparently if Unstructured can read them.) I then split each long document into smaller chunks (about 512 characters each) using LangChain’s RecursiveCharacterTextSplitter with sensible separators. This chunking is important because our embedding model and LLM have input size limits.

  2. Generating embeddings: For every text chunk, I used the Sentence-Transformer model (all-MiniLM-L6-v2) to convert it into a 384-dimensional vector. In code this is done via Chroma’s SentenceTransformerEmbeddingFunction, but conceptually it just means each chunk is now a point in high-dimensional space capturing its meaning.

  3. Vector store with ChromaDB: I created a ChromaDB collection named e.g. “real_estate” (or whatever topic I was indexing). Chroma is an open-source vector database that stores these embeddings and lets you query them efficiently. I added each embedding to Chroma along with metadata: specifically, I stored the original URL as source metadata for each chunk. (This turned out to be super useful: it lets us later show which URL a piece of text came from.) In fact, I made sure to delete any old data in the collection before adding new docs, so that running “Process URLs” starts fresh each time. At the end of this step, ChromaDB has N documents (chunks) indexed with their vectors.

  4. Retrieval at query time: Once the documents are loaded, the app waits for you to type a question. When you submit a query, I take that query string and ask ChromaDB to find the top 5 most similar chunks (using collection.query(...) with n_results=5). This returns the text of those chunks plus their metadata. If no documents exist (e.g. you forgot to click “Process URLs”), the app warns you to load some data first.

  5. LLM prompt and answer generation: I then format a prompt for the LLM. I join the retrieved chunks into one big context block and append the user’s question. For example: “Use the following information to answer the query:
    chunk1chunk1


    chunk2chunk2


    chunk5chunk5

    \n\n_Question:_
    userqueryuser query

    ”. This prompt is sent to ChatGroq.invoke(), which calls the Groq-hosted Llama model. The model returns an answer string. In code I handled cases where the API returned a dict or object, and always extracted result["content"] as the answer text.

  6. Displaying the result: The answer text is returned to Streamlit, where it’s shown under an Answer: header. Below that, I list the real source URLs (from the metadata) as clickable links, so the user can see exactly where the information came from. For example, after processing, the sidebar became two steps: enter a question, get an answer and a “Sources:” list of URLs.

Altogether, this pipeline (process_urls() and generate_answer()) ties together document loading, embedding, vector retrieval, and LLM generation. LangChain made this glue pretty painless: I just plugged in components (loader, splitter, retriever chain) instead of writing all the low-level code. The main.py Streamlit script is quite simple – it basically collects inputs, calls these functions, and renders the output – so most of the logic lived in rag.py.

Roadblocks and Learnings

Building this was a fun learning experience, and I definitely ran into a few bumps:

  • Vector DB Persistence (Streamlit Cloud issues): I discovered that ChromaDB by default uses a local SQLite file to store the vectors. This became a headache when I tried to deploy on Streamlit’s free cloud. The cloud app kept losing the knowledge base between sessions! In fact, someone on the Streamlit forums described the exact issue: “My RAG Chatbot works perfectly in localhost but loses all knowledge when deployed to Streamlit Cloud.”. In practice, this meant each time the cloud app started it had an empty database. I also saw other users warning about SQLite version errors (“unsupported sqlite3”) unless you fiddle with pysqlite3 in requirements. All this made deployment tricky. The easy workaround (for now) is: run it locally so your resources/vectorstore directory persists. That way the scraped data and embeddings stay between runs. For a future version, I might swap to a managed DB (Pinecone, Weaviate, etc.) or explore LangChain’s persist_directory options more deeply.

  • Metadata and Source Tracking: Early on I was just storing chunks with dummy metadata (like “Doc 1, Doc 2”), but that was useless for the user. I realized it’s better to store the actual URL as metadata ("source": the_url) for every chunk. That small change meant the bot could list real sources under each answer. It was a simple fix in code, but it made the final output so much more transparent and trustworthy.

  • Chunking and Context Size: Deciding how to split text was important. I initially tried bigger chunks, but the LLM (especially with max tokens ~500) struggled if too much context was fed. The default 512-character chunks (with recursive splitting on paragraphs) struck a good balance: it gave enough detail in each chunk without flooding the model. I also found that if the documents were very short (or empty), the pipeline would warn “No content extracted” or “no docs to split,” so I had to handle those edge cases.

  • Model Prompting Quirks: Working with ChatGroq/Llama-3 introduced some learnings too. For one, I had to set os.environ["TOKENIZERS_PARALLELISM"] = "false" to avoid a HuggingFace warning about parallelism. Another surprise: the raw response from llm.invoke() could come back as a dict or an object, depending on the API response format. I ended up writing a small bit of code to extract result["content"] or result.content. It was a reminder that different LLM wrappers can behave slightly differently. But once I cleaned up those details, the answers started flowing consistently.

  • No Conversation Memory Yet: One thing I haven’t added yet (but plan to) is conversational memory. Right now, every question is treated independently – the model has no recollection of the previous Q&A. LangChain actually has memory modules if I want a truly chatty experience, but for this MVP it was simpler to do single-turn Q&A. If I expand this bot, I’d look at LangChain’s ConversationBufferMemory or similar so it can remember the chat history.

Overall, I was pleasantly surprised by how well it worked. Even with a relatively small embedding model and a 70B-parameter Llama, the answers were quite coherent and on-topic (as long as the relevant info was in the indexed articles). And seeing it cite real URLs felt like a big win for trust.

Conclusion

Building this RAG chatbot was an eye-opening experience. It felt like giving the LLM a pair of glasses so it could actually read the sources instead of hallucinating. I learned a lot about the plumbing behind RAG – from text loaders and embedding models to vector similarity search and prompt engineering. The surprising thing was how quickly I could go from nothing to a working app: once the pieces were in place, asking it a question and seeing a coherent, sourced answer was very satisfying.

Going forward, I’d love to improve the UI (maybe add conversation tabs, or allow more dynamic content ingestion) and solve the deployment issues. I might also play with different embedding models or a lighter LLM to see how the answers quality changes. But even as is, this little project has become a useful tool for me (and a great demonstration of RAG on my portfolio).

If you’re curious about RAG, I encourage you to try something similar! It’s amazing how the combination of ChromaDB (or any vector store), a sentence embedder, and a modern LLM can turn into a powerful Q&A system. Hopefully my walkthrough inspires you to build and experiment with RAG on your own data.