How I Built It
Here’s the step-by-step pipeline I coded (mostly in a file rag.py, with a Streamlit front end in main.py):
-
Loading and splitting documents: When you click “Process URLs,” the app takes each URL you entered and uses LangChain’s UnstructuredURLLoader to fetch and parse the text from the web pages. (This handles PDFs or news sites transparently if Unstructured can read them.) I then split each long document into smaller chunks (about 512 characters each) using LangChain’s RecursiveCharacterTextSplitter with sensible separators. This chunking is important because our embedding model and LLM have input size limits.
-
Generating embeddings: For every text chunk, I used the Sentence-Transformer model (all-MiniLM-L6-v2) to convert it into a 384-dimensional vector. In code this is done via Chroma’s SentenceTransformerEmbeddingFunction, but conceptually it just means each chunk is now a point in high-dimensional space capturing its meaning.
-
Vector store with ChromaDB: I created a ChromaDB collection named e.g. “real_estate” (or whatever topic I was indexing). Chroma is an open-source vector database that stores these embeddings and lets you query them efficiently. I added each embedding to Chroma along with metadata: specifically, I stored the original URL as source metadata for each chunk. (This turned out to be super useful: it lets us later show which URL a piece of text came from.) In fact, I made sure to delete any old data in the collection before adding new docs, so that running “Process URLs” starts fresh each time. At the end of this step, ChromaDB has N documents (chunks) indexed with their vectors.
-
Retrieval at query time: Once the documents are loaded, the app waits for you to type a question. When you submit a query, I take that query string and ask ChromaDB to find the top 5 most similar chunks (using collection.query(...) with n_results=5). This returns the text of those chunks plus their metadata. If no documents exist (e.g. you forgot to click “Process URLs”), the app warns you to load some data first.
-
LLM prompt and answer generation: I then format a prompt for the LLM. I join the retrieved chunks into one big context block and append the user’s question. For example: “Use the following information to answer the query:
chunk1
chunk2 …
chunk5\n\n_Question:_
userquery”. This prompt is sent to ChatGroq.invoke(), which calls the Groq-hosted Llama model. The model returns an answer string. In code I handled cases where the API returned a dict or object, and always extracted result["content"] as the answer text.
-
Displaying the result: The answer text is returned to Streamlit, where it’s shown under an Answer: header. Below that, I list the real source URLs (from the metadata) as clickable links, so the user can see exactly where the information came from. For example, after processing, the sidebar became two steps: enter a question, get an answer and a “Sources:” list of URLs.
Altogether, this pipeline (process_urls() and generate_answer()) ties together document loading, embedding, vector retrieval, and LLM generation. LangChain made this glue pretty painless: I just plugged in components (loader, splitter, retriever chain) instead of writing all the low-level code. The main.py Streamlit script is quite simple – it basically collects inputs, calls these functions, and renders the output – so most of the logic lived in rag.py.
Roadblocks and Learnings
Building this was a fun learning experience, and I definitely ran into a few bumps:
-
Vector DB Persistence (Streamlit Cloud issues): I discovered that ChromaDB by default uses a local SQLite file to store the vectors. This became a headache when I tried to deploy on Streamlit’s free cloud. The cloud app kept losing the knowledge base between sessions! In fact, someone on the Streamlit forums described the exact issue: “My RAG Chatbot works perfectly in localhost but loses all knowledge when deployed to Streamlit Cloud.”. In practice, this meant each time the cloud app started it had an empty database. I also saw other users warning about SQLite version errors (“unsupported sqlite3”) unless you fiddle with pysqlite3 in requirements. All this made deployment tricky. The easy workaround (for now) is: run it locally so your resources/vectorstore directory persists. That way the scraped data and embeddings stay between runs. For a future version, I might swap to a managed DB (Pinecone, Weaviate, etc.) or explore LangChain’s persist_directory options more deeply.
-
Metadata and Source Tracking: Early on I was just storing chunks with dummy metadata (like “Doc 1, Doc 2”), but that was useless for the user. I realized it’s better to store the actual URL as metadata ("source": the_url) for every chunk. That small change meant the bot could list real sources under each answer. It was a simple fix in code, but it made the final output so much more transparent and trustworthy.
-
Chunking and Context Size: Deciding how to split text was important. I initially tried bigger chunks, but the LLM (especially with max tokens ~500) struggled if too much context was fed. The default 512-character chunks (with recursive splitting on paragraphs) struck a good balance: it gave enough detail in each chunk without flooding the model. I also found that if the documents were very short (or empty), the pipeline would warn “No content extracted” or “no docs to split,” so I had to handle those edge cases.
-
Model Prompting Quirks: Working with ChatGroq/Llama-3 introduced some learnings too. For one, I had to set os.environ["TOKENIZERS_PARALLELISM"] = "false" to avoid a HuggingFace warning about parallelism. Another surprise: the raw response from llm.invoke() could come back as a dict or an object, depending on the API response format. I ended up writing a small bit of code to extract result["content"] or result.content. It was a reminder that different LLM wrappers can behave slightly differently. But once I cleaned up those details, the answers started flowing consistently.
-
No Conversation Memory Yet: One thing I haven’t added yet (but plan to) is conversational memory. Right now, every question is treated independently – the model has no recollection of the previous Q&A. LangChain actually has memory modules if I want a truly chatty experience, but for this MVP it was simpler to do single-turn Q&A. If I expand this bot, I’d look at LangChain’s ConversationBufferMemory or similar so it can remember the chat history.
Overall, I was pleasantly surprised by how well it worked. Even with a relatively small embedding model and a 70B-parameter Llama, the answers were quite coherent and on-topic (as long as the relevant info was in the indexed articles). And seeing it cite real URLs felt like a big win for trust.
Conclusion
Building this RAG chatbot was an eye-opening experience. It felt like giving the LLM a pair of glasses so it could actually read the sources instead of hallucinating. I learned a lot about the plumbing behind RAG – from text loaders and embedding models to vector similarity search and prompt engineering. The surprising thing was how quickly I could go from nothing to a working app: once the pieces were in place, asking it a question and seeing a coherent, sourced answer was very satisfying.
Going forward, I’d love to improve the UI (maybe add conversation tabs, or allow more dynamic content ingestion) and solve the deployment issues. I might also play with different embedding models or a lighter LLM to see how the answers quality changes. But even as is, this little project has become a useful tool for me (and a great demonstration of RAG on my portfolio).
If you’re curious about RAG, I encourage you to try something similar! It’s amazing how the combination of ChromaDB (or any vector store), a sentence embedder, and a modern LLM can turn into a powerful Q&A system. Hopefully my walkthrough inspires you to build and experiment with RAG on your own data.