Hands-on RAG Tutorial: Building an App with LlamaIndex, Gemini, and Pinecone
Harness the power of retrieval-augmented generation (RAG) to bridge your private data with state-of-the-art LLMs. This hands-on guide walks you through building a Python app using LlamaIndex, Google Gemini, and Pinecone, covering end-to-end setup, ingestion, and querying.
Introduction to RAG with LlamaIndex, Gemini, and Pinecone
In modern AI applications, connecting a large language model like Google Gemini to your proprietary datasets unlocks new levels of utility—whether you’re automating customer support, mining insights from internal wikis, or powering domain-specific research assistants. Retrieval-Augmented Generation (RAG) makes this possible by independently querying your private content and embedding that relevant context into the LLM’s prompt. In our workflow, LlamaIndex orchestrates the pipeline: ingesting text from blog posts (or any document source), chunking and indexing it, and serving it through the Pinecone vector store. By the end of this tutorial, you will understand how to scrape HTML content with BeautifulSoup, generate vector embeddings via GeminiEmbedding, and interface with Pinecone—all integrated seamlessly for context-rich conversational AI.
How RAG Powers Smarter LLMs
Rather than relying exclusively on pre-trained model weights, RAG retrieves and injects real-time context from a specialized vector database. Here’s how a typical RAG flow works:
- Data Ingestion: Load documents (e.g., PDFs, webpages) into the pipeline.
- Chunking: Split text into manageable passages based on sentence boundaries or token counts.
- Embedding: Use an embedding model (GeminiEmbedding) to convert each chunk into a high-dimensional vector.
- Storage: Upsert these vectors into Pinecone under a named index.
- Retrieval: At query time, compute the query embedding and fetch the top-K most similar vectors.
- Generation: Supply the retrieved text as context to the LLM (e.g., Gemini) for response generation.
“RAG stands for Retrieval-Augmented Generation and it’s a technique or an architecture that’s used to enhance large language models.”
This pattern ensures you can serve accurate answers grounded in your own data, rather than piecing together vague or outdated knowledge.
Environment and API Setup
Before writing any code, prepare your cloud and local environments:
-
Pinecone Setup
- Sign up or log in at https://www.pinecone.io.
- Create a new index named
demo
with dimension768
andcosine
similarity. - Navigate to “API Keys”, generate a key labeled
demo
, and copy the value.
-
Google Gemini API
- Access https://ai.google.dev (AI Studio).
- Select your Google Cloud project or create one.
- Under “Credentials”, generate an API key and store it securely.
-
Local Virtual Environment
python -m venv venv # Activate on macOS/Linux source venv/bin/activate # Activate on Windows venv\Scripts\activate.bat
Isolating dependencies prevents version conflicts and ensures reproducible installs across machines. Remember to add
venv/
to your.gitignore
.
Installing Dependencies
Install the essential Python packages using pip and pin versions for stability:
pip install llama_index>=0.10 pinecone-client beautifulsoup4 python-dotenv
Optionally, create a requirements.txt
by running:
pip freeze > requirements.txt
llama_index
: Orchestrates ingestion, embedding, and query.pinecone-client
: Official Pinecone SDK for index management.beautifulsoup4
: Parses and extracts HTML content.python-dotenv
: Loads.env
configuration into environment variables.
Keep these packages up to date, especially after major llama_index
releases, which introduce new transformers and vector store integrations.
Configuring Your Application
Store your API credentials securely and initialize clients in your main.py
:
- Create a
.env
file:
PINECONE_API_KEY=your_pinecone_api_key
GOOGLE_API_KEY=your_google_gemini_api_key
PINECONE_ENV=us-west1-gcp # adjust region if needed
- Load and configure:
import os
from dotenv import load_dotenv
load_dotenv()
pinecone_key = os.getenv("PINECONE_API_KEY")
pinecone_region = os.getenv("PINECONE_ENV", "us-west1-gcp")
google_api_key = os.getenv("GOOGLE_API_KEY")
# Initialize Pinecone client
import pinecone
pinecone.init(api_key=pinecone_key, environment=pinecone_region)
index_name = "demo"
# Configure LlamaIndex globals
from llama_index.llms import Gemini
from llama_index.embeddings import GeminiEmbedding
from llama_index.core import settings
settings.llm_model = Gemini(model="models/text-bison-001", api_key=google_api_key)
settings.embed_model = GeminiEmbedding(model="models/textembedding-gecko-001", api_key=google_api_key)
settings.chunk_size = 1024 # tune for better retrieval granularity
- Validation: Confirm the Pinecone index exists:
print("Available indexes:", pinecone.list_indexes())
# Expect to see ['demo']
Handle exceptions (e.g., pinecone.exceptions.NotFoundError
) to catch misconfigurations early.
Ingesting and Indexing Data
Turn raw HTML into indexed vectors ready for semantic search:
- Web Scraping:
from llama_index.readers.web import BeautifulSoupWebReader
loader = BeautifulSoupWebReader()
documents = loader.load_data(urls=["https://your-site.com/blog-post"])
This strips tags, retains text segments, and converts them into LlamaIndex Document
objects.
- Pipeline Construction:
from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.node_parser import SentenceSplitter
from llama_index.vector_stores import PineconeVectorStore
# Create PineconeVectorStore wrapper
pinecone_index = pinecone.Index(index_name)
vector_store = PineconeVectorStore(pinecone_index)
# Build pipeline with sentence splitter and vector store
# You can customize overlap_size or splitter settings as needed.
pipeline = IngestionPipeline(
transformations=[SentenceSplitter(overlap_size=50)],
vector_store=vector_store,
documents=documents
)
By default, SentenceSplitter
splits on punctuation. Adjust overlap_size
for context continuity or swap in other node parsers for tokens.
- Execution:
pipeline.run()
print("Upsert complete. Check Pinecone dashboard for stored vectors.")
You can attach metadata—like URL or title—to each document before ingestion. This metadata is stored in Pinecone’s metadata
field, enabling filtered queries later.
Querying with RAG
Now, retrieve and serve context-enriched answers:
from llama_index import VectorStoreIndex
# Wrap the vector store into LlamaIndex’s VectorStoreIndex
vs_index = VectorStoreIndex(vector_store)
retriever = vs_index.get_retriever(similarity_k=5)
query_engine = retriever.build_query_engine()
# Send a question and print the context-aware response
question = "Why should you choose LlamaIndex?"
response = query_engine.query(question)
print("Answer:", response)
– similarity_k: Number of nearest neighbors to fetch. Increase for broader context, decrease for precision.
– Inspect Sources: Use response.source_nodes
to see which chunks fed the model.
– Custom Prompts: Pass input_template=
to build_query_engine()
for Jinja-style prompt formatting.
– Error Handling: Implement retries for network timeouts or API rate limits, with exponential backoff.
Conclusion
You’ve successfully built a complete RAG pipeline that scrapes web content, ingests and indexes it into Pinecone, and queries the Google Gemini model with context-driven prompts. This architecture unlocks powerful, accurate responses tailored to your data.
What to explore next? Experiment with PDF ingestion, chain multiple data sources, or deploy your app as an API service with FastAPI or Flask.
- Leverage RAG to integrate custom data into your LLM workflows using LlamaIndex, Gemini, and Pinecone.