Hands-on RAG Tutorial: Building an App with LlamaIndex, Gemini, and Pinecone

Harness the power of retrieval-augmented generation (RAG) to bridge your private data with state-of-the-art LLMs. This hands-on guide walks you through building a Python app using LlamaIndex, Google Gemini, and Pinecone, covering end-to-end setup, ingestion, and querying.

Introduction to RAG with LlamaIndex, Gemini, and Pinecone

In modern AI applications, connecting a large language model like Google Gemini to your proprietary datasets unlocks new levels of utility—whether you’re automating customer support, mining insights from internal wikis, or powering domain-specific research assistants. Retrieval-Augmented Generation (RAG) makes this possible by independently querying your private content and embedding that relevant context into the LLM’s prompt. In our workflow, LlamaIndex orchestrates the pipeline: ingesting text from blog posts (or any document source), chunking and indexing it, and serving it through the Pinecone vector store. By the end of this tutorial, you will understand how to scrape HTML content with BeautifulSoup, generate vector embeddings via GeminiEmbedding, and interface with Pinecone—all integrated seamlessly for context-rich conversational AI.

How RAG Powers Smarter LLMs

Rather than relying exclusively on pre-trained model weights, RAG retrieves and injects real-time context from a specialized vector database. Here’s how a typical RAG flow works:

Data Ingestion: Load documents (e.g., PDFs, webpages) into the pipeline.
Chunking: Split text into manageable passages based on sentence boundaries or token counts.
Embedding: Use an embedding model (GeminiEmbedding) to convert each chunk into a high-dimensional vector.
Storage: Upsert these vectors into Pinecone under a named index.
Retrieval: At query time, compute the query embedding and fetch the top-K most similar vectors.
Generation: Supply the retrieved text as context to the LLM (e.g., Gemini) for response generation.

“RAG stands for Retrieval-Augmented Generation and it’s a technique or an architecture that’s used to enhance large language models.”

This pattern ensures you can serve accurate answers grounded in your own data, rather than piecing together vague or outdated knowledge.

Environment and API Setup

Before writing any code, prepare your cloud and local environments:

Pinecone Setup
- Sign up or log in at https://www.pinecone.io.
- Create a new index named demo with dimension 768 and cosine similarity.
- Navigate to “API Keys”, generate a key labeled demo, and copy the value.
Google Gemini API
- Access https://ai.google.dev (AI Studio).
- Select your Google Cloud project or create one.
- Under “Credentials”, generate an API key and store it securely.
Local Virtual Environment
```
python -m venv venv
# Activate on macOS/Linux
source venv/bin/activate
# Activate on Windows
venv\Scripts\activate.bat
```
Isolating dependencies prevents version conflicts and ensures reproducible installs across machines. Remember to add venv/ to your .gitignore.

Installing Dependencies

Install the essential Python packages using pip and pin versions for stability:

pip install llama_index>=0.10 pinecone-client beautifulsoup4 python-dotenv

Optionally, create a requirements.txt by running:

pip freeze > requirements.txt

llama_index: Orchestrates ingestion, embedding, and query.
pinecone-client: Official Pinecone SDK for index management.
beautifulsoup4: Parses and extracts HTML content.
python-dotenv: Loads .env configuration into environment variables.

Keep these packages up to date, especially after major llama_index releases, which introduce new transformers and vector store integrations.

Configuring Your Application

Store your API credentials securely and initialize clients in your main.py:

Create a .env file:

PINECONE_API_KEY=your_pinecone_api_key
GOOGLE_API_KEY=your_google_gemini_api_key
PINECONE_ENV=us-west1-gcp          # adjust region if needed

Load and configure:

import os
from dotenv import load_dotenv

load_dotenv()
pinecone_key     = os.getenv("PINECONE_API_KEY")
pinecone_region  = os.getenv("PINECONE_ENV", "us-west1-gcp")
google_api_key   = os.getenv("GOOGLE_API_KEY")

# Initialize Pinecone client
import pinecone
pinecone.init(api_key=pinecone_key, environment=pinecone_region)
index_name = "demo"

# Configure LlamaIndex globals
from llama_index.llms import Gemini
from llama_index.embeddings import GeminiEmbedding
from llama_index.core import settings

settings.llm_model   = Gemini(model="models/text-bison-001", api_key=google_api_key)
settings.embed_model = GeminiEmbedding(model="models/textembedding-gecko-001", api_key=google_api_key)
settings.chunk_size  = 1024  # tune for better retrieval granularity

Validation: Confirm the Pinecone index exists:

print("Available indexes:", pinecone.list_indexes())
# Expect to see ['demo']

Handle exceptions (e.g., pinecone.exceptions.NotFoundError) to catch misconfigurations early.

Ingesting and Indexing Data

Turn raw HTML into indexed vectors ready for semantic search:

Web Scraping:

from llama_index.readers.web import BeautifulSoupWebReader

loader    = BeautifulSoupWebReader()
documents = loader.load_data(urls=["https://your-site.com/blog-post"])

This strips tags, retains text segments, and converts them into LlamaIndex Document objects.

Pipeline Construction:

from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.node_parser import SentenceSplitter
from llama_index.vector_stores import PineconeVectorStore

# Create PineconeVectorStore wrapper
pinecone_index = pinecone.Index(index_name)
vector_store   = PineconeVectorStore(pinecone_index)

# Build pipeline with sentence splitter and vector store
# You can customize overlap_size or splitter settings as needed.
pipeline = IngestionPipeline(
    transformations=[SentenceSplitter(overlap_size=50)],
    vector_store=vector_store,
    documents=documents
)

By default, SentenceSplitter splits on punctuation. Adjust overlap_size for context continuity or swap in other node parsers for tokens.

Execution:

pipeline.run()
print("Upsert complete. Check Pinecone dashboard for stored vectors.")

You can attach metadata—like URL or title—to each document before ingestion. This metadata is stored in Pinecone’s metadata field, enabling filtered queries later.

Querying with RAG

Now, retrieve and serve context-enriched answers:

from llama_index import VectorStoreIndex

# Wrap the vector store into LlamaIndex’s VectorStoreIndex
vs_index     = VectorStoreIndex(vector_store)
retriever    = vs_index.get_retriever(similarity_k=5)
query_engine = retriever.build_query_engine()

# Send a question and print the context-aware response
question = "Why should you choose LlamaIndex?"
response = query_engine.query(question)
print("Answer:", response)

– similarity_k: Number of nearest neighbors to fetch. Increase for broader context, decrease for precision.
– Inspect Sources: Use response.source_nodes to see which chunks fed the model.
– Custom Prompts: Pass input_template= to build_query_engine() for Jinja-style prompt formatting.
– Error Handling: Implement retries for network timeouts or API rate limits, with exponential backoff.

Conclusion

You’ve successfully built a complete RAG pipeline that scrapes web content, ingests and indexes it into Pinecone, and queries the Google Gemini model with context-driven prompts. This architecture unlocks powerful, accurate responses tailored to your data.

What to explore next? Experiment with PDF ingestion, chain multiple data sources, or deploy your app as an API service with FastAPI or Flask.

Leverage RAG to integrate custom data into your LLM workflows using LlamaIndex, Gemini, and Pinecone.

Hands-on RAG Tutorial: Building an App with LlamaIndex, Gemini, and Pinecone

Jump to Specific Moments

Hands-on RAG Tutorial: Building an App with LlamaIndex, Gemini, and Pinecone

Introduction to RAG with LlamaIndex, Gemini, and Pinecone

How RAG Powers Smarter LLMs

Environment and API Setup

Installing Dependencies

Configuring Your Application

Ingesting and Indexing Data

Querying with RAG

Conclusion

Topics: