Document Search with Streamlit

Lab Overview

Learn how to implement document search functionality using Streamlit and LangChain, creating a powerful search interface for your documents. This lab demonstrates how to build a conversational document search system that allows users to upload PDFs and interact with their content through natural language queries.

Lab Materials

View Lab Notebook

Key Components

Document Processing:
- PDF loading with PyPDFLoader
- Text splitting with RecursiveCharacterTextSplitter
- Chunk size: 1500 characters
- Chunk overlap: 200 characters
Embedding and Search:
- HuggingFace embeddings (all-MiniLM-L6-v2 model)
- DocArrayInMemorySearch vector store
- MMR (Maximal Marginal Relevance) search
- Configurable search parameters (k=5, fetch_k=10)
Chat Interface:
- Streaming responses
- Conversation memory
- Context retrieval display
- Message history management

Technical Implementation

# Core components
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import PyPDFLoader
from langchain.memory import ConversationBufferMemory
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.chains import ConversationalRetrievalChain
from langchain.vectorstores import DocArrayInMemorySearch

Document Processing Pipeline

File Upload:
- Accept multiple PDF files
- Create temporary directory for processing
- Load documents using PyPDFLoader
Text Processing:
- Split documents into manageable chunks
- Create embeddings using HuggingFace model
- Store in DocArray vector database
Retrieval System:
- Configure MMR search for diversity in results
- Set up conversation memory
- Initialize ChatOpenAI model
- Create conversational retrieval chain

Features

Document upload and processing
Full-text search capabilities
Result highlighting
Interactive search interface
Real-time streaming responses
Conversation history
Context-aware responses
Multiple document support

Implementation Steps

Environment Setup:

pip install langchain streamlit openai pypdf sentence_transformers docarray

Configure Streamlit Interface:
- Set up page configuration
- Create file upload widget
- Implement chat interface
- Add API key management
Document Processing:
- Implement document loading
- Configure text splitting
- Set up vector store
- Initialize retrieval system
Chat System:
- Set up conversation memory
- Configure LLM
- Implement streaming responses
- Add context retrieval display

Advanced Features

Streaming response handler for real-time updates
Context retrieval display for transparency
Conversation buffer memory for contextual chat
MMR search for diverse result retrieval
Temperature control for consistent responses

Prerequisites

Google Colab account
OpenAI API key
Basic Python knowledge
Understanding of:
- Document processing
- Search concepts
- Web interfaces
- Vector embeddings

Best Practices

Use streaming for better user experience
Implement proper error handling
Cache resource-intensive operations
Manage conversation context effectively
Display search context for transparency

Lab Overview​

Lab Materials​

Key Components​

Technical Implementation​

Document Processing Pipeline​

Features​

Implementation Steps​

Advanced Features​

Prerequisites​

Best Practices​

Resources​