A lightweight hybrid search and retrieval system for document understanding, semantic search, and text information retrieval.
TinySearch provides an end-to-end solution for converting documents into searchable vector embeddings, with support for hybrid multi-retriever search (Vector + BM25 + Substring). Focused on simplicity, flexibility, and efficiency. Perfect for building RAG (Retrieval-Augmented Generation) systems, semantic document search, knowledge bases, and more.
- 🧩 Modular Design: Plug-and-play components for data processing, text splitting, embedding generation, and vector retrieval
- 🔍 Semantic Search: Find contextually relevant information beyond keyword matching
- 🔀 Hybrid Retrieval: Combine Vector, BM25, and Substring retrievers with configurable fusion strategies (RRF, Weighted)
- 📝 BM25 Keyword Search: Fast keyword-based retrieval with jieba Chinese tokenization support
- 🔤 Substring/Regex Search: Ctrl+F style exact match and regex pattern search
- 🏆 Cross-Encoder Reranking: Optional reranking with BGE Reranker for improved relevance
- ⚙️ Highly Configurable: Simple YAML configuration to control all aspects of the system
- 🔌 Multiple Input Formats: Support for TXT, PDF, CSV, Markdown, JSON, and custom adapters
- 🤖 Embedding Models: Integration with HuggingFace models like Qwen-Embedding and more
- 🚀 Fast Search: FAISS vector indexing with configurable metrics and index types
- 💾 Incremental Updates: Add new documents without rebuilding the entire index
- 🔄 Hot-Update: Real-time index updates when source documents change
- 🌐 API Interface: FastAPI endpoint for easy integration with other services
- 🔐 API Security: Authentication with API keys and rate limiting protection
- 🧠 Context Management: Optimize content for LLM processing with token counting and window sizing
- 📊 Response Formatting: Format results in multiple formats (Text, Markdown, JSON, HTML)
- 🛡️ Data Validation: Comprehensive utilities for ensuring data integrity
- 🖥️ Web UI: Simple web interface for search and index management
- 🧪 Extensible: Easy to add new components or customize existing ones
flowchart TB
subgraph Input
Documents["Documents<br/>(PDF, Text, CSV, JSON, etc.)"]
end
subgraph DataProcessing["Data Processing"]
DataAdapter["DataAdapter<br/>Extract text from data source"]
TextSplitter["TextSplitter<br/>Chunk text into segments"]
end
subgraph RetrieverLayer["Retriever Layer"]
VectorRetriever["VectorRetriever<br/>Embedder + FAISS"]
BM25Retriever["BM25Retriever<br/>bm25s + jieba"]
SubstringRetriever["SubstringRetriever<br/>Regex / exact match"]
end
subgraph FusionLayer["Fusion & Reranking"]
FusionStrategy["FusionStrategy<br/>(RRF / Weighted)"]
Reranker["Reranker (optional)<br/>Cross-Encoder BGE"]
end
subgraph QueryLayer["Query Layer"]
UserQuery["User Query"]
QueryEngine["QueryEngine<br/>Template / Hybrid"]
SearchResults["Search Results<br/>Ranked by relevance"]
end
subgraph FlowControl["Flow Control"]
Config["Configuration"]
FlowController["FlowController<br/>Orchestrate data flow"]
end
subgraph API["API Layer"]
CLI["Command Line Interface"]
FastAPI["FastAPI Web Service"]
end
%% Data Flow - Indexing
Documents --> DataAdapter
DataAdapter --> TextSplitter
TextSplitter --> VectorRetriever
TextSplitter --> BM25Retriever
TextSplitter --> SubstringRetriever
%% Data Flow - Querying
UserQuery --> QueryEngine
QueryEngine --> VectorRetriever
QueryEngine --> BM25Retriever
QueryEngine --> SubstringRetriever
VectorRetriever --> FusionStrategy
BM25Retriever --> FusionStrategy
SubstringRetriever --> FusionStrategy
FusionStrategy --> Reranker
Reranker --> SearchResults
%% Control Flow
Config --> FlowController
FlowController --> DataAdapter
FlowController --> TextSplitter
FlowController --> QueryEngine
%% API Flow
CLI --> FlowController
FastAPI --> FlowController
TinySearch includes comprehensive documentation to help you get started quickly and make the most of its features:
-
📘 User Guide (中文)**
- Installation instructions
- Core concepts
- Configuration overview
- CLI usage examples
- API integration
- Advanced features & patterns
- Troubleshooting common issues
-
🧰 API Reference (中文)**
- Component interfaces
- Method signatures
- Parameter descriptions
- Return types
- Usage examples
- Extension patterns
-
🏗️ Architecture Documentation (中文)**
- System design overview
- Component descriptions
- Data flow diagrams
- Extension points
- Design decisions
-
- Configuration schema
- Advanced options
- Performance tuning
- Component-specific settings
- Configuration examples
-
✨ Feature Summary (中文)**
- Data validation utilities
- Context window management
- Response formatting utilities
- Hot-update capabilities
- Usage examples
TinySearch provides a RESTful API for integrating with other applications:
# Query the index
import requests
response = requests.post(
"http://localhost:8000/query",
headers={"X-API-Key": "your-api-key"},
json={"query": "Your search query", "top_k": 5}
)
results = response.json()API features include:
- Full text search with vector embeddings
- Index management (build, upload, stats, clear)
- Authentication with API keys
- Rate limiting protection
- Web UI for easy exploration
For API documentation, see API Guide and API Authentication Guide (中文).
# Basic installation (vector search only)
pip install tinysearch
# With hybrid search (BM25 + Chinese tokenization)
pip install tinysearch bm25s jieba
# With cross-encoder reranking
pip install tinysearch FlagEmbedding
# With API support
pip install tinysearch[api]
# With embedding models support
pip install tinysearch[embedders]
# With all adapters
pip install tinysearch[adapters]
# With all features
pip install tinysearch[full]Create a config.yaml file:
adapter:
type: text
params:
encoding: utf-8
splitter:
type: character
chunk_size: 300
chunk_overlap: 50
embedder:
type: huggingface
model: Qwen/Qwen-Embedding
device: cuda # or cpu
normalize: true
indexer:
type: faiss
index_path: .cache/index.faiss
metric: cosine
query_engine:
method: template
template: "Please find information about: {query}"
top_k: 5tinysearch index --data ./your_documents --config config.yamltinysearch query --q "Your search query" --config config.yamltinysearch-apiThen visit http://localhost:8000 in your browser to use the web interface, or send API requests programmatically:
curl -X POST http://localhost:8000/query -H "Content-Type: application/json" -d '{"query": "Your search query", "top_k": 5}'TinySearch supports hybrid retrieval that combines multiple search strategies for better recall and precision. You can mix Vector (semantic), BM25 (keyword), and Substring (exact match) retrievers, then fuse their results using RRF or Weighted fusion.
To enable hybrid search, set query_engine.method to "hybrid" and configure the retrievers list:
# Embedding + Vector index (required for vector retriever)
embedder:
model: Qwen/Qwen3-Embedding-0.6B
device: cuda
indexer:
type: faiss
index_path: .cache/index.faiss
metric: cosine
# Multi-retriever configuration
retrievers:
- type: vector # Semantic search (uses embedder + indexer above)
- type: bm25 # Keyword search
tokenizer: jieba # Chinese tokenization (optional, fallback: whitespace)
- type: substring # Exact match / regex
is_regex: false
# Fusion strategy
fusion:
strategy: weighted # "weighted" or "rrf"
weights: [0.5, 0.4, 0.1] # Weights for each retriever (vector, bm25, substring)
min_score: 0.1 # Drop results below this fused score
# Optional: Cross-encoder reranking
reranker:
enabled: false
model: BAAI/bge-reranker-v2-m3
# Query engine
query_engine:
method: hybrid # "template" (vector only) or "hybrid"
top_k: 20
recall_multiplier: 2 # Each retriever recalls top_k * 2 candidates before fusion| Strategy | Description | Best For |
|---|---|---|
| weighted | Min-max normalize scores, then weighted sum | When you know relative retriever importance |
| rrf | Reciprocal Rank Fusion: score = Σ 1/(rank + k) |
When score distributions differ (more robust) |
from tinysearch.base import TextChunk
from tinysearch.retrievers import VectorRetriever, BM25Retriever, SubstringRetriever
from tinysearch.fusion import WeightedFusion, ReciprocalRankFusion
from tinysearch.query import HybridQueryEngine
# Build retrievers
bm25 = BM25Retriever()
bm25.build(chunks) # chunks: List[TextChunk]
substr = SubstringRetriever(is_regex=False)
substr.build(chunks)
# Create hybrid engine
engine = HybridQueryEngine(
retrievers=[bm25, substr],
fusion_strategy=WeightedFusion(weights=[0.7, 0.3]),
recall_multiplier=2,
)
results = engine.retrieve("搜索关键词", top_k=10)
# Each result: {"text", "metadata", "score", "retrieval_method": "hybrid", "scores": {...}}Hybrid search results include per-retriever scores for transparency:
{
"text": "...", # Chunk text
"metadata": {...}, # Source metadata
"score": 0.85, # Final fused score [0, 1]
"retrieval_method": "hybrid",
"scores": { # Per-retriever original scores
"vector": 0.92,
"bm25": 0.75,
}
}| Package | Purpose | Required? |
|---|---|---|
bm25s |
BM25 retrieval engine | Only for BM25Retriever |
jieba |
Chinese tokenization | Optional (fallback: whitespace split) |
FlagEmbedding |
Cross-encoder reranking | Optional (only for reranker) |
Install with:
pip install bm25s jieba # For BM25 + Chinese support
pip install FlagEmbedding # For cross-encoder rerankingIf you don't configure retrievers or keep query_engine.method: "template", TinySearch behaves exactly as before — pure vector search with no additional dependencies.
TinySearch includes various example scripts in the examples/ directory to demonstrate different features:
- simple_example.py: Basic usage of TinySearch
- flow_example.py: Demonstrates the flow controller
- faiss_gpu_demo.py: Shows how to use GPU acceleration with FAISS
- api_auth_demo.py: Demonstrates API authentication
- web_ui_demo.py: Shows how to use the web UI
The advanced_features_demo.py script showcases the full capabilities of TinySearch with maximum customization:
-
Data Validation Utilities
- Path validation
- File extension validation
- Configuration validation
- Embedding dimension validation
-
Context Window Management
- Token counting
- Window sizing
- Context generation for queries
-
Response Formatting
- Plain text formatting
- Markdown formatting
- JSON formatting
- HTML formatting
- Custom formatting with term highlighting
-
Hot-Update Capabilities
- Real-time monitoring of document changes
- Automatic index updates
- Custom update callbacks
-
Custom Components
- Multi-format data adapter
- Custom response formatter
- Custom reranking function
cd /path/to/TinySearch
python examples/advanced_features_demo.pyThe demo requires additional dependencies:
pip install torch sentence-transformers faiss-cpu watchdog- 🔎 Semantic document search: Find contextually relevant information beyond simple keyword matching
- 📚 Knowledge base retrieval: Build intelligent Q&A systems based on your documents
- 🤖 RAG systems: Enhance LLM outputs with relevant document context
- 📊 Recommendation systems: Find similar items based on their content
- 🧠 Content organization: Automatically categorize and group related documents
The TinySearch Web UI provides an intuitive interface for interacting with the system:
- Search Interface: Easily search through your documents with customizable parameters
- Index Management: Upload documents, build indexes, and manage your data
- Statistics Dashboard: View insights about your index and processed documents
- Authentication Management: Manage API keys and security settings
You can create custom data adapters by implementing the DataAdapter interface.
Each adapter handles single file extraction — directory traversal is handled
automatically by the framework:
from tinysearch.base import DataAdapter
class MyAdapter(DataAdapter):
def extract(self, filepath):
# Extract text from a single file
return [text1, text2, ...]Then configure it in your config.yaml:
adapter:
type: custom
params:
module: my_module
class: MyAdapter
init:
param1: value1When building an index from a directory, TinySearch uses a centralized file
discovery mechanism (iter_input_files()). Each adapter type has default
file extensions:
| Adapter | Extensions |
|---|---|
| text | .txt, .text, .md, .py, .js, .html, .css, .json |
.pdf |
|
| csv | .csv |
| markdown | .md, .markdown, .mdown, .mkdn |
| json | .json |
You can override extensions in config.yaml:
adapter:
type: text
params:
extensions: [".txt", ".log"]Or use iter_input_files() programmatically:
from tinysearch.utils.file_discovery import iter_input_files
for file_path in iter_input_files("./data", adapter_type="text"):
texts = adapter.extract(file_path)TinySearch features a modern logging system powered by loguru with beautiful, colorful output and flexible configuration options.
- 🎨 Colorful Output: Modern, emoji-enhanced logging with syntax highlighting
- ⚙️ Flexible Configuration: Easily customize log levels, formats, and output destinations
- 📁 File Logging: Optional file output with automatic rotation and compression
- 🎯 Multiple Formats: Choose from modern, simple, or detailed output formats
- 🔧 Runtime Configuration: Configure logging through YAML/JSON config files
Configure logging in your config.yaml:
logging:
# Log level: DEBUG, INFO, WARNING, ERROR, CRITICAL
level: "INFO"
# Format style: modern, simple, detailed
format: "modern"
# Whether to show timestamps and file locations
show_time: true
show_location: false
# Whether to use colored output
colorize: true
# Optional file logging
file: "logs/tinysearch.log"
file_level: "DEBUG"Modern Format (default):
15:30:45 | INFO | 🚀 Building index from data/documents
15:30:46 | INFO | 📄 Extracted 150 documents
15:30:47 | INFO | ✂️ Created 450 text chunks
Simple Format:
15:30:45 | INFO | Building index from data/documents
15:30:46 | INFO | Extracted 150 documents
Detailed Format:
2024-01-15 15:30:45 | INFO | cli:build_index:172 | Building index from data/documents
2024-01-15 15:30:46 | INFO | cli:build_index:183 | Extracted 150 documents
from tinysearch.logger import get_logger, configure_logger, log_success
# Configure logging
configure_logger({
"logging": {
"level": "INFO",
"format": "modern",
"colorize": True
}
})
# Get a logger
logger = get_logger("my_component")
# Use convenience functions
log_success("Operation completed!")To set up the development environment:
# Clone the repository
git clone https://github.com/yourusername/tinysearch.git
cd tinysearch
# Install development dependencies
pip install -e ".[dev]"
# Run tests
pytestThis project is licensed under the MIT License - see the LICENSE file for details.
vector search, semantic search, hybrid search, BM25, document retrieval, embeddings, FAISS, RAG, information retrieval, text search, vector database, document understanding, NLP, natural language processing, AI search, reranking, fusion