Skip to content

CodePothunter/TinySearch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TinySearch

A lightweight hybrid search and retrieval system for document understanding, semantic search, and text information retrieval.

TinySearch provides an end-to-end solution for converting documents into searchable vector embeddings, with support for hybrid multi-retriever search (Vector + BM25 + Substring). Focused on simplicity, flexibility, and efficiency. Perfect for building RAG (Retrieval-Augmented Generation) systems, semantic document search, knowledge bases, and more.

Key Features

  • 🧩 Modular Design: Plug-and-play components for data processing, text splitting, embedding generation, and vector retrieval
  • 🔍 Semantic Search: Find contextually relevant information beyond keyword matching
  • 🔀 Hybrid Retrieval: Combine Vector, BM25, and Substring retrievers with configurable fusion strategies (RRF, Weighted)
  • 📝 BM25 Keyword Search: Fast keyword-based retrieval with jieba Chinese tokenization support
  • 🔤 Substring/Regex Search: Ctrl+F style exact match and regex pattern search
  • 🏆 Cross-Encoder Reranking: Optional reranking with BGE Reranker for improved relevance
  • ⚙️ Highly Configurable: Simple YAML configuration to control all aspects of the system
  • 🔌 Multiple Input Formats: Support for TXT, PDF, CSV, Markdown, JSON, and custom adapters
  • 🤖 Embedding Models: Integration with HuggingFace models like Qwen-Embedding and more
  • 🚀 Fast Search: FAISS vector indexing with configurable metrics and index types
  • 💾 Incremental Updates: Add new documents without rebuilding the entire index
  • 🔄 Hot-Update: Real-time index updates when source documents change
  • 🌐 API Interface: FastAPI endpoint for easy integration with other services
  • 🔐 API Security: Authentication with API keys and rate limiting protection
  • 🧠 Context Management: Optimize content for LLM processing with token counting and window sizing
  • 📊 Response Formatting: Format results in multiple formats (Text, Markdown, JSON, HTML)
  • 🛡️ Data Validation: Comprehensive utilities for ensuring data integrity
  • 🖥️ Web UI: Simple web interface for search and index management
  • 🧪 Extensible: Easy to add new components or customize existing ones

System Architecture

flowchart TB
    subgraph Input
        Documents["Documents<br/>(PDF, Text, CSV, JSON, etc.)"]
    end

    subgraph DataProcessing["Data Processing"]
        DataAdapter["DataAdapter<br/>Extract text from data source"]
        TextSplitter["TextSplitter<br/>Chunk text into segments"]
    end

    subgraph RetrieverLayer["Retriever Layer"]
        VectorRetriever["VectorRetriever<br/>Embedder + FAISS"]
        BM25Retriever["BM25Retriever<br/>bm25s + jieba"]
        SubstringRetriever["SubstringRetriever<br/>Regex / exact match"]
    end

    subgraph FusionLayer["Fusion & Reranking"]
        FusionStrategy["FusionStrategy<br/>(RRF / Weighted)"]
        Reranker["Reranker (optional)<br/>Cross-Encoder BGE"]
    end

    subgraph QueryLayer["Query Layer"]
        UserQuery["User Query"]
        QueryEngine["QueryEngine<br/>Template / Hybrid"]
        SearchResults["Search Results<br/>Ranked by relevance"]
    end

    subgraph FlowControl["Flow Control"]
        Config["Configuration"]
        FlowController["FlowController<br/>Orchestrate data flow"]
    end

    subgraph API["API Layer"]
        CLI["Command Line Interface"]
        FastAPI["FastAPI Web Service"]
    end

    %% Data Flow - Indexing
    Documents --> DataAdapter
    DataAdapter --> TextSplitter
    TextSplitter --> VectorRetriever
    TextSplitter --> BM25Retriever
    TextSplitter --> SubstringRetriever

    %% Data Flow - Querying
    UserQuery --> QueryEngine
    QueryEngine --> VectorRetriever
    QueryEngine --> BM25Retriever
    QueryEngine --> SubstringRetriever
    VectorRetriever --> FusionStrategy
    BM25Retriever --> FusionStrategy
    SubstringRetriever --> FusionStrategy
    FusionStrategy --> Reranker
    Reranker --> SearchResults

    %% Control Flow
    Config --> FlowController
    FlowController --> DataAdapter
    FlowController --> TextSplitter
    FlowController --> QueryEngine

    %% API Flow
    CLI --> FlowController
    FastAPI --> FlowController
Loading

Documentation

TinySearch includes comprehensive documentation to help you get started quickly and make the most of its features:

  • 📘 User Guide (中文)**

    • Installation instructions
    • Core concepts
    • Configuration overview
    • CLI usage examples
    • API integration
    • Advanced features & patterns
    • Troubleshooting common issues
  • 🧰 API Reference (中文)**

    • Component interfaces
    • Method signatures
    • Parameter descriptions
    • Return types
    • Usage examples
    • Extension patterns
  • 🏗️ Architecture Documentation (中文)**

    • System design overview
    • Component descriptions
    • Data flow diagrams
    • Extension points
    • Design decisions
  • ⚙️ Configuration Guide (中文)**

    • Configuration schema
    • Advanced options
    • Performance tuning
    • Component-specific settings
    • Configuration examples
  • ✨ Feature Summary (中文)**

    • Data validation utilities
    • Context window management
    • Response formatting utilities
    • Hot-update capabilities
    • Usage examples

API

TinySearch provides a RESTful API for integrating with other applications:

# Query the index
import requests
response = requests.post(
    "http://localhost:8000/query",
    headers={"X-API-Key": "your-api-key"},
    json={"query": "Your search query", "top_k": 5}
)
results = response.json()

API features include:

  • Full text search with vector embeddings
  • Index management (build, upload, stats, clear)
  • Authentication with API keys
  • Rate limiting protection
  • Web UI for easy exploration

For API documentation, see API Guide and API Authentication Guide (中文).

Installation

# Basic installation (vector search only)
pip install tinysearch

# With hybrid search (BM25 + Chinese tokenization)
pip install tinysearch bm25s jieba

# With cross-encoder reranking
pip install tinysearch FlagEmbedding

# With API support
pip install tinysearch[api]

# With embedding models support
pip install tinysearch[embedders]

# With all adapters
pip install tinysearch[adapters]

# With all features
pip install tinysearch[full]

Quick Start

1. Create a configuration file

Create a config.yaml file:

adapter:
  type: text
  params:
    encoding: utf-8

splitter:
  type: character
  chunk_size: 300
  chunk_overlap: 50

embedder:
  type: huggingface
  model: Qwen/Qwen-Embedding
  device: cuda  # or cpu
  normalize: true

indexer:
  type: faiss
  index_path: .cache/index.faiss
  metric: cosine

query_engine:
  method: template
  template: "Please find information about: {query}"
  top_k: 5

2. Build an index

tinysearch index --data ./your_documents --config config.yaml

3. Query the index

tinysearch query --q "Your search query" --config config.yaml

4. Start the API server with web UI

tinysearch-api

Then visit http://localhost:8000 in your browser to use the web interface, or send API requests programmatically:

curl -X POST http://localhost:8000/query -H "Content-Type: application/json" -d '{"query": "Your search query", "top_k": 5}'

Hybrid Search

TinySearch supports hybrid retrieval that combines multiple search strategies for better recall and precision. You can mix Vector (semantic), BM25 (keyword), and Substring (exact match) retrievers, then fuse their results using RRF or Weighted fusion.

Hybrid Search Configuration

To enable hybrid search, set query_engine.method to "hybrid" and configure the retrievers list:

# Embedding + Vector index (required for vector retriever)
embedder:
  model: Qwen/Qwen3-Embedding-0.6B
  device: cuda
indexer:
  type: faiss
  index_path: .cache/index.faiss
  metric: cosine

# Multi-retriever configuration
retrievers:
  - type: vector          # Semantic search (uses embedder + indexer above)
  - type: bm25            # Keyword search
    tokenizer: jieba       # Chinese tokenization (optional, fallback: whitespace)
  - type: substring        # Exact match / regex
    is_regex: false

# Fusion strategy
fusion:
  strategy: weighted       # "weighted" or "rrf"
  weights: [0.5, 0.4, 0.1]  # Weights for each retriever (vector, bm25, substring)
  min_score: 0.1           # Drop results below this fused score

# Optional: Cross-encoder reranking
reranker:
  enabled: false
  model: BAAI/bge-reranker-v2-m3

# Query engine
query_engine:
  method: hybrid           # "template" (vector only) or "hybrid"
  top_k: 20
  recall_multiplier: 2     # Each retriever recalls top_k * 2 candidates before fusion

Fusion Strategies

Strategy Description Best For
weighted Min-max normalize scores, then weighted sum When you know relative retriever importance
rrf Reciprocal Rank Fusion: score = Σ 1/(rank + k) When score distributions differ (more robust)

Programmatic Usage

from tinysearch.base import TextChunk
from tinysearch.retrievers import VectorRetriever, BM25Retriever, SubstringRetriever
from tinysearch.fusion import WeightedFusion, ReciprocalRankFusion
from tinysearch.query import HybridQueryEngine

# Build retrievers
bm25 = BM25Retriever()
bm25.build(chunks)  # chunks: List[TextChunk]

substr = SubstringRetriever(is_regex=False)
substr.build(chunks)

# Create hybrid engine
engine = HybridQueryEngine(
    retrievers=[bm25, substr],
    fusion_strategy=WeightedFusion(weights=[0.7, 0.3]),
    recall_multiplier=2,
)

results = engine.retrieve("搜索关键词", top_k=10)
# Each result: {"text", "metadata", "score", "retrieval_method": "hybrid", "scores": {...}}

Result Format

Hybrid search results include per-retriever scores for transparency:

{
    "text": "...",               # Chunk text
    "metadata": {...},           # Source metadata
    "score": 0.85,               # Final fused score [0, 1]
    "retrieval_method": "hybrid",
    "scores": {                  # Per-retriever original scores
        "vector": 0.92,
        "bm25": 0.75,
    }
}

Optional Dependencies for Hybrid Search

Package Purpose Required?
bm25s BM25 retrieval engine Only for BM25Retriever
jieba Chinese tokenization Optional (fallback: whitespace split)
FlagEmbedding Cross-encoder reranking Optional (only for reranker)

Install with:

pip install bm25s jieba              # For BM25 + Chinese support
pip install FlagEmbedding            # For cross-encoder reranking

Backward Compatibility

If you don't configure retrievers or keep query_engine.method: "template", TinySearch behaves exactly as before — pure vector search with no additional dependencies.

Examples

TinySearch includes various example scripts in the examples/ directory to demonstrate different features:

Basic Examples

  • simple_example.py: Basic usage of TinySearch
  • flow_example.py: Demonstrates the flow controller
  • faiss_gpu_demo.py: Shows how to use GPU acceleration with FAISS
  • api_auth_demo.py: Demonstrates API authentication
  • web_ui_demo.py: Shows how to use the web UI

Advanced Features Demo

The advanced_features_demo.py script showcases the full capabilities of TinySearch with maximum customization:

Features Demonstrated

  1. Data Validation Utilities

    • Path validation
    • File extension validation
    • Configuration validation
    • Embedding dimension validation
  2. Context Window Management

    • Token counting
    • Window sizing
    • Context generation for queries
  3. Response Formatting

    • Plain text formatting
    • Markdown formatting
    • JSON formatting
    • HTML formatting
    • Custom formatting with term highlighting
  4. Hot-Update Capabilities

    • Real-time monitoring of document changes
    • Automatic index updates
    • Custom update callbacks
  5. Custom Components

    • Multi-format data adapter
    • Custom response formatter
    • Custom reranking function

Running the Demo

cd /path/to/TinySearch
python examples/advanced_features_demo.py

The demo requires additional dependencies:

pip install torch sentence-transformers faiss-cpu watchdog

Use Cases

  • 🔎 Semantic document search: Find contextually relevant information beyond simple keyword matching
  • 📚 Knowledge base retrieval: Build intelligent Q&A systems based on your documents
  • 🤖 RAG systems: Enhance LLM outputs with relevant document context
  • 📊 Recommendation systems: Find similar items based on their content
  • 🧠 Content organization: Automatically categorize and group related documents

Features of the Web UI

The TinySearch Web UI provides an intuitive interface for interacting with the system:

  • Search Interface: Easily search through your documents with customizable parameters
  • Index Management: Upload documents, build indexes, and manage your data
  • Statistics Dashboard: View insights about your index and processed documents
  • Authentication Management: Manage API keys and security settings

Custom Data Adapters

You can create custom data adapters by implementing the DataAdapter interface. Each adapter handles single file extraction — directory traversal is handled automatically by the framework:

from tinysearch.base import DataAdapter

class MyAdapter(DataAdapter):
    def extract(self, filepath):
        # Extract text from a single file
        return [text1, text2, ...]

Then configure it in your config.yaml:

adapter:
  type: custom
  params:
    module: my_module
    class: MyAdapter
    init:
      param1: value1

Directory Indexing

When building an index from a directory, TinySearch uses a centralized file discovery mechanism (iter_input_files()). Each adapter type has default file extensions:

Adapter Extensions
text .txt, .text, .md, .py, .js, .html, .css, .json
pdf .pdf
csv .csv
markdown .md, .markdown, .mdown, .mkdn
json .json

You can override extensions in config.yaml:

adapter:
  type: text
  params:
    extensions: [".txt", ".log"]

Or use iter_input_files() programmatically:

from tinysearch.utils.file_discovery import iter_input_files

for file_path in iter_input_files("./data", adapter_type="text"):
    texts = adapter.extract(file_path)

Modern Logging System

TinySearch features a modern logging system powered by loguru with beautiful, colorful output and flexible configuration options.

Logging Features

  • 🎨 Colorful Output: Modern, emoji-enhanced logging with syntax highlighting
  • ⚙️ Flexible Configuration: Easily customize log levels, formats, and output destinations
  • 📁 File Logging: Optional file output with automatic rotation and compression
  • 🎯 Multiple Formats: Choose from modern, simple, or detailed output formats
  • 🔧 Runtime Configuration: Configure logging through YAML/JSON config files

Logging Configuration

Configure logging in your config.yaml:

logging:
  # Log level: DEBUG, INFO, WARNING, ERROR, CRITICAL
  level: "INFO"

  # Format style: modern, simple, detailed
  format: "modern"

  # Whether to show timestamps and file locations
  show_time: true
  show_location: false

  # Whether to use colored output
  colorize: true

  # Optional file logging
  file: "logs/tinysearch.log"
  file_level: "DEBUG"

Format Examples

Modern Format (default):

15:30:45 | INFO  | 🚀 Building index from data/documents
15:30:46 | INFO  | 📄 Extracted 150 documents
15:30:47 | INFO  | ✂️  Created 450 text chunks

Simple Format:

15:30:45 | INFO  | Building index from data/documents
15:30:46 | INFO  | Extracted 150 documents

Detailed Format:

2024-01-15 15:30:45 | INFO     | cli:build_index:172 | Building index from data/documents
2024-01-15 15:30:46 | INFO     | cli:build_index:183 | Extracted 150 documents

Programmatic Usage

from tinysearch.logger import get_logger, configure_logger, log_success

# Configure logging
configure_logger({
    "logging": {
        "level": "INFO",
        "format": "modern",
        "colorize": True
    }
})

# Get a logger
logger = get_logger("my_component")

# Use convenience functions
log_success("Operation completed!")

Development

To set up the development environment:

# Clone the repository
git clone https://github.com/yourusername/tinysearch.git
cd tinysearch

# Install development dependencies
pip install -e ".[dev]"

# Run tests
pytest

License

This project is licensed under the MIT License - see the LICENSE file for details.

Keywords

vector search, semantic search, hybrid search, BM25, document retrieval, embeddings, FAISS, RAG, information retrieval, text search, vector database, document understanding, NLP, natural language processing, AI search, reranking, fusion

About

A tiny search engine for object using embedding model.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors