Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
222 changes: 193 additions & 29 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,17 @@
# TinySearch

**A lightweight vector search and retrieval system for document understanding, semantic search, and text information retrieval.**
**A lightweight hybrid search and retrieval system for document understanding, semantic search, and text information retrieval.**

TinySearch provides an end-to-end solution for converting documents into searchable vector embeddings, with a focus on simplicity, flexibility, and efficiency. Perfect for building RAG (Retrieval-Augmented Generation) systems, semantic document search, knowledge bases, and more.
TinySearch provides an end-to-end solution for converting documents into searchable vector embeddings, with support for hybrid multi-retriever search (Vector + BM25 + Substring). Focused on simplicity, flexibility, and efficiency. Perfect for building RAG (Retrieval-Augmented Generation) systems, semantic document search, knowledge bases, and more.

## Key Features

- 🧩 **Modular Design**: Plug-and-play components for data processing, text splitting, embedding generation, and vector retrieval
- 🔍 **Semantic Search**: Find contextually relevant information beyond keyword matching
- 🔀 **Hybrid Retrieval**: Combine Vector, BM25, and Substring retrievers with configurable fusion strategies (RRF, Weighted)
- 📝 **BM25 Keyword Search**: Fast keyword-based retrieval with jieba Chinese tokenization support
- 🔤 **Substring/Regex Search**: Ctrl+F style exact match and regex pattern search
- 🏆 **Cross-Encoder Reranking**: Optional reranking with BGE Reranker for improved relevance
- ⚙️ **Highly Configurable**: Simple YAML configuration to control all aspects of the system
- 🔌 **Multiple Input Formats**: Support for TXT, PDF, CSV, Markdown, JSON, and custom adapters
- 🤖 **Embedding Models**: Integration with HuggingFace models like Qwen-Embedding and more
Expand All @@ -29,55 +33,63 @@ flowchart TB
subgraph Input
Documents["Documents<br/>(PDF, Text, CSV, JSON, etc.)"]
end

subgraph DataProcessing["Data Processing"]
DataAdapter["DataAdapter<br/>Extract text from data source"]
TextSplitter["TextSplitter<br/>Chunk text into segments"]
Embedder["Embedder<br/>Generate vector embeddings"]
end

subgraph IndexLayer["Index Layer"]
VectorIndexer["VectorIndexer<br/>Build and maintain FAISS index"]
IndexStorage["Index Storage<br/>(FAISS Index + Original Text)"]

subgraph RetrieverLayer["Retriever Layer"]
VectorRetriever["VectorRetriever<br/>Embedder + FAISS"]
BM25Retriever["BM25Retriever<br/>bm25s + jieba"]
SubstringRetriever["SubstringRetriever<br/>Regex / exact match"]
end


subgraph FusionLayer["Fusion & Reranking"]
FusionStrategy["FusionStrategy<br/>(RRF / Weighted)"]
Reranker["Reranker (optional)<br/>Cross-Encoder BGE"]
end

subgraph QueryLayer["Query Layer"]
UserQuery["User Query"]
QueryEngine["QueryEngine<br/>Process and reformat query"]
QueryEngine["QueryEngine<br/>Template / Hybrid"]
SearchResults["Search Results<br/>Ranked by relevance"]
end

subgraph FlowControl["Flow Control"]
Config["Configuration"]
FlowController["FlowController<br/>Orchestrate data flow"]
end

subgraph API["API Layer"]
CLI["Command Line Interface"]
FastAPI["FastAPI Web Service"]
end

%% Data Flow - Indexing
Documents --> DataAdapter
DataAdapter --> TextSplitter
TextSplitter --> Embedder
Embedder --> VectorIndexer
VectorIndexer --> IndexStorage
TextSplitter --> VectorRetriever
TextSplitter --> BM25Retriever
TextSplitter --> SubstringRetriever

%% Data Flow - Querying
UserQuery --> QueryEngine
QueryEngine --> Embedder
Embedder --> VectorIndexer
VectorIndexer --> SearchResults

QueryEngine --> VectorRetriever
QueryEngine --> BM25Retriever
QueryEngine --> SubstringRetriever
VectorRetriever --> FusionStrategy
BM25Retriever --> FusionStrategy
SubstringRetriever --> FusionStrategy
FusionStrategy --> Reranker
Reranker --> SearchResults

%% Control Flow
Config --> FlowController
FlowController --> DataAdapter
FlowController --> TextSplitter
FlowController --> Embedder
FlowController --> VectorIndexer
FlowController --> QueryEngine

%% API Flow
CLI --> FlowController
FastAPI --> FlowController
Expand Down Expand Up @@ -152,9 +164,15 @@ For API documentation, see [API Guide](docs/api.md) and [API Authentication Guid
## Installation

```bash
# Basic installation
# Basic installation (vector search only)
pip install tinysearch

# With hybrid search (BM25 + Chinese tokenization)
pip install tinysearch bm25s jieba

# With cross-encoder reranking
pip install tinysearch FlagEmbedding

# With API support
pip install tinysearch[api]

Expand Down Expand Up @@ -195,7 +213,7 @@ indexer:
type: faiss
index_path: .cache/index.faiss
metric: cosine

query_engine:
method: template
template: "Please find information about: {query}"
Expand Down Expand Up @@ -226,6 +244,118 @@ Then visit http://localhost:8000 in your browser to use the web interface, or se
curl -X POST http://localhost:8000/query -H "Content-Type: application/json" -d '{"query": "Your search query", "top_k": 5}'
```

## Hybrid Search

TinySearch supports hybrid retrieval that combines multiple search strategies for better recall and precision. You can mix Vector (semantic), BM25 (keyword), and Substring (exact match) retrievers, then fuse their results using RRF or Weighted fusion.

### Hybrid Search Configuration

To enable hybrid search, set `query_engine.method` to `"hybrid"` and configure the `retrievers` list:

```yaml
# Embedding + Vector index (required for vector retriever)
embedder:
model: Qwen/Qwen3-Embedding-0.6B
device: cuda
indexer:
type: faiss
index_path: .cache/index.faiss
metric: cosine

# Multi-retriever configuration
retrievers:
- type: vector # Semantic search (uses embedder + indexer above)
- type: bm25 # Keyword search
tokenizer: jieba # Chinese tokenization (optional, fallback: whitespace)
- type: substring # Exact match / regex
is_regex: false

# Fusion strategy
fusion:
strategy: weighted # "weighted" or "rrf"
weights: [0.5, 0.4, 0.1] # Weights for each retriever (vector, bm25, substring)
min_score: 0.1 # Drop results below this fused score

# Optional: Cross-encoder reranking
reranker:
enabled: false
model: BAAI/bge-reranker-v2-m3

# Query engine
query_engine:
method: hybrid # "template" (vector only) or "hybrid"
top_k: 20
recall_multiplier: 2 # Each retriever recalls top_k * 2 candidates before fusion
```

### Fusion Strategies

| Strategy | Description | Best For |
|----------|-------------|----------|
| **weighted** | Min-max normalize scores, then weighted sum | When you know relative retriever importance |
| **rrf** | Reciprocal Rank Fusion: `score = Σ 1/(rank + k)` | When score distributions differ (more robust) |

### Programmatic Usage

```python
from tinysearch.base import TextChunk
from tinysearch.retrievers import VectorRetriever, BM25Retriever, SubstringRetriever
from tinysearch.fusion import WeightedFusion, ReciprocalRankFusion
from tinysearch.query import HybridQueryEngine

# Build retrievers
bm25 = BM25Retriever()
bm25.build(chunks) # chunks: List[TextChunk]

substr = SubstringRetriever(is_regex=False)
substr.build(chunks)

# Create hybrid engine
engine = HybridQueryEngine(
retrievers=[bm25, substr],
fusion_strategy=WeightedFusion(weights=[0.7, 0.3]),
recall_multiplier=2,
)

results = engine.retrieve("搜索关键词", top_k=10)
# Each result: {"text", "metadata", "score", "retrieval_method": "hybrid", "scores": {...}}
```

### Result Format

Hybrid search results include per-retriever scores for transparency:

```python
{
"text": "...", # Chunk text
"metadata": {...}, # Source metadata
"score": 0.85, # Final fused score [0, 1]
"retrieval_method": "hybrid",
"scores": { # Per-retriever original scores
"vector": 0.92,
"bm25": 0.75,
}
}
```

### Optional Dependencies for Hybrid Search

| Package | Purpose | Required? |
|---------|---------|-----------|
| `bm25s` | BM25 retrieval engine | Only for BM25Retriever |
| `jieba` | Chinese tokenization | Optional (fallback: whitespace split) |
| `FlagEmbedding` | Cross-encoder reranking | Optional (only for reranker) |

Install with:
```bash
pip install bm25s jieba # For BM25 + Chinese support
pip install FlagEmbedding # For cross-encoder reranking
```

### Backward Compatibility

If you don't configure `retrievers` or keep `query_engine.method: "template"`, TinySearch behaves exactly as before — pure vector search with no additional dependencies.

## Examples

TinySearch includes various example scripts in the `examples/` directory to demonstrate different features:
Expand Down Expand Up @@ -303,14 +433,16 @@ The TinySearch Web UI provides an intuitive interface for interacting with the s

## Custom Data Adapters

You can create custom data adapters by implementing the `DataAdapter` interface:
You can create custom data adapters by implementing the `DataAdapter` interface.
Each adapter handles **single file** extraction — directory traversal is handled
automatically by the framework:

```python
from tinysearch.base import DataAdapter

class MyAdapter(DataAdapter):
def extract(self, filepath):
# Your code to extract text from the file
# Extract text from a single file
return [text1, text2, ...]
```

Expand All @@ -326,6 +458,38 @@ adapter:
param1: value1
```

## Directory Indexing

When building an index from a directory, TinySearch uses a centralized file
discovery mechanism (`iter_input_files()`). Each adapter type has default
file extensions:

| Adapter | Extensions |
|---------|-----------|
| text | `.txt`, `.text`, `.md`, `.py`, `.js`, `.html`, `.css`, `.json` |
| pdf | `.pdf` |
| csv | `.csv` |
| markdown | `.md`, `.markdown`, `.mdown`, `.mkdn` |
| json | `.json` |

You can override extensions in `config.yaml`:

```yaml
adapter:
type: text
params:
extensions: [".txt", ".log"]
```

Or use `iter_input_files()` programmatically:

```python
from tinysearch.utils.file_discovery import iter_input_files

for file_path in iter_input_files("./data", adapter_type="text"):
texts = adapter.extract(file_path)
```

## Modern Logging System

TinySearch features a modern logging system powered by [loguru](https://github.com/Delgan/loguru) with beautiful, colorful output and flexible configuration options.
Expand Down Expand Up @@ -426,4 +590,4 @@ This project is licensed under the MIT License - see the LICENSE file for detail

## Keywords

vector search, semantic search, document retrieval, embeddings, FAISS, RAG, information retrieval, text search, vector database, document understanding, NLP, natural language processing, AI search
vector search, semantic search, hybrid search, BM25, document retrieval, embeddings, FAISS, RAG, information retrieval, text search, vector database, document understanding, NLP, natural language processing, AI search, reranking, fusion
9 changes: 8 additions & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,13 @@
"indexers": [
"faiss-cpu>=1.7.0",
],
"hybrid": [
"bm25s>=0.1.0",
"jieba>=0.42.0",
],
"reranker": [
"FlagEmbedding>=1.0.0",
],
"dev": [
"pytest>=6.0.0",
"black>=21.5b2",
Expand All @@ -64,7 +71,7 @@
version=version.get("__version__", "0.1.0"),
author="TinySearch Team",
author_email="tinysearch@example.com",
description="A lightweight vector retrieval system",
description="A lightweight hybrid search and retrieval system (Vector + BM25 + Substring)",
long_description=long_description,
long_description_content_type="text/markdown",
url="https://github.com/yourusername/tinysearch",
Expand Down
Loading
Loading