A simple Streamlit app that:
- Uploads a video
- Extracts frames every N seconds (default 2s)
- Sends each frame to a FREE vision model (BLIP-2) for descriptions
- Stores timestamped frame captions
- Lets users chat and ask questions answered by a small FREE LLM using the collected descriptions
- Python 3.10+
- Streamlit (web UI)
- OpenCV (video processing)
- Hugging Face Transformers (BLIP-2 + small LLM)
- PyTorch (CPU by default)
video-chat-system/
├── app.py # Streamlit UI
├── video_processor.py # Frame extraction (OpenCV)
├── vision_analyzer.py # BLIP-2 image captioning
├── chat_handler.py # Q&A over frame descriptions
├── requirements.txt
├── .env # Optional environment variables
└── README.md
- Create and activate a virtual environment
python3 -m venv .venv
source .venv/bin/activate # macOS/Linux
# or on Windows: .venv\\Scripts\\activate- Install dependencies (CPU by default)
pip install --upgrade pip
pip install -r requirements.txt- (Optional) Configure environment
Create
.envif needed. Example options:
# Cache directories (optional)
HF_HOME=.cache/huggingface
TRANSFORMERS_CACHE=.cache/transformers
# Use an alternative HF endpoint if needed
# HF_ENDPOINT=https://huggingface.co
streamlit run app.pyThen open the provided local URL in your browser.
- Vision: BLIP-2 (Salesforce) via
transformerspipeline or model classes - LLM for chat: start with
google/flan-t5-base(free, CPU-friendly). You can swap to other small open-source models later.
- Project scaffold
- Requirements
- Implement frame extraction in
video_processor.py - Add BLIP-2 captioning in
vision_analyzer.py - Implement chat handler in
chat_handler.py - Build Streamlit UI in
app.py
This project uses only FREE and open-source components. Verify model licenses before distribution.