HT-VideoGraph is the first Hierarchical Temporal Knowledge Graph (HTKG) for long video understanding. It addresses the fundamental limitations of current video RAG methods by modeling videos at multiple temporal granularities with multi-modal feature preservation.
-
Hierarchical Temporal Structure: Models video at 4 native temporal levels:
- L0: Frame level (1 FPS sampling)
- L1: Shot level (detected boundaries)
- L2: Scene level (semantic clustering)
- L3: Narrative level (topic segmentation)
-
Query-Adaptive Hierarchical Traversal (QAHT): Dynamically determines appropriate retrieval granularity based on query analysis
-
Multi-Modal Feature Preservation: Nodes carry visual (CLIP), audio (Wav2Vec2/CLAP), and textual (BGE-M3) features
-
Training-Free: Entirely composed of pre-trained components - no fine-tuning required
┌─────────────────────────────────────────────────────────────────────────────┐
│ HT-VideoGraph Pipeline │
├─────────────────────────────────────────────────────────────────────────────┤
│ OFFLINE PHASE (Graph Construction): │
│ Video → Multi-Scale Segmentation → Entity Extraction → Edge Construction │
│ ↓ │
│ Hierarchical Temporal Knowledge Graph │
│ │
│ ONLINE PHASE (Query Processing): │
│ Query → QAHT Retrieval → Retrieved Subgraph → LVLM Generation → Answer │
└─────────────────────────────────────────────────────────────────────────────┘
- Python 3.10 or higher
- CUDA-capable GPU (recommended)
- FFmpeg for video processing
git clone https://github.com/ht-videograph/ht-videograph.git
cd ht-videograph
pip install -e .python scripts/download_models.pyfrom ht_videograph import OfflinePipeline, Config
# Load configuration
config = Config.from_yaml("config/default_config.yaml")
# Initialize pipeline
pipeline = OfflinePipeline(config)
# Build graph from video
graph = pipeline.run("path/to/video.mp4")
# Save graph
graph.save("output/graph.pkl")from ht_videograph import OnlinePipeline, load_graph
# Load graph
graph = load_graph("output/graph.pkl")
# Initialize pipeline
pipeline = OnlinePipeline(config, graph)
# Process query
query = "What experiment was conducted at around 10 minutes?"
answer = pipeline.run(query)
print(f"Answer: {answer}")python scripts/run_benchmarks.py --config config/default_config.yaml --output results/HT-VideoGraph/
├── ht_videograph/ # Main package
│ ├── models/ # Model loading and management
│ ├── encoders/ # Feature extraction (visual, audio, text)
│ ├── graph/ # Graph construction modules
│ ├── retrieval/ # QAHT retrieval algorithms
│ ├── generation/ # LVLM-based answer generation
│ ├── pipeline/ # End-to-end pipelines
│ ├── evaluation/ # Benchmark runners and metrics
│ └── utils/ # Utility functions
├── scripts/ # Executable scripts
├── config/ # Configuration files
├── tests/ # Unit tests
└── docs/ # Documentation
Constructs the hierarchical temporal backbone at 4 granularity levels using PySceneDetect, agglomerative clustering, and BERTopic.
Extracts entities from visual (GroundingDINO), audio (PANNs), and text (Whisper + spaCy NER) modalities with cross-modal linking.
Creates containment, temporal sibling, entity propagation, and semantic association edges.
Core retrieval algorithm with query analysis, multi-scale retrieval, and hierarchical refinement.
Formats retrieved graph context for LVLM (Qwen2.5-VL) generation.
- Video-MME: Multi-choice QA on videos from 11 seconds to 1 hour
- MLVU: Multi-task Long Video Understanding
- LongerVideos: Cross-video open-ended QA
- EgoSchema: Egocentric video understanding
- ActivityNet-QA: Activity-focused QA
Configuration is managed via YAML files. See config/default_config.yaml for all available options.
# Example configuration
video:
fps: 1
max_resolution: 720
graph:
hierarchy_levels: 4
retrieval:
token_budget: 4000
faiss_index: "IVF-PQ"If you use HT-VideoGraph in your research, please cite:
@article{htvideograph2024,
title={HT-VideoGraph: Hierarchical Temporal Knowledge Graph for Long Video Understanding},
author={HT-VideoGraph Team},
journal={arXiv preprint},
year={2024}
}This project is licensed under the MIT License - see the LICENSE file for details.
This project uses the following pre-trained models: