HT-VideoGraph: Hierarchical Temporal Knowledge Graph for Long Video Understanding

HT-VideoGraph is the first Hierarchical Temporal Knowledge Graph (HTKG) for long video understanding. It addresses the fundamental limitations of current video RAG methods by modeling videos at multiple temporal granularities with multi-modal feature preservation.

Key Features

Hierarchical Temporal Structure: Models video at 4 native temporal levels:
- L0: Frame level (1 FPS sampling)
- L1: Shot level (detected boundaries)
- L2: Scene level (semantic clustering)
- L3: Narrative level (topic segmentation)
Query-Adaptive Hierarchical Traversal (QAHT): Dynamically determines appropriate retrieval granularity based on query analysis
Multi-Modal Feature Preservation: Nodes carry visual (CLIP), audio (Wav2Vec2/CLAP), and textual (BGE-M3) features
Training-Free: Entirely composed of pre-trained components - no fine-tuning required

Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                        HT-VideoGraph Pipeline                                │
├─────────────────────────────────────────────────────────────────────────────┤
│  OFFLINE PHASE (Graph Construction):                                        │
│  Video → Multi-Scale Segmentation → Entity Extraction → Edge Construction   │
│                                    ↓                                         │
│                    Hierarchical Temporal Knowledge Graph                     │
│                                                                             │
│  ONLINE PHASE (Query Processing):                                           │
│  Query → QAHT Retrieval → Retrieved Subgraph → LVLM Generation → Answer    │
└─────────────────────────────────────────────────────────────────────────────┘

Installation

Prerequisites

Python 3.10 or higher
CUDA-capable GPU (recommended)
FFmpeg for video processing

Install from Source

git clone https://github.com/ht-videograph/ht-videograph.git
cd ht-videograph
pip install -e .

Download Models

python scripts/download_models.py

Quick Start

Building a Graph

from ht_videograph import OfflinePipeline, Config

# Load configuration
config = Config.from_yaml("config/default_config.yaml")

# Initialize pipeline
pipeline = OfflinePipeline(config)

# Build graph from video
graph = pipeline.run("path/to/video.mp4")

# Save graph
graph.save("output/graph.pkl")

Querying the Graph

from ht_videograph import OnlinePipeline, load_graph

# Load graph
graph = load_graph("output/graph.pkl")

# Initialize pipeline
pipeline = OnlinePipeline(config, graph)

# Process query
query = "What experiment was conducted at around 10 minutes?"
answer = pipeline.run(query)
print(f"Answer: {answer}")

Running Benchmarks

python scripts/run_benchmarks.py --config config/default_config.yaml --output results/

Project Structure

HT-VideoGraph/
├── ht_videograph/           # Main package
│   ├── models/              # Model loading and management
│   ├── encoders/            # Feature extraction (visual, audio, text)
│   ├── graph/               # Graph construction modules
│   ├── retrieval/           # QAHT retrieval algorithms
│   ├── generation/          # LVLM-based answer generation
│   ├── pipeline/            # End-to-end pipelines
│   ├── evaluation/          # Benchmark runners and metrics
│   └── utils/               # Utility functions
├── scripts/                 # Executable scripts
├── config/                  # Configuration files
├── tests/                   # Unit tests
└── docs/                    # Documentation

Modules

Module 1: Multi-Scale Temporal Segmentation

Constructs the hierarchical temporal backbone at 4 granularity levels using PySceneDetect, agglomerative clustering, and BERTopic.

Module 2: Multi-Modal Entity Extraction

Extracts entities from visual (GroundingDINO), audio (PANNs), and text (Whisper + spaCy NER) modalities with cross-modal linking.

Module 3: Cross-Granularity Edge Construction

Creates containment, temporal sibling, entity propagation, and semantic association edges.

Module 4: Query-Adaptive Hierarchical Traversal (QAHT)

Core retrieval algorithm with query analysis, multi-scale retrieval, and hierarchical refinement.

Module 5: Multi-Modal Context Integration

Formats retrieved graph context for LVLM (Qwen2.5-VL) generation.

Supported Benchmarks

Video-MME: Multi-choice QA on videos from 11 seconds to 1 hour
MLVU: Multi-task Long Video Understanding
LongerVideos: Cross-video open-ended QA
EgoSchema: Egocentric video understanding
ActivityNet-QA: Activity-focused QA

Configuration

Configuration is managed via YAML files. See config/default_config.yaml for all available options.

# Example configuration
video:
  fps: 1
  max_resolution: 720

graph:
  hierarchy_levels: 4

retrieval:
  token_budget: 4000
  faiss_index: "IVF-PQ"

Citation

If you use HT-VideoGraph in your research, please cite:

@article{htvideograph2024,
  title={HT-VideoGraph: Hierarchical Temporal Knowledge Graph for Long Video Understanding},
  author={HT-VideoGraph Team},
  journal={arXiv preprint},
  year={2024}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

This project uses the following pre-trained models:

CLIP by OpenAI
Wav2Vec2 by Facebook AI
CLAP by LAION
GroundingDINO by IDEA Research
BGE-M3 by BAAI
Whisper by OpenAI
BERTopic by Maarten Grootendorst# HT-VideoGraph

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
config		config
docs		docs
ht_videograph		ht_videograph
scripts		scripts
tests		tests
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HT-VideoGraph: Hierarchical Temporal Knowledge Graph for Long Video Understanding

Key Features

Architecture

Installation

Prerequisites

Install from Source

Download Models

Quick Start

Building a Graph

Querying the Graph

Running Benchmarks

Project Structure

Modules

Module 1: Multi-Scale Temporal Segmentation

Module 2: Multi-Modal Entity Extraction

Module 3: Cross-Granularity Edge Construction

Module 4: Query-Adaptive Hierarchical Traversal (QAHT)

Module 5: Multi-Modal Context Integration

Supported Benchmarks

Configuration

Citation

License

Acknowledgments

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

HT-VideoGraph: Hierarchical Temporal Knowledge Graph for Long Video Understanding

Key Features

Architecture

Installation

Prerequisites

Install from Source

Download Models

Quick Start

Building a Graph

Querying the Graph

Running Benchmarks

Project Structure

Modules

Module 1: Multi-Scale Temporal Segmentation

Module 2: Multi-Modal Entity Extraction

Module 3: Cross-Granularity Edge Construction

Module 4: Query-Adaptive Hierarchical Traversal (QAHT)

Module 5: Multi-Modal Context Integration

Supported Benchmarks

Configuration

Citation

License

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages