Skip to content

NatBrian/HT-VideoGraph

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HT-VideoGraph: Hierarchical Temporal Knowledge Graph for Long Video Understanding

Python 3.10+ License: MIT

HT-VideoGraph is the first Hierarchical Temporal Knowledge Graph (HTKG) for long video understanding. It addresses the fundamental limitations of current video RAG methods by modeling videos at multiple temporal granularities with multi-modal feature preservation.

Key Features

  • Hierarchical Temporal Structure: Models video at 4 native temporal levels:

    • L0: Frame level (1 FPS sampling)
    • L1: Shot level (detected boundaries)
    • L2: Scene level (semantic clustering)
    • L3: Narrative level (topic segmentation)
  • Query-Adaptive Hierarchical Traversal (QAHT): Dynamically determines appropriate retrieval granularity based on query analysis

  • Multi-Modal Feature Preservation: Nodes carry visual (CLIP), audio (Wav2Vec2/CLAP), and textual (BGE-M3) features

  • Training-Free: Entirely composed of pre-trained components - no fine-tuning required

Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                        HT-VideoGraph Pipeline                                │
├─────────────────────────────────────────────────────────────────────────────┤
│  OFFLINE PHASE (Graph Construction):                                        │
│  Video → Multi-Scale Segmentation → Entity Extraction → Edge Construction   │
│                                    ↓                                         │
│                    Hierarchical Temporal Knowledge Graph                     │
│                                                                             │
│  ONLINE PHASE (Query Processing):                                           │
│  Query → QAHT Retrieval → Retrieved Subgraph → LVLM Generation → Answer    │
└─────────────────────────────────────────────────────────────────────────────┘

Installation

Prerequisites

  • Python 3.10 or higher
  • CUDA-capable GPU (recommended)
  • FFmpeg for video processing

Install from Source

git clone https://github.com/ht-videograph/ht-videograph.git
cd ht-videograph
pip install -e .

Download Models

python scripts/download_models.py

Quick Start

Building a Graph

from ht_videograph import OfflinePipeline, Config

# Load configuration
config = Config.from_yaml("config/default_config.yaml")

# Initialize pipeline
pipeline = OfflinePipeline(config)

# Build graph from video
graph = pipeline.run("path/to/video.mp4")

# Save graph
graph.save("output/graph.pkl")

Querying the Graph

from ht_videograph import OnlinePipeline, load_graph

# Load graph
graph = load_graph("output/graph.pkl")

# Initialize pipeline
pipeline = OnlinePipeline(config, graph)

# Process query
query = "What experiment was conducted at around 10 minutes?"
answer = pipeline.run(query)
print(f"Answer: {answer}")

Running Benchmarks

python scripts/run_benchmarks.py --config config/default_config.yaml --output results/

Project Structure

HT-VideoGraph/
├── ht_videograph/           # Main package
│   ├── models/              # Model loading and management
│   ├── encoders/            # Feature extraction (visual, audio, text)
│   ├── graph/               # Graph construction modules
│   ├── retrieval/           # QAHT retrieval algorithms
│   ├── generation/          # LVLM-based answer generation
│   ├── pipeline/            # End-to-end pipelines
│   ├── evaluation/          # Benchmark runners and metrics
│   └── utils/               # Utility functions
├── scripts/                 # Executable scripts
├── config/                  # Configuration files
├── tests/                   # Unit tests
└── docs/                    # Documentation

Modules

Module 1: Multi-Scale Temporal Segmentation

Constructs the hierarchical temporal backbone at 4 granularity levels using PySceneDetect, agglomerative clustering, and BERTopic.

Module 2: Multi-Modal Entity Extraction

Extracts entities from visual (GroundingDINO), audio (PANNs), and text (Whisper + spaCy NER) modalities with cross-modal linking.

Module 3: Cross-Granularity Edge Construction

Creates containment, temporal sibling, entity propagation, and semantic association edges.

Module 4: Query-Adaptive Hierarchical Traversal (QAHT)

Core retrieval algorithm with query analysis, multi-scale retrieval, and hierarchical refinement.

Module 5: Multi-Modal Context Integration

Formats retrieved graph context for LVLM (Qwen2.5-VL) generation.

Supported Benchmarks

  • Video-MME: Multi-choice QA on videos from 11 seconds to 1 hour
  • MLVU: Multi-task Long Video Understanding
  • LongerVideos: Cross-video open-ended QA
  • EgoSchema: Egocentric video understanding
  • ActivityNet-QA: Activity-focused QA

Configuration

Configuration is managed via YAML files. See config/default_config.yaml for all available options.

# Example configuration
video:
  fps: 1
  max_resolution: 720

graph:
  hierarchy_levels: 4

retrieval:
  token_budget: 4000
  faiss_index: "IVF-PQ"

Citation

If you use HT-VideoGraph in your research, please cite:

@article{htvideograph2024,
  title={HT-VideoGraph: Hierarchical Temporal Knowledge Graph for Long Video Understanding},
  author={HT-VideoGraph Team},
  journal={arXiv preprint},
  year={2024}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

This project uses the following pre-trained models:

About

HT-VideoGraph is the first Hierarchical Temporal Knowledge Graph (HTKG) for long video understanding. It addresses the fundamental limitations of current video RAG methods by modeling videos at multiple temporal granularities with multi-modal feature preservation.

Resources

Stars

Watchers

Forks

Contributors

Languages