The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2024 (Oral)

## Overview

360+x dataset introduces a unique panoptic perspective to scene
understanding, differentiating itself from traditional datasets by
offering multiple viewpoints and modalities, captured from a variety of
scenes

### Key Features:

-   **Multi-viewpoint Captures:** Includes 360° panoramic video,
    third-person front view video, egocentric monocular video, and
    egocentric binocular video.
-   **Rich Audio Modalities:** Features normal audio and directional
    binaural delay.
-   **2,152 multi-model videos** captured by 360 cameras and Spectacles
    camera (8579k frames in total) Captured in 17 cities across 5
    countries, covering 28 scenes ranging from Artistic Spaces to
    Natural Landscapes.
-   **Action Temporal Segmentation:** Provides labels for 38 action
    instances for each video pair.

## Dataset Details

### Dataset Structure Stored in RDS

-   360x_dataset_original_resolution: the folder contains the original
    resolution of the dataset.
-   index.json: the file contains the information for each video.
-   360x_dataset_lower_resolution: the folder contains the lower
    resolution of the dataset.
-   TAL_annotations: the folder contains the temporal action
    localization annotations.

### Project Description

-   **Developed by:** Hao Chen, Yuqi Hou, Chenyuan Qu, Irene Testini,
    Xiaohan Hong, Jianbo Jiao
-   **Funded by:** the Ramsay Research Fund, and the Royal Society Short
    Industry Fellowship
-   **License:** Creative Commons Attribution-NonCommercial-ShareAlike
    4.0

### Sources

-   **Repository:** Coming Soon
-   **Paper:** https://arxiv.org/abs/2404.00989

## Dataset Statistics

-   **Total Videos:** 2,152, split between 464 videos captured using 360
    cameras and 1,688 with Spectacles cameras.
-   **Scenes:** 15 indoor and 13 outdoor, totaling 28 scene categories.
-   **Short Clips:** The videos have been segmented into 1,380 shorter
    clips, each approximately 10 seconds long, totaling around 67.78
    hours.
-   **Frames:** 8,579k frames across all clips.

## Dataset Structure

Our dataset offers a comprehensive collection of panoramic videos,
binocular videos, and third-person videos, each pair of videos
accompanied by annotations. Additionally, it includes features extracted
using I3D, VGGish, and ResNet-18. Given the high-resolution nature of
our dataset (5760x2880 for panoramic and binocular videos, 1920x1080 for
third-person front view videos), the overall size is considerably large.
To accommodate diverse research needs and computational resources, we
also provide a lower-resolution version of the dataset (640x320 for
panoramic and binocular videos, 569x320 for third-person front view
videos) available for download.

In this repo, we provide the full version of the dataset. To only access the lower-resolution version, please visit our huggingface page: https://huggingface.co/datasets/quchenyuan/360x_dataset

## BibTeX

    @inproceedings{chen2024x360,
      title={360+x: A Panoptic Multi-modal Scene Understanding Dataset},
      author={Chen, Hao and Hou, Yuqi and Qu, Chenyuan and Testini, Irene and Hong, Xiaohan and Jiao, Jianbo},
      booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
      year={2024}
    }