[Invited Talk] Refining action segmentation with hierarchical video representations

Presenter: Hyemin Ahn, German Aerospace Center; Time: 4:00pm, Friday (2021/08/13); Location: Online meeting on Zoom (ID: 568 097 6074, PW: 243175)

Abstract

In this talk, we introduce Hierarchical Action Segmentation Refiner (HASR), which can refine temporal action segmentation results from various models by understanding the overall context of a given video in a hierarchical way. When a backbone model for action segmentation estimates how the given video can be segmented, our model extracts segment-level representations based on frame-level features, and extracts a video-level representation based on the segment-level representations. Based on these hierarchical representations, our model can refer to the overall context of the entire video, and predict how the segment labels that are out of context should be corrected. Our HASR can be plugged into various action segmentation models (MS-TCN, SSTDA, ASRF), and improve the performance of state-of-the-art models based on three challenging datasets (GTEA, 50Salads, and Breakfast). For example, in 50Salads dataset, the segmental edit score improves from 67.9% to 77.4% (MS-TCN), from 75.8% to 77.3% (SSTDA), from 79.3% to 81.0% (ASRF). In addition, our model can refine the segmentation result from the unseen backbone model, which has not been referred to during training HASR. This generalization performance would make HASR be an effective tool for boosting up the existing approaches for temporal action segmentation.

Biography

Hyemin Ahn received the Ph.D. degree in the Department of Electrical and Computer Engineering from Seoul National University, Seoul, Korea in 2020. She received the B.S. degree in the Department of Electrical and Electronics Engineering from Seoul National University in 2014. She is currently a Postdoctoral Researcher in Robotics and Mechatronics Center at German Aerospace Center. Her research interests include human-centered assistive robotics and deep learning based perception.