Title: Real-Time Visual Tracking by Convolutional Neural Networks: a History and the Future
Abstract:Recent tracking algorithms based on convolutional neural networks (CNNs) achieve impressive accuracy, and MDNet is one of the seminal works among them. However, there are still many remaining issues including the efficiency of algorithms and the way to handle challenging situations. In this talk, I will first present several real-time visual tracking algorithms based on CNNs and their characteristics in terms of methodology and performance in recent benchmarks. After that, I will demonstrate the existing limitations in visual tracking algorithms and discuss the potential research directions to handle such challenges.
Biography:Bohyung Han is currently an Associate Professor in the Department of Electrical and Computer Engineering at Seoul National University, Korea. Prior to the current position, he was an Associate Professor in the Department of Computer Science and Engineering at POSTECH and a visiting research scientist in Machine Intelligence Group at Google, Venice, CA, USA. He received the B.S. and M.S. degrees from the Department of Computer Engineering at Seoul National University, Korea, in 1997 and 2000, respectively, and the Ph.D. degree from the Department of Computer Science at the University of Maryland, College Park, MD, USA, in 2005.He served or will be serving as an Area Chair or Senior Program Committee member of numerous major conferences in computer vision and machine learning, a Tutorial Chair in ICCV 2019, and a Demo Chair in ECCV 2022. He is also serving as an Associate Editor in TPAMI and MVA, and an Area Editor in CVIU. He is interested in various problems in computer vision and machine learning with an emphasis on deep learning. His research group won the Visual Object Tracking (VOT) Challenge in 2015 and 2016.
Title: Learning to track and segment objects in videos
Abstract:In this talk, I will present our recent results on visual tracking and video object segmentation.
The tracking-by-detection framework typically consists of two stages, i.e., drawing samples around the target object in the first stage and classifying each sample as the target object or as background in the second stage. The performance of existing trackers using deep classification networks is limited by two aspects. First, the positive samples in each frame are highly spatially overlapped, and they fail to capture rich appearance variations. Second, there exists extreme class imbalance between positive and negative samples. This VITAL algorithm aims to address these two problems via adversarial learning. To augment positive samples, we use a generative network to randomly generate masks, which are applied to adaptively dropout input features to capture a variety of appearance changes. With the use of adversarial learning, our network identifies the mask that maintains the most robust features of the target objects over a long temporal span. In addition, to handle the issue of class imbalance, we propose a high-order cost-sensitive loss to decrease the effect of easy negative samples to facilitate training the classification network. Extensive experiments on benchmark datasets demonstrate that the proposed tracker performs favorably against the state-of-the-art approaches.
Online video object segmentation is a challenging task as it entails to process the image sequence timely and accurately. To segment a target object through the video, numerous CNN-based methods have been developed by heavily fine-tuning on the object mask in the first frame, which is time-consuming for online applications. In the second part, we propose a fast and accurate video object segmentation algorithm that can immediately start the segmentation process once receiving the images. We first utilize a part-based tracking method to deal with challenging factors such as large deformation, occlusion, and cluttered background. Based on the tracked bounding boxes of parts, we construct a region-of-interest segmentation network to generate part masks. Finally, a similarity-based scoring function is adopted to refine these object parts by comparing them to the visual information in the first frame. Our method performs favorably against state-of-the-art algorithms in terms of accuracy on the DAVIS benchmark dataset, while achieving much faster runtime performance.
Biography: Ming-Hsuan Yang is a research scientist at Google and a professor in Electrical Engineering and Computer Science at University of California, Merced. He received the PhD degree in Computer Science from the University of Illinois at Urbana-Champaign in 2000. He serves as an area chair for several conferences including IEEE Conference on Computer Vision and Pattern Recognition, IEEE International Conference on Computer Vision, European Conference on Computer Vision, Asian Conference on Computer, and AAAI National Conference on Artificial Intelligence. He serves as a program co-chair for IEEE International Conference on Computer Vision in 2019 as well as Asian Conference on Computer Vision in 2014, and general co-chair for Asian Conference on Computer Vision in 2016. He serves as an associate editor of the IEEE Transactions on Pattern Analysis and Machine Intelligence (2007 to 2011), International Journal of Computer Vision, Computer Vision and Image Understanding, Image and Vision Computing, and Journal of Artificial Intelligence Research. Yang received the Google faculty award in 2009, and the Distinguished Early Career Research Award from the UC Merced senate in 2011, the Faculty Early Career Development (CAREER) award from the National Science Foundation in 2012, and the Distinguished Research Award from UC Merced Senate in 2015. He is an IEEE Fellow.