People like to watch videos. The way videos transfer information beats any other media form (still photos, audio clips). Billions of videos are watched everyday. Can we use those free-watching habits of videos for video annotation and tracking objects in videos? I worked on a research project to try to answer this question.

After some literature review, I found that work done in this area is very limited[1][2]. I had access to a Tobii Eye-X controller, which records eye movements looking at a screen, so I decided to analyze that. Eye movements are divided into 3 categories: saccades, smooth pursuit and fixations[3]. When observing the data from the tracker, I decided to only consider the fixations in my research, as they imply that the object is an object of interest in these frames.
I gathered tracking data from a free watching task of an episode of Scrubs. The user was asked to watch the video normally, and not look for anything specific. The data is given per frame, I filtered out the fixation data, and averaged the position (x,y) over the period of time to get one point (the tracker gives a fixation as multiple points over f consecutive frames). At this point, I have K1 to n where Ki is a fixation point, and i is the frame number.
As a separate task, I ran upper-body detection[4] and Viola-Jones face detector[5] algorithms on the video. The results were extremely noisy, so I ran non-maximal suppression on the output. The results got fairly better, but still with a lot of false positives.

OK, at this point, I have two separate sets of data: fixation points from watching the video, and detected boxes from detection algorithms. I decided to go on and play with the fixation points. As a start, I added time to the descriptors, so each fixation point now has (x,y,t). I wanted to cluster the fixation points together, as they will probably (after adding time) group into objects of interest. The problem is that I didn’t have the number of clusters, so I used spectral clustering[6] to get the number of clusters, and then ran k-means clustering on the data. So now I have k clusters, with k-centroids, as you can see here:

So, we now have k-clusters, each is a possible spatio-temporal space where an object of interest was around. With this data, I decided to filter out the tracked boxes (remember from before?) if they don’t lie in any of the clusters territory. Here is the approach:
- For cluster ki active from t1 to t2, and for all detected boxes from t1 to t2, get the mean distance (in x,y, and t) of those boxes from the center of the cluster.
- Filter out boxes where distance from center is greater than average distance of boxes.
Now we have filtered out boxes, based on distances from centers of clusters that present roaming area of objects of interest. The results was as follows:
Note that there was still noise in the filtered data. So I decided to see what would happen if we ran a tracking algorithms on both data, as detection here was done per frame and not as a sequence. I used the tracking algorithm from [7], which builds an appearance model for tracked objects. The results were amazing! 30% (from 40% to 70%) boost in precision, and 18% (from 15% to 33%) boost in accuracy.

A note worth mentioning that this research was done using only one viewer, and only one forward pass on the data. I believe and think that this can be crowd-sourced (multiple viewers watching videos normally), especially that eye trackers are getting more and more cheaper. Also, after running a backward pass on the data, the accuracy will be boosted.

P.S.: A paper about this research was accepted in IEEE International Conference on Image Processing Theory, Tools and Applications 2016.
References
[1] S. Karthikeyan, Thuyen Ngo, Miguel Eckstein and B.S. Manjunath, “Eye tracking assisted extraction of attentionally important objects from videos”, IEEE International Conference on Computer Vision and Pattern Recognition, Boston, MA, Jun. 2015
[2] Stefanos Vrochidis, Ioannis Patras, Ioannis Kompatsiaris, Exploiting gaze movements for automatic video annotation. WIAMIS 2012: 1-4
[3] Chapter 8D – Control of Eye Movements
[4] V. Ferrari, M. Marin-Jimenez, and A. Zisserman, “Progressive search space reduction for human pose estimation,” in CVPR, 2008.
[5] P. Viola and M. Jones, “Robust real-time object detection,” in IJCV, 2001.
[6] Zelnik-Manor and P. Perona. Self-tuning spectral clustering. In L. K. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems 17, pages 1601–1608. MIT Press, Cambridge, MA, 2005.
[7] S.-H. Bae and K.-J. Yoon. Robust online multi-object tracking based on tracklet confidence and online discriminative appearance learning. In CVPR 2014.