Our paper has been accepted to 4th Workshop on Web-scale Vision and Social Media (VSM) in conjunction with ECCV 2016!

This paper is on video/text retrieval by text/video queries. Our approach uses LSTM to encode text as many other existing approaches, but our observation is that LSTM tends to forget about the detail in the text (It mixes up “typing the keyboard” and “playing the keyboard”). The main contribution of this paper is to fuse to text representation web images retrieved using the text as query, which can disambiguates text.

Updated:

Now we have arXiv preprint. Please find it at: arXiv:1608.02367