Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles

Noroozi, Mehdi; Favaro, Paolo (2016). Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles. In: Leibe, B.; Matas, J.; Sebe, N.; Welling, M. (eds.) European Conference on Computer Vision (ECCV). Lecture Notes in Computer Science: Vol. 9910 (pp. 69-84). Springer 10.1007/978-3-319-46466-4_5

Text
ECCV2016_jigsaw.pdf - Submitted Version
Restricted to registered users only
Available under License Publisher holds Copyright.
Download (7MB) | Request a copy

In this paper we study the problem of image representation learning without human annotation. By following the principles of selfsupervision, we build a convolutional neural network (CNN) that can be trained to solve Jigsaw puzzles as a pretext task, which requires no manual labeling, and then later repurposed to solve object classification and detection. To maintain the compatibility across tasks we introduce the context-free network (CFN), a siamese-ennead CNN. The CFN takes image tiles as input and explicitly limits the receptive field (or context) of its early processing units to one tile at a time. We show that the CFN is a more compact version of AlexNet, but with the same semantic learning capabilities. By training the CFN to solve Jigsaw puzzles, we learn both a feature mapping of object parts as well as their correct spatial arrangement. Our experimental evaluations show that the learned features capture semantically relevant content. After training our CFN features to solve jigsaw puzzles on the training set of the ILSRV 2012 dataset [9], we transfer them via fine-tuning on the combined training and validation set of Pascal VOC 2007 for object detection (via fast RCNN) and classification. The performance of the CFN features is 51.8% for detection and 68.6% for classification, which is the highest among features obtained via unsupervised learning, and closing the gap with features obtained via supervised learning (56.5% and 78.2% respectively). In object classification the CFN features achieve 38.1% on the ILSRV 2012 validation set, after fine-tuning only the fully connected layers on the training set.

Item Type:	Conference or Workshop Item (Paper)
Division/Institute:	08 Faculty of Science > Institute of Computer Science (INF) > Computer Vision Group (CVG) 08 Faculty of Science > Institute of Computer Science (INF)
UniBE Contributor:	Noroozi, Mehdi, Favaro, Paolo
Subjects:	000 Computer science, knowledge & systems 500 Science > 510 Mathematics
Series:	Lecture Notes in Computer Science
Publisher:	Springer
Language:	English
Submitter:	Xiaochen Wang
Date Deposited:	12 Jun 2017 16:49
Last Modified:	05 Dec 2022 15:04
Publisher DOI:	10.1007/978-3-319-46466-4_5
BORIS DOI:	10.7892/boris.98318
URI:	https://boris.unibe.ch/id/eprint/98318

Actions (login required)

Edit item

Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles

Interest & Impact

Downloads

Citations

Search

Services

Actions (login required)

Item Type:

Division/Institute:

UniBE Contributor:

Subjects:

Series:

Publisher:

Language:

Submitter:

Date Deposited:

Last Modified:

Publisher DOI:

BORIS DOI:

URI:

Actions (login required)