Research:Prototypes of Image Classifiers Trained on Commons Categories

This page documents a completed research project.


In this project, we develop prototypes to evaluate the feasibility of developing simple in-house computer vision tools to support future platform evolution. While existing computer vision-based image classifiers are trained from general web images or from large scale image classification repositories such as ImageNet, we try here to exploit the richness of the category structure of Wikimedia Commons to annotate a large number of images, and then train classifiers using the annotated data. In this report, we highlight the major milestone of this project, as well as specific areas of improvement.

Methods edit

The overall workflow to build image classifiers is as follows. First, we need to collect labeled images that we will use to train classifiers. To do so, we first identify a list of objects or concepts (e.g. "boat", "person") we want the classifier to distinguish between, and then associate each concept with a set of semantically related Wikimedia Commons categories (e.g. commons:Category:People_with_boats). Next, for each concept, we collect images from related Commons categories and label them with the corresponding concept. We then use this data to train a classifier based on a CNN, and evaluate its accuracy and efficiency.

Data Labeling edit

Large scale image classifiers need lots of data to train. While there exist large image datasets such as ImageNet, which are constructed from generic web images, when we started this project there was no publicly available dataset for image classification based on Wikimedia Commons. However, for this project, we really wanted to focus on Wikimedia specific data, to both evaluate the richness of our image repositories, and to pave the way for fully integrated image classifiers which are customisable by communities based on already existing structures (such as Commons Categories). So, we built our own dataset. This was the bit of work which required the largest research effort.

Category-based Image Annotations edit

We created a dataset of images labeled with general concepts in a semi-automated ways.

Using Raw Commons Categories: Issues and Observations edit

We really wanted to leverage the semantic associations between Commons images and categories. However, we quickly realized that there were several issues:

  1. First, there are 6,780,411 million categories. How to find the ones which are relevant to identify concepts in images? From several conversations with experienced Commons users, the category network of Commons is not a reliable way to identify important topics, as the hierarchy of categories does not imply semantic dependency.
  2. Moreover, categories assigned to images don't necessarily reflect what is represented in the image. Examples of these categories include: license-related categories (e.g. CC-BY-SA-4.0), maintenance categories ( Artworks_without_Wikidata_item), tool related categories (e.g. Uploaded_with_pattypan). Therefore, aside from all issues related to parameter tuning, automatically identifying categories or groups of categories using some other form of unsupervised learning, would result in the selection of non-visual concepts which are not useful for this task.
  3. Finally, while the average number of images per category is 52, the median image per category is 4, and only 0.2% (12125) of all categories has more than 1000 images. However, selecting the top X categories as the list of concepts to be classified is not a good strategy: the categories with the largest amount of images are generally those non-semantic categories. For example, the biggest category (25,652,000 images) is Self-published_work, while the second largest is Pages_with_maps with 14,153,745 images.

Semi-Automated Labeling Process edit

To overcome the issues above, we opted for a semi-automated labeling process. We proceeded as follows:

  1. We compiled a list of generic concepts which would be useful to detect in Commons pictures. Since we could not fill this list based on Commons categories (see above), we resorted to taxonomies from external datasets. Among the various visual recognition datasets available, the COCO-stuff dataset works with the most highly generic categories. These concepts include people, animals, and things which exist in the visual world, and are organized in a coherent taxonomy.We added to the original list a set of concepts which are very prominent in Commons, e.g. "Fossils", or "Paintings", and compiled an inital list of concepts.
  2. We downloaded the list of ~7M categories in Commons, ranked by the associated number of images per category. Statistics can be found above.
  3. Next, we need to associate COCO concepts with Commons categories. A manual association would have been too costly, so we went for a semi-automated process.
    1. Computing Word Vectors - We computed FastText vectors on both COCO concepts labels and Commons Categories. We averaged over multiple word vectors for categories labeled with more than one words.
    2. Matching Concepts with Commons Categories - Next, we matched COCO concepts with Commons categories by computing cosine distance between the corresponding vectors. We dumped, for each COCO concepts, all Commons categories with distance <0.1, namely the categories which are more semantically similar.
    3. Manual Cleanup - However, word vectors are not necessarily the best solution for this problem, and while this approach helped reducing the space of search, we had to do a lot of cleaning up of the resulting COCO-Commons matches by either removing some irrelevant Commons categories, or manually searching for more Commons categories.
Manual cleanup was by far the most time demanding task of this research. While for this specific project one single person went through the whole process by hand, next time we might want to introduce a collaborative effort task or a crowdsourcing experiment to solve this matching problem.

Results edit

The list of 160 COCO concepts for which we have matches in the set of Commons categories, and the corresponding total number of images expected: https://phab.wmfusercontent.org/file/download/kdvhjlqvkuqgxg3reiik/PHID-FILE-erv4xhg5fkfjqmr3dnzd/final_category_counts.tsv This is the raw list of Commons categories associated to each COCO category: https://phab.wmfusercontent.org/file/download/vacwwcyubaqfwwhelt4d/PHID-FILE-hvadn45ovgqckluy5jxt/final_category_list.tsv

 
Number of images downloaded for each concept

Data Downloading edit

Using the CommonsDownloader tool for python, we download all images for each category associated to the COCO concept, and label each image with the corresponding concept.

Results edit

In total, we downloaded around 715,151 images spanning 160 categories, with resolution 600px. It took about 2 days to download all these images, using 4 parallel processes. The resulting number of images for each concept can be found in the bar plot. The average number of images per concept is as follows:

n Average number of images for top-n concept
100 6,987.6
50 12,560.6
10 39,997.5
The data downloaded is less than expected for some concepts. In the future, more efforts should be put in the data labeling process so that we can train more accurate classifiers.


Model Training edit

The annotation effort has given limited results. Therefore, for the first round of efforts around evaluating the feasibility of our own in-house image classifiers, we opt for a "light" model training through transfer learning. We fine-tune an existing model based on an Inceptionv3 architecture trained on ImageNet. Fine tuning means that, instead of training a model from scratch, we replace the last layer of the network with a number of output neurons equal to the number of concepts we want to classify, and adjust the weights of the last 2 layers only. We leave cross-entropy as loss function, but, given the heavy class imbalance, we compute average per-class accuracy (instead of overall accuracy) as metric to track progresses.

This model works with JPEG images only. Any image which is not JPEG has to be converted. This is a limitation and it might introduce noise.
We did some initial work on the infrastructure to train a simple model from scratch (i.e. no pretrained model). We built the architecture, loss function and the input/output workflows. Early results seem encouraging, though accuracy looked lower than the pre-trained model. This was probably due to some hyperparameters and/or to the simplicity of the architecture. More work in this direction is needed.

Setup edit

We trained several versions of the model, varying:

  • Number of categories: namely the total number of concepts that the classifier should recognize, taken from the concepts with highest number of images.
  • Number of iterations: the total number of times the system iterates over the training data

And computed two performance metrics:

  • Average accuracy on the test set
  • Training time using the GPU on stat1005

Results edit

We report quantitative results in the table below. We see that accuracy and training time increase with the number of iterations, as expected.

  • Training time is unusually high for the setup with 50 categories - this might be due to other processes that were launched on the GPU in parallel to this training.
  • We see that the accuracy is overall good, and much higher than a random setting. This suggests that the data collected, even if small, is relatively high quality.
 
Classification Accuracy per Concept detected by pre-trained image classifiers

 

Some qualitative examples are available of the labels output for the 100 classes classifier, and for the 50 classes classifier. Scores below each image represent the likelihood that the image belong to the corresponding category.

Which threshold should we use to consider an object as detected? More work looking at the overall distribution of the output predictions to define cut-off threshold is needed.

Per-class accuracy can be found in the image on the left. This is obtained from the model classifying 100 classes, with 10000 iterations (the least accurate). Accuracy is computed by retaining all scores from an image which are above a threshold of 0.2.

References edit