Grants:Project/AICAT
Project idea
editWhat is the problem you're trying to solve?
editWhat problem are you trying to solve by doing this project? This problem should be small enough that you expect it to be completely or mostly resolved by the end of this project.
Remember to review the tutorial for tips on how to answer this question.
In a nutshell, we want to investigate the application of and create a test implementation for the state of the art artificial intelligence based image classification algorithms to the task of categorizing graphical data on the Wikimedia Commons.
The precise problem we are trying to solve is lack of use of context-sensitive image categorization algorithms for classifying all visual data on the Wikimedia Commons. This lack of such algorithms results in greater reliance on contributors’ manual work when categorizing millions of images (and growing at a fast rate) uploaded[1], and hence lower efficiency in categorization, less overall categorized data, and losses to improvement on the other parts of Wikimedia Commons in terms of alternative costs. Inefficient categorization of graphical data is an important problem, as uncategorized data is much more difficult to find, less accessible for people with visual impairment and self-perpetuating as inability to find old data increases the likelihood of resubmission of the same data.
This is particularly an issue, as thousands of human hours are at stake in the newly-funded “Structured Data on Commons”, which intends to establish data structure standards and subsequently rely on contributors’ works to categorize data, including graphical data, in a more efficient way. The project was launched and funded on a premise that data on the Wikimedia Commons lacks sufficient categorization, as well as structure for such categorization. The “Structured Data on Commons” project has been working on implementing effective structures for all of the data on Commons. However, once the tools for structuring data are implemented, contributors' labor will be used for curating such content, and the long tail of the content curation will be a tremendous work that may take “up to 10 years"[2]. We think that it is rather important to employ new technologies to ensure efficient use of contributors’ time in solving this problem.
What is your solution to this problem?
editFor the problem you identified in the previous section, briefly describe your how you would like to address this problem.
We recognize that there are many ways to solve a problem. We’d like to understand why you chose this particular solution, and why you think it is worth pursuing.
Remember to review the tutorial for tips on how to answer this question.
Since image labelling and classification is a problem that can be tackled using machine learning techniques, the aim of our project is to pave the way for implementing ML-based category suggestions for images in the Wikimedia Commons database.
Our brains are remarkably good at recognising the contents of an image but for computers this task is not an easy one. However, recent efforts using state-of-the-art Machine Learning techniques have achieved substantial progress in this area. The problem of multi-label classification - where a model tries to classify images into a set of classes (e.g. "cat" or "refrigerator") - has been attempted by many research groups, including models such as AlexNet[3], VGG[4], ResNet[5], QuocNet[6] and several iterations of the Inception (GoogLeNet) model [7][8][9]. The success of these models is typically evaluated based on how often the model fails to predict the correct answer as one of their top 5 guesses ("top-5 error rate"). Inception-v3 [10] achieves a remarkable 3.46%. This model was trained to classify ImageNet images to 1000 classes [11].
Further, Google has released the Inception-v3 model to the community [12], allowing anyone to retrain the model or to use transfer learning techniques to adapt the model's knowledge to a distinct task. With this and other top models for multi-label classification available (e.g. VGG16, VGG19, ResNet50, Inception-v3, and Xception are all implemented in Keras [13]), it stands to reason that the field has matured enough for using these models in production tasks.
However, these developments have not been applied to the task of image categorization on Wikimedia Commons, thus possibly making it reliant on human work to an unnecessarily high extent. We aim to correct that.
By doing this, we hope to feed into the Wikimedia Commons: Structured Data project, and enable contributors to choose images better. Our tool would work in two main ways:
- Suggesting categories during manual processing by contributors. Accepting image data as input, processing it, and returning a classification of categories. To the maximum extent possible, these categories would match particular existing Wikidata entities. This is what such classification could look like: https://karpathy.github.io/assets/ilsvrc2.png
- Auto-categorizing uncategorized content. Looping through all image data currently existing on Wikimedia Commons and performing the same procedure but on an automatic basis, auto-applying the new tags, which could then be later verified by users.
Project goals
editWhat are your goals for this project? Your goals should describe the top two or three benefits that will come out of your project. These should be benefits to the Wikimedia projects or Wikimedia communities. They should not be benefits to you individually.
Remember to review the tutorial for tips on how to answer this question.
We envision the deliverables to broadly fall into three stages.
- [10% of the work] An assessment of the current best practices and model implementations specific to the problem of multi-label image classification. This will be coupled with an assessment of the Wikimedia Commons image database and "Structured Data on Commons" specifications in order to inform best strategies for incorporating the right machine learning techniques into the appropriate parts of the data pipeline.
- [80% of the work] Test implementations of machine learning algorithms to label images in the Wikimedia Commons database. Next to testing existing image classification frameworks on the Wikimedia data set, our main focus will be training Commons-specific categorization models: i.e. models that can assign existing Commons categories to images. To achieve this, we will perform data collection and cleaning. We also intend to evaluate the proposed models, both by validating on external well-labeled datasets, as well as collecting feedback from the community in a systematic way (e.g. by creating a tool allowing community to rate the categorization resulting from our models). If we should find the community feedback insufficient for effective evaluation, we may also scale up the process by using an on-demand workforce service such as the Mechanical Turk in order to classify images and provide a validation set. We will provide APIs that integrate with the Wikimedia data format specifications for automatic submission of new Wikimedia uploads to our models for classification.
- [10% of the work] We will produce a strategy recommendation for further integration of machine learning techniques for image classification into the Wikimedia Commons image labelling pipeline based on the initial test results.
Project impact
editHow will you know if you have met your goals?
editFor each of your goals, we’d like you to answer the following questions:
- During your project, what will you do to achieve this goal? (These are your outputs.)
- Once your project is over, how will it continue to positively impact the Wikimedia community or projects? (These are your outcomes.)
For each of your answers, think about how you will capture this information. Will you capture it with a survey? With a story? Will you measure it with a number? Remember, if you plan to measure a number, you will need to set a numeric target in your proposal (i.e. 45 people, 10 articles, 100 scanned documents).
Remember to review the tutorial for tips on how to answer this question.
We will know that we have met our goals if the core technologies for image classification mentioned above are investigated, and a working test implementation of the software is delivered. Once our project is complete, we hope it will be integrated into the image categorization workflows as envisioned by the Structured Data on Commons project and save thousands of hours of contributors' time.
Do you have any goals around participation or content?
editAre any of your goals related to increasing participation within the Wikimedia movement, or increasing/improving the content on Wikimedia projects? If so, we ask that you look through these three metrics, and include any that are relevant to your project. Please set a numeric target against the metrics, if applicable.
Our tool would help around seven thousand monthly active editors on Wikimedia Commons[14]. Moreover, these users would be able to more efficiently edit up to 45 million files [15], impacting millions of users who search for and view those files.
Project plan
editActivities
editTell us how you'll carry out your project. What will you and other organizers spend your time doing? What will you have done at the end of your project? How will you follow-up with people that are involved with your project?
The project activities will be closely structured after the deliverables in this project:
- research into the current best practices and model implementations pertinent to multi-label image classification;
- assessment of the Wikimedia Commons image database and image categorization specifications;
- testing multiple implementations of different machine learning algorithms to label images in the Wikimedia Commons Database;
- packaging the best implementations into software usable on the Wikimedia Commons;
- producing written documentation and recommendations for further integration of machine learning techniques.
Budget
editHow you will use the funds you are requesting? List bullet points for each expense. (You can create a table later if needed.) Don’t forget to include a total amount, and update this amount in the Probox at the top of your page too!
- Software development - two developers working part-time, 20 hours per week each, plus one senior machine learning specialist working in an advisory and supervisory capacity for 5 hours per week (45 hrs weekly in total), €30 per hour (lower bound of the hourly salary of a software developer working on machine learning) - 43,200 €, 8 months total
- Dissemination through participation of at least one of us in two international conferences related to machine learning (we intend to register for conferences such as the Neural Information Processing Systems Foundation 2018 - NIPS or the International Conference on Machine Learning - ICML but the exact conferences would ultimately depend on availability) in order to learn about alternative approaches used, present the project we are working on and network with relevant stakeholders: ~3,600 €
- GPU power for processing (backup option, in case we do not succeed in obtaining free alternatives). We intend to apply for credits for academic research available at AWS or Nvidia, or otherwise attempt to obtain free access to the GPUs needed for our processing, but, as an alternative, we would resort to in-house computing or purchase remote computational resources. Our estimated processing costs: ~2,000 €
- Third party model validation (backup option, in case we do not get desired community help). We intend to evaluate our models by having the Wikimedia community use a separate tool to assess and validate our categorization, but, in case we are not able to get sufficient community aid, we would resort to third party alternatives like the Mechanical Turk. Our estimated cost for this is about 50 hours of full time image categorization, amounting for up to: ~500 €
Community engagement
editHow will you let others in your community know about your project? Why are you targeting a specific audience? How will you engage the community you’re aiming to serve at various points during your project? Community input and participation helps make projects successful.
We hope to integrate within the editing tools built as part of the Structured Data on Commons project. We have already made contact with the people responsible for the project and received preliminary validation for the need for this and willingness to cooperate by providing us specifications.
We also hope to cooperate with the Commons mobile app who have expressed interest in the project, and had identified the need some years ago.
Finally, we aspire to gain insights, participation and help from more people working in machine learning through participation in relevant conferences.
Get involved
editParticipants
editPlease use this section to tell us more about who is working on this project. For each member of the team, please describe any project-related skills, experience, or other background you have that might help contribute to making this idea a success.
- Alexey Morgunov - is currently finishing his PhD in computational molecular biology at the University of Cambridge where his work involves applied machine learning techniques (CNNs and residual networks for protein structure prediction from sequence ensembles). Outside his academic work, Alexey developed the back-end of Cooljugator in Python, and is teaching courses in bioinformatics and mathematical biology at the University of Cambridge.
- Linas Vastakas - an alumnus at the University of Cambridge (2015) with a deep interest in machine learning who currently does development and software engineering in Python, Javascript and Elixir for education-related projects (Interlinear Books, Cooljugator). More information available on his personal website.
- Dr Karolis Misiunas is a post-doctoral researcher at the University of Cambridge, where his research uses neural-networks for single-molecule detection and interpretation. Prior to that, he developed image processing tools for tracking and interpreting microscopy images. As result of his research, he has extensive experience with CNN and classic machine learning methods. His publicly accessible work is available on his GitHub page.
Community notification
editYou are responsible for notifying relevant communities of your proposal, so that they can help you! Depending on your project, notification may be most appropriate on a Village Pump, talk page, mailing list, etc.--> Please paste links below to where relevant communities have been notified of your proposal, and to any other relevant community discussions. Need notification tips?
- The Wikimedia Research mailing list at wiki-research-l@lists.wikimedia.org.
- The Wikimedia Commons IRC channel #wikimedia-commons on Freenode.
- The Wikimedia Research IRC channel #wikimedia-research on Freenode.
- The Wikimedia United Kingdom IRC channel #wikimedia-uk on Freenode.
- The Idea Lab at the Village Pump.
- The Commons: Structured Data project team.
Endorsements
editDo you think this project should be selected for a Project Grant? Please add your name and rationale for endorsing this project below! (Other constructive feedback is welcome on the discussion page).
- We (the Commons mobile app) would be interested in adding AI-based suggestions in addition to the category suggestions we give already. Actually we identified this need two years ago. Syced (talk) 14:15, 6 February 2018 (UTC)
- This API would be incredibly useful! Misaochan (talk) 15:42, 6 February 2018 (UTC)
- This a great project, something Wikimedia dearly needs and I know the guys personally - I think they’d be a great fit for this project. Epantaleo (talk) 14:46, 10 March 2018 (UTC)
References
edit- ↑ https://stats.wikimedia.org/wikispecial/EN/TablesWikipediaCOMMONS.htm
- ↑ Structured Data on Commons: A Proposal from the Wikimedia Foundation - October 12, 2016, p. 4
- ↑ https://www.cs.toronto.edu/~fritz/absps/imagenet.pdf
- ↑ https://arxiv.org/abs/1409.1556
- ↑ https://arxiv.org/pdf/1512.03385.pdf
- ↑ https://static.googleusercontent.com/media/research.google.com/en//archive/unsupervised_icml2012.pdf
- ↑ https://arxiv.org/abs/1409.4842
- ↑ https://arxiv.org/abs/1502.03167
- ↑ https://arxiv.org/abs/1512.00567
- ↑ https://arxiv.org/abs/1512.00567
- ↑ http://image-net.org/challenges/LSVRC/2014/browse-synsets
- ↑ https://research.googleblog.com/2016/03/train-your-own-image-classifier-with.html
- ↑ https://github.com/fchollet/keras/tree/master/keras/applications
- ↑ https://stats.wikimedia.org/wikispecial/EN/TablesWikipediaCOMMONS.htm
- ↑ https://commons.wikimedia.org/wiki/Special:Statistics