User:Zache/Wikimedia Hackathon 2024
Improving Wikimedia Commons image hashing
editThe project idea is to calculate perceptual hashes for Wikimedia Commons images so that it is possible to reliably detect if a photo is already in Wikimedia Commons and match photos to photos in other image repositories. (Finna, Europeana, Flickr ...) This will allow for the updating of the image metadata and image files. It will also help for preventing uploading of duplicate images.
Speed improvement
editBefore the hackathon, the indexing speed was 15000 images per hour—i.e., 10 million per month. With that speed, indexing all 100 million Wikimedia Commons photos would take a year. So, in this hackathon, I moved the indexing code from Toolforge to a virtual server in wmlabs, which tripled the indexing speed to 30M+ photos per month. Indexing is expected to be ready in the summer.
Ontop SPARQL
editWe also installed the ontop server for querying hashes/duplicate images using SPARQL. This work is still ongoing, but currently we are able to query hashes located postgresql database using SPARQL.
- List hashes using Federated query
- Find duplicate image pair using hashes
- Merge imagehashes to Commons Query service query
STATUS: There is still missing pieces in our Ontop SPARQL -> SQL translation configuration and setup is pretty far away from being practical.