Grants:Project/MSIG/EpicPupper/Fortuna

statusnot funded

Fortuna

Solicit community input and discussion in order to develop an extension and API for identifying file copyright violations.

targetAll Wikimedia wikis

start dateJune 1

start year2022

end dateSeptember 1

end year2022

budget (local currency)CAD ~10,455

budget (USD)8,100

grant typeindividual

grantee• EpicPupper

contact(s)• EpicPupper

give feedback

join

endorse

friendly space expectations

browse all requests

Share your results

Project Goal

What will be the outputs of your project and how will those outputs contribute to advancing a specific Movement Strategy Initiative

What specific Movement Strategy Initiative does your project focus on and why? Please select one of the initiatives described here

Fortuna is aligned with 3 Movement Strategy Initiatives. It focuses on implementing the initiative "Global Approach for Local Skill Development", a prioritized initiative. Although the Wikimedia movement has grown in size in recent years, there has not been a parallel growth in contributors helping with copyright violation patrolling and files in general. A variety of barriers exist that keep editors from assisting, and this project intends to gather data on what these blockers are. After this is completed, Fortuna will be a project that helps Wikimedians develop their skills in copyright patrol through interactive, engaging learning activities. Clear documentation will be written, allowing newcomers to easily utilize the tool.

Secondly, "Identify Wikimedia's Impact", another prioritized initiative, is of note to this project. One recommendation in this initiative is to understand how our projects can be misused or abused by detecting threats with significant potential for harm. Copyright violations are one of the most impactful ways that the projects can be abused. The tool will assist the patrolling community to identify, evaluate and fix violations, to ensure that all content is freely distributable, and to avoid legal ramifications associated with using unlicensed content, similar to the existing tools CopyPatrol and Earwig's Copyvio Detector.

Thirdly, the MSI "Resources for Newcomers" may be considered while assessing this proposal. The Initiative calls for easy-to-find and easy-to-understand resources for newcomers, including onboarding media and guiding interfaces helping them independently learn and navigate. As described above, Fortuna will be exceptionally friendly to newcomers, with an onboarding system and engaging, interactive learning activities for new copyright patrollers to learn the ropes.

Fourthly, the merits of Fortuna are appropriate to analyze through the lens of the initiative "Bridging Content Gaps". This initiative tracks progress on finding gaps in content through means such as artificial intelligence, relevant to this proposal. By searching for images that do not truly meet the requirements of freely-licensed projects, contributors can pinpoint which subjects need files and work towards uploading them.

Fifthly, another aligned initiative is "Systematic approach to improve satisfaction and productivity". The initiative recommends Assessing the needs of groups and volunteers, taking into account their local contexts for effective support and recognition of efforts; I believe this project helps to advance this initiative as I've already reached out to some contributors working in copyright, and will continue to do so, developing a tool that is focused on the needs of the community. Another suggestion was Continuously engaging and supporting publicly diverse types of online and offline contributors; I think this is relevant to my project as contributors working in copyright are often underrepresented in terms of technology to support them, and there definitely are less tools for copyright than in other areas.

Finally, the initiative "Increased Wikimedia Awareness" is also relevant to this proposal. The initiative calls on the movement to secure the attention, trust, and interest of knowledge consumers, and is a prioritized initiative as trust is crucial to maintaining healthy, widely-used projects. I think that my tool would help remove many copyright violations and increase trust in the licensing of the Movement's content, ensuring compliance with the 5th founding principle, Free licensing of content.

Project Background

When do you intend to begin this project and when will it be completed?

I intend to begin this project on June 1, 2022 and expect it to be completed on September 1, 2022.

Where will your project activities be happening?

I will be completing research, development, technical writing and translation at my own home.

Are you collaborating with other communities or affiliates on this project? Please provide details of how partners intend to work together to achieve the project goal.

I intend to solicit feedback at every phase of development with associated communities and WikiProjects, such as the English Wikipedia's WikiProject Copyright Cleanup, Wikimedia Commons' Village Pump, the Wikimedia Community Discord server, and on IRC. I will also be requesting community internationalization (through TranslateWiki) and localization of the tool in the last month of development (see timeline here.) During development, I will continually seek assistance from community members as needed. When development is finished, I'll publish a segment detailing the features of the tool on The Signpost's Technology Report column.

What specific challenge will your project be aiming to solve? And what opportunities do you plan to take advantage of to solve the problem?

Wikimedia Commons and many other wikis currently have many copyright violations that go by undetected. Copyright patrollers are often underrepresented in terms of the number of tools supporting them, and current solutions such as OgreBot's new user upload log have been deprecated or removed. I hope that my tool can relieve the immense burden on community volunteers ensuring compliance with copyright, and encourage more contributors to assist with the field through an easier and more efficient patrolling process.

Does this project aim to apply one of the examples shared in the call for grants and if so which one?

This project is aligned with all three of the examples shared; "Skill Development Needs Assessment" for the research portion, "Skill Development Translation" for translation of the documentation, and "Skill Development Activity" for creating an engaging and interactive resource to get started with copyright patrolling.

Project Activities

What specific activities will be carried out during this project? Please describe the specific activities that will be carried out during this project.

During the first month, I will be researching and consulting with local communities on their needs with copyright patrolling. In the second, I intend on rapidly developing the core feature of the project, focusing on making identifying copyright violations an easier and more efficient process. The third month of development will include expanding the tool further based on suggestions from tool users, consulting different communities on ways to support their wikis, providing localization and internationalization options inside the tool, and identifying parts that can be made faster, more efficient, and built upon. I will be translating the tool into Simplified Chinese and French during this time, and providing localization for the English Wikipedia, Wikimedia Commons and other wikis on request.

Architecture

Based on initial community consultations, draft architecture has been drafted. This is subject to change based on further discussion and research. Fortuna is implemented as a MediaWiki extension. Its design is inspired by the MachineVision extension, written by Wikimedia Foundation developers and deployed on Wikimedia production.

Overview

When a new image is uploaded to a wiki, the Fortuna extension triggers a delayed job request to ensure that the image is still present (i.e., not deleted), and if so, request and store web matches of images generated by one or more API providers. These matches are then filtered and served to reviewers on the Special:SuspiciousFiles page. Accepted web matches are tagged for speedy deletion, nominated for a deletion process, or deleted (for administrators).

In addition to new uploads, lists of image file page titles may be passed to the maintenance script fetchMatches.php to have web matches retrieved and stored on demand.

When web matches are received for an image, an Echo event is fired to notify the uploader and patroller (if applicable) that web matches are available for review, according to the uploader and patrollers' notification preference.

The extension is designed to support arbitrary API providers (including issuing requests to multiple providers simultaneously). Initial providers planned for implementation are the Google Cloud Vision API, the Bing Visual API, the TinEye API and Pixsy.

Concepts

Image

Images are stored by their SHA1 hash in the fortuna_image table. This means that if an image file is uploaded that is identical to one for which a record exists in the database, it is the same image for the Fortuna extension's purposes, and matches will not be requested again.

The extension only handles bitmap and vector images and disregards all other file types. Bitmap thumbnails are requested for vector files when encountered.

Match

A match is stored as a string of a URL and integer of a match percentage in the fortuna_match table. Human-readable domains associated with matches are fetched at the point of presentation to the end-user. A match will be associated with an image no more than once, even if the match is subsequently suggested by a different API provider.

Waiting period

A waiting period is enforced between upload time and the submission of an image to an API provider for label suggestions. This is to reduce the likelihood of making a search request for an image that is soon to be deleted. The waiting period is planned to be 48 hours by default. This value is configured in $wgFortunaNewUploadLabelingJobDelay.

Review state

Review state is a critical concept in the Fortuna extension, because it governs which images are presented on Special:SuspiciousFiles, and to which audiences. The review states are represented as integers, with a default state of 0 (unreviewed). Possible states include the following:

Unreviewed (0): The default match review state. The match may be presented in either the "popular" or "user uploads" tab on Special:SuspiciousFiles.
Accepted (1): The match was accepted by a contributor. A deletion request, tag or direct deletion should have been performed, and afterwards it should no longer appear on Special:SuspiciousFiles.
Rejected (-1): The match was rejected by a contributor. It should no longer appear on Special:SuspiciousFiles.
Withhold from "popular" (-2): The initial review state for a match which is unreviewed but should be withheld from the "popular" tab and only shown on manual searches in the "user uploads" tab. A match may receive this review state based on the SafeSearch ratings of the image to which it pertains.
Withhold from all (-3): The review state for a match pertaining to an image which should be withheld completely from Special:SuspiciousFiles. Files can be put into this review state through a maintenance script, withholdFiles.php.
Not displayed (-4): A special review state assigned to match when an attempt to display them fails because a match could not be found. This results in the match no longer being shown on Special:SuspiciousFiles.

Feed filtering

Not to be confused with system-wide filters (see below)

Filters can be applied in SuspiciousFiles. Planned filters include uploader edit count, uploader rights, whether an image is patrolled, and uploader username. More filters will be added with community consultation.

Action API

Fortuna exposes several MediaWiki Action API endpoints for external tool use, including query+reverseimagesearch, query+unreviewedimagematches, query+unreviewedmatchcount, and reviewimagematches.

Rate limiting

Rate limiting and maximum checks per day will be applied to manual checks through the "manual input" tab.

Match lifecycle

New uploads

In a handler for the UploadComplete hook, the Fortuna extension checks whether the uploaded file is a bitmap or vector image. If so, and if the extension is configured to request matches for new uploads, the extension creates a new FetchGoogleCloudVisionMatchesJob and enqueues it on the job queue. If a waiting period is configured, the job is created with a jobReleaseTimestamp value of the current time plus the configured waiting period. When the job is executed, if the file still exists (i.e., has not been deleted), a request for matches is created and sent to Google Cloud Vision via GoogleCloudVisionClient. SafeSearch annotations are requested from an internal WMF service utilizing Google Cloud Vision.

When a response is received, system-wide filters are applied (see "Image and system-wide filtering" below). If any matches remain after filtering, they are stored in the database, and an Echo event is fired to trigger a notification to the uploader and patroller that matches are available for review. Matches are eventually served on Special:SuspiciousFiles and updated with their votes by reviewers.

If no matches are found from Cloud Vision, the extension queues additional providers for requesting. The order of requests is:

Google Cloud Vision (through GoogleCloudVisionClient)
Bing Visual Search
Pixsy
TinEye API (through TinEyeAPI.php)

This order is from least expensive to most expensive, in order to minimize costs.

Custom image lists

The match lifecycle for matches fetched through fetchMatches.php for custom image lists is similar to that for new uploads. The main difference is that instead of scheduling match fetching jobs, fetchMatches.php directly invokes GoogleCloudVisionClient::fetchMatches in each image on the list.

Image and system-wide filtering

Image matches have multiple filters applied in GoogleCloudVisionClient before storage, and each operates differently from the others.

The first filtering pass, based on $wgFortunaWithholdImageList, is intended to withhold images completely from being shown on Special:SuspiciousFiles. If a match in $wgFortunaWithholdImageList is among the matches returned for an image, the initial review state for all matches is set to WITHHOLD_ALL, which has the effect of excluding it completely. The image is not shown in either the "popular" or "user uploads" tab on Special:SuspiciousFiles. The matches are, however, retained in the database.

The second filtering pass, based on $wgFortunaGoogleSafeSearchLimits, conditionally withholds images from the "popular" tab. If an image receives a SafeSearch rating that exceeds the allowed value on any of the configured dimensions, it is withheld from the "popular" tab but still available in the "user uploads" tab on Special:SuspiciousFiles. All matches are retained in the database.

The third and final pass, based on $wgFortunaMatchUrlBlocklist, is intended to discard specific matches judged not to be useful to the projects. Matches corresponding to entries in $wgFortunaMatchUrlBlocklist are simply discarded before the remaining web matches are stored. An example use-case of this is the Wikimedia projects themselves and their mirrors.

Redirects and deletions

Fortuna follows file redirects and skips deleted files.

How do you intend to keep communities updated on the progress and outcomes of the project? Please add the names or usernames of these individuals responsible for updating the community

I, EpicPupper, will be responsible for updating the community of project progress. I will be posting on project pages, including "village pumps" and selected WikiProjects mentioned above, and maintaining a mailing list of interested users to mass-message updates to on-wiki.

Who will be responsible for delivering on this project and what are their roles and responsibilities?

I will be responsible for delivering on this project, including research, development, translation, technical writing and support. Volunteers will be solicited (see below) for assisting in localization and internationalization efforts.

Additional information

If your activities include community discussions, what is your plan for ensuring that the conversations are productive? Provide a link to a Friendly Space Policy or UCoC that will be implemented to support these discussions.

The Universal Code of Conduct will be enforced in community discussions, as well as the Wikimedia Foundation's friendly space policy. Discussions occurring within the Grants namespace on Meta will have the friendly space expectations for grants applied. Consultations in technical areas like Phabricator or Wikitech will also be enforced through MediaWiki's Code of Conduct for technical spaces. Activities on IRC will have the IRC guidelines and Libera network policies applied, and on Discord the Wikimedia Community Discord server guidelines and the Discord Community Guidelines will be enforced.

If your activities include the use of paid online tools, please describe what tools these are and how you intend to use them.

The project will utilize the Google Cloud Vision API, the TinEye APIs, Pixsy and the Bing Visual Search API for its functionality. Please see the budget breakdown below for a detailed explanation of how funds will be used.

Do your activities include the translation of materials, and if so, in what languages will the translation be done? Please include details of those responsible for making the translations.

This project includes the translation of Fortuna's user interface and documentation into Simplified Chinese and French. I, EpicPupper will be responsible for making these translations.

Are there any other details you would like to share? Consider providing rationale, research or community discussion outputs, and any other similar information, that will give more context on your proposed project.

A tool for identifying file copyright violations has been requested since 2015, as a Community Wishlist Survey item in 2015 and 2022, and as a Phabricator task.

I'd also like to use this space to invite any readers to sign up as a volunteer in the "Volunteer" section below! This project would greatly benefit from translators and contributors from diverse wikis helping to localize and internationalize the tool and documentation.

Outcomes

After your activities are complete, we would like to understand the draft implementation plan for your community. You will be required to prepare a document detailing this plan around a movement strategy initiative. This report can be prepared through Meta-wiki using the Share your results button on this page. The report can be prepared in your language, and is not required to be written in English.

In this report, you will be asked to:

Provide a link to the draft implementation plan document or Wikimedia page
Describe what activities supported the development of the plan
Describe how and where you have communicated your plan to relevant communities.
Report on how your funding was spent

Your draft implementation plan document should address the following questions clearly:

What movement strategy initiative or goal are you addressing?
What activities will you be doing to address that initiative?
What do you expect will happen as a result of your activities? How do those outcomes address the movement strategy initiative?
How will you measure or evaluate your activities? What tools or methods will you use to evaluate your activities?

To create a draft implementation plan, we recommend the use of a logic model, which will help you and your team think about goals, activities, outcomes, and other factors in an organized way. Please refer the following resources to develop a logic model:

Overview of logic models on Meta-wiki
Example logic models for reference for other movement activities (such as partnerships and edit-a-thons)
Blank logic model template on Google Drive

Please confirm below that you will be able to prepare a draft implementation plan document by the end of your grant:

I will be able to prepare a draft implementation plan document.

Optionally, you are welcome to include other information you'd like to share around participation and representation in your activities. Please include any additional outcomes you would like to report on below:

I am planning to report on basic usage statistics during the beta period of Fortuna.
Goals for this project include:
- Creating a best-practices based tool that helps contributors work more efficiently with files and copyright violations
- Creating a modular, optimized tool
- Inviting more editors to contribute to copyright and file cleanup through interactive activities and easy-to-use tools
- Helping in decreasing the amount of file copyright violations
The expected outcome is a MediaWiki extension.

Budget

How you will use the funds you are requesting? List bullet points for each expense. Don’t forget to include a total amount, and update this amount in the Probox at the top of your page too

Research, documentation, translation and development
- (time needed to review, perform analysis, or investigate any information needed to support implementation ideas or planning) and
- (document preparation time, time spent documenting of discussion, post-meeting work) and
- (translation costs for briefs and global materials) and
- (development time):
  - USD 3000 for 3 months of development
    - According to salary.com, the minimum average wage for software developers is USD 30 per hour
    - Multiplied by 50%, this gives USD 15 per hour as a stipend
    - I expect to work on this project 25-35 hours a week; I will calculate the stipend based on the lower end of 25 hours
    - USD 15 * 25 hours per week * 4 weeks * 3 months = USD 3000
    - This is comparable to previous tool-related grant requests: 1, 2, 3 (these were two-month requests for USD 2000)
Online tools or services (subscription services for online meeting platforms, social media promotion): USD 5,067.5; see breakdown below.

Google Cloud Vision

Estimate from: 75 monthly users * 50 files per day per feature * 30 days = 112,000 files. Pricing taken from the pricing calculator.

Web Detection: USD 388.5

Monthly total: USD 388.5

6-month total: USD 2331

Bing Visual API

Estimate from: 75 monthly users * 5 files per day per feature^[1] * 30 days = 11,250 files. Pricing taken from this page.

Visual Search: USD 33.75

Monthly total: USD 33.75

6-month total: USD 202.5

Pixsy

Estimate from: 75 users * 100 files uploaded per user = 7500 files. Pricing taken from the Picsy website.

Monitor (Pro plan): USD 89 per month for 100,000 images monitored

Monthly total: USD 89

6-month total: USD 534

TinEye APIs

Estimate from: 75 monthly users * 5 files per day per feature * 30 days = 11,250 files (67,500 for 6 months). Pricing taken from the TinEye website.

Web detection (TinEye API): USD 2,000 for 100,000 file searches at USD 0.02 per search

Monthly total: USD 2,000

6-month total: USD 2,000

TOTAL AMOUNT REQUESTED USD: USD 8100; rounded up from USD 8067.5

Completing your application

Once you have completed the application, please do the following:

Change the application status from status=draft to status=proposed in the {{Probox}} template.
Contact strategy2030 wikimedia.org to confirm your submission, as well as to request any support around your application.

Endorsements

An endorsement from community members (especially from outside your community) will be part of the considerations when reviewing your application. Community members are encouraged to endorse your project request here!

This project has the opportunity to dramatically improve volunteers' ability to handle copyright issues, increasing the efficacy of volunteer time both in established and developing communities. Best, Vermont 🐿️ (talk) 12:34, 5 May 2022 (UTC)
Yes please. Image copyright is a byzantine nightmare and I welcome all things that will make it easier to understand and manage. Vami IV (talk) 01:46, 6 May 2022 (UTC)
It's always annoying to come across images (and other files) uploaded willy-nilly from the internet onto Commons and other projects. I hope this tool will help resolve these issues, cut down on backlogs, give us a little more time for other stuff, and maybe even help new contributors who are unfamiliar with copyright policies. I am in full support! MSG17 (talk) 01:54, 6 May 2022 (UTC)
Dealing with copyright is a grotesquely underappreciated part of editing on Wikimedia projects, and it's badly underfunded (with a lack of tools that causes vast amounts of volunteer time to be wasted). A dollar spent on fixing this situation is about a hundred dollars of time saved. I'm all for it. JPxG (talk) 07:32, 8 May 2022 (UTC)
Seems like a good idea and would likely make dealing with copyright violations easier :D Justiyaya (talk) 07:50, 8 May 2022 (UTC)
As the wikimedia project grows we need better tools to improve our identification and handling of copyright issues. We rely too much on the expertise of a very small minority of users. This project seems like a good step in the right direction. Ixtal (talk) 10:00, 8 May 2022 (UTC)
Support. Image copyright is hard. Dealing with copyright infringement of online text-based sources would be difficult, but possible, without something like Earwig's copyvio detector. With images, the problem is much, much harder. Software that would improve the ability of users to deal with image copyright issues would significantly help in reducing the number of copyright violations that the Wikimedia Foundation could potentially be held liable for. As such, it seems like a good use of WMF-fundraised money. Mhawk10 (talk) 07:32, 13 May 2022 (UTC)
I also support this proposal. Image copyright is just as much of an issue as text copyright, and this seems like it would save the time of editors who are trying to remedy image-copyright issues. epicgenius (talk) 18:02, 16 May 2022 (UTC)

Concerns

I am not sure where comments like these go (apologies!), so I'll put them here until and unless told to move them. I have some pretty serious doubts about the feasibility and practicality of this project. I feel it worth stating upfront that I have no doubt about the requestor's good intentions and enthusiasm.

The proposal as-is contains virtually no detail on how the project will be developed - e.g. specific requirements specifications or architecture documents, languages and frameworks to be used. I can infer that it will involve the use of reverse-image-search APIs to attempt to detect copyright violations, but beyond that there is virtually no detail.
Furthermore, the proposal goes on to say I will be developing other features of the project as requested by community members, such as detecting duplicate images, searching images by color, and visually exploring the Wikimedia project - this seems like scope creep before the project has even begun.

Reverse-image-search tools are... often not that great, and using them "as-is" without any kind of post-processing is unlikely to generate a particularly useful result.

The inferred complexity of the project is large, and would be a challenge for even an experienced software engineer working alone. I have doubts about the requestor's ability to follow through given the proposal does not mention any relevant experience and I've not seen them demonstrate such by building other tools or scripts.

Personally I would recommend that bringing back the mentioned OgreBot reports would be a better starting point here - orders of magnitude less complex and expensive, and a decent first-order approximation to a solution. Complexity can then grow from there as needed. Apologies if this seems negative, but I'd hate to see time and effort wasted on a project that may not actually go anywhere. Happy to be corrected on any of the points above. firefly ( t · c ) 14:59, 8 May 2022 (UTC)

Hello Firefly, and thank you for the comments. I have explained my reasoning for this project below. If you have any questions, please do not hesitate to reply.

A member of the WMF MSIG team informed me that generally MSIG proposals focus on the research and consultation phase of projects, so I wrote my application through that lens. I fully understand the desire for a transparent architecture plan, however, so I have added it above. Fortuna will be written as a PHP MediaWiki extension.
I have decreased the scope of the project to only copyright violations. Thank you for catching this.
I would appreciate elaboration on this point; in my experience reverse image search tools are excellent for file copyright patrolling.
I have decreased the scope of this project, and strongly believe that it can be completed within this timeframe. In the event that it cannot, I am willing to return all funds provided. I have experience in full-stack development and a working proficiency in PHP.

The newbie-uploads tool can replace many of the features of OgreBot’s new upload log. Thank you for your advocacy. I will update some more aspects of the application shortly; it is difficult to do it in one edit on mobile. Cheers, EpicPupper (talk) 12:24, 9 May 2022 (UTC)

To add, much of the logic and code can be reused from the MachineVision extension, another component to why I believe this timeline is appropiate. EpicPupper (talk) 13:15, 9 May 2022 (UTC)

Postscript 2: Although many file copyright violations are uploads by new users, certainly not all are, demonstrated through the vast number of CCIs open regarding users with 10K+ edits. Thanks, EpicPupper (talk) 08:31, 11 May 2022 (UTC)

Question: you've budgeted for 6 months of API access in your grant proposal. How do you plan to continue paying for this once the grant money runs out? Spicy (talk) 13:59, 9 May 2022 (UTC)

First, I plan to evaluate the alternative providers on usefulness, and modify the budget as needed (i.e. through removing a provider). Once I have a budget that I believe is financially sound, I plan to request a General Support Fund grant to sustain the project for a period of 1-3 years. It should be noted that the Google provider may be able to be utilized through the WMF's existing account and payment plans, and that other providers may offer nonprofit discounts (I am awaiting email responses). EpicPupper (talk) 14:12, 9 May 2022 (UTC)

↑ This is reduced from the usual 50 files per day estimate as the Bing Visual API only allows "internet search experiences", which does not include automated patrolling (e.g. a "stream" of flagged files). This means that users have to manually check files for the Bing Visual API to be used.

[1] This is reduced from the usual 50 files per day estimate as the Bing Visual API only allows "internet search experiences", which does not include automated patrolling (e.g. a "stream" of flagged files). This means that users have to manually check files for the Bing Visual API to be used.

[1]