Talk:CopyPatrol

Latest comment: 16 days ago by MusikAnimal (WMF) in topic Diannaa stepping back, new features
NOTE: This page may not be regularly checked. If you need prompt attention from the maintainers please ping a member of Community Tech.

BugEdit

For whatever reason the comparison between text was showing {{Unruto=yes and missing any proper comparisons. The link to the task is here here. Thanks-CAPTCHA Security check for adding an external link? I'm offended, Pabsoluterince (talk) 08:09, 12 May 2022 (UTC)Reply[reply]

Feedback requested from CopyPatrol usersEdit

Hello CopyPatrol users! Turnitin, the company behind IThenticate which powers CopyPatrol, has asked us to collect feedback from our users. This is to help build our partnership with them and ensure the long-term stability of CopyPatrol, so your input is very much appreciated! The questions may seem a bit broad, but if you are able to elaborate on any of them please do. For all intents and purposes, "iThenticate" in this context can be viewed as CopyPatrol, since the reports you see surfaced there come from iThenticate. Some of you use the "iThenticate report" link as well; if you do, please describe your workflow in Q3. The questions are as follows:

  1. How does iThenticate help you in your work of keeping Wikipedia plagiarism-free?
  2. How would you describe the main benefit of using iThenticate? (e.g. report accuracy? Time saving?)
  3. What do you do when you identify text similarities in the article you are reviewing? Could you please describe the process of working with the detected text matches?
  4. How does iThenticate help you prevent copyright violations?

Thank you for taking the time to answer, and also your time and energy spent helping keep Wikipedia clean of copyright violations! I am pinging a few of our English-speaking power users, but anyone should feel free to respond: @Diannaa, @DanCherek, @L3X1, @Sphilbrick, @Ymblanter. Warm regards, MusikAnimal (WMF) (talk) 20:25, 24 August 2022 (UTC)Reply[reply]

Responses from DanCherek
Hi MusikAnimal (WMF), thanks for reaching out (and for your assistance with keeping CopyPatrol up and running)! Here are my thoughts:
  1. How does iThenticate help you in your work of keeping Wikipedia plagiarism-free? Having the ability to automatically scan large edits for copyright violations is incredibly helpful. There is a relatively small group of editors working on copyright cleanup, and with the hundreds of thousands of edits that are made every day, there's no way that they could all be manually scanned for copyright issues, let alone accurately identifying the sources of copied text. iThenticate makes this task much more feasible by identifying potential violations and flagging them for human review, distilling this enormous (and important) task into a much more manageable one. By having this system where edits are automatically reviewed, we're able to detect and deal with copyright issues even on articles that aren't actively being watched by other editors, and quickly handle them as they come up (making removal easier, compared to cases in which a copyright violation isn't discovered until years after the fact).
  2. How would you describe the main benefit of using iThenticate? (e.g. report accuracy? Time saving?) There are a few features that I think are particularly valuable. One is that it is really good at matching text to sources that may not otherwise be readily accessible or findable. Paywalled sources (such as journal articles), or historical versions of websites that have since been modified or taken down, are frequent sources of copying that we see on Wikipedia, but you wouldn't necessarily be able to find these matches from a Google search. So the ability to have a better sense of where some copied text comes from is really helpful. Another feature that I like is the way that the overlapping text is highlighted in the iThenticate interface, including the use of different colors for different sources. It makes it easy to tell, from a glance, which specific phrases in an edit were clearly copied from somewhere, and which parts of the edit might warrant further investigation.
  3. What do you do when you identify text similarities in the article you are reviewing? Could you please describe the process of working with the detected text matches? For each article that appears in CopyPatrol, I will open two new tabs initially: the diff of the edit, and the iThenticate report. I take a look at the iThenticate report to try to get a sense of what we're dealing with. For example, if the potential sources are all Wikipedia or Wikipedia mirrors, it may be a case of copying within Wikipedia, in which case I would look at the actual edit and see if that's the case, and whether proper attribution was given. If it looks like text was copied from a copyright source that is not compatibly licensed, I will look at the potential matches, try to identify the actual source copied from, and then remove the text from the article. Because we're dealing with recent edits (typically made in the past day or two), we don't have as many issues with reverse copying than if we were investigating edits from, say, a decade ago. I will also look at other recent edits to the article, particularly if the editor in question has made a series of edits, to see if there are other copyright issues that weren't flagged by CopyPatrol, and I will also look at edits that the user has made to other articles. Often when there is a fundamental misunderstanding of the copyright policy, the copyright issues are not confined to a single article. I mentioned above that I found the highlighting of overlapping text useful. If an edit clearly looks like it's been copied from somewhere, but the iThenticate-identified sources are all offline or not promising, I use the highlighted text to determine which phrases to enter into Google as I look a little harder for the original source.
  4. How does iThenticate help you prevent copyright violations? Besides the slight overlap with Q1, I think this question really gets at one of the most important things about CopyPatrol, which is that it helps us identify copyright issues hopefully early on in a Wikipedia user's editing history and to educate them about the copyright policy before they make too many edits. Wikipedia's contributor copyright investigation (CCI) project is severely backlogged with cases in which copyright issues were discovered after someone had already made tens or hundreds of thousands of edits. That still happens, but hopefully being able to communicate with the editor earlier on can create a better situation for everyone involved. It also lets us create a paper trail in case the copyright issues persist and further action is needed, and it can help inform when a new CCI may be needed.
Hope this helps. Let me know if I should elaborate on anything else. DanCherek (talk) 23:38, 24 August 2022 (UTC)Reply[reply]
Responses from Diannaa
  1. Wikipedia has grown to the point where we receive thousands of edits every hour, which makes it impossible to monitor recent changes without automated tools. The assistance of iThenticate is invaluable to us because it provides a vital and reliable service that can check for copyright problems without the need for our volunteers to be involved in maintainance of the service.
  2. There are some huge benefits. Our previous detection system, CorenSearchBot, checked only new page creations. The iThenticate service checks all additions over a certain size, and thus provides a lot more coverage. The reports are added to our queue almost immediately and we clear all the open reports within 24 to 36 hours, which means people who add copyright material are notified quickly as to what they did wrong and what our expectations are. Quickly notifying people of problems means that we have fewer new editors who think it's okay to add copyright content, and means cleanup is less onerous in the long run. Unlike social media and other sites where people can contribute content, we take copyright very seriously, as to failure do so would have a negative impact on our efforts to be taken seriously as a valid scholarly resource. And the Turnitin system can see behind many paywalls so that we can assess and remove content that we could otherwise not even detect. I am pretty sure CorenSearchBot was not sophisticated enough to do that. CorenSearchBot was retired in June 2016, when we got our CopyPatrol interface perfected.
  3. Assessing reports: First I look at what type of article has been flagged (biography, places, science, or current events for example) as each has specific types of common issues. Next, I look at the url that iThenticate has flagged, and it's a journal article, I will immediately click on the iThenticate link, because the Turnitin system can see behind many paywalls to view content that would otherwise be inaccessable without a subscription. I check to see if the source webpage is compatibly licensed, and if it's not, I remove the copyright content from the Wikipedia article. Sometimes it becomes obvious that the entire article needs checking, or the editor's entire edit history needs checking. So one iThenticate report can expand into a larger cleanup effort! Then I perform revision deletion if appropriate and notify the editor with either a template or a hand-written note.
  4. The way iThenticate helps prevent problems is through the opportunity to educate users as to our expectations. An editor (whether a newcomer or a veteran) is a lot less likely to add copyright material to Wikipedia if they know there's an automated detection service in place. Diannaa (talk) 14:15, 26 August 2022 (UTC)Reply[reply]
Responses from Sennecaster
  1. I am usually busy in other parts of copyright cleanup, but when I do go on CopyPatrol, and from what I see at the "second line" of defense at en:Wikipedia:Copyright problems, it really decreases the amount of manual reviewing and source hunting that we have to do. I find that iThenticate helps filter down what we should look at too, and it can crawl behind paywalls that we as volunteers sometimes cannot. Earwig's copyvio detector also has an iThenticate option, but it rarely works there unfortunately--when it does, I sometimes find issues that were not previously exposed with the other options.
  2. iThenticate is pretty accurate! Sometimes it bugs out and won't let me preview the comparison, but it gives me a source, instead of making me hunt down one. I find that dealing with CopyPatrol reports is extremely fast, even on some of the more tricky ones, so while it may take me a few days to go through one set of Copyright problems listings, I can completely handle a similar amount of CopyPatrol reports (outside of admin tool stuff like revision deletion) in a quicker fashion.
  3. I check the diff and article history first to see if the edit was already reverted by a Recent Changes patroller or if more content was added that also needs to be checked. I then look at the source url, and if I can't discern whether or not it could be a special case, I open the source. In most cases I remove the violation, or attribute it if it was copied within Wikipedia, and then request revision deletion if necessary. I then warn the person. I don't find myself using the iThenticate report that much, since the sources that CopyPatrol flags do not appear in the report when I check.
  4. I think that iThenticate itself doesn't prevent copyright violations, but rather gives us the means to help new people who (understandably so) do not understand copyright on Wikipedia understand before it becomes a problem for everyone. We can find them and give them the guidance needed, kind of like recent changes patrollers can help identify new people who may not know of a certain policy but are here in good faith. It also helps us, at times, find users and pages who have serious issues, and need a referral to our other processes, like Copyright problems or Contributor Copyright Investigations. It gives us ways to prevent long-term copyright violations, but I'm not so sure that it does anything to prevent the total amount of reports we handle or will see at our other processes. Sennecaster (talk) 22:43, 26 August 2022 (UTC)Reply[reply]
Responses from Moneytrees

The above responses have basically covered anything of substance I would say, so I will provide briefer answers:

  1. The iThenticate reports are able to detect close paraphrasing better than other community tools and has access to a wider array of sources that are otherwise difficult verify coping from. The reports are probably the most invaluable tool there is when it comes to patrolling for plagiarizing edits.
  2. The main benefits I find are the access to difficult to access sources and accuracy. Given the availability and price of some sources that are copied from, copyright violations can stay in articles for several years before being removed. The iThenticate reports help prevent this by showing comparisons to these sources.
  3. DanCherek has summarized the process the majority of reviewers go through; I have nothing to really add.
  4. It has helped us become much better at catching copyright violations early in editor's careers, helping prevent future ones, and has helped keep track of editors who have repeatedly violated copyright. Moneytrees (talk) 23:01, 3 September 2022 (UTC)Reply[reply]

I don't have much to add to the above - except that wider coverage is still needed. False negatives due to unavailable sources are still too frequent. MER-C 18:04, 13 September 2022 (UTC)Reply[reply]

Response from L3X1

Without the tools made by ithenticate it would be functionally impossible for me to do anything about plagiarism. Having a program to detect possible violations and format in the queue that I can easily interact with and delivers the information I need right at my fingertips is irreplaceable. enL3X1 ¡‹delayed reaction›¡ 22:16, 18 September 2022 (UTC)Reply[reply]

@MER-C, Moneytrees, Sennecaster, DanCherek, Diannaa, and L3X1: A belated but sincere THANK YOU for your well-articulated and thorough replies! :) I realized I failed to mention that this was for a case study. The hope is to publish a blog post (authored by Turnitin) on Wikimedia Diff. We saw the draft today and they are using direct quotes from some of you and linking to your user page. I wanted to make sure you were okay with this? I assume so since your words are already in the public eye here. I'm not sure when the post will go live but I will certainly let you know. Thanks for helping us build up our partnership with Turnitin! Best, MusikAnimal (WMF) (talk) 02:09, 2 December 2022 (UTC)Reply[reply]
No issues for me. Thanks, DanCherek (talk) 02:19, 2 December 2022 (UTC)Reply[reply]
I am okay with this and would be pleased to see the resulting blog post. Diannaa (talk) 03:21, 2 December 2022 (UTC)Reply[reply]
Fine with me :) Sennecaster (talk) 12:55, 2 December 2022 (UTC)Reply[reply]
That's cool, I am fine if my words are used. Moneytrees (talk) 20:16, 4 December 2022 (UTC)Reply[reply]
yes I am fine with being quote and/or linked to. thanks for reaching out. enL3X1 ¡‹delayed reaction›¡ 20:44, 5 December 2022 (UTC)Reply[reply]
@Moneytrees, DanCherek, Diannaa, and Sennecaster: The blog post was published and I guess I wasn't notified, but anyway here it is should you want to read it. The rest of you I didn't ping were not mentioned. Thanks again to all of you for your feedback and participation in this PR push! Thanks to you, we should soon hopefully have enough credits secured for CopyPatrol to last for many more years. Warm regards, MusikAnimal (WMF) (talk) 02:13, 18 January 2023 (UTC)Reply[reply]

What spaces are included?Edit

Hi! I don't immediately see anything about which project spaces are covered, and which (if any?) are not. The reason I ask is that a fairly serious copyvio problem in Portal space has come to light on en.wp, and that made me wonder how it got past this tool and the heroes who monitor it. Thanks, Justlettersandnumbers (talk) 20:31, 15 September 2022 (UTC)Reply[reply]

Only the main and draft namespaces are covered. — JJMC89(T·C) 07:19, 17 September 2022 (UTC)Reply[reply]

Loading issueEdit

Starting to see this crop up a lot "No text could be found in the given URL (note that only HTML and plain text pages are supported, and content generated by JavaScript or found inside iframes is ignored)" is this to be expected? 04:10, 22 October 2022 (UTC) enL3X1 ¡‹delayed reaction›¡ 04:10, 22 October 2022 (UTC)Reply[reply]

@L3X1 Sorry for the late reply! Is this still happening, and if so, could you link to an example? I'm assuming you're talking about the "Compare" buttons, which relies on toolforge:copyvios, so the issue may actually be there. MusikAnimal (WMF) (talk) 01:51, 2 December 2022 (UTC)Reply[reply]
yes, it appeared when I clicked the compare button . I will keep an eye out for its return enL3X1 ¡‹delayed reaction›¡ 20:57, 5 December 2022 (UTC)Reply[reply]
@L3X1 I just realized, this was probably because the external URL is behind a paywall. This is why we also give you the "iThenticate report" link, which will show the original text from the website. If that isn't the issue, then it's something else with toolforge:copyvios. You can contact The Earwig to report those issues, but we're happy to pass along the message for you if you'd rather post here :) MusikAnimal (WMF) (talk) 00:41, 6 December 2022 (UTC)Reply[reply]
Hello @User:MusikAnimal (WMF) I found a recent instance: https://copypatrol.toolforge.org/en/?id=93197888 . It appears that the external URL is of a PDF of a scan of a book. I tried to page search en-browser a portion of the new text from the wiki-diff, but the pdf is so large my browser cannot search it. hope this helps, thanks enL3X1 ¡‹delayed reaction›¡ 22:29, 6 December 2022 (UTC)Reply[reply]

Diannaa stepping back, new featuresEdit

@MusikAnimal (WMF) (Also ping @DanCherek, MER-C, and Diannaa-- please add anything here that you think might also be useful) If you haven't seen, Diannaa is going to be doing less work at copypatrol moving forward. This is a good a time as any to address some long running issues around how work is structured. It's no efficent, healthy, or fair for two or three people to be doing the lion shares of the work. We need to make patrolling and dealing with copyright violations like recent change patrolling, in that the majority of editors have a baseline knowledge of what to do. We need to update the processes around dealing with copyright violations to account for this, and I have two ideas in particular:

  • We should have a feature that allows you to search through all the times a specific editor has been flagged at copypatrol.
  • We should have a feature that allows you to look through all the reviews someone has done.

If for whatever reason these cannot be used by the general editing group at copypatrol, would it be possible to add an "admin" role at copypatrol that could do this? If this isn't the correct venue to request these features, what would be? Thank you, Moneytrees (talk) 21:48, 28 January 2023 (UTC)Reply[reply]

Hi Moneytrees, I never got a ping notification for this discussion, and noticed only by chance. You might like to notify the others of its existence via some other method in case they never got pinged either. Thanks. Diannaa (talk) 15:11, 2 February 2023 (UTC)Reply[reply]
Hey @Moneytrees! I didn't get this ping either, but I did get your message on my talk page (at the time I was on holiday). I'm sad to hear the mighty @Diannaa will be taking a break! She indeed has done the heavy lifting for some years now.
Myself and CommTech are happy to look into streamlining CopyPatrol however you think it will help. In my opinion though, the main issue is lack of enough interested patrollers. I think running some sort of campaign to get more folks involved on enwiki is probably going to yield the best results.
Now, looking at your two specific suggestions:
  • a feature that allows you to search through all the times a specific editor has been flagged at copypatrol
    Partially doable, and would be very slow. We don't store any data about the editor in the CopyPatrol database, only the revision ID. While we can do a query to find all revisions by a given editor that exist in CopyPatrol, this wouldn't work if the revisions no longer exist. I.e. see the reviewed cases and you'll notice that now-deleted pages don't have editor info (example). We could change CopyPatrol to start storing user data, but this would be expensive and costly for the benefit it provides, I'm afraid.
  • a feature that allows you to look through all the reviews someone has done
    This is doable and quite easily at that! If you can file a Phabricator task with the CopyPatrol task, we'll get it triaged in the next meeting. Or I can write a task when I find the time.
Best, MusikAnimal (WMF) (talk) 21:55, 28 February 2023 (UTC)Reply[reply]
@MusikAnimal (WMF) I'm planning on writing a sort of "guide to copypatrol" and some increased community activity for when I have the time. I've created a task on Phab related to the searching reviews feature, let me know if I did it wrong. Moneytrees (talk) 04:00, 5 March 2023 (UTC)Reply[reply]
Hi, @MusikAnimal (WMF)! I saw the task open up at Phabricator and wanted to take a stab at it since I had free time. I forgot to read this thread and didn't notice that you had plans to get it done. The PR can be found here; please feel free to close if it's an overreach. Chlod (say hi!) 06:35, 5 March 2023 (UTC)Reply[reply]
Not an overreach at all! We had not starting working on this. Thank you very much creating a PR :) I'll get to reviewing it soon. MusikAnimal (WMF) (talk) 18:52, 6 March 2023 (UTC)Reply[reply]
Return to "CopyPatrol" page.