Talk:CopyPatrol

Add topic
Active discussions
NOTE: This page may not be regularly checked. If you need prompt attention from the maintainers please ping a member of Community Tech.


Support for PortugueseEdit

Hi, the meta page says it supports Portuguese Wikipedia, but I don't see the language in the CopyPatrol Tool website. There is some interest to deploy it on ptwiki over at w:pt:Wikipédia:Esplanada/propostas/Trazer a ferramenta "CopyVio" para Wikipédia Lusófona (25mar2021). What needs to be done for it to become usable? GoEThe (talk) 11:31, 25 March 2021 (UTC)

Solved in phab:T278464. GoEThe (talk) 10:39, 26 March 2021 (UTC)

CopyPatrol for french WikipediaEdit

Hello, it seems that there is no new cases since 8th of April for frwiki. Maybe there is no more copyright infringement (I don't really believe in it). Is CopyPatrol bot down or something? So I think it's time to check old unverified entries ^^ -- Shawn (talk) 13:32, 19 April 2021 (UTC)

Bonjour @Shawn:,
Je pense que cela pourrait être lié à 1. Après, je n'en sais pas plus, je viens de découvrir l'outil. RG067 (talk) 16:41, 4 May 2021 (UTC)

BugEdit

For whatever reason the comparison between text was showing {{Unruto=yes and missing any proper comparisons. The link to the task is here here. Thanks-CAPTCHA Security check for adding an external link? I'm offended, Pabsoluterince (talk) 08:09, 12 May 2022 (UTC)

Feedback requested from CopyPatrol usersEdit

Hello CopyPatrol users! Turnitin, the company behind IThenticate which powers CopyPatrol, has asked us to collect feedback from our users. This is to help build our partnership with them and ensure the long-term stability of CopyPatrol, so your input is very much appreciated! The questions may seem a bit broad, but if you are able to elaborate on any of them please do. For all intents and purposes, "iThenticate" in this context can be viewed as CopyPatrol, since the reports you see surfaced there come from iThenticate. Some of you use the "iThenticate report" link as well; if you do, please describe your workflow in Q3. The questions are as follows:

  1. How does iThenticate help you in your work of keeping Wikipedia plagiarism-free?
  2. How would you describe the main benefit of using iThenticate? (e.g. report accuracy? Time saving?)
  3. What do you do when you identify text similarities in the article you are reviewing? Could you please describe the process of working with the detected text matches?
  4. How does iThenticate help you prevent copyright violations?

Thank you for taking the time to answer, and also your time and energy spent helping keep Wikipedia clean of copyright violations! I am pinging a few of our English-speaking power users, but anyone should feel free to respond: @Diannaa, @DanCherek, @L3X1, @Sphilbrick, @Ymblanter. Warm regards, MusikAnimal (WMF) (talk) 20:25, 24 August 2022 (UTC)

Responses from DanCherek
Hi MusikAnimal (WMF), thanks for reaching out (and for your assistance with keeping CopyPatrol up and running)! Here are my thoughts:
  1. How does iThenticate help you in your work of keeping Wikipedia plagiarism-free? Having the ability to automatically scan large edits for copyright violations is incredibly helpful. There is a relatively small group of editors working on copyright cleanup, and with the hundreds of thousands of edits that are made every day, there's no way that they could all be manually scanned for copyright issues, let alone accurately identifying the sources of copied text. iThenticate makes this task much more feasible by identifying potential violations and flagging them for human review, distilling this enormous (and important) task into a much more manageable one. By having this system where edits are automatically reviewed, we're able to detect and deal with copyright issues even on articles that aren't actively being watched by other editors, and quickly handle them as they come up (making removal easier, compared to cases in which a copyright violation isn't discovered until years after the fact).
  2. How would you describe the main benefit of using iThenticate? (e.g. report accuracy? Time saving?) There are a few features that I think are particularly valuable. One is that it is really good at matching text to sources that may not otherwise be readily accessible or findable. Paywalled sources (such as journal articles), or historical versions of websites that have since been modified or taken down, are frequent sources of copying that we see on Wikipedia, but you wouldn't necessarily be able to find these matches from a Google search. So the ability to have a better sense of where some copied text comes from is really helpful. Another feature that I like is the way that the overlapping text is highlighted in the iThenticate interface, including the use of different colors for different sources. It makes it easy to tell, from a glance, which specific phrases in an edit were clearly copied from somewhere, and which parts of the edit might warrant further investigation.
  3. What do you do when you identify text similarities in the article you are reviewing? Could you please describe the process of working with the detected text matches? For each article that appears in CopyPatrol, I will open two new tabs initially: the diff of the edit, and the iThenticate report. I take a look at the iThenticate report to try to get a sense of what we're dealing with. For example, if the potential sources are all Wikipedia or Wikipedia mirrors, it may be a case of copying within Wikipedia, in which case I would look at the actual edit and see if that's the case, and whether proper attribution was given. If it looks like text was copied from a copyright source that is not compatibly licensed, I will look at the potential matches, try to identify the actual source copied from, and then remove the text from the article. Because we're dealing with recent edits (typically made in the past day or two), we don't have as many issues with reverse copying than if we were investigating edits from, say, a decade ago. I will also look at other recent edits to the article, particularly if the editor in question has made a series of edits, to see if there are other copyright issues that weren't flagged by CopyPatrol, and I will also look at edits that the user has made to other articles. Often when there is a fundamental misunderstanding of the copyright policy, the copyright issues are not confined to a single article. I mentioned above that I found the highlighting of overlapping text useful. If an edit clearly looks like it's been copied from somewhere, but the iThenticate-identified sources are all offline or not promising, I use the highlighted text to determine which phrases to enter into Google as I look a little harder for the original source.
  4. How does iThenticate help you prevent copyright violations? Besides the slight overlap with Q1, I think this question really gets at one of the most important things about CopyPatrol, which is that it helps us identify copyright issues hopefully early on in a Wikipedia user's editing history and to educate them about the copyright policy before they make too many edits. Wikipedia's contributor copyright investigation (CCI) project is severely backlogged with cases in which copyright issues were discovered after someone had already made tens or hundreds of thousands of edits. That still happens, but hopefully being able to communicate with the editor earlier on can create a better situation for everyone involved. It also lets us create a paper trail in case the copyright issues persist and further action is needed, and it can help inform when a new CCI may be needed.
Hope this helps. Let me know if I should elaborate on anything else. DanCherek (talk) 23:38, 24 August 2022 (UTC)
Responses from Diannaa
  1. Wikipedia has grown to the point where we receive thousands of edits every hour, which makes it impossible to monitor recent changes without automated tools. The assistance of iThenticate is invaluable to us because it provides a vital and reliable service that can check for copyright problems without the need for our volunteers to be involved in maintainance of the service.
  2. There are some huge benefits. Our previous detection system, CorenSearchBot, checked only new page creations. The iThenticate service checks all additions over a certain size, and thus provides a lot more coverage. The reports are added to our queue almost immediately and we clear all the open reports within 24 to 36 hours, which means people who add copyright material are notified quickly as to what they did wrong and what our expectations are. Quickly notifying people of problems means that we have fewer new editors who think it's okay to add copyright content, and means cleanup is less onerous in the long run. Unlike social media and other sites where people can contribute content, we take copyright very seriously, as to failure do so would have a negative impact on our efforts to be taken seriously as a valid scholarly resource. And the Turnitin system can see behind many paywalls so that we can assess and remove content that we could otherwise not even detect. I am pretty sure CorenSearchBot was not sophisticated enough to do that. CorenSearchBot was retired in June 2016, when we got our CopyPatrol interface perfected.
  3. Assessing reports: First I look at what type of article has been flagged (biography, places, science, or current events for example) as each has specific types of common issues. Next, I look at the url that iThenticate has flagged, and it's a journal article, I will immediately click on the iThenticate link, because the Turnitin system can see behind many paywalls to view content that would otherwise be inaccessable without a subscription. I check to see if the source webpage is compatibly licensed, and if it's not, I remove the copyright content from the Wikipedia article. Sometimes it becomes obvious that the entire article needs checking, or the editor's entire edit history needs checking. So one iThenticate report can expand into a larger cleanup effort! Then I perform revision deletion if appropriate and notify the editor with either a template or a hand-written note.
  4. The way iThenticate helps prevent problems is through the opportunity to educate users as to our expectations. An editor (whether a newcomer or a veteran) is a lot less likely to add copyright material to Wikipedia if they know there's an automated detection service in place. Diannaa (talk) 14:15, 26 August 2022 (UTC)
Responses from Sennecaster
  1. I am usually busy in other parts of copyright cleanup, but when I do go on CopyPatrol, and from what I see at the "second line" of defense at en:Wikipedia:Copyright problems, it really decreases the amount of manual reviewing and source hunting that we have to do. I find that iThenticate helps filter down what we should look at too, and it can crawl behind paywalls that we as volunteers sometimes cannot. Earwig's copyvio detector also has an iThenticate option, but it rarely works there unfortunately--when it does, I sometimes find issues that were not previously exposed with the other options.
  2. iThenticate is pretty accurate! Sometimes it bugs out and won't let me preview the comparison, but it gives me a source, instead of making me hunt down one. I find that dealing with CopyPatrol reports is extremely fast, even on some of the more tricky ones, so while it may take me a few days to go through one set of Copyright problems listings, I can completely handle a similar amount of CopyPatrol reports (outside of admin tool stuff like revision deletion) in a quicker fashion.
  3. I check the diff and article history first to see if the edit was already reverted by a Recent Changes patroller or if more content was added that also needs to be checked. I then look at the source url, and if I can't discern whether or not it could be a special case, I open the source. In most cases I remove the violation, or attribute it if it was copied within Wikipedia, and then request revision deletion if necessary. I then warn the person. I don't find myself using the iThenticate report that much, since the sources that CopyPatrol flags do not appear in the report when I check.
  4. I think that iThenticate itself doesn't prevent copyright violations, but rather gives us the means to help new people who (understandably so) do not understand copyright on Wikipedia understand before it becomes a problem for everyone. We can find them and give them the guidance needed, kind of like recent changes patrollers can help identify new people who may not know of a certain policy but are here in good faith. It also helps us, at times, find users and pages who have serious issues, and need a referral to our other processes, like Copyright problems or Contributor Copyright Investigations. It gives us ways to prevent long-term copyright violations, but I'm not so sure that it does anything to prevent the total amount of reports we handle or will see at our other processes. Sennecaster (talk) 22:43, 26 August 2022 (UTC)
Responses from Moneytrees

The above responses have basically covered anything of substance I would say, so I will provide briefer answers:

  1. The iThenticate reports are able to detect close paraphrasing better than other community tools and has access to a wider array of sources that are otherwise difficult verify coping from. The reports are probably the most invaluable tool there is when it comes to patrolling for plagiarizing edits.
  2. The main benefits I find are the access to difficult to access sources and accuracy. Given the availability and price of some sources that are copied from, copyright violations can stay in articles for several years before being removed. The iThenticate reports help prevent this by showing comparisons to these sources.
  3. DanCherek has summarized the process the majority of reviewers go through; I have nothing to really add.
  4. It has helped us become much better at catching copyright violations early in editor's careers, helping prevent future ones, and has helped keep track of editors who have repeatedly violated copyright. Moneytrees (talk) 23:01, 3 September 2022 (UTC)

I don't have much to add to the above - except that wider coverage is still needed. False negatives due to unavailable sources are still too frequent. MER-C 18:04, 13 September 2022 (UTC)

Response from L3X1

Without the tools made by ithenticate it would be functionally impossible for me to do anything about plagiarism. Having a program to detect possible violations and format in the queue that I can easily interact with and delivers the information I need right at my fingertips is irreplaceable. enL3X1 ¡‹delayed reaction›¡ 22:16, 18 September 2022 (UTC)

What spaces are included?Edit

Hi! I don't immediately see anything about which project spaces are covered, and which (if any?) are not. The reason I ask is that a fairly serious copyvio problem in Portal space has come to light on en.wp, and that made me wonder how it got past this tool and the heroes who monitor it. Thanks, Justlettersandnumbers (talk) 20:31, 15 September 2022 (UTC)

Only the main and draft namespaces are covered. — JJMC89(T·C) 07:19, 17 September 2022 (UTC)
Return to "CopyPatrol" page.