Talk:CopyPatrol

Latest comment: 1 day ago by JJMC89 in topic New CopyPatrol is live
NOTE: This page may not be regularly checked. If you need prompt attention from the maintainers please ping a member of Community Tech.

Can the tool access paywalled full texts? edit

Curious whether this tool would detect violations like this from 2015 which copied from this source(you'll need to log in)? If not, have you considered whether the tool can be linked up with The Wikipedia Library to access full texts? Smartse (talk) 10:59, 19 December 2023 (UTC)Reply

@Smartse I tried it by copying that old version to Draft:Sandbox. CopyPatrol picked up the edit [1]. In the iThenticate-Report it shows that source as a 13% match. Nobody (talk) 13:16, 19 December 2023 (UTC)Reply
@1AmNobody24: Thanks for that - I see that percentage at 9% for link.springer.com, but looking at https://www.ithenticate.com/ I see that they do indeed have the full texts for many paywalled articles. Good to see that we should catch edits like this today, but I wonder how many we missed! Smartse (talk) 12:29, 21 December 2023 (UTC)Reply

Question about marking edits edit

When I encounter an edit that somebody else has already fixed (by removing content and adding copyvio-revdel tags, or by tagging for G12), should I mark the edit as "Page fixed" or as "No action needed"? I've been marking these sorts of things as "Page fixed", since it was a true copyvio and the page was fixed, but the use of you in If you fixed the problem, tagged the page for revision deletion, or tagged the page for deletion as a copyright violation, mark it as "Page fixed" is now giving me a bit of pause. — Red-tailed hawk (nest) 02:54, 21 December 2023 (UTC)Reply

@Red-tailed hawk I also mark those as Page fixed. You think something like If the problem is fixed, the page tagged for revision deletion, or tagged for deletion as a copyright violation, mark it as "Page fixed" could be better? Nobody (talk) 06:27, 21 December 2023 (UTC)Reply
I think the proposed text would work well, yes. — Red-tailed hawk (nest) 16:38, 22 December 2023 (UTC)Reply

User whitelist edit

Is that list still working? Cause this came in. Nobody (talk) 09:52, 22 December 2023 (UTC)Reply

Sometimes things slip through; I don't know why. Diannaa (talk) 15:35, 26 December 2023 (UTC)Reply

Is copy patrol down? edit

only 4 cases going back quite some hours enL3X1 ¡‹delayed reaction›¡ 21:34, 25 December 2023 (UTC)Reply

I'm not seeing any significant gaps, just a general slowdown. I guess people had something else to do on Christmas Day. Diannaa (talk) 15:34, 26 December 2023 (UTC)Reply

New CopyPatrol is live edit

I'm thrilled to announce the new version of CopyPatrol is now live at https://copypatrol.wmcloud.org. All existing links should redirect to the right place. Please join me in thanking @JJMC89 for his tremendous help in this effort. He probably deserves most of the credit here, but certainly all of it for the backend that he completely rewrote from scratch. The new backend should be much more resilient, with the sporadic downtime that we occasionally see hopefully being a thing of the past. In addition, the new frontend offers a number of new features:

  • Significant performance improvements
  • Edit summaries, change tags, and diff sizes
  • "Undo" or "revdel" links for users who have the requisite permissions

One notable change you might see is that the iThenticate reports no longer include the crawl date. Unfortunately this is outside our control. The Turnitin product team has been made aware of this feature request, so we hope it will eventually be reinstated.

Please let myself or JJMC89 know of any issues you see. At the time of writing, the backfill script is still running, so many older reports are missing. They should all be restored in due time. Additionally, we're still ironing out integration with mw:Extension:PageTriage. We'll mark phab:T333724 as resolved once all of the aforementioned has been completed.

This release also marks the conclusion of a formal agreement with Turnitin. This has been in the works since at least May 2022. Turnitin has been kind enough to give us free credits when we need them, but from a legal standpoint nothing solidified our relationship in the past. Now it is set in stone, and we have the reassurance that CopyPatrol is here to thrive for years to come. They were gracious enough to give us quite a bit of credits exceeding our current consumption, so we will soon be exploring adding more languages to CopyPatrol. On the front of negotiations with Turnitin, I'd like to thank @Ocaasi who started the conversations, and more recently my colleagues @SSpalding (WMF) from Legal, @JVargas (WMF) from Partnerships, my manager @KSiebert (WMF), and our new Lead Community Tech Manager @JWheeler-WMF.

Above all, allow me to thank all of you – our users – who are doing the actual work of helping cleanse the wikis of copyright violations. Your tireless efforts are what drove us to reaching this milestone.

Warm regards, MusikAnimal (WMF) (talk) 21:42, 9 April 2024 (UTC)Reply

Feedback edit

  Fixed = code updated and confirmed would not show up if rechecked
Wow, I can actually feel everything loading faster (imagine my shock on discovering that marking the status of reports is now near-instant). The new features are great, could I share a little bit of feedback?
  • The undo button is really useful, but its location next to the diff button has led to me now clicking it unintentionally multiple times (maybe it could be moved down)
Other than that, everyone looks good. The leaderboard seems a bit funky, but I imagine that will be fixed with the backfill script. Isochrone (talk) 22:06, 9 April 2024 (UTC)Reply
It's so awesome to see how this technology and this partnership has evolved and matured. Congrats to everyone who has pushed it so much further!! Ocaasi (talk) 00:13, 10 April 2024 (UTC)Reply
The new version has many positive changes, such as the quick loading time and the expected reduction in outages. However, on the down side, I see that there's already 212 cases posted for April 10 and there's still three hours to go, so a projected 240 cases to assess in the 24 hour period. Given that most days we only have two people working the queue, this needs to be cut in half if that's possible. It's unrealistic and unstustainable to expect our tiny crew to keep up with the voume otherwise. (I can typically only clear about 20 cases per hour and can only commit to working on this for 3-4 hours per day.) Diannaa (talk) 21:20, 10 April 2024 (UTC)Reply
Yes, many thanks for the improvements! Very grateful. I agree with Diannaa that we may need some tweaks in terms of what the bot flags as a potential copyright violation as the threshold seems to have been lowered compared to before (one example I mentioned on her talk page was that it now flags cases where someone changes one or two words in a paragraph because it detects a match for the remaining text in the paragraph). Not sure we'll be able to handle the reports otherwise. DanCherek (talk) 22:34, 10 April 2024 (UTC)Reply
@Diannaa @DanCherek Thanks for all of the feedback! Can you link to specific example(s)? someone changes one or two words in a paragraph because it detects a match for the remaining text in the paragraph – wouldn't that still usually be a copyright violation, or do you mean the source is a backwards copy (in which case it's not a copyvio at all)?
Assuming the cases are still valid, my opinion is that it's perfectly fine to have a backlog. While it's admirable to aim for completeness, you can only volunteer but so much time. If however you're seeing a lot of noise, with backwards copies, or otherwise too many cases that are right on the "borderline", etc., we certainly can work to improve that. MusikAnimal (WMF) (talk) 22:45, 10 April 2024 (UTC)Reply
I'm seeing a lot of cases like [2], where someone copyedits a paragraph and then it matches the rest of the unchanged text to a backwards copy. We still had to deal with backwards copies in the old CopyPatrol, of course, but so far it feels like a lot more after the update. DanCherek (talk) 22:50, 10 April 2024 (UTC)Reply
  Fixed — JJMC89(T·C) 17:25, 11 April 2024 (UTC)Reply
This report flags an edit that just cleaned up references with no real new text added. -- Whpq (talk) 22:56, 10 April 2024 (UTC)Reply
  Fixed — JJMC89(T·C) 06:44, 11 April 2024 (UTC) modified 16:19, 11 April 2024 (UTC)Reply
Due to the large number of Wikipedia mirrors, we will always have false positives. We can waste a lot of valuable time on those cases, attempting to determine who had it first. We do have a whitelist of Wikipedia mirrors but people who don't know Regex are warned not to edit it. Here's a few more false positives of various kinds. I don't know if these are useful examples or not:
  • Here's one where an editor removed multiple occurrences of the word "current" from a list. The list itself is public domain of course.
  • Here's one where an editor moved a paragraph that was reflected in a Wikipedia mirror. The material they added in the same edit is okay to keep.
  • In this one, an editor actually removes text but since IMDb has copied our plot summary at some point, the item gets listed.
  • Here's one that illustrated DanCherek's point: only a few words are added. Purported source: an obvious Wikipedia mirror.
Another suggestion: Perhaps we can somehow teach the system to only show us the most likely cases? Maybe there's a way to reduce the threshold for inclusion, regarding the size of the edit or the amount of the overlap? It's not a question of having a backlog; if we don't reduce the fire hose of incoming cases there will be many that never get assessed at all. Diannaa (talk) 23:25, 10 April 2024 (UTC)Reply
  Fixed first and fourthall. The second link is the same as the first. — JJMC89(T·C) 06:44, 11 April 2024 (UTC) modified 17:25, 11 April 2024 (UTC)Reply
Sorry about the duplicate link; I am not going to bother to look for the missing example. New comments:
  • Community Tech bot used to remove listings of pages that were already deleted. This doesn't seem to be happening so far: deleted article, deleted draft
  • Cases so far at the halfway point of April 11 are a much more manageable 40, so if tweeks are underway, it's working.
Diannaa (talk) 12:11, 11 April 2024 (UTC)Reply
Unfortunately I had to revert one of the fixes due to poor performance causing the bot to buildup a large backlog that hasn't been processed yet. — JJMC89(T·C) 16:19, 11 April 2024 (UTC)Reply
One thing I've noticed is that I keep getting logged out everytime I close my browser-- is there a cookie persistence issue? I had no such issues with the old backend. Isochrone (talk) 13:38, 11 April 2024 (UTC)Reply
I will look into this. This seems this happens to every new Symfony app that I create (phab:T224382). I managed to fix it before, so I'll attempt it again for CopyPatrol (the old CopyPatrol did not run on Symfony, FYI) MusikAnimal (WMF) (talk) 19:32, 11 April 2024 (UTC)Reply
I just noticed that I can't view the iThenticate reports unless I am logged in to CopyPatrol. So that might be a feature rather than a bug. Diannaa (talk) 23:14, 11 April 2024 (UTC)Reply
Logging in is required since each user must agree to the EULA to see the reports. The short login session should get worked on. — JJMC89(T·C) 22:25, 12 April 2024 (UTC)Reply
Tracked in Phabricator:
Task T362457

New feedback: Some users are incorrectly being shown with redlinked user talk pages. Here, here, here, for example. It appears this might be because they don't have a talk page on Meta, but that's immaterial; I would prefer to be able to see at a glance whether or not a user talk page exists at en.wiki for that username. Diannaa (talk) 21:44, 12 April 2024 (UTC)Reply

Moving ignore lists to the CopyPatrol UI edit

In the above discussion, it was noted how tedious it is to maintain User:CopyPatrolBot/UrlIgnoreList as it requires knowledge of regular expressions. I had an idea that we could get rid of the on-wiki lists and instead have a button "Ignore URLs like this" directly in the CopyPatrol UI. We could do the same for users, too, so you don't have to edit User:CopyPatrolBot/UserIgnoreList. This is also nice because the new system has the ignore lists centralized on Meta, where not everyone is necessarily able to edit (the page could be semi-protected).

The only issue I foresee with this idea is the potential for abuse. For that, I was thinking we'd either restrict the ability to ingore URLs and users to "privileged" users – say at least 1,000 edits, or even restrict to sysops? Another option is to go ahead and shield all of CopyPatrol from newbies, as proposed at phab:T178700.

Thoughts? MusikAnimal (WMF) (talk) 19:43, 11 April 2024 (UTC)Reply

I can't imagine any issues with this for URLs. With users, making it too easy, even for admins (who are humans), to exclude users may lead to unintentional removals of users who should be flagged, or people being too liberal with the ignore button.
When there are errors on the wikitext list, this can just be rectified by another user: would there be a way to "un-ignore" users in case of errors? Isochrone (talk) 20:28, 11 April 2024 (UTC)Reply
Return to "CopyPatrol" page.