Talk:CopyPatrol

This page is for discussions related to the CopyPatrol page.

Please remember to:

Sign posts using the four tildes (~~~~)
Remain civil and polite during discussions.
Place new text under old text (start a new post).

NOTE: This page may not be regularly checked. If you need prompt attention from the maintainers please ping a member of Community Tech.

Archives

2016 · 2017 · 2019 · 2020 · 2021 · 2022 · 2023

New CopyPatrol is live

Latest comment: 3 months ago21 comments8 people in discussion

I'm thrilled to announce the new version of CopyPatrol is now live at https://copypatrol.wmcloud.org. All existing links should redirect to the right place. Please join me in thanking @JJMC89 for his tremendous help in this effort. He probably deserves most of the credit here, but certainly all of it for the backend that he completely rewrote from scratch. The new backend should be much more resilient, with the sporadic downtime that we occasionally see hopefully being a thing of the past. In addition, the new frontend offers a number of new features:

Significant performance improvements
Edit summaries, change tags, and diff sizes
"Undo" or "revdel" links for users who have the requisite permissions

One notable change you might see is that the iThenticate reports no longer include the crawl date. Unfortunately this is outside our control. The Turnitin product team has been made aware of this feature request, so we hope it will eventually be reinstated.

Please let myself or JJMC89 know of any issues you see. At the time of writing, the backfill script is still running, so many older reports are missing. They should all be restored in due time. Additionally, we're still ironing out integration with mw:Extension:PageTriage. We'll mark phab:T333724 as resolved once all of the aforementioned has been completed.

This release also marks the conclusion of a formal agreement with Turnitin. This has been in the works since at least May 2022. Turnitin has been kind enough to give us free credits when we need them, but from a legal standpoint nothing solidified our relationship in the past. Now it is set in stone, and we have the reassurance that CopyPatrol is here to thrive for years to come. They were gracious enough to give us quite a bit of credits exceeding our current consumption, so we will soon be exploring adding more languages to CopyPatrol. On the front of negotiations with Turnitin, I'd like to thank @Ocaasi who started the conversations, and more recently my colleagues @SSpalding (WMF) from Legal, @JVargas (WMF) from Partnerships, my manager @KSiebert (WMF), and our new Lead Community Tech Manager @JWheeler-WMF.

Above all, allow me to thank all of you – our users – who are doing the actual work of helping cleanse the wikis of copyright violations. Your tireless efforts are what drove us to reaching this milestone.

Warm regards, MusikAnimal (WMF) (talk) 21:42, 9 April 2024 (UTC)Reply

Feedback

Wow, I can actually feel everything loading faster (imagine my shock on discovering that marking the status of reports is now near-instant). The new features are great, could I share a little bit of feedback?

The undo button is really useful, but its location next to the diff button has led to me now clicking it unintentionally multiple times (maybe it could be moved down)

Other than that, everyone looks good. The leaderboard seems a bit funky, but I imagine that will be fixed with the backfill script. Isochrone (talk) 22:06, 9 April 2024 (UTC)Reply

It's so awesome to see how this technology and this partnership has evolved and matured. Congrats to everyone who has pushed it so much further!! Ocaasi (talk) 00:13, 10 April 2024 (UTC)Reply

Amazing! edit summaries are so helpful! thanks to all who made this work. enL3X1 ¡‹delayed reaction›¡ 13:39, 10 April 2024 (UTC)Reply

The new version has many positive changes, such as the quick loading time and the expected reduction in outages. However, on the down side, I see that there's already 212 cases posted for April 10 and there's still three hours to go, so a projected 240 cases to assess in the 24 hour period. Given that most days we only have two people working the queue, this needs to be cut in half if that's possible. It's unrealistic and unstustainable to expect our tiny crew to keep up with the voume otherwise. (I can typically only clear about 20 cases per hour and can only commit to working on this for 3-4 hours per day.) Diannaa (talk) 21:20, 10 April 2024 (UTC)Reply

Yes, many thanks for the improvements! Very grateful. I agree with Diannaa that we may need some tweaks in terms of what the bot flags as a potential copyright violation as the threshold seems to have been lowered compared to before (one example I mentioned on her talk page was that it now flags cases where someone changes one or two words in a paragraph because it detects a match for the remaining text in the paragraph). Not sure we'll be able to handle the reports otherwise. DanCherek (talk) 22:34, 10 April 2024 (UTC)Reply

@Diannaa @DanCherek Thanks for all of the feedback! Can you link to specific example(s)? someone changes one or two words in a paragraph because it detects a match for the remaining text in the paragraph – wouldn't that still usually be a copyright violation, or do you mean the source is a backwards copy (in which case it's not a copyvio at all)?

Assuming the cases are still valid, my opinion is that it's perfectly fine to have a backlog. While it's admirable to aim for completeness, you can only volunteer but so much time. If however you're seeing a lot of noise, with backwards copies, or otherwise too many cases that are right on the "borderline", etc., we certainly can work to improve that. MusikAnimal (WMF) (talk) 22:45, 10 April 2024 (UTC)Reply

I'm seeing a lot of cases like [1], where someone copyedits a paragraph and then it matches the rest of the unchanged text to a backwards copy. We still had to deal with backwards copies in the old CopyPatrol, of course, but so far it feels like a lot more after the update. DanCherek (talk) 22:50, 10 April 2024 (UTC)Reply

Fixed — JJMC89 (T·C) 17:25, 11 April 2024 (UTC)Reply

This report flags an edit that just cleaned up references with no real new text added. -- Whpq (talk) 22:56, 10 April 2024 (UTC)Reply

~~Fixed~~ — JJMC89 (T·C) 06:44, 11 April 2024 (UTC) modified 16:19, 11 April 2024 (UTC)Reply

Due to the large number of Wikipedia mirrors, we will always have false positives. We can waste a lot of valuable time on those cases, attempting to determine who had it first. We do have a whitelist of Wikipedia mirrors but people who don't know Regex are warned not to edit it. Here's a few more false positives of various kinds. I don't know if these are useful examples or not:

Here's one where an editor removed multiple occurrences of the word "current" from a list. The list itself is public domain of course.
Here's one where an editor moved a paragraph that was reflected in a Wikipedia mirror. The material they added in the same edit is okay to keep.
In this one, an editor actually removes text but since IMDb has copied our plot summary at some point, the item gets listed.
Here's one that illustrated DanCherek's point: only a few words are added. Purported source: an obvious Wikipedia mirror.

Another suggestion: Perhaps we can somehow teach the system to only show us the most likely cases? Maybe there's a way to reduce the threshold for inclusion, regarding the size of the edit or the amount of the overlap? It's not a question of having a backlog; if we don't reduce the fire hose of incoming cases there will be many that never get assessed at all. Diannaa (talk) 23:25, 10 April 2024 (UTC)Reply

Fixed ~~first and fourth~~all. The second link is the same as the first. — JJMC89 (T·C) 06:44, 11 April 2024 (UTC) modified 17:25, 11 April 2024 (UTC)Reply

Sorry about the duplicate link; I am not going to bother to look for the missing example. New comments:

Community Tech bot used to remove listings of pages that were already deleted. This doesn't seem to be happening so far: deleted article, deleted draft
Cases so far at the halfway point of April 11 are a much more manageable 40, so if tweeks are underway, it's working.

Diannaa (talk) 12:11, 11 April 2024 (UTC)Reply

Unfortunately I had to revert one of the fixes due to poor performance causing the bot to buildup a large backlog that hasn't been processed yet. — JJMC89 (T·C) 16:19, 11 April 2024 (UTC)Reply

One thing I've noticed is that I keep getting logged out everytime I close my browser-- is there a cookie persistence issue? I had no such issues with the old backend. Isochrone (talk) 13:38, 11 April 2024 (UTC)Reply

I will look into this. This seems this happens to every new Symfony app that I create (phab:T224382). I managed to fix it before, so I'll attempt it again for CopyPatrol (the old CopyPatrol did not run on Symfony, FYI) MusikAnimal (WMF) (talk) 19:32, 11 April 2024 (UTC)Reply

I just noticed that I can't view the iThenticate reports unless I am logged in to CopyPatrol. So that might be a feature rather than a bug. Diannaa (talk) 23:14, 11 April 2024 (UTC)Reply

Logging in is required since each user must agree to the EULA to see the reports. The short login session should get worked on. — JJMC89 (T·C) 22:25, 12 April 2024 (UTC)Reply

Tracked in Phabricator:
Task T362457 resolved

New feedback: Some users are incorrectly being shown with redlinked user talk pages. Here, here, here, for example. It appears this might be because they don't have a talk page on Meta, but that's immaterial; I would prefer to be able to see at a glance whether or not a user talk page exists at en.wiki for that username. Diannaa (talk) 21:44, 12 April 2024 (UTC)Reply

Fixed MusikAnimal (WMF) (talk) 19:17, 14 April 2024 (UTC)Reply

Moving ignore lists to the CopyPatrol UI

Latest comment: 3 months ago3 comments2 people in discussion

In the above discussion, it was noted how tedious it is to maintain User:CopyPatrolBot/UrlIgnoreList as it requires knowledge of regular expressions. I had an idea that we could get rid of the on-wiki lists and instead have a button "Ignore URLs like this" directly in the CopyPatrol UI. We could do the same for users, too, so you don't have to edit User:CopyPatrolBot/UserIgnoreList. This is also nice because the new system has the ignore lists centralized on Meta, where not everyone is necessarily able to edit (the page could be semi-protected).

The only issue I foresee with this idea is the potential for abuse. For that, I was thinking we'd either restrict the ability to ingore URLs and users to "privileged" users – say at least 1,000 edits, or even restrict to sysops? Another option is to go ahead and shield all of CopyPatrol from newbies, as proposed at phab:T178700.

Thoughts? MusikAnimal (WMF) (talk) 19:43, 11 April 2024 (UTC)Reply

I can't imagine any issues with this for URLs. With users, making it too easy, even for admins (who are humans), to exclude users may lead to unintentional removals of users who should be flagged, or people being too liberal with the ignore button.
When there are errors on the wikitext list, this can just be rectified by another user: would there be a way to "un-ignore" users in case of errors? Isochrone (talk) 20:28, 11 April 2024 (UTC)Reply

I think it makes sense to have an interface to manage the ignored URLs and users. MusikAnimal (WMF) (talk) 22:43, 14 April 2024 (UTC)Reply

CopyPatrol has stopped, but..

Latest comment: 3 months ago1 comment1 person in discussion

CopyPatrol has stopped, because Turnitin is down for maintenance. Check https://turnitin.statuspage.io/ for updates. Diannaa (talk) 19:36, 20 April 2024 (UTC)Reply

I keep getting logged out

Latest comment: 2 months ago2 comments2 people in discussion

Maybe I'm losing my mind but I find myself logged out of copy patrol numerous times each week, even though I don't close the window or log out of OAuth at all. I swear the only time I had to log in to CopyPatrol on the old system was when I rebooted. Is there a setting somewhere I can change to keep me logged in or is this the new normal? Thanks enL3X1 ¡‹delayed reaction›¡ 00:53, 28 April 2024 (UTC)Reply

@L3X1 This was brought up in the feedback above. I have just deployed a change that I hope will help. Please let me know if it does (same for @Diannaa and everyone else :). There larger issue is rather a mystery, I'm afraid. I hope to investigate it more soon. You can follow phab:T224382 for updates. Best, MusikAnimal (WMF) (talk) 05:14, 28 April 2024 (UTC)Reply

Deborah Morris and John Franklin

Latest comment: 2 months ago4 comments2 people in discussion

I saw this in my watchlist...

Potential copyright violation log b 22:10 CopyPatrolBot talk contribs marked revision 1221979003 on Deborah Morris and John Franklin as a potential copyright violation ‎ Tag: PageTriage

but the only thing that I am finding is the duplication of a long title in the Bibliography: The Morris family of Philadelphia, descendants of Anthony Morris, born 1654-1721 died. It seems that I can post a comment somewhere related to this... but I forgot where. Where can I provide a comment on this? Thanks so much!CaroleHenson (talk) 05:29, 3 May 2024 (UTC)Reply

I found the log here: https://copypatrol.wmcloud.org/en. It identifies a source I did not use, but it has content that is in the article in a quote from another source: "https://archive.org/details/havilandgenealog00fros/page/210/mode/1up?q=%22no+longer+able+to+hear%22". It's a quote - and the source is from 1914, so not a copyright violation.

The source that I used was a 1893 newspaper article: "Old-Time New-York Friends: Services of the "Plain People" in Revolutionary Days". The New York Times. November 11, 1893. p. 16. Retrieved May 2, 2024."

I couldn't figure out how to add a comment that it's not a copyright violation.CaroleHenson (talk) 05:51, 3 May 2024 (UTC)Reply

@CaroleHenson When using quotes it's normally enough to add the reference. If you use public domain content outside of quotes, then you need to add the attribution template. I've marked the raport as no action needed since you already added the reference. Nobody (talk) 08:15, 3 May 2024 (UTC)Reply

Great thanks!CaroleHenson (talk) 19:44, 3 May 2024 (UTC)Reply

Bugged

Latest comment: 20 days ago2 comments2 people in discussion

Got told that this edit was a "potential copyright violation". Creating new league tables. No links added, no tables copied from elsewhere. Usually you can guess how/why a bot has flagged something incorrectly, but if that edit is tripping it, it can only be bugged to hell. Kingsif (talk) 03:59, 2 July 2024 (UTC)Reply

This case was marked as a false positive. The headers on the tables caused the report to be generated. Sorry for the inconvenience. Diannaa (talk) 12:09, 2 July 2024 (UTC)Reply

Add topic