Talk:CopyPatrol/Archives/2018

Latest comment: 5 years ago by TheDragonFire in topic Mirrors

A third option?

On a positive note, the tools been working well for me for some time (on rare occasions, there is a delay loading).

However, there is an issue that has been in the back of my mind for some time and having run across a couple examples in the last couple days I thought I'd raise it as an issue.

At present, after reviewing an identified issue, we have exactly two possible responses:

  1. Page fixed
  2. No action needed

I submit there is a third option. I'm not sure of the best title for it so let me explain the situation and see if you think I am making sense.

In some cases, we will see that some other editor has already tagged the article for deletion. In some cases, it might be checked tagged as a G 12 using the exact source identified by this tool and I would have no problem choosing the "page fixed" option.

However, it might be a G 12 with a different source identified, or it could be any number of other CSD options, or it might be a prod or AfD.

My current practice is to continue looking into the copyright issue, and if I see a clear indication of a copyright problem, I will: A. add the template (in the case of a prod) or B. change the template to a Db–multiple. However, there are situations in which it is going to take a little time to investigate the copyright issue. Sometimes the identified source is a .doc, sometimes the identified source isn't the real source and you have to poke around the site to find the real page, sometimes the linked source doesn't load and I resort to doing a Google search for selected text, as well as other issues. Perhaps the best example of such an issue is if the source is paywalled. We can use the Resource Request to get access to the original source but that's more work than can be justified when the article is likely to go away in a few days for other reasons.

However, if I choose either of the existing options, the item will be removed from the review. If I choose neither, then the next editor working on this project will have to go through the same issues.

My ideal option would be a third option which I tentatively label "on hold". In an ideal world, the tool would remove the item from the list but monitor it, and either remove it from the list if it ends up being deleted or restore it to the list if the rationale for the deletion is reviewed and rejected. I suspect that's asking too much, so my second best option would be that choosing the "on hold" button would remove it from the list temporarily and it would magically reappear seven days later, at which time the AFD, prod or CSD has almost certainly been reviewed and acted upon. If the article has been deleted it will be an easy item to dispose of, but if it hasn't been deleted then we will undertake the task of determining whether there is a copyright violation.

@Diannaa, L3X1, Crow, and MER-C: Alternatively, it may simply be that I need to change my approach. If any of the regulars have some thoughts on how they address these situations I will be happy to learn from them.--Sphilbrick (talk) 14:57, 18 September 2017 (UTC)

When I have come across a situation like that I usually marked it No Action Needed, and I never thought about it afterward. I think the "on Hold" idea is a good one, perhaps with a 255 character comment box, so that others will know why it was orginally put on hold. d.g.L3X1 delayed reaction 15:33, 18 September 2017 (UTC)
My main concern is removal and revision deletion of the copyright violation, so I will do those steps if the article is nominated for deletion as a prod or is at AFD. If it's (for example) nominated for deletion as an A7, I will either delete the article myself or change the deletion nomination to a Db–multiple and will watch-list to make sure that the article is deleted and will remove the copyvio if it is not. Cases that are already tagged as G12 speedy deletion I will usually delete myself or will watch-list and perform revision deletion if the person assessing it is able to save the article and the copyvio is still visible in the edit history. Content behind a paywall is usually visible via the iThenticate link or in an archived version via the Wayback machine. Diannaa (talk) 20:12, 18 September 2017 (UTC)
  • I typically handle it much as SP does. Being a civvy, all I can do is tag for G12, request RD1, etc, so when I do that (or a page is already G12 tagged), I call it "Fixed" but I then also Watchlist the page as I do for G12 noms that I make. Tags often get removed for GF or BF reasons, so I keep an eye on it until it goes red or I see the "Remove un-needed tag" edit summary. As far as AfD or Prod tags, those discussions do not (to me, correct me if I'm wrong) preclude a G12 speedy if they are in fact irrecoverable copyvios. I've seen many times where a company's web page is pasted in, and a NPP'er tags it for Advert when it is also a Copyvio. That said, a third option with a comment field would be nice to avoid repeating too much work if a reviewer is unsure... they may have thought of something another reviewer may not, so it adds a small layer of collaboration to the tool. Crow (talk) 23:14, 18 September 2017 (UTC)
This is an excellent example showing why it would be nice to have the ability to add something to one's watchlist with an expiration date. My watchlist is over 7000 entries and that's net of spending quite a bit of time trying to prune it. In almost all cases, an article I review as part of copypatrol is not one I want to leave on my watchlist long-term, but like Crow I might want to monitor it for a few days. Having the ability to add it to a watchlist with a 30 day expiration would be ideal.--Sphilbrick (talk) 13:36, 4 October 2017 (UTC)
If you're going to suggest the idea, I will heartily endorse it. I had to change my Twinkle settings because it was filling my watchlist with all sorts of pages I that really only needed to watch for a few days or weeks. enL3X1 ¡‹delayed reaction›¡ 14:01, 5 October 2017 (UTC)
That's an interesting idea. I recall that when CopyPatrol was first launched we did have three response buttons but I can't remember what the third one was for. I like the idea of "On hold". Replying inline a bit -
My ideal option would be a third option which I tentatively label "on hold". In an ideal world, the tool would remove the item from the list but monitor it, and either remove it from the list if it ends up being deleted or restore it to the list if the rationale for the deletion is reviewed and rejected. - While I love the idea, I wish it was simple to implement. While the tool can easily remove the record from the list if it's deleted, it's a whole another level of challenge to figure out if the deletion is rejected.
I suspect that's asking too much, so my second best option would be that choosing the "on hold" button would remove it from the list temporarily and it would magically reappear seven days later, at which time the AFD, prod or CSD has almost certainly been reviewed and acted upon. If the article has been deleted it will be an easy item to dispose of, but if it hasn't been deleted then we will undertake the task of determining whether there is a copyright violation. - I have another idea. What if when you put a record "On hold", it appears in a new tab (say "Cases on hold", like All cases, Open cases, Reviewed cases) and gets removed from the "Open cases" tab. So you can go look at that tab once in a while and sort out those records into "Page fixed"/"No action needed". When you mark it as "On hold", it gets a timestamp underneath it, so you can skip looking at records that were put on hold yesterday or so and focus on older records. This would be much easier and quicker to do code-wise and would sufficiently serve the purpose as far as I can tell. Thoughts? -- NKohli (WMF) (talk) 22:29, 10 October 2017 (UTC)
I fully understand the distinction between identifying a third option "on hold" which simply flags the item, and the ideal situation where the software might detect that the hold item can be closed. That ideal situation sounds challenging and I'd support it if someone said it turns out to be easy but that doesn't sound likely, I would be perfectly happy with the result that it simply saves these items somewhere for future review. That does, of course, mean the reviewers will have to modify their own processes and remember to take a look at those items on occasion. I haven't thought through the right way to do that, but we can sort that out.--Sphilbrick (talk) 14:45, 20 October 2017 (UTC)
@NKohli (WMF): While I sort of responded, I didn't really comment on your proposal. In short, I fully concur. I agree that simply moving the cases to an “On-hold” tab (presumably a sixth radio button after “Drafts only”) would achieve a useful purpose. Adding the time stamp would be helpful. Today is a perfect example of why it would be helpful. I found four articles where the source material is CC licensed but CC 4.0 not CC 3.0 . My understanding is that this is not compatible with Wikipedia, so I nominated for CSD. There's a chance someone will decline my CSD, thinking that CC licensed material is fine. While in theory, I could watchlist the page, my watchlist is too long, and I may miss seeing the declination. Having the “on-hold” option means I can look back in a few days, and if the article has been deleted, I can change the status. Additionally, I think the Foundation is working on the incompatibility between Wikipedia and CC 4.0 and if they happen to have resolved it, then the proper action is to decline my CSD. However, I am not likely to know this if I had marked the page as fixed.
In summary, strong support for an “On hold” option as you described.--Sphilbrick (talk) 14:56, 28 November 2017 (UTC)
Sounds good. Are you on Phabricator, by any chance? If so, filing a ticket requesting this (under the CopyPatrol project) would be very helpful. If not, I can take care of it. Thanks. -- NKohli (WMF) (talk) 15:18, 28 November 2017 (UTC)
@NKohli (WMF): Yes, I do have a Phabricator access, but I don't feel I know my way around it very well. I took a quick glance at your link which seems to list the backlog as well as items found and I did not see it in the list. I'm happy to add it but obviously don't want to if you did and I somehow missed it.Sphilbrick (talk) 16:58, 1 February 2018 (UTC)

Bot is down

Tracked in Phabricator:
Task T185163 resolved

(Cross posting) The copypatrol page has no new listings for more than eight hours. I suspect the bot is down. Could you take a look please when you get a chance? Thank you. Diannaa (talk) 00:44, 28 December 2017 (UTC)

I reloaded the page and it is empty. enL3X1 ¡‹delayed reaction›¡ 15:57, 18 January 2018 (UTC)
We are looking into this (Dianna's report from last month was an unrelated incident, I believe). The issue now is we've hit our limit with iThenticate, which is the service used to detect plagiarism. You can follow phab:T185163 for updates, but we'll comment here once we know more. Thanks and apologies for the downtime! MusikAnimal (WMF) (talk) 22:30, 18 January 2018 (UTC)

CopyPatrol not collecting records currently

Hi all, as you might have noticed CopyPatrol hasn't had any new records in the last day. CopyPatrol makes use of the data Eranbot collects for displaying records. Eranbot in turn makes use of a service called Turnitin for finding copyright violations in edits. We had a limit quota with this service which has apparently exceeded. We're working on getting this fixed and will keep you posted. You can follow updates on this on task T185163. -- NKohli (WMF) (talk) 22:59, 18 January 2018 (UTC)

This has been resolved now. Thanks everyone! -- NKohli (WMF) (talk) 19:01, 19 January 2018 (UTC)

Which wiki has the worst amount copyvios?

I've been taking a look at the Spanish and French copyPatrols and was wondering if there are any metrics as to which of the 4 wikis is afflicted with the most copyright violations? thanks, enL3X1 ¡‹delayed reaction›¡ 02:27, 14 February 2018 (UTC)

Interesting question! Those metrics are a bit hard to compute because we didn't start patrolling all the wikis at the same time. We started off with English as a pilot and opened up to more wikis, if they wanted to use it. I think we can safely say that from the wikis Copypatrol runs on currently, English has had the most number of copyvios come up. -- NKohli (WMF) (talk) 02:49, 14 February 2018 (UTC)

Bot down?

Hasn't refreshed in over 90 minutes, only displays one edit. enL3X1 ¡‹delayed reaction›¡ 02:15, 16 March 2018 (UTC)

NVM enL3X1 ¡‹delayed reaction›¡ 12:48, 16 March 2018 (UTC)

An issue triggered by copying within Wikipedia (but failing to note the copy in the edit summary)

Many editors, even experienced ones, are often unaware that our attribution requirements create a need, in the case an editor copies material from another Wikipedia page, to identify the source of the material. This is typically done with an edit summary and we have outlined best practices at: https://en.wikipedia.org/wiki/Wikipedia:Copying_within_Wikipedia

Not surprisingly, an editor copying from one article to another will trigger a entry at CopyPatrol, not because CopyPatrol is looking for edits which match or closely resemble) material within Wikipedia, but because that material may be picked up by another site. In theory, that other site should note the source of the material and provide attribution, but we all know that many sites copy Wikipedia material (acceptable), fail to attribute it (not acceptable) and may even assert full copyright over it (also unacceptable).

Obviously, in the cases where the copied materiel is properly attributed, volunteers working on CopyPatrol are likely to note the attribution, and not revert as a copyvio. However, it is much better if editors follow the best practices, as I am certain that all editors working in CopyPatrol will look at the edit summary, and the notation that it is an internal copy will forestall the need for a revdel.

So what's the point of this missive?

Would it be possible for the CopyPatrol software, when identifying at edit to article X, check Wikipedia to see if there is similar language at articles OTHER than article X? Including that information may help editors identify the edit as an improper copy within Wikipedia edit. It is still a copyright problem, but we handles such situations differently.

And, as this exchange demonstrates, editors who are unaware of the (admittedly not well-known) requirements sometimes take umbrage. I don't enjoy such interactions, and I am sure the other editor, is less than a happy camper. Identifying such situations in advance would be much better for everyone.--Sphilbrick (talk) 22:58, 8 May 2018 (UTC)

I'd like to try again to see if I can get some feedback on this proposal. Multiple hours have been wasted because an editor copied material from another article, was unaware of the need to include a note in the edit summary, and the material also showed up in an external site which mentioned Wikipedia but then went on to claim full copyright over the content, resulting in a reversion by me, and an understandably angry editor. incident here I think things have been smoothed over but it has necessitated hours of work by too many people. If copy patrol had identified that the material also existed in an existing Wikipedia article other than the one in which the edit is being made, all of this would've been avoided. It sounds easy to me to do this check. Am I missing something?--Sphilbrick (talk) 17:50, 12 June 2018 (UTC)
Hi Sphilbrick. Sorry for the late reply to this. I missed your post earlier. As you know, the data in CopyPatrol comes from Eranbot which (if memory serves me right) maintains a blacklist for sites and mirrors it filters out when detecting copyright violations. It's also possible that this happens by Turnitin itself too. I'd like to ask for User:ערן's input on this. -- NKohli (WMF) (talk) 18:23, 12 June 2018 (UTC)
This is correct - whenever we encounter a new Wikipedia mirror it is good practice to add it to User:EranBot/Copyright/Blacklist so we can skip text from those sites for the next time. The bot doesn't do internal search in Wikipedia itself (yet) for the added text as t is somewhat "heavy" (probably need few search queries for sample sentences from the diff). eranroz (talk) 21:20, 12 June 2018 (UTC)
I understand the need to add true mirrors to the list so they don't generate false positives. I trust it is clear that a single article, or portion of an article copied into a blog post does not belong in the mirror list. I think it is such examples that are causing the problem. I'm motivated to post today because two such examples occurred today, one of which was rather contentious, and the other not, perhaps because it was noticed before too much reaction.
In both cases, it looks like a legitimate copyright issue to the reviewer (me) and to the deleting admin @RHaworth:, and the editor is perplexed because they know nothing about the site for which there is matching text. Obviously, this is a nonproblem if editors know to mention the copy in the edit summary, but many editors are unaware of this requirement, and it isn't obvious how to identify such editors; it could be anyone. That's why I would be happy if CopyPatrol did a search of Wikipedia, and gave us a heads up if there is a close match.--Sphilbrick (talk) 21:48, 9 August 2018 (UTC)

Welcome!

Nice to see some new faces in CopyPatrol! Welcome! enL3X1 ¡‹delayed reaction›¡ 02:44, 10 July 2018 (UTC) enL3X1 ¡‹delayed reaction›¡ 01:07, 16 September 2018 (UTC)

Diffs not directing properly

Noticed this today, many but not all of the diffs in the CPatrol page go to invalid locations. Compare the actual edit [1] with the reported one [2] Crow (talk) 18:42, 12 August 2018 (UTC)

Chiming in with the same observation. I saw this popping up a couple weeks ago, hoped it was a momentary glitch, but it doesn't seem to be going away. On the one hand it isn't a showstopper — one can click on the view history go back and check the time and date of the edit find the edit and then track down the diff, but when one is trying to handled dozens each day, adding several extra steps is annoying. Here are three examples I came across to the last hour:
[3]
[4]
[5]--Sphilbrick (talk) 14:22, 20 August 2018 (UTC)
This is phab:T201218. It should at least be showing the content of the right revision, but the "change visibility" link (if you are an admin) is not available because MediaWiki thinks it's an invalid revision. I'm not sure if we should wait for a fix or change our our links to point oldid=123 instead of diff=123. For revisions that aren't the first to the page, using diff=123 is preferred so you can easily see the content added with the edit in question MusikAnimal (WMF) (talk) 15:17, 20 August 2018 (UTC)
Thanks for the prompt response. Happy to see that it is being tracked and investigated. I do agree that it seems to be showing the right content.--Sphilbrick (talk) 17:33, 20 August 2018 (UTC)

Small queue today

Less than a dozen tickets the 4 or 5 times I checked today. Is something broken? enL3X1 ¡‹delayed reaction›¡ 01:49, 2 October 2018 (UTC)

I haven't heard, noticed the same thing, but it seems back in action.--Sphilbrick (talk) 13:38, 2 October 2018 (UTC)
Yup, back to full capacity on my end again, thanks all. enL3X1 ¡‹delayed reaction›¡ 14:39, 2 October 2018 (UTC)
There was a hiccup with our account on Turnitin expiring but thanks to eranroz and Doc James, it was quickly restored. :) -- NKohli (WMF) (talk) 15:45, 2 October 2018 (UTC)
Yes will work to say on top of this. Ithentica / Turnitin has been an amazing partner and has been very generous in giving us access to large number of queries on their API. Doc James (talk · contribs · email) 17:36, 2 October 2018 (UTC)

Mirrors

  • en.m.wikipedia.nom.si
  • www.wikizero.com
  • wiki2.org

These domains should probably be whitelisted as mirrors of Wikipedia. TheDragonFire (talk) 13:00, 15 December 2018 (UTC)

TheDragonFire Hi! The blacklist is maintained here - User:EranBot/Copyright/Blacklist. Feel free to add it. :) -- NKohli (WMF) (talk) 05:36, 16 December 2018 (UTC)
  Done. Thanks. :) TheDragonFire (talk) 06:29, 16 December 2018 (UTC)
Return to "CopyPatrol/Archives/2018" page.