Community Wishlist Survey 2019/Citations/Automatic web archive

Random proposal ►◄ Citations The survey has concluded. Here are the results!

Automatic web archive

Problem: many people forget to archive websites when they use websites as a source. That makes it harder later if/when a page has changed and the URL does not work anymore

Who would benefit: All who take care of broken web links and all readers since the information is documented

Proposed solution: since we probably will not get our own archive. That with every weblink the outside goes to archives in the webarchive [1]

More comments:

Phabricator tickets: phab:T236252

Proposer: ZellmerLP (talk) 11:38, 30 October 2018 (UTC)[reply]

Discussion

@ZellmerLP: This sounds related to m:InternetArchiveBot? --AKlapper (WMF) (talk) 12:48, 30 October 2018 (UTC)[reply]

@AKlapper (WMF) and ZellmerLP: IABot just finds the archived copies; the archival of sources is done separately by a script that looks through recent changes on certain WMF wikis. I believe the relevant task for this request would be phab:T199193, since archival is already performed on most new URLs anyway. (As I noted there, a significant portion of the URLs currently not being archived correctly are likely unarchiveable because of limits to the Internet Archive, and not because the bot can't find the URLs fast enough.) Jc86035 (talk) 13:11, 30 October 2018 (UTC)[reply]

As noted at task T199193, instant archiving of new references is something we're already looking to work with InternetArchive to accomplish as part of the Knowledge Integrity program :) Samwalton9 (WMF) (talk) 13:15, 30 October 2018 (UTC)[reply]

@Samwalton9 (WMF): Is the WMF still planning to use the original idea of archiving sources instantaneously? I think it could be valuable, but it would be a little disappointing if it's simply decided that any page with a robots.txt or with dynamic content doesn't need to be properly archived, and to me it seems quite odd to ignore them in a plan called "Knowledge Integrity". That these are perhaps inherent limitations of the Internet Archive's software doesn't mean that it couldn't be done differently. (As stated in the task, I would think most URLs don't disappear between their addition to Wikipedia and their archival by IA a day or so later.) Jc86035 (talk) 13:43, 30 October 2018 (UTC)[reply]

Honestly we haven't really got into the details on this task yet - it's dependent on the citation event stream which is being worked on first. Your comment on that task is a really great overview of the limitations, however, and we'll make sure to take that into consideration when moving ahead with this. Samwalton9 (WMF) (talk) 13:50, 30 October 2018 (UTC)[reply]

ZellmerLP It appears to me that InternetArchive already does this as part of their work on IABot. Cyberpower678 might be able to provide more insights. -- NKohli (WMF) (talk) 21:47, 30 October 2018 (UTC)[reply]

I think this is the same as or very similar to Community Wishlist Survey 2016/Categories/Bots and gadgets#Automatic links to Internet Archive, which is about automatic archiving at the time the link is saved. IABot still does a fantastic job of doing this after the fact, so I wonder if that is sufficient. MusikAnimal (WMF) (talk) 22:42, 30 October 2018 (UTC)[reply]

Might Perma.cc be relevant here? Apparently it's run by some libraries and archives on request by anyone; it's specifically designed to prevent linkrot in academic texts. HLHJ (talk) 02:36, 31 October 2018 (UTC)[reply]

@HLHJ: I personally think it's nice but not large enough to have much of a noticeable effect. The Internet Archive is basically doing the same thing already, but on an industrial scale (the Wayback Machine has about five and a half orders of magnitude more captures than perma.cc). Jc86035 (talk) 13:14, 31 October 2018 (UTC)[reply]

We have a ticket open to do this in real time with mw:citoid at task T115224 (and was assigned to me with high priority), but it is currently stalled because the patch increased response time dramatically. This potentially could be revisited using the IABot service though, which may be fast enough, which I haven't done. Currently citoid development is frozen due to deployment issues but hopefully will be unfrozen soon. Mvolz (WMF) (talk) 12:12, 6 November 2018 (UTC)[reply]

Voting

Support I try to do this manually (archive the webpage, and add the archiveurl etc to my citation) because linkrot is such a major problem for us. I also use IABot but it often doesn't seem to add the archiveurl etc for PDF files on the web (I have no idea if this is a bug or an intention) so I am hesitant to rely on it. Also, as a VE user, the fact that someone saw fit to remove the archiveurl/archivedate/deadurl fields from the default configuration of the cite-news template just makes the job even harder. So, yes, please, give me a one-touch archive-this-and-update-my-citation-accordingly button on a per-citation basis and as a tool off the "whole of article" menu to do it for the whole article. Kerry Raymond (talk) 23:04, 16 November 2018 (UTC)[reply]
Support Rs chen 7754 02:40, 17 November 2018 (UTC)[reply]
Support Liuxinyu970226 (talk) 03:48, 17 November 2018 (UTC)[reply]
Support Jo-Jo Eumerus (talk, contributions) 10:08, 17 November 2018 (UTC)[reply]
Support Libcub (talk) 10:31, 17 November 2018 (UTC)[reply]
Support Nimrodbr (talk) 10:37, 17 November 2018 (UTC)[reply]
Support --Alaa :)..! 10:41, 17 November 2018 (UTC)[reply]
Oppose, as WMF is already planning to do this as part of Knowledge Integrity. Jc86035 (talk) 10:43, 17 November 2018 (UTC)[reply]
Support ديفيد عادل وهبة خليل 2 (talk) 10:56, 17 November 2018 (UTC)[reply]
Support Blue Rasberry (talk) 15:32, 17 November 2018 (UTC)[reply]
Support Tremendo (talk) 16:49, 17 November 2018 (UTC)[reply]
Support Cabayi (talk) 17:27, 17 November 2018 (UTC)[reply]
Support Theklan (talk) 18:10, 17 November 2018 (UTC)[reply]
Support Vwanweb (talk) 19:18, 17 November 2018 (UTC)[reply]
Support Nk (talk) 19:56, 17 November 2018 (UTC)[reply]
Support HLHJ (talk) 22:49, 17 November 2018 (UTC)[reply]
Support Tim Landscheidt (talk) 00:26, 18 November 2018 (UTC)[reply]
Support Wunkt2 (talk) 02:53, 18 November 2018 (UTC)[reply]
Support Temp3600 (talk) 05:50, 18 November 2018 (UTC)[reply]
Support AHeneen (talk) 06:22, 18 November 2018 (UTC)[reply]
Support NMaia (talk) 10:21, 18 November 2018 (UTC)[reply]
Support It should not be limited to just WebArchive, WebCite and Archive.li should be used as well. The more archived copies of sources, the better. WebArchive can't cover everything on its own, many websites reject crawlers through their robot.txt files. In this case, InternetArchiveBot'll be useless. I'd like to see a bot that creates archived copies of sources on the aforementioned websites automatically. Arbeite19 (talk) 10:56, 18 November 2018 (UTC)[reply]
Support Eddie891 (talk) 12:39, 18 November 2018 (UTC)[reply]
Oppose Many spamlinks are added to wikis every day and these should not be archived. Dvorapa (talk) 17:26, 18 November 2018 (UTC)[reply]
Support But take in concern Dvorapa's comment above. — Draceane ^talk_contrib. 17:57, 18 November 2018 (UTC)[reply]
Support Ninovolador (talk) 20:08, 18 November 2018 (UTC)[reply]
Support Wesalius (talk) 21:17, 18 November 2018 (UTC)[reply]
Support -- Whats new?^(talk) 22:24, 18 November 2018 (UTC)[reply]
Support Waddie96 (talk) 07:46, 19 November 2018 (UTC)[reply]
Support I'm hitting support, but isn't this already done / kind of done? ·addshore· ^{talk to me!} 10:02, 19 November 2018 (UTC)[reply]
Support β₁₆ - (talk) 10:46, 19 November 2018 (UTC)[reply]
Support Sadads (talk) 17:56, 19 November 2018 (UTC)[reply]
Support StringRay (talk) 22:01, 19 November 2018 (UTC)[reply]
Support Iridescent (talk) 10:21, 20 November 2018 (UTC)[reply]
Support Gareth (talk) 11:33, 20 November 2018 (UTC)[reply]
Support Vulp here 12:46, 20 November 2018 (UTC)[reply]
Support Philk84 (talk) 14:01, 20 November 2018 (UTC)[reply]
Support Lostinlodos (talk) 19:25, 20 November 2018 (UTC)[reply]
Support CAPTAIN RAJU^(T) 22:37, 20 November 2018 (UTC)[reply]
Support Novak Watchmen (talk) 00:10, 21 November 2018 (UTC)[reply]
Support MYMMMC (talk) 05:33, 21 November 2018 (UTC)[reply]
Support Laboramus (talk) 07:26, 21 November 2018 (UTC)[reply]
Support Penegal (talk) 08:42, 21 November 2018 (UTC)[reply]
Support Ayoub Fajraoui (talk) 09:32, 21 November 2018 (UTC)[reply]
Support Skyman gozilla (talk) 19:59, 21 November 2018 (UTC)[reply]
Support In fact this shoud be made another Wikimedia project. Igel B TyMaHe (talk) 19:27, 22 November 2018 (UTC)[reply]
Support Richard Nevell (talk) 22:29, 22 November 2018 (UTC)[reply]
Support Siddhant (talk) 01:22, 23 November 2018 (UTC)[reply]
Support SarahSV ^talk 05:46, 23 November 2018 (UTC)[reply]
Support Kagaoua (talk) 06:58, 23 November 2018 (UTC)[reply]
Support mik@ni 12:22, 23 November 2018 (UTC)[reply]
Support Dvorapa makes a good point. Maybe put it on a one week timer or something? Mbrickn (talk) 21:25, 23 November 2018 (UTC)[reply]
Support Hmxhmx 10:23, 24 November 2018 (UTC)[reply]
Support — AfroThundr ^{(u · t · c)} 02:00, 26 November 2018 (UTC)[reply]
Support HouseGecko (talk) 13:32, 26 November 2018 (UTC)[reply]
Support Jklamo (talk) 15:05, 26 November 2018 (UTC)[reply]
Support PMG (talk) 17:27, 26 November 2018 (UTC)[reply]
Support Dick Bos (talk) 20:39, 26 November 2018 (UTC)[reply]
Support This would be extremely useful! Nannochloropsis (talk) 21:24, 26 November 2018 (UTC)[reply]
Support Izno (talk) 01:00, 27 November 2018 (UTC)[reply]
Support YFdyh000 (talk) 17:11, 27 November 2018 (UTC)[reply]
Support Tiven2240 (talk) 12:00, 29 November 2018 (UTC)[reply]
Support Cymru.lass (talk) 20:19, 29 November 2018 (UTC)[reply]
Support Olea (talk) 21:17, 29 November 2018 (UTC)[reply]
Support This is really crucial as far as links becomes unavailable too fast Movses (talk) 14:17, 30 November 2018 (UTC)[reply]