Community Tech/Migrate dead external links to archives/Notes

Email, Dec 16

User:Cyberpower678 is developing Cyberbot II to do this; it's been approved by the EN community. Phab task: T112881.

Our team will review the Cyberbot code, and provide support to get it to completion. Issues that we could help with:

Working effectively in multiple languages and multiple projects (Wikisource, Wikibooks)
Making the code for the bot maintainable and scalable across communities and wikis (what about archiving Wikidata or Commons urls and identifier sources for example)
Making sure that the capabilities are built into Citoid for automatic archiving (which is an API pull issue in part of Mediawiki software)
Making sure that we are preparing for structured citation data with Librarybase (and/or another Wikibase holding citation data, like Wikidata).

Internet Archive is excited about working with us, and has offered resources -- API support, code review.

There are a couple ways to approach the problem: Reactive (going through existing links and matching to the archive) and Proactive (when a link gets added, send it to IA and store the metadata, to use when the link dies). The proactive approach could be built into Citoid.

James Hare has proposed Librarybase, a structured database with all of our citation data. phab:T111066

Kaldari suggestion: use a parameter in the cite templates called deadlink, which we could populate with metadata.

Assessment meeting, Dec 17

Right now, Cyberbot is using the Internet Archive's API to tell the machine to archive pages and citations that it discovers on en.wp. If we want more features, or more languages, or if we want different data fed into our machine that's calling on the archives, IA would be interested in doing it, and putting energy behind it.

Ryan will talk to the api devs. We have a lot of use cases that they should know about, to add extra features to their apis.

Commons is another interesting use case. They'll need to archive URLs that are different from the regular source page. We want to keep track of the images that are there, maybe tag them in a different way.

Asaf is a good resource, he's very connected to IA, runs a library project and is close with volunteers.

What's lacking most right now is no documentation on Cyberbot. That's going to be very important, if devs from our team are interacting with his code, and the logic he wants the bot to follow.

We need to go through the code and document its logic, then reflect back to Cyberpower and check that we're right. Is this a logic that can be applied to other langs? etc.

One great thing about cyberbot is that it responds to the social aspects as well. It posts on talk pages, leaves good edit summaries. We can look at other communities and see what the etiquette is, then see if there's a gap in logic.

Discussion with Cyberpower and Mark G. from Internet Archive, Dec 22

Cyberpower had a working single-thread bot that was slow, but functional, and was reliably replacing dead links every day.

Concerned about the speed, Cyberpower created a new multi-thread version that has a couple problems:

There's a memory leak that causes the bot to crash every few days. After a crash, it starts over again from the top, so it's not making a lot of progress.

The multi-thread bot is pushing so many API requests to the IA servers that it's overloading them; the requests need to be throttled back.

Current goal:

On English Wikipedia, there's a Dead link template that adds a page with a dead link to Category:All articles with dead external links. That category currently has 134,000 articles. So far the bot has processed about 6,000, although people keep adding the template to more pages. We can track progress by focusing on getting that number down.

When we've processed and fixed that set of 134,000 (to the extent that we can), then we can work on the bigger job of monitoring every EN article, and then moving on to other languages and other projects.

Things to work on:

Use the older single-thread code, get an instance running reliably, backport the CDX API functionality from the multi-threaded version.

At the same time, work on the multi-threaded version -- fixing the memory leak, throttling, work with the API to get a good balance of speed and performance.

It would help to abstract some of the threading functionality -- anything that Cyberpower can do to make that configurable will help to manage running several instances.

To help with crashing and starting over: set up some type of log (either file or DB) with the last article that it fixed and the timestamp so that it can restart from there.

Discussions with Cyberpower, Jan 4

Cyberpower is making good progress with the multi-threaded bot. Memory leak is fixed. It has thorough memory management, and can be restarted after a crash.

Cyberpower is moving Cyberbot to InternetArchiveBot. It'll be a shared account, and Ryan will be a listed operator.

Cyberpower suggested creating an algorithm to determine if a page is dead -- phab:T122659 now in Community Tech sprint.

Community Tech meeting, Jan 5

Cyberbot doesn't automatically test URLs, it just uses the deadlinked template.

Internet Archive concerns: Flooding the API, no good reporting interface (no idea what the bot is doing). They're interested in hosting a bot on their servers.

We need a central logging system, probably on Tool Labs.

First: go through the code and organize it, with generic variable names. Build tests.

To support other wikis: We need template data, to plug in. VE team can tell us about template data.

Discussion with Cyberpower + Alex, Jan 12

For stats on IAbot progress: Tool Labs hashtag search -- iabot

Cyberpower says we need the pthread library installed on Tool Labs for PHP, so that we don't overload IA's API. Ryan will talk to the Labs admins about installing pthread.

Community Tech meeting, Jan 19

Danny will meet with Mark from Internet Archive to define end goals for the project. We need to understand the entire scope, and then make a plan for how to handle it.

Frances will take the investigation ticket: T120850. We need to understand Cyberbot, as well as the Spanish bot (Elvisor) and the French bot (??). Do they all work the same way? How do they scale?

We may need to think about this as an ecosystem of approaches. There may not be one single solution to the deadlinks problem. Automated detection will obviously be part of it, but we also want to involve humans if it's scaleable.

Legoktm wrote up a proposal on mw.org: User:Legoktm/archive.txt.

One answer could be a gadget that gives contributors the ability to assess all the links on a page that they're working on. The gadget could surface the dead links and help people find the best replacement. Or the human could make the decision to switch the archive parameter to yes.

Niharika is working on T122659: an algorithm to determine if a web page is dead. Right now, it's very limited and conservative, just looking for links that give 400 or 500 errors. There are other situations where a lnk might be considered dead. We need to look into how Pywikibot's weblinkchecker handles this.

Questions to answer:

Q1: Comparison of current solutions -- Cyberbot, Elvisor, French bot
Q2: Automation by bot vs Lua module? Can bots scale?
Q3: Fixing links -- automated vs human decisions?
Q4: Using deadlink templates to find dead links vs automatic scanning?
Q5: How to scale for other languages? Including template and policy differences.

Ryan/Danny, Jan 21

Strategy

The goal: 100% coverage on all languages forever. Zero dead links today, zero dead links tomorrow.

That means: Every dead link on every language, now and forever, using the local community's cite templates, able to be double-checked by humans, but not requiring it for every link.

Scaling to that level means it's a job for more than one person / one team / one bot.

Cyberpower will graduate, Community Tech will move on to new projects. We need to build a system that other people can support and maintain.

There are four kinds of links that we need to consider:

Dead links that we know about
Dead links that we don't know about
Aging links that will die tomorrow
New links

Right now, Cyberbot is working on the first group.

The detection script that Niharika wrote is for identifying links in the second group.

The aging links is an interesting problem.

New links is being addressed by Citoid (although we need to cover new wikitext links too). This is entirely prophylactic, the new links are very unlikely to be dead on save.

So this isn't just one problem. It's a cluster of several problems, which may need different solutions. It's possible that we'll need (a version of) Cyberbot for dead links that we know about, a gadget for active contributors to help fill in the gaps, a change to Citoid, maybe a different bot that runs through all the existing pages.

We'll probably also need a centralized logging API to keep track of which pages have been scanned, and which haven't (or how long it's been since the scan). This might be a database on Tool Labs.

An important note: Even if we build tools that cover every problem on Enwiki, it won't scale to other languages. (They have different cite templates, with different parameters.)

What we need to do is develop a generalized, well-documented library of modules that other developers can use. For example: a module for testing dead links, a module for retrieving URLs, a module for plugging in values for cite templates. We make those available, with complete documentation, so that people can build bots for their own wikis. This could end up being a manageable task for a volunteer developer -- figuring out how to make the module is hard, but on their end, making the bot edit the right templates is what these devs already know how to do.

One strategy could be -- get some modularized bots working on EN -- then see how to adapt it to a couple other languages -- then prep it for global use.

Threads

Cyberbot currently runs on Cyberpower's server, and he plans to install it on IA's servers. We could use Tool Labs instead -- it's a server environment specifically made for running bots.

Tool Labs doesn't support multithreads. But it does support HHVM (compiled version of PHP that WP and Facebook run on, 10x faster than regular PHP). We could get the speed advantage of multithread by using HHVM and hosting it on Tool Labs.

Otherwise, we have to set up an environment that supports multithreads, and we have to make sure that the code is thread-safe. That's extra complexity in the code that makes it hard to modularize.

Internet Archive, Jan 21

Kenji has a new bulk availability API. They would like us/Cyberpower to use this API. Ryan is reviewing it and making suggestions.

IA has been scraping external URLs since two years ago, so coverage on those links is good.

4% of saved links are dead on arrival -- either the editor typed the link wrong, or the site isn't accessible because of robots.txt.

Greg says: IA is currently crawling all WP pages and can pull all the external links. This is relatively quick and easy. (Q: Is this every existing page, or it updates on edit? Confirm if I have this right.)

IA concern: When does the template get "flipped" from showing the dead link first to showing the archive link first? (parameter=true) We should review this.

Greg brought up a problem -- fake 404s. It's a page that's saying 200 but the content isn't there anymore (or only 5% of the content is there). One example is when a blogger moves to another host, they remove all their content from the original site, but leave a redirect to the main page that makes it look like a 200. A "stretch goal" is to eliminate the fake 404s. (Not sure how.)

Another pipe dream from Greg: creating an interface for Bookfinder citation that connects to the correct page in IA's "lending library" of stored books. This includes modern post-1923 books. They have an API call that tests whether a book is available. (Could this be written up as a suggestion for volunteer devs?)

Another IA concern: Some citation templates will say "from Wayback Machine", some don't. They would prefer to get credit in all cases. Mark's example: Cisco Adler#References. That's happening depending on whether Cyberbot is fixing a citation template or adding one to a reference that doesn't have a template. If there's no cite template, Cyberbot uses a template that mentions the Wayback Machine. Alex explained this in email, it's very difficult.

Meeting, Jan 26

Ryan is talking to Kenji about the API. It's doing a batch request, submitting 10 dead link URLs and retrieving 10 archive URLs at once.

Frances wants to work on building the generalized library of modules in Python. Cyberpower is working in PHP and on his own software. (Is Python easier for the volunteer devs to understand and use?)

Ryan/Danny, Jan 28

Things Community Tech can work on now

Work with Kenji on batch request API. This is the current limiting factor on how many links can be fixed.
Investigate advanced deadlink detection -- T125181
Create documention for Cyberbot II/IAbot -- T125183
Set up centralized logging interface on Tool Labs, to keep track of which links have been fixed. (phab) (waiting on disc w/ Cyberpower)
Collect info on Fixing dead links Meta page -- how the existing tools on EN, FR, DE, ES find dead links, how they fix them, where they post notifications, what templates they use/touch.
Start discussion about global Cyberbot on Phabricator (phab)
Write unit tests for Cyberbot (stalled until Cyberpower posts code which is split into classes on GitHub)

Strategy

Dead links in pages tagged with template

Cyberbot II's area. Currentluy tailored for English WP, needs tests and documentation. Doesn't have an API, can't build a gadget based on it. Cyberpower's currently working on making it more modular and testable, configurable for other projects.

On the question of single vs multiple bots: Cyberpower is currently focused on Cyberbot's speed. Kenji is working on IA's batch requests API -- that's the current limiting factor. Once that's done, we'll see if a single bot is the new limiting factor.

Dead link detection

Niharika wrote a basic script for Cyberpower; he's merged it into the Cyberbot code. We'll continue development, making the system more robust and intelligent.

It'll be helpful to have a centralized logging API, so that any tool that's testing links can check when the last time a page/link was checked.

User-directed dead link fixing

Medium-term possibility: Build a gadget that contributors can use to check a page they're currently working on. A couple options for how it could work:

Gadget calls Cyberbot, and recommends that article for checking
Gadget has an API with Cyberbot's functionality, but operates as a parser rather than a bot. From the edit window, it would take the content, run it through a parser, find dead links, retrieve the archive url, return the wikitext, and the user clicks save. It's saved in the user's name. (Talk about this with Cyberpower, see if he's interested.)

New links

Citoid team plans to do automated archive-url lookup for VE's cite tool. phab:T115224. This will only work for VE edits, may be built into new wikitext editor when that's completed.

Global

Things that need to be localized for each wiki:

Whether it fixes the links or posts notices
Which citation templates to interact with, what parameters to use
Where to post notices

Possibilities for how to include more languages:

Build Cyberbot an on-wiki configuration system, a MW page where you can set variables in wikitext.

Create modules (prob in Python): test if the link is dead, retrieve urls from IA, ping the centralized logging API. Local wiki can customize how it interacts with local templates, notification, parameters, possibly connect it to an existing tool like autowikibrowser.

Creating modules is probably best for mid-sized wikis -- big enough to have a technical community who can use the modules, but not so big that they get special attention from Cyberpower/CT.

Ryan, Feb 2

One of the tasks we'll be working on with Cyberpower678 is a centralized logging interface for tracking and reporting dead link fixes. The log will be kept on Tool Labs. This will help keeps track of which pages have had their dead links fixed, when, and by what agent/bot. This will facilitate 3 things:

If a bot dies, it can pick up where it left off
It will help prevent bots from doing redundant work
It will provide a centralized (and hopefully comprehensive) reporting interface for the Internet Archive and other archive providers

The tracking ticket for the logging interface is T125610.

Story grooming, Feb 9

Some good progress.

Centralized logging interface on Tool Labs for all bots -- This can be used to track what pages have been checked. It'll be useful for individual bots so they don't go over the same page multiple times, and especially useful if there are multiple bots running on the same wiki. Cyberpower agrees this will be helpful; we're moving ahead. The log will include name of the wiki, the archive source, # of links fixed, or notifications posted. (This should accommmodate bots from DE, FR and others.) Several tickets: Create a centralized logging API (T126363), Create a web interface (T126364), Documentation (T126365).

Investigate advanced deadlink detection -- Will be done this sprint. This should be useful for several different bots. (T125181)

Documentation for Cyberbot -- should be done this sprint. (T125183)

Modular/Global -- Cyberpower published new code, and it's more modular -- much easier to provide tools that people can use to make their own bots on various wikis.

Fixing dead links Meta page -- We're in touch with people working on Giftbot (DE) and Wikiwix (FR); setting up time to talk about if/how we can help them.

Wikiwix conversations, Feb 16

We met with Pascal from Wikiwix, which archives every new external link on French, English, Hungarian and Romanian WP. They have a default gadget on French which presents a link to the Wikiwix archive on every external link. Pascal was concerned that we had set up an official partnership between Wikimedia Foundation and Internet Archive, and that we were paying IA for some of this work.

I explained that the suggestion to back up external links with the Wayback Machine came from the community-written proposal, and that we're supporting an existing working relationship on English WP between Cyberpower and Internet Archive. There isn't any money involved, it's organizations working together with common aims. Our ultimate goal with this project is to provide code and documentation that can be used by bot-writers on lots of language WPs to create their own bots, using the templates, policies, approach and preferred archive service. Wikiwix has a long-standing, positive relationship with French WP, and we're not trying to split them up. :) We're going to continue talking with Wikiwix.

Investigation, Feb 17

Niharika's investigation, from T125181:

Investigation summary

Are there other HTTP response codes that we should consider dead links?

Did not find any obvious ones that we're missing. It will, of course, depend on whether we count redirects etc. as dead links.

Is it possible to detect "soft 404s" (for example http://www.unhcr.org/home/RSDCOI/3ae6a9f444.html)? If so, how?

My research online tells me this is mostly only possible by search engines (interesting article at http://www.seobythesea.com/2009/06/how-search-engines-might-identify-and-handle-soft-404s-and-login-required-pages/). They do this based on the page "freshness" - how often its updated, how much useful information it has etc. Some other links suggested scanning page text to do this, but I think doing it is not worth the amount of effort for this project.

What about URLs that redirect to the domain root (like http://findarticles.com/p/articles/mi_m0FCL/is_4_30/ai_66760539/)? Should those be considered dead?

In my opinion - yes. Lately a lot of websites redirect to domain root if the page is non-existent.

Is it possible to detect pages that have been replaced with link farms?

There are several algorithms that exist to solve this problem. You'll find them on a simple "link farm detection" google search. One such paper can be seen here. However I did not find any implementation which we can borrow code from or make use of. Google has an implementation of it which it uses in its PageRank algorithm to give less weight to link farm pages. That's not open source, needless to say.

How thorough is weblinkchecker's dead link checking? Should we try to use that instead (or port some of its logic to the Cyberbot code)?

Relevant Code: https://phabricator.wikimedia.org/diffusion/PWBC/browse/master/scripts/weblinkchecker.py;4f634d2989f4b786592747648ba9db242b534d5d$271

Docs: https://www.mediawiki.org/wiki/Manual:Pywikibot/weblinkchecker.py

Could not determine whether it currently works or not. Even if it does work, it will need a lot of tweaking to fit our use case here. We're better off trying to see if we can borrow some of its code. But before we try to do this, there are a few other sources that we have a better chance at. Listed below.

Similar tools

1. Dispenser's Checklinks:

Docs: https://en.wikipedia.org/wiki/User:Dispenser/Checklinks
Working tool: http://dispenser.homenet.org/~dispenser/view/Checklinks (Dispenser's own server)
Status: Code not open source. Dispenser aims to rewrite the tool starting march and then open source it. 98% claimed accuracy.
Note: Dispenser has said that he wants to set up his own archive on WMF servers. This may entail legal liability for WMF -- we'd get DMCA takedown requests -- so it's unlikely that we'll go in that direction.

2. LinkChecker:

Link: http://wummel.github.io/linkchecker/
Code: https://github.com/wummel/linkchecker

3. Some outdated tools:

Link: http://wummel.github.io/linkchecker/other.html
Not actively updated, don't think its worth pursuing these.

4. IA's link checking:

From an email exchange with Greg, he mentioned that this is a problem IA already tackles and they are internally discussing the option to provide this as a service. If this happens, it's our best bet at detecting dead links.

Check-in with IA, Feb 23

Talked to Mark and Bill at IA. There are a couple outstanding issues.

Cyberbot crashes and starts over; it happened again late last week. I talked to them about the new logging system, which should help.
Cyberbot is currently not using the bulk availability API that IA made available. Ryan says this requires some significant code refactoring. We could take it on if Max would like help with it.

They're interested in getting another bot running in another language.

Team meeting, Feb 23

It's unclear whether we're going to get advanced dead link detection help from Internet Archive.

We've got one dead link detection ticket -- T127749.

I asked about the advanced dead link detection; Mark said he'd talk to Greg about it.

Update, Feb 29

Current link for logging interface: http://tools.wmflabs.org/deadlinks/

Team meeting, March 3

Logging interface is in review. We plan to show the interface to IA/Cyberpower next week, for their feedback.

We should find out if there's old data that could be backported to the new logs.

Talk page message, March 14

User:Green Cardamom wrote on Danny's talk page:

"Wanted to let you know about WaybackMedic. It's currently focused on the set of articles edited by Cyberbot (~90k). Unfortunately there was a bug in the IA API which was returning false positives about 10-20% of the time so there are now ~10s of thousands of bad wayback links that are actually 404/403s. WaybackMedic will seek them out, replace them with working IA links, or delete them. Example edit. It also fixes article text formatting bugs introduced by Cyberbot during development. Example. It's running cleanup. The framework is now in place the bot could do other things. Will open source."

This is obviously awesome; we should think about whether there's a useful way to connect the two bots.

Team meeting, March 22

The centralized logging API is complete. Next step is actually integrating the logging function in Cyberbot. T130035 This will be the first time our team is directly editing Cyberbot code. :)

After this, the plans for Cyberbot are: globalization, and adding dead links detection (so it doesn't just have to follow the deadlink template).