InternetArchiveBot

This page is a translated version of the page InternetArchiveBot and the translation is 30% complete.
Outdated translations are marked like this.
Other languages:
Deutsch • ‎English • ‎Esperanto • ‎dansk • ‎español • ‎français • ‎galego • ‎italiano • ‎magyar • ‎polski • ‎português • ‎português do Brasil • ‎shqip • ‎suomi • ‎русский • ‎اردو • ‎العربية • ‎مصرى • ‎हिन्दी • ‎বাংলা • ‎ไทย • ‎中文 • ‎日本語 • ‎한국어
Coolest Tool Award 2019 square logo.svg
InternetArchiveBot
2019 Coolest Tool Award Winner
in the category
Impact

什麼是InternetArchiveBot (IABot)

InternetArchiveBot是一个使用PHP编写的功能强大、框架独立、主要用于维基媒体基金会各个Wiki的OAuth机器人,遵循社区请求运作的全域机器人,由Cyberpower678编写。这个全域机器人使用抽象类中适用于特定wiki中的不同规则运行。它具有高度灵活性,功能可以按wiki运营者或社区的站内和站外配置进行调整。它的功能是解决各个方面的死链问题。对于大型网站,它可以被设置为指定数量的多线程工作,以更快完成作业。每个工作都分析页面并最终将报告以统计形式回报给操作者。

它能做什麼?

IABot有一套分析页面的函数。其目标是尽可能充分地解决链接失效问题,所以会通过多种方式分析链接。

  1. 在页面上,而非数据库中寻找 URL。这使得机器人能获得 URL 的具体使用方法,例如检测 URL 是否是用于引用模板、参考资料中,或者它是一个裸露链接。如此一来,机器人可以智能地处理多种格式的来源,就如同人类一样;
  2. 检查一个链接是否已经存在存档,若否,则将它存档至网站时光机;
  3. 网站时光机中抓取一个失效链接的可用副本,或使用一个已在维基百科上使用到的存档;
  4. 检查未标记的无效链接是否无效,错误率为0.1%;
  5. 自动解析作为引用(cite)模板中的URL,并在此基础上进行工作。 同样的,作为模板一部分的访问日期也是如此;
  6. 将所有这些信息保存到一个数据库中,允许使用可以利用这些信息的接口,并允许bot学习,以改进它的服务;
  7. 如果启用了该功能的话,将现有的存档URL转换为其长形式;
  8. 修正存档模板的不当使用,或错误格式化的URL。

工作原理

IABot's functions are in several different classes, based on the functions they do. Communication-related functions and wiki configuration values, are stored in the API class. DB related functions in the DB class, miscellaneous core functions in a static Core class, dead link checking functions in a CheckIfDead class, thread engine in Thread class, and the global and wiki-specific parsing functions in an abstract Parser class. While all but the last functions can run uniformly on all wikis, the Parser class requires a class extension due to its abstract nature. The class extensions contain the functions that allow the bot to operate properly on a given wiki, with its given rules. When the bot starts up, it will attempt to load the proper extension of the Parser class and initialize that as its parsing class.

配置

从2.0版本开始,配置IABot的维基页面不再使用。现在通过IABot Management Interface来配置机器人。所有的全局关键字仍在使用。

如果你在自己运行InternetArchiveBot,你可以通过wiki配置页面和在同一目录下创建一个新的deadlink.config.local.inc.php文件来配置它。如果其他人正在运行InternetArchiveBot,而你只需要为某个特定的wiki配置它,你可以在bot的用户页面中设置一个名为 "Dead-links.js "的子页面,然后在那里配置它。例如,User:InternetArchiveBot/Dead-links.js。配置值解释如下:

  • link_scan – Determines what to scan for when analyzing a page. Set to 0 to handle every external URL on the article. Set to 1 to only scan URLs that are inside reference tags.
  • page_scan – Determines what pages to scan when doing it's run. Set to 0 to scan all of the main space. Set to 1 to only scan for pages that have dead link tags.
  • dead_only – Determines what URLs it can touch and/or modify. Set to 0 to all the bot modify all links. Set to 1 to only allow the bot to modify URLs tagged as dead. Set to 2 allow the bot to modify all URLs tagged as dead and and all dead URLs that are not tagged.
  • tag_override – Tells the bot to override its own judgement regarding URLs. If a human tags a URL as dead when the bot determines it alive, setting this to 1 will allow the tag to override the bot's judgement. Set to 0 to disable.
  • archive_by_accessdate – Setting this to 1 will instruct the bot to provide archive snapshots as close to the URLs original access data as possible. Setting this to 0 will have the bot simply find the newest working archive. Exceptions to this are the archive snapshots already found and stored in the DB for already scanned URLs.
  • touch_archive – This setting determines whether or not the bot is allowed to touch a URL that already has an archive snapshot associated with it. Setting this to 1 enables this feature. Setting this to 0 disables this feature. In the event of invalid archives being present or detectable mis-formatting of archive URLs, the bot will ignore this setting and touch those respective URLs.
  • notify_on_talk – This setting instructs the bot to leave a message of what changes it made to a page on its respective talk page. When editing the main page, the talk page message is only left when new archives are added to URLs or existing archives are changed. When only leaving a talk page message without editing the main page, the message is left if a URL is detected to be dead, or archive snapshots were found for given URLs. Setting this to 1 enables this feature. Setting this to 0 disables it.
  • notify_error_on_talk – This instructs the bot to leave messages about problematic sources not being archived on respective talk pages. Setting to 1 enables this feature.
  • talk_message_header – Set the section header of the talk page message it leaves behind, when notify_on_talk is set to 1.
    See the #Magic Word Globals subsection for usable magic words.
  • talk_message – The main body of the talk page message left when notify_on_talk is set to 1.
    See the #Magic Word Globals subsection for usable magic words.
  • talk_message_header_talk_only – Set the section header of the talk page message it leaves behind when the bot doesn't edit the main article.
    See the #Magic Word Globals subsection for usable magic words.
  • talk_message_talk_only – The main body of the talk page message left when the bot doesn't edit the main article.
    See the #Magic Word Globals subsection for usable magic words.
  • talk_error_message_header – Set the section header of the talk page error message left behind, when notify_error_on_talk is set to 1.
  • talk_error_message – The main body of the talk page error message left when notify_error_on_talk is set to 1.
    Supports the following magic words:
    1. {problematiclinks}: A bullet generated list of errors encountered during the archiving process.
  • deadlink_tags – A collection of dead link tags to seek out. Automatically resolves the redirects, so redirects are not required. Format the template as you would on an article, without parameters.
  • citation_tags – A collection of citation tags to seek out, that support URLs. Automatically resolves the redirects, so redirects are not required. Format the template as you would on an article, without parameters.
  • archive#_tags – A collection of general archive tags to seek out, that supports the archiving services IABot uses. Automatically resolves the redirects, so redirects are not required. Format the template as you would on an article, without parameters. The "#" is a number. Multiple categories can be implemented to handle different unique archiving templates. This is dependent on how the bot is designed to handle these on a given wiki and is wiki specific.
  • talk_only_tags – A collection of IABot tags to seek out, that signal the bot to only leave a talk page message. These tags overrides the active configuration.
  • no_talk_tags – A collection of IABot tags to seek out, that signal the bot to not leave a talk page message. These tags overrides the active configuration.
  • ignore_tags – A collection of bot specific tags to seek out. These tags instruct the bot to ignore the source the tag is attached to. Automatically resolves the redirects, so redirects are not required. Format the template as you would on an article, without parameters.
  • verify_dead – Activate the dead link checker algorithm. The bot will check all untagged and not yet flagged as dead URLs and act on that information. Set to 1 to enable. Set to 0 to disable.
  • archive_alive – Submit live URLs not yet in the Wayback Machine for archiving into the Wayback Machine. Set to 1 to enable. Requires permission from the developers of the Wayback Machine.
  • notify_on_talk_only – Disable editing of the main article and leave a message on the talk page only. This overrides notify_on_talk. Set to 1 to enable.
  • convert_archives – This option instructs the bot to convert all recognized archives to HTTPS when possible, and forces the long-form snapshot URLs, when possible, to include a decodable timestamp and original URL.
  • convert_to_cites – This option instructs the bot to convert plain links inside references with no title to citation templates. Set to 0 to disable.
  • mladdarchive – Part of the {modifiedlinks} magic word, this is used to describe the addition of an archive to a URL.
    Supports the following magic words:
    1. {link}: The original URL.
    2. {newarchive}: The new archive of the original URL.
  • mlmodifyarchive – Part of the {modifiedlinks} magic word, this is used to describe the modification of an archive URL for the original URL.
    Supports the following magic words:
    1. {link}: The original URL.
    2. {oldarchive}: The old archive of the original URL.
    3. {newarchive}: The new archive of the original URL.
  • mlfix – Part of the {modifiedlinks} magic word, this is used to describe the formatting changes and/or corrections made to a URL.
    Supports the following magic words:
    1. {link}: The original URL.
  • mltagged – Part of the {modifiedlinks} magic word, this is used to describe that the original URL has been tagged as dead.
    Supports the following magic words:
    1. {link}: The original URL.
  • mltagremoved – Part of the {modifiedlinks} magic word, this is used to describe that the original URL has been untagged as dead.
    Supports the following magic words:
    1. {link}: The original URL.
  • mldefault – Part of the {modifiedlinks} magic word, this is used as the default text in the event of an internal error when generating the {modifiedlinks} magic word.
    Supports the following magic words:
    1. {link}: The original URL.
  • mladdarchivetalkonly – Part of the {modifiedlinks} magic word, this is used to describe the recommended addition of an archive to a URL. This is used when the main article hasn't been edited.
    Supports the following magic words:
    1. {link}: The original URL.
    2. {newarchive}: The new archive of the original URL.
  • mltaggedtalkonly – Part of the {modifiedlinks} magic word, this is used to describe that the original URL has been found to be dead and should be tagged. This is used when the main article hasn't been edited.
    Supports the following magic words:
    1. {link}: The original URL.
  • mltagremovedtalkonly – Part of the {modifiedlinks} magic word, this is used to describe that the original URL has been tagged as dead, but found to be alive and recommends the removal of the tag. This is used when the main article hasn't been edited.
    Supports the following magic words:
    1. {link}: The original URL.
  • plerror – Part of the {problematiclinks} magic word, this is used to describe the problem the Wayback machine encountered during archiving.
    Supports the following magic words:
    1. {problem}: The problem URL.
    2. {error}: The error that was encountered for the URL during the archiving process.
  • maineditsummary – This sets the edit summary the bot will use when editing the main article.
    See the #Magic Word Globals subsection for usable magic words. (Items 11, 12, and 13 are not supported)
  • errortalkeditsummary – This sets the edit summary the bot will use when posting the error message on the article's talk page.
  • talkeditsummary = This sets the edit summary the bot will use when posting the analysis information on the article's talk page.
    See the #Magic Word Globals subsection for usable magic words.

全域魔術字

These magic words are available when mentioned in the respective configuration options above.

  1. {namespacepage}: The page name of the main article that was analyzed.
  2. {linksmodified}: The number of links that were either tagged or rescued on the main article.
  3. {linksrescued}: The number of links that were rescued on the main article.
  4. {linksnotrescued}: The number of links that were unable to be rescued on the main article.
  5. {linkstagged}: The number of links that were tagged dead on the main article.
  6. {linksarchived}: The number of links that were archived into the Wayback Machine on the main article.
  7. {linksanalayzed}: The number of links that were overall analyzed on the main article.
  8. {pageid}: The page ID of the main article that was analyzed.
  9. {title}: The URL encoded variant of the name of the main article that was analyzed.
  10. {logstatus}: Returns "fixed" when the bot is set to edit the main article. Returns "posted" when the bot is set to only leave a message on the talk page.
  11. {revid}: The revision ID of the edit to the main article. Empty if there is no edit to the main article.
  12. {diff}: The URL of the revision comparison page of the edit to main article. Empty if there is no edit to the main article.
  13. {modifedlinks}: A bullet generated list of actions performed/to be performed on the main article using the custom defined text in the other variables.

源代码

InternetArchiveBot当前的源代码可以在 https://github.com/Internetarchive/internetarchivebot 找到。

发展方向

第一阶段(已完成)- 让InternetArchiveBot处理英语维基百科中Category:Articles with dead external links的文章,并尽可能多地用Wayback Machine上的存档链接替换失效链接。

第二阶段(已完成)- 让InternetArchiveBot处理英语维基百科上的所有页面,以找到还未标记的失效链接,并将它们替换为存档链接。

第三阶段(仍在进行)- 将InternetArchiveBot部署到除英文维基百科之外的其他wiki上(获得社区同意后)。

管理

Many aspects of InternetArchiveBot can be managed at https://iabot.toolforge.org/, including reporting false positives for dead links and directing the bot to fix a single page.