Grants:IdeaLab/Open source and improve Beastie Bot, the tool for threatened species list creation

Open source and improve Beastie Bot, the tool for threatened species list creation
Make lists of threatened species easier for humans to browse and maintain by open sourcing and improving Beastie Bot, which automates their creation.
idea creator
Pengo
join
endorse
created on00:15, 28 January 2016 (UTC)

Project idea edit

What is the problem you're trying to solve? edit

I've been working on an automated tool (Beastie Bot) which creates better lists of threatened species. Such lists have always been problematic to create and maintain on Wikipedia.

It would be great if I hope I can play a part in helping build the tools that make it easier to create better content that others can learn about the world's biosphere from.

Probably the simplest way to illustrate the "problem I'm trying to solve" is to show a previous attempt to solve it. English Wikipedia has a generated list of endangered animals, but it's not very human-readable. It only uses Latin names (e.g. "Aves" instead of "Birds", and "Myrmecobius fasciatus" instead of "numbat", Some taxonomic groups are spread out throughout the list (e.g. "fish" are not grouped together, nor marsupials, lemurs, or microbats). It's outdated and unmaintained, too long, and resembles a database dump. While it's good for preserving the data found on IUCN's website, it's not so great for the Wikipedia audience.

Beastie Bot creates better lists of species. It doesn't create too many subheading levels. It uses English common names while still preserving the scientific names. It adds introductory text and occasional statistics. It can add levels where needed, for example, to group the various families of New World monkeys together (In the old list, the families of New World monkeys, lemurs, great apes, and other primates are mixed together).

There some flow-on benefits from Beastie Bot, such as the ability to create a report of "false synonyms", identifying redirects for different species which point to the same article. For example, the scientific names "Sooglossus sechellensis" and "Tachycnemis seychellensis" refer to two different frog species belonging to different frog families, but both link to the same article: Seychelles treefrog. In developing Beastie Bot, I have identified this issue and made a list of hundreds of other possible false synonyms to help aid Wikipedians.

I've also noticed there are many ambiguous common names for species which are lacking disambiguation on Wikipedia, often only linking to a single species. For example, I created the disambiguation pages Dwarf mountain pine, Moon cactus, and African common toad based on the output of Beastie Bot. I could further automate the identification of similar missing disambiguation pages, and and possibly partially automate the creation of the pages too. While it is not the core task of Beastie Bot to identify incorrect redirects or find missing disambiguation pages, these tasks alone are quite significant, and I hope more people will join in the required clean up effort.

What is your solution? edit

I have begun creating a tool called Beastie Bot to create human-friendly lists based on IUCN Red List data.

It is still a work in progress, but compare for example:

Although the lists generated by Beastie Bot may seem plain and straight forward, it has been a large undertaking to get them to the point they are.

Some of the features, or technical hurdles which have been largely solved:

  • Beastie Bot can group and split taxa to allow reasonable sized, logical sections so there aren't too many or too few headings on the page.
  • Common names are automatically taken from Wikipedia article titles (but only after automated checks that the article is relevant, e.g. by checking the taxobox)
  • Common names are also be taken from the IUCN's list of common names, but first they are checked for uniqueness, and recapitalized to match Wikipedia's article casing (not a trivial task)
  • Plural and lowercase forms of common names sometimes need to be discovered.
  • English sentences are created to summarize larger sections, e.g. "There are 422 ray-finned fish species and four ray-finned fish subspecies assessed as critically endangered". They could be improved but are in a reasonable state.

Some hurdles are still on the to-do list, such as creating links between the generated pages and automatically uploading them, and there are many tasks that could be added if the project were open sourced, such as making the "exceptions list" available to be edited on a wikipage rather than hardcoded in source code. (e.g. to add a common name for a species when it can't be determined automatically). A longer to do list is at: w:User:Beastie_Bot#To_do.

Open sourcing the project may also allow others to add internationalization so it can be used on other language Wikipedias. Another eventual goal is integration with Wikidata, to use as a source of input and to improve its data sets.

I'd eventually like to increase the scope of the project to not only creating large lists, but also smaller lists as found on a huge number of taxonomic pages, which include information about the threatened status of species within that group of plants or animals but do not always maintain the information. Beastie Bot may also be extended in future to help automate the creation and updating of such taxonomic information. Eventually I'd like to create a web interface to give various options to help Wikipedians find the best fit for content to suit an article, such as a list, tree, or a table.

Beastie Bot currently resides on my own hard drive and is tangled with other source code and projects I have worked on. I'd like to separate it out, open source it, and move it to Wikimedia Labs. This is not a trivial task, and I'm looking for support and interest before I dedicate myself full time to it. Some examples of the changes and work involved: Currently it uses a Xowa-formatted SQLite dump of Wikipedia. To move its hosting to Wikimedia Labs it would make sense to change this to use their MariaDB mirrors instead. Also the interface should change from command line to web.

The previous incarnation of Beastie Bot ran in 2006‎. It had a somewhat related task of updating the IUCN status (with citations) in the "Taxobox" of species articles. The source code was never made available and I never found time to maintain it after its initial successful run. I don't wish to repeat that, and I hope I can find some support for this project.

The current Beastie Bot is written in C#. Before anyone mentions any concerns about C# being a proprietary language: it's not. The code can run on an open source stack through Mono, as already found on Wikimedia Labs servers (I have successfully compiled and run a C# web project on Wikimedia Labs). Microsoft have also released an open source C# compiler on github ("Roslyn") and open sourced .NET Core. Some wiki-related libraries already exist, such as DotNetWikiBot Framework and WikiFunctions .NET library. The code is robust, well commented and relatively easy to maintain.

If there's no interest from the community or WMF, then open sourcing this project will just be me wasting a lot of time in pursuit of some vague hope that someone might pick it up after my death; time better spent just working on the tool in its current, closed form. However, I would of course be happier if I could develop the project within the context of a community. Feedback is appreciated.

If there are any coders, conservationists, systematists, people with natural language generation expertise, or other interested parties who would like to get in touch and discuss the project, please feel free to contact me or join the conversation. Leave feedback here or add yourself as a participant/supporter, write a message on the Beastie Bot talk page, or send me a message. Pengo (talk) 00:15, 28 January 2016 (UTC)

Goals edit

  • Create "List of endangered reptiles" on Wikipedia
  • Replace the word "endangered" with "critically endangered", "vulnerable", "near threatened", and "least concern"
  • Replace the word "reptiles" with mammals, birds, fishes, amphibians, reptiles, molluscs, invertebrates, insects, chromista, fungi, and plants. (possibly further splitting these lists when it makes sense, for example, passerines make up a large number of birds.
  • All combinations of the above.
  • Generate reports of missing disambiguation pages and redirects which need checking relating to the 79,000 species in the IUCN red list.
  • Open source the tool for creating the these lists, and move it to the WMF labs server.

Get Involved edit

Participants edit

Endorsements edit

  • Well-presented output. Pelagic (talk) 18:46, 26 February 2016 (UTC)
  • Output is nicely formatted and easy to parse, and if I understand correctly, the bot will also make the continued maintenance of such lists a lot easier - and that is very valuable indeed. Don't think I can help in any way on the code side, but would be happy to assist with other related tasks. Elmidae (talk) 16:57, 10 April 2016 (UTC)

Expand your idea edit

Would a grant from the Wikimedia Foundation help make your idea happen? You can expand this idea into a grant proposal.

Expand into an Individual Engagement Grant
Expand into a Project and Event Grant