WikiFAIR
This page is under construction. Please help review and edit this page. |
WikiFAIR is a set of ideas, instructions and helpful examples to archive the FAIR-Prinicples in research projects using Wikimedia systems and technologies. More specifically it's about integrating Wikimedia platforms in research data management to promote free knowledge, while reducing hurdles in building publishing infrastructure.
FAIR principles
editThe FAIR data principles are guidelines designed to improve the Findability, Accessibility, Interoperability, and Reusability of digital assets. These principles emphasize the ability of machines to automatically find and use the data, in addition to supporting its reuse by individuals.
Full implementation of every aspect of the FAIR principles is difficult for most research projects, especially smaller ones. Depending on whether a new digital presentations platform needs to be build, or research data needs to be published the scale of possible hurdles and potential costs is significant. But most researchers want to, and often need to follow these guidelines to archive the best scientific outcomes.
Integrating research projects with Wikimedia's ecosystem allows research teams to benefit from adhering to FAIR data standards and making the research data consumable in many different formats. The specific results depend on the chosen integration model, which can range from simply linking the data sets, to fully using multiple Wikimedia Projects to host structured data, media files and other formats about the research.
Wikimedia and FAIR Principles
editThe Wikimedia Movements goal is to become the essential infrastructure of the ecosystem of free knowledge, and accordingly implementing FAIR principles across all projects is a core goal. As WikiFAIR mostly focuses on the semantic knowledge base Wikidata and the free media repository (Wikimedia) Commons, we will explore them in particular. Most of the details about Wikidata also apply to Wikibase, the open source software powering it, which can also be run independently in a federated system.
Putting these projects in the FAIR definitions of GO FAIR:
Findable
editAll Wikimedia projects use unique and persistent identifiers (QID for Wikidata, Filenames for Commons) as required in F1, with descriptive metadata for F2. Every entry includes the identifier (F3) and is index by multiple search engines (F4).
Accessible
editWikimedia sites are offering free content ranging from CC BY-SA (often found on Commons) to Public Domain (all of Wikidata) and are retrievable by their identifier with standardized, open protocols (A1, A1.1). There is also multi-project spanning Single User Login system (A1.2).
Metadata about deleted/removed objects can be retrieved via deletion logs as demanded by A2.
Interoperable
editStructured Data on Wikidata can be represented a completely multilingual user interface and is available in the most common and FAIR (I2) export formats like RDF, JSON, TTL and more (I1). The dataset is interlinked with many other authority databases (I3).
Reusable
editThe complete Wikidata space features well over 100 Million items, while containing 1.5 Billion Triples, created from a pool of over 10,000 properties (R1). All data is released under clear and open licensing (see #Accessible R1.1) and a referencing system in applied across all Wikimedia projects, requiring citations and provenance information in most circumstances (R1.2).
The data model implements or approximates different community standards, by linking properties to their relevant equivalent in other standards. (R1.3)
Additional Features of Integration
editAside of helping with FAIR-requirements, integrating data in with the broader Wikimedia community can help with different aspects of maintaining, visualizing or just hosting your data set.
Community
editThe Community can help with long term quality management, through crowdsourcing improvements to the data. There are thousands of volunteers and bots scraping through the data each day, patrolling changes, adding or removing statements or other helpful tasks. This can mean that a newly created Wikidata item from your project receives a lifetime of maintenance as part of the wider dataset.
The community can also help with questions of data modeling, writing SPARQL Queries or improving related properties. In Commons they can categorize uploaded media and add new depicts statements, also enriching the data.
Allowing the community access can also help with fulfilling the expanded set of criteria CARE Principles for Indigenous Data Governance, in which it is required to allow access to a divers set of related groups, partly already represented in the Wikimedia Movement.
5-Star Open Data
editA different concept to FAIR, with similar intentions, is the 5-Star Open Data system. It categorizes data by qualities of accessibility and machine readability and was proposed by Tim-Berners Lee. With the addition of Structured data on Commons both it and Wikidata offer the highest level in this scheme, and by linking/integrating data with them research projects can do the same with little effort.
Software
editFrom a software point of view, both MediaWiki and Wikibase are well tested, open source tools that offer full versioning, multi-user editing and caching for scalability. They are capable of accommodating teams of editors in most sizes, and bring a well developed extension system to customize them even further.
External Tools
editData on Wikimedia projects can be processed with a plethora of diverse external tools, allowing for more possibilities than many other platforms. For an overview see Wikidata:Tools and Commons:Tools or visit Toolhub.
Possible Caveats
editBefore continuing further, please be advised that there could be downsides to some parts of this process.
(Partial) loss of control
editWikimedia projects are generally open to everyone to edit, which can be very appealing. But it's also possible to get your edits reverted, or to have your data vandalized by malicious editors.
If something like this happens, adhere to community guidelines or ask for help if anything remains unclear. Depending on your research goals and methods, you may want to monitor your data after upload. See WikiFAIR#Monitoring Uploaded Data for more information.
Alternatively it could be helpful to run your own Wikibase instance to better control editing access to the data set, see #Running your own Knowledge Graph with Wikibase.
It is of utmost importance to keep a second copy of the research data, preferably published too independent of any Wikimedia project. There are no guarantees, that e.g the community doesn't request deletion of change the contents, which always can be appealed, but could lead to disruptions. For publishing bulk research data, there are multiple popular platforms, e.g Zenodo. (A publication on such a bulk hosting platform is a great reference, see the next paragraph)
Sourced statements requirement
editIt is required to cite sources to allow your data into any Wikimedia project, more information on that here. This can be a problem for original research, or hard to source facts. If the source is a paper your team is going to write, it can also be cited, but the problem still remains for some special cases. Asking in the Project Chat, to help with questions regarding sources can be helpful.
Integrating Structured Data in Wikidata
editAs stated on the Main Page of Wikidata:
Wikidata is a free and open knowledge base that can be read and edited by both humans and machines.
This already suggests some ways to proceed, either by adding the data by hand or by using tools to modify Wikidata on a bigger scale.
Inclusion Criteria
editMost kinds of structured data fit in Wikidata, which allows for great freedom in how you want to model your contribution. There are also (deliberately broad) rules about inclusion in the scope of the project, but for researchers the second is the most important:
[The item] refers to an instance of a clearly identifiable conceptual or material entity that can be described using serious and publicly available references.
Having publicly available sources for the added data is key to the future of Wikidata as a quality source, and quite easy to implement, see #Technical Process.
Also please apply a measure of diligence to how you contribute to the dataset. If your data only adds some statements to various items, there shouldn't be any problem, but be careful when creating new items if you are not 100% sure if they actually contribute relevant knowledge.
Technical Process
editTo learn about the process of adding and editing existing items follow a tour on Wikidata:Tours. There are many different routes to upload data with tools outside, some popular options are Openrefine, Bots or Quickstatments.
Attribution
editTo collect and show the contributions made by a project, it can be useful to create a project user account. Editing with the project account lets you identify contributions as belonging to the project. Linking and referencing Wikidata statements back to the project database or publications can also be helpful, especially with the properties
- described by source(P1343)
- stated in(P248) {project item}
- reference url(P854)
Monitoring Uploaded Data
editYou can monitor uploaded data via Watchlists or Maintenance Queries. See for example the Wikiproject Chess Maintenace Queries.
Creating a Knowledge Graph with Wikibase
editIf you prefer to host the data partially or completely outside of Wikidata, but still want a similar feature set, you might want to use Wikibase. It consists of the same, open source, software Wikidata uses, but completely under your control. With it it is possible to model the data how you see fit while still using some of the ecosystem and tools described on this page.
Technical Process
editWikibase can be set up in two major ways: As a hosted service on Wikibase.cloud or as a on-premises install using Docker with Wikibase suite. Here is a guide to help you choose.
Free Media on Wikimedia Commons
editWikimedia Commons, or sometimes referred to as "Commons" is a media repository of free-to-use images, sounds, videos and other media.
Inclusion Criteria
editMany types of files can be hosted on Wikimedia Commons if they have the applicable licensing, CC-BY-SA or better, see Commons:Licensing.
Not every kind of data is allowed in Commons, even if the licensing is correct. The uploaded data should be "educational", normally not a problem for research data and fit into the project scope of Wikimedia Commons.
If you are unsure if your media is compatible for upload in Commons, you can ask here. There are also other Projects that can host freely licensed media, e.g Archive.org.
Technical Process
editFor uploading images to Wikimedia Commons follow this tutorial. There is also this list, that shows the supported file types, nearly any relevant open format is supported
Attribution
editThere are multiple ways to attribute the research project or organization, depending on the platform. Creating a User Account can help with attribution too.
Media files on Commons can be labeled (via Structured Commons and traditionally) with a creator property. Alternatively, if the files are not a creation of the researchers, it would be possible to add templates that indicate to origin and context of the upload to the file page, and of course the User property in the file history or the source statement is always a way to reference to the project account.
Monitoring Uploaded Data
editYou can monitor uploaded data via Watchlists or Maintenance Queries. See for example the Wikiproject Chess Maintenace Queries.
Other Wikimedia Projects
editIn some cases even an integration with other Wikimedia Projects could be beneficial. Please note that all of those projects are generally not multilingual like Commons or Wikidata, so using the appropriate language version would be the correct procedure.
Citing your own research in Wikipedia can be allowed in some cases, but overdoing it should be avoided. Following this guide for academics and researchers is generally a good start.
Adding new source texts or improving translation of an existing one would be possible ways to use this Wikimedia project. Using Wikisource as a platform to publish historical or non-copyrighted texts allows access to features like:
- Different export formats (EPUB and MOBI for eReaders, PDF, RTF and more)
- Configurable reader
- The Wikisource Community for proofreading of translations and transliterations
- A correlated Wikidata item to the text
A Wikimedia project dedicated to tracking and collecting species, a sensible place to put data about them.
For lexicographers, adding new dictionaries as sources, or new senses could be a worthwhile endeavor. Additionally Wikidata supports Lexemes since 2018 and will be an easy to (re)use platform, since it's licensing is public domain.
Integration Models
editDepending on your requirements the integration in the WM ecosystem can vary drastically from simple links or statements, to most of the research data being hosted there. Also more effort is required to archive deeper levels of integration.
The process of linking
Connect your research database to Wikidata, through matching Wikidata items to entries in your database and and adding links to one or both. This link could be modeled through specific Wikidata statements, e.g. by using P973: described by url or proposing a project related property.
Adding a property allows for more long term quality control through tools like the constraint report system, or pregenerated queries for quality violations.
Integration Results:
edit- Improved visibility of the original database
- Ability to query your own data with the additional data from Wikidata
Example
editDeckenmalerei.eu links most entries to Wikidata
3: Linking to your own MediaWiki
editEspecially interesting are projects that use MediaWiki as their base software, since it allows for interesting crossovers. MediaWiki can be used as a simple text and media base for people to edit, while more complex templates pull data out of Wikidata to display in local infoboxes.
Integration Results:
edit- Easy to use interface for editing multimedia content with text and images
- Possibility to use the extensive Wikipedia template ecosystem
- Showing live metadata from Wikidata in infoboxes using Wikibase Client
Example
editWIP
4: Linking to your own Wikibase
editAlternatively, following the Vision of a Linked Open Data future, Wikibase could be run as a completely independent instance, or hosted on Wikibase Cloud, while federating with Wikidata via the related property.
Integration Results:
edit- Use a semantic database, but with complete control over every aspect
- Model data in different ways than Wikidata
Example
editFactgrid has a property in Wikidata while being completely independent of Wikimedia as a whole. Remove NA could serve as a great example. They are also linked via a property.
5: Enriching data with your research
editAfter matching with your data, the next step could be to enhance Wikimedia projects by adding data from your project, while citing your research as a source.
When your project doesn't provide an online database, your project could just add data and use your publication as a reference. So instead of hosting your own database, publishing your research table as a file, and referencing it.
To mass upload metadata or Media files there are multiple routes (Openrefine).
6: Integrating Media on Commons
editAdd images, videos, sounds, PDFs to Wikimedia Commons, including structured data for those media files. This can be more troublesome, as copyright-compatibility needs to be checked, but negates the complexity of hosting those files yourself.
Because the projects is integrated with Wikipedia, its capabilities and uptime are production ready and very well integrated in other large hubs for media data. E.g it is index completely by popular search engines.
Additionally allows interesting uses for metadata tagging with structured data on Commons.