Grants:Project/CS&S/Structured Data on Wikimedia Commons functionalities in OpenRefine

statusselected
Structured Data on Wikimedia Commons functionalities in OpenRefine
summaryStructured Data on Wikimedia Commons (SDC) editing and file upload functionalities for OpenRefine
targetWikimedia Commons (and consequently also Wikidata and Wikipedias)
amount$99,411.76 (USD)
type of applicantorganization
nonprofitYes, 501(c)3
advisorPintochFRomeo (WMF)
contactSFauconnier• hi{@}codeforsociety.org
volunteerEcrituresDatamuseAttitude
this project needs...
volunteer
affiliate
join
endorse
created on09:47, 24 February 2021 (UTC)


Project idea edit

What is the problem you're trying to solve? edit

What problem are you trying to solve by doing this project? This problem should be small enough that you expect it to be completely or mostly resolved by the end of this project. Remember to review the tutorial for tips on how to answer this question.

Since 2019 it has been possible to add structured data to files on Wikimedia Commons (SDC = Structured Data on Commons). By describing Wikimedia Commons files with structured data from Wikidata, the files’ descriptions become multilingual and machine-readable.

Structured data makes Commons files much better discoverable across languages on various Wikimedia projects, including Wikipedia. Starting from early 2021, the default media search engine in VisualEditor on Wikipedias has become Special:MediaSearch, which is powered by SDC. Practically, this means that Wikipedia editors who use VisualEditor in any language will get similar image suggestions via structured data while inserting images in Wikipedia articles.

Special:MediaSearch will also become the default search engine for Wikimedia Commons in the upcoming months.

In the Wikimedia ecosystem, diverse and large-scale sets of media files are often contributed to Wikimedia Commons as part of GLAM-Wiki projects. Cultural institutions or GLAMs (Galleries, Libraries, Archives and Museums) regularly work together with local Wikimedia affiliates and individual Wikimedians to contribute many thousands of files (images, videos, audio files) from their collections to Wikimedia Commons. In most cases, files from cultural institutions have quite rich descriptions (metadata). As research from 2017 demonstrates, it is important for partner organizations that this metadata will also be well represented on Wikimedia Commons.

‘Batches’ of files contributed to Wikimedia Commons through GLAM-Wiki projects are usually pretty diverse. Typical examples are:

  • A set of media files with a variety of photographs, digitized postcards and prints from a city archive, which may contain photographs by many different photographers and of many diverse places and historical events in that city.
  • A set of media files with diverse digital photographs of artworks from a museum: digitized paintings and prints, digital photos of sculptures... many of them created by different artists in different periods.

Batch editing SDC: Wikimedia users without coding skills can’t batch edit large, diverse sets of files yet. edit

Since the deployment of structured data on Commons (SDC), many Wikimedia Commons volunteers have already added (some) structured data to millions of files.

Several tools (all of them community-built) do some batch SDC editing at this moment:

  • The ISA Tool is meant for refined addition of (only) ‘Depicts’ statements and file captions to smaller, targeted sets of files. The ISA Tool tends to work well for sets of up to 10,000 files, to which the structured data is added individually (file by file) via crowdsourcing.
  • AC/DC (“Add to Commons, Descriptive Claims”) is a Wikimedia Commons gadget that helps with batch SDC editing. AC/DC works on smaller numbers of files, for instance files from a single Commons category, and is designed to add the same structured data to all files in a small set at once.
  • QuickStatements, which is designed and heavily used for batch editing Wikidata items, also supports SDC to some extent. However, it’s not yet straightforward to do this: quite a few workarounds are needed, including using an external tool, Minefield, to convert file names to unique identifiers (M-ids, equivalent of Wikidata Q-ids).
  • The SDC user script allows to add a limited number of (the same) structured data statements to files in a single Wikimedia Commons category.

Up until now, however, bots (operated by a small number of Commons editors with coding skills) have been responsible for the largest numbers of large-scale batch additions of structured data to thousands of files at once. Wikimedia bot operators often use the Pywikibot framework which offers useful abstraction layers for large-scale batch operations, but Pywikibot actually has no SDC support yet (Phabricator task) - so SDC-oriented bot operations are usually written entirely by hand.

Summarizing: there is no reliable, flexible and powerful tool yet to add and modify structured data to very large batches of files on Wikimedia Commons (think thousands of diverse files at once, across various categories, and adding a variety of structured data to these files).

Batch uploading with SDC: No powerful tools are available for batch uploading files to Wikimedia Commons with structured data. edit

Large-scale file uploads to Wikimedia Commons with structured data can also currently only be done by volunteers with coding skills. There are no batch upload tools for Wikimedia Commons yet that support structured data.

  • The Wikimedia Commons UploadWizard supports uploads of up to 50 files at once, including structured data, but it is tedious manual work to add diverse structured data to a larger set of files.
  • Existing, and currently widely used batch upload tools for Wikimedia Commons include the following. None of these tools support structured data (yet).
    • Pattypan is a batch upload tool created by User:Yarl in 2015. It serves the more simple use case of larger-scale uploads of diverse files with Wikitext (think hundreds or several thousands of files at once). Pattypan is reasonably user-friendly, and it operates with datasets that are prepared in Excel format. Until now, Pattypan only supported wikitext but there is a longer-standing request (Phabricator task) to extend Pattypan with structured data functionality. Pattypan is currently not actively developed.
    • The GLAMWiki Toolset, developed since 2012 as an initiative of various European Wikimedia chapters and Europeana, is a batch upload tool for Wikimedia Commons that serves the more complex use case of uploading many thousands (sometimes up to tens of thousands) of files at once. The GLAMWiki Toolset takes XML files as source data; this makes it suitable for more technically proficient Wikimedians and for cultural institutions (GLAMs) with powerful collections management systems that have XML as an output format. However, the GLAMWiki Toolset hasn’t been maintained anymore since 2019 and its usage has declined. As of March 2021, Wikimedia community members are discussing deactivation of the tool (Phabricator task).

As there are no similar batch upload tools supporting SDC yet, currently only a very small group of people in the Wikimedia community (only those community members with coding skills who are able to operate bots) can do larger-scale uploads that include SDC. This, consequently, limits the availability of rich multilingual structured data on Commons.

With a flexible, powerful tool that supports SDC, Wikimedia Commons and GLAM-Wiki volunteers will be able to perform batch uploads with structured data from the get-go.

What is your solution? edit

For the problem you identified in the previous section, briefly describe your how you would like to address this problem. We recognize that there are many ways to solve a problem. We’d like to understand why you chose this particular solution, and why you think it is worth pursuing. Remember to review the tutorial for tips on how to answer this question.

We want to offer easier ways for users to modify and upload files with structured data on/to Wikimedia Commons by extending the popular data wrangling tool OpenRefine, which is already extensively used in the Wikidata community for batch Wikidata editing, with new functionalities for

  1. editing structured data of existing files on Wikimedia Commons, and
  2. uploading new Wikimedia Commons files with structured data from the start.

About OpenRefine edit

OpenRefine is a free data wrangling tool that can be used to clean tabular data and connect it with knowledge bases, including Wikidata. It was previously developed by Google (under the name Google Refine), but has transitioned to a community-supported open source project (licensed under the BSD license). OpenRefine is used by quite diverse communities interested in data manipulation and cleaning: librarians, researchers, data scientists, data journalists… and by Wikidata contributors. (See the results of OpenRefine’s early 2020 community survey for an overview of how the tool is used.) In 2019, OpenRefine has joined Code for Science & Society, a non-profit organization that supports open source projects of public interest as a fiscal sponsor and advisor in the areas of strategy, fundraising, leadership development, and sustainability.

OpenRefine can be downloaded as an application and works on desktop and laptop computers with Windows, Mac and Linux operating systems. It runs a small server on your computer and you then use a web browser to interact with it. It works best with browsers based on Webkit, such as Google Chrome, Chromium, Opera and Microsoft Edge, and is also supported on Firefox.

OpenRefine has a graphical user interface which is available in more than 15 languages.

Since 2018, starting from OpenRefine version 3.0, Wikidata items can be created and edited directly in OpenRefine via an extension developed by Antonin Delpeuch. This functionality is extensively used: OpenRefine’s successive versions (3.0 and more recent) were directly responsible for at least 8 million edits on Wikidata (and probably many more, as at least part of the QuickStatements edits on Wikidata have been made possible via an export from OpenRefine as well). According to data from the EditGroups tool, more than 500 unique Wikidata users have used OpenRefine for editing Wikidata until March 2021.

An essential feature to make batch editing Wikidata possible is the so-called reconciliation functionality. A source dataset (for instance a spreadsheet with data about public sculptures in a city) usually has various columns of data which will correspond with entities on Wikidata (in that same example: many of the artists who created the public sculptures will have a Wikidata item). The reconciliation functionality in OpenRefine takes the (usually plain-text) data from such a column, and then runs an algorithm that looks up the corresponding Wikidata Qids for each entry (again in that same example: a value like Hanke-Förster, Ursula will be reconciled with Ursula Hanke-Förster (Q1248030)). This functionality is very heavily used in OpenRefine.

An end-to-end solution for all data-related workflows edit

 
OpenRefine is a comprehensive tool that is able to cover the following steps in a typical Wikimedia batch workflow: Prepare data; Reconcile; Upload; Corrections.

OpenRefine has firstly been designed as a data cleaning and conversion application. With OpenRefine, users can load, clean and prepare a dataset in any of a wide variety of formats. This includes popular formats like  CSV, TSV, Excel (XLS/XLSX) and ODS (OpenOffice / LibreOffice) spreadsheets, JSON, and formats that are regularly used in the GLAM sector such as MARC, RDF and XML files. OpenRefine offers very powerful functionalities for this purpose, including filtering and many out-of-the-box data transformation features.

For Wikimedia upload and editing workflows this is quite valuable. It is usually quite a bit of work to ‘massage’ and clean source data in order to make it conform to Wikidata and Wikimedia Commons formats. Other upload and editing tools, such as QuickStatements, require data cleaning and preparation in a different environment first (often a spreadsheet application like Excel or OpenOffice Calc). With OpenRefine, only one tool needs to be used for the entire workflow of data cleaning, data conversion, reconciliation, and editing/uploading to Wikidata. We want to make this single-tool-for-one-workflow principle available for Wikimedia Commons as well.

In addition, OpenRefine has a graphical user interface. After a bit of training (usually one or two hours are sufficient), regular staff in cultural organizations can actively use it with collections data. On Wikidata, OpenRefine is already frequently used by collections professionals at GLAM institutions (such as archivists and librarians), allowing them to edit Wikimedia projects without needing to involve their organization's IT department.

In 2020 an important technical step was taken to make OpenRefine ready for Wikimedia Commons integration. Originally, it was only possible to edit Wikidata (via OpenRefine’s Wikidata extension, mentioned above); but during a 2020 Google Summer of Code internship, Lu Liu has generalized the Wikidata-specific features to work with arbitrary Wikibase instances. This means that OpenRefine is now able to perform edits to any Wikibase. However, Wikimedia Commons editing and upload is not possible in OpenRefine yet. Wikimedia Commons uses Wikibase in a novel way which requires further adaptations, due to the introduction of federation to reuse Wikidata items and new entity type (MediaInfo). The project plan below intends to carry out these adaptations.

Project goals edit

What are your goals for this project? Your goals should describe the top two or three benefits that will come out of your project. These should be benefits to the Wikimedia projects or Wikimedia communities. They should not be benefits to you individually. Remember to review the tutorial for tips on how to answer this question.

  1. By mid-2022, there will be a cross-platform, free and open source tool – OpenRefine – which allows Wikimedians to edit and upload large and diverse batches of files with structured data on/to Wikimedia Commons.
  2. By mid-2022, Wikimedians who are new to OpenRefine have a fast learning curve to using this tool on Wikimedia Commons, because they have access to good training materials and documentation.

Project impact edit

How will you know if you have met your goals? edit

For each of your goals, we’d like you to answer the following questions:
  1. During your project, what will you do to achieve this goal? (These are your outputs.)
  2. Once your project is over, how will it continue to positively impact the Wikimedia community or projects? (These are your outcomes.)
For each of your answers, think about how you will capture this information. Will you capture it with a survey? With a story? Will you measure it with a number? Remember, if you plan to measure a number, you will need to set a numeric target in your proposal (e.g. 45 people, 10 articles, 100 scanned documents). Remember to review the tutorial for tips on how to answer this question.

Goal 1: tool that allows batch editing and uploading SDC edit

Outputs (what we will do between July 2021 and June 2022):

  • We will build a Wikimedia Commons SDC reconciliation service.
  • We will develop OpenRefine functionalities that allow to batch edit the structured data of diverse sets of files on Wikimedia Commons.
  • We will develop OpenRefine functionalities that allow to batch upload files with structured data to Wikimedia Commons.

Outcomes (continued positive impact for the Wikimedia movement):

  • After August 2022, Wikimedians without coding skills can do large-scale editing and uploading projects with SDC thanks to the availability of OpenRefine as the first all-round, cross-platform and open source tool to support this functionality.
  • For the longer term, the new functionalities provided by OpenRefine strengthen Wikimedia’s strategic goals around offering knowledge as a service, as mentioned in the 2030 strategic direction. GLAM institutions and other external organizations that want to share free knowledge with the world are encouraged to contribute new media files with rich metadata to Wikimedia Commons because the upload process is streamlined via a tool that many of them already use, and because the addition of multilingual, structured data to their files significantly increases their discoverability and impact.

Goal 2: training materials and documentation edit

Outputs (what we will do between September 2021 and August 2022):

  • We will develop online training materials for Wikimedians who want to teach themselves to work with OpenRefine for editing and uploading SDC-enhanced files on Wikimedia Commons.
    • To feed this (permanently available) online training material, we will also organize a set of online webinars. The video recordings of these webinars will also be edited and made available online.
  • We will create documentation pages on using OpenRefine for editing and uploading SDC-enhanced files on Commons.

Outcomes (continued positive impact for the Wikimedia movement):

  • Wikimedians have access to clear documentation and training materials that allow them to get started with batch editing and uploading SDC-enhanced files on Wikimedia Commons.

Do you have any goals around participation or content? edit

Are any of your goals related to increasing participation within the Wikimedia movement, or increasing/improving the content on Wikimedia projects? If so, we ask that you look through these three metrics, and include any that are relevant to your project. Please set a numeric target against the metrics, if applicable. Remember to review the tutorial for tips on how to answer this question.

By end June 2022:

  • At least 20 individual people and/or organizations have provided input and feedback during the development process of OpenRefine’s new Wikimedia Commons features, including representatives of at least 3 international GLAM institutions.
  • At least 100,000 Wikimedia Commons files have received some new structured data in the first six months of 2022, thanks to OpenRefine’s new editing functionality released around end 2021. (This can be measured because the edits through OpenRefine will have a dedicated tag.)
  • At least 20 Wikimedians have edited structured data of media files on Wikimedia Commons, by using the newly developed editing functionality in OpenRefine in the first six months of 2022.
  • At least 30 individual people and/or organizations have participated in OpenRefine’s new feature webinars, including representatives of at least 3 GLAM institutions.

Project plan edit

Activities edit

Tell us how you'll carry out your project. What will you and other organizers spend your time doing? What will you have done at the end of your project? How will you follow-up with people that are involved with your project?

This project will run for one year (12 months). If all vacant positions are filled on that date, development will start in July 2021 and we intend to finish the project on June 30, 2022.

Software development edit

Software development for extending OpenRefine with SDC functionalities happens in two consecutive phases:

  1. Implementation of support of structured metadata editing;
  2. Building upon this new functionality: implementation of support for file uploads with structured data.

For more in-depth information on the software development tasks and their interdependencies, see this document: https://hackmd.io/dqq4TBogSxK95ezsTmOiFg?view

Phase 1: Structured metadata editing edit

This phase includes the following tasks. Time estimates are for one full time developer.

Duration Who Task
2 months Wikimedia tool developer Develop a reconciliation service for Wikimedia Commons which, just like the Wikidata reconciliation service, conforms to the Reconciliation Service API / protocol. This is an essential task needed in order to make other functionalities possible: it allows OpenRefine (and tools outside of OpenRefine) to take a list of file names from Wikimedia Commons and to convert these file names to their corresponding entity identifiers (“M numbers” or M-ids - the Wikimedia Commons equivalent of Q-ids). These M-ids are needed to perform further SDC operations. At this moment this can be done with the Minefield tool, but that tool does not follow the reconciliation protocol.
1 month OpenRefine developer Rework OpenRefine’s Wikibase extension to work with any entity type, including the MediaInfo entity type used on Wikimedia Commons.
2 months OpenRefine developer Add support for Wikibase federation in OpenRefine’s Wikibase extension, so that Wikidata items can be used in structured data generated for Commons.

With this plan, users should be able to use OpenRefine to batch edit structured metadata on existing files in Commons. In addition, these new functionalities will also make it possible in the future to do the same on other instances of MediaWiki which add structured data to their media files.

Phase 2: File upload edit

Building upon development done in the previous phase, OpenRefine will then be extended with functionalities to upload files with structured data to Wikimedia Commons. We plan to support both upload from a local drive (i.e. from a user’s harddrive) and from external URLs.

This is a very new type of functionality for OpenRefine, for various reasons:

  • Uploading files to a platform wasn’t part of any previous OpenRefine features yet.
  • Files on Wikimedia Commons are different from Wikidata items in various ways. One major difference is that, apart from structured data statements, files also still need basic Wikitext as part of their description and upload.
Duration Who Task
2 months OpenRefine developer Rework OpenRefine’s Wikibase extension to support distinct entity types with their own specific fields (media file and wikitext for MediaInfo, datatype for properties). Users can choose which entity type they want to edit when creating a schema. (In contrast, task 2 of phase 1 only presents them with a single generic interface to edit the statements and terms of entities).
2.5 months OpenRefine developer + Wikimedia tool developer Develop export and upload functionalities (from harddrive or from URL). Two scenarios are possible:
  • Directly uploading files from OpenRefine. This requires embedding a Java library to interface with MediaWiki’s file upload API.
  • We are also considering to design and offer ‘upload packages’ in an interoperable, re-usable format that can be used by other tools as well. Possibly, upload will then not be performed by OpenRefine, but by an external tool, either built as part of this project, and/or built by another (volunteer) developer.
0.5 months OpenRefine developer Add quality assurance constraints specific to MediaInfo entities (validity of filenames, checks on Wikitext, statement values should all be existing items and not new ones…). This can be done progressively, as issues are discovered during testing.

The two scenarios for the export functionalities are inspired by the two different ways OpenRefine users can currently upload data to Wikidata, either directly from OpenRefine or via QuickStatements. We will assess during the project whether to support one or both of such upload methods.

Community consultation, outreach and documentation edit

Duration Who Task
Continuous Product manager (Sandra Fauconnier) Survey user needs, collect use cases, and communicate progress with the Wikimedia community. Throughout the project, the project lead will continuously inform the Wikimedia Commons, Wikidata and GLAM-Wiki communities of relevant new developments and features, and will ask for input where needed.
2 months, at end of development phase 1 Product manager (Sandra Fauconnier) Produce documentation and training materials
  • Create documentation on how to edit SDC of Commons files with OpenRefine, as part of OpenRefine’s own documentation and on commons.wikimedia.org.
  • Organize an online webinar to demonstrate the new feature. The recording of that webinar will be edited into modular, translatable videos, used to illustrate the training materials and online documentation.
2 months, at end of development phase 2 Product manager (Sandra Fauconnier) Produce documentation and training materials
  • Create documentation on how to upload Commons files with structured data via OpenRefine, as part of OpenRefine’s own documentation wiki and on commons.wikimedia.org.
  • Organize an online webinar to demonstrate the new feature. The recording of that webinar will be edited into modular, translatable videos, used to illustrate the training materials and online documentation.

Budget edit

How you will use the funds you are requesting? List bullet points for each expense. (You can create a table later if needed.) Don’t forget to include a total amount, and update this amount in the Probox at the top of your page too!

Budget table
Role / task Weeks Hrs/wk Hourly rate Subtotals
Contractors
Product manager 50 10 $40.00 $20,000.00
OpenRefine developer 35 30 $40.00 $42,000.00
Wikimedia developer 20 20 $40.00 $16,000.00
Other direct costs
Additional developer honoraria $4,500.00
Meeting costs $1,500.00
Wire fees ($100 x 5 transfers) $500.00
Project leadership, administration, accounting, strategic support (15% of budget total) $14,911.76
Total Grant $99,411.76

Budget rationale and tasks per role

  • Product management and community engagement
    • The hourly rate of USD $40,00/hrs is based on typical hourly rates for freelance workers in the cultural sector in the Netherlands (where the product manager is based).
    • Tasks:
      • Heads the project, running regular meetings with project members
      • Surveys user needs and makes sure these are met by the new integration
      • Acts as Wikimedia community liaison throughout the project’s development
      • Produces documentation, publicizes the project and coordinates with other organizational stakeholders
      • Reports to the funder
  • OpenRefine development
    • As this position is still vacant, the hourly rate of USD $40,00/hrs is an estimate. It's on the low side for software developers in Western Europe and the United States, but may be sufficient for developers based in other regions of the world. If this project grant is funded, we will issue an open application process for this position through various channels, including relevant Wikimedia mailing lists. If the most suitable candidate is based in a country where higher hourly rates are customary, we will use budget from the next budget line ('Additional developer honoraria and contingency funds') to fit that situation.
    • Tasks (this position is still vacant):
      • Develops the integration on OpenRefine's side
      • Coordinates with the product manager and the rest of the OpenRefine dev team
  • Wikimedia development
    • As this position is still vacant, the hourly rate of USD $40,00/hrs is an estimate. It's on the low side for software developers in Western Europe and the United States, but may be sufficient for developers based in other regions of the world. If this project grant is funded, we will issue an open application process for this position on various Wikimedia mailing lists. If the most suitable candidate is based in a country where higher hourly rates are customary, we will use budget from the next budget line ('Additional developer honoraria and contingency funds') to fit that situation.
    • Tasks (this position is still vacant):
      • Develops the reconciliation service for Commons and a batch upload tool (in QuickStatements or as a new tool)
      • Coordinates with the product manager and the rest of the OpenRefine dev team
  • Additional developer honoraria and contingency funds
    • As mentioned just above, we have planned this buffer in the budget to account for potential higher hourly rates of the still-to-be-recruited Wikimedia developer. Otherwise, this budget line will act as a contingency buffer for other unforeseen costs, for instance for additional development hours that may be needed.
  • Meeting costs
    • This budget line will be used for costs made for webinars, workshops and presentations related to OpenRefine-SDC training and outreach. It can be partly used for registration costs of external conferences and - if physical gatherings become possible again in 2022 - travel to and participation in one or two (nearby) physical workshops and events
  • Wire fees
    • International wire transfer fees for contractor payments and other grant-related money transfers within, to and from the United States.
  • Project leadership, administration, accounting, strategic support
    • Code for Science & Society (CS&S) supports its sponsored open source projects in the following areas, for which in each project's grant, 15% of the total budget is reserved.
      • CS&S takes care of grant and financial administration, accounting, and administrative support.
      • CS&S organizes project leadership and strategic support, including regular reviews of grant progress, team management, and strategic project development. CS&S initiates and organizes regular meetings with the OpenRefine's advisory and steering committees to look at longer-term planning and sustainability, and to explore and pursue new funding opportunities.

Notes

The budget above has been updated on May 27, 2021 in consultation with the Wikimedia Foundation's grants team and the grants committee.

Originally, we had very strongly hoped to work with a specific, experienced and proven OpenRefine developer based in China. However, it turns out that it is legally not possible for us to hire someone from China (some background here). So the position of OpenRefine developer is now a vacancy.

We're quite confident that we'll be able to recruit someone who can do this work very well. However, hourly rates for a developer based in another part of the world (for instance in Africa, as suggested by the grants committee on the talk page of this proposal) may be more expensive than the originally budgeted USD 20/hr. After consultation with the Wikimedia Foundation's grants team, and reasonably within the allowed margins of this grants program, he budget has been updated in the following ways:

  • The hourly rates for all developers is now budgeted at 40 USD/hr, which is more in line with average international rates for a junior developer.
  • We removed the budget line for professional video editing. Video registration and documentation will still be produced, but in an unedited / raw state.
  • We slightly reduced the planned number of hours per week for both developers. This means that - depending on the experience of the to be hired developers - we may not be able to develop some of the most advanced features that we planned. We do plan to still develop all basic functionalities that belong to OpenRefine itself, but export functionalities for other tools (like QuickStatements) may possibly be skipped if the developers are not able to create this functionality in the allocated (reduced) hours.

-- SFauconnier (talk) 15:25, 27 May 2021 (UTC)

Community engagement edit

Community input and participation helps make projects successful. How will you let others in your community know about your project? Why are you targeting a specific audience? How will you engage the community you’re aiming to serve during your project?

For an overview of tasks that we have planned for this purpose, see above, in the section Community consultation, outreach and documentation under Project plan.

We plan to communicate and work with relevant communities via the channels where they are already active.

  • As soon as this project starts, we will create a project page about OpenRefine on Wikimedia Commons, with information about progress and (as soon as features are deployed) with documentation. The talk page of this landing page will be monitored for conversations with Wikimedians and specifically Wikimedia Commons contributors. Furthermore, this community will be informed, kept up to date, and engaged via their usual communications channels (mailing lists, Village Pump, Telegram group, IRC channel).
  • The GLAM-Wiki community will be informed and notified via their usual communications channels (mailing lists, This Month in GLAM newsletter, Telegram groups, Facebook groups), and will be invited to test the software, provide feedback, and participate in training webinars.
  • For the Wikimedia developer community, we will use Phabricator to keep track of Wikimedia-specific tasks and dependencies. (OpenRefine development is mainly tracked on GitHub, but we will create pointers from/to both platforms in order to promote discoverability of work done on each side.) We will also keep an eye on upcoming (virtual) Wikimedia hackathons and participate in them if it makes sense.
  • As is the case with other development of OpenRefine features, GitHub will be used for OpenRefine-specific issue tracking, product management and documentation, making sure that the OpenRefine developer and user community can stay up to date with, and check, new developments. The OpenRefine mailing list will also be used for communication about updates and new features.
  • The product manager will keep an eye on planned (virtual) conferences in the GLAM sector, especially at the end of the development period, and, if possible and relevant, will propose workshops there to promote and demonstrate OpenRefine’s new SDC features.

Get involved edit

Participants edit

Please use this section to tell us more about who is working on this project. For each member of the team, please describe any project-related skills, experience, or other background you have that might help contribute to making this idea a success.

Anyone is free to sign up to the project as a volunteer or advisor. This does not result in any obligations! We see this as a sign that you are more interested than average, and are OK with, for instance, being directly approached on your talk page when new features can be tested and if we have questions or assumptions that we would like to check with the community.

  • Code for Science and Society, Inc. (CS&S) is a US-based non-profit organization (501(c)3 clause) which hosts the OpenRefine project. They are responsible for the accounting and administrative support for the project, and help their sponsored projects in the areas of strategy, fundraising, leadership development, and sustainability.
  • Sandra Fauconnier is a freelance project lead for digital projects in the cultural sector. As a Wikimedia volunteer she is mainly active on Wikidata and Wikimedia Commons. From 2017 until 2020 she was also part of the Structured Data on Commons team at the Wikimedia Foundation, where she worked on GLAM pilot projects (including the ISA Tool) and documentation.
  • Antonin Delpeuch has been working on OpenRefine since 2017 and was behind the first Wikidata integration in the tool. He is now working in other areas (scalability, packaging, reproducibility).
  • Advisor The GLAM & Culture team at the Foundation is keen to support this project by coordinating pilots and documenting work flows and case studies. FRomeo (WMF) (talk) 11:53, 16 March 2021 (UTC)
  • Volunteer OR rules data Ecritures (talk) 12:19, 16 March 2021 (UTC)
  • Volunteer Happy to help with testing! DatamuseAttitude (talk) 19:05, 13 June 2022 (UTC)

Community notification edit

Please paste links below to where relevant communities have been notified of your proposal, and to any other relevant community discussions. You are responsible for notifying relevant communities of your proposal, so that they can help you! Depending on your project, notification may be most appropriate on a Village Pump, talk page, mailing list, etc. Need notification tips?

We notified the following mailing lists:

Village pumps and talk pages:

Endorsements edit

Do you think this project should be selected for a Project Grant? Please add your name and rationale for endorsing this project below! (Other constructive feedback is welcome on the discussion page).

  •   Support As a member of the OpenRefine steering committee, Schema.org community, and advocate of structured and linked data, I highly support this effort. GLAM institutions I have talked to are very excited about the future of sharing their collections, and much more importantly, the data they have collected about them over the years. Without good metadata, appropriate media files are impossible to find for many journalists, scientists, and researchers. Making this easier for non-coding skilled contributors in GLAM institutions will provide a wealth of incoming metadata to enhance SDC and its visibility. Thadguidry (talk) 13:17, 15 March 2021 (UTC)
  •   Support obviously, we very much need a tool to upload structured data on Commons and since OpenRefine is already known and used for Wikidata, it's a perfect match (pun intended). Cheers, VIGNERON * discut. 09:14, 16 March 2021 (UTC)
  •   Support OpenRefine is too good to go. Enock4seth (talk) 09:22, 16 March 2021 (UTC)
  •   Support OpenRefine is a top tool and would become even better when it supports structured data on commons uploads! Beireke1 (talk) 09:54, 16 March 2021 (UTC)
  •   Support This fills a structural need that isn't really filled by other tools and apps. —seav (talk) 09:56, 16 March 2021 (UTC)
  •   Support Yess! This is the #1 thing that can be done to promote broader use of Structured Data on Commons. I wholeheartedly support! – Susanna Ånäs (Susannaanas) (talk) 09:58, 16 March 2021 (UTC)
  •   Support We are missing good tools for working with SDC. OpenRefine is well-positioned to become one! Thanks for this proposal and I really hope it gets funded.--10:04, 16 March 2021 (UTC)
  •   Support Any improvement to OpenRefine is welcome! Ayack (talk) 10:05, 16 March 2021 (UTC)
  •   Support A vital and overdue step froward. Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 10:11, 16 March 2021 (UTC)
  •   Support This is so much needed. Please. Do it. Sannita (talk) 10:14, 16 March 2021 (UTC)
  •   Support Some collections (for instance archaeological ones) have a place on Wikimedia Commons though not necessarily on Wikidata but would benefit greatly from structured data to make them searchable. Openrefine is user friendly and therefore realistic for non-technical profiles in the heritage sector. Essential for structured data on Wikimedia Commons going forward. Sam.Donvil (talk) 10:33, 16 March 2021 (UTC)
  •   Support I agree with the argument right above mine, but would like to mention it also applies to paleontology collections on Commons (and possibly also zoology?), as each object also has a bibliography attached to it, and overarching data (higher taxons, etc.).--Flor WMCH (talk) 11:17, 16 March 2021 (UTC)
  •   Support OpenRefine is a great tool and SDC compatible Commons uploader is simply missing. Jklamo (talk) 11:47, 16 March 2021 (UTC)
  •   Strong support Cannot happen fast enough. It seems like the WM Foundation should just fund this, and not put it through a long grant process. Prburley (talk)
  •   Strong support I - as a linked data specialist in a large cultural heritage institution - would rather have this option today than tomorrow. This would allow CHI's to bulk upload all their data in bulk to the SDoC. Ecritures (talk) 12:22, 16 March 2021 (UTC)
  • I think it would be useful to give large uploaders like myself a way to replicate metadata already in Wikidata for the same files on Commons. I have uploaded many more files than are used on Wikidata and it would be nice to make those findable for other language speakers as well. Jane023 (talk) 13:44, 16 March 2021 (UTC)
  •   Strong support I would love to see better batch uploading tools in general - I've only done it once and it went terribly :). Mvolz (talk) 14:18, 16 March 2021 (UTC)
  • This would have been a game changer for me uploading batches of materials in Wikimedia Commons and I would love to take advantage of it in future. SDC could be so powerful but it has to get easier to add for that to be the case, and it makes sense to optimize other tools that we're already using for Wikidata in order to do it. I love Pattypan but the fewer workflows to create, the better. Librarian lena (talk) 15:01, 16 March 2021 (UTC)
  •   Support As a core committer for the OpenRefine project and long time Wikidata contributor, I'm pleased to see additional applications of OpenRefine in the Wikimedia community.Tfmorris1 (talk) 15:44, 16 March 2021 (UTC)
  •   Support absolutely necessary and helpful! --Elya (talk) 15:58, 16 March 2021 (UTC)
  •   Support This looks like a much needed project. Fjjulien (talk) 16:25, 16 March 2021 (UTC)
  •   Strong support this would be an extremely useful addition! We talked about it before and I'm very happy to see this proposal. Multichill (talk) 16:38, 16 March 2021 (UTC)
  •   Strong support Very important feature. Raymond (talk) 17:20, 16 March 2021 (UTC)
  •   Support Great idea! E.Doornbusch (talk) 17:32, 16 March 2021 (UTC)
  •   Support ProtoplasmaKid (talk) 17:34, 16 March 2021 (UTC)
  •   Support Tool support for creating and editing SDC is sorely needed. With many ~volunteers accustomed to using OpenRefine to batch create/edit on Wikidata I think it also comes with a built in user base. DivadH (talk) 17:43, 16 March 2021 (UTC)
  •   Support SDoC has great potential, but is also needy on the technical side. It requires both serious support, and a thoughtful approach leading to an adequate infrastructure. The use cases here are clear. Charles Matthews (talk) 17:47, 16 March 2021 (UTC)
  •   Support extremely useful addition to our toolkit. - PKM (talk) 18:13, 16 March 2021 (UTC)
  •   Strong support As someone who uses Wikidata, Wikibase and OpenRefine for GLAM projects, I believe this would be an incredibly powerful addition to an already great tool to support the work of GLAMs in making their collections more accessible, more (re)usable for research, and to generally improve the Wikidata-WikiCommons interconnections. I want to acknowledge that I am a member of the OpenRefine Steering Committee, but as a digital humanities scholar with experience in the GLAM field, I think this proposal would be valuable to a wide range of organizations and individuals. Loz.ross (talk) 18:36, 16 March 2021 (UTC)
  •   Support Jeroen De Dauw (talk) 19:13, 16 March 2021 (UTC)
  •   Support Tools to create or edit SDC are desperately needed. I definitely support this initiative. Ambrosia10 (talk) 19:53, 16 March 2021 (UTC)
  •   Support Full support. Christian Ferrer (talk) 19:59, 16 March 2021 (UTC)
  •   Support As a non-coding Wikidata contributor, I love OpenRefine for what it can let me do in bulk. Adding in SDC to the potential contributions would be fantastic.--DrThneed (talk) 20:12, 16 March 2021 (UTC)
  •   Support Any effort to give non-coding folks a tool to add SDC data should be applauded, so yes please! Husky (talk) 20:17, 16 March 2021 (UTC)
  •   Support Enhancing OpenRefine is, in my opinion, also a contribution to a larger and more diverse group of contributors as the OR has a well-build user interface. —The preceding unsigned comment was added by Simulo (talk) 20:55, 16 March 2021‎
  •   Support We need tools for SDoC, and OpenRefine has already proven to be invaluable for Wikidata editing. Extending it for Wikimedia Commons makes a lot of sense. Jean-Fred (talk) 21:32, 16 March 2021 (UTC)
  •   Support A very worthwhile project. The upload ability would be very nice. But perhaps what is even nicer would be the ability to get the M-ids and the values of selected properties from SDC into OpenRefine, which could also be very useful for a batch downloader-and-renamer; as well as for projects to assess and improve range and consistency of SDC content (eg based on retrieving M-ids and accession numbers from SDC or information pages, as keys to then upload additional catalogue metadata). If the reconciliation service could extract and provide information from Commons file description templates as well as SDC, that would also be nice. OpenRefine is absolutely the right platform to choose for developing this capability; but the advantage of the wiki-side reconciliation service is that that will be a general capability that will then be availble for use by other approaches as well -- eg GoogleSheets, PAWS notebooks, pywikibot scripts, whatever Lucas or Magnus develop next, etc. The proposal seems technically realistic and very achievable, within the rapid timeframe proposed. My only reservation is the experience that OpenRefine can sometimes silently drop reconcilations if they time out, without signalling the fact. This is a long-standing issue with OpenRefine, which I do not expect this project to be able to address. Users just need to be aware that certain reconciliation returns may be incomplete, and adapt their workflows accordingly. But we manage to live with it for Wikidata, so we surely could live with it on Commons as well. All in all a very important proposal, which will greatly widen the number of people able to contribute to SDC at scale, as well as being very valuable to GLAMs and other at-scale providers of content and metadata. Very strongly recommended. Jheald (talk) 22:19, 16 March 2021 (UTC)
  •   Support Very cool and necessary project! Scann (talk) 22:23, 16 March 2021 (UTC)
  •   Support Jmmuguerza (talk) 03:35, 17 March 2021 (UTC)
  •   Support seems hot --Valerio Bozzolan (talk) 07:05, 17 March 2021 (UTC)
  •   Support This will both benefit Structured data on Commons and Open Refine. An easy way of publishing correct metadata next to the images --Alina data (talk) 08:30, 17 March 2021 (UTC)
  •   Support Seems to be a very practical and useful proposal! --Papuass (talk) 10:09, 17 March 2021 (UTC)
  •   Support Logical next step in the SDC/Wikidata chain of developments and discontinuation of the GWToolset OlafJanssen (talk) 10:45, 17 March 2021 (UTC)
  •   Strong support OpenRefine is one of the best and wide-spread data cleaners, partly because of its interaction with the WikiData Reconciliation endpoint. The workflow is working well (except maybe for speed) and should be made available wherever necessary/useful. This will also benefit the growing community of Wikibase users. Johentsch (talk) 15:52, 17 March 2021 (UTC)
  •   Strong support I'm a OpenRefine user and contributor since 2019, and it is a great tool for Wikidata, in a world with not so many options. Having the ability to do the same as WikiCommons is great. It's actually a must… Antoine2711 (talk) 20:22, 17 March 2021 (UTC)
  •   Support Will make it easier to mass-add structured data --Nintendofan885 (talk) 23:22, 17 March 2021 (UTC)
  •   Support --Zache (talk) 12:25, 19 March 2021 (UTC)
  •   Support Yes please. Thanks for proposing. Simon Cobb (Sic19 ; talk page) 19:11, 19 March 2021 (UTC)
  •   Support -- Regards, ZI Jony (Talk) 08:19, 20 March 2021 (UTC)
  •   Support ESM (talk) 11:15, 20 March 2021 (UTC)
  •   Support CaptSolo (talk) 13:09, 23 March 2021 (UTC)
  •   SupportSorely needed. Nashona (talk) 22:02, 23 March 2021 (UTC)
  •   Support Wikimedia Sverige is very positive about this proposal receiving funding. The new functionalities outlined will be very valuable for a lot of the work we are planning. We hope to be able to actively support and coordinate with this project in different ways. John Andersson (WMSE) (talk) 17:22, 24 March 2021 (UTC)
  •   Support Excellent idea and a real need among glam structures. Should facilitate the desire from Glam to use Open refine and make Wikidata and Wikimedia Commons real friends.Xavier Cailleau WMFr (talk)
  •   Support I've spent some time working on documentation for SDC and this tool will be extremely helpful especially if it includes documentation which does not assume too much prior knowledge. John Cummings (talk) 09:27, 11 April 2021 (UTC)
  •   Support Being able to use OpenRefine to batch edit Commons would be great. And improving the documentation will make that task accessible to more users. Simon.letort (talk) 15:05, 11 April 2021 (UTC)
  •   Support cheap tool development by proven team. SDC is hobbled without this functionality. Slowking4 (talk) 23:55, 14 April 2021 (UTC)
  •   Support KCVelaga (talk) 15:14, 21 April 2021 (UTC)
  •   Support As an OR user and and a SDC contributor, of course, yes! That would bring so much power to the editors. -- Bodhisattwa (talk) 01:35, 22 April 2021 (UTC)
  •   Support of course :) –SJ talk  16:11, 7 July 2021 (UTC)
  •   Support Wholeheartedly :)-Vinayaraj (talk) 13:24, 18 January 2022 (UTC)