Grants:Project/Diegodlh/Web2Cit: Visual Editor for Citoid Web Translators/Timeline


Timeline for DiegodlhEdit

Timeline Date
Development milestone 1: Publish Web2Cit mockup (DONE) 27 August 2021
Development milestone 2: Release Web2Cit translation engine (now "core library") v1 (DONE) 14 February 2022
Development milestone 3: Release Web2Cit translation API (now "translation server") v1 (DONE) 14 February 2022
Development milestone 4: Release Web2Cit frontend (aka "editor") v1 (DONE) 17 June 2022
Development milestone 5: Release Web2Cit translation cache (now "monitor") v1 30 September 2022
Development milestone 6 (optional): Release Wikipedia User script (DONE) 17 June 2022
Development milestone 7: Publish development documentation 30 September 2022
Research milestone 1: Publish up-to-date Citoid coverage gap data & report 30 September 2022
Research milestone 2: Release Citoid coverage gap estimator script 31 July 2022


Monthly updatesEdit

Please prepare a brief project update each month, in a format of your choice, to share progress and learnings with the community along the way. Submit the link below as you complete each update.

Month 1 (July)Edit

  • Started meetings to organize the project between Diego & Scann.
  • Contacted potential researchers to do the research aspect of Web2Cit, that will help to establish the baseline to understand the impact of the project.
  • Issued Call for Members for Advisory Board at Call for members. The call was shared through several communication channels, including:
    • personal social media (our Twitter accounts)
    • outreach to specific people previously identified with Diego
    • Village Pump
    • Café (Spanish Wikipedia)
    • Mailing lists
  • Started organizing the Meta page for the project.

Month 2 (August)Edit

  • Communications & Community:
    • Closed Advisory Board call for members. Opened Board's mailing list and collected time availability for meetings.
    • Called first two meetings of the Advisory Board: September 14th (non-technical profiles) & September 15th (technical profiles).
  • Development:
  • Research:
    • Confirmed the members of the research group.
    • Had initial team meetings and agreed research subproject's deliverables (in Spanish).
    • Familiarized with Citation Templates and defined an initial corpus of Wikipedia articles.
    • Started development of the script to automatically evaluate the Citoid coverage gap.

Month 3 (September)Edit

  • Communications & Community:
    • Held first two meetings of the Advisory Board: non-technical and technical meetings.
    • Active discussions in the Board's mailing list (27 posts by 5 participants in the last 30 days).
    • We created a pre-recorded video to present Web2Cit in Spanish and in English at WikiConference North America 2021.
  • Development:
    • Web2Cit umbrella project tag was requested in Phabricator.
  • Research:
    • Started discussing and considering limitations and alternatives concerning methodological assumptions within the team and with the Advisory Board.
    • Started exploring tools and strategies to extract metadata from citation templates.

Month 4 (October)Edit

  • Communications & Community:
  • Project management:
    • Web2Cit umbrella project tag was created in Phabricator to keep track of open tasks.
  • Research:
    • Created git repository, including work-in-progress Jupyter Notebook: https://github.com/hdcaicyt/Web2Cit-research
    • Developed script to (1) fetch wikitext from a list of Wikipedia articles and (2) extract citation metadata from their citation templates.
    • Ongoing discussions concerning alternatives to get accurate citation metadata from Wikipedia articles.

Month 5 (November)Edit

  • Communications & Community:
    • Held second Advisory Board meeting on November 16.
  • Project management:
    • Requested changes in project's timeline.
  • Research:
    • Presented research subproject to the Advisory Board.
    • Discussed and agreed on strategy to support extracting citation templates from different Wikipedia languages.
    • Discussed and agreed on strategy to improve reliability of extracted data, using highlighted articles.

Month 6 (December)Edit

Month 7 (January)Edit

  • Communications & Community:
    • Published mockup presentation video, available at https://meta.wikimedia.org/wiki/Web2Cit
    • Had Advisory Board meeting. Starting from now on, Advisory Board meetings will be published on YouTube (but not listed) in order to document some software development decisions in an easier format than reading a bunch of documentation. We're preparing the video of January's meeting.
  • Development:
    • Created Wikimedia Gitlab source code repositories for Web2Cit Core and Web2Cit Server components.
    • Created the web2cit tool account in Toolforge, which probably will serve (at least) the translation API from https://web2cit.toolforge.org/, and the static frontend from https://static.wmflabs.org/web2cit/
    • Web2Cit Core library development
      • initial HTTP and Citoid caching of target URLs
      • basic selection steps, including Citoid and XPath selections.
      • basic transformation steps, including Join, Split, Date, and Range transformations.
      • translation procedures support (i.e., sequences of multiple selection and transformation steps)
  • Research:
    • Created a map from Zotero = Citoid fields to Web2Cit fields that will allow us to compare the Citoid response vs the citation metadata collected from Featured articles.
    • Moved the automated script to Wikimedia PAWS for better performance, with time improvements ranging between 67 and 99%:
    • Downloaded wikitext of 10.5k featured articles from 4 language Wikipedias.
    • Managed to extract >450k references with URL from 94% of these articles (~50 references per article).
    • Fetched Citoid response for first ~10k URLs collected. Optimizations pending.

Month 8 (February)Edit

  • Communications & Community:
  • Development:
    • Completed development of our core library's initial version (development milestone 2). Briefly:
      • Integration of translation procedures into template fields, including procedure output validation.
      • Definition of Domain Configuration class, including methods to fetch and manage collaboratively-defined configuration revisions from our repository in Meta.
        • Definition of Template Configuration subclass, including method to translate a web target with a series of translation templates.
        • Definition of Pattern Configuration subclass, including method to sort URLs into URL path pattern groups.
      • Integration of all submodules into top-level Domain class
    • Initial translation server made available at https://web2cit.toolforge.org (development milestone 3, switched with development milestone 4 to allow for earlier testing), exposing the current capabilities of the core library. This includes:
      • target translation using manually defined translation templates and URL path patterns as described in our resources for early adopters;
      • translation results are included as embedded metadata ready to use by Wikipedia's automatic citation generator.
  • Research:
    • Optimization of high-volume requests to Citoid, including
    • Continued improvement of our collaborative citation template list, including addition of several parameter aliases.
    • Estimation of proportion of references excluded from our analysis; that is, inserted without using citation templates.

Month 9 (March)Edit

  • Communications & Community:
  • Development:
    • Recorded Web2Cit core library architecture meeting as video documentation, including overview UML file.
    • Added selection and transformation step types:
      • Fixed selection
      • Match transformation
    • Implemented the sandbox endpoint of the translation server, and updated the early adotper guidelines accordingly.
  • Project management
    • Wrote and submitted the project's midpoint report.
  • Research:
    • Subsmitted preliminary results to the WikiWorkshop 2022 (our work got accepted and will be presented on April 25, 2022).
    • We continued working on URL validation to minimize unnecessary requests to Citoid (see T301519).
    • Started planning the details of how we will compare Citoid responses vs (presumably accurate) extracted metadata, including
      • mapping fields in Citoid responses to Web2Cit fields of interest, and
      • comparison strategies for fields where:
        • both extracted metadata and Citoid response are array of strings, or
        • extracted metadata is array of strings, and Citoid response is single string.

Month 10 (April)Edit

  • Communications & Community:
  • Development:
    • Implemented the debug endpoint of the translation server, and updated the early adotper guidelines accordingly.
    • Made improvements to the translation server:
      • General layout improvements, including preparing the results page to show translation test results (i.e., supporting multiple translation targets, and expected outputs and test scores).
      • An improved debug information section, with a hopefully clearer table format.
      • Internationalization and Spanish translation.
      • Added a home page, with a search field to enter a target URL, and basic Web2Cit translation options.
      • Added JSON editor pages, with JSON editor forms embedded, to simplify editing of configuration files.
    • Considerably improved our JSON-schemas to simplify editing JSON configuration files for early adopters using json-editor automatically generated forms:
    • Created a user script to more seamlessly integrate Web2Cit into Wikipedia.
  • Research:
    • Finished making Citoid requests for the 380+ reference URLs obtained in the extraction phase
      • We managed to get citation metadata for 75% (280+) of these URLs
    • Finished mapping fields in Citoid responses to Web2Cit fields
    • Began cleaning the values obtained in the extraction phase, to continue with the comparison phase.
    • Presented our work at WikiWorkshop 2022, and met with Diego Sáenz-Trumper from Wikimedia Research.

Month 11 (May)Edit

  • Communications & Community:
    • We held our first workshop.
      • Around 12 people participated.
      • Recording: https://www.youtube.com/watch?v=wlf3On0YgcI
      • We learned some valuable lessons for our upcoming workshops (see the corresponding Workshop summary section on our Workshops page).
      • At least 2 translation tests and 2 translation templates were created by the participants.
    • Published a video describing what the Web2Cit ecosystem looks like.
    • Held a session at Wikimedia Hackathon 2022. We had around 11 participants, including Citoid's maintainer Marielle Volz.
  • Development:
    • Made several improvements to the configuration file editor to encourage participation until our visual editor is available.
    • Began implementing support for translation tests, both in Web2Cit Core and in Web2Cit server.
    • Started drafting a Web2Cit Monitor redesign using Meta to simplify its implementation (see here).
    • Began development of XPath selection.

Month 12 (June)Edit

  • Communications & Community:
    • Gave a Citoid and Web2Cit workshop at Wikimedia Argentina's and Wikimedistas Uruguay's "Wikiherramientas". YouTube video.
    • Had a conversation with Giovanna Fontenelle. Shared the tool with her and discussed people who may be interested in using it.
  • Development:
  • Research:
    • Added a landing page for the research subproject at our home page.
    • Resumed work with comparing the extracted references vs the results returned by Citoid.

Month 13 (July)Edit

  • Communications & Community:
    • Held a remote workshop in Spanish organized by Wikimedia Colombia. See their flyer on Twitter.
    • ...
  • Development:
    • Had a new developer join the project and started working on Web2Cit monitor.
    • Continued with development of the real-time version of the Web2Cit editor.
    • Published Web2Cit core as npm library, for convenient reuse from Web2Cit server, Web2Cit editor, and any other project which would like to use Web2Cit functionalities.
  • Research:
    • Finished comparing extracted citation metadata vs Citoid responses, reaching the end of the research script writing process.
    • Began writing the final report.
    • ...
  • Project management:
    • ...

Month 14 (August)Edit

Month 15 (September)Edit

Is your final report due but you need more time?