Talk:NLP for Wikipedia (EMNLP 2024)

Open science principles? (data availability, open reviews, open access)

Latest comment: 15 days ago4 comments2 people in discussion

I am wondering if the workshop's organizers have been making any efforts to require or at least encourage

data availability statements
open reviews
publication of the accepted papers under open access

It appears that at least as of last year, the EMNLP conference organizers were strongly encouraging 1 and 2 (at least for the main conference):

For a transparent and open reviewing system, this year we have implemented a process by which some of the reviews, author responses and meta reviews will be made publicly available. Our motivation here was to provide increased transparency in the review process, to foster more accountability for reviewers and higher quality reviews as well as enabling peer review research by providing an open collection of papers and reviews. [...] The requirement of making reviews open meant that we had to move the conference platform from SoftConf to OpenReview. The OpenReview platform is also currently being used by the ACL Rolling Review and some other related conferences and is well suited to this type of process. We hope that more and more authors will choose to make their reviews and data available for the community, and more and more reviewers will engage in the rebuttal discussions.

But the page NLP for Wikipedia (EMNLP 2024)/Call for Papers mentions none of those three. What's more, the workshop's OpenReview page linked there is empty (i.e. no public reviews), in contrast to those of several other EMNLP 2024 workshops (example). Regards, HaeB (talk) 07:46, 25 October 2024 (UTC)Reply

Thanks for the feedback @HaeB! As context, this is the first time we're running this workshop (and at EMNLP) so this year was very much figuring out the process and what seems to work and what does not. We are waiting to hear if we will be able to run a second iteration of this workshop next year, which is when we will be able to make some changes because enough of the core of the workshop will be in place that we can properly plan for things like open review.

Re open-access papers, yes the proceedings will be open-access (CC-BY 4.0 per ACL) and many of the authors have also added their papers to arXiv or otherwise. So I think we're in agreement there. We will do our best to link to them ahead of the workshop.
Re open reviews / data availability statements: that's something I will raise for discussion with my co-organizers for next year's workshop. I'm not opposed but we'd have to figure out where it makes sense to incorporate. It looks like opt-in has been the approach taken by others for open-reviews (thanks for sharing the BlackBoxNLP example).

This year we ran two tracks: a more standard peer-review track and a track for pre-published papers where we just reviewed for fit. Next year we're hoping to mix this up a bit to hopefully set us up for more interaction between NLP researchers and Wikimedians as part of the workshop:

We'd like a way to have more Wikimedia community perspectives in the workshop without requiring a formal review process through OpenReview (e.g., maybe just asking folks to respond to a prompt on-wiki and then incorporating these responses into the workshop)
We'd also like to have a central track that focuses on datasets related to core content policies which would be a good place for the more data-specific transparency requirements. In general, we are considering what it would mean to implement a more wiki-specific ethics/transparency checklist for papers (example of generic research checklist from NeurIPS).
We might retain a more generic track too but try to make that track less of a focus for the workshop.

Nothing has been decided yet so any feedback on how to nudge research towards being more beneficial for the Wikimedia community is welcomed (with the three you already shared noted). -- Isaac (WMF) (talk) 16:24, 31 October 2024 (UTC)Reply

Thank you for these thoughtful responses!

Open access: That's awesome (and what I had expected). I would just suggest to also include that information explicitly on the CfP page next time. By the way, I noticed that the proceedings are now online on the ACL website, so I took the liberty to add the link myself.
We'd like a way to have more Wikimedia community perspectives in the workshop - that sounds great (also considering that the workshop costs $200 to attend virtually). One aspect of doing that might be to encourage researchers to create Research:Projects pages per the usual best practice.
We'd also like to have a central track that focuses on datasets related to core content policies which would be a good place for the more data-specific transparency requirements. In general, we are considering what it would mean to implement a more wiki-specific ethics/transparency checklist for papers (example of generic research checklist from NeurIPS). - that sounds good, but to clarify just in case: I did not bring up data availability statements in order to impose any wiki-specific values. (I am aware that the Wikimedia Foundation is not the sole organizer of the workshop.) Also, they are not just important for dataset papers. Rather, they are becoming a widespread practice across many research disciplines (as can be gleaned from the Google search link I had included above), also driven by serious concerns about replicability/reproducibility and research transparency. To quote one of the largest scientific publishers:

research is an activity that naturally thrives from previous findings and research. Allowing your readers to access research data is considered a good practice in science, and a mechanism to encourage transparency in scientific progress. It’s for this that most journals today demand a Data Availability Statement, where authors openly provide the necessary information for others to reproduce works stated or reported in an article. [...] It can’t be overemphasized how important Data Availability Statements are for transparency in science. They assure that researchers are properly recognized for their work and help science grow as a discipline, in which sustainable progress is a key pillar.

It would be sad to see Wikimedia-supported research falling behind in this aspect.

(PS: As context for others reading along, my questions above had been prompted by writing a review of one of this workshop's already published papers for the current issue of the research newsletter, see also the community discussion about the paper/the review here.)

Regards, HaeB (talk) 06:34, 11 November 2024 (UTC)Reply

A bit slow on responding but thanks for the additional feedback:

Open-access: Good point re: advertising open-access. I'll make sure it's included in next year's CfP. FYI we're in the early planning stages but you can find the page here if you wish to continue any conversation: https://meta.wikimedia.org/wiki/NLP_for_Wikipedia_(ACL_2025)
Accessibility: also good point about encouraging Meta pages for papers. I'll bring that back to the team.
Checklists: the way I'm thinking about this is not as a hard filter but more of a nudge to researchers to remember to follow some good practices or explain why they have not to assist reviewers. Given that the checklist would be new, I think we'd start with something relatively simple to see how authors respond to it. I actually think in many cases, the overhead will be somewhat appreciated because it takes some of the guesswork out. For example, if we incorporate language that asked authors to justify the sharing of any editor usernames, this is a simple way of letting authors know that we expect them to minimize the discoverability of editors unless they've received explicit permission. I'll see how this develops but open-access is certainly relevant too.

Isaac (WMF) (talk) 01:56, 12 December 2024 (UTC)Reply

Add topic