Grants talk:IEG/ScalaWiki data processing toolbox
- 1 Strategic priority
- 2 April 12 Proposal Deadline: Reminder to change status to 'proposed'
- 3 OpenRefine
- 4 Feedback
- 5 Eligibility confirmed
- 6 Aggregated feedback from the committee for ScalaWiki data processing toolbox
- 7 Round 1 2016 decision
Can I fill "strategic priority" field with "encourage innovation"? It's one of the 5 strategic priorities in Wikimedia Movement Strategic Plan Summary but form suggests only 3 of them: "Choices of strategic goals: Increasing Reach (more people access Wikimedia projects), Improving Quality (better quality and quantity of content on Wikimedia projects), Increasing Participation (larger and more diverse groups of people are contributing to Wikimedia projects)" --Ilya (talk) 15:47, 12 April 2016 (UTC)
- Hi Ilya! Thanks for this question. In our call earlier, I told you I would look into this question and get back to you. Having taken a look at your project, I would prefer that you stick with the suggested guidelines. My recommendation is that you select "Increasing Reach (more people access Wikimedia projects)." I suggest this because your project seeks to improve access so that certain kinds of data present on Wikimedia projects won't be limited to users who are able to write scripts or tools for data access.
April 12 Proposal Deadline: Reminder to change status to 'proposed'Edit
@Ilya: This is a final reminder that the deadline for Individual Engagement Grant (IEG) submissions this round is today (April 12th, 2016). To submit your proposal, you must (1) complete the proposal entirely, filling in all empty fields, and (2) change the status from "draft" to "proposed." As soon as you’re ready, you should begin to invite any communities affected by your project to provide feedback on your proposal talkpage. If you have any questions about finishing up or would like to brainstorm with us about your proposal, let me know.
I just skimmed the proposal for now, and I was not able to understand what you're actually proposing. However a word popped in my mind: OpenRefine. Any resemblance? Nemo 05:31, 13 April 2016 (UTC)
- Not going to comment your "skimming" and "inability to understand", ok? I'm grateful to any suggestions about tools that might be similar or related. I stumbled upon OpenRefine once but did not have possibility to pay attention and then forgot about it. Yes, it seems quite related and I'm going to evaluate how it or any other software can be used or give ideas for the goals of this grant request. I found a mention of using OpenRefine for Wikimedia project and a wikidata-reconcile using it by Magnus Manske. Do you have other usage examples? It has several open issues regarding Wikimedia projects support: Support Wiki table formats, Malformed input from reconcile APIs freezes OpenRefine, Implement Wikidata reconciliation.
- I found reports that it slows down and crashes with 1 GB datasets and unresolved issues that it crashes with 88Mb dataset, cannot handle forms larger than 1Mb and finally that it does not support streaming but loads all data into memory intsead.
- Comment: "The current architecture requires all data to be in memory for processing. Changing that would require a major reworking of things and require GRefine to be backed by a "real" database to do all the processing For the forseable future, you should assume that 1.x N memory is required for an N-size database, where the goal is to keep x small (say 5-20%)."
- So probably you were not aware of this? I cannot assume that you suggest to process Wikipedia data fully in memory and require 5-20 times more memory than the dataset size.
- Regarding other tools that may be related, I'm going to do some evaluation of pandas, Jupyter (And other pydata projects), R and Sparta. Any other suggestions? I'll ask this question on wikimedia tech lists too. --Ilya (talk) 09:10, 13 April 2016 (UTC)
- I think I'll have to provide more detailed examples --Ilya (talk) 15:14, 13 April 2016 (UTC)
- Yes, examples would be good. Nemo 12:07, 17 April 2016 (UTC)
- I am a regular user of OpenRefine. I think it is a valuable tool to clean up messy datasets from our GLAM partners. As an example, I just spent days aligning a Lepidoptera catalogue from my local university that contains outdated names or badly identified taxons with online resources (namely the Catalogue of life and the Encyclopedia of life) and Wikidata just to be able to properly describe the specimens we photograph. OpenRefine's reconcile capacity is great, despite the flaws Ilya pointed and Magnus' own Wikidata-reconcile service is rather crude and has its own issues.
- --EdouardHue (talk) 12:38, 18 April 2016 (UTC)
Hi Ilya, as I have been in touch with you and have used your WLM tool I have no doubts that you have the needed skills for this kind of tasks and also the required insider view in the movement. Still, I am having problems to understand what are you offering to the movement, to me it is a bit too abstract. Proposal: could you clarify for which kind of public is this toolset foreseen (public could be any newbie editor in the movement, an experienced editor, a bot owner, a programmer, ....) and list 5 use cases of concrete problems that I could solve with this tool set?
Off-topic: it is good to know that in the coming weeks your WLX jury tool will have an admin user interface so that we can manage our stuff without needing you, and it is surely good for you, as you can then relax :) If some of Wikimedia Ucrania's employees are now familiarized with the code of the WLx tool and are even implementing a new UI, would it be possible to share that code so that other chapters can add features to it? e.g. if one jury member out of five members accepts a picture, then it is skipped from the others' backlog as it will make it to the next phase anyhow. Please, let me know where we can discuss about this topic or even if I could discuss this topic during Berlin's WikiCon next week, as it is not directly linked to this grant but for many people of high interest. Thank you and best regards, --Poco a poco (talk) 19:07, 15 April 2016 (UTC)
This Individual Engagement Grant proposal is under review!
We've confirmed your proposal is eligible for review and scoring. Please feel free to ask questions and make changes to this proposal as discussions continue during this community comments period (through 2 May 2016).
The committee's formal review begins on 3 May 2016, and grants will be announced 17 June 2016. See the round 1 2016 schedule for more details.
Questions? Contact us at iegrants wikimedia · org .
Aggregated feedback from the committee for ScalaWiki data processing toolboxEdit
|(A) Impact potential
|(B) Community engagement
|(C) Ability to execute
|(D) Measures of success
|Additional comments from the Committee:
- Hi MJue (WMF), thank you for the feedback. I'll provide a detailed answer in several days. Can I convert your list from unordered to ordered so I'll be able to reference the points more easily? --Ilya (talk) 19:33, 3 June 2016 (UTC)
- MJue (WMF), I'm going to provide detailed answers on each of the statements from these comments by Monday June 13th. Is that ok? --Ilya (talk) 03:24, 11 June 2016 (UTC)
- In progress..
I'll group by major concerns
- 13. these reports always need a human perspective, and I always adjust the measurement methods for them so that they are easy to build. I would not prefer using a one-size-fits-all reporting method.
- 1. "would serve a minority of our community, namely those conducting Wikipedia research or reporting on projects.",
- 3."unsure where this project fits strategically, what impact it will have",
- 6. "I am still looking for reasons to develop this project in the first place."
- 18. "I don't see the impact of this project."
- 19. "the delivery is limited in comparison with the work involved.",
- 20. "I'd like to see more concrete use cases and negative examples"
- 13. Yes, I agree that reports need human perspective and we need to adjust measurement methods. That is exactly why I want to have a tool that allows humans to adjust what and how to measure. It's also important to be able to save, repeat and share with others the adjustments we do.
- 1 "Would serve a minority of our community, namely those conducting Wikipedia research or reporting on projects." Only if research or reporting is treated formally, just for those who need to do that research or report for some reason. Мeaningful research is not to get some numbers to cover the need to get them but to understand the reality and act according. In many cases what is interesting is not some numbers, but finding specific subjects (users, articles, categories, areas) that are interesting to us by some criteria and trying to take some actions on that.
Others do not see the impact, and upon rereading the grant request I think it is because I was mostly concerned on making the readers sure that I'm aware with the technical side and know how to imlement it, and naively took for granted that what can be done with thе tool can be obvious from that techical details.
Let's find a common ground by comparing to other grant - Grants:IEG/WIGI: Wikipedia Gender Index. That grant recevied 1.9 points higher on Impact potential with some good comments on that and no explicitly stated misunderstanding that is so much stated here. WIGI measures about dozen of fixed parameters on one particular subject (women) so it is a subset of a poject that I suggest that is going to provide UI to select what subject and what parameters do you want to query, how do you want to filter, group, aggregate and present the output. Let us say we want to promote or investigate something different (endadgered species, biosphere reserves, architectural monuments, scientists, LGBT people whatever). Does is make sense to create a separate tool for any subject that might be interesting to us? When one uses a calculator it does not matter what real life objects the numbers correspond to.
Example of how it can be used.
- Find and improve on under-represented areas or content that can be used. Categories, articles, sources, images that are unproportionally low/not-existent/lacking some property in one wikimedia project comparing to another wikmedia project or some reference list (or in some country/region comparing to another, or by some other property (one category of articles comparing to another)). List of such areas can be used to attract volunteers to improve for example them using thematic weeks, edithatons, wikiprojects. This is very general description too but can be used in many different variations.
- Ratings of presentation on wikipedia (for example for museums, universities) can motivate them to improve their presentation.
- Find most active authors of articles on specific topic to invite them to participate in such thematic weeks, contact them to ask questions or suggest someting on the topic.
- Usually all reports measure only impact of specific events without comparing with what happens outside of them (no Scientific control).
- Whether activity in thematics weeks or contests is higher than without them, or users just switch and write about thematic topics?
- Do they give long-term growth of activity of users who participated and any users on topics the events cover? Or can events with prizes demotivate and decrease contribution of users who did not receive them?
- How does amount and size of articles on specific topic compare in some event and outside it?
- In case of image uploads interesting question is geographic diversity, estimated travel time and distance of contributors (Taking lots of pictures in one time and place like big city vs traveling may distant villages).
Ability to executeEdit
- 4. Could be a huge project with infinite complexity
- 10. I am very skeptical that one individual (no matter how brilliant) would be able to build out such a base set of libraries without active user participation in the given time frame.
- 12. My concern is merely whether or not this could be completed within 6 months.
- 17. May be too ambitious to complete in six months
First of all I like to thank about that scepticism that allowed me to think more about how to make sure to follow time constraints.
About completing and boundaries of included functionality. It's not possible to include everything at once. The goal of the project is to become viable and compeling enough with some core starting functionality withing this timeframe, so it will be worth sustaining, experimenting and evolving without much effort per each functionality part that will added later.
Lets us put some measure of difficulty. There is an an mw:Extension:ApiSandbox. It is based on development by a student User:Salil who did not know about Mediawiki and its API before in a GSoC project. It was planned that he would spend about 3,5 month part-time in the evenings. Let's put it 1,5 months fulltime. I think UI like this (or in general converting the metadata (what commands are available) to code that takes particular actions based on the metadata) is essential part of the project. It shows time estimate and also I can start from that code. And adapt it to supler form generator.
On other components
- I've been working on Mediawiki API, database and dumps components for about 2 years, so they need more polishing and completing and not implementing from start.
- There are a lot of good tools for Wikidata, it is quite complex and I do not have enough knowledge and experience to give it the same level of coverage so I'm not spending that much time on it.
- I'm going to start with simplest possible implemetations. For example I can start not with coversion fron JSON to SQL but just plain SQL, I can start with just Mediawiki API query string and not the UI to build it.
Overall I think it's more like integration project than something radically new and brilliant and the problem mainly is not in the ability to implement some design, but maybe for ideas or needs from other users to be part of it.
- Note that ApiSandbox was developed by experienced MediaWiki developers and mw:Google Summer of Code past projects doesn't list any related past project. The proposer doesn't seem to have contributed any patch. Nemo 16:02, 14 June 2016 (UTC)
- According to comments I decided to ask for ideas from users before showcasing the first version and not vise versa.
- 21. "There is also mention of participation in several events (Wikimania, CEE, WLM), but is it part of the budget?
- I've already been funded for Wikimania 2016, for CEEM I can be funded by Wikimedia Ukraine. There is no offline WLM event, I will communicate with the teams online
- 3. 1) The users of the tools mentioned in the problem section may not be able to learn and write scala code to solve issues, especially. not with the (fairly heavyweight) projects mentioned later on in the proposal. 2) For the tool developers themselves, inside the Wikimedia Community Python / PHP are much more favored languages, and already have a large set of supporting libraries & communities. I'm unsure where this project fits strategically, what impact it will have, and how it'll be sustained afterwards.
Scala: Yes, Scala is much less favoured overall than Python. But
- Many projects are actually mainly developed by small number of people, so I should be familiar and productive with it as a major factor of its succes
- Scala is popular among data scientists. According to O'Reilly survey 10% data scientists use Scala, and particularly 24% data engineers use Scala. This is mainly due to Spark which is developed in Scala and whose users largely use Scala, which shows that good tool can increase popularity of technology.
- Large set of libraries for the same purpose can be a sign of fragmentation and dissatisfaction with them.
Heavyweight dependencies: I suggested Flink to grant enough power and functionality out of the box and not to reimplement things. Maybe for local usage small library of Apache Calcite will be enough and will cover most usage cases.
- 17. project presents some criticism to Global Metrics tools that have not been scaled in Phabricator.
Following up on your answersEdit
I spoke with Asaf Bartov yesterday and he pointed me here to your responses to the aggregated feedback from the committee. I am very sorry that there hasn't been any follow-up before now. I inadvertently failed to watch this talk page and I consequently overlooked your responses.
Unfortunately, this proposal was not recommended by enough reviewers to advance past the first phase of committee selection. This means that your responses to the aggregated committee feedback did not impact their decision about whether to fund your project. That should have been clearer in the feedback template and I'll make sure it is revised going forward. That said, the committee is very open to reviewing submissions again in future rounds, and they tend to look favorably on applicants who have taken their feedback seriously and used it to clarify and improve their proposal. If you are still interested in doing this project, you can incorporate your answers into your proposal and resubmit it into the current Project Grants open call. The deadline is August 2.
If you click on that link and scroll down to the "Upcoming events" section, you will see that WMF is hosting a series of proposal clinics via Hangouts. You are welcome to join one of those clinics if you would like individualized support for reworking your proposal for submission.
My apologies again that there wasn't a response here before now. It's clear that you have put care into responding to the committee's feedback and I'm sorry your answers were overlooked. I've sent a message to the IEG committee asking them to follow up with any further feedback in light of your answers, so that you can take it into account if you do decide to resubmit your proposal.
- @Mjohnson (WMF):
- Thank you for the explanation. The evaluation process looks broken for me :( But I do appreciate your being frank about it. And I hope that the organization of the process will be more clear for other future applicants now and they would not run into the same situation as me.
- What are my steps to resubmit the application? Am I supposed to create a new proposal and insert there my answers to the questions on the talk page? So people reading my new proposal will get the info they were interested before? --Ilya (talk) 11:03, 14 July 2016 (UTC)
- I agree with you that the process was broken and I apologize for that oversight. I've corrected this for Project Grants going forward so that next steps will be clearer to future applicants. It's not necessary to transfer your answer to questions on this talkpage to the talkpage of your Project Grant proposal, but you should make sure that the Project Grant proposal itself addresses committee concerns raised here. The committee make-up will be different this time, so there will be different perspectives at work in the next review. If you get further comments from the committee on your Project Grants proposal talkpage, you will definitely want to respond there.
- Thank you again for your feedback.
- --Marti (WMF) (talk) 20:48, 9 August 2016 (UTC)
Round 1 2016 decisionEdit
This project has not been selected for an Individual Engagement Grant at this time.
We love that you took the chance to creatively improve the Wikimedia movement. The committee has reviewed this proposal and not recommended it for funding, but we hope you'll continue to engage in the program. Please drop by the IdeaLab to share and refine future ideas!
- Review the feedback provided on your proposal and to ask for any clarifications you need using this talk page.
- Visit the IdeaLab to continue developing this idea and share any new ideas you may have.
- To reapply with this project in the future, please make updates based on the feedback provided in this round before resubmitting it for review in a new round.
- Check the schedule for the next open call to submit proposals - we look forward to helping you apply for a grant in a future round.