Please participate here! Please add your thoughts regarding the five questions into the appropriate sections below. If you have something else to address, create a new section!

Program Evaluation

Evaluating the program evaluation and design capacity-building initiative

Latest comment: 10 years ago1 comment1 person in discussion

How should we evaluate and report on the program evaluation and design capacity-building initiative? What measure of success would you find MOST useful for assessing progress toward team goals?
Enter comments below this line

Much of the current documentation regarding program evaluation considers activities from the perspective of what they're called, or what they are; an edit-a-thon is one thing, an editor workshop is yet another thing, and so forth. It may be worth considering from the perspective of the desired outcome. What can we do to get more editors? What reduces the backlog of outstanding work on a project? The best measure of success may involve identifying all the different problems people are trying to solve—both Wikimedia affiliate organizations and purely online groups like WikiProjects on the English Wikipedia. harej (talk) 04:43, 17 May 2014 (UTC)Reply

Evaluating learning opportunities and resources

Latest comment: 10 years ago2 comments1 person in discussion

Examining the evaluation learning opportunities and resources made available so far, (a) what have you found to be most useful in supporting program evaluation and design and (b) in what evaluation areas do you feel there is not enough resource support?

(a) What have you found to be most useful in supporting program evaluation and design?

Enter comments below this line

The opportunity to speak with evaluation experts and to have my work looked over by people who know what they are doing has been most helpful. harej (talk) 04:43, 17 May 2014 (UTC)Reply

(b) In what evaluation areas do you feel there is not enough resource support?

Enter comments below this line

Attention needs to be paid to development of governance. Different groups and organizations have different needs but the ability of an organization to carry out serious program evaluation depends on the basics—being able to recruit volunteers, being able to prepare a budget, being able to run a business. Running events is enough of a volunteer burden; running events as part of a broader plan requires additional volunteer work, and getting volunteers to do this depends on being able to effectively recruit and manage them. Many nonprofits invest heavily in volunteer training and Wikimedia should be no exception. harej (talk) 04:43, 17 May 2014 (UTC)Reply

Evaluating beta report metrics

Latest comment: 10 years ago2 comments1 person in discussion

Examining Evaluation Reports (beta), please share about the (a) strengths and (b) weaknesses of the metrics piloted.

To discuss metrics more in-depth, please participate in the metrics brainstorm and prioritization we began at the Wikimedia Conference.
Follow this link to an OPEN DISCUSSION about the next steps for measuring program impacts.

(a) Strengths of the metrics piloted

Enter comments below this line

Report defines terms used and projects reported on clearly. Our collective lack of progress on program evaluation is really well explored. It's really fascinating just how behind the curve we've all been on program evaluation. We were never expected to do this kind of work, but the report discusses baseline expectations of measurement and now I am interested in seeing what we learn over the next year. harej (talk) 20:01, 18 May 2014 (UTC)Reply

(b) Weaknesses of the metrics piloted

Enter comments below this line

I feel like there is more going on than the report would have us believe—could future evaluation reports mine the annual reports filed by the affiliate organizations? I am also interested in learning what return we get on our money. One could spend $30,000 on a program, but who says we get more out of that than we would spending $30,000 on a new server or better wiki software? We need comprehensive metrics for the entire WMF budget because otherwise we have numbers without context. harej (talk) 20:01, 18 May 2014 (UTC)Reply

Grantmaking

Evaluating grantmaking to groups and organizations

Latest comment: 10 years ago3 comments2 people in discussion

How should WMF Grantmaking evaluate its various grantmaking initiatives to groups and organizations (e.g., Annual Plan Grants, Project and Event Grants)? What strategies and/or metrics would you recommend for assessing grantmaking to groups and organizations?^{(DISCUSS here)}

Enter comments below this line

Grant proposals should be categorized according to what kind of program they want to implement. If a chapter seeks a grant to host an edit-a-thon, it should be flagged as an edit-a-thon themed grant and they should get the resources needed to carry it out. This kind of pre-emptive categorization ensures sufficient support from the beginning and it makes it easier to incorporate their findings into future reports. At the same time, make sense of the diversity of program proposals within a category. Edit-a-thons geared toward building relationships with institutions function differently from edit-a-thons geared toward recruiting individuals to edit Wikipedia, and edit-a-thons held in Nairobi are going to be different from ones held in New York or Bangalore. Now if a group comes along and has a completely novel proposal for something that's never been done before, that's an opportunity to work with them to flesh out the program's goals and pilot test something that could ultimately be replicated by other organizations. Grantmaking should be responsible for lead generation for Program Evaluation and Design. harej (talk) 20:16, 18 May 2014 (UTC)Reply
First time on this part of Meta. No links to the metrics or "beta".
WRT James's comment above, some programs might involve more than one macro-category. I'd be inclined to look at individual activity-types rather than trying to categorise whole programs or even projects.
I write about the PEG system as a member of GAC: the revamp of GAC I suspect will embrace volunteers in the reportage system more (except that there's a major structural problem in motivating on-site volunteers to participate in reviewing).
The revamp might see a system of occasional summaries of lessons learned, done by staff no doubt, since there are so few active volunteers in GAC that the system is almost grinding to a halt. What are the patterns, stepping back?
Reports are often unsatisfactorily vague, although at least we seem now to be insisting on largely numerical "measures of success" in applications.
I'd like to see a correlation of money spent with outcomes (and what the money could have done if spent on WMF engineering and products instead). Hard to do, though, because of the apples/oranges problem.
I think there's not sufficient emphasis on comparing the value added through physical off-line activities such as conferences vs programs of online contact, and room editathons vs the production, say, of online tutorial videos. Tony (talk) 04:44, 1 June 2014 (UTC)Reply
- I agree with Tony that grants should be categorized by activity-types rather than large, "macro" categories that won't capture nuance. I also agree that we do not do a good enough job of measuring the value proposition of doing work offline, e.g., what edit-a-thons lack in editor retention make up for by improving perceptions of Wikipedia's legitimacy by relevant institutions. harej (talk) 20:20, 4 June 2014 (UTC)Reply

Evaluating grantmaking to individuals

How should WMF Grantmaking evaluate its various grantmaking initiatives to individuals (e.g., Travel scholarships, Individual Engagement Grants)? What strategies and/or metrics would you recommend for assessing grantmaking to individuals?^{(DISCUSS here)}
Enter comments below this line

Other comments and suggestions

Metrics for public support

Latest comment: 10 years ago1 comment1 person in discussion

Somewhere in all of this evaluation we need to look at whether our activities build public support for the encyclopedia and open knowledge. We need general public support to ensure that the Internet remains usable and open, and to ensure that we have the financial backing from the public. Much of the necessary financial, political, and cultural support needs to come from people who are readers, users, enthusiasts, but not necessarily high-edit-count Wikipedians themselves.

Solid metrics that measure what activities generate public support would require some understanding of public opinion research. My guess is that it might be more cost-effective at this point to view ongoing public outreach as part of how you do business rather than get sidetracked with polling and surveys.

We also need to consider that in-person events can be important for building relationships between people who are supportive of Wikipedia, and look at finding ways for people to contribute who aren't in a position to be power editors, so they can remain involved with the project as their circumstances change.

Maintaining the health of the organization is a part of the mix if we want sustainability. The "contribution to organizational health and sustainability" piece isn't a big part of our evaluation efforts so far. At some point, we might need to look at what other volunteer-based non-profits are doing with regards to evaluation, accounting, and transparency. Even if these tasks are performed pro-bono, it may be in the Foundation's interests to check the qualifications of the people performing them. My guess is that some of these core tasks are typically better handled by paid staff, and that the trick is getting a system that's simple and modular enough that it's easy for the volunteers to add their input. Djembayz (talk) 02:43, 20 May 2014 (UTC)Reply

Use of the term "Impact"

Latest comment: 10 years ago1 comment1 person in discussion

I think that "impact" as a technical term in Program Evaluation should be replaced with "long-term outcomes". This would be easier for everyone to understand. Otherwise we can have endless discussions about what we mean by "impact" and if short-term outcomes should be considered "impact". I will make this comment in other talk pages as well. --Pine^✉ 23:01, 31 May 2014 (UTC)Reply

Suggestions for programs to evaluate

Latest comment: 9 years ago12 comments6 people in discussion

Conferences

I suggest evaluating Wikimania, the Wikimedia Conference, and the annual hackathon. No one responded to my question on Wikimedia-l about what strategic purposes Wikimania serves, and only one person responded when I asked this same question on the IEGCom email list. Evaluating the long-term outcomes of these programs and their strategic value could help make sure movement resources are being spent wisely and help these programs to be more successful at making progress toward strategic goals. If some programs are found to be more cost-effective than others at achieving long-term outcomes toward strategic goals then adjustments can be made. --Pine^✉ 23:13, 31 May 2014 (UTC)Reply

Yes, good post. Evaluating the short- and long-term outcomes of any conference (academic, medical, Wikimedia) is very difficult. I'm tempted to say that this annual festival/party, and the annual Wikiconference, are of no use to the product (what our readers/donors want) and should be dumped in favour of a single conference every two years—a serious affair, with high-profile presentations all-round, and—ahem—streamed? I'm generally cynical about the quality of conference presentations, even professional academic/medical conferences. We're very slow to take up the internet as a form of presentation and Q&A. Tony (talk) 04:52, 1 June 2014 (UTC)Reply

I agree we should pay attention to measuring the usefulness of conferences. Unlike some I generally believe that conferences are beneficial to the Wikimedia movement and to the Wikimedia projects as the movement's deliverables—but how? Having concrete objectives for conferences gives us a standard for reference. For instance, if the objective of Wikimania is to share research and insight among a global audience, we would want to encourage presentations to be rigorously researched and prepared and not something from the seat of one's pants. If the objective of Wikimania is actually to motivate volunteers by letting them network with their peers from around the world, then we would have to think of Wikimania not as an academic conference but as a gigantic social space with the requisite planning requirements and metrics. But we first need to know what our priority actually is with these conferences. harej (talk) 20:29, 4 June 2014 (UTC)Reply

I'm going to be honest here. I'd have very little interest in attending the kind of conference that Tony suggests. I can read papers just fine from my comfy chair at home in my slippers and baggy pants. I go to serious educational conferences when someone else is paying; for me, it's too much like work. As well, I'm not certain there's enough useful scholarship to bring to such a conference if the scholars have to pay their own freight; average cost would be around $3000 - $7000 including flights, meals and accommodations, depending on where they live in comparison to where Wikimania or any other conference is held. Multiply that out by 30-50 sessions over several days, and we're talking big dollars just for the presenters.

I think Harej is closer to being right here; most WMF-related conferences, including Wikimania, are primarily networking opportunities with a side order of "check out the latest stuff". Risker (talk) 07:36, 7 June 2014 (UTC)Reply

Risker, I wasn't proposing that WMF-related conferences be turned into academic affairs. I wrote "a serious affair, with high-profile presentations all-round". My premise is that we can present sessions on the sites and the movement that are of professional quality without being suitable for publication in academic conference proceedings (although a few in the mix could be that, if people wanted it and such papers were in the offing).

My problem, in reviewing every WM-related conference, is that they seem to achieve very little in-session, especially when considering the gigantic cost of these events. At Berlin, the sessions that seemed to be designed around let's-break-up-for-a-while-and-move-from-table-to-table, putting bits of coloured paper on whiteboards, are a good example of what Asaf Bartov said publicly at the close of the conference: too little preparation, both of the presentations and by the hundreds whose attendance donors bankrolled. I don't see the point, and I think the case to argue for real impact, or even its shorter-term cousin, progress, is theirs to make. Where is the evidence? Where are the reports matching goals to knowledge gain / decisions made?

All conferences are partly huge social-gaming spaces, of course—whether they hang on loosely organised and underprepared presentations or more tightly conceived sessions. I don't mind social gaming if the formal proceedings are up to scratch. But they're not, and they never have been. The fact that London still hasn't got together its keynotes, and has no particular timetable for publishing a schedule of the sessions, is telling. Tony (talk) 11:15, 7 June 2014 (UTC)Reply

Thank you for these dialogue comments. Conferences, Hack-a-thons, and Wikimedians in Residence are actually the three additional programs our team is targeting for mapping and evaluation this year. We have been reaching out to program leaders to consult on their evaluation strategies and will be working toward mapping those evaluated outcomes and reporting on those programs in the second round of evaluation reports. Some evaluation data to look forward to: WMUK is conducting a follow-up survey for their last GLAM Conference to assess outcomes in terms of collaborations and projects, WMDE had a participant post-survey assessing conference content, networking, and learning as well as made a request for presenters to feed back regarding their sessions, WMDC is also working to gather feedback data as part of the WikiConference USA evaluation, and Wikimania evaluation strategies are in the works. The Zurich Hack-a-thon collected an exit survey and organizers have plans for a follow-up and we have also consulted on survey strategies for a couple smaller hack-a-thon events. Are there any other specific evaluation criteria you would expect to see for any of these programs? JAnstee (WMF) (talk) 17:49, 7 June 2014 (UTC)Reply

Thanks, Jaime. Any chance we could see the questions asked? I've briefly thought through the post-questionnaire issue. A likely contaminating variable is that the respondents are self-selected and often funded by sponsors within the movement (affiliates and the Foundation itself); they may thus less likely to be critical than if they were third-party observers. There are ways of designing the questions that might minimise this. Tony (talk) 05:26, 8 June 2014 (UTC)Reply

The questions have varied as each of these are essentially independent efforts that I have been consulted on to varying degrees. None have completed their reporting as most are in collection and analysis, if not still in planning. I will be reaching out to share their designs and results as they become available. I do know that some struggle with access to larger community response as well. JAnstee (WMF) (talk) 18:57, 13 June 2014 (UTC)Reply

Evaluating conferences, especially with a view of their longer term impact on participants, is not a simple thing to do, but I agree that it's extremely valuable. I've been working on analysing a GLAM-Wiki 2013 conference in London, which aim was to increase people's knowledge about the topics, and improve collaboration between the participants. It's revealing to be looking at the results and I hope to be sharing them soon! Daria Cybulska (WMUK) (talk) 09:20, 3 July 2014 (UTC)Reply

Ombudsman Commission

I also suggest evaluating the Ombudsman Commission since the last I heard is that it is not producing much in the way of outcomes. An evaluation of its effectiveness and how to improve it may be beneficial. --Pine^✉ 23:13, 31 May 2014 (UTC)Reply

I'm going out on a limb here to say that I think it is well outside the scope of the grants program to evaluate a committee of the Board which does not receive any grants. I'd suggest if you have concerns about the commission's effectiveness, you start on the applicable talk page. But remember that (a) their next report is not due for another few weeks and (b) their "product" is determination of whether or not the applicable policies have been violated. It's not customer satisfaction, and it's not going out and looking for problems that they then solve. Risker (talk) 07:16, 7 June 2014 (UTC)Reply

Berlin evaluation survey sounds a warning

None of what I say here should be taken as critical of the job done by WMDE people in organising the WM Conference 2014: they set high standards (see Q20 results). Rather, it is systemic issues of underperformance that we are not yet efficiently capturing and analysing. Too many of the survey questions were not well targeted at eliciting results that will prompt and inform efforts to improve the demonstrable outcomes of these expensive events. My concern is that the exit questions were pitched in vague terms to a self-selected and partially conflicted sample:

A strikingly low 55% of attendees responded (why so low, I wonder, when people were generously funded to attend? How can this sampling rate be improved for future conferences?).
The conflict arises from the fact that almost all attendees were beneficiaries of an average of more than $1000 in costs, it appears, paid by donors. Attendees are less likely to look a gift horse in the mouth than independent third parties. (So how can a survey be framed in ways that counter this bias?)
Even without those two factors in play, many of the questions were always going to yield stunningly positive results that would make a PR company envious: "The conference ... gave me the opportunity to exchange ideas with others on movement issues" and "The conference ... was suitable for my background and experience" are framed in ways that are likely to yield very high proportions of agreement and strong agreement. It's asking a lot to expect respondents to reflect critically and declare that it wasn't suitable for their background (what background and experience would lead someone to disagree with that proposition? What exactly do you want to know?). I'm unsure why anyone would declare by implication that they'd had no opportunity to exchange ideas with others. I wouldn't have. (Surely it's what kind of ideas that counts.)

Given that the survey was on the long side, consideration should be given to dumping or tightening two types of question. First, those that could be predicted to generate almost full approval ("helped me to gain knowledge from other Wikimedians": 98% strike rate—I could have told you that in advance). And second, questions that haven't been reverse-engineered from outcome utility (the conference "helped me release tensions" got a tick from 79%: I could whimsically ask whether more than one in five participants finished stressed out; unless you want to consider free valium, massage, and encounter groups next time, this is not useful data.

Instead, deeper probing is required: less predictability, more detail, fewer questions (especially no puff), and more "coding" I'm afraid. In particular, work back from possible results and ask whether they'd be actionable, whether they'll contribute to lessons learned. I acknowledge that written answers were possible in quite a few places; perhaps a shorter survey might elicit more written responses from people. Unsure.

Let's take a few of many possible examples:

Q14 "how many new working contacts did you make" looks like we have thousands of new dotted lines among working networks, with more than one in five people citing "41–60" new working contacts. Not defining "working contact" makes the results impossible to interpret in useful ways. Why not ask instead: "How many new contacts did you make with whom you expect to have online or offline dialogue concerning movement-based planning or activities over the next 12 months?" That might yield useful information, and certainly would show more realistic numbers. At $1000 per person, donors would want to know how much expectation of serious site-supporting work will result.
Q15 "helped me to join or start an initiative"—good thing to know, but I'm a little suspicious that 84% of the sample gave that a tick. It would be more credible if people had been asked to specify the initiative (with a note that it wouldn't be published, of course, but was just for data clarification—this might have lessened the impact of the undoubted enthusiasm of the moment, and possibly linguistic/cultural misunderstandings).
Q11 "Which three sessions were your favorites?" Were these scores analysed in relation to attendance at sessions? Does the 50% for "Chapters Dialogue" just mean it was attended by more than the others?
Q22 "Expectations". From a participant–supplier perspective, it showed happy travellers. But it raises the circularity of privileging partipicants' expectations over those of the WMF's key stakeholders, the readers, which the new ED has rightly announced are the preeminent concern.
Q20 alone was quite enough without 22: it showed solid endorsement of the organisers—well done. My only issue with Q20 is that we still don't know why 55% gave the Wifi less than top endorsement, and especially why nearly one in six thought it sucked. (What were their expectations? Could they not view vids, use google drive, skype audio? These things could have been built into the questioning: "If disagree or strongly disagree, please specify the problems you encountered.")

I've treated only negatives, whereas some of the data were useful. But overall the administration and content of questionnaires needs to evolve. It would be interesting to conduct a retro-survey a few months later, probing people in their more typical state of mind, with fewer, much more focused questions; but without preparation and warning you'd probably get a low participation rate. In any case, will the organisers write up a brief set of bullets for other organisers of movement conferences, perhaps, advising dos and don'ts they've learned? Even a few tips about how to deal with AV contractors, pitfalls in room design and whiteboards ... managing visa issues—your call. Building and communicating in-house knowledge is so valuable.

And just a final matter: in some cultures it would be considered rude to say anything critical of such an event, especially if your costs were paid by others—even in an anonymised survey. Some limbering up, both in writing in the intro of the survey, and probably orally in a keynote, might lessen that potential disortion: telling people how important their critical ideas are for improving the next conference might reduce the fear-of-rudeness factor. Tony (talk) 02:11, 29 July 2014 (UTC)Reply

Add topic

Talk:Learning and Evaluation/Connect/Community Dialogue

Contents

Program Evaluation

Evaluating the program evaluation and design capacity-building initiative

Evaluating learning opportunities and resources

(a) What have you found to be most useful in supporting program evaluation and design?

(b) In what evaluation areas do you feel there is not enough resource support?

Evaluating beta report metrics

(a) Strengths of the metrics piloted

(b) Weaknesses of the metrics piloted

Grantmaking

Evaluating grantmaking to groups and organizations

Evaluating grantmaking to individuals

Other comments and suggestions

Metrics for public support

Use of the term "Impact"

Suggestions for programs to evaluate

Conferences

Ombudsman Commission

Berlin evaluation survey sounds a warning