Community Insights Collaboration, Diversity, & Inclusion Supplement (2021)/Statistical endnotes

Appendix. Methodological and statistical endnotes edit

Community Insights survey participants were weighted based on their sampling group and monthly edit count used for sampling, and compared across groups to understand differences in line with the main report and analysis (See also Survey Methodology and the tab labelled “Appendix: Methodological and Statistical Endnotes”). The analysis in this supplement diverges slightly from the main Community Insights report and analysis; here, all participants have been coded to one of five contributor groups, associated with their highest level of leadership in the movement (i.e. Editor, Developer, Admin, Movement Organizer, Organizing Admins) and those Movement Organizers who were sampled outside of the Editors sample are included in the between-groups analysis using a weight of “1”. As many as 2193 and as few as 568 individuals responded to the various question sets to capture our social climate factors. Participant counts for the different question sets vary as some were randomized to shorten survey length while others were presented to every survey participant (See Appendix: 2020 Descriptive statistics). All responses were collected on a 5-point likert scale of agreement with an option to respond “unsure.” All responses within each question set were scored such that a higher score is positive; this means that some items were reverse scored where appropriate (as noted) and averaged to produce each factor score presented.

Along with the following statistical tests, means and medians are also reported in the data tables for clarity even though assumptions of normality and outliers are violated.^[1] In addition, comparisons of percentage favorable have been included in the narrative to help translate the data into more meaningful terms; these data, along with all key aggregates of mean data, will be available in the interactive report after publication and can be accessed here. Sometimes t-tests data are also reported alongside Mann-Whitney U results to ease interpretation.

Gender Gap edit

Due to the non-normal distribution of the data which could not be corrected via statistical transformation, a Kruskal-Wallis^[2]^[3] test was conducted to determine if there were differences in demographic representativeness between contributor groups. There were differences for gender representativeness. Specifically, ratios of participants who identified as Men were not similar for all groups, as assessed by visual inspection of a boxplot. Differences in the proportion of participants identifying as men were statistically significant between contributor groups, χ2(4) = 77.831, p = <.001, N = 1694. Subsequently, pairwise comparisons were performed using Dunn's (1964) procedure with a Bonferroni correction for multiple comparisons.^[4] This post hoc analysis revealed statistically significant differences between Movement organizing Admins (mean rank = 906.13) and Movement Organizers (mean rank = 700.03) (p = .002) as well as between Movement Organizers and Editors (mean rank = 872.70) (p = .000), Movement Organizers and Developers (mean rank = 905.38) (p = .010), Movement Organizers and non-organizing Admins (mean rank = 917.25) (p = .000); but not between any other group combination.

Geo Gap edit

Due to the non-normal distribution of the data which could not be corrected via statistical transformation, again a Kruskal-Wallis^[2]^[3] test was conducted to determine if there were differences in the distribution of contributor groups based on geography: distributions of contributor types were not similar for all regions, as assessed by visual inspection of a boxplot. Distributions were statistically significantly different between the different contributor groups, χ2(5) = 103.392, p = .000, N =1716. Subsequently, pairwise comparisons were performed using Dunn's (1964) procedure with a Bonferroni correction for multiple comparisons.^[4] This post hoc analysis revealed statistically significant differences between Europe (mean rank = 818.72) and Africa (mean rank = 1359.64) (p = .000), and Europe and Asia (mean rank = 911.30) (p = .005); as well as between Africa and Oceania (mean rank = 740.45) (p = .000), Northern America (mean rank = 863.80) (p = .000), Latin America & the Caribbean (mean rank = 880.63) (p = .000) and Asia (p = .000), but not between any other group combination. Spaces with higher mean ranks were represented more by Movement Organizers and Admins than less engaged active Editors.

Audience differences in Collaborative Engagement edit

Due to the non-normal distribution of the data which could not be corrected via statistical transformation, a Kruskal-Wallis^[2]^[3] test was conducted to determine if there were differences in Collaborative Engagement factor scores between contributor groups: Editors, On-wiki Admins, Developers, Movement Organizers, and Movement Organizing Admins. (Note: n-value varies by item; see details in Appendix: 2020 Descriptive statistics for n-values, means, and medians)

Distributions of Awareness of Self & Others scores were not similar for all groups, as assessed by visual inspection of a boxplot. Differences in Awareness of Self & Others scores were statistically significant between the different contributor types, χ2(4) = 13.099, p = <.011, N = 1200. Subsequently, pairwise comparisons were performed using Dunn's (1964) procedure with a Bonferroni correction for multiple comparisons.^[4] This post hoc analysis revealed statistically significant higher Collaborative Intention scores among Movement Organizers (mean rank = 673.71) compared to non-organizing On-wiki Admins (mean rank = 548.77) (p = .040), and Editors (mean rank = 590.12) (p = .018) but not between any other group combination.
Distributions of Collaborative Intention scores were not similar for all groups, as assessed by visual inspection of a boxplot. Differences in Collaborative Intention scores were statistically significant between the different contributor types, χ2(4) = 27.396, p = <.001, N = 1210. Subsequently, pairwise comparisons were performed using Dunn's (1964) procedure with a Bonferroni correction for multiple comparisons.^[4] This post hoc analysis revealed statistically significant higher Collaborative Intention scores among Movement Organizers (mean rank = 716.51) compared to non-organizing On-wiki Admins (mean rank = 562.10) (p = .004), and Editors (mean rank = 584.43)(p = .000) but not between any other group combination.
Distributions of Engagement scores were not similar for all groups, as assessed by visual inspection of a boxplot. Differences in Engagement scores were statistically significant between the different contributor groups, χ2(4) = 62.196, p = < .001, N = 2119. Subsequently, pairwise comparisons were performed using Dunn's (1964) procedure with a Bonferroni correction for multiple comparisons.^[4] This post hoc analysis revealed statistically significant lower Engagement scores among Editors (mean rank = 1015.62) (p = .000) and Admins (mean rank = 978.83) (p = .000) compared to Movement Organizers (mean rank = 1257.31); as well as for non-organizing Admins (p = .000), Developers (p = .033) and Editors (p = .000) compared to movement-organizing Admins (mean rank = 1411.32); but not between any other group combination.
Distributions of Feelings of Belonging scores were not similar for all groups, as assessed by visual inspection of a boxplot. Differences in Feelings of Belonging scores were statistically significant between the different contributor groups, χ2(4) = 90.705, p = .000, N = 1777. Subsequently, pairwise comparisons were performed using Dunn's (1964) procedure with a Bonferroni correction for multiple comparisons.^[4] This post hoc analysis revealed statistically significant lower Feelings of Belonging scores among Editors (mean rank = 826.74) compared to Movement Organizers (mean rank = 1086.36)(p = .000) and movement-organizing Admins (mean rank = 1264.06) (p = .000); as well as between non-organizing Admins (mean rank = 898.44) and Movement Organizers (p = .003) and movement-organizing Admins (p = .000); but not between any other group combination.
Distributions of Fairness scores were not similar for all groups, as assessed by visual inspection of a boxplot. Differences in Fairness scores were statistically significant between the different contributor groups, χ2(4) = 29.135, p = <.001, N = 1767. Subsequently, pairwise comparisons were performed using Dunn's (1964) procedure with a Bonferroni correction for multiple comparisons.^[4] This post hoc analysis revealed statistically significant higher Fairness ratings among non-organizing Admins (mean rank = 1080.93) compared to Editors (mean rank = 860.57) (p = .000), Developers (mean rank = 793.33) (p = .020), and Movement Organizers (mean rank = 888.48) (p = .001), but not between any other groups.
Distributions of Movement Leadership scores were not similar for all groups, as assessed by visual inspection of a boxplot. Differences in Movement Leadership scores were statistically significant between the different contributor groups, χ2(4) = 31.575, p = <.001, N = 2206. Subsequently, pairwise comparisons were performed using Dunn's (1964) procedure with a Bonferroni correction for multiple comparisons.^[4] This post hoc analysis revealed statistically significant lower scores among Admins who did not organize (mean rank = 827.57) compared to Movement Organizers (mean rank = 1154.28) (p = .000), and higher scores among Editors who were not admins (mean rank = 1117.32) compared to Admins as well (p = .000), but not between any other groups.
Distributions of Movement Strategy scores were not similar for all groups, as assessed by visual inspection of a boxplot. Differences in Movement Strategy scores were statistically significant between the different contributor groups, χ2(4) = 44.974, p = <.001, N = 558. Subsequently, pairwise comparisons were performed using Dunn's (1964) procedure with a Bonferroni correction for multiple comparisons.^[4] This post hoc analysis revealed statistically significant lower Movement Strategy scores among non-organizing Admins (mean rank = 187.86) compared to Editors (mean rank = 259.29) (p = .017), Movement Organizers (mean rank = 327.58) (p = .000), and movement-organizing Admins (mean rank = 291.13) (p = .014), as well as higher scores among Movement Organizers compared to Editors (p = .000), but not between any other group combination.

Year-over-year differences in Collaborative Engagement edit

There are some differences in the year-over-year analysis completed this year compared to previous years which excluded those who did not progress to at least 50% completion of the survey. That cut point was required to create stability with the previous 2018 comparison year.^[5] This year’s report does not exclude by progress cut-point and thus also changes the 2019 comparison statistics some from what was reported previously. As reported in the methodology of the 2019 data report, this cut-off point tended to more frequently exclude from the sample those sharing less favorable ratings (who come from less engaged roles) than those with more favorable ratings (who are more deeply engaged in movement roles). Due to the non-normal distribution of the data which could not be corrected via statistical transformation, a Mann-Whitney U-test was used to determine if there were differences in factors scores between 2019 and 2020 data.^[6] (Note: n-values vary by indicator and year and are specified in the parenthetical notes). When compared to 2019, an independent sample U-test found contributors more likely to report experiencing higher levels of Engagement (2020 mean = 4.14, N =3683, t = 6.918; U = 1876791.500, 2019 mean rank = 1701.51, 2020 mean rank = 1945.70, p = <.001), Feelings of Belonging (2020 mean = 3.70, N =3025, t = 4.637; U = 1216015.00, 2019 mean rank = 1427.13, 2020 mean rank = 1573.31, p = <.001), Movement Leadership (2020 mean = 3.33, N = 4277, t = 3.232; U = 2413761.5, 2019 mean rank = 2076.49, 2020 mean rank = 2197.68, p = .001) and Fairness (2020 mean = 2.82, N = 2982, t = 4.506; U =1173114.50, 2019 mean rank = 1409.47, 2020 mean rank = 1547.90, p = <.001). While these may indicate a true difference, we recommend caution as we also have undergone changes to the sample recruitment methodology moving from opt-in email invitation means from repeated talkpage posting of invitation reminders. It is unknown what effect this may also have had on the comparison. (See also Appendix: Changes from 2019 to 2020)

Audience differences in Diversity & Inclusion edit

Due to the non-normal distribution of the data which could not be corrected via statistical transformation, again a Kruskal-Wallis^[2]^[3] test was conducted to determine if there were differences in Diversity & Inclusion factor scores between contributor groups as follows:

Distributions of Non-Discrimination scores were not similar for all groups, as assessed by visual inspection of a boxplot. Differences in Non-Discrimination scores were statistically significant between the contributor groups, χ2(4) = 36.513, p = <.001, N = 1817. Subsequently, pairwise comparisons were performed using Dunn's (1964) procedure with a Bonferroni correction for multiple comparisons.^[4] This post hoc analysis revealed statistically significant discrimnation being reported more often by Admins and Movement Organizers than Active Editors without these responsibilities. Non-Discrimination scores among Editors (mean rank = 948.59) were higher compared to Admins (mean rank = 820.05) (p = .033), organizing Admins (mean rank = 685.73) (p = .004), and Movement Organizers (mean rank = 808.89; p = .000), but not between any other group combination. When looking more closely at the two items underlying this construct with a binary lens for those who had experienced harassment in the last year vs those who had not, these associations become more pronounced, χ2(4) = 63.018, p = .000, N = 1598. This is also the case for those who had been made to feel unsafe or uncomfortable vs those who had not, χ2(4) = 17.656, p = .001, N = 1764. Post hoc analysis revealed experiences of harassment and discrimination being experienced more often by Movement Organizers (mean rank = 909.63), and especially Organizing Admins (mean rank =1069.19 ), than active Editors (mean rank = 760.19, p = .000) without these responsibilities as well as for Organizing Admins compared to On-wiki Admins who did not organize (mean rank = 841.58, p = .006). Having felt unsafe or uncomfortable contributing to Wikimedia projects in the last 12 months was also more prevalent among On-wiki Admins (mean rank = 974.94; p = .032), and Movement Organizers (mean rank =933.03 ; p = .059) compared to Editors (mean rank = 857.97) but this was less pronounced.
Distributions of Inclusive Culture scores were not similar for all groups, as assessed by visual inspection of a boxplot. Differences in Inclusive Interactions scores were statistically significant between the different contributor groups, χ2(4) = 18.762, p = <.001, N = 1651. Subsequently, pairwise comparisons were performed using Dunn's (1964) procedure with a Bonferroni correction for multiple comparisons.^[4] This post hoc analysis revealed statistically significant lower scores among Editors (mean rank = 806.75) (p = .002) and Admins (mean rank =751.74) (p = .007) compared to Movement Organizers (mean rank = 919.38), but not between any other groups.
Distributions of Individual Commitment to Diversity scores were not similar for all groups, as assessed by visual inspection of a boxplot. Differences in Individual Commitment to Diversity scores were statistically significant between the different contributor groups, χ2(4) = 20.852, p = <.001, N = 1289. Subsequently, pairwise comparisons were performed using Dunn's (1964) procedure with a Bonferroni correction for multiple comparisons.^[4] This post hoc analysis revealed statistically significant lower scores among Editors (mean rank = 625.33) (p = .001) and Admins (mean rank = 769.01) (p = .020), compared to Movement Organizers (mean rank = 733.32), but not between any other groups.
Distributions of Leadership Commitment to Diversity scores were not similar for all groups, as assessed by visual inspection of a boxplot. Differences in Leadership Commitment to Diversity scores were statistically significant between the different contributor groups, χ2(4) = 29.981, p = <.001, N = 1204. Subsequently, pairwise comparisons were performed using Dunn's (1964) procedure with a Bonferroni correction for multiple comparisons.^[4] This post hoc analysis revealed statistically significant higher scores among Movement Organizers (mean rank = 702.78) compared to Editors (mean rank = 576.66) (p = .000) and Admins (mean rank = 555.95) (p = .006), but not between any other groups.

Year-over-year differences in Diversity & Inclusion edit

Due to the non-normal distribution of the data which could not be corrected via statistical transformation, a Mann-Whitney U-test was used to determine if there were differences in factors scores between 2019 and 2020 data.^[6] (Note: n-values vary by indicator and year and are specified in the parenthetical notes). Once again, year-over-year analysis included only the Editors group for which we are able to apply partial propensity score matching to weight for better representation based on the higher tendency for our more active editors to both start and complete the survey. When compared to 2019, an independent samples U-test found overall that contributors reported higher levels of Non-Discrimination (mean = 4.31, N = 3074, t = 2.101; 2020 mean rank of 1563.04 compared to 1500.58 in 2019, U = 1188391, p = .036). While this may indicate a true difference, again, we recommend caution as we also have undergone changes to the sample recruitment methodology and, in addition, one of the underlying items changed presentation protocols: where previously only those who first reported witnessing harassment were asked how often in 2019, in 2020 all participants were asked how often, with an option for “never,” without a screening question. While we have worked to ensure alignment as best as possible, it is unknown what effect this may also have had on the comparison. (See also Appendix: Changes from 2019 to 2020)

References edit

↑ Laerd Statistics (2015). "Statistical tutorials and software guides". Laerd Statistics.
↑ ^a ^b ^c ^d Kruskal, W. H.; Wallis, W. A. (1952). "Use of ranks in one-criterion variance analysis". Journal of the American Statistical Association, 47(260), 583-621.
↑ ^a ^b ^c ^d Lehmann, E. L. (2006). Nonparametrics: Statistical methods based on ranks. New York: Springer.
↑ ^a ^b ^c ^d ^e ^f ^g ^h ⁱ ^j ^k ^l ^m Dunn, O. J. (1964). "Multiple comparisons using rank sums". Technometrics, 6, 241-252.
↑ Sheskin, D. J. (2011). Handbook of parametric and nonparametric statistical procedures (5th ed.). Boca Raton, FL, USA.: Chapman & Hall/CRC Press.
↑ ^a ^b Mann, H.B.; Whitney, D. R. (1947). "On a test of whether one of two-random variables is stochastically larger than the other.". The Annals of Mathematical Statistics, 18(1), 50-60.

[1] Laerd Statistics (2015). "Statistical tutorials and software guides". Laerd Statistics.

[:0-2] Kruskal, W. H.; Wallis, W. A. (1952). "Use of ranks in one-criterion variance analysis". Journal of the American Statistical Association, 47(260), 583-621.

[:1-3] Lehmann, E. L. (2006). Nonparametrics: Statistical methods based on ranks. New York: Springer.

[:7-4] ↑ ^a ^b ^c ^d ^e ^f ^g ^h ⁱ ^j ^k ^l ^m Dunn, O. J. (1964). "Multiple comparisons using rank sums". Technometrics, 6, 241-252.

[5] Sheskin, D. J. (2011). Handbook of parametric and nonparametric statistical procedures (5th ed.). Boca Raton, FL, USA.: Chapman & Hall/CRC Press.

[:2-6] Mann, H.B.; Whitney, D. R. (1947). "On a test of whether one of two-random variables is stochastically larger than the other.". The Annals of Mathematical Statistics, 18(1), 50-60.

[1]

[2]

[3]

[4]

[5]

[6]