Problems

edit

I know of at least 2 people who were provided the survey and are male in real life. I think if you want to have a survey that adequately about women, you can't have any men involved even if they pretend to be female online. I'm sure that many women would think that it would be a problem if the results were skewed in anyway because of people who are not actually women. It is one of the problems with allowing people to create their own identities and then trying to get statistics off of them. Ottava Rima (talk) 15:28, 7 November 2011 (UTC)Reply

It's possible for men to lie about their gender, just as it's possible for women to lie about their gender as they might want to do if they feel at risk of harassment. We cannot prevent people from lying, so we do the best we can and hope that the lies are rare enough and/or mutual enough that they are not significant for the outcome of the analysis. Pinetalk 06:54, 21 November 2011 (UTC)Reply

Future directions

edit

Hi! This was great reading, and fits in nicely with thoughts I've had about why people will engage with Wikipedia editing. It works extremely well as a discussion paper, which I assume was the aim. In reading it, though, I had a few thoughts on how it could be further developed, so I'm tossing them here on the grounds that it seems like the right place. :)

My main concern for the first part is the reliance on the Women and Wikimedia Survey 2011. I think Ottava has a valid point, but more fundamentally, the 2001 survey primarily looked at those who self-identify as women. Thus the figure of 9% quoted is not in relation to all female editors, but only in relation to self-identified female editors. I don't know if this is the case, but it might be possible that more people willing to self identify belong to a particular group, and thus self identified female editors on Wikipedia are not a representative portion of female editors on the whole. (There's also an error in "lesbians account for 24% of the contributors", which contradicts the rest of the statistics). The other concern with the 2011 survey is that it also allowed participants to invite others to answer the survey. This will tend to skew the results, as people will tend to ask others whom they know, possibly through shared interests. Thus you might get an over representation of one or more groups because of a tendency for people within those groups to invite others who are similar, rather than those from outside.

Because of that, I'd try and drop the reliance on the 2011 survey - it serves to point to what sorts of questions a full survey could do, and it does that really well, but the methodology needs to be refined before the data collected can be used as more than a discussion point.

I'm also a tad nervous about the "57% of Wikipedia's female contributors are single", as it seemed like an interesting figure, but according to ABS stats, 49% of adults in Australia are not in a relationship. In conjunction with the age range of the survey, which included people from the age of 12 and with a medium age of 31, that figure seems unsurprising taken on its own.

More generally, and perhaps more fundamentally - the tool used here seems to be based on the assumption that authors have the opportunity to write in their natural style. The study was based on single-authored texts, not, as is the case of Wikipedia, collaborativly authored texts. So I'm curious about the extent to which writing style is influenced by the purpose of the writing (writing for an encyclopedia would have specific style requirements), and by other external factors such as the MoS. I'd love to see more about how that may influence the figures. Unfortunately, without knowing the extent to which that influences the results, the comparison with other Wikis doesn't provide a lot of data. But it does point to some interesting future research.

Anyway, just some thoughts. As mentioned, it is in keeping with what I would expect to find, and I presume that this was intended to spark discussion, hence my response. :) There's certainly a lot worth discussing in the research. - Bilby 06:28, 8 November 2011 (UTC)Reply

That's a good point about the proportion of "singletons" in the general population -- it is growing in Western countries. Maybe it would be more telling to ask whether or not female editors are raising children. I think that would give a better indication of how much available free time they have to do anything online, and how they prioritize their time. Food for thought. OttawaAC 23:46, 14 November 2011 (UTC)Reply

Issues with identifying people's gender based on text style

edit

Am I the only one who has reservations about this idea? I understand that large-scale studies have identified some words that are more associated with one gender, but how reliable are these 'author gender tests' really? I'd like to see some evidence that they are actually generally correct before accepting the results of this study at face value. The fact that they apparently identified all of the articles on the GeekFeminism wiki (a wiki with a large, if not overwhelming, proportion of female contributors) as written by men raises doubts about their reliability. I'm inclined to think that these tools are not measuring gender at all, but rather style/tone of writing, and the words they identify as 'male' are associated with a more encyclopaedic tone. So it's no wonder Wikipedia scores highly on such measures. Robofish 23:58, 16 November 2011 (UTC)Reply

The very idea smells sexist, indeed. Measuring style/tone can still be meaningful and useless, though, as the page notes. Nemo 09:14, 17 November 2011 (UTC)Reply
What struck me is the revelation that other wikis with much greater female participation rates -- the Geek Feminism wiki in particular -- actually have comparable writing styles (under this metric) to Wikipedia. I'd like to see a larger sample size from each Wiki, of course, but with the results given here, it tells me that the writing style gender assignments are either fatally flawed, or inherent somehow in the nature of information writing. LtPowers 15:36, 20 November 2011 (UTC)Reply
The importance of other wikis, for me, was these other wikis are just like Wikipedia: They require the main text be written using a male writing style BUT these wikis still attract a large female editor base. Ultimately, for me, when coupled with the user page information, they confirm several things that Wikipedia's female population: 1) Wikipedia's female editor is not representative of female writers in general, 2) Wikipedia's existing female writer base may actually be harmful to the goal of reducing the gendergap because of how they participate and literature that suggests some of these women may drive off other female participation, 3) "The nature of Wikipedia using male information styles" cannot be an excuse because other Wikis can write in male styles and attract large amounts of female participation. --LauraHale 19:27, 20 November 2011 (UTC)Reply

Possible re-interpretation of conclusions

edit

This article concludes that Wikipedia attracts women who write like men (also, can we perhaps use the term women and men, rather than males and females? the latter two are used to oppress trans-gendered men and women and that is just ht beginning of the problem). My interpretation of the conclusions would be more that the edits which are allowed to stand, and the women who are accepted as part of the Wikipedia community are more likely to read as masculine. I suspect feminine writing is more likely to be challenged on NPOV, and that the sexism in online communities is likely to be more acutely felt by women who write using a feminine style, and as such I suspect Wikipedia is the cause of the masculine writing style, rather than its effect.

We chose the terms male and female, instead of men and women, because there are people under the age of 18 who edit Wikipedia. We felt that the terms male and female were more accurate. Pinetalk 06:58, 21 November 2011 (UTC)Reply

I also see many ways to interpret the conclusions and would like much more detailed discussion of them. There is a logical circularity to saying that you can, on the one hand, distinguish male and female writing styles, but on the other hand, women on Wikipedia write like men. Starting from the resulting data, one might conclude more simply that there are not pronounced gender differences in Wikipedia writing, even if there are differences in formal BNC writing, which is certainly neither a logical nor empirical absurdity. (That does not mean I disagree with your interesting inference in a different direction; you may well be correct, but I think substantially more support is needed.) The fact that the sample from which you draw your base observations about gender difference (formal writing from the British National Corpus) is very different from the user pages and/or a subset of gender-related Wikipedia pages is also troubling. The fact that Wikipedia pages are collaboratively edited, as the user above suggests, whereas BNC articles typically have much more traditional editing styles, also suggests that you may be comparing apples and oranges. I don't see a strong effort to distinguish between edits and original text contributions, for example.

This is an important contribution to an ongoing discussion that is widely covered in the academic literature; I would like to see a much closer integration of your methodology with current techniques of corpus analysis, including submission to a peer-reviewed journal, where you will receive more feedback from corpus linguists who can guide you to additional resources, methods, authors, and experiments that bear on your question. As presented now, there are too many unchecked assumptions and inferences without methodological backup to draw any strong conclusions from your obviously interesting data and work. I appreciate Wikipedia's DIY ethic, but I am not clear how the persistent side-stepping of existing research strands, and reliance on just a few (if important ones), helps your audience to have confidence in your results. Have you, for example, run the paper by Shlomo Argamon, Moshe Koppel, Jonathan Fine, and Anat Rachel Shimoni, whose methods and research you are relying on heavily? I think they would have a lot of advice and support for your efforts. [[[Special:Contributions/173.53.108.220|173.53.108.220]] 17:14, 20 January 2012 (UTC)]

Wording differences on Wikipedia between male and female user pages

edit

This isn't a fully processed thought, but rather a data sample based on what we've already done (and that wasn't included in the "paper" largely because it was getting unwieldy and there didn't appear to be much added value to putting this into the article beyond YUM! TASTY DATA! YUM!). Using the original list of self identified male and female contributors from user boxes as in the original paper, there were 867 females represented in the female data set, and 568 males represented in the male data set. For each data set, we ran a program that counted the total number of occurrences of all words for all females and all males. Example: Female 1 uses the word pornography 5 times, baseball 2 times and love 15 times. Female 2 uses the word pornography 0 times, baseball 20 times and love 6 times. Total for all female scores: pornography 5, baseball 22, love 21 times. All words with 3 or fewer letters were removed from the list. All words with _ in them were removed. A number of words that implied html or wiki specific coding were removed. These included words like table, alignleft, alignright. (Not all were removed, and some may have remained. Important to remember, all these words that were not included were not included for both genders.)

Women had 10,871 unique words they used. Men had 8,933 unique words they used. The table below is the top 100 words used by according to each gender.

Female word list Female Use count Female word rank Male word list Male use count Male word rank
this 864 1 this 947 1
that 621 2 wikipedia 414 2
wikipedia 595 3 that 381 3
have 559 4 image 335 4
with 412 5 white 319 5
page 393 6 have 252 6
about 378 7 page 251 7
from 295 8 with 245 8
like 293 9 info 232 9
image 285 10 wiki 224 10
userboxes 285 11 from 223 11
also 273 12 articles 220 12
articles 273 13 user 218 13
white 270 14 nbsp 207 14
wiki 249 15 userboxes 177 15
user 223 16 about 171 16
more 210 17 article 145 17
info 208 18 middle 138 18
here 207 19 small 137 19
navframe 201 20 also 133 20
just 200 21 time 133 21
will 197 22 here 130 22
time 195 23 your 128 23
other 188 24 transparent 124 24
some 188 25 like 120 25
love 181 26 green 115 26
small 161 27 navframe 109 27
when 157 28 more 103 28
article 155 29 other 101 29
been 150 30 please 101 29
family 149 31 some 100 31
school 148 32 created 98 32
people 142 33 work 98 32
know 141 34 family 96 34
there 138 35 list 95 35
music 134 36 university 94 36
things 133 37 will 93 37
nbsp 127 38 interests 92 38
english 125 39 history 85 39
much 125 40 talk 84 40
middle 123 41 there 83 41
your 123 42 gray 82 42
interests 122 43 music 82 43
display 121 44 people 80 44
what 120 45 contributions 78 45
currently 117 46 been 77 46
pages 115 47 english 77 46
because 114 48 information 77 46
cellpadding 112 49 just 77 46
please 111 50 edits 76 50
auto 107 51 barnstar 74 51
editing 107 51 welcome 74 51
well 107 51 display 73 53
hello 106 54 supports 73 54
most 106 54 currently 71 55
navhead 105 56 ffffff 71 55
really 105 56 know 69 57
talk 105 56 cellpadding 68 58
navcontent 104 59 weight 68 58
read 103 60 hello 67 60
favorite 102 61 what 67 60
university 102 61 only 66 62
information 101 63 which 66 62
which 101 63 when 65 64
work 101 63 free 64 65
good 100 66 most 64 65
normal 99 67 united 64 65
since 98 68 would 64 65
they 97 69 category 63 69
very 97 69 pages 63 69
would 97 69 position 63 69
only 96 72 plainlinks 61 72
than 96 72 they 61 72
contributions 95 74 eats 60 74
help 95 74 school 60 74
history 94 76 world 60 74
life 94 76 auto 59 77
list 93 78 first 59 77
make 91 79 editing 58 78
many 91 79 make 58 78
years 91 79 since 58 78
serif 88 82 computer 57 82
barnstar 87 83 good 57 82
should 87 83 blue 56 82
wikimedia 86 85 live 56 82
feel 84 86 navcontent 56 82
ffffff 84 86 navhead 56 82
something 84 86 data 55 88
born 82 89 inherit 55 88
face 82 89 where 55 88
find 82 89 life 54 91
sans 82 89 orange 54 91
anything 81 93 party 54 91
first 81 93 related 54 91
live 80 95 serif 54 91
writing 80 96 things 54 91
reading 79 97 large 53 97
student 78 98 years 53 97
interested 77 99 collapsible 52 99
want 77 99 games 52 100
welcome 77 99 than 52 100
them 75 102 them 52 100
think 75 102 born 50 103

Is there a lot of difference between ranking of words on these lists which suggest different patterns in gender? 77 words appear on both lists for the top 100. They are: that, wikipedia, have, with, page, about, from, like, image, userboxes, also, articles, white, wiki, user, more, info, here, navframe, just, will, time, other, some, small, when, article, been, family, school, people, know, there, music, things, nbsp, english, middle, your, interests, display, what, currently, pages, cellpadding, please, auto, editing, hello, most, navhead, talk, navcontent, university, information, which, work, good, since, they, would, only, than, contributions, history, life, list, make, years, serif, barnstar, ffffff, first, live, welcome, them.

Words that appeared on the female list of top 100 words but not the male list include: love, much, because, well, really, read, favorite, normal, very, help, many, should, wikimedia, feel, something, born, face, find, sans, anything, writing, reading, student, interested, want, think. These words may appear on both lists anyway. For example, born is ranked the 90th most used word for women and the 103rd most used word for men. Help ranks 118 for men, and 74 for women. (Helpful is the 1192 most used word by women, and 1288 for men. Helping ranks 603 for men and 707 for women.) Without doing a lot of math, this data appears to confirm the conclusion for other data that male and female word usage on Wikipedia does not differ substantial. It appears highly unlikely that a word list could be used to identify the gender of Wikipedia users based on the text they write against a body of work based on known gendered Wikipedians. --LauraHale 04:42, 21 November 2011 (UTC)Reply

And because the data was sitting around… a comparison of the top one hundred words used based on the users and there stated gender/program identified gender.

Stated F identified F Count Stated F identified M Word Stated F identified M Count Stated M identified F Word Stated M identified F Count Stated M identified M Word Stated M identified M Count
282 this 705 wikipedia 113 this 894
253 that 338 that 95 image 329
235 wikipedia 334 have 80 wikipedia 306
219 have 304 wiki 77 white 297
169 image 272 with 65 that 293
164 page 239 like 61 info 218
162 white 230 page 59 nbsp 200
136 about 201 articles 58 page 193
124 with 194 from 58 with 181
107 from 190 user 54 have 172
106 also 176 this 53 articles 166
106 articles 167 about 52 from 164
101 userboxes 161 data 48 user 164
100 info 153 label 47 wiki 146
100 user 149 userboxes 47 userboxes 131
99 wiki 145 also 40 middle 123
96 navframe 135 transparent 37 about 121
96 like 130 here 33 article 119
96 small 127 more 33 small 108
90 some 127 please 33 your 108
83 other 122 collapsible 31 green 106
77 more 111 auto 30 time 101
74 middle 108 edits 30 here 98
71 here 106 just 30 navframe 95
71 nbsp 103 talk 30 also 92
69 family 101 information 29 transparent 85
69 time 98 inherit 29 work 82
65 will 96 list 29 other 81
63 article 94 small 29 some 79
63 just 94 time 29 created 75
63 school 89 university 25 family 75
61 display 88 will 25 gray 74
60 english 85 english 24 more 70
58 love 85 music 24 supports 70
56 currently 84 welcome 24 university 69
55 people 83 dcdcdc 23 will 69
55 barnstar 82 interests 23 barnstar 68
52 music 82 school 23 please 68
52 been 81 white 23 interests 67
51 pages 79 because 22 list 67
49 interests 78 currently 22 history 66
49 read 75 know 22 there 66
49 serif 73 very 22 display 62
48 what 73 been 21 ffffff 61
48 face 72 category 21 eats 60
48 navcontent 72 family 21 people 59
47 navhead 72 hello 21 cellpadding 58
47 your 71 people 21 like 58
46 cellpadding 70 some 21 music 58
46 auto 69 years 21 what 58
46 favorite 69 your 21 been 56
46 there 69 article 20 contributions 56
46 sans 68 contributions 20 position 56
45 well 68 created 20 talk 55
45 university 67 history 20 weight 55
44 normal 66 only 20 english 53
43 work 66 other 20 blue 52
42 georgia 65 languages 19 united 52
41 information 65 love 19 free 51
41 list 65 make 19 information 50
40 gray 64 much 19 orange 50
40 history 63 there 19 welcome 50
40 hello 62 they 19 which 50
40 when 62 email 18 currently 49
39 only 61 help 18 navcontent 49
38 things 61 infobox 18 navhead 49
38 should 60 most 18 pages 49
38 contributions 59 myself 18 world 49
38 good 59 nowiki 18 hello 48
37 most 59 work 18 large 48
36 than 58 made 17 mijzelffan 48
36 which 58 student 17 when 48
36 wikimedia 58 when 17 would 48
35 editing 57 leave 16 first 47
35 logo 57 message 16 just 47
35 born 56 pages 16 know 47
34 film 56 related 16 life 47
34 first 56 which 16 airborne 46
34 ffffff 55 would 16 computer 46
34 please 55 education 15 most 46
34 know 53 high 15 only 46
33 site 53 into 15 plainlinks 46
33 years 53 male 15 edits 45
33 both 52 many 15 flag 45
33 much 52 plainlinks 15 than 45
32 cornflowerblue 51 politics 15 where 45
32 since 51 religion 15 editing 44
32 student 51 since 15 gold 44
32 writing 51 things 15 good 44
32 high 50 editing 14 live 44
32 life 49 everything 14 serif 44
31 they 49 info 14 born 43
31 topright 49 navframe 14 party 43
31 american 48 them 14 since 43
30 female 48 vandalism 14 wrote 43
30 find 48 bgcolor 13 best 42
30 interested 48 editor 13 category 42
30 something 47 interested 13 games 42
29 verdana 47 movies 13 they 42
29 supports 46 weight 13 make 41
29 welcome 46 world 13 absolute 40
would 46 space 40
well 40

Here, there appear to be a few differences. --LauraHale 07:17, 21 November 2011 (UTC)Reply

Female wikiHow users as a benchmark for female Wikipedians: Female Wikipedians use more male styles of writing than female wikiHowians

edit

I used the user boxes on wikiHow to create a list of wikiHow users by gender. I then ran their userpages through the same analysis I did for Wikipedia users where we knew their gender based on user boxes. Basically, the same methodology described on the main page, only switching sites. There were 256 unique users who had gender boxes identifying them by gender. When scores were removed for people scoring the same such as 0-0 (and effectively being gender neutral or undetermined), there were 231 users left, with 25 having been removed. The results are here:

Category Gender Count Percentage
Wikihow Identified Gender Male 66 28.6%
. Female 165 71.4%
Program identified gender Male 131 56.7%
. Female 100 43.3%
Correctness Yes - Female 81 49.1%
. Yes - Male 47 71.2%
. No - Female 84 50.9%
. No - Male 19 28.8%

We're dealing with a much smaller sample from a site that has 43% female participation but an overall smaller contributor base. They have a greater percentage of females identifying with userboxes than Wikipedia does. (Suggests to me that women feel even more comfortable expressing femaleness there.) Beyond that, the program correctly identifies women 11% more often than it does on Wikipedia. 11% in this case seems significant to me. I think we can assume some bias for male writing nature in both spaces because of the factual nature conveyance for both… but Wikipedia's population and WikiHow's should match up if they were attracting similar types of females… and they aren't because there is that 11% difference. Beyond that, the program correctly identified male users within 1% (both roughly 71%) for wikiHow and Wikipedia. This says to me the two wikis have unique groups of female users as their characteristics are not the same. Wikipedia's females use male language much more in their personal space than wikiHow's female contributors.

Also worth noting, the female users appear to have their male/female word count points grouped more closely together than their male counterparts. Difference in STDEV for females between male and female scores is 77. For males, the difference is 188. Difference for women for mean between the male score and female score is 44. For men, it is 135. I'm guessing that we have a situation, where when plotted, the men would be less close to the gender neutral line than the women would be. Better idea of the users with in the group…

STDEV Female score Male score
Female user 1380.1 1457.5
Male user 824.5 1012.9
MEAN Female score Male score
Female user 810.1 852.3
Male user 558.5 693.6
MEDIAN Female score Male score
Female user 324 367
Male user 276.5 319.5
MODE Female score Male score
Female user 0 0
Male user 0 213

All this new data supports the methodology being valid. You may not like the study about gendered language and the word lists, but they developed a method of determining different patterns of language usage between genders. The wikiHow data really ,really supports the validity of it. Wikipedian female users over representing as male supports the supposition that female Wikipedians are much more likely to use male coded language and they are not representative of the wider of females. There is no real valid place to critique the methodology as flawed, because the distinct groupings validate it, and the wikiHow data is just the icing on the cake. --LauraHale 11:10, 21 November 2011 (UTC)Reply

Return to "Mind the Gap" page.