Research:MoodBar/Observational study

Introduction edit

MoodBar has been designed as an engagement feature to encourage feedback from new users. The rationale for this feature is that the ability to send feedback may improve the experience of new users on Wikipedia and improve their chances of socializing in the community and overcoming early barriers to participation. In this report, we want to understand how MoodBar users compare with the general population of active users.

Terminology and definitions edit

  • An active user is a registered user[1] who clicked at least once on the "Edit" button, but who did not necessarily complete any edit. Hence MoodBar users are a subset of all active users.
  • A MoodBar user is an active user who has posted at least one piece of feedback using MoodBar.

Not all registered users are active users.[2] A graph, updated once a day, of the daily volume of new account registrations, active users, and active users who completed at least an edit is available here.

Our goal is to establish what kind of associations exist between the cumulative edit count of a user in the first 30 days of activity and a number of selected factors of interests, namely the reported mood (from now on, the mood) and the outcome of the interaction this user has had with MoodBar: whether a response was posted or not, and whether it was considered helpful or not. We consider these conditions as different treatments that new registered users receive.

The main dependent variable of the present analysis is the edit count. This variable is measured, for all users, on five different occasions during their first 30 days of activity, specifically at 1, 2, 5, 10, and 30 days after the activation of MoodBar.[3] We refer to these different snapshots as the age of a user account: a 5-day old active account is the account of a user who clicked for the first time on the edit button 5 days before the edit count measurement was taken.

Research questions edit

Our research questions are the following:

  1. How does the typical active user behave in the first 30 days of activity? Are MoodBar users more or less productive than the general population of active users?
  2. Within the population of MoodBar users:
    1. Is receiving a response associated to a higher level of productivity, compared to not receiving any response?[4]
    2. Is receiving a response that is judged as helpful associated to a higher level of productivity, compared to a response that was not marked as helpful?[5]

Methods edit

To answer the above questions we analyzed new user contribution data and MoodBar usage data. We performed a regression analysis on edit count as a dependent variable using a regression model suitable for count data.[6]

We collected data in the interval between 2011-12-14 00:00:00 UTC and 2012-05-22 23:59:59 UTC (which we call the start and end of the observation period, respectively). We consider two separate samples of users: one containing all active users registered in that interval, and a subsample containing only those users who posted at least one piece of feedback with MoodBar.

Variables edit

As we said, our goal is to study the association between several variables -- account age, MoodBar treatment, and reported mood (see above for the terminology) -- with a user's edit count. Of course, several other variables might have an influence in determining the edit count of a newly registered editor in her first 30 days of activity. We thus control for the influence of a number of additional variables in our analysis.

Because the two datasets refer to two different populations of users, not all of variables are available in both datasets. The complete list of variables is described in the following table; variables whose name is in italic are control variables:


Table 1: List of variables used in the analysis (control variables in italic)
Variable Description All users MoodBar users
EDITCOUNT Dependent variable    
AGE Time of measurement: 1, 2, 5, 10, and 30 days since activation of MoodBar    
TREATMENT A categorical factor describing the user treatment: Reference, Feedback, Feedback+Response, Feedback+Helpful    
REGISTRATION_MONTH Months since the start of the observation period[7]    
ACTIVATION_LAG Time between account registration and activation of MoodBar    
FEEDBACK_LAG Time between activation of MoodBar and first feedback  N  
MOOD Mood type: happy, sad, or confused  N  
IS_EDITING Was the feedback reported while editing?  N  
FEEDBACK_EDITCOUNT edit count at the time of feedback  N  
NAMESPACE Namespace of the page from which the feedback was sent  N  
TEXT_LENGTH Number of characters in the feedback message  N  
OS Operative System type reported in the user agent (UA) string of the user's browser  N  
BROWSER Browser model (no version) reported in the UA string  N  
NUM_FEEDBACK Total number of feedback messages sent with MoodBar  N  

Treatment type edit

The vast majority of MoodBar users in this sample sent only one piece of feedback, and received no more than one response so, to determine the treatment group of a user, we focus only on the first feedback. All active users who never sent any feedback represent the reference group. All other users are assigned to a treatment group based on the outcome of their first interaction with MoodBar:[8]

  • Reference indicates active users who never used MoodBar.
  • Feedback indicates users who used MoodBar (at least) once but did not receive any response.
  • Feedback+Response indicates MoodBar users who received (at least) a response but did not mark it as “helpful”.[4]
  • Feedback+Helpful indicates MoodBar users who received (at least) a response and marked it as “helpful”.[5]

The composition of our sample is highly skewed -- posting feedback via MoodBar is the exception, not the norm. We can see this from the table below, which reports the size of the different treatment groups:

Table 2: Sample composition: number of users by treatment
Reference Feedback Feedback+Response Feedback+Helpful
515,519 9,838 4,692 838

Data collection edit

The edit count is computed using the following query:

Data collection - SQL query
SELECT 
    u.user_id as user_id,
    SUM(CASE 
        WHEN udc.contribs IS NULL THEN 0
        WHEN udc.day - INTERVAL ? DAY <= DATE(ept.ept_timestamp) THEN udc.contribs 
        ELSE 0
    END) as editcount
FROM 
    user u
JOIN
    edit_page_tracking ept
ON
    u.user_id = ept.ept_user
LEFT JOIN
    user_daily_contribs udc
ON 
    u.user_id = udc.user_id
WHERE 
    DATE(u.user_registration) >= @min_registration
    AND
    DATE(u.user_registration) < @max_registration
GROUP BY
    u.user_id

The queries for performing the data collection can be found here (edit count computation), here (full sample extraction), and here (moodbar users sub-sample extraction).

Data analysis edit

For our analysis we used R, an open source statistical environment.[9]

Results edit

Exploratory analysis edit

The exploratory analysis shows that MoodBar users tend to be more productive than the average active editor who has never sent feedback. Figure 1 shows the average edit count by treatment at 1, 2, 5, 10 and 30 days since the activation of MoodBar. Each panel corresponds to a day:

 
Figure 1.
Average cumulative edit count at 1, 2, 5, 10, and 30 days of activity for active users who sent at least a feedback with MoodBar (green bars, further divided by treatment) and active users who never sent a feedback with MoodBar (Reference group, orange bars). Error bars represent the standard error of the mean.

Interestingly, compared to the Reference group, the edit count increase for the Feedback+Response group is lower than the increase of the Feedback group. In other words, receiving a response to the sent feedback is associated to a lower average increase in edit count than just sending the feedback and not receiving any response.

We try to get a sense of the growth rate of the edit count over the whole 30-days period in the following plot, in which we also explore the role of the reported mood. Therefore, we focus on the sub-sample of Moodbar users only. Figure 2 shows smoothed growth curves (solid lines) of the average edit count, aggregated by the treatment and by the reported mood.[10] As a reference for reading the plot, we include the heights of the green bars of Figure 1, which represent the average edit count aggregated only by treatment.

 
Figure 2.
Smoothed average (LOESS regression) of cumulative edit count in the first 30 days of activity of MoodBar users, aggregated by reported mood (see legend) and by treatment (panels). Black points represent mean edit count by treatment (but not aggregated by reported), with standard errors (vertical black bars)

Figure 2 shows a number of interesting things:

  • Once the reported mood is taken into account, there is quite a bit of heterogeneity for the Feedback (left panel) and the Feedback+Response groups (middle panel). This is particularly accentuated for people who did not receive a response to their “sad” feedback, whose growth trajectory stands apart from the group average, represented by the black points. Interestingly, these people tend to outperform their peers who reported a different mood.
  • For Feedback+Response (middle panel) we see that until day 5 people who report a “confused” and a “happy” mood tend to have similar trajectories. These are close to the group average, but soon after they depart, both from each other and from the group average. In contrast, those who report a “sad” mood tend to make initially more edits and to grow at a faster rate. This goes on until approximately day 6, when they experience a slowdown in productivity. Indeed, by day 30, the “happy” group seems to have caught up with them.
  • In contrast to the first two groups, Feedback+Helpful (right panel) seems to show less heterogeneity, as all trajectories seem to fall within the canonical two standard deviations from the group average. This might suggest that the reported mood is less important once you receive a helpful response.
  • However, compared with the other two groups, users in Feedback+Helpful tend to be more productive from already day 1, and to grow at a much higher rate than everybody else. By day 30, those people in this group who reported a “happy” mood have accrued an average of 30 edits, compared to roughly 11 edits for their peers in the other two groups who reported the same mood. Both things suggest that there might be a certain degree of self-selection in the Feedback+Helpful group.

In summary:

  1. MoodBar users tend to be more productive than non-MoodBar users.
  2. There is some unexplained heterogeneity for the first two treatments (Feedback and Feedback+Response) once the mood is taken into account, and,
  3. there might be self-selection going on in the third group (Feedback+Helpful), which might explain why these users are by far the most productive in the whole sample.

Indeed at this stage we do not yet control for several relevant factors, like at what point during the activity period the first feedback is posted, the individual level of productivity, etc. All these variables might be responsible for the trends we just saw in the data, and therefore we need to take them into account.

Regression analysis edit

All users (MoodBar and Non-MoodBar) edit

We performed a first regression analysis of the edit count at 30 days for all users in our sample.[11] The results of the analysis are displayed in Table 3 below.[12] The values in the Effect column indicate the predicted change in edit count associated with the corresponding variable, while keeping all other variables at zero (and thus at AGE = 1 day) [13] (the last three rows refer to two-way interactions between AGE and TREATMENT).[14] All effects should then be interpreted as increases or decreases in edit count compared to the edit count of a typical non-MoodBar user (our Reference group).

 
Figure 3.
Predicted edit count growth in the first 30 days since activation of MoodBar, controlling for the lag between registration and first edit click and for seasonal effects, for a typical active user registered between December 14, 2011 and May 22, 2012. For the fit results, see Table 3.
Table 3: regression analysis results, all users (MoodBar and non-MoodBar)
Variable Effect Estimate Std. Error   value  
0 intercept 2.12 0.6993402 0.0023719 294.84 <2e-16 ***
1 age - 1 1.017 0.0174006 0.0001033 168.50 <2e-16 ***
2 Feedback 2.36 0.8585132 0.0107617 79.78 <2e-16 ***
3 Feedback+Response 2.16 0.7719526 0.0154476 49.97 <2e-16 ***
4 Feedback+Helpful 4.34 1.4689070 0.0370844 39.61 <2e-16 ***
5 registration_month 0.99 -0.0102577 0.0007234 -14.18 <2e-16 ***
6 activation_lag 0.90 -0.1012595 0.0012240 -82.73 <2e-16 ***
7 (age - 1) AND Feedback 1.017 0.0169336 0.0007743 21.87 <2e-16 ***
8 (age - 1) AND Feedback+Response 1.012 0.0122573 0.0011115 11.03 <2e-16 ***
9 (age - 1) AND Feedback+Helpful 1.027 0.0264749 0.0026824 9.87 <2e-16 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

These are the main results of the analysis (links to the corresponding rows in Table 3 are provided in the text):

  • The first row gives the baseline values of the Reference group at AGE = 1 day. This is equal to 2.12 edits ( ) [row 0]; after the first day, the edit count grows at a rate of only 1.7% a day [row 1] ( );
  • Rows 2 to 4 refer to the effects of MoodBar usage. MoodBar usage is associated to a higher initial edit count (for all coefficients  ):
    • Feedback:   higher than Reference [row 2];
    • Feedback+Response:   higher than Reference [row 3];
    • Feedback+Helpful:   higher than Reference [row 4];
  • Rows 5 and 6 are the control variables. Their effects are both small and negative.
  • Rows 7 to 9 refer to the interaction effects. All coefficients are positive ([row 7],[row 8],[row 9]), which means that, for each day after the first, each MoodBar treatment group there has also a small, additional increase in the edit count growth rate, ( , respectively) over time.

Figure 3 shows the growth trajectories of the edit count predicted by the model, for a typical user in each treatment group. The trajectories are obtained by setting the control variables (REGISTRATION_MONTH and ACTIVATION_LAG) to their average values, and letting AGE and TREATMENT vary.

MoodBar users only edit

We performed a second regression analysis applied to MoodBar users only.[15] the results are in Table 4 below.[16] As this sample only includes MoodBar users, and we included an additional categorical variable (mood), the effects reported in this table should be interpreted as increases (or decreases) relative to the edit count of a typical MoodBar user who sends a “happy” feedback but does not receive any response (Feedback group AND “happy”).

Table 4: regression analysis results, MoodBar users only
Variable Effect Estimate Std. Error   value  
0 intercept 1.827 0.6032 0.01976 30.533 < 2e-16 ***
1 age - 1 1.020 0.023 0.0007215 31.87 < 2e-16 ***
2 Feedback+Response 0.94 -0.05916 0.02279 -2.596 0.009445 **
3 Feedback+Helpful 1.41 0.3477 0.0468 7.780 7.23e-15 ***
4 confused 0.96 -0.037435 0.018827 -1.988 0.046767 *
5 sad 1.037 0.03662 0.03065 1.195 0.232159
6 registration_month 0.96 -0.04290 0.005857 -7.324 2.40e-13 ***
7 feedback_lag 1.00 -1.342e-07 3.264e-09 -41.116 < 2e-16 ***
8 activation_lag 1.00 -9.962e-08 8.056e-09 -12.366 < 2e-16 ***
9 feedback_editcount 1.043 0.04274 1.186e-04 360.452 < 2e-16 ***
10 text_length 1.0047 0.004706 1.530e-04 30.764 < 2e-16 ***
11 num_feedbacks 1.421 0.3517 0.008450 41.616 < 2e-16 ***
12 is_editing 0.57384 -0.5554 0.01775 -31.290 < 2e-16 ***
13 Feedback+Helpul AND sad 0.84 -0.1705 0.07814 -2.181 0.029154 *
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

These are the main results:

  • With respect to MoodBar users who send a feedback but do not receive any response (Feedback), MoodBar users whose response is not marked as helpful (Feedback+Response, [row 2]) are 6% less productive ( ),[4][5] while users who mark it as helpful are 41% more productive ( , [row 3]). If we compare these figures with the effects from the previous table (Looking up on Table 3 we have Feedback+Response vs Feedback: 2.16/2.36 = 0.91 that is a 9% decrease, and similarly Feedback+Helpful vs Feedback: 4.34/2.36 = 1.83 that is a 83% increase, see [row 2], [row 3], and [row 4]) we see that accounting for the extra factors has deflated a bit the effect of the MoodBar treatment.
  • Regarding the reported mood:
    • There is mild evidence ( ) that users who report a “confused” mood are only 4% less productive than users who report a “happy” mood [row 4]. There is no evidence that in the first 30 days users reporting a “sad” mood have a different productivity from users who report a “happy” mood [row 5].
    • There is mild evidence ( ) that users who report a “sad” mood and mark the response as “helpful” (Feedback+Helpful, [row 13]) have, on average, an edit count 16% lower than users in the same group who report a different mood.
  • Finally, regarding the control variables:
    • the lag between registration and activation of MoodBar (first click on “Edit”, activation_lag, [row 8]) and between activation of MoodBar and the moment the feedback (feedback_lag, [row 7]) have no effect on the edit count. On the other hand, each the presence of edits already performed at the time the feedback (feedback_editcount, [row 9]) is associated to a 42% increase in edit count.[17]
    • Regardless of the type of reported mood and of the outcome of the transaction via MoodBar, users who send a feedback while editing (is_editing, [row 12]) tend to be 43% less productive than those those who don't ( ).

Figure 4 shows the typical edit count growth curves for the different treatment groups, broken down by mood type. As for the previous figure, control variables are set to their average values.

 
Figure 4.
Predicted edit count growth in the first 30 days since activation of MoodBar for a typical MoodBar user registered between December 14, 2011 and May 22, 2012.

Summary edit

The present analysis of the activity level of new users in their first 30 days shows that MoodBar users differ significantly from the average active user who never used MoodBar. Moreover, within the group of MoodBar users, there is evidence that the type of treatment a user receives (produced by MoodBar, the Feedback dashboard, and the Mark-as-helpful tool) is associated to different levels of activity within the first 30 days. The reported mood, on the other hand, does not seem to be strongly associated with overall different levels of activity.

In particular, regarding our research questions:

  • Research Question 1: MoodBar users are initially (day 1) more productive ( ), and their edit count grows at slightly faster rates, than the average non-MoodBar active user See Table 3, row 2.
  • Research Question 2 (a): there is mild evidence that receiving a response that is not marked as “helpful” is associated with a small decrease in edit count (-6%) with respect to the average MoodBar user who did not receive any response. We speculate that this might be due to the fact that not all users who receive a response were actually able to see it.[5][4] See Table 3, row 3.
  • Research Question 2 (b): MoodBar users who received a response and marked it as helpful have an even higher initial edit count ( ), which grows slightly faster (2% rate), than the average active non-MoodBar user. See Table 3, row 4. After controlling for individual level of productivity, time of feedback, etc. (see list of factors), compared to the average MoodBar user who did not receive any response, the boost is  . See Table 4, row 3.

Moreover, there is only mild and circumstantial evidence that the reported mood is associated with a higher or lower edit count. In particular, MoodBar users who receive no response and who report a “confused” mood have a 4% lower edit count than those who report a “happy” mood (Table 4, row 4), while there is no statistical difference between reporting a “happy” and “sad” mood for this group (Table 4, row 5). There is no evidence of mood-based differences in the group of users who receive a response that is not marked as “helpful” (coefficients not reported), while in the group of users who receive a response and mark it as “helpful” there is only mild evidence of a lower edit count (-16%) for those who report a “sad” mood (Table 4, row 13).

Figures 3 and 4 display the predicted trajectories of edit count growth of MoodBar users versus non-MoodBar users respectively of the three categories of MoodBar users. The trajectories displayed in both figures take into account all relevant factors discussed above.

Follow-up analysis edit

We performed a follow-up analysis that improves the methodology presented here in two ways:

  1. Data cleaning using the CentralAuth DB.
  2. Controlling for the presence of an authenticated email address.

Data cleaning with CentralAuth edit

CentralAuth is a Mediawiki extension specifically implemented for Wikimedia wikis that lets registered users employ the same set of credentials to log in to any Wikimedia project. Each wiki (e.g. commons, enwiki, itwiki, etc.) has its own user database and CentralAuth acts as a bridge between them, enabling users to create a "global" account that can access all Wikimedia projects.

Because of CentralAuth, the data on the user registration timestamp in our original sample were not accurate, and could influence the results of the present study. Because we only have access to the enwiki database, we decided to clean our sample by removing all users who did not register their account on enwiki. This removed 45,145 non-local users from our original sample.

We re-ran our regression analysis and found very similar results to the original analysis with just minimal differences between the response coefficients of the regressions.

Controlling for authenticated email edit

Users with an authenticated email address can receive notification from Mediawiki, including responses to their MoodBar feedback. Moreover, after sending feedback through MoodBar, users without an authenticated email address are remindend the possibility of registering an email address, and are asked if they want to add their emaila ddress.

Both actions could have an impact on the degree of retention and on the productivity of an user, and thus we decided to add two additional control variables to our regression analysis:

  1. email_auth: whether the user has an authenticated email.
  2. auth_mb_funnel: whether the user authenticated her email address after her first feedback item sent via MoodBar.

The second is a proxy for measuring whether the user registered an email address at the end of the MoodBar funnel.

We re-ran the regression analysis for the group of MoodBar users and found that having an authenticated email (email_auth = 1) was associated to a 24% higher editcount, compared to no authenticated email ( ). On the other hand, registering an email at the end of the MoodBar funnel did not yield statistical significance.

References edit

  1. Note that in this report we refer to “users” and not “editors” to include “inactive accounts”, i.e. users that never completed a single transaction after registering an account.
  2. Note that both Moodbar and the ability to measure the number of “active” users were introduced on July 25 2011. Historical data before that date is not available.
  3. MoodBar activates itself after the user clicks for the first time on the “Edit” button. We consider this event to determine the baseline population as a population of active users, and not that of all registered user, as the majority of these users (up to 60% as of July 2012) never click on the edit button.
  4. a b c d Note that based on the current data we cannot distinguish whether a user did not receive a response or received a response but did not actually see it. It is reasonable to assume that every user returning to Wikipedia after a response is posted will see the response as it's prominently advertised via a sticky notification.
  5. a b c d As above, we cannot distinguish whether a user saw a response and decided not to mark it as helpful or simply didn't see the response. The current design of the Mark as Helpful extension doesn't support marking responses as unhelpful.
  6. Our data consists of event counts and a common model for count data is the Poisson model. The Poisson model assumes that the variance of the data is equal to its mean  . Unfortunately our dataset is affected by overdispersion, which goes against this assumption and thus makes it difficult to apply the Poisson model. Indeed we tested a Poisson model and could not reject the null hypothesis using a goodness of fit test based on the deviance statistic. A quick assessment of overdispersion can be made by computing the mean edit count aggregated over the editor age and the treatment type, and comparing the mean values with their standard deviations. The following two tables report the results of the aggregation:
                                    Mean
    Account age        1         2         5        10        30 (days)
    Reference   1.825837  1.949301  2.202360  2.505040  3.158209
    Feedback    3.900105  4.521921  5.828468  7.543435 11.723340
    Feed+Resp.  3.700528  4.183718  5.224784  6.373199  9.553554
    Feed+Help.  6.824638  8.085507 11.379710 16.268116 27.746377
    
    
                              Standard deviation
    Account age         1         2         5       10        30 (days)
    Reference     4.946832  5.625729  7.558456 11.13289 24.18149
    Feedback      9.335787 12.445637 19.410766 27.07077 53.46395
    Feed+Resp.    8.470853 10.315052 15.443742 22.46240 58.19352
    Feed+Help.   16.961913 19.088137 26.645785 46.08548 90.68130
    

    Our data is overdispersed. For example the average edit count at 30 days for people who sent a feedback is 11.72 edits, and its standard deviation is 53.46 edits. Therefore, we should employ a count model able to account for overdispersion. A good candidate is the negative binomial model, which is the one we used for the present analysis.

  7. This is a control variable for seasonal and long-term trends in editor activity.
  8. Note that the outcome may happen at any time within the whole period of activity of the user, not just within the first 30 days. In practice, our previous analyses of the time to first feedback and of the response time tell that both events happen within the very first days since the activation of MoodBar, so this does not constitute a problem for the present analysis.
  9. In particular we use the glm.nb routine from R, which fits our model via Iteratively reweighted least squares and separately fits the   parameter of the negative binomial model.
  10. To perform the smoothing we use a local regression method.
  11. The regression model for the edit count of an active user, in the formula notation of R, is the following:  . During the course of the study we tested, for both samples, several different regression models, starting from a base one, which includes only the predictors of interests (age, treatment, and, for MoodBar users only, mood), and then adding controls and/or interaction effects. In this section we show only the results of the “final” models -- those that gives the best description of both datasets, given the set of predictors at our disposal.
  12. The fit is good: the ratio between the residual deviance and the number of degrees of freedom is   -- a value very close to 1 -- which indicates that the model accounts very reasonably for the over-dispersion in the data.
  13. Note that we use AGE - 1 instead of AGE (see [row 1]) so that the estimates refer to the average edit count on the first day, and not to AGE = 0 days.
  14. They indicate the extra change in the edit count that occurs when both variables change simultaneously. An example of this is a MoodBar user in the Feedback group (TREATMENT implicitly changes from Reference to Feedback) on all days after the first (AGE changes to 2, 3, and so on).
  15. The model for the sample of MoodBar users only is:   Where CONTROLS is the list of all control variables:
  16. For simplicity we omit from Table 4 those variables which did not attain statistical significance (os and browser and all interaction effects except one).
  17. According to ANOVA (not shown in the report), this variable is also the most powerful predictor (in terms of residual deviance) in the whole list of predictors.