Research:Post-edit feedback/PEF-2

From testwiki
Jump to navigation Jump to search

Template:Nutshell Template:Shortcut

This page documents the results of the second iteration of the Post-edit feedback experiment. The goal of the experiment was to determine whether receiving feedback had any significant desirable or undesirable effect on the volume and quality of contributions by new registered users, compared to the control group.

We measured the effects on volume by analyzing the number of edits, contribution size and time to threshold for participants in each experimental condition; we measured the impact of the experiment on quality by looking at the rate of reverts and blocks in each experimental condition.

Prior to performing the analysis we generated a clean dataset from the entire population of participants in the experiment to filter out known outliers and focus on genuinely new registered users.

Unless otherwise noted, all analyses refer to a 2-week interval since registration time to include a supplementary week after the 1-week treatment period. We report when significant differences emerge comparing the in-treatment and post-treatment period.

Research questions

  • RQ1. Does receiving feedback increase the number of edits?
  • RQ2. Does feedback lead to larger contributions?
  • RQ3. Does receiving feedback shorten the time to the second contribution?
  • RQ4. Does feedback affect the rate at which newcomers are blocked?
  • RQ5. Does feedback affect the success rate of newcomers?
Sample groups
Treatment 1 edit 5 edits 10 edits 25 edits 50 edits 100 edits
Control 3535 843 389 118 42 13
Historical 3607 853 371 99 33 11

Edit volume - Edit Count & Bytes Added

The following analysis address the question:

RQ1. Does receiving feedback increase the number of edits?

Edit count is the most direct measure of editor activity. We measured the total edit counts of new editors that were added by experimental condition in the first 14 days of activity since registration. New users would not receive the treatment message until after completing their first edit. Therefore, for each editor included in the experiment the first contribution was omitted.

RQ2: Does feedback lead to larger contributions?

The bytes added are computed in four ways for each editor:

  • Net - the net sum of bytes added or removed
  • Positive - the sum of bytes added

Below are the means of byte changed normalized by edit count for each group. We considered logarithmic transformations of bytes changed to work with normally distributed data. In order to perform the log operation on the distribution over "Net" byte count the net negative samples were ommited (about 15% of total samples). The samples for total bytes added for any given editor were normalized by edit count, so for example, if an editor had made five edits contributing 100,200,300,50, and 50 bytes the sample for this editor would be (100+200+300+50+50)/5=175. Furthermore, the byte count data was verified to be log-normal under the Shapiro–Wilk test for each treatment and bytes added metric and, given rejection of the null hypothesis (alpha=0.05), t-tests were performed over the transformed data sets.

Finally, the sample group was sub-sampled based on the milestones reached and analysis was executed separately for each of these groups.

At least one edit:

Template:Hidden

Template:Hidden

At least five edits:

Template:Hidden

Template:Hidden

At least ten edits:

Template:Hidden

Template:Hidden

At least twenty-five edits:

Template:Hidden

Template:Hidden

At least fifty edits:

Template:Hidden

Template:Hidden

At least one hundred edits:

Template:Hidden

Template:Hidden

There were no significant differences in bytes added per edit when considering positive and negative bytes added. Nor was there a significant result for edits. However, the control group had consistently larger edit counts. Given more data the observed effect size may prove to be significant.

Net Bytes Added per treatment. Box plots of PEF-1 treatments for net bytes added for bucketed users. Log transformed.
Positive Bytes Added per treatment. Box plots of PEF-1 treatments for positive bytes added for bucketed users. Log transformed.

Time to threshold

We measured the time to threshold as the number of minutes between historical milestones. Only editors that reached successive milestones were included in this analysis. Using this metric helps us address the following:

RQ3. Does receiving milestone feedback shorten the time to reach the next milestone?

Milestone: 1st Edit - 5th Edit

Template:Hidden

Milestone: 5th Edit - 10th Edit

Template:Hidden

Milestone: 10th Edit - 25th Edit

Template:Hidden

Milestone: 25th Edit - 50th Edit

Template:Hidden

Milestone: 50th Edit - 100th Edit

Template:Hidden


The table below contains the mean values and sample sizes of time-to-threshold for each milestone event:

Table 6. Time to threshold parameters and sample size per treatment
Treatment Sample Size 1st - 5th Edit 5th - 10th Edit 10th - 50th Edit * 50th - 100th Edit
Control Sample 1328 682 241 78 30
Historical Sample 1303 647 203 67 17
Control Mean (minutes) 428.63 450.48 694.44 764.54 631.8
Historical Mean (minutes) 404.03 411.34 668.70 697.33 976.65
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The only milestone event that came close to a significant result was observed under the 50th-100th edit event. However, the significance is still on the fringe of being marginal and the sample sizes are relatively small. This may motivate a greater interest in investigating more targeted responses among more active early editors. It should finally be noted that the time to threshold was shorter for the control treatment not receiving any feedback.

First to second edit time distribution for Control treatment.
First to second edit time distribution for Historical treatment.
First to fifth edit time distribution for Control treatment.
First to fifth edit time distribution for Historical treatment.

Quality

While feedback may increase the volume of newcomer edits, it might do so at the cost of decreased quality. This is concerning since increasing the amount of edits that will need to be reverted is counter productive. Similarly increasing the amount of newcomers who eventually need to be blocked would increase the burden on en:WP:AIV.

To explore whether the changes in the volume of newcomer work were coming at the cost of decreased quality, we examined the work that newcomers performed in their first two weeks of editing. We identified two aspects of newcomers and the work that they perform: the proportion of newcomers who were eventually blocked from editing and the rate at which newcomers' contributions were rejected (reverted or deleted). We used these metrics to answer the following questions:

RQ4. Does feedback affect the rate at which newcomers are blocked?
RQ5. Does feedback affect the success rate of newcomers?

Block rate

To determine which newcomers were blocked, we processed the logging table of the enwiki database to look for block events for newcomers in the experimental conditions. We decided that a newcomer had been blocked if there was any event for them with log_type="block" AND log_action="block" between the beginning of the experimental period and midnight GMT Sept. 5th. Blocked newcomers plots the proportion of newcomers were blocked by experimental condition. As the plot suggests, the difference in proportions varies insignificantly (around 0.072) which suggests that the experimental treatment had no meaningful effect on the rate at which newcomers were blocked from editing.

To make sure that this result wasn't due to blocks of editors who hadn't earned them through a series of bad-faith edits, we examined the relationship between the number of edits these newcomers saved and the proportion of them that were blocked. Blocked newcomers by revisions shows a steady increase in the proportion of newcomers that were blocked between 1 and 4 revisions. This seems likely due to the 4 levels of warnings that are used on the English Wikipedia.

Template:Hidden

The observed block rates for the two groups were extremely small: control = .099% and historical = .084%. Although, the block rate for the historical group was smaller the result was not significant.

Success rate

We examined the en:SHA1 checksum associated with the content of revisions to determine which revisions were reverted (see Research:Revert detection) by other editors. By comparing the number of revisions saved with the number of revisions reverted, we can build a proportion of reverted revisions (see Research:Metrics/revert_rate) and the success rate (the proportion of revisions saved by an editor that were not reverted). We use the success rate of an editor as a proxy for the quality of their work and a direct measure of the additional work their activities necessitate from Wikipedians.

To look for evidence of a causal relationship between PEF and the quality of newcomer work, we calculated the mean success rate for newcomer for each experimental condition.

Template:Hidden

The revert rate for the control was actually lower, however the result is not significant.