Slightly Modified Evaluation Metric for Phase 3

While analysing the results of phase 2, we noticed that there are cases where a single but large outlier (i.e. a run where an otherwise well-performing policy failed) can have a significant impact on the resulting mean score.

To make the evaluation more robust against such single outliers, we change it a bit for phase 3: Instead of the mean, we are now using median to get the average score of the runs of the same level.

Note that this change would not have changed the final ranking of phase 2 as all participants are affected to a similar degree. We hope, however, that it leads to a bit more consistent results when comparing the different weeks.

The documentation has been updated accordingly.

