Friedman Test is a non-parametric method to determine the significance of differences between multiple () classification algorithms evaluated over multiple () datasets. The first step in our analysis will be to do a hypothesis test to determine whether there is a significant difference in the performance of the classifiers. Our null hypothesis states that there is no difference between all classifiers, while the alternative hypothesis is that there is. To test these hypotheses, Friedman ranks the algorithms per dataset and then compares the average ranks, as detailed below:
Rank each algorithm by their performance on each dataset. In case of ties, average ranks are assigned. Example: if in a given dataset we have classifiers with the following accuracies, , , and we can see that and are tied, has the highest accuracy and has the lowest, so the rankings would be: , , and .
Calculate the average rank of each method over all datasets:
Calculate the Friedman statistic:
For large values of and (approximately and ), the Friedman statistic follows the distribution with degrees of freedom. Hence we need to search for the critical value of for a chosen significance level (typically ) in order to decide whether we can reject or not the null hypothesis. A distribution table can be found here. When the conditions for and are not met, the following table of critical values should be used.
Note: Iman and Davenport showed that Friedman's tends to be conservative (i.e., it may have a high likelihood of type II error) and thus derived the following statistic:
which follows the -distribution with and degrees of freedom.
Let's analyse a numerical example, using a subset of the classification accuracies of multiple classifiers on the UCR datasets, provided by the UEA & UCR Time Series Classification Repository. For convenience, we will select only 5 classifiers from the table (ts-chief, rocket, boss, weasel, catch22) and 12 datasets (which correspond to the rows of the table).
ts-chief | rocket | boss | weasel | catch22 | |
---|---|---|---|---|---|
Beef | 0.632 | 0.760 | 0.612 | 0.740 | 0.473 |
BME | 0.996 | 0.997 | 0.866 | 0.948 | 0.905 |
Car | 0.879 | 0.912 | 0.848 | 0.834 | 0.746 |
CBF | 0.998 | 0.996 | 0.999 | 0.980 | 0.954 |
Crop | 0.762 | 0.752 | 0.686 | 0.724 | 0.653 |
Fish | 0.982 | 0.974 | 0.970 | 0.951 | 0.773 |
Ham | 0.805 | 0.855 | 0.837 | 0.821 | 0.694 |
Meat | 0.984 | 0.989 | 0.981 | 0.977 | 0.943 |
Rock | 0.832 | 0.805 | 0.803 | 0.855 | 0.705 |
UMD | 0.983 | 0.983 | 0.966 | 0.932 | 0.869 |
Wine | 0.898 | 0.914 | 0.893 | 0.930 | 0.700 |
Yoga | 0.873 | 0.914 | 0.910 | 0.892 | 0.804 |
The null hypothesis is that the accuracy is the same for all five classifiers and the alternative is that it is not:
Note that the alternative hypothesis is not that all classifiers have different accuracies because this implies that all five means must differ from one another in order to reject the null hypothesis.
Step 1: We rank each algorithm on each dataset:
ts-chief | rocket | boss | weasel | catch22 | |
---|---|---|---|---|---|
Beef | 3 | 1 | 4 | 2 | 5 |
BME | 2 | 1 | 5 | 3 | 4 |
Car | 2 | 1 | 3 | 4 | 5 |
CBF | 2 | 3 | 1 | 4 | 5 |
Crop | 1 | 2 | 4 | 3 | 5 |
Fish | 1 | 2 | 3 | 4 | 5 |
Ham | 4 | 1 | 2 | 3 | 5 |
Meat | 2 | 1 | 3 | 4 | 5 |
Rock | 2 | 3 | 4 | 1 | 5 |
UMD | 1 | 2 | 3 | 4 | 5 |
Wine | 3 | 2 | 4 | 1 | 5 |
Yoga | 4 | 1 | 2 | 3 | 5 |
In the Beef datasets, ROCKET has the highest accuracy (0.76) thus it's ranked as , while Catch22 has the lowest accuracy (0.47) and hence it's ranked as .
Step 2: We can now calculate the average rank of each method, which is simply the average of each individual column on the above table.
Classifier | Avg rank |
---|---|
ts-chief | 2.2 |
rocket | 1.7 |
boss | 3.2 |
weasel | 3 |
catch22 | 4.9 |
We can see that TS-CHIEF and ROCKET have the lowest rankings, so they are the best performing classifiers, while Catch22 has the higher average ranking (close to 5) which seems to indicate that it's consistently the worst performing classifier.
Step 3: Compute the Friedman statistic to decide whether the classifiers are significantly different. We have 12 datasets and 5 classifiers, thus and :
Step 4: The critical value for the distribution with 4 degrees of freedom for is 9.487, so we can reject the null hypothesis (Reject ).
Finally, we can also compute the correction by Iman and Davenport:
The critical value for the distribution with and degrees of freedom for is , thus rejecting , consistently with the results from the Friedman statistic.
After applying the Friedman test, if the null hypothesis is rejected, we can proceed with a post-hoc test to establish which are the significant differences between the algorithms. We will explain possible post-hoc tests in future blog posts.
[1] Jerrold H. Zar. Biostatistical Analysis. 5th ed. Prentice-Hall/Pearson, 2010.
[2] Ronald L. Iman and James M. Davenport. Approximations of the critical region of the Friedman statistic. Communications in Statistics, pages 571–595, 1980.
[3] Anthony Bagnall, Jason Lines, William Vickers and Eamonn Keogh. The UEA & UCR Time Series Classification Repository, www.timeseriesclassification.com.