Friedman Test

Friedman test is a non-parametric method to compare multiple classification algorithms over multiple datasets. Through ranking the algorithms over multiple datasets, it can determine whether there is a significant difference between them. This article presents how to set up the test and what conditions have to be met, with a numerical example. The acompanying notebook can be downloaded from the following link.

Elizabeth Fons

Alejandro Sztrajman

July 28, 2021

$M$ $N$ $H_0$ $H_A$ is that there is. To test these hypotheses, Friedman ranks the algorithms per dataset and then compares the average ranks, as detailed below:

$m_1=0.7$ $m_2=0.9$ $m_3=0.7$ $m_4=0.5$ $m_1$ $m_3$ $m_2$ $m_4$ $r_1=2.5$ $r_2=1$ $r_3=2.5$ $r_4=4$ .
$R_j$ $N$ datasets:

R_j = \frac{1}{N}\sum_i r_{ji}\nonumber

Calculate the Friedman statistic:
$\chi_F^2 = \frac{12N}{M(M+1)}\left[\sum_j R_j^2 - \frac{M(M+1)^2}{4} \right].\nonumber$
$M$ $N$ $N>10$ $M>5$ $\chi^2_F$ $\chi^2$ $M-1$ $\chi^2_{(M-1)}$ $\alpha = 0.05$ $\chi^2$ here $M$ $N$ table $^{1}$

Note: $^2$ $\chi^2$ tends to be conservative (i.e., it may have a high likelihood of type II error) and thus derived the following statistic:

F_F = \frac{(N-1) \chi^2_F}{N(M-1)-\chi^2_F}\nonumber

$F$ $M-1$ $(M-1)(N-1)$ degrees of freedom.

Example

The full table reporting the accuracies for all classifiers on all datasets can be downloaded from this link.

UEA & UCR Time Series Classification Repository $^3$ . For convenience, we will select only 5 classifiers from the table (ts-chief, rocket, boss, weasel, catch22) and 12 datasets (which correspond to the rows of the table).

	ts-chief	rocket	boss	weasel	catch22
Beef	0.632	0.760	0.612	0.740	0.473
BME	0.996	0.997	0.866	0.948	0.905
Car	0.879	0.912	0.848	0.834	0.746
CBF	0.998	0.996	0.999	0.980	0.954
Crop	0.762	0.752	0.686	0.724	0.653
Fish	0.982	0.974	0.970	0.951	0.773
Ham	0.805	0.855	0.837	0.821	0.694
Meat	0.984	0.989	0.981	0.977	0.943
Rock	0.832	0.805	0.803	0.855	0.705
UMD	0.983	0.983	0.966	0.932	0.869
Wine	0.898	0.914	0.893	0.930	0.700
Yoga	0.873	0.914	0.910	0.892	0.804

The null hypothesis is that the accuracy is the same for all five classifiers and the alternative is that it is not:

H_0: \text{the accuracy is the same for all five classifiers}\nonumber\\ H_A: \text{the accuracy is not the same for all five classifiers}

$i.e., H_A: \theta_1 \neq \theta_2 \neq \cdots \neq \theta_5$ because this implies that all five means must differ from one another in order to reject the null hypothesis.

Step 1: We rank each algorithm on each dataset:

	ts-chief	rocket	boss	weasel	catch22
Beef	3	1	4	2	5
BME	2	1	5	3	4
Car	2	1	3	4	5
CBF	2	3	1	4	5
Crop	1	2	4	3	5
Fish	1	2	3	4	5
Ham	4	1	2	3	5
Meat	2	1	3	4	5
Rock	2	3	4	1	5
UMD	1	2	3	4	5
Wine	3	2	4	1	5
Yoga	4	1	2	3	5

$1^{st}$ $5^{th}$ .

Step 2: We can now calculate the average rank of each method, which is simply the average of each individual column on the above table.

Classifier	Avg rank
ts-chief	2.2
rocket	1.7
boss	3.2
weasel	3
catch22	4.9

We can see that TS-CHIEF and ROCKET have the lowest rankings, so they are the best performing classifiers, while Catch22 has the higher average ranking (close to 5) which seems to indicate that it's consistently the worst performing classifier.

Step 3: $N=12$ $M=5$ :

\chi_F^2 = \frac{12 \cdot 12}{5(5+1)}\left[(2.2^2+1.7^2+3.2^2+3^2+4.9^2) - \frac{5(5+1)^2}{4} \right] = 29.0\nonumber

Step 4: $\chi^2$ $\alpha=0.05$ $H_0$ ).

$^{2}$ :

F_F = \frac{11\cdot29.0}{12\cdot4-29.0} = 16.79\nonumber

$F$ $5-1=4$ $(5-1)\cdot(12-1) = 44$ $\alpha=0.05$ $2.583$ $H_0$ , consistently with the results from the Friedman statistic.

Next steps

After applying the Friedman test, if the null hypothesis is rejected, we can proceed with a post-hoc test to establish which are the significant differences between the algorithms. We will explain possible post-hoc tests in future blog posts.

References

[1] Jerrold H. Zar. Biostatistical Analysis. 5th ed. Prentice-Hall/Pearson, 2010.

[2] Ronald L. Iman and James M. Davenport. Approximations of the critical region of the Friedman statistic. Communications in Statistics, pages 571–595, 1980.

[3] Anthony Bagnall, Jason Lines, William Vickers and Eamonn Keogh. The UEA & UCR Time Series Classification Repository, www.timeseriesclassification.com.