Statistical tests for benchmark comparisonThe curse of the benchmark instancesBenchmark problems for scenario-based stochastic optimizationHow to determine if a given problem seems to be a good fit to be solved using combinatorial Benders decompositionAlgorithmic gap for Hochbaum's (greedy) algorithm for (metric) uncapacitated facility locationIs my approach to my internship project good? Optimal allocation of product across stores, constrained optimizationMany-to-many Breadth First SearchBenchmark problems for combinatorial multi-objective optimisationFinding minimum time for vehicle to reach to its destinationRun repeatable numerical tests in the cloudApproaches for choosing a “risk” factor in an Inventory Optimization problem?
Planar regular languages
Other than good shoes and a stick, what are some ways to preserve your knees on long hikes?
How to publish superseding results without creating enemies
What is the meaning of 「ぞんぞん」?
How do certain apps show new notifications when internet access is restricted to them?
Why does an orbit become hyperbolic when total orbital energy is positive?
Statistical tests for benchmark comparison
Test to know when to use GLM over Linear Regression?
'Overwrote' files, space still occupied, are they lost?
Wrong Schengen Visa exit stamp on my passport, who can I complain to?
How to be sure services and researches offered by the University are not becoming cases of unfair competition?
Can I see Harvest moon in India?
In what state are satellites left in when they are left in a graveyard orbit?
How much would a 1 foot tall human weigh?
Ambiguity in notation resolved by +
Why does Kubuntu 19.04 show an update that apparently doesn't exist?
What are the advantages and disadvantages of tail wheels that cause modern airplanes to not use them?
Meaning of Swimming their horses
What is the mathematical notation for rounding a given number to the nearest integer?
How to draw a Venn diagram for X - (Y intersect Z)?
What would happen if Protagoras v Euathlus were heard in court today?
How clean are pets?
Ethernet, Wifi and a little human psychology
What was the ultimate objective of The Party in 1984?
Statistical tests for benchmark comparison
The curse of the benchmark instancesBenchmark problems for scenario-based stochastic optimizationHow to determine if a given problem seems to be a good fit to be solved using combinatorial Benders decompositionAlgorithmic gap for Hochbaum's (greedy) algorithm for (metric) uncapacitated facility locationIs my approach to my internship project good? Optimal allocation of product across stores, constrained optimizationMany-to-many Breadth First SearchBenchmark problems for combinatorial multi-objective optimisationFinding minimum time for vehicle to reach to its destinationRun repeatable numerical tests in the cloudApproaches for choosing a “risk” factor in an Inventory Optimization problem?
$begingroup$
Suppose that you have two algorithms for solving an optimization model, and you want to benchmark their performance over a large set of instances (with only one performance metric, for example, the running time). I want to show that one algorithm is "faster" than the other by using some statistical test.
My question: Which test is the correct one to use?
More precisely: Most of the paper in OR just compare mean times, which is clearly too weak. On the other extreme, you can use a Sign Test, but it doesn't consider the magnitude of the difference, only if you are faster or not on the different instances. Others (including me) use Performance Profiles, but this graph ignores the one-to-one comparison over each instance (you can be slower in more than half of the instances, and still have a performance curve dominating the other).
My knowledge of statistics is low (for being a mathematician), but if I'm not wrong, you need:
- a nonparametric test (because I cannot assume normality or any other distribution)
- paired (because the time of each algorithm in the same instance are not independent)
- one-sided (because I want to show that one algorithm is better than the other)
Also, does it make a difference if the instances are randomly independently generated or not?
optimization algorithms benchmark hypothesis-testing
$endgroup$
add a comment
|
$begingroup$
Suppose that you have two algorithms for solving an optimization model, and you want to benchmark their performance over a large set of instances (with only one performance metric, for example, the running time). I want to show that one algorithm is "faster" than the other by using some statistical test.
My question: Which test is the correct one to use?
More precisely: Most of the paper in OR just compare mean times, which is clearly too weak. On the other extreme, you can use a Sign Test, but it doesn't consider the magnitude of the difference, only if you are faster or not on the different instances. Others (including me) use Performance Profiles, but this graph ignores the one-to-one comparison over each instance (you can be slower in more than half of the instances, and still have a performance curve dominating the other).
My knowledge of statistics is low (for being a mathematician), but if I'm not wrong, you need:
- a nonparametric test (because I cannot assume normality or any other distribution)
- paired (because the time of each algorithm in the same instance are not independent)
- one-sided (because I want to show that one algorithm is better than the other)
Also, does it make a difference if the instances are randomly independently generated or not?
optimization algorithms benchmark hypothesis-testing
$endgroup$
1
$begingroup$
I found that the Wilcoxon Signed Rank test satisfies these conditions, but apparently this test evaluate if one algorithm is always faster (for all instances), which can be too strict.
$endgroup$
– Borelian
8 hours ago
$begingroup$
But isn't that what you want to show? That one algorithm is better than another. Of course you will get an answer about the statistical significance of the performance differences of the two algorithms. I would go for the Wilcoxon signed rank test.
$endgroup$
– JakobS
8 hours ago
1
$begingroup$
I think it depends on the definition of being faster. You can also use a Sign Test, which compare instance by instance if you are better or not (ignoring the amplitude of the difference). But you can be slightly slower in half of the instances, and much faster in the others, but you will fail for this test. On the other extreme, you can just compare mean times (as usual in many, many papers). That's my question, which test should be considered a good definition of "being best" in the algorithmic context? I'll add this to the question. Thanks!
$endgroup$
– Borelian
6 hours ago
$begingroup$
Would you look for to Design of Experiments or Experimental Methods? Might, this or this link be useful to you.
$endgroup$
– abbas omidi
6 hours ago
add a comment
|
$begingroup$
Suppose that you have two algorithms for solving an optimization model, and you want to benchmark their performance over a large set of instances (with only one performance metric, for example, the running time). I want to show that one algorithm is "faster" than the other by using some statistical test.
My question: Which test is the correct one to use?
More precisely: Most of the paper in OR just compare mean times, which is clearly too weak. On the other extreme, you can use a Sign Test, but it doesn't consider the magnitude of the difference, only if you are faster or not on the different instances. Others (including me) use Performance Profiles, but this graph ignores the one-to-one comparison over each instance (you can be slower in more than half of the instances, and still have a performance curve dominating the other).
My knowledge of statistics is low (for being a mathematician), but if I'm not wrong, you need:
- a nonparametric test (because I cannot assume normality or any other distribution)
- paired (because the time of each algorithm in the same instance are not independent)
- one-sided (because I want to show that one algorithm is better than the other)
Also, does it make a difference if the instances are randomly independently generated or not?
optimization algorithms benchmark hypothesis-testing
$endgroup$
Suppose that you have two algorithms for solving an optimization model, and you want to benchmark their performance over a large set of instances (with only one performance metric, for example, the running time). I want to show that one algorithm is "faster" than the other by using some statistical test.
My question: Which test is the correct one to use?
More precisely: Most of the paper in OR just compare mean times, which is clearly too weak. On the other extreme, you can use a Sign Test, but it doesn't consider the magnitude of the difference, only if you are faster or not on the different instances. Others (including me) use Performance Profiles, but this graph ignores the one-to-one comparison over each instance (you can be slower in more than half of the instances, and still have a performance curve dominating the other).
My knowledge of statistics is low (for being a mathematician), but if I'm not wrong, you need:
- a nonparametric test (because I cannot assume normality or any other distribution)
- paired (because the time of each algorithm in the same instance are not independent)
- one-sided (because I want to show that one algorithm is better than the other)
Also, does it make a difference if the instances are randomly independently generated or not?
optimization algorithms benchmark hypothesis-testing
optimization algorithms benchmark hypothesis-testing
edited 6 hours ago
TheSimpliFire♦
2,3547 silver badges39 bronze badges
2,3547 silver badges39 bronze badges
asked 8 hours ago
BorelianBorelian
4118 bronze badges
4118 bronze badges
1
$begingroup$
I found that the Wilcoxon Signed Rank test satisfies these conditions, but apparently this test evaluate if one algorithm is always faster (for all instances), which can be too strict.
$endgroup$
– Borelian
8 hours ago
$begingroup$
But isn't that what you want to show? That one algorithm is better than another. Of course you will get an answer about the statistical significance of the performance differences of the two algorithms. I would go for the Wilcoxon signed rank test.
$endgroup$
– JakobS
8 hours ago
1
$begingroup$
I think it depends on the definition of being faster. You can also use a Sign Test, which compare instance by instance if you are better or not (ignoring the amplitude of the difference). But you can be slightly slower in half of the instances, and much faster in the others, but you will fail for this test. On the other extreme, you can just compare mean times (as usual in many, many papers). That's my question, which test should be considered a good definition of "being best" in the algorithmic context? I'll add this to the question. Thanks!
$endgroup$
– Borelian
6 hours ago
$begingroup$
Would you look for to Design of Experiments or Experimental Methods? Might, this or this link be useful to you.
$endgroup$
– abbas omidi
6 hours ago
add a comment
|
1
$begingroup$
I found that the Wilcoxon Signed Rank test satisfies these conditions, but apparently this test evaluate if one algorithm is always faster (for all instances), which can be too strict.
$endgroup$
– Borelian
8 hours ago
$begingroup$
But isn't that what you want to show? That one algorithm is better than another. Of course you will get an answer about the statistical significance of the performance differences of the two algorithms. I would go for the Wilcoxon signed rank test.
$endgroup$
– JakobS
8 hours ago
1
$begingroup$
I think it depends on the definition of being faster. You can also use a Sign Test, which compare instance by instance if you are better or not (ignoring the amplitude of the difference). But you can be slightly slower in half of the instances, and much faster in the others, but you will fail for this test. On the other extreme, you can just compare mean times (as usual in many, many papers). That's my question, which test should be considered a good definition of "being best" in the algorithmic context? I'll add this to the question. Thanks!
$endgroup$
– Borelian
6 hours ago
$begingroup$
Would you look for to Design of Experiments or Experimental Methods? Might, this or this link be useful to you.
$endgroup$
– abbas omidi
6 hours ago
1
1
$begingroup$
I found that the Wilcoxon Signed Rank test satisfies these conditions, but apparently this test evaluate if one algorithm is always faster (for all instances), which can be too strict.
$endgroup$
– Borelian
8 hours ago
$begingroup$
I found that the Wilcoxon Signed Rank test satisfies these conditions, but apparently this test evaluate if one algorithm is always faster (for all instances), which can be too strict.
$endgroup$
– Borelian
8 hours ago
$begingroup$
But isn't that what you want to show? That one algorithm is better than another. Of course you will get an answer about the statistical significance of the performance differences of the two algorithms. I would go for the Wilcoxon signed rank test.
$endgroup$
– JakobS
8 hours ago
$begingroup$
But isn't that what you want to show? That one algorithm is better than another. Of course you will get an answer about the statistical significance of the performance differences of the two algorithms. I would go for the Wilcoxon signed rank test.
$endgroup$
– JakobS
8 hours ago
1
1
$begingroup$
I think it depends on the definition of being faster. You can also use a Sign Test, which compare instance by instance if you are better or not (ignoring the amplitude of the difference). But you can be slightly slower in half of the instances, and much faster in the others, but you will fail for this test. On the other extreme, you can just compare mean times (as usual in many, many papers). That's my question, which test should be considered a good definition of "being best" in the algorithmic context? I'll add this to the question. Thanks!
$endgroup$
– Borelian
6 hours ago
$begingroup$
I think it depends on the definition of being faster. You can also use a Sign Test, which compare instance by instance if you are better or not (ignoring the amplitude of the difference). But you can be slightly slower in half of the instances, and much faster in the others, but you will fail for this test. On the other extreme, you can just compare mean times (as usual in many, many papers). That's my question, which test should be considered a good definition of "being best" in the algorithmic context? I'll add this to the question. Thanks!
$endgroup$
– Borelian
6 hours ago
$begingroup$
Would you look for to Design of Experiments or Experimental Methods? Might, this or this link be useful to you.
$endgroup$
– abbas omidi
6 hours ago
$begingroup$
Would you look for to Design of Experiments or Experimental Methods? Might, this or this link be useful to you.
$endgroup$
– abbas omidi
6 hours ago
add a comment
|
3 Answers
3
active
oldest
votes
$begingroup$
I think there are many different factors to consider. There's a very good paper by Coffin and Saltzman (Statistical Analysis of Computational Tests of Algorithms
and Heuristics, INFORMS JOC 12(1): 24-44, 2000) that discusses this issue in detail.
$endgroup$
add a comment
|
$begingroup$
Options for you: McNemarNP, $t$ test (with variants)P, WilcoxonNP, sign testNP, FriedmanNP
Dietterich (1998)
Five statistical tests are compared primarily on the type I error produced. Emphasis mine.
Two widely used statistical tests are shown to have high probability of type I error in certain situations and should never be used: a test for the difference of two proportions and a paired-differences $t$ test based on taking several random train-test splits. A third test, a paired differences $t$ test based on $10$-fold cross-validation, exhibits somewhat elevated probability of type I error.
In your case, this shouldn't matter as a $t$ test assumes normality to a certain extent.
The cross-validated $t$ test is the most powerful. The $5×2$ cv test is shown to be slightly more powerful than McNemar’s test. The choice of the best test is determined by the computational cost of running the learning algorithm. For algorithms that can be executed only once, McNemar’s test is the only test with acceptable type I error. For algorithms that can be executed $10$ times, the $5×2$ cv test is recommended, because it
is slightly more powerful and because it directly measures variation due to the choice of training set.
Note that McNemar's test is non-parametric, similar to goodness-of-fit that uses a $chi^2$-distribution.
Demšar (2006)
This paper is more interesting as it considers some alternative, non-parametric approaches, such as the Wilcoxon signed-rank test.
When the assumptions of the paired $t$-test are met, the Wilcoxon signed-ranks test is less powerful than the paired $t$-test. On the other hand, when the assumptions are violated, the Wilcoxon test can be even more powerful than the $t$-test.
Another non-parametric test is the Friedman test. It is similar to ANOVA, and still uses ranking as part of the expression for the test statistic.
A simpler method is to use the sign test, and the larger the number of data the closer the equivalence of this test to the $z$-test. However, the cost of the simplicity is highlighted below.
This test does not assume any commensurability of scores or differences nor does it assume normal distributions and is thus applicable to any data (as long as the observations, i.e. the data sets, are independent). On the other hand, it is much weaker than the Wilcoxon signed-ranks test.
In conclusion...
Overall, the non-parametric tests, namely the Wilcoxon and Friedman test are suitable for our problems. They are appropriate since they assume some, but limited commensurability. They are safer than parametric tests since they do not assume normal distributions or homogeneity of variance. As such, they can be applied to classification accuracies, error ratios or any other measure for evaluation of classifiers, including even model sizes and computation times. Empirical results suggest that they are also stronger than the other tests studied.
References
[1] Dietterich, T. G. (1998). Approximate Statistical Tests for Comparing Supervised
Classification Learning Algorithms. Neural Computation. 10:1895-1923.
[2] Demšar, J. (2006). Statistical Comparisons of Classifiers over Multiple Data Sets. Journal of Machine Learning Research. 7:1-30.
$endgroup$
add a comment
|
$begingroup$
I know that you explicitly ask for a statistical test, but maybe this is because you don't know about alternatives that are rather established in the community. When comparing algorithms, my number one is performance profiles.
They were introduced in this article: Elizabeth D. Dolan and Jorge J. Moré, Benchmarking optimization software with performance profiles, Mathematical Programming, 91(2):201–213, 2002.
$endgroup$
add a comment
|
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "700"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/4.0/"u003ecc by-sa 4.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2for.stackexchange.com%2fquestions%2f2558%2fstatistical-tests-for-benchmark-comparison%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
I think there are many different factors to consider. There's a very good paper by Coffin and Saltzman (Statistical Analysis of Computational Tests of Algorithms
and Heuristics, INFORMS JOC 12(1): 24-44, 2000) that discusses this issue in detail.
$endgroup$
add a comment
|
$begingroup$
I think there are many different factors to consider. There's a very good paper by Coffin and Saltzman (Statistical Analysis of Computational Tests of Algorithms
and Heuristics, INFORMS JOC 12(1): 24-44, 2000) that discusses this issue in detail.
$endgroup$
add a comment
|
$begingroup$
I think there are many different factors to consider. There's a very good paper by Coffin and Saltzman (Statistical Analysis of Computational Tests of Algorithms
and Heuristics, INFORMS JOC 12(1): 24-44, 2000) that discusses this issue in detail.
$endgroup$
I think there are many different factors to consider. There's a very good paper by Coffin and Saltzman (Statistical Analysis of Computational Tests of Algorithms
and Heuristics, INFORMS JOC 12(1): 24-44, 2000) that discusses this issue in detail.
answered 6 hours ago
LarrySnyder610♦LarrySnyder610
6,14714 silver badges66 bronze badges
6,14714 silver badges66 bronze badges
add a comment
|
add a comment
|
$begingroup$
Options for you: McNemarNP, $t$ test (with variants)P, WilcoxonNP, sign testNP, FriedmanNP
Dietterich (1998)
Five statistical tests are compared primarily on the type I error produced. Emphasis mine.
Two widely used statistical tests are shown to have high probability of type I error in certain situations and should never be used: a test for the difference of two proportions and a paired-differences $t$ test based on taking several random train-test splits. A third test, a paired differences $t$ test based on $10$-fold cross-validation, exhibits somewhat elevated probability of type I error.
In your case, this shouldn't matter as a $t$ test assumes normality to a certain extent.
The cross-validated $t$ test is the most powerful. The $5×2$ cv test is shown to be slightly more powerful than McNemar’s test. The choice of the best test is determined by the computational cost of running the learning algorithm. For algorithms that can be executed only once, McNemar’s test is the only test with acceptable type I error. For algorithms that can be executed $10$ times, the $5×2$ cv test is recommended, because it
is slightly more powerful and because it directly measures variation due to the choice of training set.
Note that McNemar's test is non-parametric, similar to goodness-of-fit that uses a $chi^2$-distribution.
Demšar (2006)
This paper is more interesting as it considers some alternative, non-parametric approaches, such as the Wilcoxon signed-rank test.
When the assumptions of the paired $t$-test are met, the Wilcoxon signed-ranks test is less powerful than the paired $t$-test. On the other hand, when the assumptions are violated, the Wilcoxon test can be even more powerful than the $t$-test.
Another non-parametric test is the Friedman test. It is similar to ANOVA, and still uses ranking as part of the expression for the test statistic.
A simpler method is to use the sign test, and the larger the number of data the closer the equivalence of this test to the $z$-test. However, the cost of the simplicity is highlighted below.
This test does not assume any commensurability of scores or differences nor does it assume normal distributions and is thus applicable to any data (as long as the observations, i.e. the data sets, are independent). On the other hand, it is much weaker than the Wilcoxon signed-ranks test.
In conclusion...
Overall, the non-parametric tests, namely the Wilcoxon and Friedman test are suitable for our problems. They are appropriate since they assume some, but limited commensurability. They are safer than parametric tests since they do not assume normal distributions or homogeneity of variance. As such, they can be applied to classification accuracies, error ratios or any other measure for evaluation of classifiers, including even model sizes and computation times. Empirical results suggest that they are also stronger than the other tests studied.
References
[1] Dietterich, T. G. (1998). Approximate Statistical Tests for Comparing Supervised
Classification Learning Algorithms. Neural Computation. 10:1895-1923.
[2] Demšar, J. (2006). Statistical Comparisons of Classifiers over Multiple Data Sets. Journal of Machine Learning Research. 7:1-30.
$endgroup$
add a comment
|
$begingroup$
Options for you: McNemarNP, $t$ test (with variants)P, WilcoxonNP, sign testNP, FriedmanNP
Dietterich (1998)
Five statistical tests are compared primarily on the type I error produced. Emphasis mine.
Two widely used statistical tests are shown to have high probability of type I error in certain situations and should never be used: a test for the difference of two proportions and a paired-differences $t$ test based on taking several random train-test splits. A third test, a paired differences $t$ test based on $10$-fold cross-validation, exhibits somewhat elevated probability of type I error.
In your case, this shouldn't matter as a $t$ test assumes normality to a certain extent.
The cross-validated $t$ test is the most powerful. The $5×2$ cv test is shown to be slightly more powerful than McNemar’s test. The choice of the best test is determined by the computational cost of running the learning algorithm. For algorithms that can be executed only once, McNemar’s test is the only test with acceptable type I error. For algorithms that can be executed $10$ times, the $5×2$ cv test is recommended, because it
is slightly more powerful and because it directly measures variation due to the choice of training set.
Note that McNemar's test is non-parametric, similar to goodness-of-fit that uses a $chi^2$-distribution.
Demšar (2006)
This paper is more interesting as it considers some alternative, non-parametric approaches, such as the Wilcoxon signed-rank test.
When the assumptions of the paired $t$-test are met, the Wilcoxon signed-ranks test is less powerful than the paired $t$-test. On the other hand, when the assumptions are violated, the Wilcoxon test can be even more powerful than the $t$-test.
Another non-parametric test is the Friedman test. It is similar to ANOVA, and still uses ranking as part of the expression for the test statistic.
A simpler method is to use the sign test, and the larger the number of data the closer the equivalence of this test to the $z$-test. However, the cost of the simplicity is highlighted below.
This test does not assume any commensurability of scores or differences nor does it assume normal distributions and is thus applicable to any data (as long as the observations, i.e. the data sets, are independent). On the other hand, it is much weaker than the Wilcoxon signed-ranks test.
In conclusion...
Overall, the non-parametric tests, namely the Wilcoxon and Friedman test are suitable for our problems. They are appropriate since they assume some, but limited commensurability. They are safer than parametric tests since they do not assume normal distributions or homogeneity of variance. As such, they can be applied to classification accuracies, error ratios or any other measure for evaluation of classifiers, including even model sizes and computation times. Empirical results suggest that they are also stronger than the other tests studied.
References
[1] Dietterich, T. G. (1998). Approximate Statistical Tests for Comparing Supervised
Classification Learning Algorithms. Neural Computation. 10:1895-1923.
[2] Demšar, J. (2006). Statistical Comparisons of Classifiers over Multiple Data Sets. Journal of Machine Learning Research. 7:1-30.
$endgroup$
add a comment
|
$begingroup$
Options for you: McNemarNP, $t$ test (with variants)P, WilcoxonNP, sign testNP, FriedmanNP
Dietterich (1998)
Five statistical tests are compared primarily on the type I error produced. Emphasis mine.
Two widely used statistical tests are shown to have high probability of type I error in certain situations and should never be used: a test for the difference of two proportions and a paired-differences $t$ test based on taking several random train-test splits. A third test, a paired differences $t$ test based on $10$-fold cross-validation, exhibits somewhat elevated probability of type I error.
In your case, this shouldn't matter as a $t$ test assumes normality to a certain extent.
The cross-validated $t$ test is the most powerful. The $5×2$ cv test is shown to be slightly more powerful than McNemar’s test. The choice of the best test is determined by the computational cost of running the learning algorithm. For algorithms that can be executed only once, McNemar’s test is the only test with acceptable type I error. For algorithms that can be executed $10$ times, the $5×2$ cv test is recommended, because it
is slightly more powerful and because it directly measures variation due to the choice of training set.
Note that McNemar's test is non-parametric, similar to goodness-of-fit that uses a $chi^2$-distribution.
Demšar (2006)
This paper is more interesting as it considers some alternative, non-parametric approaches, such as the Wilcoxon signed-rank test.
When the assumptions of the paired $t$-test are met, the Wilcoxon signed-ranks test is less powerful than the paired $t$-test. On the other hand, when the assumptions are violated, the Wilcoxon test can be even more powerful than the $t$-test.
Another non-parametric test is the Friedman test. It is similar to ANOVA, and still uses ranking as part of the expression for the test statistic.
A simpler method is to use the sign test, and the larger the number of data the closer the equivalence of this test to the $z$-test. However, the cost of the simplicity is highlighted below.
This test does not assume any commensurability of scores or differences nor does it assume normal distributions and is thus applicable to any data (as long as the observations, i.e. the data sets, are independent). On the other hand, it is much weaker than the Wilcoxon signed-ranks test.
In conclusion...
Overall, the non-parametric tests, namely the Wilcoxon and Friedman test are suitable for our problems. They are appropriate since they assume some, but limited commensurability. They are safer than parametric tests since they do not assume normal distributions or homogeneity of variance. As such, they can be applied to classification accuracies, error ratios or any other measure for evaluation of classifiers, including even model sizes and computation times. Empirical results suggest that they are also stronger than the other tests studied.
References
[1] Dietterich, T. G. (1998). Approximate Statistical Tests for Comparing Supervised
Classification Learning Algorithms. Neural Computation. 10:1895-1923.
[2] Demšar, J. (2006). Statistical Comparisons of Classifiers over Multiple Data Sets. Journal of Machine Learning Research. 7:1-30.
$endgroup$
Options for you: McNemarNP, $t$ test (with variants)P, WilcoxonNP, sign testNP, FriedmanNP
Dietterich (1998)
Five statistical tests are compared primarily on the type I error produced. Emphasis mine.
Two widely used statistical tests are shown to have high probability of type I error in certain situations and should never be used: a test for the difference of two proportions and a paired-differences $t$ test based on taking several random train-test splits. A third test, a paired differences $t$ test based on $10$-fold cross-validation, exhibits somewhat elevated probability of type I error.
In your case, this shouldn't matter as a $t$ test assumes normality to a certain extent.
The cross-validated $t$ test is the most powerful. The $5×2$ cv test is shown to be slightly more powerful than McNemar’s test. The choice of the best test is determined by the computational cost of running the learning algorithm. For algorithms that can be executed only once, McNemar’s test is the only test with acceptable type I error. For algorithms that can be executed $10$ times, the $5×2$ cv test is recommended, because it
is slightly more powerful and because it directly measures variation due to the choice of training set.
Note that McNemar's test is non-parametric, similar to goodness-of-fit that uses a $chi^2$-distribution.
Demšar (2006)
This paper is more interesting as it considers some alternative, non-parametric approaches, such as the Wilcoxon signed-rank test.
When the assumptions of the paired $t$-test are met, the Wilcoxon signed-ranks test is less powerful than the paired $t$-test. On the other hand, when the assumptions are violated, the Wilcoxon test can be even more powerful than the $t$-test.
Another non-parametric test is the Friedman test. It is similar to ANOVA, and still uses ranking as part of the expression for the test statistic.
A simpler method is to use the sign test, and the larger the number of data the closer the equivalence of this test to the $z$-test. However, the cost of the simplicity is highlighted below.
This test does not assume any commensurability of scores or differences nor does it assume normal distributions and is thus applicable to any data (as long as the observations, i.e. the data sets, are independent). On the other hand, it is much weaker than the Wilcoxon signed-ranks test.
In conclusion...
Overall, the non-parametric tests, namely the Wilcoxon and Friedman test are suitable for our problems. They are appropriate since they assume some, but limited commensurability. They are safer than parametric tests since they do not assume normal distributions or homogeneity of variance. As such, they can be applied to classification accuracies, error ratios or any other measure for evaluation of classifiers, including even model sizes and computation times. Empirical results suggest that they are also stronger than the other tests studied.
References
[1] Dietterich, T. G. (1998). Approximate Statistical Tests for Comparing Supervised
Classification Learning Algorithms. Neural Computation. 10:1895-1923.
[2] Demšar, J. (2006). Statistical Comparisons of Classifiers over Multiple Data Sets. Journal of Machine Learning Research. 7:1-30.
edited 6 hours ago
answered 6 hours ago
TheSimpliFire♦TheSimpliFire
2,3547 silver badges39 bronze badges
2,3547 silver badges39 bronze badges
add a comment
|
add a comment
|
$begingroup$
I know that you explicitly ask for a statistical test, but maybe this is because you don't know about alternatives that are rather established in the community. When comparing algorithms, my number one is performance profiles.
They were introduced in this article: Elizabeth D. Dolan and Jorge J. Moré, Benchmarking optimization software with performance profiles, Mathematical Programming, 91(2):201–213, 2002.
$endgroup$
add a comment
|
$begingroup$
I know that you explicitly ask for a statistical test, but maybe this is because you don't know about alternatives that are rather established in the community. When comparing algorithms, my number one is performance profiles.
They were introduced in this article: Elizabeth D. Dolan and Jorge J. Moré, Benchmarking optimization software with performance profiles, Mathematical Programming, 91(2):201–213, 2002.
$endgroup$
add a comment
|
$begingroup$
I know that you explicitly ask for a statistical test, but maybe this is because you don't know about alternatives that are rather established in the community. When comparing algorithms, my number one is performance profiles.
They were introduced in this article: Elizabeth D. Dolan and Jorge J. Moré, Benchmarking optimization software with performance profiles, Mathematical Programming, 91(2):201–213, 2002.
$endgroup$
I know that you explicitly ask for a statistical test, but maybe this is because you don't know about alternatives that are rather established in the community. When comparing algorithms, my number one is performance profiles.
They were introduced in this article: Elizabeth D. Dolan and Jorge J. Moré, Benchmarking optimization software with performance profiles, Mathematical Programming, 91(2):201–213, 2002.
answered 5 hours ago
Marco LübbeckeMarco Lübbecke
2,2885 silver badges30 bronze badges
2,2885 silver badges30 bronze badges
add a comment
|
add a comment
|
Thanks for contributing an answer to Operations Research Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2for.stackexchange.com%2fquestions%2f2558%2fstatistical-tests-for-benchmark-comparison%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
$begingroup$
I found that the Wilcoxon Signed Rank test satisfies these conditions, but apparently this test evaluate if one algorithm is always faster (for all instances), which can be too strict.
$endgroup$
– Borelian
8 hours ago
$begingroup$
But isn't that what you want to show? That one algorithm is better than another. Of course you will get an answer about the statistical significance of the performance differences of the two algorithms. I would go for the Wilcoxon signed rank test.
$endgroup$
– JakobS
8 hours ago
1
$begingroup$
I think it depends on the definition of being faster. You can also use a Sign Test, which compare instance by instance if you are better or not (ignoring the amplitude of the difference). But you can be slightly slower in half of the instances, and much faster in the others, but you will fail for this test. On the other extreme, you can just compare mean times (as usual in many, many papers). That's my question, which test should be considered a good definition of "being best" in the algorithmic context? I'll add this to the question. Thanks!
$endgroup$
– Borelian
6 hours ago
$begingroup$
Would you look for to Design of Experiments or Experimental Methods? Might, this or this link be useful to you.
$endgroup$
– abbas omidi
6 hours ago