Statistical tests for benchmark comparisonThe curse of the benchmark instancesBenchmark problems for scenario-based stochastic optimizationHow to determine if a given problem seems to be a good fit to be solved using combinatorial Benders decompositionAlgorithmic gap for Hochbaum's (greedy) algorithm for (metric) uncapacitated facility locationIs my approach to my internship project good? Optimal allocation of product across stores, constrained optimizationMany-to-many Breadth First SearchBenchmark problems for combinatorial multi-objective optimisationFinding minimum time for vehicle to reach to its destinationRun repeatable numerical tests in the cloudApproaches for choosing a “risk” factor in an Inventory Optimization problem?

Planar regular languages

Other than good shoes and a stick, what are some ways to preserve your knees on long hikes?

How to publish superseding results without creating enemies

What is the meaning of 「ぞんぞん」?

How do certain apps show new notifications when internet access is restricted to them?

Why does an orbit become hyperbolic when total orbital energy is positive?

Statistical tests for benchmark comparison

Test to know when to use GLM over Linear Regression?

'Overwrote' files, space still occupied, are they lost?

Wrong Schengen Visa exit stamp on my passport, who can I complain to?

How to be sure services and researches offered by the University are not becoming cases of unfair competition?

Can I see Harvest moon in India?

In what state are satellites left in when they are left in a graveyard orbit?

How much would a 1 foot tall human weigh?

Ambiguity in notation resolved by +

Why does Kubuntu 19.04 show an update that apparently doesn't exist?

What are the advantages and disadvantages of tail wheels that cause modern airplanes to not use them?

Meaning of Swimming their horses

What is the mathematical notation for rounding a given number to the nearest integer?

How to draw a Venn diagram for X - (Y intersect Z)?

What would happen if Protagoras v Euathlus were heard in court today?

How clean are pets?

Ethernet, Wifi and a little human psychology

What was the ultimate objective of The Party in 1984?



Statistical tests for benchmark comparison


The curse of the benchmark instancesBenchmark problems for scenario-based stochastic optimizationHow to determine if a given problem seems to be a good fit to be solved using combinatorial Benders decompositionAlgorithmic gap for Hochbaum's (greedy) algorithm for (metric) uncapacitated facility locationIs my approach to my internship project good? Optimal allocation of product across stores, constrained optimizationMany-to-many Breadth First SearchBenchmark problems for combinatorial multi-objective optimisationFinding minimum time for vehicle to reach to its destinationRun repeatable numerical tests in the cloudApproaches for choosing a “risk” factor in an Inventory Optimization problem?













5












$begingroup$


Suppose that you have two algorithms for solving an optimization model, and you want to benchmark their performance over a large set of instances (with only one performance metric, for example, the running time). I want to show that one algorithm is "faster" than the other by using some statistical test.



My question: Which test is the correct one to use?



More precisely: Most of the paper in OR just compare mean times, which is clearly too weak. On the other extreme, you can use a Sign Test, but it doesn't consider the magnitude of the difference, only if you are faster or not on the different instances. Others (including me) use Performance Profiles, but this graph ignores the one-to-one comparison over each instance (you can be slower in more than half of the instances, and still have a performance curve dominating the other).



My knowledge of statistics is low (for being a mathematician), but if I'm not wrong, you need:



  • a nonparametric test (because I cannot assume normality or any other distribution)

  • paired (because the time of each algorithm in the same instance are not independent)

  • one-sided (because I want to show that one algorithm is better than the other)

Also, does it make a difference if the instances are randomly independently generated or not?










share|improve this question











$endgroup$









  • 1




    $begingroup$
    I found that the Wilcoxon Signed Rank test satisfies these conditions, but apparently this test evaluate if one algorithm is always faster (for all instances), which can be too strict.
    $endgroup$
    – Borelian
    8 hours ago











  • $begingroup$
    But isn't that what you want to show? That one algorithm is better than another. Of course you will get an answer about the statistical significance of the performance differences of the two algorithms. I would go for the Wilcoxon signed rank test.
    $endgroup$
    – JakobS
    8 hours ago







  • 1




    $begingroup$
    I think it depends on the definition of being faster. You can also use a Sign Test, which compare instance by instance if you are better or not (ignoring the amplitude of the difference). But you can be slightly slower in half of the instances, and much faster in the others, but you will fail for this test. On the other extreme, you can just compare mean times (as usual in many, many papers). That's my question, which test should be considered a good definition of "being best" in the algorithmic context? I'll add this to the question. Thanks!
    $endgroup$
    – Borelian
    6 hours ago











  • $begingroup$
    Would you look for to Design of Experiments or Experimental Methods? Might, this or this link be useful to you.
    $endgroup$
    – abbas omidi
    6 hours ago















5












$begingroup$


Suppose that you have two algorithms for solving an optimization model, and you want to benchmark their performance over a large set of instances (with only one performance metric, for example, the running time). I want to show that one algorithm is "faster" than the other by using some statistical test.



My question: Which test is the correct one to use?



More precisely: Most of the paper in OR just compare mean times, which is clearly too weak. On the other extreme, you can use a Sign Test, but it doesn't consider the magnitude of the difference, only if you are faster or not on the different instances. Others (including me) use Performance Profiles, but this graph ignores the one-to-one comparison over each instance (you can be slower in more than half of the instances, and still have a performance curve dominating the other).



My knowledge of statistics is low (for being a mathematician), but if I'm not wrong, you need:



  • a nonparametric test (because I cannot assume normality or any other distribution)

  • paired (because the time of each algorithm in the same instance are not independent)

  • one-sided (because I want to show that one algorithm is better than the other)

Also, does it make a difference if the instances are randomly independently generated or not?










share|improve this question











$endgroup$









  • 1




    $begingroup$
    I found that the Wilcoxon Signed Rank test satisfies these conditions, but apparently this test evaluate if one algorithm is always faster (for all instances), which can be too strict.
    $endgroup$
    – Borelian
    8 hours ago











  • $begingroup$
    But isn't that what you want to show? That one algorithm is better than another. Of course you will get an answer about the statistical significance of the performance differences of the two algorithms. I would go for the Wilcoxon signed rank test.
    $endgroup$
    – JakobS
    8 hours ago







  • 1




    $begingroup$
    I think it depends on the definition of being faster. You can also use a Sign Test, which compare instance by instance if you are better or not (ignoring the amplitude of the difference). But you can be slightly slower in half of the instances, and much faster in the others, but you will fail for this test. On the other extreme, you can just compare mean times (as usual in many, many papers). That's my question, which test should be considered a good definition of "being best" in the algorithmic context? I'll add this to the question. Thanks!
    $endgroup$
    – Borelian
    6 hours ago











  • $begingroup$
    Would you look for to Design of Experiments or Experimental Methods? Might, this or this link be useful to you.
    $endgroup$
    – abbas omidi
    6 hours ago













5












5








5


1



$begingroup$


Suppose that you have two algorithms for solving an optimization model, and you want to benchmark their performance over a large set of instances (with only one performance metric, for example, the running time). I want to show that one algorithm is "faster" than the other by using some statistical test.



My question: Which test is the correct one to use?



More precisely: Most of the paper in OR just compare mean times, which is clearly too weak. On the other extreme, you can use a Sign Test, but it doesn't consider the magnitude of the difference, only if you are faster or not on the different instances. Others (including me) use Performance Profiles, but this graph ignores the one-to-one comparison over each instance (you can be slower in more than half of the instances, and still have a performance curve dominating the other).



My knowledge of statistics is low (for being a mathematician), but if I'm not wrong, you need:



  • a nonparametric test (because I cannot assume normality or any other distribution)

  • paired (because the time of each algorithm in the same instance are not independent)

  • one-sided (because I want to show that one algorithm is better than the other)

Also, does it make a difference if the instances are randomly independently generated or not?










share|improve this question











$endgroup$




Suppose that you have two algorithms for solving an optimization model, and you want to benchmark their performance over a large set of instances (with only one performance metric, for example, the running time). I want to show that one algorithm is "faster" than the other by using some statistical test.



My question: Which test is the correct one to use?



More precisely: Most of the paper in OR just compare mean times, which is clearly too weak. On the other extreme, you can use a Sign Test, but it doesn't consider the magnitude of the difference, only if you are faster or not on the different instances. Others (including me) use Performance Profiles, but this graph ignores the one-to-one comparison over each instance (you can be slower in more than half of the instances, and still have a performance curve dominating the other).



My knowledge of statistics is low (for being a mathematician), but if I'm not wrong, you need:



  • a nonparametric test (because I cannot assume normality or any other distribution)

  • paired (because the time of each algorithm in the same instance are not independent)

  • one-sided (because I want to show that one algorithm is better than the other)

Also, does it make a difference if the instances are randomly independently generated or not?







optimization algorithms benchmark hypothesis-testing






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited 6 hours ago









TheSimpliFire

2,3547 silver badges39 bronze badges




2,3547 silver badges39 bronze badges










asked 8 hours ago









BorelianBorelian

4118 bronze badges




4118 bronze badges










  • 1




    $begingroup$
    I found that the Wilcoxon Signed Rank test satisfies these conditions, but apparently this test evaluate if one algorithm is always faster (for all instances), which can be too strict.
    $endgroup$
    – Borelian
    8 hours ago











  • $begingroup$
    But isn't that what you want to show? That one algorithm is better than another. Of course you will get an answer about the statistical significance of the performance differences of the two algorithms. I would go for the Wilcoxon signed rank test.
    $endgroup$
    – JakobS
    8 hours ago







  • 1




    $begingroup$
    I think it depends on the definition of being faster. You can also use a Sign Test, which compare instance by instance if you are better or not (ignoring the amplitude of the difference). But you can be slightly slower in half of the instances, and much faster in the others, but you will fail for this test. On the other extreme, you can just compare mean times (as usual in many, many papers). That's my question, which test should be considered a good definition of "being best" in the algorithmic context? I'll add this to the question. Thanks!
    $endgroup$
    – Borelian
    6 hours ago











  • $begingroup$
    Would you look for to Design of Experiments or Experimental Methods? Might, this or this link be useful to you.
    $endgroup$
    – abbas omidi
    6 hours ago












  • 1




    $begingroup$
    I found that the Wilcoxon Signed Rank test satisfies these conditions, but apparently this test evaluate if one algorithm is always faster (for all instances), which can be too strict.
    $endgroup$
    – Borelian
    8 hours ago











  • $begingroup$
    But isn't that what you want to show? That one algorithm is better than another. Of course you will get an answer about the statistical significance of the performance differences of the two algorithms. I would go for the Wilcoxon signed rank test.
    $endgroup$
    – JakobS
    8 hours ago







  • 1




    $begingroup$
    I think it depends on the definition of being faster. You can also use a Sign Test, which compare instance by instance if you are better or not (ignoring the amplitude of the difference). But you can be slightly slower in half of the instances, and much faster in the others, but you will fail for this test. On the other extreme, you can just compare mean times (as usual in many, many papers). That's my question, which test should be considered a good definition of "being best" in the algorithmic context? I'll add this to the question. Thanks!
    $endgroup$
    – Borelian
    6 hours ago











  • $begingroup$
    Would you look for to Design of Experiments or Experimental Methods? Might, this or this link be useful to you.
    $endgroup$
    – abbas omidi
    6 hours ago







1




1




$begingroup$
I found that the Wilcoxon Signed Rank test satisfies these conditions, but apparently this test evaluate if one algorithm is always faster (for all instances), which can be too strict.
$endgroup$
– Borelian
8 hours ago





$begingroup$
I found that the Wilcoxon Signed Rank test satisfies these conditions, but apparently this test evaluate if one algorithm is always faster (for all instances), which can be too strict.
$endgroup$
– Borelian
8 hours ago













$begingroup$
But isn't that what you want to show? That one algorithm is better than another. Of course you will get an answer about the statistical significance of the performance differences of the two algorithms. I would go for the Wilcoxon signed rank test.
$endgroup$
– JakobS
8 hours ago





$begingroup$
But isn't that what you want to show? That one algorithm is better than another. Of course you will get an answer about the statistical significance of the performance differences of the two algorithms. I would go for the Wilcoxon signed rank test.
$endgroup$
– JakobS
8 hours ago





1




1




$begingroup$
I think it depends on the definition of being faster. You can also use a Sign Test, which compare instance by instance if you are better or not (ignoring the amplitude of the difference). But you can be slightly slower in half of the instances, and much faster in the others, but you will fail for this test. On the other extreme, you can just compare mean times (as usual in many, many papers). That's my question, which test should be considered a good definition of "being best" in the algorithmic context? I'll add this to the question. Thanks!
$endgroup$
– Borelian
6 hours ago





$begingroup$
I think it depends on the definition of being faster. You can also use a Sign Test, which compare instance by instance if you are better or not (ignoring the amplitude of the difference). But you can be slightly slower in half of the instances, and much faster in the others, but you will fail for this test. On the other extreme, you can just compare mean times (as usual in many, many papers). That's my question, which test should be considered a good definition of "being best" in the algorithmic context? I'll add this to the question. Thanks!
$endgroup$
– Borelian
6 hours ago













$begingroup$
Would you look for to Design of Experiments or Experimental Methods? Might, this or this link be useful to you.
$endgroup$
– abbas omidi
6 hours ago




$begingroup$
Would you look for to Design of Experiments or Experimental Methods? Might, this or this link be useful to you.
$endgroup$
– abbas omidi
6 hours ago










3 Answers
3






active

oldest

votes


















2














$begingroup$

I think there are many different factors to consider. There's a very good paper by Coffin and Saltzman (Statistical Analysis of Computational Tests of Algorithms
and Heuristics, INFORMS JOC 12(1): 24-44, 2000) that discusses this issue in detail.






share|improve this answer









$endgroup$






















    2














    $begingroup$

    Options for you: McNemarNP, $t$ test (with variants)P, WilcoxonNP, sign testNP, FriedmanNP




    Dietterich (1998)



    Five statistical tests are compared primarily on the type I error produced. Emphasis mine.




    Two widely used statistical tests are shown to have high probability of type I error in certain situations and should never be used: a test for the difference of two proportions and a paired-differences $t$ test based on taking several random train-test splits. A third test, a paired differences $t$ test based on $10$-fold cross-validation, exhibits somewhat elevated probability of type I error.




    In your case, this shouldn't matter as a $t$ test assumes normality to a certain extent.




    The cross-validated $t$ test is the most powerful. The $5×2$ cv test is shown to be slightly more powerful than McNemar’s test. The choice of the best test is determined by the computational cost of running the learning algorithm. For algorithms that can be executed only once, McNemar’s test is the only test with acceptable type I error. For algorithms that can be executed $10$ times, the $5×2$ cv test is recommended, because it
    is slightly more powerful and because it directly measures variation due to the choice of training set.




    Note that McNemar's test is non-parametric, similar to goodness-of-fit that uses a $chi^2$-distribution.



    Demšar (2006)



    This paper is more interesting as it considers some alternative, non-parametric approaches, such as the Wilcoxon signed-rank test.




    When the assumptions of the paired $t$-test are met, the Wilcoxon signed-ranks test is less powerful than the paired $t$-test. On the other hand, when the assumptions are violated, the Wilcoxon test can be even more powerful than the $t$-test.




    Another non-parametric test is the Friedman test. It is similar to ANOVA, and still uses ranking as part of the expression for the test statistic.



    A simpler method is to use the sign test, and the larger the number of data the closer the equivalence of this test to the $z$-test. However, the cost of the simplicity is highlighted below.




    This test does not assume any commensurability of scores or differences nor does it assume normal distributions and is thus applicable to any data (as long as the observations, i.e. the data sets, are independent). On the other hand, it is much weaker than the Wilcoxon signed-ranks test.




    In conclusion...




    Overall, the non-parametric tests, namely the Wilcoxon and Friedman test are suitable for our problems. They are appropriate since they assume some, but limited commensurability. They are safer than parametric tests since they do not assume normal distributions or homogeneity of variance. As such, they can be applied to classification accuracies, error ratios or any other measure for evaluation of classifiers, including even model sizes and computation times. Empirical results suggest that they are also stronger than the other tests studied.





    References



    [1] Dietterich, T. G. (1998). Approximate Statistical Tests for Comparing Supervised
    Classification Learning Algorithms. Neural Computation. 10:1895-1923.



    [2] Demšar, J. (2006). Statistical Comparisons of Classifiers over Multiple Data Sets. Journal of Machine Learning Research. 7:1-30.






    share|improve this answer











    $endgroup$






















      1














      $begingroup$

      I know that you explicitly ask for a statistical test, but maybe this is because you don't know about alternatives that are rather established in the community. When comparing algorithms, my number one is performance profiles.



      They were introduced in this article: Elizabeth D. Dolan and Jorge J. Moré, Benchmarking optimization software with performance profiles, Mathematical Programming, 91(2):201–213, 2002.






      share|improve this answer









      $endgroup$

















        Your Answer








        StackExchange.ready(function()
        var channelOptions =
        tags: "".split(" "),
        id: "700"
        ;
        initTagRenderer("".split(" "), "".split(" "), channelOptions);

        StackExchange.using("externalEditor", function()
        // Have to fire editor after snippets, if snippets enabled
        if (StackExchange.settings.snippets.snippetsEnabled)
        StackExchange.using("snippets", function()
        createEditor();
        );

        else
        createEditor();

        );

        function createEditor()
        StackExchange.prepareEditor(
        heartbeatType: 'answer',
        autoActivateHeartbeat: false,
        convertImagesToLinks: false,
        noModals: true,
        showLowRepImageUploadWarning: true,
        reputationToPostImages: null,
        bindNavPrevention: true,
        postfix: "",
        imageUploader:
        brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
        contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/4.0/"u003ecc by-sa 4.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
        allowUrls: true
        ,
        noCode: true, onDemand: true,
        discardSelector: ".discard-answer"
        ,immediatelyShowMarkdownHelp:true
        );



        );














        draft saved

        draft discarded
















        StackExchange.ready(
        function ()
        StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2for.stackexchange.com%2fquestions%2f2558%2fstatistical-tests-for-benchmark-comparison%23new-answer', 'question_page');

        );

        Post as a guest















        Required, but never shown

























        3 Answers
        3






        active

        oldest

        votes








        3 Answers
        3






        active

        oldest

        votes









        active

        oldest

        votes






        active

        oldest

        votes









        2














        $begingroup$

        I think there are many different factors to consider. There's a very good paper by Coffin and Saltzman (Statistical Analysis of Computational Tests of Algorithms
        and Heuristics, INFORMS JOC 12(1): 24-44, 2000) that discusses this issue in detail.






        share|improve this answer









        $endgroup$



















          2














          $begingroup$

          I think there are many different factors to consider. There's a very good paper by Coffin and Saltzman (Statistical Analysis of Computational Tests of Algorithms
          and Heuristics, INFORMS JOC 12(1): 24-44, 2000) that discusses this issue in detail.






          share|improve this answer









          $endgroup$

















            2














            2










            2







            $begingroup$

            I think there are many different factors to consider. There's a very good paper by Coffin and Saltzman (Statistical Analysis of Computational Tests of Algorithms
            and Heuristics, INFORMS JOC 12(1): 24-44, 2000) that discusses this issue in detail.






            share|improve this answer









            $endgroup$



            I think there are many different factors to consider. There's a very good paper by Coffin and Saltzman (Statistical Analysis of Computational Tests of Algorithms
            and Heuristics, INFORMS JOC 12(1): 24-44, 2000) that discusses this issue in detail.







            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered 6 hours ago









            LarrySnyder610LarrySnyder610

            6,14714 silver badges66 bronze badges




            6,14714 silver badges66 bronze badges
























                2














                $begingroup$

                Options for you: McNemarNP, $t$ test (with variants)P, WilcoxonNP, sign testNP, FriedmanNP




                Dietterich (1998)



                Five statistical tests are compared primarily on the type I error produced. Emphasis mine.




                Two widely used statistical tests are shown to have high probability of type I error in certain situations and should never be used: a test for the difference of two proportions and a paired-differences $t$ test based on taking several random train-test splits. A third test, a paired differences $t$ test based on $10$-fold cross-validation, exhibits somewhat elevated probability of type I error.




                In your case, this shouldn't matter as a $t$ test assumes normality to a certain extent.




                The cross-validated $t$ test is the most powerful. The $5×2$ cv test is shown to be slightly more powerful than McNemar’s test. The choice of the best test is determined by the computational cost of running the learning algorithm. For algorithms that can be executed only once, McNemar’s test is the only test with acceptable type I error. For algorithms that can be executed $10$ times, the $5×2$ cv test is recommended, because it
                is slightly more powerful and because it directly measures variation due to the choice of training set.




                Note that McNemar's test is non-parametric, similar to goodness-of-fit that uses a $chi^2$-distribution.



                Demšar (2006)



                This paper is more interesting as it considers some alternative, non-parametric approaches, such as the Wilcoxon signed-rank test.




                When the assumptions of the paired $t$-test are met, the Wilcoxon signed-ranks test is less powerful than the paired $t$-test. On the other hand, when the assumptions are violated, the Wilcoxon test can be even more powerful than the $t$-test.




                Another non-parametric test is the Friedman test. It is similar to ANOVA, and still uses ranking as part of the expression for the test statistic.



                A simpler method is to use the sign test, and the larger the number of data the closer the equivalence of this test to the $z$-test. However, the cost of the simplicity is highlighted below.




                This test does not assume any commensurability of scores or differences nor does it assume normal distributions and is thus applicable to any data (as long as the observations, i.e. the data sets, are independent). On the other hand, it is much weaker than the Wilcoxon signed-ranks test.




                In conclusion...




                Overall, the non-parametric tests, namely the Wilcoxon and Friedman test are suitable for our problems. They are appropriate since they assume some, but limited commensurability. They are safer than parametric tests since they do not assume normal distributions or homogeneity of variance. As such, they can be applied to classification accuracies, error ratios or any other measure for evaluation of classifiers, including even model sizes and computation times. Empirical results suggest that they are also stronger than the other tests studied.





                References



                [1] Dietterich, T. G. (1998). Approximate Statistical Tests for Comparing Supervised
                Classification Learning Algorithms. Neural Computation. 10:1895-1923.



                [2] Demšar, J. (2006). Statistical Comparisons of Classifiers over Multiple Data Sets. Journal of Machine Learning Research. 7:1-30.






                share|improve this answer











                $endgroup$



















                  2














                  $begingroup$

                  Options for you: McNemarNP, $t$ test (with variants)P, WilcoxonNP, sign testNP, FriedmanNP




                  Dietterich (1998)



                  Five statistical tests are compared primarily on the type I error produced. Emphasis mine.




                  Two widely used statistical tests are shown to have high probability of type I error in certain situations and should never be used: a test for the difference of two proportions and a paired-differences $t$ test based on taking several random train-test splits. A third test, a paired differences $t$ test based on $10$-fold cross-validation, exhibits somewhat elevated probability of type I error.




                  In your case, this shouldn't matter as a $t$ test assumes normality to a certain extent.




                  The cross-validated $t$ test is the most powerful. The $5×2$ cv test is shown to be slightly more powerful than McNemar’s test. The choice of the best test is determined by the computational cost of running the learning algorithm. For algorithms that can be executed only once, McNemar’s test is the only test with acceptable type I error. For algorithms that can be executed $10$ times, the $5×2$ cv test is recommended, because it
                  is slightly more powerful and because it directly measures variation due to the choice of training set.




                  Note that McNemar's test is non-parametric, similar to goodness-of-fit that uses a $chi^2$-distribution.



                  Demšar (2006)



                  This paper is more interesting as it considers some alternative, non-parametric approaches, such as the Wilcoxon signed-rank test.




                  When the assumptions of the paired $t$-test are met, the Wilcoxon signed-ranks test is less powerful than the paired $t$-test. On the other hand, when the assumptions are violated, the Wilcoxon test can be even more powerful than the $t$-test.




                  Another non-parametric test is the Friedman test. It is similar to ANOVA, and still uses ranking as part of the expression for the test statistic.



                  A simpler method is to use the sign test, and the larger the number of data the closer the equivalence of this test to the $z$-test. However, the cost of the simplicity is highlighted below.




                  This test does not assume any commensurability of scores or differences nor does it assume normal distributions and is thus applicable to any data (as long as the observations, i.e. the data sets, are independent). On the other hand, it is much weaker than the Wilcoxon signed-ranks test.




                  In conclusion...




                  Overall, the non-parametric tests, namely the Wilcoxon and Friedman test are suitable for our problems. They are appropriate since they assume some, but limited commensurability. They are safer than parametric tests since they do not assume normal distributions or homogeneity of variance. As such, they can be applied to classification accuracies, error ratios or any other measure for evaluation of classifiers, including even model sizes and computation times. Empirical results suggest that they are also stronger than the other tests studied.





                  References



                  [1] Dietterich, T. G. (1998). Approximate Statistical Tests for Comparing Supervised
                  Classification Learning Algorithms. Neural Computation. 10:1895-1923.



                  [2] Demšar, J. (2006). Statistical Comparisons of Classifiers over Multiple Data Sets. Journal of Machine Learning Research. 7:1-30.






                  share|improve this answer











                  $endgroup$

















                    2














                    2










                    2







                    $begingroup$

                    Options for you: McNemarNP, $t$ test (with variants)P, WilcoxonNP, sign testNP, FriedmanNP




                    Dietterich (1998)



                    Five statistical tests are compared primarily on the type I error produced. Emphasis mine.




                    Two widely used statistical tests are shown to have high probability of type I error in certain situations and should never be used: a test for the difference of two proportions and a paired-differences $t$ test based on taking several random train-test splits. A third test, a paired differences $t$ test based on $10$-fold cross-validation, exhibits somewhat elevated probability of type I error.




                    In your case, this shouldn't matter as a $t$ test assumes normality to a certain extent.




                    The cross-validated $t$ test is the most powerful. The $5×2$ cv test is shown to be slightly more powerful than McNemar’s test. The choice of the best test is determined by the computational cost of running the learning algorithm. For algorithms that can be executed only once, McNemar’s test is the only test with acceptable type I error. For algorithms that can be executed $10$ times, the $5×2$ cv test is recommended, because it
                    is slightly more powerful and because it directly measures variation due to the choice of training set.




                    Note that McNemar's test is non-parametric, similar to goodness-of-fit that uses a $chi^2$-distribution.



                    Demšar (2006)



                    This paper is more interesting as it considers some alternative, non-parametric approaches, such as the Wilcoxon signed-rank test.




                    When the assumptions of the paired $t$-test are met, the Wilcoxon signed-ranks test is less powerful than the paired $t$-test. On the other hand, when the assumptions are violated, the Wilcoxon test can be even more powerful than the $t$-test.




                    Another non-parametric test is the Friedman test. It is similar to ANOVA, and still uses ranking as part of the expression for the test statistic.



                    A simpler method is to use the sign test, and the larger the number of data the closer the equivalence of this test to the $z$-test. However, the cost of the simplicity is highlighted below.




                    This test does not assume any commensurability of scores or differences nor does it assume normal distributions and is thus applicable to any data (as long as the observations, i.e. the data sets, are independent). On the other hand, it is much weaker than the Wilcoxon signed-ranks test.




                    In conclusion...




                    Overall, the non-parametric tests, namely the Wilcoxon and Friedman test are suitable for our problems. They are appropriate since they assume some, but limited commensurability. They are safer than parametric tests since they do not assume normal distributions or homogeneity of variance. As such, they can be applied to classification accuracies, error ratios or any other measure for evaluation of classifiers, including even model sizes and computation times. Empirical results suggest that they are also stronger than the other tests studied.





                    References



                    [1] Dietterich, T. G. (1998). Approximate Statistical Tests for Comparing Supervised
                    Classification Learning Algorithms. Neural Computation. 10:1895-1923.



                    [2] Demšar, J. (2006). Statistical Comparisons of Classifiers over Multiple Data Sets. Journal of Machine Learning Research. 7:1-30.






                    share|improve this answer











                    $endgroup$



                    Options for you: McNemarNP, $t$ test (with variants)P, WilcoxonNP, sign testNP, FriedmanNP




                    Dietterich (1998)



                    Five statistical tests are compared primarily on the type I error produced. Emphasis mine.




                    Two widely used statistical tests are shown to have high probability of type I error in certain situations and should never be used: a test for the difference of two proportions and a paired-differences $t$ test based on taking several random train-test splits. A third test, a paired differences $t$ test based on $10$-fold cross-validation, exhibits somewhat elevated probability of type I error.




                    In your case, this shouldn't matter as a $t$ test assumes normality to a certain extent.




                    The cross-validated $t$ test is the most powerful. The $5×2$ cv test is shown to be slightly more powerful than McNemar’s test. The choice of the best test is determined by the computational cost of running the learning algorithm. For algorithms that can be executed only once, McNemar’s test is the only test with acceptable type I error. For algorithms that can be executed $10$ times, the $5×2$ cv test is recommended, because it
                    is slightly more powerful and because it directly measures variation due to the choice of training set.




                    Note that McNemar's test is non-parametric, similar to goodness-of-fit that uses a $chi^2$-distribution.



                    Demšar (2006)



                    This paper is more interesting as it considers some alternative, non-parametric approaches, such as the Wilcoxon signed-rank test.




                    When the assumptions of the paired $t$-test are met, the Wilcoxon signed-ranks test is less powerful than the paired $t$-test. On the other hand, when the assumptions are violated, the Wilcoxon test can be even more powerful than the $t$-test.




                    Another non-parametric test is the Friedman test. It is similar to ANOVA, and still uses ranking as part of the expression for the test statistic.



                    A simpler method is to use the sign test, and the larger the number of data the closer the equivalence of this test to the $z$-test. However, the cost of the simplicity is highlighted below.




                    This test does not assume any commensurability of scores or differences nor does it assume normal distributions and is thus applicable to any data (as long as the observations, i.e. the data sets, are independent). On the other hand, it is much weaker than the Wilcoxon signed-ranks test.




                    In conclusion...




                    Overall, the non-parametric tests, namely the Wilcoxon and Friedman test are suitable for our problems. They are appropriate since they assume some, but limited commensurability. They are safer than parametric tests since they do not assume normal distributions or homogeneity of variance. As such, they can be applied to classification accuracies, error ratios or any other measure for evaluation of classifiers, including even model sizes and computation times. Empirical results suggest that they are also stronger than the other tests studied.





                    References



                    [1] Dietterich, T. G. (1998). Approximate Statistical Tests for Comparing Supervised
                    Classification Learning Algorithms. Neural Computation. 10:1895-1923.



                    [2] Demšar, J. (2006). Statistical Comparisons of Classifiers over Multiple Data Sets. Journal of Machine Learning Research. 7:1-30.







                    share|improve this answer














                    share|improve this answer



                    share|improve this answer








                    edited 6 hours ago

























                    answered 6 hours ago









                    TheSimpliFireTheSimpliFire

                    2,3547 silver badges39 bronze badges




                    2,3547 silver badges39 bronze badges
























                        1














                        $begingroup$

                        I know that you explicitly ask for a statistical test, but maybe this is because you don't know about alternatives that are rather established in the community. When comparing algorithms, my number one is performance profiles.



                        They were introduced in this article: Elizabeth D. Dolan and Jorge J. Moré, Benchmarking optimization software with performance profiles, Mathematical Programming, 91(2):201–213, 2002.






                        share|improve this answer









                        $endgroup$



















                          1














                          $begingroup$

                          I know that you explicitly ask for a statistical test, but maybe this is because you don't know about alternatives that are rather established in the community. When comparing algorithms, my number one is performance profiles.



                          They were introduced in this article: Elizabeth D. Dolan and Jorge J. Moré, Benchmarking optimization software with performance profiles, Mathematical Programming, 91(2):201–213, 2002.






                          share|improve this answer









                          $endgroup$

















                            1














                            1










                            1







                            $begingroup$

                            I know that you explicitly ask for a statistical test, but maybe this is because you don't know about alternatives that are rather established in the community. When comparing algorithms, my number one is performance profiles.



                            They were introduced in this article: Elizabeth D. Dolan and Jorge J. Moré, Benchmarking optimization software with performance profiles, Mathematical Programming, 91(2):201–213, 2002.






                            share|improve this answer









                            $endgroup$



                            I know that you explicitly ask for a statistical test, but maybe this is because you don't know about alternatives that are rather established in the community. When comparing algorithms, my number one is performance profiles.



                            They were introduced in this article: Elizabeth D. Dolan and Jorge J. Moré, Benchmarking optimization software with performance profiles, Mathematical Programming, 91(2):201–213, 2002.







                            share|improve this answer












                            share|improve this answer



                            share|improve this answer










                            answered 5 hours ago









                            Marco LübbeckeMarco Lübbecke

                            2,2885 silver badges30 bronze badges




                            2,2885 silver badges30 bronze badges































                                draft saved

                                draft discarded















































                                Thanks for contributing an answer to Operations Research Stack Exchange!


                                • Please be sure to answer the question. Provide details and share your research!

                                But avoid


                                • Asking for help, clarification, or responding to other answers.

                                • Making statements based on opinion; back them up with references or personal experience.

                                Use MathJax to format equations. MathJax reference.


                                To learn more, see our tips on writing great answers.




                                draft saved


                                draft discarded














                                StackExchange.ready(
                                function ()
                                StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2for.stackexchange.com%2fquestions%2f2558%2fstatistical-tests-for-benchmark-comparison%23new-answer', 'question_page');

                                );

                                Post as a guest















                                Required, but never shown





















































                                Required, but never shown














                                Required, but never shown












                                Required, but never shown







                                Required, but never shown

































                                Required, but never shown














                                Required, but never shown












                                Required, but never shown







                                Required, but never shown







                                Popular posts from this blog

                                Invision Community Contents History See also References External links Navigation menuProprietaryinvisioncommunity.comIPS Community ForumsIPS Community Forumsthis blog entry"License Changes, IP.Board 3.4, and the Future""Interview -- Matt Mecham of Ibforums""CEO Invision Power Board, Matt Mecham Is a Liar, Thief!"IPB License Explanation 1.3, 1.3.1, 2.0, and 2.1ArchivedSecurity Fixes, Updates And Enhancements For IPB 1.3.1Archived"New Demo Accounts - Invision Power Services"the original"New Default Skin"the original"Invision Power Board 3.0.0 and Applications Released"the original"Archived copy"the original"Perpetual licenses being done away with""Release Notes - Invision Power Services""Introducing: IPS Community Suite 4!"Invision Community Release Notes

                                Canceling a color specificationRandomly assigning color to Graphics3D objects?Default color for Filling in Mathematica 9Coloring specific elements of sets with a prime modified order in an array plotHow to pick a color differing significantly from the colors already in a given color list?Detection of the text colorColor numbers based on their valueCan color schemes for use with ColorData include opacity specification?My dynamic color schemes

                                Ласкавець круглолистий Зміст Опис | Поширення | Галерея | Примітки | Посилання | Навігаційне меню58171138361-22960890446Bupleurum rotundifoliumEuro+Med PlantbasePlants of the World Online — Kew ScienceGermplasm Resources Information Network (GRIN)Ласкавецькн. VI : Літери Ком — Левиправивши або дописавши її