Boost logo

Boost :

From: Johan Råde (rade_at_[hidden])
Date: 2008-04-24 12:49:46


John Maddock wrote:
> Johan Råde wrote:
>> A typical data mining scenario might be to calculate the cdf for the
>> t- or F-distribution for
>> each value in an array of say 100,000 single or double precision
>> floating point numbers.
>> (I tend to use double precision.)
>> Anything that could speed up that task would be interesting.
>
> Nod, the question is what the actual combination of arguments that get
> passed to the incomplete beta are: if the data isn't unduely sensitive, what
> would be really useful is to have a log of those values so we can see which
> parts of the implementation are getting hammered the most.

In data mining applications, usually most of the variables satisfy the null hypothesis.
In other words, their distribution is given by the t- or F-distribution at hand.
So you can generate test data by starting with an array of uniform [0,1] random numbers,
and then you apply the inverse of the cdf to each value.

Concerning the degrees of freedom, in the problems we analyze:
For the t-distribution, 10 - 1000, typically around 100.
For the F distribution, the first number of degrees of freedom: 2 - 10
the second number of degrees of freedom: 10 - 1000, typically around 100.

HTH
Johan Råde


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk