DA4. Probabilities¶
Think of something that you might want to measure that is affected by random variation. Identify what you want to measure, then describe its (approximate) sample space. Give a rough description of the probabilities associated with those values (you can simply specify if they are all the same probability or if values in one range will be more likely than values in another range).
Let’s imagine that we have created a specific program, and we want to evaluate the time consumed by this program to do a specific task, and the effect of input size on the program in question.
The population of this problem is ALL the times that our program runs with the input that we are studying; this population might be infinite or unknown since we don’t know how many times our program will be executed with the same input.
This problem is very complicated since we don’t know the exact specifications of the machine that will run our program; neither we know the exact amount of available memory, computation, or the interactions of other running programs; plus we don’t know the exact size of input that will be supplied to our program.
To approach this problem, we are going to idealize our samples a bit, so that we are going to set a fixed input, then we are going to run our program 100 times -on the SAME machine and the SAME environment eg. memory, other running programs ..etc- supplying the same input and recording the time each execution process consumes.
We are going to get a table similar to this:
Average time consumed for size x.
processId ( subject identifier) | time consumed (unit of time) |
---|---|
1 | 196 |
2 | 166 |
3 | 176 |
We calculate the amount of time consumed for this input size as the average of time consumed in the table above, using R.
inputSizes <- c(1,2,3,4,5,6,7,8,9,10)
For each input size we run the following:
Record the measurements of timeConsumed:
input.x.timeConsumed = c(182,168,172,154,174,176,193,156,157, …)
We need to replace the values supplied to the function c() with the values related to this input size. And x corresponds to the relevant inputSize.
Find the average of timeConsumed for this inputSize.
avg.x = mean(input.x.timeConsumed)
Later we got this table, which links each input size to the average consumed while running this input:
Average time consumed for each input size
inputSize | timeConsumed |
---|---|
1 | 154 |
2 | 142 |
3 | 164 |
To really calculate the average time consumed by our program, we need to choose a sample of combinations of random input sizes and then, find the actual average; since we will never know in advance what is the size of the inputs that our program will run on.
What would you say to a person who says that he or she “knows” what the outcome of an individual observation will be (an outcome of something that has not happened yet that is subject to random error)?
The statical concept of random is uncertainty, the one can not expect a result based solely on uncertainty; as we showed earlier, the population size might be infinite, so thinking that we might know the time consumed by our program in advance is irrelevant.