DA3. Identify Population¶
Specify a large population that you might want to study and describe the type numeric measurement that you will collect (examples: a count of things, the height of people, a score on a survey, the weight of something). What would you do if you found a couple outliers in a sample of size 100? What would you do if you found two values that were twice as big as the next highest value?
Let’s imagine that we have created a specific program, and we want to evaluate the time consumed by this program to do a specific task; meaning supplying the same input and expecting the same output.
The population of this problem is ALL the times that our program runs with the input that we are studying; this population might be infinite or unknown since we don’t know how many times our program will be executed with the same input.
To approach this problem, we are going to run our program 100 times -on the SAME machine and the SAME environment eg. memory, other running programs ..etc- supplying the same input and recording the time each execution process consumes.
We are going to get a table similar to this:
processId ( subject identifier) | timeConsumed (unit of time) |
---|---|
1 | 1 |
2 | 2 |
3 | 3 |
To analyze the sample that we have, we can start constructing the shape of the distribution of the sample by calculating:
- Mean (average).
- The median.
- First Quartile.
- Third Quartile.
- Minimum value.
- Maximum value.
- Identify the outliers, by finding the relative min/max according to the data from the sample.
After we calculate the above values we can display them in a plot, and the Box plot seems a perfect fit for this problem.
To further analyze the data we can calculate the difference between each value and the centre part of the data by finding the deviations, standard variance and standard deviation.
Outliers usually mean something extraordinary happens while taking the measurement of the subject. Finding a couple of outliers in a sample of size 100 is a big issue that requires us to reconsider the method of measurement we are using, or further analyze the environment surrounding the experiment for any hidden factors that were NOT put into consideration when designing the experiment, and sometimes it may require us to reconsider the entire experiment from scratch.
Regarding our problem with the time consumed by a program, finding 2 outliers with a very large time consumed requires us to reconsider the clock that we are using to measure the time and if that clock is actually behaving the same for ALL measurements.
If the clock is confirmed to be valid for all observations, then we can start analyzing the environment, like if the machine did any special operation that is different when measuring the outlier observation, as if the machine reconnected to the internet or opened a heavy process in the background ..etc. Then we can confirm if the input changed -for any reason- during this observation.
If ALL environment factors were confirmed to be identical for all our observations, then we might re-audit the whole experiment design.
If reviewing the experiment confirmed that it is valid and no changes are required, then we can safely accept that these outliers are actually legitimate values for this experiment.