2. Sampling and Data structures¶
data sampling considerations¶
1. variation in data¶
- the quantification of data variation contributions to uncertainties in making inference is the statistician’s main concern
- variations present in any set of data
- when taking different samples from a population of data, the data may vary
- if we are investigating the same sample, then any variations in the results indicates an issue with either data collection methods or data recording accuracy
2. variation in samples¶
- 2 or more samples from the from the same population, all having the same characteristics as the population, But the samples themselves might be different from each other.
- types of sampling:
- random sampling: sample randomly a given number of elements from the entire population, it is easier to apply eg. choose 30 students randomly from the entire school.
- cluster sampling: choose randomly a subset of the population and study all aspects of the subset. eg. Randomly choose a class of 30 students and study every aspect of this class.
- the size of the sample = number of observations is important.
3. Frequency¶
- frequency distribution is the primary way of summarizing the variability of data.
- frequency is the number of times a given datum occurs in a data set.
- total number of frequencies must be equal to the sample size.
-
relative frequency = frequency / sample size. they can be written as fraction, percentages or decimals.
-
so in the normal frequency distribution the number 2 appears 3 times, which sums up to 0.15 (or 15%) of all occurrences (relative frequency)
- The sum of the relative frequencies should always be equal to 1.
-
cumulative relative frequency: is the accumulation of the previous relative frequencies, for each frequency the sum of all its previous cumulative frequencies is being added to it.
-
the last cumulative relative frequency must always be 1.
4. critical evaluation¶
- it is important to evaluate critically the statistical analyses we encounter before accepting the conclusions that are obtained as a result of these analyses.
- biased samples: samples that are not representative of their population, they produce results that are inaccurate or invalid.
- data quality: errors in sampling or data processing will reduce data quality, some of these errors are avoidable, and the data must be cleaned from these errors as much as possible.
- self selected samples: Responses only by people who choose to respond, such as call-in surveys, that are often biased.
- sample size issues: small samples are unreliable, the lager the sample size the better.
- undue influence: collecting data or asking questions in a way that influences the response.
-
what makes a sample biased:
- failing to represent the entire population
- low data quality.
- self selected samples
- very small sample size
- undue influence
-
causality: relationship between 2 variables does not mean that one causes the other, maybe both are correlated to their relationship with a third variable.
- self funded or self-interested studies: A study by performed by a person or organization support the claim
- confounding: means confusion, when the effects of multiple factors can not be separated.
R data structures¶
- data frame: is the standard tabular format of storing statistical data, the columns of the table are called variables
- if the data of the columns (variables) are numeric, we call them quantitative variables or numeric variables.
- if the data of the columns are qualitative or level values (aka. strings or enums), we call them factors.
- the rows of the table are called observations and corresponds to the subjects.
- example:
- there are 100 rows in this data frame (table) => 100 subjects
- each row has three variables (columns): id, sex, height
- the variables (columns) id and height are quantitative.
- the variable sex is a factor.
- quantitative discrete data: the data that result of counting, eg. 1,2,3 .. (aka. integers)
- quantitative continuous data: the data that result of measuring on a continuos scale, eg. angels in radians, weights .., (aka. floats).