2. Sampling and Data structures¶

data sampling considerations¶

the quantification of data variation contributions to uncertainties in making inference is the statistician’s main concern
variations present in any set of data
when taking different samples from a population of data, the data may vary
if we are investigating the same sample, then any variations in the results indicates an issue with either data collection methods or data recording accuracy

2 or more samples from the from the same population, all having the same characteristics as the population, But the samples themselves might be different from each other.
types of sampling:
1. random sampling: sample randomly a given number of elements from the entire population, it is easier to apply eg. choose 30 students randomly from the entire school.
2. cluster sampling: choose randomly a subset of the population and study all aspects of the subset. eg. Randomly choose a class of 30 students and study every aspect of this class.
the size of the sample = number of observations is important.

frequency distribution is the primary way of summarizing the variability of data.
frequency is the number of times a given datum occurs in a data set.
total number of frequencies must be equal to the sample size.
relative frequency = frequency / sample size. they can be written as fraction, percentages or decimals.
so in the normal frequency distribution the number 2 appears 3 times, which sums up to 0.15 (or 15%) of all occurrences (relative frequency)
The sum of the relative frequencies should always be equal to 1.
cumulative relative frequency: is the accumulation of the previous relative frequencies, for each frequency the sum of all its previous cumulative frequencies is being added to it.
the last cumulative relative frequency must always be 1.

it is important to evaluate critically the statistical analyses we encounter before accepting the conclusions that are obtained as a result of these analyses.
biased samples: samples that are not representative of their population, they produce results that are inaccurate or invalid.
data quality: errors in sampling or data processing will reduce data quality, some of these errors are avoidable, and the data must be cleaned from these errors as much as possible.
self selected samples: Responses only by people who choose to respond, such as call-in surveys, that are often biased.
sample size issues: small samples are unreliable, the lager the sample size the better.
undue influence: collecting data or asking questions in a way that influences the response.
what makes a sample biased:
1. failing to represent the entire population
2. low data quality.
3. self selected samples
4. very small sample size
5. undue influence
causality: relationship between 2 variables does not mean that one causes the other, maybe both are correlated to their relationship with a third variable.
self funded or self-interested studies: A study by performed by a person or organization support the claim
confounding: means confusion, when the effects of multiple factors can not be separated.

data frame: is the standard tabular format of storing statistical data, the columns of the table are called variables
if the data of the columns (variables) are numeric, we call them quantitative variables or numeric variables.
if the data of the columns are qualitative or level values (aka. strings or enums), we call them factors.
the rows of the table are called observations and corresponds to the subjects.
example:
- there are 100 rows in this data frame (table) => 100 subjects
- each row has three variables (columns): id, sex, height
- the variables (columns) id and height are quantitative.
- the variable sex is a factor.
quantitative discrete data: the data that result of counting, eg. 1,2,3 .. (aka. integers)
quantitative continuous data: the data that result of measuring on a continuos scale, eg. angels in radians, weights .., (aka. floats).