Skip to content

2. Sampling and Data structures

data sampling considerations

1. variation in data

  • the quantification of data variation contributions to uncertainties in making inference is the statistician’s main concern
  • variations present in any set of data
  • when taking different samples from a population of data, the data may vary
  • if we are investigating the same sample, then any variations in the results indicates an issue with either data collection methods or data recording accuracy

2. variation in samples

  • 2 or more samples from the from the same population, all having the same characteristics as the population, But the samples themselves might be different from each other.
  • types of sampling:
    1. random sampling: sample randomly a given number of elements from the entire population, it is easier to apply eg. choose 30 students randomly from the entire school.
    2. cluster sampling: choose randomly a subset of the population and study all aspects of the subset. eg. Randomly choose a class of 30 students and study every aspect of this class.
  • the size of the sample = number of observations is important.

3. Frequency

  • frequency distribution is the primary way of summarizing the variability of data.
  • frequency is the number of times a given datum occurs in a data set.
  • total number of frequencies must be equal to the sample size.
  • relative frequency = frequency / sample size. they can be written as fraction, percentages or decimals.

    relative frequency

  • so in the normal frequency distribution the number 2 appears 3 times, which sums up to 0.15 (or 15%) of all occurrences (relative frequency)

  • The sum of the relative frequencies should always be equal to 1.
  • cumulative relative frequency: is the accumulation of the previous relative frequencies, for each frequency the sum of all its previous cumulative frequencies is being added to it.

    cumulative relative frequency

  • the last cumulative relative frequency must always be 1.

4. critical evaluation

  • it is important to evaluate critically the statistical analyses we encounter before accepting the conclusions that are obtained as a result of these analyses.
  • biased samples: samples that are not representative of their population, they produce results that are inaccurate or invalid.
  • data quality: errors in sampling or data processing will reduce data quality, some of these errors are avoidable, and the data must be cleaned from these errors as much as possible.
  • self selected samples: Responses only by people who choose to respond, such as call-in surveys, that are often biased.
  • sample size issues: small samples are unreliable, the lager the sample size the better.
  • undue influence: collecting data or asking questions in a way that influences the response.
  • what makes a sample biased:

    1. failing to represent the entire population
    2. low data quality.
    3. self selected samples
    4. very small sample size
    5. undue influence
  • causality: relationship between 2 variables does not mean that one causes the other, maybe both are correlated to their relationship with a third variable.

  • self funded or self-interested studies: A study by performed by a person or organization support the claim
  • confounding: means confusion, when the effects of multiple factors can not be separated.

R data structures

  • data frame: is the standard tabular format of storing statistical data, the columns of the table are called variables
  • if the data of the columns (variables) are numeric, we call them quantitative variables or numeric variables.
  • if the data of the columns are qualitative or level values (aka. strings or enums), we call them factors.
  • the rows of the table are called observations and corresponds to the subjects.
  • example:
    • data frame example
    • there are 100 rows in this data frame (table) => 100 subjects
    • each row has three variables (columns): id, sex, height
    • the variables (columns) id and height are quantitative.
    • the variable sex is a factor.
  • quantitative discrete data: the data that result of counting, eg. 1,2,3 .. (aka. integers)
  • quantitative continuous data: the data that result of measuring on a continuos scale, eg. angels in radians, weights .., (aka. floats).