Chapter 1:
Data Collection

161250 Data Analysis

Census, Survey & Experiments

  • In a census, every element of the population is contacted, counted, and other information collected or evaluated.

  • In a survey, a sample from the population is selected (using sampling schemes) and information collected.

  • Experiments are active than surveys due to application of a treatment. Experiments aim is to study the effects of induced conditions.

Types of data

  • Categorical (or Qualitative)

    • Nominal data a label is assigned to nominal data

      Mathematical manipulation of nominal data makes no sense

    • Ordinal data e.g. rating scales

  • Quantitative Data

    • Discrete data
    • Continuous data
  • Ratio & Interval scales

    • interval scale - division & subtraction may not be meaningful
    • ratio scale - all arithmetic manipulation can be done

Measurement issues

  • Measuring Devices or Instruments

    • a physical device - measuring rule to gauge the heights of plants
    • a counting device - a Geiger- counter for measuring radioactive material
    • a questionnaire - requires a more subjective response.
  • Measurement Error

    • arises if the instrument tends to be faulty
  • Indirect measures

    • e.g. measure of fitness - using BMI?
    • temperature - gauged by expansion of mercury!

Non-response

  • a non-sampling error

  • Selection stage: an element may be selected but not found

    • e.g. sheep in a flock may be tagged with individual identification number but one may not be found at the time of the survey.
  • Collection stage: it may not be possible to take a measurement

    • some respondents may forget, or refuse, to answer the questionnaire
  • Documentation stage

    • Incorrect record of measurement
  • Call-backs reduce non-response

Principle of randomisation

  • to avoid bias
  • select a sample having similar properties of the population
  • to estimate how closely a sample reflects the population
  • epsem (equal probability of selection) sampling methods follow the randomisation principle

Bias vs variance

  • sampling variation (ie. sample to sample variation) is different from bias

Simple Random Sampling (SRS)

  • Random - refers to process not outcome

    • Each (sampling) unit has same chance of being selected
    • Units can be selected with & without replacement

To apply the SRS method, the population needs to be homogeneous

  • SRS is easy to handle; suits even for a poor sampling frame

  • SRS can be costlier; may lead to a politically incorrect sample!

  • SRS estimates are more variable

Stratified Random Sampling (STRS)

  • Suitable for heterogeneous populations

  • Population is divided into homogeneous groups called strata and samples are selected from each stratum

  • Sampling Approaches

    • Sample the larger strata more heavily (suits when all the strata are equally variable)
    • Sample the more varied strata are sampled
  • Advantages of STRS

    • leads to efficient estimation That is, the variance (of an estimate) is usually less than that of SRS
    • sample is spread throughout population

Cluster sampling

  • A convenient method of sampling

  • population is composed of clusters (groups)

  • Select certain clusters (randomly) and collect measurements from a random selection of the elements within the chosen clusters

  • Larger variance than SRS!

Systematic Random sampling (SyRS)

  • Select every \(k^{th}\) element!

  • Random start within the first block of elements.

    • Convenient and also the sample will be representative of population

    • Variance of estimates - generally greater than those of SRS

    • Inefficient/inappropriate, if cycle or trend is present

PPS (probability proportional to size) sampling

  • each element has its own probability of selection

  • requires an associated variable to be known (e.g. previous census) and each element in the population has a value (size) of it

  • e.g. “Dollar-Unit Sampling”

    • audit sampling depending on the size.
    • That is, the chance of selecting an account is proportional to its value.

Other Sampling methods

  • Multistage
    • e.g. 1st stage - cluster; 2nd stage - SRS
  • Volunteers
    • e.g. Blood donors…
    • randomisation?
  • Snowball / opportunity / Purposive
    • e.g. Study of HIV/AIDS patients…
    • randomisation?
  • Capture-Recapture Methods
    • e.g. estimating wild life populations

Example

Effective Sample size (thumb rule)

Sample Design Design Effect (\(d\)) Effective Sample Size (\(\frac{n}{d}\))
SRS 1.00 \(n\)
STRS 0.80 to 0.90 \(\frac{n}{0.9}\) to \(\frac{n}{0.8}\)
Cluster 1.02 to 1.26 \(\frac{n}{1.26}\) to \(\frac{n}{1.02}\)
SyRS 1.05 \(\frac{n}{1.05}\)
Quota 2 \(\frac{n}{2}\)

Summary

  • Issues to address

    • WHAT are collected?
    • WHO does the data collection?
    • HOW are the data collected?
  • Bias occurs due to

    • SELECTION
    • COLLECTION
    • NON-RESPONSE (the single largest cause of bias!)
  • A sample may have the same biases as a census along with sampling errors