STOR 155 Introduction to Data Models and Inference

Aug 20, 2024 » Teaching

From Deterministic to Probabilistic Scenarios

Modeling a problem that has a definite solution versus those that involve uncertainty and probability.
Question 1: Chris and Amy bought a total of 10 books. Chris bought 4 books. How many books did Amy buy?
1. Set a variable x to formulate the problem.
2. Solve for x.
Question 2: Minji has perfectly symmetrical dice. Roll the dice, and let X be the number on the top.
1. What is your guess for X?
2. Are you 100% certain about your guess in (1)?
3. If we’re not 100% certain, does that mean we know nothing about X? What can we still say about X?
To describe a random variable, we can talk about possible outcomes and probabilities assigned to them.
What are some other examples of random variables?
- Is it common to find randomness in data from everyday life?
- The dice example used a discrete set of numbers. Think of a random variable that uses a continuous set of numbers.

What Do We Learn from STOR 155?

Four steps of scientific inquiry:
1. Identify a question or problem of interest.
  Move up as a researcher!
2. Collect relevant data.
  Methods for data collection: sampling strategies, observational studies, experiments, and ways to collect reliable data.
3. Analyze the data.
  Calculating summary statistics, regression and correlation, hypothesis testing, confidence intervals.
4. Form a conclusion.
  Given a confidence interval or the result of a hypothesis test, what can we say about our data?
Statistics as the language of Science!

Case Study: Using Stents to Prevent Strokes

Objective: Evaluate the effectiveness of stents in treating patients at risk of stroke.
Research Question: Does the use of stents reduce the risk of stroke?

Study Details: The researchers conducted an experiment with 451 at-risk patients. Patients randomly assigned 224 patients to the control group and 227 to the treatment group. The table below shows the distribution of patients who had a stroke at the 365-day follow-up.

Group	Stroke
	Yes	No	Total
Treatment	28	199	227
Control	45	179	224
Total	73	378	451

Proportion with stroke in treatment group: approximately 12 %
Proportion with good outcomes in control group: approximately 20%

Understanding the Results

Do the data show a “real” difference between the groups?
- Suppose we have a fair coin and flip it 100 times. Let X represent the number of heads observed.
- What is the expected number of heads and tails? Do we actually observe that in reality?
- While the chance a coin lands heads in any given flip is 50%, we probably won’t observe exactly 50 heads. This type of fluctuation is part of almost any type of data-generating process.

Generalizing the Results

Are the results of this study generalizable to all at-risk patients?
- This set of patients could have specific characteristics, so it may not represent all stroke at-risk patients.
A Soup Example
- Is an 80% non-random sample “better” than a 5% random sample in measurable terms? 90%? 95%? 99%?
- Which should we trust more: a 1% survey with a 60% response rate or a non-probabilistic dataset covering 80% of the population?
- Think about tasting soup and wanting to know how salty it is.
  - Stir it well, then a few bits are sufficient regardless of the size of the container!
  - Stirring corresponds to a randomization process in statistics.
- This example is from a lecture by Meng: See this YouTube video.