# In Module Two, you learned to visually analyze data

Focus

In Module Two, you learned to visually analyze data via frequency distributions, and then expanded upon this knowledge through an investigation of numerical summaries of location and spread. In Module Three, you will explore probability and distributions.

Probability

Have you ever seen the weatherman give a percentage representing the chance of rain? How about looking at the back of a scratch lottery ticket to determine the odds of winning? Each of these examples represents a calculation that determines the relative frequency of an event. At its most basic level, if an event can be defined by counting, then the probability of an event occurring can easily be calculated. For example, when you flip a coin, there are only two possibilities: heads or tails. Only one outcome can happen at a time, and either outcome is equally likely, so the probability of the coin landing on tails is ½ or 0.50. In more complex situations, the number of outcomes or the probability of occurrence cannot be so easily observed, and therefore a variety of techniques have been developed to help biostatisticians calculate the probabilities of various events. Probability relies on random variables, whether discrete or continuous. A discrete random variable is one that has a countable (finite) set of possible outcomes. For example, the number of breast cancer cases in a particular geographic area at a given time and the number of smokers in a certain sized sample are each discrete random variables. A continuous random variable is one that has an unbroken continuum of possible (infinite) values. For example, the height of an individual at a given period in time and the amount of time it takes to finish a project are each an example of continuous random variables. With discrete random variables, we identify the probability of each value occurring as a way to describe the distribution of the variable. There are a few different notations for the probability of a particular event occurring. Let us call the event A. An event is where one value of the discrete random variable or a set of values are of interest; for example, a coin turning up tails. The probability of event A occurring can be represented by Pr(A) or P(A).

Probabilities have certain properties that must be present:

All probabilities must be between the values 0 and 1, inclusive. All probabilities in a sample space (set of all possible values) must sum to exactly 1. For example, in the coin toss example P(heads) = 0.5 and P(tails) = 0.5, there are no other possible outcomes and P(heads) + P(tails) = 0.5 + 0.5 = 1.0. The complement of an event is all the other values that are not a part of the event. For example, suppose you are studying the hair color of a population, and you take note of each person walking by. Seeing someone with red hair is one possible event.

Seeing someone without red hair is the complement of this event. The probability of a complement event is equal to 1 minus the probability of the event. For example, P (not tails) = 1 – P(tails) = 1 – 0.5 = 0.5, or P(red hair) = 1 – P(not having red hair). P(red hair) in this scenario stands for the probability that the next person who walks by you will have red hair. The fourth property of probabilities is that the probabilities of disjoint events (i.e., can never occur at the same time) can be added to determine their union. For example, P(natural red hair or natural blond hair) = P(natural red hair) + P(natural blond hair).

More advanced rules of probabilities will be learned, as well. Each of these will aid in problem solving and applications. The concept of independence is fundamental to probability. Two events are independent if the probability of both events occurring is the product of their individual probabilities of occurring. In statistical notation, independence means if you have two events (A and B), they are independent if and only if P(A and B) = P(A) x P(B). Conditional probability is also important. It is the probability that event B will occur given that we know for a fact another event A is evident, which can be calculated using the following formula: P(B|A) = P(A and B) / P(A). For example, what is the probability of being diagnosed with breast cancer (event B) given your mother had breast cancer (event A)? This module will also introduce you to Bayes’ theorem.

This is a theorem that determines how a probability might be affected by new information. There are many ways biostatisticians apply this theorem. For example, if you are trying to find the probability of selecting someone who has a college degree, and we know that 47% of the population has a college degree, you can calculate the probability of selecting such an individual from the population. However, suppose that you later learn that 70% of people who have a high school diploma also have a college degree. You can take this new piece of information and use Bayes’ theorem to calculate a new, more precise probability of selecting someone with a college degree who also has a high school diploma.

Understanding Distributions

It is helpful in biostatistics to not only learn the general concepts but to delve into the specifics, too. Depending on the facts given, certain rules, concepts, and techniques will be applicable, which provide more precise answers. For example, biostatisticians will often talk generally about the distribution of a population, but given certain characteristics, you will be able to deduce exactly the type of distribution to which the biostatistician refers. For example, the most commonly used distribution for discrete random variables is the binomial distribution.

There are three main characteristics of the binomial distribution:

There are a certain number of observations, n. Each observation can be classified as a success or failure. The probability of having a success for each observation is defined by some constant number, p. Because of its basic assumptions, calculating a binomial probability is straight-forward. Binomial probabilities are calculated using the formula: P(X = x) = nCxpxqn-x. The term nCx is called the binomial coefficient, n is the number of independent trials (for example, the number of coins you tossed, the number of people you include in your sample),, and p is the probability of success (probability of getting a specific outcome). Students may recall that the binomial coefficient is also called the “choose function.” It is common to denote that a variable follows a binomial distribution using the notation X ~ b(n, p), which is read aloud as, “X is distributed as a binomial random variable with parameters n and p.” But what is X? X is the number of successes.

So, it can be the number of coins that landed on their tail or the number of people with red hair in your sample, and so, X is a value between 0 and n. If you wanted to know the probability of three of your four children having red hair, n = 4, p = P(of a child of yours having red hair, which depends on your genes, by the way), X = 3 and 1 – p = q = P(child does not have red hair). C is the number of ways you can select three items out of four (think of arranging your redheads by order of birth NRRR, RNRR, RRNR, RRRN, etc.; there are four ways, or four places, that non-redhead (N) can occur). Sometimes (and often in practice), a researcher does not just want to know the probability of a specific event occurring, but rather the probability of observing a certain value or less, i.e., P(X x). For example, you might want to know the probability that the number of breast cancer cases in your community is greater than 25 because your cancer center can only handle 25 patients at a time.

For this, biostatisticians will compute a cumulative probability. When probability distributions are discussed, it is important to remember that the mean of a distribution (for a discrete random variable) is often referred to as its expected value. When all is said and done, a binomial distribution is useful when a researcher wants to determine whether a specific number of “successful” outcomes is expected or not. Whereas the binomial distribution is the most commonly used distribution for discrete random variables, the normal distribution is the most commonly used for continuous random variables. If you compare a histogram (a typical representation of the binomial distribution) with the normal distribution (bell-shaped line graph), you will notice that, for continuous random variables, the probability distribution is much smoother. See Figure 3.1. Histogram for the analysis of plating thickness. The x-axis denotes the plating thickness in mils, starting from 3 point 41 to 3 point 57, marked with a difference of point 2. The y-axis denotes percent, starting from 0 to 25, marked with a difference of 5. A textbox on the corner of the graph reads: N 100 point 00. Normal pr greater than D, 0 point 150. As the histogram bars are joined by a line, it forms a wave-like shape. Figure 3.1: Histogram (Base SAS (R) 9.2 Procedures Guide, 2015)

The normal distribution is defined by its mean, µ, and its standard deviation, σ. The shorthand notation for the normal distribution is X ~ N(µ, σ), which is read aloud as, “X is distributed as a normal random variable with mean, µ, and standard deviation, σ.” The mean value tells us where the center of the distribution falls on the number line, and the standard deviation measures how spread out the data is around the mean value. The larger the standard deviation, the more spread out the data. Normal distributions are often described as “bell-shaped” and symmetrical, and their mean, median, and mode match (if they are perfectly symmetrical). Due to the symmetry of the normal distribution, generalizations can be made.

For example, 68% of the observations/data points fall within one standard deviation of the mean (meaning within one standard deviation below the mean and one standard deviation above the mean), 95% of the observations fall within two standard deviations of the mean, and 99.7% of the observations fall within three standard deviations of the mean. The normal distribution can be used to draw many inferences about a population and allows us to compare different normal distributions, to identify which population might have a larger mean or more variation/spread in their data. A key takeaway from this discussion of the normal distribution is that one cannot simply apply a normal distribution to any sample or population.

One must ensure that the random variable of interest actually is approximately normal before applying the distribution’s properties. In order to make this assessment, first examine the shape of the distribution using familiar graphing techniques such as stem plots, box plots, or histograms. If it looks very different from the traditional bell shape, then the normal distribution cannot be applied to the data.

In Module Four, you will be introduced to statistical inference, which is when biostatisticians use data from a sample to generalize inferences about a population. You will also learn about hypothesis testing, which is one of the more common ways in which statistical inferences are made. The binomial and normal distributions are used to make these statistical inferences. References Base SAS (R) 9.2 Procedures Guide: Statistical Procedures, 3rd ed. (2015). Retrieved from http://support.sas.com/documentation/cdl/en/procstat/63104/HTML/default/viewer.htm#procstat_univariate_sect074.htm 3-1

Discussion: Probability Distributions and Questions Discussion Topic Task: Reply to this topic Available on Jun 18, 2022 11:59 PM. Submission restricted before availability starts. In this module, you have learned about two of the most frequently used probability distributions in biostatistics: the binomial and normal distributions. The binomial distribution and the normal distribution are similar in several ways.

In fact, under certain conditions, the normal distribution is used to approximate the binomial. For your initial post, discuss the following three questions. Provide examples of data that follow each distribution to help illustrate your points.

What are the basic differences between the two distributions?

Under what circumstances do you think it works well to approximate the binomial using the normal, considering the differences?

Under what public health or medical circumstances would it be helpful to identify the probability of an event? Provide some real-life examples. To complete this assignment, review the Discussion Rubric PDF document.