## Descriptive Stats

In these section we will cover the following descriptive stats:

- mean
- median
- mode
- variance
- standard deviation
- inter-quartile range
- proportion

### Scenario 1 Dependent Variable is Continuous Numeric

In this scenario you describe your dependent variable with a pair of descriptors: a measure of centre and a measure of spread.

Measures of centre include the arithmetic mean (a.k.a. the average), the median, and the mode. Most of the time we use the average, and it is meant to give an idea of a "typical" value in your dataset.

On its own, a measure of centre is not particularly useful, because one doesn't know if all data points are close to the "typical" (average) value, or if most of them are quite different from the average (in which case your average value isn't particularly "typical"!). This is why we need to describe the spread of your data too. Measures of spread include the variance (*s*^{2}), the standard deviation ("*s*"), and the inter-quartile range (IQR). The IQR is the difference between the values in the dataset that lie at the 25th and 75th percentiles (or first and third quartiles). We do not discuss the IQR further here.

You'll learn in BIOL202 that characteristics of your data dictate which pair of descriptors are best suited for describing your data. For now, you should know that the average should be paired with the standard deviation, and this pair is typically the preferred pair to use. The median should be paired with the IQR. The mode is less commonly used in biology.

**Variance \(s{^2}\)** is a measure of how far, on average, data values deviate from the mean. A small variance indicates the data are tightly clustered around the mean. The larger the variance, the more spread out the data. Variance is calculated by summing all the squared deviations from the mean (a deviation is the difference between an individual measurement and the mean) and dividing this sum by the number of data entries minus one.

\[s^{2} =\frac{ \sum{(x_{i}-x)^{2}}}{n-1}\]

**Standard deviation \(s\)** is simply the square root of the variance.

\[s =\sqrt{(\frac{1}{n-1})\sum_{i=1}^{n}(x_{i}-\overline{x})^{2}}\]

**Calculating and presenting your descriptive statistics**

PRO TIP: Most scientific calculators have a "DATA" mode that includes a set of functions for calculating descriptive statistics such as the average and standard deviation. But it is still important that you understand how to do calculations by hand, and what the descriptors represent.

For detailed instructions on how to calculate the mean and standard deviation by hand, consult this web resource. If you're keen, you can consult these instructions on how to use the "R" software to do the calculations.

Based on the UBCO Biology Guidelines for data presentation, measures of spread should be reported to one more decimal place than the number of decimal places that the data entries contain.

If your experiment included a treatment variable that is categorical, like in our plant experiment where the three temperature treatments are handled as ordered categories (10, 20, and 30 degrees), then you should plan to calculate your descriptive statistics on the dependent variable (plant height) using the data from each of the treatment groups separately. So in our example you'd calculate the average and standard deviation of plant height for each of the three groups of height measurements, and report these in a table as described in the Procedures and Guidelines document.

### Scenario 2: Dependent Variable is Categorical

Recall that for categorical dependent variables, like "food type" in our example choice experiment, we tally the frequency with which each category occurs. Imagine that the low protein food was chosen by 6 of the mice, the high fibre food was chosen 4 times, and the high protein food was chosen 10 times. These data are very straightforward to present "as is" in a table, but we should also calculate the main descriptive statistic for categorical data, called the "proportion" \(p\). That is simply the frequency of the particular category (say, high fibre food) divided by the total number of trials (or total sample size), here 20. Thus, for the high fibre food, the corresponding proportion \(p = 4/20 = 0.2\). The proportion value should always fall between 0 and 1. You should plan to report both the raw frequencies of each category alongside their corresponding proportions.

Be sure to report your sample size \(n\) for each treatment group, regardless of what type of variable your dependent variable is. Also make sure to tally any missing values in any of the groups (e.g. that may have arisen due to problems during the experiment)

If your choice experiment includes an independent categorical variable, such as sex (male / female), then you should calculate and report the raw frequencies and corresponding proportions for each category of the dependent variable (food type) for each category of the independent variable (here, male and female).