Descriptive statistics

No comments

Tips for teaching maths skills to our future chemists, by Paul Yates of Keele University

using a calculator

Source: Thinkstock

Most students will be familiar with the concept of laboratory measurements, particularly in physical chemistry. As chemists we read burettes, thermometers, stop watches, pH meters, and a multitude of other instruments and the read values are fed into equations to calculate quantities such as concentrations, enthalpy changes, rate constants and electrode potentials, to name a few. Normally we get only one chance to perform a given experiment, although we may take multiple readings in some cases.

Textbooks, however, are full of well established quantities whose values we take for granted, eg the gas constant, the electronic charge and the acceleration due to gravity. A moment's thought will suggest that these values were not arrived at as a result of a single laboratory session of two to three hours!

Descriptive statistics

Descriptive statistics allow us to deal with multiple measurements of the same quantity. As a very simple example, consider the following sets of multiple measurements:

5 10 15 20 25

13 14 15 16 17

It is fairly obvious that the "middle" value is 15 in both cases. However, it is also fairly clear that there is a much greater "spread" of values in the first case. In real data sets, things are often not quite so obvious, so we need to use some statistical techniques to describe the data, hence the term descriptive statistics.

To illustrate the techniques we will work with a more realistic, although still relatively simple, data set. The values below represent integral absorbance determinations of a series of identical solutions containing 10 ng of Sb(III)¹:

501 499 496 496 505 507

As a useful first step, rearrange the data in ascending order to give:

496 496 499 501 505 507

We are now in a position to calculate the location and spread of the data.

Location

This gives us a way of determining which single value should be quoted to represent our set of data. There are a number of ways of calculating it.

The most common and useful is known as the mean. To determine this we simply add up all the values, and divide by the number of data. If we are measuring x, the mean would be denoted as x^-. In this case

x^- = (496 + 496 + 499 + 501 + 505 + 507)/6

= 3004/6

= 501

One would naturally expect the value obtained to lie within the range of data given, thus providing a simple check on reasonableness.

The median is the middle value when the values are placed in order. In this case we have an even number of data, so we take the mean of the two central values, the third and fourth values. This achieves the objective in calculating the median of splitting the data into two halves.² Consequently, the median is

(499 + 501)/2 = 500

This can be a useful value if the data are skewed to the high or low end of values.

The mode is the value which occurs most frequently. In our example the value is 496, as this is the only one which occurs twice. But the mode is not necessarily a good indication of central tendency,² as can be seen in this example where the mode is at the extreme of the values obtained.

Spread

The easiest quantity to calculate to give a measure of spread is the mean absolute deviation. To obtain this we first subtract the mean from each value:

496 - 501 = -5
496 - 501 = -5
499 - 501 = -2
501 - 501 = 0
505 - 501 = 4
507 - 501 = 6

The values obtained are known as deviations. We now take what is termed the absolute value of each value obtained - each value being considered positive, regardless of whether or not it is preceded by a negative sign. The set of values thus becomes

5 5 2 0 4 6

These are the absolute deviations. Finally we take the mean of these deviations to give

(5 + 5 + 2 + 0 + 4 + 6)/6 = 3.7

It is, however, more useful to calculate what is known as the standard deviation, s or s. The reasons for this will be clearer once we have some numbers to work with. Continuing with the same data set, we take each of the deviations determined above and square them:

(-5)² = 25
(-5)² = 25
(-2)² = 4
0² = 0
4² = 16
6² = 36

We now add them to give

25 + 25 + 4 + 0 + 16 + 36 = 106

We divide the sum by the number of data items minus one, in this case 6 - 1 = 5 to give

106/5 = 21.2

The final stage is to take the square root of this number, which is 4.6. This is the standard deviation of the data.

The standard deviation is useful because it can be used to determine the confidence interva for a set of measurements - we can expect 95 per cent of measured values to lie within a specified interval above and below the mean. If we knew the true value rather than having to work using the mean and if we had an infinite number of measurements the confidence interval would be approximately twice the standard deviation. To allow for the fact that the mean is only an estimate of the true value and we are likely to be dealing with relatively small samples a correction is required and the required interval is actually

ts /√n

where n is the number of measurements and t can be found from tables (see for example references 2 or 3). The appropriate value of t for a 95 per cent confidence interval and an n -1 of five is 2.57, so the required interval is

(2.57 x 4.6)/√6 = 11.8/2.4 = 4.8

So, referring back to the mean calculated earlier, we can say that we would expect 95 per cent of values to fall within ±5 of the mean value of 501. It is, of course, possible to calculate other confidence intervals by choosing appropriate values of t.

Chemical applications

Although we have considered relatively straightforward statistics, there are interesting applications in chemistry even at this level. The Cambridge Structural Database contains a wealth of data on molecular structures, and descriptive statistics can be used, for example, to analyse the variation in bond length across a range of compounds.⁴ Fluorescence correlation spectroscopy requires fitting parameters in a theoretical model to experimental data, and is only relatively recently that measurements of standard deviation have been possible to improve the procedure.⁵

Teaching issues

Garfield⁶ has considered the issue of how students learn statistics, and has suggested a number of principles for learning statistics. None of these are particularly specific to the discipline, but she does suggest that calculators and computers should be used to help students visualise and explore data. Yilmaz⁷ has addressed the particular issue of teaching statistics to non-specialists, and notes that technical tools for dealing with statistical issues will interest a student only once the need for those tools has been established. This can be done by introducing applications specific to the student's field of study, but also through situations of general interest or experiences in daily life. Sowey⁸ believes that teachers should address five important attributes when teaching statistics, particularly in service courses - coherence, perspective, intellectual excitement, resilience to challenging questioning, and demonstration of practical usefulness.

There is a wealth of real data available for teaching statistics, particularly from analytical chemistry. This addresses the issues of relevance,^7,8 but the data set used needs to be chosen with care. The data set used in this example has two particular advantages: the quantities measured do not have units, their introduction would probably be the next stage in teaching the subject; and most of the calculations involve integers which are "user friendly", reducing the possibility of mistakes. Additionally, with this data set the calculations can be performed on a pocket calculator⁶ and time should be spent showing students how to do this. The procedure is often not straightforward, and confidence in the manual calculation is desirable both as a back up and as an aid to understanding.

References

M. O. Andreae et al., Analytical Chemistry, 1981, 53, 1766.
T. H. Wonnacott and R J Wonnacott, Introductory Statistics , New York: Wiley, 1990.
J. N. Miller and J. C. Miller, Statistics and Chemometrics for Analytical Chemistry, Harlow: Prentice Hall, 2000.
F. H. Allen, O. Kennard and R. Taylor, Acc. Chem. Res., 1983, 16, 146.
T. Wohland, R. Rigler and H. Vogel, Biophys. J., 2001, 80, 2987.
J. Garfield, Int. Stat. Rev., 1995, 63, 25.
M. R. Yilmaz, J. Stat. Educ., 1996, 4.
E. R. Sowey, J. Stat. Educ., 1995, 3.