Visible Statistics       Visible Probability
Home    Forums    About Us
Support    Documentation
One Variable Stats

Intro to Stats

Basics | Single Variable | Two Variable

Basic Single Variable Statistics

Single variable statistics tell us about the distribution of a single table column. For instance, if we have a table which has the height of American men, we would like to know what the most common height is, what the average height is, and how common extremely short or tall men are.

Visible Statistics provides a large number of single variable statistics along with two charts. Combined, they should provide you with a comprehensive picture of a numerical column.

Finding the Middle

In data analysis, finding the "center" of a data column is one of the first goals of any data analyst. The word "center" is ambiguous, so we provide several different statistics which could be viewed as the "center" of a data column.

  1. Mean (μ): Shows you the average element of the data column. Mathematically, this is the sum of all the elements divided by the number of elements.
  2. Median: Shows you the middle element of the data column. That is, if the elements are sorted, the median is either the middle element (if there are an odd number of elements) or the average of the two middle elements (if there are an even number of elements). There are equal numbers of data larger than and smaller than the median.
  3. Mode: The mode is the most common element in a data column. For instance, if the data column consists of {1, 3, 4, 5, 1, 2, 1, 2} 1 appears three times and is thus the mode. Should several different elements tie for the most common our software simply provides one of them.

How far is data generally from the middle?

Although the mean, median and mode all give you a sense of where you data is centered, they do not tell you how much and in what manner the data deviates from that center. While both {99, 100, 101} and {1, 100, 1000} have a median of 100, the numbers in {99, 100, 101} are clearly much closer to the median than those in {1, 100, 1000}. Visible Statistics provides both statistics and graphs for determining distribution of the data.

Provided Statistics

  1. Standard Deviation (σ: The standard deviation measures the deviation of the values from their mean. Large standard deviations indicate that the data is spread out a good deal while small standard deviations indicate that the data is generally very close to the mean.
  2. 1st Quartile: The element above which about 75% of the data lies and below which about 25% of the data lies. Thus, about 75% of data are larger than the first quartile.
  3. 3rd Quartile: The element above which about 25% of data lies and below which about 75% of the data lies. Thus, about 75% of the data are smaller than the 3rd quartile.
  4. Inter-Quartile Range (IQR): The distance between the 3rd and 1st quartiles. Can be seen as a simple metric for how spread out the data is.
  5. Min & Max: The largest and smallest values in the data column.

Box Plot (aka Box and Whisker Plot)

The box plot is merely a graphical representation of the min, 1st quartile, median, third quartile, iqr and max. The box plot consists of two whiskers at the top and bottom of the box which represent the max (on the top) and min (on the bottom). Long black lines connect the whiskers to the blue box. The top of the box represents the 3rd quartile and the bottom of the box represents 1st quartile. The median is represented by a red line inside of the box. We also add a green line that shows where the datum your mouse is presently over falls.

Here are some telling qualitative observations you can make from the box plot:

  • Where the median is relative to the quartiles
  • Where the median is relative to the min and max
  • Where the IQR is relative to the min and max
  • The size of the IQR versus the total difference between min and max

The Histogram

The histogram is another tool for examining the distribution of data. The histogram breaks the distribution up into bins and creates bars whose height show what fraction of the distribution is in each bin. As mentioned at the beginning of this document, there is single mathematically obvious way of binning categorical data, so our software uses a technique that generally produces good results.

Our histogram is broken into 13 bins. The center bin represents data that are within 2/5ths of a standard deviation of the mean. All other bins also have width of 2/5ths of a standard deviation. As there are 6 bins on either side of the mean each of which is also 2/5ths of a standard deviation in size.

The histogram is good at showing us a picture of the data distribution, though it can fail to show extreme outlier values. It can tell us if the data is close to its mean and how data farther from the mean falls off. It can tell us if a distribution is symmetric (equal on either side of the mean) or has a gradual fall off to one side an abrupt end on the other. Take a look at the histogram and what the characteristics of your distribution are.

Over the histogram, we superimpose an image of a bell curve, also known as a normal curve. Many phenomena in nature are distributed according to this curve. Using this image, you can graphically check if your histogram resembles the bell curve. If it is, we say it is "normally distributed". We also have some more advanced tools for testing the normality of data, but that's beyond the scope of this simple overview.

Other Statistics

In addition to helping you determine the center and distribution of your data column, we also provide some other statistics that you may find useful. To understand these, we need to talk briefly about populations and sampling.

Often statisticians are asked to randomly sample a large population in order to characterize the entire population as a whole. For instance, a pollster may poll a few thousand randomly chosen Americans to try to draw a picture of the country as a whole or an industrial engineer may test a few thousand randomly chosen widgets from a production line to see how quickly they fail.

In cases where you have such a sample and plug the data into our tools, you often would like to know how precisely the mean of the sample corresponds to the mean of entire population. For instance, if you polled only 5 people and found that 2 of them, or 40% hated ice cream, you could not reasonably conclude that 40% of Americans hate ice cream because you did not sample enough Americans to make your results very precise.

  1. Set Size: Set size tells you how many numeric values were in the data column. Note that this may not be the same as the number of rows in the table because some of them may not have numeric values.
  2. Standard Error (SE): If your data is from a random sample of a large population, the standard error is a measure of how accurately the mean of your sample data reflects the mean of the population as a whole. The larger the standard error, the less accurately the sample mean reflects the mean of the whole population.
  3. Margin for Error for Mean (+/-): Provided that your data is a sample from a much larger population, the "+/-" field gives you the margin for error in which the sample mean predicts population mean. For instance, "+/- 2.5" tells you that the actual population's mean is likely between the sample mean - 2.5 and the sample mean + 2.5. The population mean will fall within the range specified by the "+/-" field 95% of the time.


By using this site, you agree to our Terms of Use (click to view).