| One Variable Stats |
Intro to StatsBasics | Single Variable | Two VariableBasic Single Variable StatisticsSingle variable statistics tell us about the distribution of a single table column. For instance, if we have a table which has the height of American men, we would like to know what the most common height is, what the average height is, and how common extremely short or tall men are. Visible Statistics provides a large number of single variable statistics along with two charts. Combined, they should provide you with a comprehensive picture of a numerical column. Finding the MiddleIn data analysis, finding the "center" of a data column is one of the first goals of any data analyst. The word "center" is ambiguous, so we provide several different statistics which could be viewed as the "center" of a data column.
How far is data generally from the middle?Although the mean, median and mode all give you a sense of where you data is centered, they do not tell you how much and in what manner the data deviates from that center. While both {99, 100, 101} and {1, 100, 1000} have a median of 100, the numbers in {99, 100, 101} are clearly much closer to the median than those in {1, 100, 1000}. Visible Statistics provides both statistics and graphs for determining distribution of the data. Provided Statistics
Box Plot (aka Box and Whisker Plot)The box plot is merely a graphical representation of the min, 1st quartile, median, third quartile, iqr and max. The box plot consists of two whiskers at the top and bottom of the box which represent the max (on the top) and min (on the bottom). Long black lines connect the whiskers to the blue box. The top of the box represents the 3rd quartile and the bottom of the box represents 1st quartile. The median is represented by a red line inside of the box. We also add a green line that shows where the datum your mouse is presently over falls. Here are some telling qualitative observations you can make from the box plot:
The HistogramThe histogram is another tool for examining the distribution of data. The histogram breaks the distribution up into bins and creates bars whose height show what fraction of the distribution is in each bin. As mentioned at the beginning of this document, there is single mathematically obvious way of binning categorical data, so our software uses a technique that generally produces good results. Our histogram is broken into 13 bins. The center bin represents data that are within 2/5ths of a standard deviation of the mean. All other bins also have width of 2/5ths of a standard deviation. As there are 6 bins on either side of the mean each of which is also 2/5ths of a standard deviation in size. The histogram is good at showing us a picture of the data distribution, though it can fail to show extreme outlier values. It can tell us if the data is close to its mean and how data farther from the mean falls off. It can tell us if a distribution is symmetric (equal on either side of the mean) or has a gradual fall off to one side an abrupt end on the other. Take a look at the histogram and what the characteristics of your distribution are. Over the histogram, we superimpose an image of a bell curve, also known as a normal curve. Many phenomena in nature are distributed according to this curve. Using this image, you can graphically check if your histogram resembles the bell curve. If it is, we say it is "normally distributed". We also have some more advanced tools for testing the normality of data, but that's beyond the scope of this simple overview. Other StatisticsIn addition to helping you determine the center and distribution of your data column, we also provide some other statistics that you may find useful. To understand these, we need to talk briefly about populations and sampling. Often statisticians are asked to randomly sample a large population in order to characterize the entire population as a whole. For instance, a pollster may poll a few thousand randomly chosen Americans to try to draw a picture of the country as a whole or an industrial engineer may test a few thousand randomly chosen widgets from a production line to see how quickly they fail. In cases where you have such a sample and plug the data into our tools, you often would like to know how precisely the mean of the sample corresponds to the mean of entire population. For instance, if you polled only 5 people and found that 2 of them, or 40% hated ice cream, you could not reasonably conclude that 40% of Americans hate ice cream because you did not sample enough Americans to make your results very precise.
|