Data Analysis Part 1: Database Management, Distributions, and Bivariates

Binary Wrapping Paper available at “Think Geek!”

This is part 1 of how to analyze data, for those who have never analyzed data…

Part 1: Database Management, Distributions (aka Descriptive Statistics), and Bivariate Analysis

If you have never created, organized, or utilized data from a database, analysis will truly suck for you.  If you have experience working with and organizing data in Microsoft Excel or Access, or some other database program like FileMakerPro, then it will just suck.

Step 1: Filter and Clean your data. This is not a one-stop shop, but an iterative process that will evolve along with your understanding and appreciation for your data or lack thereof.  All aspects of your field work, be they survey responses, laboratory results, observations, etc, must be evaluated to determine the most logical way of representing the information.  Do not take advice from others on this process, unless these “others” have looked at your data and marginally begun the iterative process with you.  Only you will ever look at your database, work with your database, and wish you had never created it.  That said, do not expect to organize your information, say in tabular format, and then quickly proceed to analysis.  It literally takes multiple attempts to develop a structure that logically extends into the analysis process.  Binary wrapping paper (see above) is an example of how not to organize your data.  I for example, assigned binary values to all variables prior to deeply analyzing the distribution of data and determining if in fact, a binary bivariate analysis of categorical data was what I needed.  I have personally re-developed my database three times: binary; non-binary; and finally a combination of binary and non-binary variables determined by a deep analysis of variable distribution, logical assumption, and intended utilization.  This third component was only possible after multiple phases of trial and error with the Bivariate analysis, Step 3.  Each reorganization of the database was deeply painful.  I have never walked aimlessly around a room out of sheer desperate boredom in my life.  It is like prison and solitary confinement and can drive you mad unless you have interpersonal interaction.  Do this process in the company of others to feel less crazy.  Get exercise, try not to smoke.  Keep lots of whiskey handy.  Drink heavily after pushing “save.”

Get a better office.

Step 2, Distributions. So you think your database is ready eh?  Well OK. Let’s find out.  Do you know statistics?  If yes, good for you.  Benjamin Disraeli was quoted once as saying statistics are a form of lying: “lies, damned lies, and statistics.”  You will quickly learn why.  But first, you will quickly learn why you don’t need to know statistics, and therefore why you have no idea what is happening inside your computer with your data.  Thus, you won’t be able to explain your data mathematically, and will be forced to simply accept that your computer has done its job, and that your statistical results are not lies, though you have no idea why.  Quite like a creationist’s explanation of Genesis: “The bible said so, so it is, I have dominion.”  No explanation necessary.  Yet science is not dogmatic, because we’re statistically less crazy than creationists.

Anyway, I use JMP Statistical Software from the SAS Institute.  JMP is a menu driven program that is very visual and requires no previous experience with code languages.  Why did I choose JMP?  To maximize profit.  JMP is cheap at UC Davis.  Less than $30.  It must be an incentive program to get young scientists addicted to JMP, so when they graduate they pay the full amount for a professional license.  Like handing out crack for free on the corner, knowing they’ll pay premium once they’re hooked, like this guy…

A typical UC Davis JMP user in the Population Biology Graduate Group…

Descriptive statistics are done for you by your program.  If you want to know the theory, take a stats class or go to the library.  It’s boring.  Basically, a series of distributions are shown of your data.  Depending on the type of data: nominal (categorical), ordinal, or continuous, your program will spit out some figures and charts and tables.  You get to look at these and say: Thanks JMP!  But then you have to remember that you’re moving towards Step 3: Bivariate Analysis, and so maybe you should be more thorough.  For example, maybe the herd sizes in your study have a natural breaking point around 124 animals based on a histogram output shown by these descriptive statistics, where 52% of herds have less than 124 head of cattle, and 48% have more.  You might decide that you want to develop a new variable called “HerdSize_Binomial” where 0 indicates small herds (<124 animals) and 1 indicates large herds (>124 animals).  Take a look at Example 1 below….

Example 1: Distribution of Herd Size (HRD_SIZE)




100.0% maximum 1328.0
99.5% 1328.0
97.5% 1186.3
90.0% 487.2
75.0% quartile 249.8
50.0% median 121.5
25.0% quartile 84.3
10.0% 48.0
2.5% 12.9
0.5% 9.0
0.0% minimum 9.0


Mean 221.90385
Std Dev 246.72103
Std Err Mean 34.21405
upper 95% Mean 290.59142
lower 95% Mean 153.21627
N 52

Here you can see several fun things.  For instance, in the “Moments Table” we can see sample size (n=52 households) mean (u=221.9 animals), and some confidence intervals about the mean.  Above in the Quantiles table we find the rationalization for the percentage breakdown.  Our median (the number separating the higher half of the sample from the lower half) is 121.5 animals, meaning that about half of the households have more than 121.5 animals, and half have less.  We could have used this as the breakdown point as well.  It’s really up to the scientist, and that’s you remember!

Descriptive statistics are really fun to look at, especially if you’re interested in your study.  If you’re not, they will suck.  But why would we want to do this in the first place?  Well, beyond having a lot of interesting informtion that will help us understand our data and write about it, they also help us move towards Step 3…

Step 3: Bivariate Analysis. You’ve just created a HerdSize_Binomial variable based on your Descriptive Statistical breakdown of data on the number of cattle per household.  You have two scores: 0 and 1, for small and large herds respectively.  If there is any point at all to your study, you also have a dependent variable, outcome variable, or something that you are investigating.  Since I have been looking at diarrhea, and causes of diarrhea in calves, lets use diarrhea as an outcome.  0 means no diarrhea in the herd, and 1 means diarrhea.  These values were obtained by observations of animals with diarrhea in the field.  Now what we want to do, is determine if herd size is associated statistically with diarrhea, or whether the outcome variable is explained by the covariate (explanatory variable).  If you want to know the theory, please go to the library and look up the statistics.  If you’re in a hurry, just push a button on your stats program, or enable your vast working knowledge of code to run R or SAS software and “Fit Y by X” to compare outcome by covariate.

What happens, especially in the case of two nominal variables (0,1 vs. 0,1), is an output of contingency tables (2×2 tables) with subsequent Chi-Square values and their associated probabilities.  If there is a statistically significant relationship, these probabilities will be less than 0.05, or 0.01 based on what you have determined to be statistically significant.  I used 0.1 for the bivariates to look at a greater range of possible covariates to explain diarrhea for use in further more complex analytical procedures.   There will be a Fischer’s Chi-Square test as well, which with the 2×2 tables, is more rigorous than any Likelihood Ratio or Pearsons’ tests run by your program.

Example 2: Bivariate Test of Clinical Diarrhea (CLN_DIAR) by Herd Size (Total_HerdSize_Rank_Binomial):

Contingency Table (2×2 Table)

Total_HerdSize_Rank_Binomial By CLN_DIAR

Total %
Col %
Row %
0 1
1 6
2 18


N DF -LogLike RSquare (U)
52 1 6.7637548 0.1885
Test ChiSquare Prob>ChiSq
Likelihood Ratio 13.528 0.0002
Pearson 12.942 0.0003
Fisher’s Exact Test Prob Alternative Hypothesis
Left 0.0004 Prob(CLN_DIAR=1) is greater for Total_HerdSize_Rank_Binomial=1 than 2
Right 1.0000 Prob(CLN_DIAR=1) is greater for Total_HerdSize_Rank_Binomial=2 than 1
2-Tail 0.0007 Prob(CLN_DIAR=1) is different across Total_HerdSize_Rank_Binomial

In this case, we can see that Clinical Diarrhea is indeed associated with herd size as there is a statistically significant relationship shown by the P-values (Prob>ChiSq) of 0.0002 and 0.0003 for the Likelihood Ratio and Pearson tests, and verified by the Fischer’s Exact Test (p=0.0004), where these values are less than 0.05, our level of significance.  Therefore, we reject the null hypothesis: there is no relationship between the variables.  However, this significance is between Clinical Diarrhea in calves, and small herds (See the Fischer’s Exact Test: left, and associated Alternative Hypothesis), as the probablity of diarrhea in calves is greater when the herd size value is 1 (small herd value).  This requires logcially looking at your data, reviewing the distributions of herd size against other factors like location of these herds, water sources, management practices, numbers of small ruminants, and other factors that might be influencing diarrhea beyond just herd size.  But just because it is significant in a way that is contrary to our expections, it is still worth reporting, and maybe worth analyzing in combination with other factors in a more complex multivariate model, the next phase of analysis.

Next Steps. You will conduct this procedure with every single covariate/variable in your data (though not all will be contingency tables as not all data will be categorical or dichotomous).  You will then interpret the results, repeating Steps 1 and 2 of database management and descriptives to obtain more easily analyzable data (like changing number of animals to a herd size rank), and then re-analyzing, continually repeating this cycle until you have no idea what the file names are, where the outputs are stored, where you live, when you last showered, why your fiance is mad at you, and what your study objectives were in the first place.

At this point, it is generally a good idea to re-read your research proposal, remind yourself of why you conducted fieldwork and under what hypotheses, refresh the logical and theoretical assumptions and framework that influenced the data collection, and then revisit your analytical results.  Otherwise, you are sure to attend a meeting where you present these bivariates, warranting a firm and authoritative “what’s the point of this study, and why are you showing me this?” response from your advisers and team members.

Who cares?

Next time: Analysis Part 2: Multivariate Modeling and the Beginning of the End….


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s