Very Basic Intro to Pandas

There are probably thousands of Pandas tutorials online, and many of them are presumably better than this one. With that said, it never hurts to have one more. This tutorial is aimed at people with little or no experience with Python and Pandas. Whenever I need to manipulate and analyze data in Python, I find myself doing the same few things over and over again. I have a couple of techniques that I fall back on, and those techniques can be used for just about everything. This guide introduces some basic concepts in Pandas, explains a few simple operations that frequently come up, and outlines two approaches that you might find useful. I assume you have Python set up and Pandas installed. If you don’t have Pandas installed, just go to the command line and type “pip install pandas” and you should be good. I also assume you know how to write and run a simple Python script.

Pandas Basics

Pandas is a Python library that provides data structures and data analysis tools. For our purposes, what matters is the DataFrame. It looks like this:

A B C D
0 4.0 E 0.1 2016-09-15
1 9.0 F 0.8 2017-03-22
2 1.0 G 9.0 2016-02-11

You can create a DataFrame with pd.DataFrame(). For example:

import pandas as pd

df = pd.DataFrame(np.random.randn(6, 4),
                  index=[0, 1, 2, 3, 4, 5],
                  columns=['A', 'B', 'C', 'D'])

The index= part is completely unnecessary here, as pandas will set the index starting at zero counting up. I simply included it here in case you want to create a different index in the future, say with dates. The resulting DataFrame looks like this:

A B C D
0 0.24 0.81 -1.84 -0.46
1 0.18 -0.23 -1.79 1.60
2 0.19 0.08 -1.50 0.25
3 -0.35 0.44 0.21 -0.71
4 0.39 -0.04 0.18 1.82
5 -0.44 0.07 0.74 -0.45

You won’t normally be creating DataFrames from scratch. Usually you will already have some data, and you will want to use Python to analyze it. Pandas can read a variety of different file formats, but for this tutorial I am going to use this CSV of UFO sightings data. Download the CSV and put it in the same directory where your Python code is. You can then load it into Pandas using the following code. For reference, I also show you how to write a DataFrame to a CSV.

import pandas as pd

# Read from csv
df = pd.read_csv('ufo.csv')
# Write to csv
df.to_csv('ufo.csv')

The DataFrame should now look like this:

city state country ... lat lng time
0 sanmarcos tx us ... 29.88 -97.94 20:30
1 lackland tx NaN ... 29.38 -98.58 21:00
... ... ... ... ... ... ... ...
79636 vienna va us ... 38.90 -77.26 22:20
79637 edmond ok us ... 35.65 -97.47 23:00

There are a number of columns that aren’t shown in the DataFrame above. These are: shape, seconds_seen, time_seen, description, and date. If you’re following along at home, take a second to look at the DataFrame and figure out what is in each column.

Groupby

There are three operations that I want to explain: groupby, pivot, and merge. These will take care of a lot of the data manipulation work that you’re interested in, so it is worth learning them. At the very least, it is helpful to know they exist. First is groupby, which is used to split data into groups, apply a function to each group, and combine the data. In short, you use a groupby to combine data according to some function. For this tutorial, I want to find out how many times each shape of UFO has appeared in each state in the US. I can do that with the following code:

df = df[df['country'] == 'us']
df = df.groupby(['state', 'shape'])['shape'].count()

In the first line I limit my data to UFO sightings in the US, although you can ignore that part for now, as I will return to it later. In the second line, I group the data by both state and shape, and then count the appearance of each shape. You don’t have to perform a count function on the data, though; you might, for example, be interested in the mean number of seconds each UFO is seen by shape and state:

df = df[df['country'] == 'us']
df = df.groupby(['state', 'shape'])['seconds_seen'].mean()

This returns a Series. I prefer to work with DataFrames, so underneath that code I would add:

df = df.to_frame()
df = df.reset_index()

This converts the Series to a DataFrame, and then resets the index, which converts the DataFrame into the format we are used to working with:

state shape seconds_seen
0 ak changing 5484.00
1 ak cigar 317.50
2 ak circle 1055.12
... ... ... ...
1056 wy sphere 2447.85
1057 wy triangle 328.79
1058 wy unknown 222.56

However, if I want to get a DataFrame containing shape counts, I run into a problem. I want to group by the shape column while also counting the shape column, which results in an error when converting to a DataFrame, as there are two columns named shape. There are a number of ways around this, but my personal favorite is the hacky solution of just creating a new column of counts. Each row represents one sighting, so we can create a “shape_count” column equal to 1, and count that column instead:

df = df[df['country'] == 'us']
df['shape_count'] = 1
df = df.groupby(['state', 'shape'])['shape_count'].count()
df = df.to_frame()
df = df.reset_index()

I prefer this method of creating a new column to other approaches, as it comes in handy when trying to perform functions on multiple columns. We now have the DataFrame we want:

state shape shape_count
0 ak changing 5
1 ak cigar 6
2 ak circle 26
... ... ... ...
1056 wy sphere 13
1057 wy triangle 14
1058 wy unknown 9

Pivot

The above section was longer than expected, so I will keep this short. Pivot is used to reshape a DataFrame. Specifically, pivot will take a DataFrame and use the columns to create new columns and populate the values. For example, I can use pivot to rearrange the above DataFrame to have one row per state, with each shape type as a column. Following on from the code above:

df = df.pivot(index='state', columns='shape', values='shape_count')
df = df.fillna(0)
df = df.reset_index()

If a shape has never been seen in a given state, by default pandas will fill the value with NaN (not a number). I therefore fillna(0) to change NaN to 0. We’re left with the following DataFrame:

state changing ... teardrop triangle
0 ak 5 ... 1 25
1 al 12 ... 6 78
2 ar 16 ... 7 84
... ... ... ... ... ...
49 wi 17 ... 14 111
50 wv 10 ... 3 54
51 wy 4 ... 0 14

Merge

The last operation that I want to show is merge, which can be used to combine two DataFrames. The Pandas documentation is especially useful here, and odds are you will be able to find a case that fits your exact needs. I therefore will only provide a simple example here. Let’s say we have the DataFrame we created with the pivot (named df), and we want to combine it with some other data. For example, maybe we want to add a column containing the number of votes Donald Trump received in each state in the 2016 election. Here is a CSV containing that data. All we have to do is tell Pandas we want to merge the two DataFrames using the shared column “state”:

df_trump = pd.read_csv('trump_vote.csv')
result = pd.merge(df, df_trump, on='state')

The DataFrame “result” only contains 50 lines, one for each state, rather than the 52 in the original DataFrame (which contained DC and Puerto Rico); that is because merge, by default, will only merge the rows that are shared by the two DataFrames. This behavior can be changed, of course, but hopefully this example suffices to demonstrate how merge works. Like I said, if you view the documentation that I linked to you should be able to find an example that matches your exact case.

Solving Problems in Pandas

With all of that covered, there are two final things that I want to discuss. First, I regularly find that I need to limit the data in my DataFrame, or drop some data, based on a condition. This is incredibly easy, as it turns out. We’ve already seen an example of this above, where I only wanted the UFO data from the US:

df = df[df['country'] == 'us']

You can use this general pattern to do more complex things as well. For example:

df = df[(df['state'] == 'ny') & (df['seconds_seen'] > 30.0)]

This will give you every row where the state is “ny” and the UFO was seen for more than 30 seconds. You have to use &, not “and”, for somewhat confusing reasons. For the same reason, use | instead of “or”. Also know that isin() exists:

df = df[df['state'].isin(['tx', 'ny'])]

This, of course, will give you every row where the state is either “tx” or “ny”. Second, if you’re stuck you can probably solve your problem by iterating the index and individually accessing whatever element or elements you are interested in. The two patterns to remember are for i in df.index: (to iterate over the DataFrame) and df.ix[i, 'column_name'] (to access the element for the row, i, that you are on). Say you want to create a column that displays the number of votes Trump received in a state by the number of sightings of either the circle or the fireball, whichever is greater. There are a few “cute” ways of doing that, but if you can’t figure it out you can just iterate over the index. Having renamed “result” from the above merge to df:

for i in df.index:
    if df.ix[i, 'circle'] > df.ix[i, 'fireball']:
        df.ix[i, 'trump_over_c_or_f'] = df.ix[i, 'trump'] / df.ix[i, 'circle']
    else:
        df.ix[i, 'trump_over_c_or_f'] = df.ix[i, 'trump'] / df.ix[i, 'fireball']

Notice that “trump_over_c_or_f” is a column that doesn’t exist yet, so you can use this approach to create new columns. Of course, this example is silly because I have no idea why you would want to do this operation, but hopefully it serves to demonstrate the technique. “When in doubt, iterate (loop) over the entire DataFrame” isn’t a horrible rule to follow for beginners.

Conclusion

I can already hear the whining of experienced Python programmers complaining about this tutorial and the way I did certain things. But this tutorial isn’t for those people; it’s for people who don’t really know Pandas at all. I’ll wrap this up now, but hopefully it was helpful. Feel free to DM me on Twitter if you have any thoughts or questions.