Python Pandas Tutorial

In my previous post, Cost Functions: Why R-Squared Does Not Cut It, I showed a sample of alternative cost functions using Python Pandas. Afterwards, I figured it might be useful to give a python pandas tutorial to those who are just starting out, and it’ll be something I can reference in my future posts.

I’m sure Pandas techniques will evolve, so I’ll try to keep this post current as much as possible, but if you think something should be done differently, please leave me a comment and I’ll include it.

Loading data into a pandas dataframe

Pandas data objects are called DataFrames, similar to what you’d find in R. Getting your data into Pandas is really easy, so let’s break it down by datasource.

The raw data we’ll be working with can be found here.

CSV into Python Pandas DataFrame

Yup, that easy. I’ll explain the indexes later, but for now all you need to know is that each dataframe has an index column. In this case, let’s set it to property_id, which is the primary key for this data. If you don’t specify an index_col, one will be created for you.

The pandas documentation page has a good explanation of what setting the dia.quoting does:

By default, read_csv uses the Excel dialect and treats the double quote as the quote character, which causes it to fail when it finds a newline before it finds the closing double quote.

We can get around this using dialect

Excel into Python Pandas DataFrame

Note: openpyxl is required to read .xlsx files.

You can read more about importing Excel files here, but one thing to note is that importing large Excel files will take significantly longer than test files.

MySQL table into Pandas DataFrame

To read from a MySQL database you’ll need to import a couple additional modules.

The output dataframe won’t have a specified index column, but that’s ok since we can add one. I’ll show you how to do that in the indexes section of this post.

 Renaming Columns

Renaming columns is very common, especially when working with external datasource which don’t comply with *your* naming convention.

The suggested way to rename columns is to do it “in place,” meaning you’re not creating a new dataframe. This will be important when you’re running into memory constraints.

You can also create a new dataframe and exclude the the inplace option, or set it to False.

Indexing

One of the better features of pandas is that every dataframe comes with a prebuilt index (ie primary key for database folks). This makes querying and joining data really fast. Let’s take a look at how to set an index if you haven’t done so at time of import.

Here is what the subset of the data we’ll use going forward looks like:

I’ll explain the subsetting by columns and the .describe() functions later, but this gives us a manageable set of columns to work with. There is a default 0 to n-1 index on the data, but let’s say we wanted to set an index on ‘district’, here is what we would do.

Set new index

This sets a new index in place, the drop= parameter says to keep the original column, if it was set to True, the ‘district’ column would be removed. Append parameter is called out to show that you can re-index your existing index with the new column if you wanted to.

Reindex

Sometimes you’ll need to reindex your dataframe 1 to n (this becomes useful when you’ve taken a subset of your data, ran it through some procedure that only outputs a Series and need to merge back tot the original dataframe).

The drop=True statement says that we need to drop the original index, because trying to re-index the categorical district variable will throw errors. There are cases where having to re-index is useful, and we’ll explore some of them in later posts.

Selecting Data Using An Index

Let’s reindex our data using ‘district’ and show you how to select data using ix.

In that first step we select all the rows where district index equals ‘FINANCIAL’, and in the second step we also subset where number of units equals 42.

You can also select all rows where the number of units is greater than 42 with this operator .[42:].

Group By Column

One of the common operations in data analysis is grouping data by a variable to derive some summary statistics by a group. You can do this in pandas with literally one line of code.

Group By Index

Of course it’s more efficient to group by an index, so let me show you how to do that. Let’s say you have a multi-index on columns district and year_built, and you want a count of units in each district and then a count by district and year_built:

 

That’s the basics of it. Let’s see how we can join these statistics back to our original data.

Merge, Join, and Concatenate

Let’s take a scenario where we want to compute a new variable that’s defined as the number of units divided by mean number of units in the district. We already have the mean number of units in a district stored in the dfg (cryptic, i know) dataframe, which has a ‘district’ index. There are a couple of options when merging and joining data, let’s look at them one-by-one.

Merge

This was an example that uses the index to merge. Note that both the datasets have the same n-dimension index. If you don’t have indexes on your dataframe, you can use left_on= and right_on= instead of right_index/left_index.

Join

If your indexes are properly set, the join method does the same thing as merge, just with fewer arguments (even less code).

You can specify inner/outer parameters as well.

Creating new variables/columns

Creating new variables is exactly like you would think it is. Let’s assume we wanted to create a constant variable for regression, mean-impute the number of units, and normalize the number of units by the mean.

 Create variables with apply

When doing data analysis, you’re often transforming data to correct for distribution or minimize the impact of outliers. You’ll find yourself needing to apply functions such as square root to the data.

Numpy Functions

Apply does out of the box numpy functions. For example, if you wanted to create a new variable that’s computed as the square root of number of units:

 Custom Functions

Let’s say you wanted to create categories out of the n units variable, like ‘low’, ‘medium’, ‘high.’ For that, you’d need a custom function that reads the values and returns a category. Apply allows you to do just that:

Pay attention to how the function is defined and how it’s called, because you’ll run into this example quickly. The documentation for apply says that apply only takes one functional argument … but it allows for an args= tuple to be passed through. If we define our function with the extra parameters (min/max), and pass them through using the args tuple, we’ve achieved passing multiple parameters into the function. #winning

 Basic Statistics

I’ve kind of inserted some of these throughout the post already, but let’s sum them up:

To get more detailed statistics or to set variables to equal a specific metric, try something like this:

 

Regression

The models that i’m going to build in this example are for demonstration purposes only, and I’ll try to capture as much useful syntax as possible.

Let’s build a descriptive model where we’ll look at the market value per square foot as a function of number of units, square-footage/unit, and age.

 

Now that we know what the data looks like, let’s do some feature scaling on the variables (normally, we’d plot these to figure out the appropriate transformation, but since this is a pandas example, i’m just going to apply the standard normal transformation to the independent variables, and take a log of the dependent variable.

Now we have some normalized variables, don’t worry about full_market_value as we’re not actually going to use it in our overly-simplistic model.

So the output is something we’re all familiar with looking at. As you can see, age, and the two comparable properties are the best predictors of price, and n_units isn’t significant (probably because that relationship is better captured by the two comps). Let’s remove the n_units variable and see what the “final” model looks like.

 

 Sklearn

Python pandas makes sklearn even more enjoyable. Let’s say we wanted to build a random forest model (year… on 1.1k observations…) to see if it can classify buildings in the Financial District vs Other.

 

Ok, so only 2.6 % of the observations are in the financial district, so roughly 26 properties … let’s see if we can train a random forest to predict those. And by train, I mean execute some simple command to give us a baseline prediction.

pred_proba output give us probability of 0/1, and as you can imagine most probabilities will be in favor of 0/not-financial district. The ‘pred’ output gives us a 1/0 output whether it thinks that the property’s in the financial district or not — it sets 1 where pred_proba > 0.5, which is why all will be zero in this case.

Let’s see if we can determine a reasonable cut-off of the predicted probabilities that define a 1 vs 0 (instead of 0.5 as default).

At this point we have our original data, and predicted values, how do we measure model performance? My favorite way, and I’ll address other approaches later, is looking at average of response by predicted quantile. The idea is that quantiles with highest predicted values should have a higher percentage of records where the response variable = 1.

I think this tells the story very well that the top quantiles have the highest percent of ‘Financial’ district, 8.4% vs 2.6% as baseline, and .2% in the bottom quantile.

This concludes my introduction to python pandas, and i’ll be using this language primarily to demonstrate techniques that statisticians and data scientists use in later posts.

Tagged with: ,
Posted in Python
4 comments on “Python Pandas Tutorial
  1. Megha says:

    nice post!! thanks…

  2. SS says:

    Hi,
    This is very useful. If I have a dataframe which has the following columns : [product, items sold, timestamp] – the timestamp looks like this “2012-10-01 00:00:00″. How do I get a subset of the data frame based on a product name and for all months < 05 and yr = 2012?

    Thanks,
    S.S

    • mikhail.semeniuk says:

      Hi SS,

      Sorry for the way late reply, but you can do something like:
      1. Set your timestamp to be an index
      2. create a beg/end parameters like
      beg = df.index.searchsorted(dt.datetime(2012, 1, 1))
      end = df.index.searchsorted(dt.datetime(2012, 12, 1))

      3. Slice your dataframe by the index:
      df.ix[beg:end]

      Best,
      M

  3. Reddy says:

    Thanks for the very easy explanation of ‘application of SKLearn with financial district problem’

1 Pings/Trackbacks for "Python Pandas Tutorial"

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">