Practical Business Python

Taking care of business, one python script at a time

Mon 21 November 2016

Building a Financial Model with Pandas

Posted by Chris Moffitt in articles   

article header image

Introduction

In my previous articles, I have discussed how to use pandas as a replacement for Excel as a data wrangling tool. In many cases, a python + pandas solution is superior to the highly manual processes many people use for manipulating data in Excel. However, Excel is used for many scenarios in a business environment - not just data wrangling. This specific post will discuss how to do financial modeling in pandas instead of Excel. For this example, I will build a simple amortization table in pandas and show how to model various outcomes.

In some ways, building the model is easier in Excel (there are many examples just a google search away). However, as an exercise in learning about pandas, it is useful because it forces one to think about how to use pandas strengths to solve a problem in a way different from the Excel solution. In my opinion the solution is more powerful because you can build on it to run multiple scenarios, easily chart various outcomes and focus on aggregating the data in a way most useful for your needs.

What is an amortization schedule?

Financial modeling can take many forms but for this article, I wanted to focus on a problem that many people will encounter in their lifetime. Namely, the finance aspects of a large loan.

The wikipedia page has a good explanation of an amortization schedule. In the simplest terms, an amortization schedule is a table that shows the periodic principal and interest payments needed to pay of a debt. The most common case is the payoff schedule for a mortgage.

Using the mortgage as an example, in each period (typically monthly) a home owner writes a check to their bank for a specified amount. This amount is split into a principal and interest payment. The bank keeps the interest and the principal is applied to the outstanding loan. Over a period of time the final balance will go to 0 and you will own the home.

Even with historically low interest rates, the amount of interest paid on a typical 15 or 30 year mortgage is very high. It is possible that you can pay almost as much in interest as the original loan was worth. Because of the financial importance of this purchase, it is important to understand all the financial ramifications of a mortgage. In addition, there are many variables that can affect the mortgage payments:

  • Interest rate
  • Duration of the loan
  • Payment frequency (monthly vs bi-weekly, etc)
  • Additional principal payments

There are certainly many on-line calculators and examples that show how to build tools in Excel. However, using the pandas solution is handy as a teaching tool to understand pandas in more detail and in using pandas to build a simple way to model and compare multiple scenarios. Before I go through the pandas-based solution, it’s helpful to see the Excel based solution so we have a model to use as a basis for the pandas solution:

Amortization Screen Shot

The basic model is simple. Each period results in a small decrease in the principal. At the end of 30 years, the balance is $0 and the loan is complete.

This model assumes that an individual pays exactly the prescribed amount each period. However, there can be financial benefits to paying extra principal and paying off the loan faster. As I think about modeling my mortgage, I’m curious to understand things like:

  • How much do I save in interest if I contribute a little more principal each payment period?
  • When will I pay off the loan?
  • What is the impact of various interest rates?

Using the pandas solution can be useful for comparing and contrasting multiple options.

Payment, Principal and Interest

Not surprisingly, the numpy library has all the built in functions we need to do the behind the scenes math. In fact, the documentation shows one approach to build the amortization table. This approach certainly works but I’d like to include the results in a pandas DataFrame so that I can more easily dump the results to Excel or visualize the results.

I am going to walk through the basic parts of the solution for a 30 year $200K mortgage structured with a monthly payment and an annual interest rate of 4%. For an added twist, I’m going to build the solution with an extra $50/month to pay down the principal more quickly.

Get started with the imports of all the modules we need:

import pandas as pd
import numpy as np
from datetime import date

Define the variables for the mortgage:

Interest_Rate = 0.04
Years = 30
Payments_Year = 12
Principal = 200000
Addl_Princ = 50
start_date = (date(2016,1,1))

Now, let’s play with the basic formulas so we understand how they work. Calculating the total payment requires us to pass the right values to the numpy pmt function.

pmt = np.pmt(Interest_Rate/Payments_Year, Years*Payments_Year, Principal)
-954.83059093090765

This means that every month we need to pay $954.83 (which matches the Excel solution above). But, how much of this is interest and how much is principal? Well, it depends. The payment stays constant over time but the amount applied to principal increases and the interest decreases as we move forward in time.

For example, for period 1, here is the interest and principal:

# Period to calculate
per = 1

# Calculate the interest
ipmt = np.ipmt(Interest_Rate/Payments_Year, per, Years*Payments_Year, Principal)

# Calculate the principal
ppmt = np.ppmt(Interest_Rate/Payments_Year, per, Years*Payments_Year, Principal)

print(ipmt, ppmt)
-666.6666666666667 -288.163924264

In other words, the first payment of $954.83 is composed of $666.67 in interest and only $288.16 in principal. Ouch.

Let’s look at what the breakdown is for period 240 (20 years in the future).

per = 240 # Period to calculate
# Calculate the interest
ipmt = np.ipmt(Interest_Rate/Payments_Year, per, Years*Payments_Year, Principal)

# Calculate the principal
ppmt = np.ppmt(Interest_Rate/Payments_Year, per, Years*Payments_Year, Principal)
print(ipmt, ppmt)
-316.49041533656924 -638.340175594

In this case, we are paying much more towards the principal ($638.34) and much less towards the interest ($316.49).

That should be fairly straightforward. But, what if I want to know what my balance is at period 240? Well, then I need to understand the cumulative effect of all my principal payments. This is not as straightforward in pandas. This is where the Excel solution is a little simpler to conceptualize.

In Excel, it is easy to reference the row above and use that value in the current row. Here is the Excel version for maintaining the balance due:

Excel Formula for Balance

As you can see, in row 10, the balance formula references row 9. This type of formula is simple in Excel but in pandas a reference like this seems difficult. Your first instinct might be to try writing a loop but we know that is not optimal. Fortunately there is another approach that is more consistent with pandas. I will get to that in a moment. Before we go there, let’s get the basic pandas structure in place.

Looping in Pandas
You should avoid looping code in pandas. It is very inefficient and there are normally better options to accomplish the same goal.

Building the Table

To answer the question about the balance change over time, we need to build a pandas DataFrame from scratch. There are extra steps here (as compared to Excel) but this is a useful adventure into some of the pandas functions I have not discussed previously.

First, let’s build a DateTimeIndex for the next 30 years based on MS (Month Start):

rng = pd.date_range(start_date, periods=Years * Payments_Year, freq='MS')
rng.name = "Payment_Date"
DatetimeIndex(['2016-01-01', '2016-02-01', '2016-03-01', '2016-04-01',
               '2016-05-01', '2016-06-01', '2016-07-01', '2016-08-01',
               '2016-09-01', '2016-10-01',
               ...
               '2045-03-01', '2045-04-01', '2045-05-01', '2045-06-01',
               '2045-07-01', '2045-08-01', '2045-09-01', '2045-10-01',
               '2045-11-01', '2045-12-01'],
              dtype='datetime64[ns]', name='Payment_Date', length=360, freq='MS')

This helpful function creates a range for the next 30 years starting on Jan 1, 2016. The range will be used to build up the basic DataFrame we will use for the amortization schedule. Note that we need to make sure the first period is 1 not 0, hence the need to use the df.index += 1 :

df = pd.DataFrame(index=rng,columns=['Payment', 'Principal', 'Interest', 'Addl_Principal', 'Balance'], dtype='float')
df.reset_index(inplace=True)
df.index += 1
df.index.name = "Period"

Here is what the stub DataFrame looks like:

Payment_Date Payment Principal Interest Addl_Principal Balance
Period
1 2016-01-01 NaN NaN NaN NaN NaN
2 2016-02-01 NaN NaN NaN NaN NaN
3 2016-03-01 NaN NaN NaN NaN NaN
4 2016-04-01 NaN NaN NaN NaN NaN
5 2016-05-01 NaN NaN NaN NaN NaN

This looks similar to what we have in Excel so we’re on the right track.

Adding the payment is easy because it is a simple formula that produces a consistent value.

df["Payment"] = np.pmt(Interest_Rate/Payments_Year, Years*Payments_Year, Principal)

However the interest and principal change over time. Fortunately the formula is based on the period which we have available in our DataFrame as df.index . We can reference it in our formula to get the unique values for the specified period:

df["Principal"] = np.ppmt(Interest_Rate/Payments_Year, df.index, Years*Payments_Year, Principal)
df["Interest"] = np.ipmt(Interest_Rate/Payments_Year, df.index, Years*Payments_Year, Principal)

The final step is to add the Additional Principal (as a negative number) and round the values:

# Convert to a negative value in order to keep the signs the same
df["Addl_Principal"] = -Addl_Principal
df = df.round(2)

The table is starting to come together:

Payment_Date Payment Principal Interest Addl_Principal Curr_Balance
Period
1 2016-01-01 -954.83 -288.16 -666.67 -50 NaN
2 2016-02-01 -954.83 -289.12 -665.71 -50 NaN
3 2016-03-01 -954.83 -290.09 -664.74 -50 NaN
4 2016-04-01 -954.83 -291.06 -663.78 -50 NaN
5 2016-05-01 -954.83 -292.03 -662.81 -50 NaN

All that’s left is figuring out how to manage the Curr_Balance column. Before I show you the better solution (I won’t say best because I would not be surprised if there is an even better option), I am going to show you the ugly approach I first took.

Maintaining the Balance - Try 1

Do Not Try This At Home
The code below works but it is not the optimal way to solve this problem in pandas.

I am showing this example because I suspect many novice pandas users would go down this path when trying to solve a similar problem. It also shows how a little time spent thinking about the solution yields a much better approach than just charging in with the first idea that comes to mind.

First, we calculate the balance for the first period by doing the calculation for the first row:

df["Balance"] = 0
df.loc[1, "Balance"] = Principal + df.loc[1, "Principal"] + df.loc[1, "Addl_Principal"]
Payment_Date Payment Principal Interest Addl_Principal Balance
Period
1 2016-01-01 -954.830591 -288.163924 -666.666667 -50 199661.836076
2 2016-02-01 -954.830591 -289.124471 -665.706120 -50 0.000000
3 2016-03-01 -954.830591 -290.088219 -664.742372 -50 0.000000
4 2016-04-01 -954.830591 -291.055180 -663.775411 -50 0.000000
5 2016-05-01 -954.830591 -292.025364 -662.805227 -50 0.000000

It works but it’s starting to get a little cumbersome.

My next step was to loop through each row and calculate the balance:

for i in range(2, len(df)+1):
    # Get the previous balance as well as current payments
    prev_balance = df.loc[i-1, 'Balance']
    principal = df.loc[i, 'Principal']
    addl_principal = df.loc[i, "Addl_Principal"]

    # If there is no balance, then do 0 out the principal and interest
    if prev_balance == 0:
        df.loc[i, ['Payment', 'Principal', 'Interest', 'Balance', 'Addl_Principal']] = 0
        continue

    # If this payment does not pay it off, reduce the balance
    if abs(principal + addl_principal) <= prev_balance:
        df.loc[i, 'Balance'] = principal + prev_balance + addl_principal

    # If it does pay it off, zero out the balance and adjust the final payment
    else:
        # Just adjust the principal down
        if prev_balance <= abs(principal):
            principal = -prev_balance
            addl_principal = 0
        else:
            addl_principal = (prev_balance - abs(principal_payment))
        df.loc[i, 'Balance'] = 0
        df.loc[i, 'Principal'] = principal
        df.loc[i, 'Addl_Principal'] = addl_principal
        df.loc[i, "Payment"] = principal +  df.loc[i, "Interest"]

df = df.round(2)
Payment_Date Payment Principal Interest Addl_Principal Balance
Period
1 2016-01-01 -954.83 -288.16 -666.67 -50 199661.84
2 2016-02-01 -954.83 -289.12 -665.71 -50 199322.71
3 2016-03-01 -954.83 -290.09 -664.74 -50 198982.62
4 2016-04-01 -954.83 -291.06 -663.78 -50 198641.57
5 2016-05-01 -954.83 -292.03 -662.81 -50 198299.54
Payoff Time lines

Oh boy. That works but the code smell is quite intense. At this point, I almost ditched this article because the solution was not very pretty.

I decided to regroup by doing some research and found this post by Brandon Rhodes which helped me re-frame my problem and develop a much better solution.

Maintaining the Balance - Try 2

Calculation are incorrect
Please review the updates at the end of this article. The calculations as described below are not correct in those instances where additional principal payments are being made. Work is underway to resolve. (26-Nov-2016)

After reading Brandon’s article, I realized that by adding an additional column with my cumulative principal payments, I could very easily calculate the balance. The pandas authors realized some of the challenges of calculating results based on prior rows of data so they included several cumulative functions.

In this example, I will use cumsum to build a running total of my principal payments.

df["Cumulative_Principal"] = (df["Principal"] + df["Addl_Principal"]).cumsum()

One thing that is interesting is that with the additional principal payments, I end up with paying more in principal that I originally planned to.

Payment_Date Payment Principal Interest Addl_Principal Curr_Balance Cumulative_Principal
Period
356 2045-08-01 -954.83 -939.07 -15.76 -50 NaN -214012.32
357 2045-09-01 -954.83 -942.20 -12.63 -50 NaN -215004.52
358 2045-10-01 -954.83 -945.35 -9.49 -50 NaN -215999.87
359 2045-11-01 -954.83 -948.50 -6.33 -50 NaN -216998.37
360 2045-12-01 -954.83 -951.66 -3.17 -50 NaN -218000.03

This is obviously not correct so I need to put a floor (or clip ) the results so that I never exceed $200,000 in total principal payments:

df["Cumulative_Principal"] = df["Cumulative_Principal"].clip(lower=-Principal)

Now that I have that out of the way, the Current Balance for any given period is very simple to calculate:

df["Curr_Balance"] = Principal + df["Cumulative_Principal"]
Payment_Date Payment Principal Interest Addl_Principal Curr_Balance Cumulative_Principal
Period
1 2016-01-01 -954.83 -288.16 -666.67 -50 199661.84 -338.16
2 2016-02-01 -954.83 -289.12 -665.71 -50 199322.72 -677.28
3 2016-03-01 -954.83 -290.09 -664.74 -50 198982.63 -1017.37
4 2016-04-01 -954.83 -291.06 -663.78 -50 198641.57 -1358.43
5 2016-05-01 -954.83 -292.03 -662.81 -50 198299.54 -1700.46

Wow. This approach is much simpler than the looping solution I tried in my first iteration. The only thing left is figuring out how to clean up the table if we pay it off early.

The Big Payoff

When an amortization table is built, the assumption is that the payments over each period will just be enough to cover the principal and interest and at the end of the time period, the balance goes to 0. However, there may be scenarios where you want to accelerate the payments in order to pay off the loan earlier. In the example we have been running with, the model includes $50 extra each month.

In order to find the last payment, we want to find the the payment where the Curr_Balance first goes to 0:

Payment_Date Payment Principal Interest Addl_Principal Curr_Balance Cumulative_Principal
Period
340 2044-04-01 -954.83 -890.38 -64.45 -50 1444.24 -198555.76
341 2044-05-01 -954.83 -893.35 -61.48 -50 500.89 -199499.11
342 2044-06-01 -954.83 -896.33 -58.50 -50 0.00 -200000.00
343 2044-07-01 -954.83 -899.32 -55.52 -50 0.00 -200000.00

Based on this view, you can see that our last payment would be in period 342. We can find this value by using idxmax

last_payment = df.query("Curr_Balance <= 0")["Curr_Balance"].idxmax(axis=1, skipna=True)
df.loc[last_payment]
Payment_Date            2044-06-01 00:00:00
Payment                             -954.83
Principal                           -896.33
Interest                              -58.5
Addl_Principal                          -50
Curr_Balance                              0
Cumulative_Principal                -200000
Name: 342, dtype: object

Now we know the last payment period, but astute readers may have noticed that we payed $896.33 + $50 in principal but we only owed $500.89. We can clean this up with a couple of statements using last_payment as the index:

df.loc[last_payment, "Principal"] = -(df.loc[last_payment-1, "Curr_Balance"])
df.loc[last_payment, "Payment"] = df.loc[last_payment, ["Principal", "Interest"]].sum()
df.loc[last_payment, "Addl_Principal"] = 0
Payment_Date Payment Principal Interest Addl_Principal Curr_Balance Cumulative_Principal
Period
338 2044-02-01 -954.83 -884.48 -70.36 -50 3322.04 -196677.96
339 2044-03-01 -954.83 -887.42 -67.41 -50 2384.62 -197615.38
340 2044-04-01 -954.83 -890.38 -64.45 -50 1444.24 -198555.76
341 2044-05-01 -954.83 -893.35 -61.48 -50 500.89 -199499.11
342 2044-06-01 -559.39 -500.89 -58.50 0 0.00 -200000.00

For a final step, we can truncate the DataFrame so that we only include through period 342:

df = df.loc[0:last_payment]

Now we have a complete table, we can summarize and compare results.

Time to Analyze

It has taken some time to pull this solution together but now that we know how to solve the problem, we can put it into a function that allows us to input various scenarios, summarize the results and visualize them in various ways.

I have built an amortization table function that looks like this:

def amortization_table(interest_rate, years, payments_year, principal, addl_principal=0, start_date=date.today()):
    """ Calculate the amortization schedule given the loan details

     Args:
        interest_rate: The annual interest rate for this loan
        years: Number of years for the loan
        payments_year: Number of payments in a year
        principal: Amount borrowed
        addl_principal (optional): Additional payments to be made each period. Assume 0 if nothing provided.
                                   must be a value less then 0, the function will convert a positive value to
                                   negative
        start_date (optional): Start date. Will start on first of next month if none provided

    Returns:
        schedule: Amortization schedule as a pandas dataframe
        summary: Pandas dataframe that summarizes the payoff information
    """

Refer to this notebook for the full code as well as example usage.

You can call it to get summary info as well as the detailed amortization schedule:

schedule1, stats1 = amortization_table(0.05, 30, 12, 100000, addl_principal=0)

Which yields a schedule:

Payment_Date Payment Principal Interest Addl_Principal Curr_Balance Cumulative_Principal
Period
1 2016-12-01 -536.82 -120.15 -416.67 0 99879.85 -120.15
2 2017-01-01 -536.82 -120.66 -416.17 0 99759.19 -240.81
3 2017-02-01 -536.82 -121.16 -415.66 0 99638.03 -361.97
4 2017-03-01 -536.82 -121.66 -415.16 0 99516.37 -483.63
5 2017-04-01 -536.82 -122.17 -414.65 0 99394.20 -605.80

and summary stats:

payoff_date Interest Rate Number of years Period_Payment Payment Principal Addl_Principal Interest
0 11-01-2046 0.05 30 -536.82 -193255.2 -100000.02 0.0 -93255.69

The powerful aspect of this approach is that you can run multiple scenarios and combine them into 1 table:

schedule2, stats2 = amortization_table(0.05, 30, 12, 100000, addl_principal=-200)
schedule3, stats3 = amortization_table(0.04, 15, 12, 100000, addl_principal=0)

# Combine all the scenarios into 1 view
pd.concat([stats1, stats2, stats3], ignore_index=True)
payoff_date Interest Rate Number of years Period_Payment Payment Principal Addl_Principal Interest
0 11-01-2046 0.06 30 -599.55 -215838.00 -99999.92 0.0 -115838.23
1 04-01-2037 0.05 30 -736.82 -131689.78 -51200.00 -48800.0 -80490.16
2 11-01-2031 0.04 15 -739.69 -133144.20 -100000.03 0.0 -33143.85

Finally, because the data is in a DataFrame, we can easily plot the results to see what the payoff time lines look like for the various scenarios:

fig, ax = plt.subplots(1, 1)
schedule1.plot(x='Payment_Date', y='Curr_Balance', label="Scenario 1", ax=ax)
schedule2.plot(x='Payment_Date', y='Curr_Balance', label="Scenario 2", ax=ax)
schedule3.plot(x='Payment_Date', y='Curr_Balance', label="Scenario 3", ax=ax)
plt.title("Pay Off Timelines")
Payoff Time lines

Or, we can look at the interest payments by year:

fig, ax = plt.subplots(1, 1)

y1_schedule = schedule1.set_index('Payment_Date').resample("A")["Interest"].sum().abs().reset_index()
y1_schedule["Year"] = y1_schedule["Payment_Date"].dt.year
y1_schedule.plot(kind="bar", x="Year", y="Interest", ax=ax, label="30 Years @ 5%")

plt.title("Interest Payments");
Payoff Time lines

Obviously there are lots of available options for visualizing the results but this gives you a flavor for some of the options.

Closing Out

Thank you for reading through this example. I have to admit that this was one of my more time consuming articles. It was also one where I learned a lot about how to work with pandas and use some functions that I did not have much familiarity with. I hope this article will help others build their knowledge of python and pandas and might even be useful as a tool to analyze your own mortgage payment options.

Article Updates

Nov-26-2016 - Calculation Accuracy:

  • Based on feedback in the comments and discussions off-line, I realized that the calculations are not correctly working with the extra principal payments. After looking into this in more detail, I figured out that the interest and principal payments do indeed to be recalculated each period which is proving to be problematic in pandas. I am working on a solution but in the meantime want to make sure to note the issue.

    I am keeping the article up since I think it is helpful to show additional pandas functionality but do regret that the results are not correct.

    If you have ideas on how to fix, please let me know.

Dec-19-2016 - Corrected Article:

  • A new article has been posted that contains corrected code to fix the errors identified above.

Dec-13-2019 - Removed ix

  • Removed ix and used loc to be compatible with current version of pandas.
  • Also updated the referenced notebook with the .loc usage

Comments