Data Preparation#

Getting Started#

This tutorial focuses on data, including a brief discussion on how to best prepare your data so it works well with the chainladder package.

Be sure to make sure your packages are updated. For more info on how to update your pakages, visit Keeping Packages Updated.

# Black linter, optional
%load_ext lab_black

import pandas as pd
import numpy as np
import chainladder as cl
import matplotlib.pyplot as plt

print("pandas: " + pd.__version__)
print("numpy: " + np.__version__)
print("chainladder: " + cl.__version__)
pandas: 2.1.4
numpy: 1.24.3
chainladder: 0.8.18

Disclaimer#

Note that a lot of the examples shown might not be applicable in a real world scenario, and is only meant to demonstrate some of the functionalities included in the package. The user should always follow all applicable laws, the Code of Professional Conduct, applicable Actuarial Standards of Practice, and exercise their best actuarial judgement.

Converting Triangle Data into Long Format#

One of the most commonly asked questions is that if the data needs to be in the tabular long format as opposed to the already processed triangle format when we are loading the data for use.

Unfortunately, the chainladder package requires the data to be in long form.

Suppose you have a wide triangle.

df = cl.load_sample("raa").to_frame(origin_as_datetime=True)
df
12 24 36 48 60 72 84 96 108 120
1981-01-01 5012.0 8269.0 10907.0 11805.0 13539.0 16181.0 18009.0 18608.0 18662.0 18834.0
1982-01-01 106.0 4285.0 5396.0 10666.0 13782.0 15599.0 15496.0 16169.0 16704.0 NaN
1983-01-01 3410.0 8992.0 13873.0 16141.0 18735.0 22214.0 22863.0 23466.0 NaN NaN
1984-01-01 5655.0 11555.0 15766.0 21266.0 23425.0 26083.0 27067.0 NaN NaN NaN
1985-01-01 1092.0 9565.0 15836.0 22169.0 25955.0 26180.0 NaN NaN NaN NaN
1986-01-01 1513.0 6445.0 11702.0 12935.0 15852.0 NaN NaN NaN NaN NaN
1987-01-01 557.0 4020.0 10946.0 12314.0 NaN NaN NaN NaN NaN NaN
1988-01-01 1351.0 6947.0 13112.0 NaN NaN NaN NaN NaN NaN NaN
1989-01-01 3133.0 5395.0 NaN NaN NaN NaN NaN NaN NaN NaN
1990-01-01 2063.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN

You can use pandas to unstack the data into the wide long format.

df = df.unstack().dropna().reset_index()
df.head(10)
level_0 level_1 0
0 12 1981-01-01 5012.0
1 12 1982-01-01 106.0
2 12 1983-01-01 3410.0
3 12 1984-01-01 5655.0
4 12 1985-01-01 1092.0
5 12 1986-01-01 1513.0
6 12 1987-01-01 557.0
7 12 1988-01-01 1351.0
8 12 1989-01-01 3133.0
9 12 1990-01-01 2063.0

Let’s clean up our column names before we get too far.

df.columns = ["age", "origin", "values"]
df.head()
age origin values
0 12 1981-01-01 5012.0
1 12 1982-01-01 106.0
2 12 1983-01-01 3410.0
3 12 1984-01-01 5655.0
4 12 1985-01-01 1092.0

Next, we will need a valuation column (think Schedule P style triangle).

df["valuation"] = (df["origin"].dt.year + df["age"] / 12 - 1).astype(int)
df.head()
age origin values valuation
0 12 1981-01-01 5012.0 1981
1 12 1982-01-01 106.0 1982
2 12 1983-01-01 3410.0 1983
3 12 1984-01-01 5655.0 1984
4 12 1985-01-01 1092.0 1985

Now, we are finally ready to load it into the chainladder package!

cl.Triangle(
    df, origin="origin", development="valuation", columns="values", cumulative=True
)
12 24 36 48 60 72 84 96 108 120
1981 5,012 8,269 10,907 11,805 13,539 16,181 18,009 18,608 18,662 18,834
1982 106 4,285 5,396 10,666 13,782 15,599 15,496 16,169 16,704
1983 3,410 8,992 13,873 16,141 18,735 22,214 22,863 23,466
1984 5,655 11,555 15,766 21,266 23,425 26,083 27,067
1985 1,092 9,565 15,836 22,169 25,955 26,180
1986 1,513 6,445 11,702 12,935 15,852
1987 557 4,020 10,946 12,314
1988 1,351 6,947 13,112
1989 3,133 5,395
1990 2,063

Sparse Triangles#

By default, the chainladder Triangle is a wrapper around a numpy array. Numpy is optimized for high performance and this allows chainladder to achieve decent computational speeds. Despite being fast, numpy can become memory inefficient with triangle data because triangles are inherently sparse (when memory is being allocated yet no data is stored).

The lower half of a an incomplete triangle is generally blank and that means about 50% of an array size is wasted on empty space. As we include granular index and column values in our Triangle, the sparsity of the triangle increases further consuming RAM unnecessarily. Chainladder automatically eliminates this extraneous consumption of memory by resorting to a sparse array representation when the Triangle becomes sufficiently large.

Let’s load the prism dataset and include each claim number in the index of the Triangle. The dataset is claim level and includes over 130,000 triangles.

prism = cl.load_sample("prism")
prism
/home/docs/checkouts/readthedocs.org/user_builds/chainladder-python/conda/latest/lib/python3.11/site-packages/chainladder/core/base.py:250: UserWarning: The argument 'infer_datetime_format' is deprecated and will be removed in a future version. A strict version of it is now the default, see https://pandas.pydata.org/pdeps/0004-consistent-to-datetime-parsing.html. You can safely remove this argument.
  arr = dict(zip(datetime_arg, pd.to_datetime(**item)))
/home/docs/checkouts/readthedocs.org/user_builds/chainladder-python/conda/latest/lib/python3.11/site-packages/chainladder/core/base.py:250: UserWarning: The argument 'infer_datetime_format' is deprecated and will be removed in a future version. A strict version of it is now the default, see https://pandas.pydata.org/pdeps/0004-consistent-to-datetime-parsing.html. You can safely remove this argument.
  arr = dict(zip(datetime_arg, pd.to_datetime(**item)))
Triangle Summary
Valuation: 2017-12
Grain: OMDM
Shape: (34244, 4, 120, 120)
Index: [ClaimNo, Line, Type, ClaimLiability, Limit, Deductible]
Columns: [reportedCount, closedPaidCount, Paid, Incurred]

Let’s also look at the array representation of the Triangle and notice how it is no longer a numpy array, but instead a sparse array.

prism.values
Formatcoo
Data Typefloat64
Shape(34244, 4, 120, 120)
nnz121178
Density6.143513381095148e-05
Read-onlyTrue
Size2.8M
Storage ratio0.00

The sparse array consumes about 4.6Mb of memory. We can also see its density is very low, this is because individual claims will at most exist in only one origin period. Let’s approximate the size of this Triangle assuming we used a dense array representation. Approximation can be done by assuming 8 bytes (for float64) of memory are used for each cell in the array.

print("Dense array size:", np.prod(prism.shape) / 1e6 * 8, "MB.")
print("Sparse array size:", prism.values.nbytes / 1e6, "MB.")
print(
    "Dense array is",
    round((np.prod(prism.shape) / 1e6 * 8) / (prism.values.nbytes / 1e6), 1),
    "times larger!",
)
Dense array size: 15779.6352 MB.
Sparse array size: 2.908272 MB.
Dense array is 5425.8 times larger!

Incremental vs Cumulative Triangles#

Cumulative triangles are naturally denser than those stored in an incremental fashion. While almost all actuarial techniques rely on cumulative triangles, it may be worthwhile to maintain and manipulate triangles as incremental triangles until you are ready to apply a model.

prism.values
Formatcoo
Data Typefloat64
Shape(34244, 4, 120, 120)
nnz121178
Density6.143513381095148e-05
Read-onlyTrue
Size2.8M
Storage ratio0.00
prism = prism.incr_to_cum()
prism.values
Formatcoo
Data Typefloat64
Shape(34244, 4, 120, 120)
nnz5750047
Density0.00291517360299939
Read-onlyTrue
Size219.3M
Storage ratio0.01

Our incremental triangle is under 5MB, but when we convert to a cumulative triangle it becomes an astonishingly large 219MB and this is despite still maintaining a sparsity of under 0.3%!

Claim-Level Data#

The sparse representation of triangles allows for substantially more data to be pushed through chainladder. This gives us some nice capabilities that we would not otherwise be able to do with aggregate data.

For example, we can now drill into the individual claim makeup of any cell in our Triangle. Let’s look at January 2017 claim details at age 12.

claims = prism[prism.origin == "2017-01"][prism.development == 12].to_frame(
    origin_as_datetime=True
)
claims[abs(claims).sum(axis="columns") != 0].reset_index()
ClaimNo Line Type ClaimLiability Limit Deductible reportedCount closedPaidCount Paid Incurred
0 38339 Auto PD False 8000.0 1000 1.0 0.0 0.000000 0.000000
1 38436 Auto PD True 15000.0 1000 1.0 1.0 8337.875863 8337.875863
2 38142 Auto PD True 8000.0 1000 1.0 1.0 7000.000000 7000.000000
3 38195 Auto PD True 20000.0 1000 1.0 1.0 19000.000000 19000.000000
4 38158 Auto PD True 20000.0 1000 1.0 1.0 10686.229420 10686.229420
... ... ... ... ... ... ... ... ... ... ...
155 38393 Auto PD True 8000.0 1000 1.0 1.0 7000.000000 7000.000000
156 38396 Auto PD True 8000.0 1000 1.0 1.0 7000.000000 7000.000000
157 38455 Auto PD True 20000.0 1000 1.0 1.0 9927.351224 9927.351224
158 38457 Auto PD True 15000.0 1000 1.0 1.0 7874.879070 7874.879070
159 38460 Auto PD True 15000.0 1000 1.0 1.0 3125.840672 3125.840672

160 rows × 10 columns

We can also examine the data as the usual aggregated Triangle, by applying sum(). We’ll also apply grain() so we can better visualize the data.

plt.plot(prism["Paid"].sum().grain("OYDM").to_frame(origin_as_datetime=True).T)
[<matplotlib.lines.Line2D at 0x7fce21a00f90>,
 <matplotlib.lines.Line2D at 0x7fce056cc8d0>,
 <matplotlib.lines.Line2D at 0x7fce056ce9d0>,
 <matplotlib.lines.Line2D at 0x7fce056cead0>,
 <matplotlib.lines.Line2D at 0x7fce056cf010>,
 <matplotlib.lines.Line2D at 0x7fce056cf490>,
 <matplotlib.lines.Line2D at 0x7fce056cf850>,
 <matplotlib.lines.Line2D at 0x7fce056cfc50>,
 <matplotlib.lines.Line2D at 0x7fce050b8050>,
 <matplotlib.lines.Line2D at 0x7fce056cf590>]
../../_images/310554bf91fe84770808b01ed5dbd08adf78326e8cc3b8e54cb7f3e4c825549d.png

With claim level data, we can set a claim large loss cap or create an excess Triangle on the fly.

prism["Capped 100k Paid"] = cl.minimum(prism["Paid"], 100000)
prism["Excess 100k Paid"] = prism["Paid"] - prism["Capped 100k Paid"]
plt.plot(
    prism["Excess 100k Paid"].sum().grain("OYDM").to_frame(origin_as_datetime=True).T
)
[<matplotlib.lines.Line2D at 0x7fce2103a5d0>,
 <matplotlib.lines.Line2D at 0x7fce2103ab50>,
 <matplotlib.lines.Line2D at 0x7fce215ba9d0>,
 <matplotlib.lines.Line2D at 0x7fce2103a690>,
 <matplotlib.lines.Line2D at 0x7fce21039690>,
 <matplotlib.lines.Line2D at 0x7fce210380d0>,
 <matplotlib.lines.Line2D at 0x7fce210386d0>,
 <matplotlib.lines.Line2D at 0x7fce21038c50>,
 <matplotlib.lines.Line2D at 0x7fce21039210>,
 <matplotlib.lines.Line2D at 0x7fce2103a050>]
../../_images/1820d77a12b3413f48b4f2806a71718a507eeed1368f1740e611105161324bec.png

Claim-Level IBNR Estimates#

Let’s see how we can use the API to create claim-level IBNR estimates. When using aggregate actuarial techniques, it really makes sense to perform the model fitting at an aggregate level.

We use aggregate data to fit the model to generate reasonable development patterns.

agg_data = prism.sum()[["Paid", "reportedCount"]]
model_cl = cl.Chainladder().fit(agg_data)

With the fitted model, we are not limited to predicting ultimates at the aggregated grain. Let’s predict chainladder ultimates at a claim level. Here, we are using model_cl, which was built using agg_data to make prediction on prism, which is claim-level data.

cl_ults = model_cl.predict(prism[["Paid", "reportedCount"]]).ultimate_
cl_ults
Triangle Summary
Valuation: 2261-12
Grain: OMDM
Shape: (34244, 2, 120, 1)
Index: [ClaimNo, Line, Type, ClaimLiability, Limit, Deductible]
Columns: [Paid, reportedCount]

We could stop here, but let’s try a Bornhuetter-Ferguson method as well. We will infer an a-priori severity from our chainladder model, model_cl above.

plt.plot(
    (model_cl.ultimate_["Paid"] / model_cl.ultimate_["reportedCount"]).to_frame(
        origin_as_datetime=True
    ),
)
[<matplotlib.lines.Line2D at 0x7fce21a1ac90>]
../../_images/03e200755d7d072ac51ca996c2478944e01373b7788b5186db718a91e0501a76.png

40K seems a reasonable a-priori (at least for the last two years or so, between 2016 - 2017).

Now, let’s fit an aggregate Bornhuetter-Ferguson model. Like the chainladder example, we fit the model in aggregate (summing all claims) to create a stable model from which we can generate granular predictions. We will use our Chainladder ultimate claim counts as our sample_weight (exposure) for the BornhuetterFerguson method.

paid_bf = cl.BornhuetterFerguson(apriori=40000).fit(
    X=prism["Paid"].sum().incr_to_cum(), sample_weight=cl_ults["reportedCount"].sum()
)
plt.bar(
    paid_bf.ultimate_.grain("OYDM").to_frame(origin_as_datetime=False).index.year,
    paid_bf.ultimate_.grain("OYDM").to_frame(origin_as_datetime=False)["2261-12"],
)
<BarContainer object of 10 artists>
../../_images/687cc6a4e65f93d47b08b3dbf22bd858f213343ce05620e34138e9b525235586.png

We can now create claim-level BornhuetterFerguson predictions using our claim-level Triangle. Ideally, the results should tie to the aggregate results.

bf_ults = paid_bf.predict(
    prism["Paid"].incr_to_cum(), sample_weight=cl_ults["reportedCount"]
).ultimate_
plt.bar(
    bf_ults.sum().grain("OYDM").to_frame(origin_as_datetime=False).index.year,
    bf_ults.sum().grain("OYDM").to_frame(origin_as_datetime=False)["2261-12"],
)
<BarContainer object of 10 artists>
../../_images/687cc6a4e65f93d47b08b3dbf22bd858f213343ce05620e34138e9b525235586.png