# Development Tutorial
## Getting Started
This tutorial focuses on selecting the development factors. 

Be sure to make sure your packages are updated. For more info on how to update your pakages, visit [Keeping Packages Updated](https://chainladder-python.readthedocs.io/en/latest/library/install.html#keeping-packages-updated).

In [1]:
# Black linter, optional
%load_ext lab_black

import pandas as pd
import numpy as np
import chainladder as cl

print("pandas: " + pd.__version__)
print("numpy: " + np.__version__)
print("chainladder: " + cl.__version__)

pandas: 1.4.2
numpy: 1.22.4
chainladder: 0.8.12


## Disclaimer
Note that a lot of the examples shown might not be applicable in a real world scenario, and is only meant to demonstrate some of the functionalities included in the package. The user should always follow all applicable laws, the Code of Professional Conduct, applicable Actuarial Standards of Practice, and exercise their best actuarial judgement.

## Testing for Violation of Chain Ladder's Assumptions

The chain ladder method is based on the strong assumptions of independence across origin periods and across valuation periods. Mack developed tests to verify if these assumptions hold, and these tests have been implemented in the `chainladder` package.

Before the chain ladder model can be used, we should verify that the data satisfies the underlying assumptions using tests at the desired confidence interval level. If assumptions are violated, we should consider if ultimates can be estimated using other models.

There are two main tests that we need to perform:
- The `valuation_correlation` test: 
    - This test tests for the assumption of independence of accident years. In fact, it tests for correlation across calendar periods (diagonals), and by extension, origin periods (rows).
    - An additional parameter, `total`, can be passed, depending on if we want to calculate valuation correlation in total across all origins (`True`), or for each origin separately (`False`).
    - The test uses Z-statistic.
- The `development_correlation` test:
    - This test tests for the assumption of independence of the chain ladder method that assumes that subsequent development factors are not correlated (columns).
    - The test uses T-statistic.

In [2]:
raa = cl.load_sample("raa")
print(
    "Are valuation years correlated? Or, are the origins correlated?",
    raa.valuation_correlation(p_critical=0.1, total=True).z_critical.values,
)
print(
    "Are development periods coorelated?",
    raa.development_correlation(p_critical=0.5).t_critical.values,
)

Are valuation years correlated? Or, are the origins correlated? [[False]]
Are development periods coorelated? [[False]]


The above tests show that the `raa` triangle is independent in both cases, suggesting that there is no evidence that the chain ladder model is not an appropriate method to develop the ultimate amounts. It is suggested to review Mack's papers to ensure a proper understanding of the methodology and the choice of `p_critical`.

Mack also demonstrated that we can test for valuation years' correlation. To test for each valuation year's correlation individually, we set `total` to `False`.

In [3]:
raa.valuation_correlation(p_critical=0.1, total=False).z_critical

Unnamed: 0,1982,1983,1984,1985,1986,1987,1988,1989,1990
1981,False,False,False,False,False,False,False,False,False


Note that the tests are run on the entire 4 dimensions of the `triangle`.

## Estimator Basics

All development methods follow the `sklearn` estimator API.  These estimators have a few properties that are worth getting used to.

We instantiate the estimator with your choice of assumptions.  In the case where we don't opt for any assumptions, defaults are chosen for you.

At this point, we've chosen an estimator and assumptions (even if default) but we have not shown our estimator a `Triangle`.  At this point it is merely instructions on how to fit development patterns, but no patterns exist as of yet.

All estimators have a `fit` method and you can pass a triangle to your estimator.  Let's `fit` a `Triangle` in a `Development` estimator.  Let's also assign the estimator to a variable so we can reference attributes about it.

In [4]:
genins = cl.load_sample("genins")
dev = cl.Development().fit(genins)

Now that we have `fit` a `Development` estimator, it has many additional properties that didn't exist before fitting.  For example, 
we can view the `ldf_`

In [5]:
dev.ldf_

Unnamed: 0,12-24,24-36,36-48,48-60,60-72,72-84,84-96,96-108,108-120
(All),3.4906,1.7473,1.4574,1.1739,1.1038,1.0863,1.0539,1.0766,1.0177


We can view the `cdf_`

In [6]:
dev.cdf_

Unnamed: 0,12-Ult,24-Ult,36-Ult,48-Ult,60-Ult,72-Ult,84-Ult,96-Ult,108-Ult
(All),14.4466,4.1387,2.3686,1.6252,1.3845,1.2543,1.1547,1.0956,1.0177


We can also convert between LDFs and CDFs using incr_to_cum() and cum_to_incr() similar to triangles.

In [7]:
dev.ldf_.incr_to_cum()

Unnamed: 0,12-Ult,24-Ult,36-Ult,48-Ult,60-Ult,72-Ult,84-Ult,96-Ult,108-Ult
(All),14.4466,4.1387,2.3686,1.6252,1.3845,1.2543,1.1547,1.0956,1.0177


In [8]:
dev.cdf_.cum_to_incr()

Unnamed: 0,12-24,24-36,36-48,48-60,60-72,72-84,84-96,96-108,108-120
(All),3.4906,1.7473,1.4574,1.1739,1.1038,1.0863,1.0539,1.0766,1.0177


Notice these attributes have a trailing underscore (`_`). This is scikit-learn's API convention, as its [documentation](https://scikit-learn.org/dev/developers/develop.html) states, "attributes that have been estimated from the data must always have a name ending with trailing underscore, for example the coefficients of some regression estimator would be stored in a `coef_` attribute after `fit` has been called." In summary, the trailing underscore in class attributes is a scikit-learn's convention to denote that the attributes are estimated, or to denote that they are fitted attributes.

In [9]:
print("Assumption parameter (no underscore):", dev.average)
print("Estimated parameter (underscore):\n", dev.ldf_)

Assumption parameter (no underscore): volume
Estimated parameter (underscore):
           12-24     24-36     36-48     48-60     60-72     72-84     84-96    96-108   108-120
(All)  3.490607  1.747333  1.457413  1.173852  1.103824  1.086269  1.053874  1.076555  1.017725


## Development Averaging

Now that we have a grounding in triangle manipulation and the basics of estimators, we can start getting more creative with customizing our development factors.

The basic `Development` estimator uses a weighted regression through the origin for estimating parameters. Mack showed that using weighted regressions allows for:
1. `volume` weighted average development patterns<br>
2. `simple` average development factors<br>
3. OLS `regression` estimate of development factor where the regression equation is Y = mX + 0<br>

While he posited this framework to suggest the `MackChainladder` stochastic method, it is an elegant form even for deterministic development pattern selection.

In [10]:
genins = cl.load_sample("genins")
genins

Unnamed: 0,12,24,36,48,60,72,84,96,108,120
2001,357848,1124788.0,1735330.0,2218270.0,2745596.0,3319994.0,3466336.0,3606286.0,3833515.0,3901463.0
2002,352118,1236139.0,2170033.0,3353322.0,3799067.0,4120063.0,4647867.0,4914039.0,5339085.0,
2003,290507,1292306.0,2218525.0,3235179.0,3985995.0,4132918.0,4628910.0,4909315.0,,
2004,310608,1418858.0,2195047.0,3757447.0,4029929.0,4381982.0,4588268.0,,,
2005,443160,1136350.0,2128333.0,2897821.0,3402672.0,3873311.0,,,,
2006,396132,1333217.0,2180715.0,2985752.0,3691712.0,,,,,
2007,440832,1288463.0,2419861.0,3483130.0,,,,,,
2008,359480,1421128.0,2864498.0,,,,,,,
2009,376686,1363294.0,,,,,,,,
2010,344014,,,,,,,,,


We can also print the `age_to_age` factors.

In [11]:
genins.age_to_age

Unnamed: 0,12-24,24-36,36-48,48-60,60-72,72-84,84-96,96-108,108-120
2001,3.1432,1.5428,1.2783,1.2377,1.2092,1.0441,1.0404,1.063,1.0177
2002,3.5106,1.7555,1.5453,1.1329,1.0845,1.1281,1.0573,1.0865,
2003,4.4485,1.7167,1.4583,1.2321,1.0369,1.12,1.0606,,
2004,4.568,1.5471,1.7118,1.0725,1.0874,1.0471,,,
2005,2.5642,1.873,1.3615,1.1742,1.1383,,,,
2006,3.3656,1.6357,1.3692,1.2364,,,,,
2007,2.9228,1.8781,1.4394,,,,,,
2008,3.9533,2.0157,,,,,,,
2009,3.6192,,,,,,,,


And colorcode with `heatmap()`.

In [12]:
genins.age_to_age.heatmap()

  .render()


Unnamed: 0,12-24,24-36,36-48,48-60,60-72,72-84,84-96,96-108,108-120
2001,3.1432,1.5428,1.2783,1.2377,1.2092,1.0441,1.0404,1.063,1.0177
2002,3.5106,1.7555,1.5453,1.1329,1.0845,1.1281,1.0573,1.0865,
2003,4.4485,1.7167,1.4583,1.2321,1.0369,1.12,1.0606,,
2004,4.568,1.5471,1.7118,1.0725,1.0874,1.0471,,,
2005,2.5642,1.873,1.3615,1.1742,1.1383,,,,
2006,3.3656,1.6357,1.3692,1.2364,,,,,
2007,2.9228,1.8781,1.4394,,,,,,
2008,3.9533,2.0157,,,,,,,
2009,3.6192,,,,,,,,


In [13]:
vol = cl.Development(average="volume").fit(genins).ldf_
vol

Unnamed: 0,12-24,24-36,36-48,48-60,60-72,72-84,84-96,96-108,108-120
(All),3.4906,1.7473,1.4574,1.1739,1.1038,1.0863,1.0539,1.0766,1.0177


In [14]:
sim = cl.Development(average="simple").fit(genins).ldf_
sim

Unnamed: 0,12-24,24-36,36-48,48-60,60-72,72-84,84-96,96-108,108-120
(All),3.5661,1.7456,1.452,1.181,1.1112,1.0848,1.0527,1.0748,1.0177


In most cases, estimator attributes are `Triangle`s themselves and can be manipulated with just like raw triangles.

In [15]:
print("LDF Type: ", type(vol))
print("Difference between volume and simple average:")
vol - sim

LDF Type:  <class 'chainladder.core.triangle.Triangle'>
Difference between volume and simple average:


Unnamed: 0,12-24,24-36,36-48,48-60,60-72,72-84,84-96,96-108,108-120
(All),-0.0755,0.0018,0.0055,-0.0071,-0.0074,0.0015,0.0011,0.0018,


We can specify how the LDFs are averaged independently for each age-to-age period. For example, we can use `volume` averaging on the first pattern, `simple` the second, `regression` the third, and then repeat the cycle three times for the 9 age-to-age factors that we need. Note that the array of selected method must be of the same length as the number of age-to-age factors.

In [16]:
cl.Development(average=["volume", "simple", "regression"] * 3).fit(genins).ldf_

Unnamed: 0,12-24,24-36,36-48,48-60,60-72,72-84,84-96,96-108,108-120
(All),3.4906,1.7456,1.4619,1.1739,1.1112,1.0873,1.0539,1.0748,1.0177


Another example, using `volume`-weighting for the first factor, `simple`-weighting for the next 5 factors, and `volume`-weighting for the last 3 factors.

In [17]:
cl.Development(average=["volume"] + ["simple"] * 5 + ["volume"] * 3).fit(genins).ldf_

Unnamed: 0,12-24,24-36,36-48,48-60,60-72,72-84,84-96,96-108,108-120
(All),3.4906,1.7456,1.452,1.181,1.1112,1.0848,1.0539,1.0766,1.0177


## Averaging Period

`Development` comes with an `n_periods` parameter that allows you to select the latest `n` origin periods for fitting your development patterns. `n_periods=-1` is used to indicate the usage of all available periods, which is also the default if the parameter is not specified. The units of `n_periods` follows the `origin_grain` of the underlying triangle.

In [18]:
cl.Development().fit(genins).ldf_

Unnamed: 0,12-24,24-36,36-48,48-60,60-72,72-84,84-96,96-108,108-120
(All),3.4906,1.7473,1.4574,1.1739,1.1038,1.0863,1.0539,1.0766,1.0177


In [19]:
cl.Development(n_periods=-1).fit(genins).ldf_

Unnamed: 0,12-24,24-36,36-48,48-60,60-72,72-84,84-96,96-108,108-120
(All),3.4906,1.7473,1.4574,1.1739,1.1038,1.0863,1.0539,1.0766,1.0177


In [20]:
cl.Development(n_periods=3).fit(genins).ldf_

Unnamed: 0,12-24,24-36,36-48,48-60,60-72,72-84,84-96,96-108,108-120
(All),3.4604,1.8465,1.392,1.1539,1.0849,1.0974,1.0539,1.0766,1.0177


Much like `average`, `n_periods` can also be set for each age-to-age individually.

In [21]:
cl.Development(n_periods=[8, 2, 6, 5, -1, 2, -1, -1, 5]).fit(genins).ldf_

Unnamed: 0,12-24,24-36,36-48,48-60,60-72,72-84,84-96,96-108,108-120
(All),3.5325,1.9502,1.4808,1.1651,1.1038,1.0825,1.0539,1.0766,1.0177


Note that if we provide `n_periods` that is greater than what is available for any particular age-to-age period, all available periods will be used instead.

In [22]:
cl.Development(n_periods=[1, 2, 3, 4, 5, 6, 7, 8, 9]).fit(
    genins
).ldf_ == cl.Development(n_periods=[1, 2, 3, 4, 5, 4, 3, 2, 1]).fit(genins).ldf_

True

## Discarding Problematic Link Ratios

Even with `n_periods`, there are situations where you might want to be more surgical in our selections. For example, you could have a valuation period with bad data and wish to omit the entire diagonal from your averaging.

In [23]:
cl.Development(drop_valuation="2004").fit(genins).ldf_

Unnamed: 0,12-24,24-36,36-48,48-60,60-72,72-84,84-96,96-108,108-120
(All),3.3797,1.7517,1.4426,1.1651,1.1038,1.0863,1.0539,1.0766,1.0177


We can also do an olympic averaging (i.e. exluding high and low from each period).

In [24]:
cl.Development(drop_high=True, drop_low=True).fit(genins).ldf_



Unnamed: 0,12-24,24-36,36-48,48-60,60-72,72-84,84-96,96-108,108-120
(All),3.5201,1.7277,1.4351,1.193,1.1018,1.0825,1.0573,1.0766,1.0177


The function also accepts intergers. For example, if we want to drop the highest 3 factors from each period.

In [25]:
cl.Development(drop_high=3).fit(genins).ldf_



Unnamed: 0,12-24,24-36,36-48,48-60,60-72,72-84,84-96,96-108,108-120
(All),3.1614,1.6392,1.3687,1.1222,1.0601,1.0441,1.0539,1.0766,1.0177


There's a `preserve` that we can use, this variable allows us to specified the minimum number of LDFs required for calculation. If this minimum is not yet, the `drop_high` and `drop_low` for that age will be ignored. This is especially useful in the tail when the data is thin.

In [26]:
cl.Development(drop_high=3, drop_low=2, preserve=2).fit(genins).ldf_



Unnamed: 0,12-24,24-36,36-48,48-60,60-72,72-84,84-96,96-108,108-120
(All),3.4108,1.7012,1.4061,1.1739,1.1038,1.0863,1.0539,1.0766,1.0177


We can also use an array of booleans or ints.

In [27]:
cl.Development(drop_high=[True, True, False, True], drop_low=[1, 2, 0, 3]).fit(
    genins
).ldf_

Unnamed: 0,12-24,24-36,36-48,48-60,60-72,72-84,84-96,96-108,108-120
(All),3.5201,1.7685,1.4574,1.2342,1.1038,1.0863,1.0539,1.0766,1.0177


Or maybe there is just a single outlier link-ratio that you don't think is indicative of future development.  For these, you can specify the intersection of the origin and development age of the **denominator** of the link-ratio to `drop`.

In [28]:
genins.age_to_age.heatmap()

  .render()


Unnamed: 0,12-24,24-36,36-48,48-60,60-72,72-84,84-96,96-108,108-120
2001,3.1432,1.5428,1.2783,1.2377,1.2092,1.0441,1.0404,1.063,1.0177
2002,3.5106,1.7555,1.5453,1.1329,1.0845,1.1281,1.0573,1.0865,
2003,4.4485,1.7167,1.4583,1.2321,1.0369,1.12,1.0606,,
2004,4.568,1.5471,1.7118,1.0725,1.0874,1.0471,,,
2005,2.5642,1.873,1.3615,1.1742,1.1383,,,,
2006,3.3656,1.6357,1.3692,1.2364,,,,,
2007,2.9228,1.8781,1.4394,,,,,,
2008,3.9533,2.0157,,,,,,,
2009,3.6192,,,,,,,,


Let's say we believe the 4.5680 factor from origin 2004 between age 12 and 24 should be dropped, we can use `drop=('2004', 12)`.

In [29]:
cl.Development(drop=("2004", 12)).fit(genins).ldf_

Unnamed: 0,12-24,24-36,36-48,48-60,60-72,72-84,84-96,96-108,108-120
(All),3.3797,1.7473,1.4574,1.1739,1.1038,1.0863,1.0539,1.0766,1.0177


If there are more than one outliers, you can also pass an array of array to the `drop` argument.

In [30]:
cl.Development(drop=[("2004", 12), ("2008", 24)]).fit(genins).ldf_

Unnamed: 0,12-24,24-36,36-48,48-60,60-72,72-84,84-96,96-108,108-120
(All),3.3797,1.7041,1.4574,1.1739,1.1038,1.0863,1.0539,1.0766,1.0177


## Transformers
In `sklearn`, there are two types of estimators: transformers and predictors. A transformer transforms the input data (X) in some ways, and a predictor predicts a new value (or values, Y) by using the input data X.

`Development` is a transformer, as the returned object is a means to create development patterns, which is used to estimate ultimates, but itself is not a reserving model (predictor). 

Transformers come with the `tranform` and `fit_transform` method. These will return a `Triangle` object, but augment it with additional information for use in a subsequent IBNR model (a predictor). `drop_high` (and `drop_low`) can take an array of boolean variables, indicating if the highest factor should be dropped for each of the LDF calculation.

In [31]:
transformed_triangle = cl.Development(drop_high=[True] * 4 + [False] * 5).fit_transform(
    genins
)
transformed_triangle

Unnamed: 0,12,24,36,48,60,72,84,96,108,120
2001,357848,1124788.0,1735330.0,2218270.0,2745596.0,3319994.0,3466336.0,3606286.0,3833515.0,3901463.0
2002,352118,1236139.0,2170033.0,3353322.0,3799067.0,4120063.0,4647867.0,4914039.0,5339085.0,
2003,290507,1292306.0,2218525.0,3235179.0,3985995.0,4132918.0,4628910.0,4909315.0,,
2004,310608,1418858.0,2195047.0,3757447.0,4029929.0,4381982.0,4588268.0,,,
2005,443160,1136350.0,2128333.0,2897821.0,3402672.0,3873311.0,,,,
2006,396132,1333217.0,2180715.0,2985752.0,3691712.0,,,,,
2007,440832,1288463.0,2419861.0,3483130.0,,,,,,
2008,359480,1421128.0,2864498.0,,,,,,,
2009,376686,1363294.0,,,,,,,,
2010,344014,,,,,,,,,


In [32]:
transformed_triangle.link_ratio

Unnamed: 0,12-24,24-36,36-48,48-60,60-72,72-84,84-96,96-108,108-120
2001,3.1432,1.5428,1.2783,,1.2092,1.0441,1.0404,1.063,1.0177
2002,3.5106,1.7555,1.5453,1.1329,1.0845,1.1281,1.0573,1.0865,
2003,4.4485,1.7167,1.4583,1.2321,1.0369,1.12,1.0606,,
2004,,1.5471,,1.0725,1.0874,1.0471,,,
2005,2.5642,1.873,1.3615,1.1742,1.1383,,,,
2006,3.3656,1.6357,1.3692,1.2364,,,,,
2007,2.9228,1.8781,1.4394,,,,,,
2008,3.9533,,,,,,,,
2009,3.6192,,,,,,,,


Our transformed triangle behaves as our original `genins` triangle.  However, notice the link_ratios exclude any droppped values you specified.

In [33]:
transformed_triangle.link_ratio.heatmap()

  .render()


Unnamed: 0,12-24,24-36,36-48,48-60,60-72,72-84,84-96,96-108,108-120
2001,3.1432,1.5428,1.2783,,1.2092,1.0441,1.0404,1.063,1.0177
2002,3.5106,1.7555,1.5453,1.1329,1.0845,1.1281,1.0573,1.0865,
2003,4.4485,1.7167,1.4583,1.2321,1.0369,1.12,1.0606,,
2004,,1.5471,,1.0725,1.0874,1.0471,,,
2005,2.5642,1.873,1.3615,1.1742,1.1383,,,,
2006,3.3656,1.6357,1.3692,1.2364,,,,,
2007,2.9228,1.8781,1.4394,,,,,,
2008,3.9533,,,,,,,,
2009,3.6192,,,,,,,,


In [34]:
print(type(transformed_triangle))
transformed_triangle.latest_diagonal

<class 'chainladder.core.triangle.Triangle'>


Unnamed: 0,2010
2001,3901463
2002,5339085
2003,4909315
2004,4588268
2005,3873311
2006,3691712
2007,3483130
2008,2864498
2009,1363294
2010,344014


However, it has other attributes that make it IBNR model-ready.

In [35]:
transformed_triangle.cdf_

Unnamed: 0,12-Ult,24-Ult,36-Ult,48-Ult,60-Ult,72-Ult,84-Ult,96-Ult,108-Ult
(All),13.1367,3.887,2.2809,1.6131,1.3845,1.2543,1.1547,1.0956,1.0177


`fit_transform()` is equivalent to calling `fit` and `transform` in succession on the same triangle.  Again, this should feel very familiar to the `sklearn` practitioner.

In [36]:
cl.Development().fit_transform(genins) == cl.Development().fit(genins).transform(genins)

True

The reason you might want want to use `fit` and `transform` separately would be when you want to apply development patterns to a a different triangle.  For example, we can:

1. Extract the commercial auto triangles from the `clrd` dataset<br>
2. Summarize to an industry level and `fit` a `Development` object<br>
3. We can then `transform` the individual company triangles with the industry development patterns<br>

In [37]:
clrd = cl.load_sample("clrd")
comauto = clrd[clrd["LOB"] == "comauto"]["CumPaidLoss"]

comauto_industry = comauto.sum()
industry_dev = cl.Development().fit(comauto_industry)

industry_dev.transform(comauto)

Unnamed: 0,Triangle Summary
Valuation:,1997-12
Grain:,OYDY
Shape:,"(157, 1, 10, 10)"
Index:,"[GRNAME, LOB]"
Columns:,[CumPaidLoss]


## Working with Multidimensional Triangles

Several (though not all) of the estimators in `chainladder` can be fit to several triangles simultaneously. While this can be a convenient shorthand, all these estimators use the same assumptions across every triangle.

In [38]:
clrd = cl.load_sample("clrd").groupby("LOB").sum()["CumPaidLoss"]
print("Fitting to " + str(len(clrd.index)) + " industries simultaneously.")
cl.Development().fit_transform(clrd).cdf_

Fitting to 6 industries simultaneously.


Unnamed: 0,Triangle Summary
Valuation:,2261-12
Grain:,OYDY
Shape:,"(6, 1, 1, 9)"
Index:,[LOB]
Columns:,[CumPaidLoss]


For greater control, you can slice individual triangles out and fit separate patterns to each.

In [39]:
print(cl.Development(average="simple").fit(clrd.loc["wkcomp"]))
print(cl.Development(n_periods=4).fit(clrd.loc["ppauto"]))
print(cl.Development(average="regression", n_periods=6).fit(clrd.loc["comauto"]))

Development(average='simple')
Development(n_periods=4)
Development(average='regression', n_periods=6)
