Write your own version of Mint, part 2

Last time I demonstrated how to write a simple Java program to process your bank statements. Now let’s add more automation to it by using Machine Learning (ML) to automatically categorize each transaction, is this grocery or entertainment?

First, I will need a training data set. With 6 months worth of bank statements, I am able to get 400+ transactions by running my program and manually tag the category. As a result, this is the generated CSV file:

Id Description Amount Category
1 NETFLIX.COM NETFLIX.COM CA 10.94 Entertainment
2 AT&T*BILL PAYMENT WWW.ATT.COM TX 74.88 Phone
3 COSTCO GAS #0006 TUKWILA WA 21.51 Car

Note the following changes:

  • The Id column replaces Date column so I can use it as the row id, which is a unique identifier for each record.
  • Card column is removed because I do not think it helps with model prediction. However, if certain card is always used to pay for certain type of bill, this can be a valuable feature.

The ML model will consume Description and Amount as features to predict label Category. ML is all about finding patterns, by skimming through my data set, I think the model will do a good job on predicting Grocery, Health, Phone, Entertainment, but poorly on Restaurant and Shopping, because restaurant names can be anything. Additionally, 400+ records is too small for a multi-class model with ~10 labels. So I think the model will be at around 50% accuracy.

I am going to use AWS Machine Learning to train a model because it is super easy to use. After uploading the CSV file as the training data set with the following input schema:

{
  "version" : "1.0",
  "rowId" : "Id",
  "rowWeight" : null,
  "targetAttributeName" : "Category",
  "dataFormat" : "CSV",
  "dataFileContainsHeader" : true,
  "attributes" : [ {
    "attributeName" : "Id",
    "attributeType" : "CATEGORICAL"
  }, {
    "attributeName" : "Description",
    "attributeType" : "TEXT"
  }, {
    "attributeName" : "Amount",
    "attributeType" : "NUMERIC"
  }, {
    "attributeName" : "Category",
    "attributeType" : "CATEGORICAL"
  } ],
  "excludedAttributeNames" : [ ]
}

I will use this modified schema:

{
  "groups": {
    "NUMERIC_VARS_QB_500": "group('Amount')"
  },
  "assignments": {},
  "outputs": [
    "ALL_CATEGORICAL",
    "quantile_bin(NUMERIC_VARS_QB_500,500)",
    "ngram(lowercase(no_punct('Description')),3)"
  ]
}

to configure the model settings. AWS ML service will automatically split the training data set into 70%-30%, meaningly randomly selected 70% of the data will be used for training, while the remaining 30% will be used for evaluation. It will take a few minutes for the service to finish building the model and executing the evaluation. At the end, my model shows a 0.600 F1 score, not bad at all!

By the end of Q2, I will use this model to predict the categories! Why Q2? Because I release our financial report once every quarter 🙂

Advertisements

One thought on “Write your own version of Mint, part 2

  1. Pingback: Write your own version of Mint, part 1 – 齊天大聖

Comments are closed.