Tuesday, February 17, 2009

Data Preparation

My view is that feature extraction or variable creation or data preparation is the largest single activity which determines the quality of data mining results. I believe that feature extraction has an order of magnitude larger impact on the final result of a data mining exercise than does the algorithm utilized.

I have several motivations for data preparation

1) Current data mining tools cannot directly ingest the normalized data found in OLTP and data warehouse systems. They require the data to be "flattened" or "de-normalized", creating attributes related to the entities of interest (i.e. customers, locations, transactions, suppliers etc.) Aside - Algorithms which can mine the data in normalized format would be fabulous. Something I've thought about for years but haven't figured out yet.

2) The creation of variables creates context or relates an atomic transaction to the complete set of information about a particular entity. I.e. the fact that I had a credit card transaction at 2:47pm on February 16, 2009 for $10.11 at Starbucks is of little relevance or use in this form. relating this transaction to my entire transaction behaviour is the critical leap i.e. This transaction represents 4% of my week's spend, is 10% more than I spent at starbuck's last week, is 25% of my quarterly coffee purchases etc.

3) The selection of which features or variables to be created is guided my engineering background. Let us develop features which describe all the properties of a system. In engineering systems like air flow around and aircraft we measure temperature, pressure, density, 3 dimensional velocity, viscosity, etc. In a retail system, for example, if we are modeling customer behaviour we should ask what are the properties which describe customer shopping behaviour over a period of time. Recency, frequency, montary value, product mix, channel mix, consistency, trends, counters of/from major events, time series metrics etc. are many of the feature categories we can derive from a customers transaction series. We can create millions of features from a customers transaction series using this approach. ie. days since recent transaction, number of transactions by month, montary value of transactions by month. If we calculate this for the last 3 years by month, for 20 product categories we have 3x12X20x3 variables. If we further calculate month over month changes, year over year changeswe can further double the variables. Depending on the granularuity of time and product you can easily see how we can derive millions of variables. The largest number of variables I created for a data mining project was on the order of 25,000. I typically create ~1,000 attributes.

The exercise in step 3 results in a flattened data set where we have thousands of variables for each entity that we are modelling. This extensive set of variables provides a very rich description of an entities behaviour over a period of time. It provides the context about an entity which allow us to make intelligent conclusions, predictions or assumptions about that entity which are useful to improved decision making.

I apply the same variable creation method to all problems in all industries. The categories recency, frequency, monetary value, product mix, channel mix, trends, stochastics, counters etc. all have analogies in every problem I have countered. I prefer to describe a system and let the algorithms and feature selection approaches determine which features are important to the problem at hand, as opposed to hypothesizing a small set of features based on my knowledge.

In my experience, I can generate better data mining results through data preparation or feature extraction by building a rich description of a system's entities over time relating atomic transactions or events to the entity behaviour as a whole than by focusing on improving an algorithm's performance.

Comments welcome.