By Yanchang Zhao, RDataMining.com
Building prediction/classification models is one of the most widely-seen data mining tasks in business applications. To share experience on building prediction models with R, I have started a discussion at RDataMining group on LinkedIn with the following questions. And my experience can be found at the end of question list. Pls join our discussion if you are interested.
1. What is your application about?
2. What techniques did you use? E.g., linear/logistic/non-linear regression, decision tree, random forest, neural networks, SVM, k-NN classification.
3. Which tool/function/package did you use? E.g., package rpart or party in R.
4. Why did you choose the above technique/tool/package?
5. Was your data a mixture of numerical and categorical attributes? If yes, what did you do to preprocess data before feeding them into the above techniques/functions?
6. Was your data of imbalanced classes? If yes, what did you do to achieve good prediction results?
7. Did your data have many missing values or extreme values? If yes, what did you do with them?
8. Was your table very wide, i.e., having many attributes/variables? If yes, how did you do variable/feature selection or dimensionality reduction before building predictive models?
Below are my experiences in a business application.
It was an application to model risk of customers. There were two classes, good and bad, labelled as 0 and 1. A model was to build to predict classes of new cases.
I used decision tree, because the tree is easy to understand by business people and managers, and the rules are simple and easily to be accepted by business, as compared to SVM or neural networks. We finally built a tree with less than 30 leaf nodes, that is, less than 30 rules. Although random forest can make a similar model, but it would end up with too many rules. To sum up, decision trees are easy to understand, have good performance, accommodate both categorical and numerical data and also missing values, and produce simple models.
I used ctree() in package party. The reason for this choice is not technical at all. I first tried rpart, but when I plotted rpart tree, I got a tree without any labels, and the tree does not look like a tree at all. Then I tried ctree(), and fell in love with its plot, a nice looking tree with all labels I needed. I believe both packages produce similar results, and my choice of ctreeI() is simply personal preference.
The reason I didn’t use package randomForest is that it cannot handle missing values, and in our application, we didn’t believe filling those missing values would produce good result, because many missing values were introduced when joining tables (e.g., there were no records for a customer in some tables). However, it may work in other applications. I did try cforest() in package party, it produced similar result as ctree(), so I finally chose ctree() because it produces much simpler trees than a random forest. When performance is similar, the simpler, the better.
The data was a mixture of numerical and categorical attributes, and ctree() handles that very well.
There were many missing values. We didn’t fill them. However, for attributes with too many missing values, e.g., if above 60% are missing, we simply excluded the attribute from modelling. SAS EM uses a similar strategy when importing data into EM, and I think the default threshold for missing values are 20% or 30% in EM.
The data were large and also composed of over 300 variables from dozens of tables. We didn’t do any feature selection first, but found that it took too long to train a model, especially with some categorical variables having many value levels. An option is to use a small sample to train models. I took a different way to use as many training data as possible. To make it work, I draw 20 random samples of training data, and built 20 decision trees, one tree for each sample. There are around 20-30 variables in each tree, and many trees share similar set of variables. Then I collected all variables appearing in those trees, and got around 60 variables. After that, I used all original training data for training without any sampling, but with the above 60 variables only. That’s my way to select features in that application, where all training cases were used to build a final model, but with only those attributes having appeared in the 20 trees on sampled data.
More information on R and data mining is available at:
Group on Linkedin: http://group.rdatamining.com
Group on Google: http://group2.rdatamining.com