Two free online courses starting soon: Data Analysis (with R) and Social Network Analysis

There are two online courses starting soon on Coursera, which are free to register.

1. Data Analysis (with R)

It is a 8-week online course starting on Jan 22nd 2013 <>.

This course is an applied statistics course focusing on data analysis. The course will begin with an overview of how to organize, perform, and write-up data analyses. Then it will cover some of the most popular and widely used statistical methods like linear regression, principal components analysis, cross-validation, and p-values. Instead of focusing on mathematical details, the lectures will be designed to help you apply these techniques to real data using the R statistical programming language, interpret the results, and diagnose potential problems in data analysis.

2. Social Network Analysis

It is a 9-week online course starting on March 4th 2013 <>.

This course will use social network analysis, both its theory and computational tools, to make sense of the social and information networks that have been fueled and rendered accessible by the internet.

Posted in Data Mining, R | Tagged , | 3 Comments

R code and data for book “R and Data Mining: Examples and Case Studies”

R code and data for book “R and Data Mining: Examples and Case Studies” are now available at An online PDF version of the book (the first 11  chapters only) can also be downloaded at

Below are its details and table of contents.

Book title: R and Data Mining: Examples and Case Studies
Author: Yanchang Zhao
Publisher: Elsevier
Publish date: December 2012
ISBN: 978-0-123-96963-7
234 pages

Table of Contents
1 Introduction
1.1 Data Mining
1.2 R
1.3 Datasets
1.3.1 The Iris Dataset
1.3.2 The Bodyfat Dataset

2 Data Import and Export
2.1 Save and Load R Data
2.2 Import from and Export to .CSV Files
2.3 Import Data from SAS
2.4 Import/Export via ODBC
2.4.1 Read from Databases
2.4.2 Output to and Input from EXCEL Files

3 Data Exploration
3.1 Have a Look at Data
3.2 Explore Individual Variables
3.3 Explore Multiple Variables
3.4 More Explorations
3.5 Save Charts into Files

4 Decision Trees and Random Forest
4.1 Decision Trees with Package party
4.2 Decision Trees with Package rpart
4.3 Random Forest

5 Regression
5.1 Linear Regression
5.2 Logistic Regression
5.3 Generalized Linear Regression
5.4 Non-linear Regression

6 Clustering
6.1 The k-Means Clustering
6.2 The k-Medoids Clustering
6.3 Hierarchical Clustering
6.4 Density-based Clustering

7 Outlier Detection
7.1 Univariate Outlier Detection
7.2 Outlier Detection with LOF
7.3 Outlier Detection by Clustering
7.4 Outlier Detection from Time Series
7.5 Discussions

8 Time Series Analysis and Mining
8.1 Time Series Data in R
8.2 Time Series Decomposition
8.3 Time Series Forecasting
8.4 Time Series Clustering
8.4.1 Dynamic Time Warping
8.4.2 Synthetic Control Chart Time Series Data
8.4.3 Hierarchical Clustering with Euclidean Distance
8.4.4 Hierarchical Clustering with DTW Distance
8.5 Time Series Classification
8.5.1 Classification with Original Data
8.5.2 Classification with Extracted Features
8.5.3 k-NN Classification
8.6 Discussions
8.7 Further Readings

9 Association Rules
9.1 Basics of Association Rules
9.2 The Titanic Dataset
9.3 Association Rule Mining
9.4 Removing Redundancy
9.5 Interpreting Rules
9.6 Visualizing Association Rules
9.7 Discussions and Further Readings

10 Text Mining
10.1 Retrieving Text from Twitter
10.2 Transforming Text
10.3 Stemming Words
10.4 Building a Term-Document Matrix
10.5 Frequent Terms and Associations
10.6 Word Cloud
10.7 Clustering Words
10.8 Clustering Tweets
10.8.1 Clustering Tweets with the k-means Algorithm
10.8.2 Clustering Tweets with the k-medoids Algorithm
10.9 Packages, Further Readings and Discussions

11 Social Network Analysis
11.1 Network of Terms
11.2 Network of Tweets
11.3 Two-Mode Network
11.4 Discussions and Further Readings

12 Case Study I: Analysis and Forecasting of House Price Indices
12.1 Importing HPI Data
12.2 Exploration of HPI Data
12.3 Trend and Seasonal Components of HPI
12.4 HPI Forecasting
12.5 The Estimated Price of a Property
12.6 Discussion

13 Case Study II: Customer Response Prediction and Profit Optimization
13.1 Introduction
13.2 The Data of KDD Cup 1998
13.3 Data Exploration
13.4 Training Decision Trees
13.5 Model Evaluation
13.6 Selecting the Best Tree
13.7 Scoring
13.8 Discussions and Conclusions

14 Case Study III: Predictive Modeling of Big Data with Limited Memory
14.1 Introduction
14.2 Methodology
14.3 Data and Variables
14.4 Random Forest
14.5 Memory Issue
14.6 Train Models on Sample Data
14.7 Build Models with Selected Variables
14.8 Scoring
14.9 Print Rules
14.9.1 Print Rules in Text
14.9.2 Print Rules for Scoring with SAS
14.10 Conclusions and Discussion

15 Online Resources
15.1 R Reference Cards
15.2 R
15.3 Data Mining
15.4 Data Mining with R
15.5 Classification/Prediction with R
15.6 Time Series Analysis with R
15.7 Association Rule Mining with R
15.8 Spatial Data Analysis with R
15.9 Text Mining with R
15.10 Social Network Analysis with R
15.11 Data Cleansing and Transformation with R
15.12 Big Data and Parallel Computing with R

R Reference Card for Data Mining


General Index

Package Index

Function Index

Posted in Data Mining, R | Tagged , | 7 Comments

CFP: DMApps 2013 – Workshop on Data Mining Applications in Industry and Government, submission due by Jan 6, 2013

DMApps 2013: the International Workshop on Data Mining Applications in Industry & Government
In conjunction with PAKDD 2013, Gold Coast, Australia, April 14-17, 2013

The 2013 International Workshop on Data Mining Applications in Industry & Government (DMApps 2013) will provide a platform for industrial data mining practitioners to share knowledge and experience, and also provide a bridge between academia and industry for applying new advanced data mining techniques to industrial applications. The audience will be composed of industrial data mining practitioners, as well as academic researchers who are interested in designing algorithms to meet industrial needs. The workshop will foster the collaboration between academia and industry and speed-up the process for new techniques to transfer from academic research to industrial applications.

The workshop focuses on applications of data mining in real-world projects. Topics include, but not limited to data mining applications in:
• Finance
• Retail
• Insurance
• Telecommunications
• Crime & Homeland Security
• Stock Market
• Social Welfare
• Social Media
• Medicine and Health
• Education
• Sports
• Transport
• Education
• Environment
• Manufacturing
• Government
• Other Fields

Long and Short Papers
There are two types of paper that can be submitted. One is a long paper covering research into real-world data mining applications in industry and government. The other is a short paper up to four pages from managers and practitioners covering a challenging and informative issue in data mining. This includes what the issue was, how it was managed and what lessons were learned from the activity. The page limit is 12 pages for long papers and 4 pages for short papers. All papers should be with 10pt font size, following the Springer LNCS/LNAI manuscript submission guidelines ( The submission due date is December 14, 2012.

Important Dates
Submission due:                           January 6, 2013
Notification to authors:              January 31, 2013
Camera-ready due:                      February 15, 2013
Workshop date:                            April 14, 2013

Submission Procedure
All papers must be submitted electronically in PDF format at All submitted papers will be reviewed by 2 or 3 reviewers. Selected outstanding long papers presented at the workshop will be included in a LNCS/LNAI post Proceedings of PAKDD Workshops published by Springer.

Submitting a paper to the workshop means that if the paper is accepted, at least one author should attend the workshop to present the paper.

Organising Committee
Workshop Chairs

Warwick Graco
Operational Analytics,
Australian Taxation Office

Inna Kolyshkina
Chair of the South Australian Chapter
Australian Institute of Analytics Professionals

Program Chairs

Yanchang Zhao
Department of Immigration & Citizenship,
Australia; and

Clifton Phua
Data Analytics Department,
Institute for Infocomm Research, Singapore

Posted in Data Mining | Tagged | Leave a comment

Call for contribution: the RDataMining package – an R package for data mining

Join the RDataMining project to build a comprehensive R package for data mining

We have started the RDataMining project on R-Forge to build an R package for data mining. The package will provide various functionalities for data mining, with contributions from many R users. If you have developed or will implement any data mining algorithms in R, please participate in the project to make your work available to R users worldwide.

Although there are many R packages for various data mining functionalities, there are many more new algorithms designed and published every year, without any R implementations for them. It is far beyond the capability of a single team, even several teams, to build packages for oncoming new data mining algorithms. On the other hand, many R users developed their own implementations of new data mining algorithms, but unfortunately, used for their own work only, without sharing with other R users. The reason could be that they donot know or donot have time to build packages to share their code, or they might think that it is not worth building a package with only one or two functions.

To forester the development of data mining capability in R and facilitate sharing of data mining codes/functions/algorithms among R users, we started this project on R-Forge to collaboratively build an R package for data mining, with contributions from many R users, including ourselves.

How it works
The project works in a way similar to an edited book. We, as organizors, send out call for participation and solicit R users to join this project and contribute their implemented functions and algorithms. The contributed functions will build up and make a package.

Function authors will be responsible for the development, maintenance and documentation of their contributed functions. We will put all functions together as one package and also make a manual for the package.

Function authors will be acknowledged as authors of corresponding functions in help documentation and manual of the package. We, as the organizor of the package, will be shown as the manager/maintainer of the whole package.

It’s free to join or quit the project at any time, and authors can withdraw their contributed functions at any time.

The RDataMining package and project:
The RDataMining project on R-Forge: or

Yanchang Zhao <yanchang at>

Join the RDataMining Project, and we will work together to build a comprehensive R package for data mining.

Posted in Data Mining, R | Tagged , , | 1 Comment

CFP: AusDM 2012, deadline extended to 31 August 2012

The Tenth Australasian Data Mining Conference (AusDM 2012)
Sydney, Australia
5-7 December 2012

Deadline extended to 31 August 2012

The Australasian Data Mining Conference has established itself as the premier
Australasian meeting for both practitioners and researchers in data mining.
Since AusDM’02 the conference has showcased research in data mining,
providing a forum for presenting and discussing the latest research and
developments. This year’s conference, AusDM’12, co-hosted with the Australian
Joint Conference on Artificial Intelligence, builds on this tradition of
facilitating the cross-disciplinary exchange of ideas, experience and
potential research directions. Specifically, the conference seeks to showcase:
Industry Case Studies; Research Prototypes; Practical Analytics Technology;
and Research Student Projects. AusDM’12 will be a meeting place for pushing
forward the frontiers of data mining in industry and academia.

Publication and topics

We are calling for papers, both research and applications, and from both
academia and industry, for presentation at the conference. All papers will go
through double-blind, peer-review by a panel of international experts.
Accepted papers will be published in an up-coming volume (Data Mining and
Analytics 2012) of the Conferences in Research and Practice in Information
Technology (CRPIT) series by the Australian Computer Society which is also
held in full-text on the ACM Digital Library. We require that at least one
author for each accepted paper will register for the conference and present
their work. Selected papers will be invited to extend to publish in Journal of
Research and Practice in Information Technology.

- Applications and Case Studies | Lessons and Experiences
- Biomedical and Health Data Mining
- Business Analytics
- Data Integration, Matching and Linkage
- Data Preparation, Cleaning and Preprocessing
- Data Stream Mining
- Evaluation of Results and their Communication
- Link, Graph, Network and Process Mining
- Multimedia Data Mining
- New Data Mining Algorithms
- Privacy-preserving Data Mining
- Spatial and Temporal Data Mining
- Text Mining and Web Mining
- Visual Analytics

Submission of papers

The length of the submissions is not restricted. We encourage submissions of
6-10 pages. We will use a double-blinded review process, i.e. paper
submissions must NOT include authors names or affiliations (and also not
acknowledgements referring to funding bodies). Self-citing references should
also be removed from the submitted papers (they can be added on after the
review) for the double blind reviewing purpose.

Paper submissions are required to follow the general format specified for
papers in the CRPIT series <>. LaTeX is
suggested. The electronic submissions should be in PDF and made through the
AusDM’12 Submission Page at <>.

Important Dates

Submission of full papers:              31 August 2012 (extended)
Notification of authors:                1 October 2012
Final version and author registration:  15 October 2012
Conference:                             5-7 December 2012

Organising Committee

Program Chairs
Yanchang Zhao, Department of Immigration & Citizenship, Australia; and
Jiuyong Li, University of South Australia, Adelaide

Conference Chairs
Peter Christen, Australian National University, Canberra
Paul Kennedy, University of Technology, Sydney

Steering Committee Chairs
Simeon Simoff, University of Western Sydney
Graham Williams, Australian Taxation Office

Other Steering Committee Members
Peter Christen, Australian National University, Canberra
Paul Kennedy, University of Technology, Sydney
Jiuyong Li, University of South Australia, Adelaide
Kok-Leong Ong, Deakin University, Victoria
John Roddick, Flinders University, Adelaide
Andrew Stranieri, University of Ballarat, Ballarat

Posted in Data Mining | Tagged | Leave a comment

Examples of profiling R code

by Yanchang Zhao,

Below are simple examples of profiling R code, which help to find out which steps or functions are most time consuming. It is very useful for improving efficiency of R code.

# profiling of running time
y <- myFunction(x)  # this is the function to profile

The example below profiles memory as well. Memory allocation can also be profiled with function Rprofmem().

# profiling of both time and memory
Rprof(“myFunction.out”, memory.profiling=T)
y <- myFunction(x)
summaryRprof(“myFunction.out”, memory=”both”)

A detailed example of profiling R code can be found at

Posted in R | Tagged | Leave a comment

R is reported as being used by about half of all data miners in the 2011 Data Miners Survey

by Yanchang Zhao,

R is reported as now being used by close to half of all data miners (47%) in the 2011 Data Miners Survey by Rexer Analytics.

Below is picked up from the survey highlights regarding data mining tools.

“TOOLS:  R continued its rise this year and is now being used by close to half
of all data miners (47%).  R users report preferring it for being free, open
source, and having a wide variety of algorithms.  Many people also cited R’s
flexibility and the strength of the user community.  In the 2011 survey we
asked R users to tell us more about their use of R.  Read the R user
comments about why these use R (pros), the cons of using R, why they select
their R interface, and how they use R in conjuction with other tools.
STATISTICA is selected as the primary data mining tool by the most data
miners (17%).  Data miners report using an average of 4 software tools
overall.  STATISTICA, KNIME, Rapid Miner and Salford Systems received the
strongest satisfaction ratings in 2011.”

See the survey highlights at

Some insights from R users can be found at

Posted in Data Mining, R | Tagged , | Leave a comment