Free Stanford online course on Statistical Learning (with R) starting on 19 Jan 2015

This is an introductory-level course in supervised learning, with a focus on regression and classification methods. The syllabus includes: linear and polynomial regression, logistic regression and linear discriminant analysis; cross-validation and the bootstrap, model selection and regularization methods (ridge and lasso); nonlinear models, splines and generalized additive models; tree-based methods, random forests and boosting; support-vector machines. Some unsupervised learning methods are discussed: principal components and clustering (k-means and hierarchical).

The lectures cover all the material in An Introduction to Statistical Learning, with Applications in R by James, Witten, Hastie and Tibshirani (Springer, 2013). As of January 5, 2014, the pdf for this book will be available for free, with the consent of the publisher, on the book website.

Classes Start: Jan 19, 2015
Classes End: Apr 03, 2015
Course Staff: Prof. Trevor Hastie, Prof. Rob Tibshirani
Price: Free

Posted in Data Mining, R | Tagged , | Leave a comment

AusDM 2014 Conference Program

The Program of AusDM 2014 Conference is now available at It features two keynote talks, one on Learning in Sequential Decision Problems by Prof Peter Bartlett from UC Berkeley, and the other on Making Sense of a Random World through Statistics by Prof Geoff McLachlan from University of Queensland. It also has a half-day workshop on R and Data Mining, providing hands-on experience on data mining with R. Moreover, there will be 24 presentations of accepted papers, covering topics on machine learning, information retrieval, health & bioinformatics,  collaborative filtering & recommendation, clustering, data fusion, record linkage and sensor networks.

See detailed conference program at and register for the conference at

Posted in Data Mining, R | Tagged , | 1 Comment

SBS documentary “The Age of Big Data”

by Yanchang Zhao,

“Data is becoming a powerful and most valuable commodity in 21st century. It is leading to scientific insights and new ways of understanding human behaviour. Data can also make you rich. Very rich.”
– SBS documentary “The Age of Big Data”

Last Friday, there was an interesting documentary on SBS, “The Age of Big Data”. It presented applications of data mining and big data in crime detection, medicine, financial market, advertising and astronomy.

It started with Los Angeles police driving a car with a laptop in front of them, which guided them with possible crime hotspots in next 24 hours produced by data mining models. University researchers have used similar models to predict earthquake after-shocks, and they are using such models to predict human behaviours and crime hotspots.

It then showed applications of DNA and genome analysis in medicine for diagnosis, predicting price variations for trading in financial market, decision theory used by NASA for selecting the best one out of 35 billion possible Man-to-Mars missions. It also showed how data mining was used for advertising by predicting what people might want to buy, which might get clues about that even before people realize by themselves! It ends with application in astronomy where a telescopy array is collecting 30 Terabytes of data per second, to unlock the secret of university.

Although it has talked nothing about big data techniques like Hadoop, it is an easy-to-understand introduction of data mining and big data for people who know nothing or little about it, like your boss, family and friends. It provides an opportunity to educate them and let them know what you are doing.

The video is available on SBSonDemand at You can also find it at

Again, I love the statement given at the very beginning of this post, and am looking forward to getting rich one day. :-)

Posted in Big Data, Data Mining | Tagged , | 4 Comments

Call for participation: AusDM 2014, Brisbane, 27-28 November

12th Australasian Data Mining Conference (AusDM 2014)
Brisbane, Australia
27-28 November 2014


The Australasian Data Mining Conference has established itself as the premier Australasian meeting for both practitioners and researchers in data mining. Since AusDM’02 the conference has showcased research in data mining, providing a forum for presenting and discussing the latest research and developments.

This year’s conference, AusDM’14 builds on this tradition of facilitating the cross-disciplinary exchange of ideas, experience and potential research directions. Specifically, the conference seeks to showcase: Industry Case Studies; Research Prototypes; Practical Analytics Technology; and Research Student Projects. AusDM’14 will be a meeting place for pushing forward the frontiers of data mining in industry and academia. We have lined up an excellent Keynote Speaker program.


Registration site:
Registration fees:
Standard Registration: $495
Student Standard Registration: $320

If you are registering as a student, contact us via the email with an evidence of you being an active student. We will issue you a discount code for you to use the website.


Keynote I: Learning in sequential decision problems
Prof. Peter Bartlett, University of California, Berkeley, USA

Abstract: Many problems of decision making under uncertainty can be formulated as sequential decision problems in which a strategy’s current state and choice of action determine its loss and next state, and the aim is to choose actions so as to minimize the sum of losses incurred.  For instance, in internet news recommendation and in digital marketing, the optimization of interactions with users to maximize long-term utility needs to exploit the dynamics of users. We consider three problems of this kind: Markov decision processes with adversarially chosen transition and loss structures; policy optimization for large scale Markov decision processes; and linear tracking problems with adversarially chosen quadratic loss functions. We present algorithms and optimal excess loss bounds for these three problems. We show situations where these algorithms are computationally efficient, and others where hardness results suggest that no algorithm is computationally efficient.

Keynote II: Making Sense of a Random World through Statistics
Prof. Geoff McLachlan, University of Queensland, Brisbane, Australia

Abstract: With the growth in data in recent times, it is argued in this talk that there is a need for even more statistical methods in data mining. In so doing, we present some examples in which there is a need to adopt some fairly sophisticated statistical procedures (at least not off-the-shelf methods) to avoid misleading inferences being made about patterns in the data due to randomness. One example concerns the search for clusters in data. Having found an apparent clustering in a dataset, as evidenced in a visualisation of the dataset in some reduced form, the question arises of whether this clustering is representative of an underlying group structure or is merely due to random fluctuations. Another example concerns the supervised classification in the case of many variables measured on only a small number of objects. In this situation, it is possible to construct a classifier based on a relatively small subset of the variables that provides a perfect classification of the data (that is, its apparent error rate is zero). We discuss how statistics is needed to correct for the optimism in these results due to randomness and to provide a realistic interpretation.


Half-day workshop on R and Data Mining, Thursday afternoon, 27 November
Dr. Yanchang Zhao,

The workshop will present an introduction on data mining with R, providing R code examples for classification, clustering, association rules and text mining. See workshop slides at

Accepted Papers

Comparison of athletic performances across disciplines and disability classes
Chris Barnes

Factors Influencing Robustness and Effectiveness of Conditional Random Fields in Active Learning Frameworks
Mahnoosh Kholghi, Laurianne Sitbon, Guido Zuccon and Anthony Nguyen

Tree Based Scalable Indexing for Multi-Party Privacy Preserving Record Linkage
Thilina Ranbaduge, Peter Christen and Dinusha Vatsalan

Towards Social Media as a Data Source for Opportunistic Sensor Networking
James Meneghello, Kevin Lee and Nik Thompson

A Case Study of Utilising Concept Knowledge in a Topic Specific Document Collection
Gavin Shaw and Richi Nayak

An Efficient Tagging Data Interpretation and Representation Scheme for Item Recommendation
Noor Ifada and Richi Nayak

Evolving Wavelet Neural Networks for Breast Cancer Classification
Maryam Khan, Stephan Chalup and Alexandre Mendes

Dynamic Class Prediction with Classifier Based Distance Measure
Senay Yasar Saglam and Nick Street

Detecting Digital Newspaper Duplicates with Focus on eliminating OCR errors
Yeshey Peden and Richi Nayak

Improving Scalability and Performance of Random Forest Based Learning-to-Rank Algorithms by Aggressive Subsampling
Muhammad Ibrahim and Mark Carman

A Multidimensional Collaborative Filtering Fusion Approach with Dimensionality Reduction
Xiaoyu Tang, Yue Xu, Ahmad Abdel-Hafez and Shlomo Geva

The Schema Last Approach to Data Fusion
Neil Brittliff and Dharmendra Sharma

A Triple Store Implementation to support Tabular Data
Neil Brittliff and Dharmendra Sharma

Pruned Annular Extreme Learning Machine Optimization based on RANSAC Multi Model Response Regularization
Lavneet Singh and Girija Chetty

Automatic Detection of Cluster Structure Changes using Relative Density Self-Organizing Maps
Denny, Pandu Wicaksono and Ruli Manurung

Decreasing Uncertainty for Improvement of Relevancy Prediction
Libiao Zhang, Yuefeng Li and Moch Arif Bijaksana

Identifying Product Families Using Data Mining Techniques in Manufacturing Paradigm
Israt Jahan Chowdhury and Richi Nayak

Market Segmentation of EFTPOS Retailers
Ashishkumar Singh, Grace Rumantir and Annie South

Locality-Sensitive Hashing for Protein Classification
Lawrence Buckingham, James Hogan, Shlomo Geva and Wayne Kelly

Real-time Collaborative Filtering Recommender Systems
Huizhi Liang, Haoran Du and Qing Wang

Pattern-based Topic Modelling for Query Expansion
Yang Gao, Yue Xu and Yuefeng Li

Hartigan’s Method for K-modes Clustering and Its Advantages
Zheng Rong Xiang and Zahidul Islam

Data Cleansing during Data Collection from Wireless Sensor Networks
Md Zahidul Islam, Quazi Mamun and Md Geaur Rahman

Content Based Image Retrieval Using Signature Representation
Dinesha Chathurani Nanayakkara Wasam Uluwitige, Shlomo Geva, Vinod Chandran and Timothy Chappell

Organising Committee

Conference Chairs
Richi Nayak, Queensland University of Technology, Brisbane, Australia
Paul Kennedy, University of Technology, Sydney

Program Chairs (Research)
Lin Liu, University of South Australia, Adelaide
Xue Li, University of Queensland, Brisbane, Australia

Program Chairs (Application)
Kok-Leong Ong, Deakin University, Melbourne
Yanchang Zhao, Department of Immigration & Border Protection, Australia; and

Sponsorship Chair
Andrew Stranieri, University of Ballarat, Ballarat

Local Chair
Yue Xu, Brisbane, Australia

Steering Committee Chairs
Simeon Simoff, University of Western Sydney
Graham Williams, Australian Taxation Office

Other Steering Committee Members
Peter Christen, The Australian National University, Canberra
Paul Kennedy, University of Technology, Sydney
Jiuyong Li, University of South Australia, Adelaide
Kok-Leong Ong, Deakin University, Melbourne
John Roddick, Flinders University, Adelaide
Andrew Stranieri, University of Ballarat, Ballarat
Geoff Webb, Monash University, Melbourne

Join us on LinkedIn

Posted in Data Mining | Tagged | Leave a comment

My visits and RDataMining talks in North America


Talk at Twitter

RDataMining Talk at Twitter

I visited Mexico recently and is now travelling in US. In the 1st week of October, I delivered a keynote talk at the CONAIS 2014 conference in Mexico, as well as a one-day workshop on data mining with R to students and researchers at UJAT.

After that I visited Amazon in Seattle on 8 October and made a 2-hour seminar there on association rule mining and text mining with R. And then I visited Google in Bay Area. Yesterday 10 October, I visited Twitter in San Francisco and gave a talk there on R and data mining. Very nice work environment at Amazon, Google and Twitter. Especially, enjoyed very delicious free meals provided at Google and Twitter, which is not available at Amazon. It is a pity that I missed Microsoft in Seattle.

Will conclude my travel in US within one week and then back to Australia.

The above talks were based on my RDataMining slides series, which can be downloaded from website.

Posted in Data Mining, R | Tagged , | 2 Comments

RDataMining Slides Series

by Yanchang Zhao,

I have made a series of slides on R and data mining, based on my book titled R and Data Mining — Examples and Case Studies. The slides will be used at my presentations at seminars to graduate students at Universidad Juárez Autónoma de Tabasco (UJAT), prior to my keynote speech on Analysing Twitter Data with Text Mining and Social Network Analysis at the CONAIS 2014 conference in Mexico in October 2014.

The slides cover seven topics below. Click the links to download them in PDF files.

I will make more slides in near future, such as social network analysis with R and big data analysis with R. Keep tuned with

Posted in Data Mining, R | Tagged , | 3 Comments

Slides of 12 tutorials at ACM SIGKDD 2014

Slides of 12 tutorials taught by data science experts and thought leaders at ACM SIGKDD 2014 are provided at Below is a list of them.

1.Scaling Up Deep Learning
Yoshua Bengio

2. Constructing and mining web-scale knowledge graphs
Antoine Bordes, Evgeniy Gabrilovich

3. Bringing Structure to Text: Mining Phrases, Entity Concepts, Topics, and Hierarchies
Jiawei Han, Chi Wang, Ahmed El-Kishky

4. Computational Epidemiology
Madhav Marathe, Naren Ramakrishnan, Anil Kumar S. Vullikanti

5. Management and Analytic of Biomedical Big Data with Cloud-based In-Memory Database and Dynamic Querying: A Hands-on Experience with Real-world Data
Roger Mark, John Ellenberger, Mengling Feng, Mohammad Ghassemi, Thomas Brennan, Ishrar Hussain

6. The Recommender Problem Revisited
Xavier Amatriain, Bamshad Mobasher

7. Correlation clustering: from theory to practice
Francesco Bonchi, David Garcia-Soriano, Edo Liberty

8. Deep Learning
Ruslan Salakhutdinov

9. Network Mining and Analysis for Social Applications
Feida Zhu, Huan Sun, Xifeng Yan

10. Sampling for Big Data
Graham Cormode, Nick Duffield

11. Statistically Sound Pattern Discovery
Geoff Webb, Wilhelmiina Hamalainen

12. Recommendation in Social Media
Jiliang Tang, Jie Tang, Huan Liu

Posted in Big Data, Data Mining | Tagged , | Leave a comment