Call for participation: AusDM 2014, Brisbane, 27-28 November

12th Australasian Data Mining Conference (AusDM 2014)
Brisbane, Australia
27-28 November 2014


The Australasian Data Mining Conference has established itself as the premier Australasian meeting for both practitioners and researchers in data mining. Since AusDM’02 the conference has showcased research in data mining, providing a forum for presenting and discussing the latest research and developments.

This year’s conference, AusDM’14 builds on this tradition of facilitating the cross-disciplinary exchange of ideas, experience and potential research directions. Specifically, the conference seeks to showcase: Industry Case Studies; Research Prototypes; Practical Analytics Technology; and Research Student Projects. AusDM’14 will be a meeting place for pushing forward the frontiers of data mining in industry and academia. We have lined up an excellent Keynote Speaker program.


Registration site:
Registration fees:
Standard Registration: $495
Student Standard Registration: $320

If you are registering as a student, contact us via the email with an evidence of you being an active student. We will issue you a discount code for you to use the website.


Keynote I: Learning in sequential decision problems
Prof. Peter Bartlett, University of California, Berkeley, USA

Abstract: Many problems of decision making under uncertainty can be formulated as sequential decision problems in which a strategy’s current state and choice of action determine its loss and next state, and the aim is to choose actions so as to minimize the sum of losses incurred.  For instance, in internet news recommendation and in digital marketing, the optimization of interactions with users to maximize long-term utility needs to exploit the dynamics of users. We consider three problems of this kind: Markov decision processes with adversarially chosen transition and loss structures; policy optimization for large scale Markov decision processes; and linear tracking problems with adversarially chosen quadratic loss functions. We present algorithms and optimal excess loss bounds for these three problems. We show situations where these algorithms are computationally efficient, and others where hardness results suggest that no algorithm is computationally efficient.

Keynote II: Making Sense of a Random World through Statistics
Prof. Geoff McLachlan, University of Queensland, Brisbane, Australia

Abstract: With the growth in data in recent times, it is argued in this talk that there is a need for even more statistical methods in data mining. In so doing, we present some examples in which there is a need to adopt some fairly sophisticated statistical procedures (at least not off-the-shelf methods) to avoid misleading inferences being made about patterns in the data due to randomness. One example concerns the search for clusters in data. Having found an apparent clustering in a dataset, as evidenced in a visualisation of the dataset in some reduced form, the question arises of whether this clustering is representative of an underlying group structure or is merely due to random fluctuations. Another example concerns the supervised classification in the case of many variables measured on only a small number of objects. In this situation, it is possible to construct a classifier based on a relatively small subset of the variables that provides a perfect classification of the data (that is, its apparent error rate is zero). We discuss how statistics is needed to correct for the optimism in these results due to randomness and to provide a realistic interpretation.


Half-day workshop on R and Data Mining, Thursday afternoon, 27 November
Dr. Yanchang Zhao,

The workshop will present an introduction on data mining with R, providing R code examples for classification, clustering, association rules and text mining. See workshop slides at

Accepted Papers

Comparison of athletic performances across disciplines and disability classes
Chris Barnes

Factors Influencing Robustness and Effectiveness of Conditional Random Fields in Active Learning Frameworks
Mahnoosh Kholghi, Laurianne Sitbon, Guido Zuccon and Anthony Nguyen

Tree Based Scalable Indexing for Multi-Party Privacy Preserving Record Linkage
Thilina Ranbaduge, Peter Christen and Dinusha Vatsalan

Towards Social Media as a Data Source for Opportunistic Sensor Networking
James Meneghello, Kevin Lee and Nik Thompson

A Case Study of Utilising Concept Knowledge in a Topic Specific Document Collection
Gavin Shaw and Richi Nayak

An Efficient Tagging Data Interpretation and Representation Scheme for Item Recommendation
Noor Ifada and Richi Nayak

Evolving Wavelet Neural Networks for Breast Cancer Classification
Maryam Khan, Stephan Chalup and Alexandre Mendes

Dynamic Class Prediction with Classifier Based Distance Measure
Senay Yasar Saglam and Nick Street

Detecting Digital Newspaper Duplicates with Focus on eliminating OCR errors
Yeshey Peden and Richi Nayak

Improving Scalability and Performance of Random Forest Based Learning-to-Rank Algorithms by Aggressive Subsampling
Muhammad Ibrahim and Mark Carman

A Multidimensional Collaborative Filtering Fusion Approach with Dimensionality Reduction
Xiaoyu Tang, Yue Xu, Ahmad Abdel-Hafez and Shlomo Geva

The Schema Last Approach to Data Fusion
Neil Brittliff and Dharmendra Sharma

A Triple Store Implementation to support Tabular Data
Neil Brittliff and Dharmendra Sharma

Pruned Annular Extreme Learning Machine Optimization based on RANSAC Multi Model Response Regularization
Lavneet Singh and Girija Chetty

Automatic Detection of Cluster Structure Changes using Relative Density Self-Organizing Maps
Denny, Pandu Wicaksono and Ruli Manurung

Decreasing Uncertainty for Improvement of Relevancy Prediction
Libiao Zhang, Yuefeng Li and Moch Arif Bijaksana

Identifying Product Families Using Data Mining Techniques in Manufacturing Paradigm
Israt Jahan Chowdhury and Richi Nayak

Market Segmentation of EFTPOS Retailers
Ashishkumar Singh, Grace Rumantir and Annie South

Locality-Sensitive Hashing for Protein Classification
Lawrence Buckingham, James Hogan, Shlomo Geva and Wayne Kelly

Real-time Collaborative Filtering Recommender Systems
Huizhi Liang, Haoran Du and Qing Wang

Pattern-based Topic Modelling for Query Expansion
Yang Gao, Yue Xu and Yuefeng Li

Hartigan’s Method for K-modes Clustering and Its Advantages
Zheng Rong Xiang and Zahidul Islam

Data Cleansing during Data Collection from Wireless Sensor Networks
Md Zahidul Islam, Quazi Mamun and Md Geaur Rahman

Content Based Image Retrieval Using Signature Representation
Dinesha Chathurani Nanayakkara Wasam Uluwitige, Shlomo Geva, Vinod Chandran and Timothy Chappell

Organising Committee

Conference Chairs
Richi Nayak, Queensland University of Technology, Brisbane, Australia
Paul Kennedy, University of Technology, Sydney

Program Chairs (Research)
Lin Liu, University of South Australia, Adelaide
Xue Li, University of Queensland, Brisbane, Australia

Program Chairs (Application)
Kok-Leong Ong, Deakin University, Melbourne
Yanchang Zhao, Department of Immigration & Border Protection, Australia; and

Sponsorship Chair
Andrew Stranieri, University of Ballarat, Ballarat

Local Chair
Yue Xu, Brisbane, Australia

Steering Committee Chairs
Simeon Simoff, University of Western Sydney
Graham Williams, Australian Taxation Office

Other Steering Committee Members
Peter Christen, The Australian National University, Canberra
Paul Kennedy, University of Technology, Sydney
Jiuyong Li, University of South Australia, Adelaide
Kok-Leong Ong, Deakin University, Melbourne
John Roddick, Flinders University, Adelaide
Andrew Stranieri, University of Ballarat, Ballarat
Geoff Webb, Monash University, Melbourne

Join us on LinkedIn

Posted in Data Mining | Tagged | Leave a comment

My visits and RDataMining talks in North America


Talk at Twitter

RDataMining Talk at Twitter

I visited Mexico recently and is now travelling in US. In the 1st week of October, I delivered a keynote talk at the CONAIS 2014 conference in Mexico, as well as a one-day workshop on data mining with R to students and researchers at UJAT.

After that I visited Amazon in Seattle on 8 October and made a 2-hour seminar there on association rule mining and text mining with R. And then I visited Google in Bay Area. Yesterday 10 October, I visited Twitter in San Francisco and gave a talk there on R and data mining. Very nice work environment at Amazon, Google and Twitter. Especially, enjoyed very delicious free meals provided at Google and Twitter, which is not available at Amazon. It is a pity that I missed Microsoft in Seattle.

Will conclude my travel in US within one week and then back to Australia.

The above talks were based on my RDataMining slides series, which can be downloaded from website.

Posted in Data Mining, R | Tagged , | 2 Comments

RDataMining Slides Series

by Yanchang Zhao,

I have made a series of slides on R and data mining, based on my book titled R and Data Mining — Examples and Case Studies. The slides will be used at my presentations at seminars to graduate students at Universidad Juárez Autónoma de Tabasco (UJAT), prior to my keynote speech on Analysing Twitter Data with Text Mining and Social Network Analysis at the CONAIS 2014 conference in Mexico in October 2014.

The slides cover seven topics below. Click the links to download them in PDF files.

I will make more slides in near future, such as social network analysis with R and big data analysis with R. Keep tuned with

Posted in Data Mining, R | Tagged , | 3 Comments

Slides of 12 tutorials at ACM SIGKDD 2014

Slides of 12 tutorials taught by data science experts and thought leaders at ACM SIGKDD 2014 are provided at Below is a list of them.

1.Scaling Up Deep Learning
Yoshua Bengio

2. Constructing and mining web-scale knowledge graphs
Antoine Bordes, Evgeniy Gabrilovich

3. Bringing Structure to Text: Mining Phrases, Entity Concepts, Topics, and Hierarchies
Jiawei Han, Chi Wang, Ahmed El-Kishky

4. Computational Epidemiology
Madhav Marathe, Naren Ramakrishnan, Anil Kumar S. Vullikanti

5. Management and Analytic of Biomedical Big Data with Cloud-based In-Memory Database and Dynamic Querying: A Hands-on Experience with Real-world Data
Roger Mark, John Ellenberger, Mengling Feng, Mohammad Ghassemi, Thomas Brennan, Ishrar Hussain

6. The Recommender Problem Revisited
Xavier Amatriain, Bamshad Mobasher

7. Correlation clustering: from theory to practice
Francesco Bonchi, David Garcia-Soriano, Edo Liberty

8. Deep Learning
Ruslan Salakhutdinov

9. Network Mining and Analysis for Social Applications
Feida Zhu, Huan Sun, Xifeng Yan

10. Sampling for Big Data
Graham Cormode, Nick Duffield

11. Statistically Sound Pattern Discovery
Geoff Webb, Wilhelmiina Hamalainen

12. Recommendation in Social Media
Jiliang Tang, Jie Tang, Huan Liu

Posted in Big Data, Data Mining | Tagged , | Leave a comment

RDataMining group having 6000 members today

RDataMining Group:
Twitter: @RDataMining

The RDataMining group has 6000 members today, 5 July 2014.

Created in August 2011, this group has developed into a big community with 6000 member within three years. Since its creation, many members have shared their knowledge and experiences on R and data mining in the group, such as posting useful online resources that they came across, or the examples that they made by themselves. Without their contributions, this group would not have grown into this size today.

Many group members have asked questions and seek help on technical problems, which also contribute a lot to discussions. Those questions incurred many helpful responses and insightful solutions, provided by more experienced group members. Some examples are discussions on good books for learning R and data mining, best way to determine a threshold, and visualization of decision trees. Such discussions are fantastic for knowledge sharing and group learning, from which many members (including myself) have benefited a lot.

Big thanks to those who have responded to above questions and discussions, helping other group members and sharing their knowledge and experiences. I appreciate their time and efforts on helping others out.

Thanks also go to job advertisers or group members who shared job vacancies in the group. Many group members are interested in not only improving their knowledge and skill set, but also in new opportunities.

Last but not least, I’d also thank members for inviting their friends and colleagues to this group, or mentioning this group to people who might be interested in it. Without their recommendations, the group would not be of 6000 members today.

Thanks to all members for their contributions. I look forward to more discussions and interactions with you.

If you are not a member yet, please join us for knowledge sharing with 6000 members.

Best Regards
Yanchang Zhao

P.S. I’d like to take this opportunity to introduce two other professional groups that I manage, which are respectively for Australia-wide and Canberra-based data miners.

AusDM: Data Mining & Analytics
This is a group for the AusDM conference, and is also a forum for Australia-wide data miners and analysts to exchange ideas and share experiences.

Canberra Data Miners
We are Canberra data miners. We get together on weekends for bush walking and experience sharing on data mining and analytics. Join us to have fun, get fit, and share knowledge.

Posted in Data Mining, R | Tagged , | Leave a comment

Currency Exchange Rate Forecasting with ARIMA and STL

I have made an example of time series forecasting with R, demonstrating currency exchange rate forecasting with the ARIMA and STL models. The example is easy to understand and follow.

R source files are provided to run the example.

The example was produced with R Markdown. If you want to learn R Markdown, you can try the Rmd source file, which is also provided.

Check the example and source files at


Posted in Data Mining, R | Tagged , | 3 Comments

Step-by-Step Guide to Setting Up an R-Hadoop System

by Yanchang Zhao

Following my first R-Hadoop system setup guide written in Sept 2013, I have further tested setting up a Hadoop system for running R code, as well as using HBase. I have tested it both on a single computer and on a cluster of computers. The process is described in a newer version of guide to setting up an R-Hadoop system, which was updated on 30 May 2014. The guide also provides links to MapReduce and Hadoop documents and to examples of R-Hadoop code.

See the detailed guide at, and below is a summary of it.

A list of software used for this setup:
– OS and other tools:
Mac OS X 10.6.8, Java 1.6.0_65, Homebrew, thrift 0.9.0
– Hadoop and HBase:
Hadoop 1.1.2, HBase 0.94.17
– R and RHadoop packages:
R 3.1.0, rhdfs 1.0.8, rmr2 3.1.0, plyrmr 0.2.0, rhbase 1.2.0


1. Set up single-node Hadoop
1.1 Download Hadoop
1.2 Set up Hadoop in standalone mode
1.2.1 Set JAVA_HOME
1.2.2 Set up remote desktop and enabling self-login
1.2.3 Run Hadoop
1.3 Test Hadoop
1.3.1 Example 1 – calculate pi
1.3.2 Example 2 – word count

2 Set up Hadoop in cluster mode
2.1 Switching between different modes
2.2 Setup name node (master machine)
2.3 Set JAVA_HOME, set up remote desktop and enable self-login on all nodes
2.4 Copy public key
2.5 Firewall
2.6 Setup data nodes (slave machines)
2.7 Format name node
2.8 Run Hadoop
2.9 Test Hadoop

3. Set up HBase
3.1 Set up HBase
3.2 Switching between different modes

4. Install R

5. Install GCC, Homebrew, git, pkg-config and thrift
5.1 Download and install GCC
5.2 Install Homebrew
5.3 Install git and pkg-config
5.4 Install thrift 0.9.0

6. Environment settings: HADOOP_PREFIX and HADOOP_CMD

7. Install RHadoop: rhdfs, rhbase, rmr2 and plyrmr
7.1 Install relevant R packages
7.2 Set environment variables HADOOP_CMD and HADOOP_STREAMING
7.3 Install RHadoop packages

8. Run an R job on Hadoop for word counting

If you have successfully built up your R-Hadoop system, could you please share your success with R users at this thread? Please also donot forget to forward this tutorial to your friends and colleagues who are interested in running R on Hadoop.

If you have any comments or suggestions, or find errors in above process, please feel free to post your questions to the above thread or to RDataMining group at


Posted in Big Data, R | Tagged , | 2 Comments