RDataMining group having 6000 members today

RDataMining Group: http://group.rdatamining.com
Twitter: @RDataMining
Website: http://www.RDataMining.com

The RDataMining group has 6000 members today, 5 July 2014.

Created in August 2011, this group has developed into a big community with 6000 member within three years. Since its creation, many members have shared their knowledge and experiences on R and data mining in the group, such as posting useful online resources that they came across, or the examples that they made by themselves. Without their contributions, this group would not have grown into this size today.

Many group members have asked questions and seek help on technical problems, which also contribute a lot to discussions. Those questions incurred many helpful responses and insightful solutions, provided by more experienced group members. Some examples are discussions on good books for learning R and data mining, best way to determine a threshold, and visualization of decision trees. Such discussions are fantastic for knowledge sharing and group learning, from which many members (including myself) have benefited a lot.

Big thanks to those who have responded to above questions and discussions, helping other group members and sharing their knowledge and experiences. I appreciate their time and efforts on helping others out.

Thanks also go to job advertisers or group members who shared job vacancies in the group. Many group members are interested in not only improving their knowledge and skill set, but also in new opportunities.

Last but not least, I’d also thank members for inviting their friends and colleagues to this group, or mentioning this group to people who might be interested in it. Without their recommendations, the group would not be of 6000 members today.

Thanks to all members for their contributions. I look forward to more discussions and interactions with you.

If you are not a member yet, please join us for knowledge sharing with 6000 members.

Best Regards
Yanchang Zhao

P.S. I’d like to take this opportunity to introduce two other professional groups that I manage, which are respectively for Australia-wide and Canberra-based data miners.

AusDM: Data Mining & Analytics
URL: http://www.linkedin.com/groups/AusDM-4907891
This is a group for the AusDM conference, and is also a forum for Australia-wide data miners and analysts to exchange ideas and share experiences.

Canberra Data Miners
URL: www.meetup.com/Canberra-Data-Miners
We are Canberra data miners. We get together on weekends for bush walking and experience sharing on data mining and analytics. Join us to have fun, get fit, and share knowledge.

Posted in Data Mining, R | Tagged , | Leave a comment

Currency Exchange Rate Forecasting with ARIMA and STL

I have made an example of time series forecasting with R, demonstrating currency exchange rate forecasting with the ARIMA and STL models. The example is easy to understand and follow.

R source files are provided to run the example.

The example was produced with R Markdown. If you want to learn R Markdown, you can try the Rmd source file, which is also provided.

Check the example and source files at http://www.rdatamining.com/examples/time-series-forecasting


Posted in Data Mining, R | Tagged , | 3 Comments

Step-by-Step Guide to Setting Up an R-Hadoop System

by Yanchang Zhao

Following my first R-Hadoop system setup guide written in Sept 2013, I have further tested setting up a Hadoop system for running R code, as well as using HBase. I have tested it both on a single computer and on a cluster of computers. The process is described in a newer version of guide to setting up an R-Hadoop system, which was updated on 30 May 2014. The guide also provides links to MapReduce and Hadoop documents and to examples of R-Hadoop code.

See the detailed guide at http://www.rdatamining.com/tutorials/r-hadoop-setup-guide, and below is a summary of it.

A list of software used for this setup:
– OS and other tools:
Mac OS X 10.6.8, Java 1.6.0_65, Homebrew, thrift 0.9.0
– Hadoop and HBase:
Hadoop 1.1.2, HBase 0.94.17
– R and RHadoop packages:
R 3.1.0, rhdfs 1.0.8, rmr2 3.1.0, plyrmr 0.2.0, rhbase 1.2.0


1. Set up single-node Hadoop
1.1 Download Hadoop
1.2 Set up Hadoop in standalone mode
1.2.1 Set JAVA_HOME
1.2.2 Set up remote desktop and enabling self-login
1.2.3 Run Hadoop
1.3 Test Hadoop
1.3.1 Example 1 – calculate pi
1.3.2 Example 2 – word count

2 Set up Hadoop in cluster mode
2.1 Switching between different modes
2.2 Setup name node (master machine)
2.3 Set JAVA_HOME, set up remote desktop and enable self-login on all nodes
2.4 Copy public key
2.5 Firewall
2.6 Setup data nodes (slave machines)
2.7 Format name node
2.8 Run Hadoop
2.9 Test Hadoop

3. Set up HBase
3.1 Set up HBase
3.2 Switching between different modes

4. Install R

5. Install GCC, Homebrew, git, pkg-config and thrift
5.1 Download and install GCC
5.2 Install Homebrew
5.3 Install git and pkg-config
5.4 Install thrift 0.9.0

6. Environment settings: HADOOP_PREFIX and HADOOP_CMD

7. Install RHadoop: rhdfs, rhbase, rmr2 and plyrmr
7.1 Install relevant R packages
7.2 Set environment variables HADOOP_CMD and HADOOP_STREAMING
7.3 Install RHadoop packages

8. Run an R job on Hadoop for word counting

If you have successfully built up your R-Hadoop system, could you please share your success with R users at this thread? Please also donot forget to forward this tutorial to your friends and colleagues who are interested in running R on Hadoop.

If you have any comments or suggestions, or find errors in above process, please feel free to post your questions to the above thread or to RDataMining group at http://group.rdatamining.com.


Posted in Big Data, R | Tagged , | 1 Comment

A Sequence of 9 Courses on Data Science Starts on Coursera on 2 June and 7 July 2014

A sequence of 9 courses on Data Science will start on Coursera on 2 June and 7 July 2014, to be lectured by(Associate/Assistant) Professors of Johns Hopkins University. The courses are designed for students to learn to become Data Scientists and apply their skills in a capstone project.

You can take the courses for free. However, if you want to get a Verified Certificate in the course, the Specialization Certificate or taking the Capstone Project, you will have to pay for it. The cost is
$49 each × 9 courses + $49 Capstone project = $490 Specialization Certificate.

Below is course information picked up from the courses homepage on Coursera website, and more details can be found at https://www.coursera.org/specialization/jhudatascience/1.

Course 1: The Data Scientist’s Toolbox
Upcoming Session: 2 June, 7 July
Duration: 4 weeks
Estimated Workload: 3-4 hours/week
URL: https://www.coursera.org/course/datascitoolbox
Description: Upon completion of this course you will be able to identify and classify data science problems. You will also have created your Github account, created your first repository, and pushed your first markdown file to your account.

Course 2: R Programming
Upcoming Session: 2 June, 7 July
Duration: 4 weeks
Estimated Workload: 3-5 hours/week
URL: https://www.coursera.org/course/rprog
Description: The course will cover the following material each week:
Week 1: Overview of R, R data types and objects, reading and writing data
Week 2: Control structures, functions, scoping rules, dates and times
Week 3: Loop functions, debugging tools
Week 4: Simulation, code profiling

Course 3: Getting and Cleaning Data
Upcoming Session: 2 June, 7 July
Duration: 4 weeks
Estimated Workload: 3-5 hours/week
URL: https://www.coursera.org/course/getdata
Description: Upon completion of this course you will be able to obtain data from a variety of sources. You will know the principles of tidy data and data sharing. Finally, you will understand and be able to apply the basic tools for data cleaning and manipulation.

Course 4: Exploratory Data Analysis
Upcoming Session: 2 June, 7 July
Duration: 4 weeks
Estimated Workload: 3-5 hours/week
URL: https://www.coursera.org/course/exdata
Description: After successfully completing this course you will be able to make visual representations of data using the base, lattice, and ggplot2 plotting systems in R, apply basic principles of data graphics to create rich analytic graphics from different types of datasets, construct exploratory summaries of data in support of a specific question, and create visualizations of multidimensional data using exploratory multivariate statistical techniques.

Course 5: Reproducible Research
Upcoming Session: 2 June, 7 July
Duration: 4 weeks
Estimated Workload: 3-5 hours/week
URL: https://www.coursera.org/course/repdata
Description: In this course you will learn to write a document using R markdown, integrate live R code into a literate statistical program, compile R markdown documents using knitr and related tools, and organize a data analysis so that it is reproducible and accessible to others.

Course 6: Statistical Inference
Upcoming Session: 2 June, 7 July
Duration: 4 weeks
Estimated Workload: 3-5 hours/week
URL: https://www.coursera.org/course/statinference
Description: In this class students will learn the fundamentals of statistical inference. Students will receive a broad overview of the goals, assumptions and modes of performing statistical inference. Students will be able to perform inferential tasks in highly targeted settings and will be able to use  the skills developed as a roadmap for more complex inferential challenges.

Course 7: Regression Models
Upcoming Session: 2 June, 7 July, 4 August
Duration: 4 weeks
Estimated Workload: 3-5 hours/week
URL: https://www.coursera.org/course/regmods
Description: In this course students will learn how to fit regression models, how to interpret coefficients, how to investigate residuals and variability.  Students will further learn special cases of regression models including use of dummy variables and multivariable adjustment. Extensions to generalized linear models, especially considering Poisson and logistic regression will be reviewed.

Course 8: Practical Machine Learning
Upcoming Session: 2 June, 7 July, 4 August
Duration: 4 weeks
URL: https://www.coursera.org/course/predmachlearn
Description: Upon completion of this course you will understand the components of a machine learning algorithm. You will also know how to apply multiple basic machine learning tools. You will also learn to apply these tools to build and evaluate predictors on real data.

Course 9: Developing Data Products
Upcoming Session: 2 June, 7 July, 4 August
Duration: 4 weeks
Estimated Workload: 3-5 hours/week
URL: https://www.coursera.org/course/devdataprod
Description: Students will learn how communicate using statistics and statistical products. Emphasis will be paid to communicating uncertainty in statistical results. Students will learn how to create simple Shiny web applications and R packages for their data products.

Capstone Project
Duration: 4 weeks
Description: The capstone project class will allow students to create a usable/public data product that can be used to show your skills to potential employers. Projects will be drawn from real-world problems and will be conducted with industry, government, and academic partners. The capstone project will be four weeks long, offered in conjunction with the series. The capstone class will be offered thrice yearly. The Capstone Project is available after you’ve completed all courses in the Specialization.

Posted in Data Mining, R | Tagged , | 4 Comments

A Coursera course on Machine Learning starts on 16 June

A 10-week course on Machine Learning by Andrew Ng from Stanford University will start on Coursera on 16 June. Below are descriptions of the course picked up from Coursera.

The course provides a broad introduction to machine learning, data mining, and statistical pattern recognition. Topics include: (i) Supervised learning (parametric/non-parametric algorithms, support vector machines, kernels, neural networks). (ii) Unsupervised learning (clustering, dimensionality reduction, recommender systems, deep learning). (iii) Best practices in machine learning (bias/variance theory; innovation process in machine learning and AI).

The course will also draw from numerous case studies and applications, so that you’ll also learn how to apply learning algorithms to building smart robots (perception, control), text understanding (web search, anti-spam), computer vision, medical informatics, audio, database mining, and other areas.

See details and join the course at http://www.coursera.org/course/ml

Posted in Data Mining | Tagged | 1 Comment

CFP: AusDM 2014 – the 12th Australasian Data Mining Conference

12th Australasian Data Mining Conference (AusDM 2014)
Brisbane, Australia
27-28 November 2014

Data Mining is the art and science of intelligent analysis of (usually big) data sets for meaningful insights. Data mining is actively applied across all industries including defence, medicine, science, finance, customer relationship management, government, insurance, telecommunications, retail and distribution, transportation, and utilities.

The Australasian Data Mining Conference has established itself as the premier Australasian meeting for both practitioners and researchers in data mining. Since AusDM’02 the conference has showcased research in data mining, providing a forum for presenting and discussing the latest research and developments. Since 2006, all proceedings have been printed as volumes in the CRPIT series.

This year’s conference, AusDM’14 builds on this tradition of facilitating the cross-disciplinary exchange of ideas, experience and potential research directions. Specifically, the conference seeks to showcase: Industry Case Studies; Research Prototypes; Practical Analytics Technology; and Research Student Projects. AusDM’14 will be a meeting place for pushing forward the frontiers of data mining in industry and academia. We have lined up an excellent Keynote Speaker program.

Publication and topics

We are calling for papers, both research and applications, and from both academia and industry, for publication and presentation at the conference. All papers will go through peer-review by a panel of international experts. Accepted papers will be published in an upcoming volume (Data Mining and Analytics 2014) of the Conferences in Research and Practice in Information Technology (CRPIT) series by the Australian Computer Society which is also held in full-text on the ACM Digital Library. The proceeding in electronic version will be distributed at the conference. For more details on CRPIT please see http://www.crpit.com.

This year we are introducing a new track “Industry Showcase” for industry participants to present the state-of-the-art analytics projects. These submissions can be of non-academic-publication style and will be for presentation only. These case studies and data mining experiences will not be included in the conference proceeding.

Please note that we require that at least one author for each accepted paper will register for the conference and present their work.

AusDM invites contributions addressing current research in data mining and knowledge discovery as well as experiences, novel applications and future challenges. Topics of interest include, but are not restricted to:

- Applications and Case Studies | Lessons and Experiences
– Big Data Analytics
– Biomedical and Health Data Mining
– Business Analytics
– Computational Aspects of Data Mining
– Data Integration, Matching and Linkage
– Data Mining Education
– Data Mining in Security and Surveillance
– Data Preparation, Cleaning and Preprocessing
– Data Stream Mining
– Evaluation of Results and their Communication
– Implementations of Data Mining in Industry
– Integrating Domain Knowledge
– Link, Tree, Graph, Network and Process Mining
– Multimedia Data Mining
– New Data Mining Algorithms
– Professional Challenges in Data Mining
– Privacy-preserving Data Mining
– Social Network and Social Media Mining
– Spatial and Temporal Data Mining
– Text Mining
– Visual Analytics
– Web Mining and Personalization

Submission of papers

We invite three types of submissions for AusDM 2014:

- Research Track:
Normal academic submissions reporting on research progress, with a paper length of between 8 and 12 pages in CRPIT style, as detailed below. For academic submissions we will use a double-blinded review process, i.e. paper submissions must NOT include authors names or affiliations or acknowledgments referring to funding bodies. Self-citing references should also be removed from the submitted papers for the double blind reviewing purpose. These information can be added on after the review.

- Application Track:
Submissions on specific data mining implementations and experiences in government and industry settings. Submissions in this category can be between 4 and 8 pages in CRPIT style, as detailed below. A committee made of mix of academic and industry representatives will review these submissions.

- Industry Showcase:
Submissions in this track are presentation only. In this track, government and industry participants can present the case studies and their experiences without getting worried about publication. We call for an extended abstract up to two pages to assess these submissions. A special committee made of industry representatives will review these submissions.

Paper submissions in Research and Application tracks are required to follow the general format specified for papers in the CRPIT series by the Australian Computer Society. Submission details are available from http://crpit.com/AuthorsSubmitting.html. LaTeX styles and Word templates may be found on this site. LaTeX is the recommended typesetting package.

The electronic submissions must be in PDF only, and made through the AusDM’14 Submission Page at https://www.easychair.org/conferences/?conf=ausdm2014.

Important Dates

Submission of abstracts: 28 July 2014
Submission of full papers: 4 August 2014 (midnight PST)
Notification of authors: 22 September 2014
Final version and author registration: 14 October 2014
Conference 27-28 November 2014

Organising Committee

Program Chairs (Research)
Lin Liu, University of South Australia, Adelaide
Xue Li, University of Queensland, Brisbane, Australia

Program Chairs (Application)
Yanchang Zhao, Department of Immigration & Border Protection, Australia; and RDataMining.com
Kok-Leong Ong, Deakin University, Melbourne

Conference Chairs
Richi Nayak, Queensland University of Technology, Brisbane, Australia
Paul Kennedy, University of Technology, Sydney

Sponsorship Chair
Andrew Stranieri, University of Ballarat, Ballarat

Local Chair
Yue Xu, Brisbane, Australia

Steering Committee Chairs
Simeon Simoff, University of Western Sydney
Graham Williams, Australian Taxation Office

Other Steering Committee Members
Peter Christen, The Australian National University, Canberra
Paul Kennedy, University of Technology, Sydney
Jiuyong Li, University of South Australia, Adelaide
Kok-Leong Ong, Deakin University, Melbourne
John Roddick, Flinders University, Adelaide
Andrew Stranieri, University of Ballarat, Ballarat
Geoff Webb, Monash University, Melbourne

Join us on LinkedIn

Posted in Data Mining | Tagged | 1 Comment

Multidimensional Scaling (MDS) with R

This page shows Multidimensional Scaling (MDS) with R. It demonstrates with an example of automatic layout of Australian cities based on distances between them. The layout obtained with MDS is very close to their locations on a map.

At first, the data of distances between 8 city in Australia are loaded from http://rosetta.reltech.org/TC/v15/Mapping/data/dist-Aus.csv.

dist.au <- read.csv("http://rosetta.reltech.org/TC/v15/Mapping/data/dist-Aus.csv")

Alternatively, we can download the file first and then read it into R from local drive.

dist.au <- read.csv("dist-Aus.csv")
##    X    A   AS    B    D    H    M    P    S
## 1  A    0 1328 1600 2616 1161  653 2130 1161
## 2 AS 1328    0 1962 1289 2463 1889 1991 2026
## 3  B 1600 1962    0 2846 1788 1374 3604  732
## 4  D 2616 1289 2846    0 3734 3146 2652 3146
## 5  H 1161 2463 1788 3734    0  598 3008 1057
## 6  M  653 1889 1374 3146  598    0 2720  713
## 7  P 2130 1991 3604 2652 3008 2720    0 3288
## 8  S 1161 2026  732 3146 1057  713 3288    0

Then we remove the frist column, acronyms of cities, and set them to row names.

row.names(dist.au) <- dist.au[, 1]
dist.au <- dist.au[, -1]
##       A   AS    B    D    H    M    P    S
## A     0 1328 1600 2616 1161  653 2130 1161
## AS 1328    0 1962 1289 2463 1889 1991 2026
## B  1600 1962    0 2846 1788 1374 3604  732
## D  2616 1289 2846    0 3734 3146 2652 3146
## H  1161 2463 1788 3734    0  598 3008 1057
## M   653 1889 1374 3146  598    0 2720  713
## P  2130 1991 3604 2652 3008 2720    0 3288
## S  1161 2026  732 3146 1057  713 3288    0

After that, we run Multidimensional Scaling (MDS) with function cmdscale(), and get x and y coordinates.

fit <- cmdscale(dist.au, eig = TRUE, k = 2)
x <- fit$points[, 1]
y <- fit$points[, 2]

Then we visualise the result, which shows the positions of cities are very close to their relative locations on a map.

plot(x, y, pch = 19, xlim = range(x) + c(0, 600))
city.names <- c("Adelaide", "Alice Springs", "Brisbane", "Darwin", "Hobart", 
    "Melbourne", "Perth", "Sydney")
text(x, y, pos = 4, labels = city.names)



By flipping both x- and y-axis, Darwin and Brisbane are moved to the top (north), which makes it easier to compare with a map.

x <- 0 - x
y <- 0 - y
plot(x, y, pch = 19, xlim = range(x) + c(0, 600))
text(x, y, pos = 4, labels = city.names)



MDS is also implemented in the igraph package as layout.mds.

g <- graph.full(nrow(dist.au))
V(g)$label <- city.names
layout <- layout.mds(g, dist = as.matrix(dist.au))
plot(g, layout = layout, vertex.size = 3)



Posted in R | Tagged | 5 Comments