A Sequence of 9 Courses on Data Science Starts on Coursera on 2 June and 7 July 2014

A sequence of 9 courses on Data Science will start on Coursera on 2 June and 7 July 2014, to be lectured by(Associate/Assistant) Professors of Johns Hopkins University. The courses are designed for students to learn to become Data Scientists and apply their skills in a capstone project.

You can take the courses for free. However, if you want to get a Verified Certificate in the course, the Specialization Certificate or taking the Capstone Project, you will have to pay for it. The cost is
$49 each × 9 courses + $49 Capstone project = $490 Specialization Certificate.

Below is course information picked up from the courses homepage on Coursera website, and more details can be found at https://www.coursera.org/specialization/jhudatascience/1.

Course 1: The Data Scientist’s Toolbox
Upcoming Session: 2 June, 7 July
Duration: 4 weeks
Estimated Workload: 3-4 hours/week
URL: https://www.coursera.org/course/datascitoolbox
Description: Upon completion of this course you will be able to identify and classify data science problems. You will also have created your Github account, created your first repository, and pushed your first markdown file to your account.

Course 2: R Programming
Upcoming Session: 2 June, 7 July
Duration: 4 weeks
Estimated Workload: 3-5 hours/week
URL: https://www.coursera.org/course/rprog
Description: The course will cover the following material each week:
Week 1: Overview of R, R data types and objects, reading and writing data
Week 2: Control structures, functions, scoping rules, dates and times
Week 3: Loop functions, debugging tools
Week 4: Simulation, code profiling

Course 3: Getting and Cleaning Data
Upcoming Session: 2 June, 7 July
Duration: 4 weeks
Estimated Workload: 3-5 hours/week
URL: https://www.coursera.org/course/getdata
Description: Upon completion of this course you will be able to obtain data from a variety of sources. You will know the principles of tidy data and data sharing. Finally, you will understand and be able to apply the basic tools for data cleaning and manipulation.

Course 4: Exploratory Data Analysis
Upcoming Session: 2 June, 7 July
Duration: 4 weeks
Estimated Workload: 3-5 hours/week
URL: https://www.coursera.org/course/exdata
Description: After successfully completing this course you will be able to make visual representations of data using the base, lattice, and ggplot2 plotting systems in R, apply basic principles of data graphics to create rich analytic graphics from different types of datasets, construct exploratory summaries of data in support of a specific question, and create visualizations of multidimensional data using exploratory multivariate statistical techniques.

Course 5: Reproducible Research
Upcoming Session: 2 June, 7 July
Duration: 4 weeks
Estimated Workload: 3-5 hours/week
URL: https://www.coursera.org/course/repdata
Description: In this course you will learn to write a document using R markdown, integrate live R code into a literate statistical program, compile R markdown documents using knitr and related tools, and organize a data analysis so that it is reproducible and accessible to others.

Course 6: Statistical Inference
Upcoming Session: 2 June, 7 July
Duration: 4 weeks
Estimated Workload: 3-5 hours/week
URL: https://www.coursera.org/course/statinference
Description: In this class students will learn the fundamentals of statistical inference. Students will receive a broad overview of the goals, assumptions and modes of performing statistical inference. Students will be able to perform inferential tasks in highly targeted settings and will be able to use  the skills developed as a roadmap for more complex inferential challenges.

Course 7: Regression Models
Upcoming Session: 2 June, 7 July, 4 August
Duration: 4 weeks
Estimated Workload: 3-5 hours/week
URL: https://www.coursera.org/course/regmods
Description: In this course students will learn how to fit regression models, how to interpret coefficients, how to investigate residuals and variability.  Students will further learn special cases of regression models including use of dummy variables and multivariable adjustment. Extensions to generalized linear models, especially considering Poisson and logistic regression will be reviewed.

Course 8: Practical Machine Learning
Upcoming Session: 2 June, 7 July, 4 August
Duration: 4 weeks
URL: https://www.coursera.org/course/predmachlearn
Description: Upon completion of this course you will understand the components of a machine learning algorithm. You will also know how to apply multiple basic machine learning tools. You will also learn to apply these tools to build and evaluate predictors on real data.

Course 9: Developing Data Products
Upcoming Session: 2 June, 7 July, 4 August
Duration: 4 weeks
Estimated Workload: 3-5 hours/week
URL: https://www.coursera.org/course/devdataprod
Description: Students will learn how communicate using statistics and statistical products. Emphasis will be paid to communicating uncertainty in statistical results. Students will learn how to create simple Shiny web applications and R packages for their data products.

Capstone Project
Duration: 4 weeks
Description: The capstone project class will allow students to create a usable/public data product that can be used to show your skills to potential employers. Projects will be drawn from real-world problems and will be conducted with industry, government, and academic partners. The capstone project will be four weeks long, offered in conjunction with the series. The capstone class will be offered thrice yearly. The Capstone Project is available after you’ve completed all courses in the Specialization.

Posted in Data Mining, R | Tagged , | 6 Comments

A Coursera course on Machine Learning starts on 16 June

A 10-week course on Machine Learning by Andrew Ng from Stanford University will start on Coursera on 16 June. Below are descriptions of the course picked up from Coursera.

The course provides a broad introduction to machine learning, data mining, and statistical pattern recognition. Topics include: (i) Supervised learning (parametric/non-parametric algorithms, support vector machines, kernels, neural networks). (ii) Unsupervised learning (clustering, dimensionality reduction, recommender systems, deep learning). (iii) Best practices in machine learning (bias/variance theory; innovation process in machine learning and AI).

The course will also draw from numerous case studies and applications, so that you’ll also learn how to apply learning algorithms to building smart robots (perception, control), text understanding (web search, anti-spam), computer vision, medical informatics, audio, database mining, and other areas.

See details and join the course at http://www.coursera.org/course/ml

Posted in Data Mining | Tagged | 1 Comment

CFP: AusDM 2014 – the 12th Australasian Data Mining Conference

*********************************************************
12th Australasian Data Mining Conference (AusDM 2014)
Brisbane, Australia
27-28 November 2014
http://ausdm14.ausdm.org/
*********************************************************

Data Mining is the art and science of intelligent analysis of (usually big) data sets for meaningful insights. Data mining is actively applied across all industries including defence, medicine, science, finance, customer relationship management, government, insurance, telecommunications, retail and distribution, transportation, and utilities.

The Australasian Data Mining Conference has established itself as the premier Australasian meeting for both practitioners and researchers in data mining. Since AusDM’02 the conference has showcased research in data mining, providing a forum for presenting and discussing the latest research and developments. Since 2006, all proceedings have been printed as volumes in the CRPIT series.

This year’s conference, AusDM’14 builds on this tradition of facilitating the cross-disciplinary exchange of ideas, experience and potential research directions. Specifically, the conference seeks to showcase: Industry Case Studies; Research Prototypes; Practical Analytics Technology; and Research Student Projects. AusDM’14 will be a meeting place for pushing forward the frontiers of data mining in industry and academia. We have lined up an excellent Keynote Speaker program.

Publication and topics
======================

We are calling for papers, both research and applications, and from both academia and industry, for publication and presentation at the conference. All papers will go through peer-review by a panel of international experts. Accepted papers will be published in an upcoming volume (Data Mining and Analytics 2014) of the Conferences in Research and Practice in Information Technology (CRPIT) series by the Australian Computer Society which is also held in full-text on the ACM Digital Library. The proceeding in electronic version will be distributed at the conference. For more details on CRPIT please see http://www.crpit.com.

This year we are introducing a new track “Industry Showcase” for industry participants to present the state-of-the-art analytics projects. These submissions can be of non-academic-publication style and will be for presentation only. These case studies and data mining experiences will not be included in the conference proceeding.

Please note that we require that at least one author for each accepted paper will register for the conference and present their work.

AusDM invites contributions addressing current research in data mining and knowledge discovery as well as experiences, novel applications and future challenges. Topics of interest include, but are not restricted to:

- Applications and Case Studies | Lessons and Experiences
– Big Data Analytics
– Biomedical and Health Data Mining
– Business Analytics
– Computational Aspects of Data Mining
– Data Integration, Matching and Linkage
– Data Mining Education
– Data Mining in Security and Surveillance
– Data Preparation, Cleaning and Preprocessing
– Data Stream Mining
– Evaluation of Results and their Communication
– Implementations of Data Mining in Industry
– Integrating Domain Knowledge
– Link, Tree, Graph, Network and Process Mining
– Multimedia Data Mining
– New Data Mining Algorithms
– Professional Challenges in Data Mining
– Privacy-preserving Data Mining
– Social Network and Social Media Mining
– Spatial and Temporal Data Mining
– Text Mining
– Visual Analytics
– Web Mining and Personalization

Submission of papers
====================

We invite three types of submissions for AusDM 2014:

- Research Track:
Normal academic submissions reporting on research progress, with a paper length of between 8 and 12 pages in CRPIT style, as detailed below. For academic submissions we will use a double-blinded review process, i.e. paper submissions must NOT include authors names or affiliations or acknowledgments referring to funding bodies. Self-citing references should also be removed from the submitted papers for the double blind reviewing purpose. These information can be added on after the review.

- Application Track:
Submissions on specific data mining implementations and experiences in government and industry settings. Submissions in this category can be between 4 and 8 pages in CRPIT style, as detailed below. A committee made of mix of academic and industry representatives will review these submissions.

- Industry Showcase:
Submissions in this track are presentation only. In this track, government and industry participants can present the case studies and their experiences without getting worried about publication. We call for an extended abstract up to two pages to assess these submissions. A special committee made of industry representatives will review these submissions.

Paper submissions in Research and Application tracks are required to follow the general format specified for papers in the CRPIT series by the Australian Computer Society. Submission details are available from http://crpit.com/AuthorsSubmitting.html. LaTeX styles and Word templates may be found on this site. LaTeX is the recommended typesetting package.

The electronic submissions must be in PDF only, and made through the AusDM’14 Submission Page at https://www.easychair.org/conferences/?conf=ausdm2014.

Important Dates
===============

Submission of abstracts: 28 July 2014
Submission of full papers: 4 August 2014 (midnight PST)
Notification of authors: 22 September 2014
Final version and author registration: 14 October 2014
Conference 27-28 November 2014

Organising Committee
====================

Program Chairs (Research)
Lin Liu, University of South Australia, Adelaide
Xue Li, University of Queensland, Brisbane, Australia

Program Chairs (Application)
Yanchang Zhao, Department of Immigration & Border Protection, Australia; and RDataMining.com
Kok-Leong Ong, Deakin University, Melbourne

Conference Chairs
Richi Nayak, Queensland University of Technology, Brisbane, Australia
Paul Kennedy, University of Technology, Sydney

Sponsorship Chair
Andrew Stranieri, University of Ballarat, Ballarat

Local Chair
Yue Xu, Brisbane, Australia

Steering Committee Chairs
Simeon Simoff, University of Western Sydney
Graham Williams, Australian Taxation Office

Other Steering Committee Members
Peter Christen, The Australian National University, Canberra
Paul Kennedy, University of Technology, Sydney
Jiuyong Li, University of South Australia, Adelaide
Kok-Leong Ong, Deakin University, Melbourne
John Roddick, Flinders University, Adelaide
Andrew Stranieri, University of Ballarat, Ballarat
Geoff Webb, Monash University, Melbourne

Join us on LinkedIn
===================
http://www.linkedin.com/groups/AusDM-4907891

Posted in Data Mining | Tagged | 1 Comment

Multidimensional Scaling (MDS) with R

This page shows Multidimensional Scaling (MDS) with R. It demonstrates with an example of automatic layout of Australian cities based on distances between them. The layout obtained with MDS is very close to their locations on a map.

At first, the data of distances between 8 city in Australia are loaded from http://rosetta.reltech.org/TC/v15/Mapping/data/dist-Aus.csv.

dist.au <- read.csv("http://rosetta.reltech.org/TC/v15/Mapping/data/dist-Aus.csv")

Alternatively, we can download the file first and then read it into R from local drive.

dist.au <- read.csv("dist-Aus.csv")
dist.au
##    X    A   AS    B    D    H    M    P    S
## 1  A    0 1328 1600 2616 1161  653 2130 1161
## 2 AS 1328    0 1962 1289 2463 1889 1991 2026
## 3  B 1600 1962    0 2846 1788 1374 3604  732
## 4  D 2616 1289 2846    0 3734 3146 2652 3146
## 5  H 1161 2463 1788 3734    0  598 3008 1057
## 6  M  653 1889 1374 3146  598    0 2720  713
## 7  P 2130 1991 3604 2652 3008 2720    0 3288
## 8  S 1161 2026  732 3146 1057  713 3288    0

Then we remove the frist column, acronyms of cities, and set them to row names.

row.names(dist.au) <- dist.au[, 1]
dist.au <- dist.au[, -1]
dist.au
##       A   AS    B    D    H    M    P    S
## A     0 1328 1600 2616 1161  653 2130 1161
## AS 1328    0 1962 1289 2463 1889 1991 2026
## B  1600 1962    0 2846 1788 1374 3604  732
## D  2616 1289 2846    0 3734 3146 2652 3146
## H  1161 2463 1788 3734    0  598 3008 1057
## M   653 1889 1374 3146  598    0 2720  713
## P  2130 1991 3604 2652 3008 2720    0 3288
## S  1161 2026  732 3146 1057  713 3288    0

After that, we run Multidimensional Scaling (MDS) with function cmdscale(), and get x and y coordinates.

fit <- cmdscale(dist.au, eig = TRUE, k = 2)
x <- fit$points[, 1]
y <- fit$points[, 2]

Then we visualise the result, which shows the positions of cities are very close to their relative locations on a map.

plot(x, y, pch = 19, xlim = range(x) + c(0, 600))
city.names <- c("Adelaide", "Alice Springs", "Brisbane", "Darwin", "Hobart", 
    "Melbourne", "Perth", "Sydney")
text(x, y, pos = 4, labels = city.names)

 

mds1

By flipping both x- and y-axis, Darwin and Brisbane are moved to the top (north), which makes it easier to compare with a map.

x <- 0 - x
y <- 0 - y
plot(x, y, pch = 19, xlim = range(x) + c(0, 600))
text(x, y, pos = 4, labels = city.names)

 

mds2

MDS is also implemented in the igraph package as layout.mds.

library(igraph)
g <- graph.full(nrow(dist.au))
V(g)$label <- city.names
layout <- layout.mds(g, dist = as.matrix(dist.au))
plot(g, layout = layout, vertex.size = 3)

mds3

 

Posted in R | Tagged | 5 Comments

New book release: Data Mining Applications with R

Book title: Data Mining Applications with R
Editors: Yanchang Zhao, Yonghua Cen
Publisher: Elsevier
Publish date: December 2013
ISBN: 978-0-12-411511-8
Length: 514 pages
URL: http://www.rdatamining.com/books/dmar

An edited book titled Data Mining Applications with R was released in December 2013, which features 15 real-word applications on data mining with R.

Book preview on Google Books

R code, data and color figures for the book

Buy the book on
Amazon
Elsevier
Google Books

Below is its table of contents.fig1 fig2 fig3 fig4 fig5 fig6

  • Foreword
    Graham Williams
  • Chapter 1 Power Grid Data Analysis with R and Hadoop
    Terence Critchlow, Ryan Hafen, Tara Gibson and Kerstin Kleese van Dam
  • Chapter 2 Picturing Bayesian Classifiers: A Visual Data Mining Approach to Parameters Optimization
    Giorgio Maria Di Nunzio and Alessandro Sordoni
  • Chapter 3 Discovery of emergent issues and controversies in Anthropology using text mining, topic modeling and social network analysis of microblog content
    Ben Marwick
  • Chapter 4 Text Mining and Network Analysis of Digital Libraries in R
    Eric Nguyen
  • Chapter 5 Recommendation systems in R
    Saurabh Bhatnagar
  • Chapter 6 Response Modeling in Direct Marketing: A Data Mining Based Approach for Target Selection
    Sadaf Hossein Javaheri, Mohammad Mehdi Sepehri and Babak Teimourpour
  • Chapter 7 Caravan Insurance Policy Customer Profile Modeling with R Mining
    Mukesh Patel and Mudit Gupta
  • Chapter 8 Selecting Best Features for Predicting Bank Loan Default
    Zahra Yazdani, Mohammad Mehdi Sepehri and Babak Teimourpour
  • Chapter 9 A Choquet Ingtegral Toolbox and its Application in Customer’s Preference Analysis
    Huy Quan Vu, Gleb Beliakov and Gang Li
  • Chapter 10 A Real-Time Property Value Index based on Web Data
    Fernando Tusell, Maria Blanca Palacios, María Jesús Bárcena and Patricia Menéndez
  • Chapter 11 Predicting Seabed Hardness Using Random Forest in R
    Jin Li, Justy Siwabessy, Zhi Huang, Maggie Tran and Andrew Heap
  • Chapter 12 Supervised classification of images, applied to plankton samples using R and zooimage
    Kevin Denis and Philippe Grosjean
  • Chapter 13 Crime analyses using R
    Madhav Kumar, Anindya Sengupta and Shreyes Upadhyay
  • Chapter 14 Football Mining with R
    Maurizio Carpita, Marco Sandri, Anna Simonetto and Paola Zuccolotto
  • Chapter 15 Analyzing Internet DNS(SEC) Traffic with R for Resolving Platform Optimization
    Emmanuel Herbert, Daniel Migault, Stephane Senecal, Stanislas Francfort and Maryline Laurent
Posted in Data Mining, R | Tagged , | 6 Comments

Preview of book Data Mining Applications with R

An edited book titled Data Mining Applications with R will be on market soon, which features 15 real-word applications on data mining with R. A preview of the book is available on Google Books. R code, data and color figures for the book can be downloaded at RDataMining.com.

Below is its table of contents.

  • Foreword
    Graham Williams
  • Chapter 1 Power Grid Data Analysis with R and Hadoop
    Terence Critchlow, Ryan Hafen, Tara Gibson and Kerstin Kleese van Dam
  • Chapter 2 Picturing Bayesian Classifiers: A Visual Data Mining Approach to Parameters Optimization
    Giorgio Maria Di Nunzio and Alessandro Sordoni
  • Chapter 3 Discovery of emergent issues and controversies in Anthropology using text mining, topic modeling and social network analysis of microblog content
    Ben Marwick
  • Chapter 4 Text Mining and Network Analysis of Digital Libraries in R
    Eric Nguyen
  • Chapter 5 Recommendation systems in R
    Saurabh Bhatnagar
  • Chapter 6 Response Modeling in Direct Marketing: A Data Mining Based Approach for Target Selection
    Sadaf Hossein Javaheri, Mohammad Mehdi Sepehri and Babak Teimourpour
  • Chapter 7 Caravan Insurance Policy Customer Profile Modeling with R Mining
    Mukesh Patel and Mudit Gupta
  • Chapter 8 Selecting Best Features for Predicting Bank Loan Default
    Zahra Yazdani, Mohammad Mehdi Sepehri and Babak Teimourpour
  • Chapter 9 A Choquet Ingtegral Toolbox and its Application in Customer’s Preference Analysis
    Huy Quan Vu, Gleb Beliakov and Gang Li
  • Chapter 10 A Real-Time Property Value Index based on Web Data
    Fernando Tusell, Maria Blanca Palacios, María Jesús Bárcena and Patricia Menéndez
  • Chapter 11 Predicting Seabed Hardness Using Random Forest in R
    Jin Li, Justy Siwabessy, Zhi Huang, Maggie Tran and Andrew Heap
  • Chapter 12 Supervised classification of images, applied to plankton samples using R and zooimage
    Kevin Denis and Philippe Grosjean
  • Chapter 13 Crime analyses using R
    Madhav Kumar, Anindya Sengupta and Shreyes Upadhyay
  • Chapter 14 Football Mining with R
    Maurizio Carpita, Marco Sandri, Anna Simonetto and Paola Zuccolotto
  • Chapter 15 Analyzing Internet DNS(SEC) Traffic with R for Resolving Platform Optimization
    Emmanuel Herbert, Daniel Migault, Stephane Senecal, Stanislas Francfort and Maryline Laurent
Posted in Data Mining, R | Leave a comment

Step by step to build my first R Hadoop System

by Yanchang Zhao, RDataMining.com

After reading documents and tutorials on MapReduce and Hadoop and playing with RHadoop for about 2 weeks, finally I have built my first R Hadoop system and successfully run some R examples on it. My experience and steps to achieve that are presented at http://www.rdatamining.com/big-data/rhadoop. Hopefully it will make it easier to try RHadoop for R users who are new to Hadoop. Note that I tried this on Mac only and some steps might be different for Windows.

Before going through the complex steps, you may want to have a look what you can get with R and Hadoop. There is a video showing Wordcount MapReduce in R at http://www.youtube.com/watch?v=hSrW0Iwghtw.

If you are interested enough to try R on Handoop, please follow the steps below, whose details are available at http://www.rdatamining.com/big-data/rhadoop.

1. Install Hadoop
2. Run Hadoop
3. Install R
4. Install RHadoop
5. Run R jobs on Hadoop
6. What’s Next

Enjoy MapReducing with R!

Posted in Big Data, R | Tagged , | 2 Comments