RDataMining Slides Series

by Yanchang Zhao, RDataMining.com

I have made a series of slides on R and data mining, based on my book titled R and Data Mining — Examples and Case Studies. The slides will be used at my presentations at seminars to graduate students at Universidad Juárez Autónoma de Tabasco (UJAT), prior to my keynote speech on Analysing Twitter Data with Text Mining and Social Network Analysis at the CONAIS 2014 conference in Mexico in October 2014.

The slides cover seven topics below. Click the links to download them in PDF files.

I will make more slides in near future, such as social network analysis with R and big data analysis with R. Keep tuned with RDataMining.com.

Posted in Data Mining, R | Tagged , | 3 Comments

Slides of 12 tutorials at ACM SIGKDD 2014

Slides of 12 tutorials taught by data science experts and thought leaders at ACM SIGKDD 2014 are provided at http://www.kdd.org/kdd2014/tutorials.html. Below is a list of them.

1.Scaling Up Deep Learning
Yoshua Bengio

2. Constructing and mining web-scale knowledge graphs
Antoine Bordes, Evgeniy Gabrilovich

3. Bringing Structure to Text: Mining Phrases, Entity Concepts, Topics, and Hierarchies
Jiawei Han, Chi Wang, Ahmed El-Kishky

4. Computational Epidemiology
Madhav Marathe, Naren Ramakrishnan, Anil Kumar S. Vullikanti

5. Management and Analytic of Biomedical Big Data with Cloud-based In-Memory Database and Dynamic Querying: A Hands-on Experience with Real-world Data
Roger Mark, John Ellenberger, Mengling Feng, Mohammad Ghassemi, Thomas Brennan, Ishrar Hussain

6. The Recommender Problem Revisited
Xavier Amatriain, Bamshad Mobasher

7. Correlation clustering: from theory to practice
Francesco Bonchi, David Garcia-Soriano, Edo Liberty

8. Deep Learning
Ruslan Salakhutdinov

9. Network Mining and Analysis for Social Applications
Feida Zhu, Huan Sun, Xifeng Yan

10. Sampling for Big Data
Graham Cormode, Nick Duffield

11. Statistically Sound Pattern Discovery
Geoff Webb, Wilhelmiina Hamalainen

12. Recommendation in Social Media
Jiliang Tang, Jie Tang, Huan Liu

Posted in Big Data, Data Mining | Tagged , | Leave a comment

RDataMining group having 6000 members today

RDataMining Group: http://group.rdatamining.com
Twitter: @RDataMining
Website: http://www.RDataMining.com

The RDataMining group has 6000 members today, 5 July 2014.

Created in August 2011, this group has developed into a big community with 6000 member within three years. Since its creation, many members have shared their knowledge and experiences on R and data mining in the group, such as posting useful online resources that they came across, or the examples that they made by themselves. Without their contributions, this group would not have grown into this size today.

Many group members have asked questions and seek help on technical problems, which also contribute a lot to discussions. Those questions incurred many helpful responses and insightful solutions, provided by more experienced group members. Some examples are discussions on good books for learning R and data mining, best way to determine a threshold, and visualization of decision trees. Such discussions are fantastic for knowledge sharing and group learning, from which many members (including myself) have benefited a lot.

Big thanks to those who have responded to above questions and discussions, helping other group members and sharing their knowledge and experiences. I appreciate their time and efforts on helping others out.

Thanks also go to job advertisers or group members who shared job vacancies in the group. Many group members are interested in not only improving their knowledge and skill set, but also in new opportunities.

Last but not least, I’d also thank members for inviting their friends and colleagues to this group, or mentioning this group to people who might be interested in it. Without their recommendations, the group would not be of 6000 members today.

Thanks to all members for their contributions. I look forward to more discussions and interactions with you.

If you are not a member yet, please join us for knowledge sharing with 6000 members.

Best Regards
Yanchang Zhao

P.S. I’d like to take this opportunity to introduce two other professional groups that I manage, which are respectively for Australia-wide and Canberra-based data miners.

AusDM: Data Mining & Analytics
URL: http://www.linkedin.com/groups/AusDM-4907891
This is a group for the AusDM conference, and is also a forum for Australia-wide data miners and analysts to exchange ideas and share experiences.

Canberra Data Miners
URL: www.meetup.com/Canberra-Data-Miners
We are Canberra data miners. We get together on weekends for bush walking and experience sharing on data mining and analytics. Join us to have fun, get fit, and share knowledge.

Posted in Data Mining, R | Tagged , | Leave a comment

Currency Exchange Rate Forecasting with ARIMA and STL

I have made an example of time series forecasting with R, demonstrating currency exchange rate forecasting with the ARIMA and STL models. The example is easy to understand and follow.

R source files are provided to run the example.

The example was produced with R Markdown. If you want to learn R Markdown, you can try the Rmd source file, which is also provided.

Check the example and source files at http://www.rdatamining.com/examples/time-series-forecasting


Posted in Data Mining, R | Tagged , | 3 Comments

Step-by-Step Guide to Setting Up an R-Hadoop System

by Yanchang Zhao

Following my first R-Hadoop system setup guide written in Sept 2013, I have further tested setting up a Hadoop system for running R code, as well as using HBase. I have tested it both on a single computer and on a cluster of computers. The process is described in a newer version of guide to setting up an R-Hadoop system, which was updated on 30 May 2014. The guide also provides links to MapReduce and Hadoop documents and to examples of R-Hadoop code.

See the detailed guide at http://www.rdatamining.com/big-data/r-hadoop-setup-guide, and below is a summary of it.

A list of software used for this setup:
– OS and other tools:
Mac OS X 10.6.8, Java 1.6.0_65, Homebrew, thrift 0.9.0
– Hadoop and HBase:
Hadoop 1.1.2, HBase 0.94.17
– R and RHadoop packages:
R 3.1.0, rhdfs 1.0.8, rmr2 3.1.0, plyrmr 0.2.0, rhbase 1.2.0


1. Set up single-node Hadoop
1.1 Download Hadoop
1.2 Set up Hadoop in standalone mode
1.2.1 Set JAVA_HOME
1.2.2 Set up remote desktop and enabling self-login
1.2.3 Run Hadoop
1.3 Test Hadoop
1.3.1 Example 1 – calculate pi
1.3.2 Example 2 – word count

2 Set up Hadoop in cluster mode
2.1 Switching between different modes
2.2 Setup name node (master machine)
2.3 Set JAVA_HOME, set up remote desktop and enable self-login on all nodes
2.4 Copy public key
2.5 Firewall
2.6 Setup data nodes (slave machines)
2.7 Format name node
2.8 Run Hadoop
2.9 Test Hadoop

3. Set up HBase
3.1 Set up HBase
3.2 Switching between different modes

4. Install R

5. Install GCC, Homebrew, git, pkg-config and thrift
5.1 Download and install GCC
5.2 Install Homebrew
5.3 Install git and pkg-config
5.4 Install thrift 0.9.0

6. Environment settings: HADOOP_PREFIX and HADOOP_CMD

7. Install RHadoop: rhdfs, rhbase, rmr2 and plyrmr
7.1 Install relevant R packages
7.2 Set environment variables HADOOP_CMD and HADOOP_STREAMING
7.3 Install RHadoop packages

8. Run an R job on Hadoop for word counting

If you have successfully built up your R-Hadoop system, could you please share your success with R users at this thread? Please also donot forget to forward this tutorial to your friends and colleagues who are interested in running R on Hadoop.

If you have any comments or suggestions, or find errors in above process, please feel free to post your questions to the above thread or to RDataMining group at http://group.rdatamining.com.


Posted in Big Data, R | Tagged , | 2 Comments

A Sequence of 9 Courses on Data Science Starts on Coursera on 2 June and 7 July 2014

A sequence of 9 courses on Data Science will start on Coursera on 2 June and 7 July 2014, to be lectured by(Associate/Assistant) Professors of Johns Hopkins University. The courses are designed for students to learn to become Data Scientists and apply their skills in a capstone project.

You can take the courses for free. However, if you want to get a Verified Certificate in the course, the Specialization Certificate or taking the Capstone Project, you will have to pay for it. The cost is
$49 each × 9 courses + $49 Capstone project = $490 Specialization Certificate.

Below is course information picked up from the courses homepage on Coursera website, and more details can be found at https://www.coursera.org/specialization/jhudatascience/1.

Course 1: The Data Scientist’s Toolbox
Upcoming Session: 2 June, 7 July
Duration: 4 weeks
Estimated Workload: 3-4 hours/week
URL: https://www.coursera.org/course/datascitoolbox
Description: Upon completion of this course you will be able to identify and classify data science problems. You will also have created your Github account, created your first repository, and pushed your first markdown file to your account.

Course 2: R Programming
Upcoming Session: 2 June, 7 July
Duration: 4 weeks
Estimated Workload: 3-5 hours/week
URL: https://www.coursera.org/course/rprog
Description: The course will cover the following material each week:
Week 1: Overview of R, R data types and objects, reading and writing data
Week 2: Control structures, functions, scoping rules, dates and times
Week 3: Loop functions, debugging tools
Week 4: Simulation, code profiling

Course 3: Getting and Cleaning Data
Upcoming Session: 2 June, 7 July
Duration: 4 weeks
Estimated Workload: 3-5 hours/week
URL: https://www.coursera.org/course/getdata
Description: Upon completion of this course you will be able to obtain data from a variety of sources. You will know the principles of tidy data and data sharing. Finally, you will understand and be able to apply the basic tools for data cleaning and manipulation.

Course 4: Exploratory Data Analysis
Upcoming Session: 2 June, 7 July
Duration: 4 weeks
Estimated Workload: 3-5 hours/week
URL: https://www.coursera.org/course/exdata
Description: After successfully completing this course you will be able to make visual representations of data using the base, lattice, and ggplot2 plotting systems in R, apply basic principles of data graphics to create rich analytic graphics from different types of datasets, construct exploratory summaries of data in support of a specific question, and create visualizations of multidimensional data using exploratory multivariate statistical techniques.

Course 5: Reproducible Research
Upcoming Session: 2 June, 7 July
Duration: 4 weeks
Estimated Workload: 3-5 hours/week
URL: https://www.coursera.org/course/repdata
Description: In this course you will learn to write a document using R markdown, integrate live R code into a literate statistical program, compile R markdown documents using knitr and related tools, and organize a data analysis so that it is reproducible and accessible to others.

Course 6: Statistical Inference
Upcoming Session: 2 June, 7 July
Duration: 4 weeks
Estimated Workload: 3-5 hours/week
URL: https://www.coursera.org/course/statinference
Description: In this class students will learn the fundamentals of statistical inference. Students will receive a broad overview of the goals, assumptions and modes of performing statistical inference. Students will be able to perform inferential tasks in highly targeted settings and will be able to use  the skills developed as a roadmap for more complex inferential challenges.

Course 7: Regression Models
Upcoming Session: 2 June, 7 July, 4 August
Duration: 4 weeks
Estimated Workload: 3-5 hours/week
URL: https://www.coursera.org/course/regmods
Description: In this course students will learn how to fit regression models, how to interpret coefficients, how to investigate residuals and variability.  Students will further learn special cases of regression models including use of dummy variables and multivariable adjustment. Extensions to generalized linear models, especially considering Poisson and logistic regression will be reviewed.

Course 8: Practical Machine Learning
Upcoming Session: 2 June, 7 July, 4 August
Duration: 4 weeks
URL: https://www.coursera.org/course/predmachlearn
Description: Upon completion of this course you will understand the components of a machine learning algorithm. You will also know how to apply multiple basic machine learning tools. You will also learn to apply these tools to build and evaluate predictors on real data.

Course 9: Developing Data Products
Upcoming Session: 2 June, 7 July, 4 August
Duration: 4 weeks
Estimated Workload: 3-5 hours/week
URL: https://www.coursera.org/course/devdataprod
Description: Students will learn how communicate using statistics and statistical products. Emphasis will be paid to communicating uncertainty in statistical results. Students will learn how to create simple Shiny web applications and R packages for their data products.

Capstone Project
Duration: 4 weeks
Description: The capstone project class will allow students to create a usable/public data product that can be used to show your skills to potential employers. Projects will be drawn from real-world problems and will be conducted with industry, government, and academic partners. The capstone project will be four weeks long, offered in conjunction with the series. The capstone class will be offered thrice yearly. The Capstone Project is available after you’ve completed all courses in the Specialization.

Posted in Data Mining, R | Tagged , | 6 Comments

A Coursera course on Machine Learning starts on 16 June

A 10-week course on Machine Learning by Andrew Ng from Stanford University will start on Coursera on 16 June. Below are descriptions of the course picked up from Coursera.

The course provides a broad introduction to machine learning, data mining, and statistical pattern recognition. Topics include: (i) Supervised learning (parametric/non-parametric algorithms, support vector machines, kernels, neural networks). (ii) Unsupervised learning (clustering, dimensionality reduction, recommender systems, deep learning). (iii) Best practices in machine learning (bias/variance theory; innovation process in machine learning and AI).

The course will also draw from numerous case studies and applications, so that you’ll also learn how to apply learning algorithms to building smart robots (perception, control), text understanding (web search, anti-spam), computer vision, medical informatics, audio, database mining, and other areas.

See details and join the course at http://www.coursera.org/course/ml

Posted in Data Mining | Tagged | 1 Comment