Home » Posts tagged 'algorithms'
Tag Archives: algorithms
Using Rapid Miner and R for Sports Analytics #rstats
Ajay- Why did you choose Rapid Miner and R? What were the other software alternatives you considered and discarded?
Analyst- We considered most of the other major players in statistics/data mining or enterprise BI. However, we found that the value proposition for an open source solution was too compelling to justify the premium pricing that the commercial solutions would have required. The widespread adoption of R and the variety of packages and algorithms available for it, made it an easy choice. We liked RapidMiner as a way to design structured, repeatable processes, and the ability to optimize learner parameters in a systematic way. It also handled large data sets better than R on 32-bit Windows did. The GUI, particularly when 5.0 was released, made it more usable than R for analysts who weren’t experienced programmers.
Ajay- What analytics do you do think Rapid Miner and R are best suited for?
Analyst- We use RM+R mainly for sports analysis so far, rather than for more traditional business applications. It has been quite suitable for that, and I can easily see how it would be used for other types of applications.
Ajay- Any experiences as an enterprise customer? How was the installation process? How good is the enterprise level support?
Analyst- Rapid-I has been one of the most responsive tech companies I’ve dealt with, either in my current role or with previous employers. They are small enough to be able to respond quickly to requests, and in more than one case, have fixed a problem, or added a small feature we needed within a matter of days. In other cases, we have contracted with them to add larger pieces of specific functionality we needed at reasonable consulting rates. Those features are added to the mainline product, and become fully supported through regular channels. The longer consulting projects have typically had a turnaround of just a few weeks.
Ajay- What challenges if any did you face in executing a pure open source analytics bundle ?
Analyst- As Rapid-I is a smaller company based in Europe, the availability of training and consulting in the USA isn’t as extensive as for the major enterprise software players, and the time zone differences sometimes slow down the communications cycle. There were times where we were the first customer to attempt a specific integration point in our technical environment, and with no prior experiences to fall back on, we had to work with Rapid-I to figure out how to do it. Compared to the what traditional software vendors provide, both R and RM tend to have sparse, terse, occasionally incomplete documentation. The situation is getting better, but still lags behind what the traditional enterprise software vendors provide.
Ajay- What are the things you can do in R ,and what are the things you prefer to do in Rapid Miner (comparison for technical synergies)
Analyst- Our experience has been that RM is superior to R at writing and maintaining structured processes, better at handling larger amounts of data, and more flexible at fine-tuning model parameters automatically. The biggest limitation we’ve had with RM compared to R is that R has a larger library of user-contributed packages for additional data mining algorithms. Sometimes we opted to use R because RM hadn’t yet implemented a specific algorithm. The introduction the R extension has allowed us to combine the strengths of both tools in a very logical and productive way.
In particular, extending RapidMiner with R helped address RM’s weakness in the breadth of algorithms, because it brings the entire R ecosystem into RM (similar to how Rapid-I implemented much of the Weka library early on in RM’s development). Further, because the R user community releases packages that implement new techniques faster than the enterprise vendors can, this helps turn a potential weakness into a potential strength. However, R packages tend to be of varying quality, and are more prone to go stale due to lack of support/bug fixes. This depends heavily on the package’s maintainer and its prevalence of use in the R community. So when RapidMiner has a learner with a native implementation, it’s usually better to use it than the R equivalent.
RCOMM 2012 goes live in August
An awesome conference by an awesome software Rapid Miner remains one of the leading enterprise grade open source software , that can help you do a lot of things including flow driven data modeling ,web mining ,web crawling etc which even other software cant.
Presentations include:
- Mining Machine 2 Machine Data (Katharina Morik, TU Dortmund University)
- Handling Big Data (Andras Benczur, MTA SZTAKI)
- Introduction of RapidAnalytics at Telenor (Telenor and United Consult)
- and more
Here is a list of complete program
Program
Time
|
Tuesday
|
Wednesday
|
Thursday
|
Friday
|
09:00 – 10:30 |
Introductory Speech Ingo Mierswa (Rapid-I)Resource-aware Data Mining or M2M Mining (Invited Talk) Katharina Morik (TU Dortmund University)
Data Analysis
NeurophRM: Integration of the Neuroph framework into RapidMiner |
To be announced (Invited Talk) Andras Benczur Recommender Systems
Extending RapidMiner with Recommender Systems Algorithms Implementation of User Based Collaborative Filtering in RapidMiner |
Parallel Training / Workshop Session
Advanced Data Mining and Data Transformations or |
|
10:30 – 11:00 |
Coffee Break |
Coffee Break |
Coffee Break |
|
11:00 – 12:30 |
Data Analysis
Nearest-Neighbor and Clustering based Anomaly Detection Algorithms for RapidMiner Customers’ LifeStyle Targeting on Big Data using Rapid Miner Robust GPGPU Plugin Development for RapidMiner |
Extensions
Optimization Plugin For RapidMiner
Image Mining Extension – Year After Incorporating R Plots into RapidMiner Reports |
||
12:30 – 13:30 |
Lunch |
Lunch |
Lunch |
|
13:30 – 15:30 |
Parallel Training / Workshop Session
Basic Data Mining and Data Transformations or |
Applications
Introduction of RapidAnalyticy Enterprise Edition at Telenor Hungary
Application of RapidMiner in Steel Industry Research and Development A Comparison of Data-driven Models for Forecast River Flow Portfolio Optimization Using Local Linear Regression Ensembles in Rapid Miner |
Extensions
An Octave Extension for RapidMiner
Unstructured Data
Processing Data Streams with the RapidMiner Streams-Plugin Automated Creation of Corpuses for the Needs of Sentiment Analysis
Demonstration: News from the Rapid-I Labs This short session demonstrates the latest developments from the Rapid-I lab and will let you how you can build powerful analysis processes and routines by using those RapidMiner tools. |
Certification Exam |
15:30 – 16:00 |
Coffee Break |
Coffee Break |
Coffee Break |
|
16:00 – 18:00 |
Book Presentation and Game Show
Data Mining for the Masses: A New Textbook on Data Mining for Everyone Matthew North presents his new book “Data Mining for the Masses” introducing data mining to a broader audience and making use of RapidMiner for practical data mining problems.
Game Show |
User Support
Get some Coffee for free – Writing Operators with RapidMiner Beans Meta-Modeling Execution Times of RapidMiner operators Conference day ends at ca. 17:00. |
||
19:30 |
Social Event (Conference Dinner) |
Social Event (Visit of Bar District) |
and you should have a look at https://rapid-i.com/rcomm2012f/index.php?option=com_content&view=article&id=65
Conference is in Budapest, Hungary,Europe.
( Disclaimer- Rapid Miner is an advertising sponsor of Decisionstats.com in case you didnot notice the two banner sized ads.)
Interview John Myles White , Machine Learning for Hackers
Here is an interview with one of the younger researchers and rock stars of the R Project, John Myles White, co-author of Machine Learning for Hackers.
Ajay- What inspired you guys to write Machine Learning for Hackers. What has been the public response to the book. Are you planning to write a second edition or a next book?
John-We decided to write Machine Learning for Hackers because there were so many people interested in learning more about Machine Learning who found the standard textbooks a little difficult to understand, either because they lacked the mathematical background expected of readers or because it wasn’t clear how to translate the mathematical definitions in those books into usable programs. Most Machine Learning books are written for audiences who will not only be using Machine Learning techniques in their applied work, but also actively inventing new Machine Learning algorithms. The amount of information needed to do both can be daunting, because, as one friend pointed out, it’s similar to insisting that everyone learn how to build a compiler before they can start to program. For most people, it’s better to let them try out programming and get a taste for it before you teach them about the nuts and bolts of compiler design. If they like programming, they can delve into the details later.
Ajay- What are the key things that a potential reader can learn from this book?
John- We cover most of the nuts and bolts of introductory statistics in our book: summary statistics, regression and classification using linear and logistic regression, PCA and k-Nearest Neighbors. We also cover topics that are less well known, but are as important: density plots vs. histograms, regularization, cross-validation, MDS, social network analysis and SVM’s. I hope a reader walks away from the book having a feel for what different basic algorithms do and why they work for some problems and not others. I also hope we do just a little to shift a future generation of modeling culture towards regularization and cross-validation.
Ajay- Describe your journey as a science student up till your Phd. What are you current research interests and what initiatives have you done with them?
John-As an undergraduate I studied math and neuroscience. I then took some time off and came back to do a Ph.D. in psychology, focusing on mathematical modeling of both the brain and behavior. There’s a rich tradition of machine learning and statistics in psychology, so I got increasingly interested in ML methods during my years as a grad student. I’m about to finish my Ph.D. this year. My research interests all fall under one heading: decision theory. I want to understand both how people make decisions (which is what psychology teaches us) and how they should make decisions (which is what statistics and ML teach us). My thesis is focused on how people make decisions when there are both short-term and long-term consequences to be considered. For non-psychologists, the classic example is probably the explore-exploit dilemma. I’ve been working to import more of the main ideas from stats and ML into psychology for modeling how real people handle that trade-off. For psychologists, the classic example is the Marshmallow experiment. Most of my research work has focused on the latter: what makes us patient and how can we measure patience?
Ajay- How can academia and private sector solve the shortage of trained data scientists (assuming there is one)?
John- There’s definitely a shortage of trained data scientists: most companies are finding it difficult to hire someone with the real chops needed to do useful work with Big Data. The skill set required to be useful at a company like Facebook or Twitter is much more advanced than many people realize, so I think it will be some time until there are undergraduates coming out with the right stuff. But there’s huge demand, so I’m sure the market will clear sooner or later.
(TIL he has played in several rock bands!)
Rapid Miner User Conference 2012
One of those cool conferences that is on my bucket list- this time in Hungary (That’s a nice place)
But I am especially interested in seeing how far Radoop has come along !
Disclaimer- Rapid Miner has been a Decisionstats.com sponsor for many years. It is also a very cool software but I like the R Extension facility even more!
—————————————————————
and not very expensive too compared to other User Conferences in Europe!-
http://rcomm2012.org/index.php/registration/prices
Information about Registration
- Early Bird registration until July 20th, 2012.
- Normal registration from July 21st, 2012 until August 13th, 2012.
- Latest registration from August 14th, 2012 until August 24th, 2012.
- Students have to provide a valid Student ID during registration.
- The Dinner is included in the All Days and in the Conference packages.
- All prices below are net prices. Value added tax (VAT) has to be added if applicable.
Prices for Regular Visitors
Days and Event |
Early Bird Rate |
Normal Rate |
Latest Registration |
| Tuesday
(Training / Development 1) |
190 Euro | 230 Euro | 280 Euro |
| Wednesday + Thursday
(Conference) |
290 Euro | 350 Euro | 420 Euro |
| Friday
(Training / Development 2 and Exam) |
190 Euro | 230 Euro | 280 Euro |
| All Days
(Full Package) |
610 Euro | 740 Euro | 900 Euro |
Prices for Authors and Students
In case of students, please note that you will have to provide a valid student ID during registration.
Days and Event |
Early Bird Rate |
Normal Rate |
Latest Registration |
| Tuesday
(Training / Development 1) |
90 Euro | 110 Euro | 140 Euro |
| Wednesday + Thursday
(Conference) |
140 Euro | 170 Euro | 210 Euro |
| Friday
(Training / Development 2 and Exam) |
90 Euro | 110 Euro | 140 Euro |
| All Days
(Full Package) |
290 Euro | 350 Euro | 450 Euro |
Time
|
Tuesday
|
Wednesday
|
Thursday
|
Friday
|
09:00 – 10:30 |
Introductory Speech Ingo Mierswa; Rapid-I
Data Analysis
NeurophRM: Integration of the Neuroph framework into RapidMiner |
To be announced (Invited Talk) To be announced
Recommender Systems
Extending RapidMiner with Recommender Systems Algorithms Implementation of User Based Collaborative Filtering in RapidMiner |
Parallel Training / Workshop Session
Advanced Data Mining and Data Transformations or |
|
10:30 – 12:30 |
Data Analysis
Nearest-Neighbor and Clustering based Anomaly Detection Algorithms for RapidMiner Customers’ LifeStyle Targeting on Big Data using Rapid Miner Robust GPGPU Plugin Development for RapidMiner |
Extensions
Image Mining Extension – Year After Incorporating R Plots into RapidMiner Reports An Octave Extension for RapidMiner |
||
12:30 – 13:30 |
Lunch |
Lunch |
Lunch |
|
13:30 – 15:00 |
Parallel Training / Workshop Session
Basic Data Mining and Data Transformations or |
Applications
Application of RapidMiner in Steel Industry Research and Development A Comparison of Data-driven Models for Forecast River Flow Portfolio Optimization Using Local Linear Regression Ensembles in Rapid Miner |
Unstructured Data
Processing Data Streams with the RapidMiner Streams-Plugin Automated Creation of Corpuses for the Needs of Sentiment Analysis
Demonstration
News from the Rapid-I Labs This short session demonstrates the latest developments from the Rapid-I lab and will let you how you can build powerful analysis processes and routines by using those RapidMiner tools. |
Certification Exam |
15:00 – 17:00 |
Book Presentation and Game Show
Data Mining for the Masses: A New Textbook on Data Mining for Everyone Matthew North presents his new book “Data Mining for the Masses” introducing data mining to a broader audience and making use of RapidMiner for practical data mining problems.
Game Show |
User Support
Get some Coffee for free – Writing Operators with RapidMiner Beans Meta-Modeling Execution Times of RapidMiner operators |
||
19:00 |
Social Event (Conference Dinner) |
Social Event (Visit of Bar District) |
Training: Basic Data Mining and Data Transformations
This is a short introductory training course for users who are not yet familiar with RapidMiner or only have a few experiences with RapidMiner so far. The topics of this training session include
- Basic Usage
- User Interface
- Creating and handling RapidMiner repositories
- Starting a new RapidMiner project
- Operators and processes
- Loading data from flat files
- Storing data, processes, and results
- Predictive Models
- Linear Regression
- Naïve Bayes
- Decision Trees
- Basic Data Transformations
- Changing names and roles
- Handling missing values
- Changing value types by discretization and dichotimization
- Normalization and standardization
- Filtering examples and attributes
- Scoring and Model Evaluation
- Applying models
- Splitting data
- Evaluation methods
- Performance criteria
- Visualizing Model Performance
Training: Advanced Data Mining and Data Transformations
This is a short introductory training course for users who already know some basic concepts of RapidMiner and data mining and have already used the software before, for example in the first training on Tuesday. The topics of this training session include
- Advanced Data Handling
- Sampling
- Balancing data
- Joins and Aggregations
- Detection and removal of outliers
- Dimensionality reduction
- Control process execution
- Remember process results
- Recall process results
- Loops
- Using branches and conditions
- Exception handling
- Definition of macros
- Usage of macros
- Definition of log values
- Clearing log tables
- Transforming log tables to data
Development Workshop Part 1 and Part 2
Want to exchange ideas with the developers of RapidMiner? Or learn more tricks for developing own operators and extensions? During our development workshops on Tuesday and Friday, we will build small groups of developers each working on a small development project around RapidMiner. Beginners will get a comprehensive overview of the architecture of RapidMiner before making the first steps and learn how to write own operators. Advanced developers will form groups with our experienced developers, identify shortcomings of RapidMiner and develop a new extension which might be presented during the conference already. Unfinished work can be continued in the second workshop on Friday before results might be published on the Marketplace or can be taken home as a starting point for new custom operators.
Google Cloud is finally here
Amazon gets some competition, and customers should see some relief, unless Google withdraws commitment on these products after a few years of trying (like it often does now!)
http://cloud.google.com/products/index.html
| Machine Type Pricing | ||||||
|---|---|---|---|---|---|---|
| Configuration | Virtual Cores | Memory | GCEU * | Local disk | Price/Hour | $/GCEU/hour |
| n1-standard-1-d | 1 | 3.75GB *** | 2.75 | 420GB *** | $0.145 | 0.053 |
| n1-standard-2-d | 2 | 7.5GB | 5.5 | 870GB | $0.29 | 0.053 |
| n1-standard-4-d | 4 | 15GB | 11 | 1770GB | $0.58 | 0.053 |
| n1-standard-8-d | 8 | 30GB | 22 | 2 x 1770GB | $1.16 | 0.053 |
| Network Pricing | |
|---|---|
| Ingress | Free |
| Egress to the same Zone. | Free |
| Egress to a different Cloud service within the same Region. | Free |
| Egress to a different Zone in the same Region (per GB) | $0.01 |
| Egress to a different Region within the US | $0.01 **** |
| Inter-continental Egress | At Internet Egress Rate |
| Internet Egress (Americas/EMEA destination) per GB | |
| 0-1 TB in a month | $0.12 |
| 1-10 TB | $0.11 |
| 10+ TB | $0.08 |
| Internet Egress (APAC destination) per GB | |
| 0-1 TB in a month | $0.21 |
| 1-10 TB | $0.18 |
| 10+ TB | $0.15 |
| Persistent Disk Pricing | |
|---|---|
| Provisioned space | $0.10 GB/month |
| Snapshot storage** | $0.125 GB/month |
| IO Operations | $0.10 per million |
| IP Address Pricing | |
|---|---|
| Static IP address (assigned but unused) | $0.01 per hour |
| Ephemeral IP address (attached to instance) | Free |
** coming soon
*** 1GB is defined as 2^30 bytes
**** promotional pricing; eventually will be charged at internet download rates
Google Prediction API
Tap into Google’s machine learning algorithms to analyze data and predict future outcomes.
Leverage machine learning without the complexity
Use the familiar RESTful interface
Enter input in any format – numeric or text
Build smart apps
Learn how you can use Prediction API to build customer sentiment analysis, spam detection or document and email classification.
Google Translation API
Use Google Translate API to build multilingual apps and programmatically translate text in your webpage or application.
Translate text into other languages programmatically
Use the familiar RESTful interface
Take advantage of Google’s powerful translation algorithms
Build multilingual apps
Learn how you can use Translate API to build apps that can programmatically translate text in your applications or websites.
Google BigQuery
Analyze Big Data in the cloud using SQL and get real-time business insights in seconds using Google BigQuery. Use a fully-managed data analysis service with no servers to install or maintain.
Figure
Reliable & Secure
Complete peace of mind as your data is automatically replicated across multiple sites and secured using access control lists.
Scale infinitely
You can store up to hundreds of terabytes, paying only for what you use.
Blazing fast
Run ad hoc SQL queries on
multi-terabyte datasets in seconds.
Google App Engine
Create apps on Google’s platform that are easy to manage and scale. Benefit from the same systems and infrastructure that power Google’s applications.
Focus on your apps
Let us worry about the underlying infrastructure and systems.
Scale infinitely
See your applications scale seamlessly from hundreds to millions of users.
Business ready
Premium paid support and 99.95% SLA for business users
Interview Jason Kuo SAP Analytics #Rstats
Here is an interview with Jason Kuo who works with SAP Analytics as Group Solutions Marketing Manager. Jason answers questions on SAP Analytics and it’s increasing involvement with R statistical language.
Ajay- What made you choose R as the language to tie important parts of your technology platform like HANA and SAP Predictive Analysis. Did you consider other languages like Julia or Python.
Jason- It’s the most popular. Over 50% of the statisticians and data analysts use R. With 3,500+ algorithms its arguably the most comprehensive statistical analysis language. That said,we are not closing the door on others.
Ajay- When did you first start getting interested in R as an analytics platform?
Jason- SAP has been tracking R for 5+ years. With R’s explosive growth over the last year or two, it made sense for us to dramatically increase our investment in R.
Ajay- Can we expect SAP to give back to the R community like Google and Revolution Analytics does- by sponsoring Package development or sponsoring user meets and conferences?
Will we see SAP’s R HANA package in this year’s R conference User 2012 in Nashville
Jason- Yes. We plan to provide a specific driver for HANA tables for input of the data to native R. This planned for end of 2012. We’ll then review our event strategy. SAP has been a sponsor of Predictive Analytics World for several years and was indeed a founding sponsor. We may be attending the year’s R conference in Nashville.
Ajay- What has been some of the initial customer feedback to your analytics expansion and offerings.
Jason- We have completed two very successful Pilots of the R Integration for HANA with two of SAP’s largest customers.
About-
Jason has over 15 years of BI and Data Warehousing industry experience. Having worked at Oracle, Business Objects, and now SAP, Jason has been involved in numerous technical marketing roles involving performance management dashboards, information management, text analysis, predictive analytics, and now big data. He has a bachelor’s of science in operations research from the University of Michigan.
R for Business Analytics- Book by Ajay Ohri
So the cover art is ready, and if you are a reviewer, you can reserve online copies of the book I have been writing for past 2 years. Special thanks to my mentors, detractors, readers and students- I owe you a beer!
You can also go here-
http://www.springer.com/statistics/book/978-1-4614-4342-1
R for Business Analytics
Ohri, Ajay
2012, 2012, XVI, 300 p. 208 illus., 162 in color.
ISBN 978-1-4614-4342-1
Due: September 30, 2012
(net)
- Covers full spectrum of R packages related to business analytics
- Step-by-step instruction on the use of R packages, in addition to exercises, references, interviews and useful links
- Background information and exercises are all applied to practical business analysis topics, such as code examples on web and social media analytics, data mining, clustering and regression models
R for Business Analytics looks at some of the most common tasks performed by business analysts and helps the user navigate the wealth of information in R and its 4000 packages. With this information the reader can select the packages that can help process the analytical tasks with minimum effort and maximum usefulness. The use of Graphical User Interfaces (GUI) is emphasized in this book to further cut down and bend the famous learning curve in learning R. This book is aimed to help you kick-start with analytics including chapters on data visualization, code examples on web analytics and social media analytics, clustering, regression models, text mining, data mining models and forecasting. The book tries to expose the reader to a breadth of business analytics topics without burying the user in needless depth. The included references and links allow the reader to pursue business analytics topics.
This book is aimed at business analysts with basic programming skills for using R for Business Analytics. Note the scope of the book is neither statistical theory nor graduate level research for statistics, but rather it is for business analytics practitioners. Business analytics (BA) refers to the field of exploration and investigation of data generated by businesses. Business Intelligence (BI) is the seamless dissemination of information through the organization, which primarily involves business metrics both past and current for the use of decision support in businesses. Data Mining (DM) is the process of discovering new patterns from large data using algorithms and statistical methods. To differentiate between the three, BI is mostly current reports, BA is models to predict and strategize and DM matches patterns in big data. The R statistical software is the fastest growing analytics platform in the world, and is established in both academia and corporations for robustness, reliability and accuracy.
Content Level » Professional/practitioner
Keywords » Business Analytics - Data Mining - Data Visualization - Forecasting - GUI - Graphical User Interface - R software - Text Mining
Related subjects » Business, Economics & Finance - Computational Statistics - Statistics

