Tag: code
Software Review- BigML.com – Machine Learning meets the Cloud
I had a chance to dekko the new startup BigML https://bigml.com/ and was suitably impressed by the briefing and my own puttering around the site. Here is my review-
1) The website is very intutively designed- You can create a dataset from an uploaded file in one click and you can create a Decision Tree model in one click as well. I wish other cloud computing websites like Google Prediction API make design so intutive and easy to understand. Also unlike Google Prediction API, the models are not black box models, but have a description which can be understood.
2) It includes some well known data sources for people trying it out. They were kind enough to offer 5 invite codes for readers of Decisionstats ( if you want to check it yourself, use the codes below the post, note they are one time only , so the first five get the invites.
BigML is still invite only but plan to get into open release soon.
3) Data Sources can only be by uploading files (csv) but they plan to change this hopefully to get data from buckets (s3? or Google?) and from URLs.
4) The one click operation to convert data source into a dataset shows a histogram (distribution) of individual variables.The back end is clojure , because the team explained it made the easiest sense and fit with Java. The good news (?) is you would never see the clojure code at the back end. You can read about it from http://clojure.org/
As cloud computing takes off (someday) I expect clojure popularity to take off as well.
Clojure is a dynamic programming language that targets the Java Virtual Machine (and the CLR, and JavaScript). It is designed to be a general-purpose language, combining the approachability and interactive development of a scripting language with an efficient and robust infrastructure for multithreaded programming. Clojure is a compiled language – it compiles directly to JVM bytecode, yet remains completely dynamic. Every feature supported by Clojure is supported at runtime. Clojure provides easy access to the Java frameworks, with optional type hints and type inference, to ensure that calls to Java can avoid reflection.
Clojure is a dialect of Lisp
5) As of now decision trees is the only distributed algol, but they expect to roll out other machine learning stuff soon. Hopefully this includes regression (as logit and linear) and k means clustering. The trees are created and pruned in real time which gives a slightly animated (and impressive effect). and yes model building is an one click operation.
The real time -live pruning is really impressive and I wonder why /how it can ever be replicated in other software based on desktop, because of the sheer interactive nature.
Making the model is just half the work. Creating predictions and scoring the model is what is really the money-earner. It is one click and customization is quite intuitive. It is not quite PMML compliant yet so I hope some Zemanta like functionality can be added so huge amounts of models can be applied to predictions or score data in real time.
If you are a developer/data hacker, you should check out this section too- it is quite impressive that the designers of BigML have planned for API access so early.
BigML.io gives you:
- Secure programmatic access to all your BigML resources.
- Fully white-box access to your datasets and models.
- Asynchronous creation of datasets and models.
- Near real-time predictions.
Note: For your convenience, some of the snippets below include your real username and API key.
Please keep them secret.
REST API
BigML.io conforms to the design principles of Representational State Transfer (REST). BigML.io is enterely HTTP-based.
BigML.io gives you access to four basic resources: Source, Dataset, Model and Prediction. You cancreate, read, update, and delete resources using the respective standard HTTP methods: POST, GET,PUT and DELETE.
All communication with BigML.io is JSON formatted except for source creation. Source creation is handled with a HTTP PUT using the “multipart/form-data” content-type
HTTPS
All access to BigML.io must be performed over HTTPS
and https://bigml.com/developers/quick_start ( In think an R package which uses JSON ,RCurl would further help in enhancing ease of usage).
Summary-
Overall a welcome addition to make software in the real of cloud computing and statistical computation/business analytics both easy to use and easy to deploy with fail safe mechanisms built in.
Check out https://bigml.com/ for yourself to see.
The invite codes are here -one time use only- first five get the invites- so click and try your luck, machine learning on the cloud.
If you dont get an invite (or it is already used, just leave your email there and wait a couple of days to get approval)
Easter Eggs in #Rstats
Yes.
Cite-http://en.wikipedia.org/wiki/Easter_egg_(media)
A virtual Easter egg is an intentional hidden message, in-joke, or feature in a work such as a computer program, web page, video game, movie, book, or crossword. The term was coined — according to Warren Robinett — by Atari after they were pointed to the secret message left by Robinett in the game Adventure.[1] It draws a parallel with the custom of the Easter egg hunt observed in many Western nations as well as the last Russian imperial family’s tradition of giving elaborately jeweled egg-shaped creations by Carl Fabergé which contained hidden surprises
In R.
Cite-http://stackoverflow.com/questions/7910270/are-there-any-easter-eggs-in-base-r-or-in-major-packages
I like this
just type
example(readLine)
and these two
on 32 bit R type
memory.limit(4096)
and on any version try four question marks
Perhaps the prettiest eggs are the demos in animation package.
But there is magic in asking for help on internal functions in R
Just type-
and you get the sobering thought that you probably are a R Muggle
Call an Internal Function
Description
.Internal performs a call to an internal code which is built in to the R interpreter.
Only true R wizards should even consider using this function, and only R developers can add to the list of internal functions.
Usage
.Internal(call)
Arguments
call |
a call expression |
See Also
.Primitive, .External (the nearest equivalent available to users).
I liked that I could see the actual internal functions in svn at http://svn.r-project.org/R/trunk/src/main/names.c
The opening of the internals document floored me.
It must have been a curious year in 2003-4 when the copyright of R was held (briefly it seems) by the R Foundation and also by the R Development Core Team. (which sounds better?)
* R : A Computer Language for Statistical Data Analysis * Copyright (C) 1995, 1996 Robert Gentleman and Ross Ihaka * Copyright (C) 1997--2012 The R Development Core Team * Copyright (C) 2003, 2004 The R Foundation
My contribution
R help discourages for loop
Try ??for or ?for
you go into a loop till you hit escape
If you want more-just write .Internal(inspect(ls())) at the end of your R program.
How to learn Hacking Part 2
Now that you have read the basics here at http://www.decisionstats.com/how-to-learn-to-be-a-hacker-easily/ (please do read this before reading the below)
Here is a list of tutorials that you should study (in order of ease)
1) LEARN BASICS – enough to get you a job maybe if that’s all you wanted.
http://www.offensive-security.com/metasploit-unleashed/Main_Page

2) READ SOME MORE-
Lena’s Reverse Engineering Tutorial-“Use Google.com for finding the Tutorial”
Lena’s Reverse Engineering tutorial. It includes 36 parts of individual cracking techniques and will teach you the basics of protection bypassing
01. Olly + assembler + patching a basic reverseme
02. Keyfiling the reverseme + assembler
03. Basic nag removal + header problems
04. Basic + aesthetic patching
05. Comparing on changes in cond jumps, animate over/in, breakpoints
06. “The plain stupid patching method”, searching for textstrings
07. Intermediate level patching, Kanal in PEiD
08. Debugging with W32Dasm, RVA, VA and offset, using LordPE as a hexeditor
09. Explaining the Visual Basic concept, introduction to SmartCheck and configuration
10. Continued reversing techniques in VB, use of decompilers and a basic anti-anti-trick
11. Intermediate patching using Olly’s “pane window”
12. Guiding a program by multiple patching.
13. The use of API’s in software, avoiding doublechecking tricks
14. More difficult schemes and an introduction to inline patching
15. How to study behaviour in the code, continued inlining using a pointer
16. Reversing using resources
17. Insights and practice in basic (self)keygenning
18. Diversion code, encryption/decryption, selfmodifying code and polymorphism
19. Debugger detected and anti-anti-techniques
20. Packers and protectors : an introduction
21. Imports rebuilding
22. API Redirection
23. Stolen bytes
24. Patching at runtime using loaders from lena151 original
25. Continued patching at runtime & unpacking armadillo standard protection
26. Machine specific loaders, unpacking & debugging armadillo
27. tElock + advanced patching
28. Bypassing & killing server checks
29. Killing & inlining a more difficult server check
30. SFX, Run Trace & more advanced string searching
31. Delphi in Olly & DeDe
32. Author tricks, HIEW & approaches in inline patching
33. The FPU, integrity checks & loader versus patcher
34. Reversing techniques in packed software & a S&R loader for ASProtect
35. Inlining inside polymorphic code
36. Keygenning
If you want more free training – hang around this website
http://www.owasp.org/index.php/Cheat_Sheets
OWASP Cheat Sheet Series
- OWASP Top Ten Cheat Sheet
- Authentication Cheat Sheet
- Cross-Site Request Forgery (CSRF) Prevention Cheat Sheet
- Transport Layer Protection Cheat Sheet
- Cryptographic Storage Cheat Sheet
- Input Validation Cheat Sheet
- XSS Prevention Cheat Sheet
- DOM based XSS Prevention Cheat Sheet
- Forgot Password Cheat Sheet
- Query Parameterization Cheat Sheet
- SQL Injection Prevention Cheat Sheet
- Session Management Cheat Sheet
- HTML5 Security Cheat Sheet
- Web Service Security Cheat Sheet
- Application Security Architecture Cheat Sheet
- Logging Cheat Sheet
- JAAS Cheat Sheet
Draft OWASP Cheat Sheets
- Access Control Cheat Sheet
- REST Security Cheat Sheet
- Abridged XSS Prevention Cheat Sheet
- PHP Security Cheat Sheet
- Password Storage Cheat Sheet
- Secure Coding Cheat Sheet
- Threat Modeling Cheat Sheet
- Clickjacking Cheat Sheet
- Virtual Patching Cheat Sheet
- Secure SDLC Cheat Sheet
3) SPEND SOME MONEY on TRAINING
http://www.corelan-training.com/index.php/training/corelan-live/
Course overview
Module 1 – The x86 environment
- System Architecture
- Windows Memory Management
- Registers
- Introduction to Assembly
- The stack
Module 2 – The exploit developer environment
- Setting up the exploit developer lab
- Using debuggers and debugger plugins to gather primitives
Module 3 – Saved Return Pointer Overwrite
- Functions
- Saved return pointer overwrites
- Stack cookies
Module 4 – Abusing Structured Exception Handlers
- Abusing exception handler overwrites
- Bypassing Safeseh
Module 5 – Pointer smashing
- Function pointers
- Data/object pointers
- vtable/virtual functions
Module 6 – Off-by-one and integer overflows
- Off-by-one
- Integer overflows
Module 7 – Limited buffers
- Limited buffers, shellcode splitting
Module 8 – Reliability++ & reusability++
- Finding and avoiding bad characters
- Creative ways to deal with character set limitations
Module 9 – Fun with Unicode
- Exploiting Unicode based overflows
- Writing venetian alignment code
- Creating and Using venetian shellcode
Module 10 – Heap Spraying Fundamentals
- Heap Management and behaviour
- Heap Spraying for Internet Explorer 6 and 7
Module 11 – Egg Hunters
- Using and tweaking Egg hunters
- Custom egghunters
- Using Omelet egghunters
- Egghunters in a WoW64 environment
Module 12 – Shellcoding
- Building custom shellcode from scratch
- Understanding existing shellcode
- Writing portable shellcode
- Bypassing Antivirus
Module 13 – Metasploit Exploit Modules
- Writing exploits for the Metasploit Framework
- Porting exploits to the Metasploit Framework
Module 14 – ASLR
- Bypassing ASLR
Module 15 – W^X
- Bypassing NX/DEP
- Return Oriented Programming / Code Reuse (ROP) )
Module 16 – Advanced Heap Spraying
- Heap Feng Shui & heaplib
- Precise heap spraying in modern browsers (IE8 & IE9, Firefox 13)
Module 17 – Use After Free
- Exploiting Use-After-Free conditions
Module 18 – Windows 8
- Windows 8 Memory Protections and Bypass
ALSO GET CERTIFIED http://www.offensive-security.com/information-security-training/penetration-testing-with-backtrack/ ($950 cost)
the syllabus is here at
http://www.offensive-security.com/documentation/penetration-testing-with-backtrack.pdf
4) HANG AROUND OTHER HACKERS
At http://attrition.org/attrition/
or The Noir Hat Conferences-
http://blackhat.com/html/bh-us-12/training/bh-us-12-training_complete.html
or read this website
http://software-security.sans.org/developer-how-to/
5) GET A DEGREE
Yes it is possible
See http://web.jhu.edu/jhuisi/
The Johns Hopkins University Information Security Institute (JHUISI) is the University’s focal point for research and education in information security, assurance and privacy.
Scholarship Information
The Information Security Institute is now accepting applications for the Department of Defense’s Information Assurance Scholarship Program (IASP). This scholarship includes full tuition, a living stipend, books and health insurance. In return each student recipient must work for a DoD agency at a competitive salary for six months for every semester funded. The scholarship is open to American citizens only.
http://web.jhu.edu/jhuisi/mssi/index.html
MASTER OF SCIENCE IN SECURITY INFORMATICS PROGRAM
The flagship educational experience offered by Johns Hopkins University in the area of information security and assurance is represented by the Master of Science in Security Informatics degree. Over thirty courses are available in support of this unique and innovative graduate program.
———————————————————–
Disclaimer- I havent done any of these things- This is just a curated list from Quora so I am open to feedback.
You use this at your own risk of conscience ,local legal jurisdictions and your own legal liability.
send email by R
For automated report delivery I have often used send email options in BASE SAS. For R, for scheduling tasks and sending me automated mails on completion of tasks I have two R options and 1 Windows OS scheduling option. Note red font denotes the parameters that should be changed. Anything else should NOT be changed.
Option 1-
Use the mail package at
http://cran.r-project.org/web/packages/mail/mail.pdf
> library(mail)
Attaching package: ‘mail’
The following object(s) are masked from ‘package:sendmailR’:
sendmail
>
> sendmail(“ohri2007@gmail.com“, subject=”Notification from R“,message=“Calculation finished!”, password=”rmail”)
[1] “Message was sent to ohri2007@gmail.com! You have 19 messages left.”
Disadvantage- Only 20 email messages by IP address per day. (but thats ok!)
Option 2-
use sendmailR package at http://cran.r-project.org/web/packages/sendmailR/sendmailR.pdf
install.packages()
library(sendmailR)
from <- sprintf(“<sendmailR@%s>”, Sys.info()[4])
to <- “<ohri2007@gmail.com>”
subject <- “Hello from R”
body <- list(“It works!”, mime_part(iris))
sendmail(from, to, subject, body,control=list(smtpServer=”ASPMX.L.GOOGLE.COM”))
BiocInstaller version 1.2.1, ?biocLite for help
> install.packages(“sendmailR”)
Installing package(s) into ‘/home/ubuntu/R/library’
(as ‘lib’ is unspecified)
also installing the dependency ‘base64’trying URL ‘http://cran.at.r-project.org/src/contrib/base64_1.1.tar.gz’
Content type ‘application/x-gzip’ length 61109 bytes (59 Kb)
opened URL
==================================================
downloaded 59 Kbtrying URL ‘http://cran.at.r-project.org/src/contrib/sendmailR_1.1-1.tar.gz’
Content type ‘application/x-gzip’ length 6399 bytes
opened URL
==================================================
downloaded 6399 bytesBiocInstaller version 1.2.1, ?biocLite for help
* installing *source* package ‘base64’ …
** package ‘base64’ successfully unpacked and MD5 sums checked
** libs
gcc -std=gnu99 -I/usr/local/lib64/R/include -I/usr/local/include -fpic -g -O2 -c base64.c -o base64.o
gcc -std=gnu99 -shared -L/usr/local/lib64 -o base64.so base64.o -L/usr/local/lib64/R/lib -lR
installing to /home/ubuntu/R/library/base64/libs
** R
** preparing package for lazy loading
** help
*** installing help indices
** building package indices …
** testing if installed package can be loaded
BiocInstaller version 1.2.1, ?biocLite for help* DONE (base64)
BiocInstaller version 1.2.1, ?biocLite for help
* installing *source* package ‘sendmailR’ …
** package ‘sendmailR’ successfully unpacked and MD5 sums checked
** R
** preparing package for lazy loading
** help
*** installing help indices
** building package indices …
** testing if installed package can be loaded
BiocInstaller version 1.2.1, ?biocLite for help* DONE (sendmailR)
The downloaded packages are in
‘/tmp/RtmpsM222s/downloaded_packages’
> library(sendmailR)
Loading required package: base64
> from <- sprintf(“<sendmailR@%s>”, Sys.info()[4])
> to <- “<ohri2007@gmail.com>”
> subject <- “Hello from R”
> body <- list(“It works!”, mime_part(iris))
> sendmail(from, to, subject, body,
+ control=list(smtpServer=”ASPMX.L.GOOGLE.COM”))
$code
[1] “221”$msg
[1] “2.0.0 closing connection ff2si17226764qab.40”
Disadvantage-This worked when I used the Amazon Cloud using the BioConductor AMI (for free 2 hours) at http://www.bioconductor.org/help/cloud/
It did NOT work when I tried it use it from my Windows 7 Home Premium PC from my Indian ISP (!!) .
It gave me this error
or in wait_for(250) :
SMTP Error: 5.7.1 [180.215.172.252] The IP you’re using to send mail is not authorized
PAUSE–
ps Why do this (send email by R)?
Note you can add either of the two programs of the end of the code that you want to be notified automatically. (like daily tasks)
This is mostly done for repeated business analytics tasks (like reports and analysis that need to be run at specific periods of time)
pps- What else can I do with this?
Can be modified to include sms or tweets or even blog by email by modifying the “to” location appropriately.
3) Using Windows Task Scheduler to run R codes automatically (either the above)
or just sending an email
got to Start> All Programs > Accessories >System Tools > Task Scheduler ( or by default C:Windowssystem32taskschd.msc)
Create a basic task
Now you can use this to run your daily/or scheduled R code or you can send yourself email as well.
and modify the parameters- note the SMTP server (you can use the ones for google in example 2 at ASPMX.L.GOOGLE.COM)
and check if it works!
Related
Geeky Things , Bro
–
Configuring IIS on your Windows 7 Home Edition-
note path to do this is-
Control Panel>All Control Panel Items> Program and Features>Turn Windows features on or off> Internet Information Services
and
http://stackoverflow.com/questions/709635/sending-mail-from-batch-file
R for Predictive Modeling- PAW Toronto
A nice workshop on using R for Predictive Modeling by Max Kuhn Director, Nonclinical Statistics, Pfizer is on at PAW Toronto.
Workshop
Monday, April 23, 2012 in Toronto
Full-day: 9:00am – 4:30pm
R for Predictive Modeling:
A Hands-On Introduction
Intended Audience: Practitioners who wish to learn how to execute on predictive analytics by way of the R language; anyone who wants “to turn ideas into software, quickly and faithfully.”
Knowledge Level: Either hands-on experience with predictive modeling (without R) or hands-on familiarity with any programming language (other than R) is sufficient background and preparation to participate in this workshop.

Workshop Description
This one-day session provides a hands-on introduction to R, the well-known open-source platform for data analysis. Real examples are employed in order to methodically expose attendees to best practices driving R and its rich set of predictive modeling packages, providing hands-on experience and know-how. R is compared to other data analysis platforms, and common pitfalls in using R are addressed.
The instructor, a leading R developer and the creator of CARET, a core R package that streamlines the process for creating predictive models, will guide attendees on hands-on execution with R, covering:
- A working knowledge of the R system
- The strengths and limitations of the R language
- Preparing data with R, including splitting, resampling and variable creation
- Developing predictive models with R, including decision trees, support vector machines and ensemble methods
- Visualization: Exploratory Data Analysis (EDA), and tools that persuade
- Evaluating predictive models, including viewing lift curves, variable importance and avoiding overfitting
Hardware: Bring Your Own Laptop
Each workshop participant is required to bring their own laptop running Windows or OS X. The software used during this training program, R, is free and readily available for download.
Attendees receive an electronic copy of the course materials and related R code at the conclusion of the workshop.
Schedule
- Workshop starts at 9:00am
- Morning Coffee Break at 10:30am – 11:00am
- Lunch provided at 12:30 – 1:15pm
- Afternoon Coffee Break at 2:30pm – 3:00pm
- End of the Workshop: 4:30pm
Instructor
Max Kuhn, Director, Nonclinical Statistics, Pfizer
Max Kuhn is a Director of Nonclinical Statistics at Pfizer Global R&D in Connecticut. He has been applying models in the pharmaceutical industries for over 15 years.
He is a leading R developer and the author of several R packages including the CARET package that provides a simple and consistent interface to over 100 predictive models available in R.
Mr. Kuhn has taught courses on modeling within Pfizer and externally, including a class for the India Ministry of Information Technology.
Source-
http://www.predictiveanalyticsworld.com/toronto/2012/r_for_predictive_modeling.php
Book Review- Machine Learning for Hackers
This is review of the fashionably named book Machine Learning for Hackers by Drew Conway and John Myles White (O’Reilly ). The book is about hacking code in R.
The preface introduces the reader to the authors conception of what machine learning and hacking is all about. If the name of the book was machine learning for business analytsts or data miners, I am sure the content would have been unchanged though the popularity (and ambiguity) of the word hacker can often substitute for its usefulness. Indeed the many wise and learned Professors of statistics departments through out the civilized world would be mildly surprised and bemused by their day to day activities as hacking or teaching hackers. The book follows a case study and example based approach and uses the GGPLOT2 package within R programming almost to the point of ignoring any other native graphics system based in R. It can be quite useful for the aspiring reader who wishes to understand and join the booming market for skilled talent in statistical computing.
Chapter 1 has a very useful set of functions for data cleansing and formatting. It walks you through the basics of formatting based on dates and conditions, missing value and outlier treatment and using ggplot package in R for graphical analysis. The case study used is an Infochimps dataset with 60,000 recordings of UFO sightings. The case study is lucid, and done at a extremely helpful pace illustrating the powerful and flexible nature of R functions that can be used for data cleansing.The chapter mentions text editors and IDEs but fails to list them in a tabular format, while listing several other tables like Packages used in the book. It also jumps straight from installation instructions to functions in R without getting into the various kinds of data types within R or specifying where these can be referenced from. It thus assumes a higher level of basic programming understanding for the reader than the average R book.
Chapter 2 discusses data exploration, and has a very clear set of diagrams that explain the various data summary operations that are performed routinely. This is an innovative approach and will help students or newcomers to the field of data analysis. It introduces the reader to type determination functions, as well different kinds of encoding. The introduction to creating functions is quite elegant and simple , and numerical summary methods are explained adequately. While the chapter explains data exploration with the help of various histogram options in ggplot2 , it fails to create a more generic framework for data exploration or rules to assist the reader in visual data exploration in non standard data situations. While the examples are very helpful for a reader , there needs to be slightly more depth to step out of the example and into a framework for visual data exploration (or references for the same). A couple of case studies however elaborately explained cannot do justice to the vast field of data exploration and especially visual data exploration.
Chapter 3 discussed binary classification for the specific purpose for spam filtering using a dataset from SpamAssassin. It introduces the reader to the naïve Bayes classifier and the principles of text mining suing the tm package in R. Some of the example codes could have been better commented for easier readability in the book. Overall it is quite a easy tutorial for creating a naïve Bayes classifier even for beginners.
Chapter 4 discusses the issues in importance ranking and creating recommendation systems specifically in the case of ordering email messages into important and not important. It introduces the useful grepl, gsub, strsplit, strptime ,difftime and strtrim functions for parsing data. The chapter further introduces the reader to the concept of log (and affine) transformations in a lucid and clear way that can help even beginners learn this powerful transformation concept. Again the coding within this chapter is sparsely commented which can cause difficulties to people not used to learn reams of code. ( it may have been part of the code attached with the book, but I am reading an electronic book and I did not find an easy way to go back and forth between the code and the book). The readability of the chapters would be further enhanced by the use of flow charts explaining the path and process followed than overtly verbose textual descriptions running into multiple pages. The chapters are quite clearly written, but a helpful visual summary can help in both revising the concepts and elucidate the approach taken further.A suggestion for the authors could be to compile the list of useful functions they introduce in this book as a sort of reference card (or Ref Card) for R Hackers or atleast have a chapter wise summary of functions, datasets and packages used.
Chapter 5 discusses linear regression , and it is a surprising and not very good explanation of regression theory in the introduction to regression. However the chapter makes up in practical example what it oversimplifies in theory. The chapter on regression is not the finest chapter written in this otherwise excellent book. Part of this is because of relative lack of organization- correlation is explained after linear regression is explained. Once again the lack of a function summary and a process flow diagram hinders readability and a separate section on regression metrics that help make a regression result good or not so good could be a welcome addition. Functions introduced include lm.
Chapter 6 showcases Generalized Additive Model (GAM) and Polynomial Regression, including an introduction to singularity and of over-fitting. Functions included in this chapter are transform, and poly while the package glmnet is also used here. The chapter also introduces the reader formally to the concept of cross validation (though examples of cross validation had been introduced in earlier chapters) and regularization. Logistic regression is also introduced at the end in this chapter.
Chapter 7 is about optimization. It describes error metric in a very easy to understand way. It creates a grid by using nested loops for various values of intercept and slope of a regression equation and computing the sum of square of errors. It then describes the optim function in detail including how it works and it’s various parameters. It introduces the curve function. The chapter then describes ridge regression including definition and hyperparameter lamda. The use of optim function to optimize the error in regression is useful learning for the aspiring hacker. Lastly it describes a case study of breaking codes using the simplistic Caesar cipher, a lexical database and the Metropolis method. Functions introduced in this chapter include .Machine$double.eps .
Chapter 8 deals with Principal Component Analysis and unsupervised learning. It uses the ymd function from lubridate package to convert string to date objects, and the cast function from reshape package to further manipulate the structure of data. Using the princomp functions enables PCA in R.The case study creates a stock market index and compares the results with the Dow Jones index.
Chapter 9 deals with Multidimensional Scaling as well as clustering US senators on the basis of similarity in voting records on legislation .It showcases matrix multiplication using %*% and also the dist function to compute distance matrix.
Chapter 10 has the subject of K Nearest Neighbors for recommendation systems. Packages used include class ,reshape and and functions used include cor, function and log. It also demonstrates creating a custom kNN function for calculating Euclidean distance between center of centroids and data. The case study used is the R package recommendation contest on Kaggle. Overall a simplistic introduction to creating a recommendation system using K nearest neighbors, without getting into any of the prepackaged packages within R that deal with association analysis , clustering or recommendation systems.
Chapter 11 introduces the reader to social network analysis (and elements of graph theory) using the example of Erdos Number as an interesting example of social networks of mathematicians. The example of Social Graph API by Google for hacking are quite new and intriguing (though a bit obsolete by changes, and should be rectified in either the errata or next edition) . However there exists packages within R that should be atleast referenced or used within this chapter (like TwitteR package that use the Twitter API and ROauth package for other social networks). Packages used within this chapter include Rcurl, RJSONIO, and igraph packages of R and functions used include rbind and ifelse. It also introduces the reader to the advanced software Gephi. The last example is to build a recommendation engine for whom to follow in Twitter using R.
Chapter 12 is about model comparison and introduces the concept of Support Vector Machines. It uses the package e1071 and shows the svm function. It also introduces the concept of tuning hyper parameters within default algorithms . A small problem in understanding the concepts is the misalignment of diagram pages with the relevant code. It lastly concludes with using mean square error as a method for comparing models built with different algorithms.
Overall the book is a welcome addition in the library of books based on R programming language, and the refreshing nature of the flow of material and the practicality of it’s case studies make this a recommended addition to both academic and corporate business analysts trying to derive insights by hacking lots of heterogeneous data.
Have a look for yourself at-
http://shop.oreilly.com/product/0636920018483.do















