Talking on Big Data Analytics

I am going  being sponsored to a Government of India sponsored talk on Big Data Analytics at Bangalore on Friday the 13 th of July. If you are in Bangalore, India you may drop in for a dekko. Schedule and Abstracts (i am on page 7 out 9) .

Your tax payer money is hard at work- (hassi majak only if you are a desi. hassi to fassi.)

13 July 2012 (9.30 – 11.00 & 11.30 – 1.00)
Big Data Big Analytics
The talk will showcase using open source technologies in statistical computing for big data, namely the R programming language and its use cases in big data analysis. It will review case studies using the Amazon Cloud, custom packages in R for Big Data, tools like Revolution Analytics RevoScaleR package, as well as the newly launched SAP Hana used with R. We will also review Oracle R Enterprise. In addition we will show some case studies using BigML.com (using Clojure) , and approaches using PiCloud. In addition it will showcase some of Google APIs for Big Data Analysis.

Lastly we will talk on social media analysis ,national security use cases (i.e. cyber war) and privacy hazards of big data analytics.

Schedule

View more presentations from Ajay Ohri.
Abstracts

View more documents from Ajay Ohri.

 

R for Business Analytics- Book by Ajay Ohri

So the cover art is ready, and if you are a reviewer, you can reserve online copies of the book I have been writing for past 2 years. Special thanks to my mentors, detractors, readers and students- I owe you a beer!

You can also go here-

http://www.springer.com/statistics/book/978-1-4614-4342-1

 

R for Business Analytics

R for Business Analytics

Ohri, Ajay

2012, 2012, XVI, 300 p. 208 illus., 162 in color.

Hardcover
Information

ISBN 978-1-4614-4342-1

Due: September 30, 2012

(net)

approx. 44,95 €
  • Covers full spectrum of R packages related to business analytics
  • Step-by-step instruction on the use of R packages, in addition to exercises, references, interviews and useful links
  • Background information and exercises are all applied to practical business analysis topics, such as code examples on web and social media analytics, data mining, clustering and regression models

R for Business Analytics looks at some of the most common tasks performed by business analysts and helps the user navigate the wealth of information in R and its 4000 packages.  With this information the reader can select the packages that can help process the analytical tasks with minimum effort and maximum usefulness. The use of Graphical User Interfaces (GUI) is emphasized in this book to further cut down and bend the famous learning curve in learning R. This book is aimed to help you kick-start with analytics including chapters on data visualization, code examples on web analytics and social media analytics, clustering, regression models, text mining, data mining models and forecasting. The book tries to expose the reader to a breadth of business analytics topics without burying the user in needless depth. The included references and links allow the reader to pursue business analytics topics.

 

This book is aimed at business analysts with basic programming skills for using R for Business Analytics. Note the scope of the book is neither statistical theory nor graduate level research for statistics, but rather it is for business analytics practitioners. Business analytics (BA) refers to the field of exploration and investigation of data generated by businesses. Business Intelligence (BI) is the seamless dissemination of information through the organization, which primarily involves business metrics both past and current for the use of decision support in businesses. Data Mining (DM) is the process of discovering new patterns from large data using algorithms and statistical methods. To differentiate between the three, BI is mostly current reports, BA is models to predict and strategize and DM matches patterns in big data. The R statistical software is the fastest growing analytics platform in the world, and is established in both academia and corporations for robustness, reliability and accuracy.

Content Level » Professional/practitioner

Keywords » Business Analytics – Data Mining – Data Visualization – Forecasting – GUI – Graphical User Interface – R software – Text Mining

Related subjects » Business, Economics & Finance – Computational Statistics – Statistics

TABLE OF CONTENTS

Why R.- R Infrastructure.- R Interfaces.- Manipulating Data.- Exploring Data.- Building Regression Models.- Data Mining using R.- Clustering and Data Segmentation.- Forecasting and Time-Series Models.- Data Export and Output.- Optimizing your R Coding.- Additional Training Literature.- Appendix

Interview Alvaro Tejada Galindo, SAP Labs Montreal, Using SAP Hana with #Rstats

Here is a brief interview with Alvaro Tejada Galindo aka Blag who is a developer working with SAP Hana and R at SAP Labs, Montreal. SAP Hana is SAP’s latest offering in BI , it’s also a database and a computing environment , and using R and HANA together on the cloud can give major productivity gains in terms of both speed and analytical ability, as per preliminary use cases.

Ajay- Describe how you got involved with databases and R language.
Blag-  I used to work as an ABAP Consultant for 11 years, but also been involved with programming since the last 13 years, so I was in touch with SQLServer, Oracle, MySQL and SQLite. When I joined SAP, I heard that SAP HANA was going to use an statistical programming language called “R”. The next day I started my “R” learning.

Ajay- What made the R language a fit for SAP HANA. Did you consider other languages? What is your view on Julia/Python/SPSS/SAS/Matlab languages

Blag- I think “R” is a must for SAP HANA. As the fastest database in the market, we needed a language that could help us shape the data in the best possible way. “R” filled that purpose very well. Right now, “R” is not the only language as “L” can be used as well (http://wiki.tcl.tk/17068) …not forgetting “SQLScript” which is our own version of SQL (http://goo.gl/x3bwh) . I have to admit that I tried Julia, but couldn’t manage to make it work. Regarding Python, it’s an interesting question as I’m going to blog about Python and SAP HANA soon. About Matlab, SPSS and SAS I haven’t used them, so I got nothing to say there.

Ajay- What is your view on some of the limitations of R that can be overcome with using it with SAP HANA.

Blag-  I think mostly the ability of SAP HANA to work with big data. Again, SAP HANA and “R” can work very nicely together and achieve things that weren’t possible before.

Ajay-  Have you considered other vendors of R including working with RStudio, Revolution Analytics, and even Oracle R Enterprise.

Blag-  I’m not really part of the SAP HANA or the R groups inside SAP, so I can’t really comment on that. I can only say that I use RStudio every time I need to do something with R. Regarding Oracle…I don’t think so…but they can use any of our products whenever they want.

Ajay- Do you have a case study on an actual usage of R with SAP HANA that led to great results.

Blag-   Right now the use of “R” and SAP HANA is very preliminary, I don’t think many people has start working on it…but as an example that it works, you can check this awesome blog entry from my friend Jitender Aswani “Big Data, R and HANA: Analyze 200 Million Data Points and Later Visualize Using Google Maps “ (http://allthingsr.blogspot.com/#!/2012/04/big-data-r-and-hana-analyze-200-million.html)

Ajay- Does your group in SAP plan to give to the R ecosystem by attending conferences like UseR 2012, sponsoring meets, or package development etc

Blag- My group is in charge of everything developers, so sure, we’re planning to get more in touch with R developers and their ecosystem. Not sure how we’re going to deal with it, but at least I’m going to get myself involved in the Montreal R Group.

 

About-

http://scn.sap.com/people/alvaro.tejadagalindo3

Name: Alvaro Tejada Galindo
Email: a.tejada.galindo@sap.com
Profession: Development
Company: SAP Canada Labs-Montreal
Town/City: Montreal
Country: Canada
Instant Messaging Type: Twitter
Instant Messaging ID: Blag
Personal URL: http://blagrants.blogspot.com
Professional Blog URL: http://www.sdn.sap.com/irj/scn/weblogs?blog=/pub/u/252210910
My Relation to SAP: employee
Short Bio: Development Expert for the Technology Innovation and Developer Experience team.Used to be an ABAP Consultant for the last 11 years. Addicted to programming since 1997.

http://www.sap.com/solutions/technology/in-memory-computing-platform/hana/overview/index.epx

and from

http://en.wikipedia.org/wiki/SAP_HANA

SAP HANA is SAP AG’s implementation of in-memory database technology. There are four components within the software group:[1]

  • SAP HANA DB (or HANA DB) refers to the database technology itself,
  • SAP HANA Studio refers to the suite of tools provided by SAP for modeling,
  • SAP HANA Appliance refers to HANA DB as delivered on partner certified hardware (see below) as anappliance. It also includes the modeling tools from HANA Studio as well replication and data transformation tools to move data into HANA DB,[2]
  • SAP HANA Application Cloud refers to the cloud based infrastructure for delivery of applications (typically existing SAP applications rewritten to run on HANA).

R is integrated in HANA DB via TCP/IP. HANA uses SQL-SHM, a shared memory-based data exchange to incorporate R’s vertical data structure. HANA also introduces R scripts equivalent to native database operations like join or aggregation.[20] HANA developers can write R scripts in SQL and the types are automatically converted in HANA. R scripts can be invoked with HANA tables as both input and output in the SQLScript. R environments need to be deployed to use R within SQLScript

More blog posts on using SAP and R together

Dealing with R and HANA

http://scn.sap.com/community/in-memory-business-data-management/blog/2011/11/28/dealing-with-r-and-hana
R meets HANA

http://scn.sap.com/community/in-memory-business-data-management/blog/2012/01/29/r-meets-hana

HANA meets R

http://scn.sap.com/community/in-memory-business-data-management/blog/2012/01/26/hana-meets-r
When SAP HANA met R – First kiss

http://scn.sap.com/community/developer-center/hana/blog/2012/05/21/when-sap-hana-met-r–first-kiss

 

Using RODBC with SAP HANA DB-

SAP HANA: My experiences on using SAP HANA with R

http://scn.sap.com/community/in-memory-business-data-management/blog/2012/02/21/sap-hana-my-experiences-on-using-sap-hana-with-r

and of course the blog that started it all-

Jitender Aswani’s http://allthingsr.blogspot.in/

 

 

Protected: Converting SAS language code to Java

This content is password-protected. To view it, please enter the password below.

Software Review- BigML.com – Machine Learning meets the Cloud

I had a chance to dekko the new startup BigML https://bigml.com/ and was suitably impressed by the briefing and my own puttering around the site. Here is my review-

1) The website is very intutively designed- You can create a dataset from an uploaded file in one click and you can create a Decision Tree model in one click as well. I wish other cloud computing websites like  Google Prediction API make design so intutive and easy to understand. Also unlike Google Prediction API, the models are not black box models, but have a description which can be understood.

2) It includes some well known data sources for people trying it out. They were kind enough to offer 5 invite codes for readers of Decisionstats ( if you want to check it yourself, use the codes below the post, note they are one time only , so the first five get the invites.

BigML is still invite only but plan to get into open release soon.

3) Data Sources can only be by uploading files (csv) but they plan to change this hopefully to get data from buckets (s3? or Google?) and from URLs.

4) The one click operation to convert data source into a dataset shows a histogram (distribution) of individual variables.The back end is clojure , because the team explained it made the easiest sense and fit with Java. The good news (?) is you would never see the clojure code at the back end. You can read about it from http://clojure.org/

As cloud computing takes off (someday) I expect clojure popularity to take off as well.

Clojure is a dynamic programming language that targets the Java Virtual Machine (and the CLR, and JavaScript). It is designed to be a general-purpose language, combining the approachability and interactive development of a scripting language with an efficient and robust infrastructure for multithreaded programming. Clojure is a compiled language – it compiles directly to JVM bytecode, yet remains completely dynamic. Every feature supported by Clojure is supported at runtime. Clojure provides easy access to the Java frameworks, with optional type hints and type inference, to ensure that calls to Java can avoid reflection.

Clojure is a dialect of Lisp

 

5) As of now decision trees is the only distributed algol, but they expect to roll out other machine learning stuff soon. Hopefully this includes regression (as logit and linear) and k means clustering. The trees are created and pruned in real time which gives a slightly animated (and impressive effect). and yes model building is an one click operation.

The real time -live pruning is really impressive and I wonder why /how it can ever be replicated in other software based on desktop, because of the sheer interactive nature.

 

Making the model is just half the work. Creating predictions and scoring the model is what is really the money-earner. It is one click and customization is quite intuitive. It is not quite PMML compliant yet so I hope some Zemanta like functionality can be added so huge amounts of models can be applied to predictions or score data in real time.

 

If you are a developer/data hacker, you should check out this section too- it is quite impressive that the designers of BigML have planned for API access so early.

https://bigml.com/developers

BigML.io gives you:

  • Secure programmatic access to all your BigML resources.
  • Fully white-box access to your datasets and models.
  • Asynchronous creation of datasets and models.
  • Near real-time predictions.

 

Note: For your convenience, some of the snippets below include your real username and API key.

Please keep them secret.

REST API

BigML.io conforms to the design principles of Representational State Transfer (REST)BigML.io is enterely HTTP-based.

BigML.io gives you access to four basic resources: SourceDatasetModel and Prediction. You cancreatereadupdate, and delete resources using the respective standard HTTP methods: POSTGET,PUT and DELETE.

All communication with BigML.io is JSON formatted except for source creation. Source creation is handled with a HTTP PUT using the “multipart/form-data” content-type

HTTPS

All access to BigML.io must be performed over HTTPS

and https://bigml.com/developers/quick_start ( In think an R package which uses JSON ,RCurl  would further help in enhancing ease of usage).

 

Summary-

Overall a welcome addition to make software in the real of cloud computing and statistical computation/business analytics both easy to use and easy to deploy with fail safe mechanisms built in.

Check out https://bigml.com/ for yourself to see.

The invite codes are here -one time use only- first five get the invites- so click and try your luck, machine learning on the cloud.

If you dont get an invite (or it is already used, just leave your email there and wait a couple of days to get approval)

  1. https://bigml.com/accounts/register/?code=E1FE7
  2. https://bigml.com/accounts/register/?code=09991
  3. https://bigml.com/accounts/register/?code=5367D
  4. https://bigml.com/accounts/register/?code=76EEF
  5. https://bigml.com/accounts/register/?code=742FD

How to learn Hacking Part 2

Now that you have read the basics here at http://www.decisionstats.com/how-to-learn-to-be-a-hacker-easily/ (please do read this before reading the below)

 

Here is a list of tutorials that you should study (in order of ease)

1) LEARN BASICS – enough to get you a job maybe if that’s all you wanted.

http://www.offensive-security.com/metasploit-unleashed/Main_Page

2) READ SOME MORE-

Lena’s Reverse Engineering Tutorial-“Use Google.com  for finding the Tutorial

Lena’s Reverse Engineering tutorial. It includes 36 parts of individual cracking techniques and will teach you the basics of protection bypassing

01. Olly + assembler + patching a basic reverseme
02. Keyfiling the reverseme + assembler
03. Basic nag removal + header problems
04. Basic + aesthetic patching
05. Comparing on changes in cond jumps, animate over/in, breakpoints
06. “The plain stupid patching method”, searching for textstrings
07. Intermediate level patching, Kanal in PEiD
08. Debugging with W32Dasm, RVA, VA and offset, using LordPE as a hexeditor
09. Explaining the Visual Basic concept, introduction to SmartCheck and configuration
10. Continued reversing techniques in VB, use of decompilers and a basic anti-anti-trick
11. Intermediate patching using Olly’s “pane window”
12. Guiding a program by multiple patching.
13. The use of API’s in software, avoiding doublechecking tricks
14. More difficult schemes and an introduction to inline patching
15. How to study behaviour in the code, continued inlining using a pointer
16. Reversing using resources
17. Insights and practice in basic (self)keygenning
18. Diversion code, encryption/decryption, selfmodifying code and polymorphism
19. Debugger detected and anti-anti-techniques
20. Packers and protectors : an introduction
21. Imports rebuilding
22. API Redirection
23. Stolen bytes
24. Patching at runtime using loaders from lena151 original
25. Continued patching at runtime & unpacking armadillo standard protection
26. Machine specific loaders, unpacking & debugging armadillo
27. tElock + advanced patching
28. Bypassing & killing server checks
29. Killing & inlining a more difficult server check
30. SFX, Run Trace & more advanced string searching
31. Delphi in Olly & DeDe
32. Author tricks, HIEW & approaches in inline patching
33. The FPU, integrity checks & loader versus patcher
34. Reversing techniques in packed software & a S&R loader for ASProtect
35. Inlining inside polymorphic code
36. Keygenning

If you want more free training – hang around this website

http://www.owasp.org/index.php/Cheat_Sheets

OWASP Cheat Sheet Series

Draft OWASP Cheat Sheets

3) SPEND SOME MONEY on TRAINING

http://www.corelan-training.com/index.php/training/corelan-live/

Course overview

Module 1 – The x86 environment

  • System Architecture
  • Windows Memory Management
  • Registers
  • Introduction to Assembly
  • The stack

Module 2 – The exploit developer environment

  • Setting up the exploit developer lab
  • Using debuggers and debugger plugins to gather primitives

Module 3 – Saved Return Pointer Overwrite

  • Functions
  • Saved return pointer overwrites
  • Stack cookies

Module 4 – Abusing Structured Exception Handlers

  • Abusing exception handler overwrites
  • Bypassing Safeseh

Module 5 – Pointer smashing

  • Function pointers
  • Data/object pointers
  • vtable/virtual functions

Module 6 – Off-by-one and integer overflows

  • Off-by-one
  • Integer overflows

Module 7 – Limited buffers

  • Limited buffers, shellcode splitting

Module 8 – Reliability++ & reusability++

  • Finding and avoiding bad characters
  • Creative ways to deal with character set limitations

Module 9 – Fun with Unicode

  • Exploiting Unicode based overflows
  • Writing venetian alignment code
  • Creating and Using venetian shellcode

Module 10 – Heap Spraying Fundamentals

  • Heap Management and behaviour
  • Heap Spraying for Internet Explorer 6 and 7

Module 11 – Egg Hunters

  • Using and tweaking Egg hunters
  • Custom egghunters
  • Using Omelet egghunters
  • Egghunters in a WoW64 environment

Module 12 – Shellcoding

  • Building custom shellcode from scratch
  • Understanding existing shellcode
  • Writing portable shellcode
  • Bypassing Antivirus

Module 13 – Metasploit Exploit Modules

  • Writing exploits for the Metasploit Framework
  • Porting exploits to the Metasploit Framework

Module 14 – ASLR

  • Bypassing ASLR

Module 15 – W^X

  • Bypassing NX/DEP
  • Return Oriented Programming / Code Reuse (ROP) )

Module 16 – Advanced Heap Spraying

  • Heap Feng Shui & heaplib
  • Precise heap spraying in modern browsers (IE8 & IE9, Firefox 13)

Module 17 – Use After Free

  • Exploiting Use-After-Free conditions

Module 18 – Windows 8

  • Windows 8 Memory Protections and Bypass
TRAINING SCHEDULES AT

ALSO GET CERTIFIED http://www.offensive-security.com/information-security-training/penetration-testing-with-backtrack/ ($950 cost)

the syllabus is here at

http://www.offensive-security.com/documentation/penetration-testing-with-backtrack.pdf

4) HANG AROUND OTHER HACKERS

At http://attrition.org/attrition/

or The Noir  Hat Conferences-

http://blackhat.com/html/bh-us-12/training/bh-us-12-training_complete.html

or read this website

http://software-security.sans.org/developer-how-to/

5) GET A DEGREE

Yes it is possible

 

See http://web.jhu.edu/jhuisi/

The Johns Hopkins University Information Security Institute (JHUISI) is the University’s focal point for research and education in information security, assurance and privacy.

Scholarship Information

 

The Information Security Institute is now accepting applications for the Department of Defense’s Information Assurance Scholarship Program (IASP).  This scholarship includes full tuition, a living stipend, books and health insurance. In return each student recipient must work for a DoD agency at a competitive salary for six months for every semester funded. The scholarship is open to American citizens only.

http://web.jhu.edu/jhuisi/mssi/index.html

MASTER OF SCIENCE IN SECURITY INFORMATICS PROGRAM

The flagship educational experience offered by Johns Hopkins University in the area of information security and assurance is represented by the Master of Science in Security Informatics degree.  Over thirty courses are available in support of this unique and innovative graduate program.

———————————————————–

Disclaimer- I havent done any of these things- This is just a curated list from Quora  so I am open to feedback.

You use this at your own risk of conscience ,local legal jurisdictions and your own legal liability.

 

 

 

 

 

 

R for Predictive Modeling- PAW Toronto

A nice workshop on using R for Predictive Modeling by Max Kuhn Director, Nonclinical Statistics, Pfizer is on at PAW Toronto.

Workshop

Monday, April 23, 2012 in Toronto
Full-day: 9:00am – 4:30pm

R for Predictive Modeling:
A Hands-On Introduction

Intended Audience: Practitioners who wish to learn how to execute on predictive analytics by way of the R language; anyone who wants “to turn ideas into software, quickly and faithfully.”

Knowledge Level: Either hands-on experience with predictive modeling (without R) or hands-on familiarity with any programming language (other than R) is sufficient background and preparation to participate in this workshop.


What prior attendees have exclaimed


Workshop Description

This one-day session provides a hands-on introduction to R, the well-known open-source platform for data analysis. Real examples are employed in order to methodically expose attendees to best practices driving R and its rich set of predictive modeling packages, providing hands-on experience and know-how. R is compared to other data analysis platforms, and common pitfalls in using R are addressed.

The instructor, a leading R developer and the creator of CARET, a core R package that streamlines the process for creating predictive models, will guide attendees on hands-on execution with R, covering:

  • A working knowledge of the R system
  • The strengths and limitations of the R language
  • Preparing data with R, including splitting, resampling and variable creation
  • Developing predictive models with R, including decision trees, support vector machines and ensemble methods
  • Visualization: Exploratory Data Analysis (EDA), and tools that persuade
  • Evaluating predictive models, including viewing lift curves, variable importance and avoiding overfitting

Hardware: Bring Your Own Laptop
Each workshop participant is required to bring their own laptop running Windows or OS X. The software used during this training program, R, is free and readily available for download.

Attendees receive an electronic copy of the course materials and related R code at the conclusion of the workshop.


Schedule

  • Workshop starts at 9:00am
  • Morning Coffee Break at 10:30am – 11:00am
  • Lunch provided at 12:30 – 1:15pm
  • Afternoon Coffee Break at 2:30pm – 3:00pm
  • End of the Workshop: 4:30pm

Instructor

Max Kuhn, Director, Nonclinical Statistics, Pfizer

Max Kuhn is a Director of Nonclinical Statistics at Pfizer Global R&D in Connecticut. He has been applying models in the pharmaceutical industries for over 15 years.

He is a leading R developer and the author of several R packages including the CARET package that provides a simple and consistent interface to over 100 predictive models available in R.

Mr. Kuhn has taught courses on modeling within Pfizer and externally, including a class for the India Ministry of Information Technology.

Source-

http://www.predictiveanalyticsworld.com/toronto/2012/r_for_predictive_modeling.php