Ajay Ohri

Top 10 Regrets on Learning the SAS Language

I didn’t learn the SAS Macro Language enough. SAS Macros are cool, and fast. Ditto for arrays. or ODS.
Not keeping up with the changes in Version 9+. Especially the hash method.(Why name a technique after a recreational drug, most unfair)
Not studying more statistics theory.
Flunking SAS Certification Twice.
Not making enough money because customers need a solution not a p value.
There is no Proc common sense. There is no Proc Clean the Data.
No Macros to automate the model. Here is dirty data. There is clean model. Wait till version 16.
Not getting selected by SAS R & D.Not applying to SAS R & D.
Google has better voice recognition for typing notes. No Voice Recognition in SAS langvuage to type syntax.
Enhanced Editor and EG are both idiotic junk pushed by Marketing!

Inspired by true events at

http://www.sascommunity.org/wiki/Category:Bricolage

Why Online Education

1) Huge variety of courses from the best professors in the world (see Gamification course from Coursera below) or Machine Learning , Human Computer Interaction

2) They are free ( is a mistake)! time is not free.

Also signature courses at Coursera now offer credible tracks for $39, and they have more support.

Why do you as a student need support? because sometimes you get stuck, and sometimes you need human interaction to stay motivated.

3) Coursera- I love these things-

Can run the course faster at 1.75 times ( because seriously I get distracted otherwise)

Can run the multiple language CC (captions) – reading is so much faster

Best feature- in video quizzes

Most number of courses

Free!

Codeacademy-

Makes learning fun

Makes easy to learn language

I wish someone could mash more of Coursera content with Codeacademy gamification and teach hacking and data sciences to the next generation of hackers!!

Rest of the websites are good, but I stick to Coursera and Codeacademy!

5) Education empowers! Every person who learns R or JMP through a free MOOC will create more value for themselves, customers, and their society, country than had they remain uneducated because they could not afford the training.

R 3.0 launched #rstats

The 3.0 Era for R starts today! Changes include better Big Data support.

Read the NEWS here

install.packages() has a new argument quiet to reduce the amount of output shown.
New functions cite() and citeNatbib() have been added, to allow generation of in-text citations from "bibentry" objects. A cite() function may be added to bibstyle() environments.
merge() works in more cases where the data frames include matrices. (Wish of PR#14974.)
sample.int() has some support for n >= 2^31: see its help for the limitations.A different algorithm is used for (n, size, replace = FALSE, prob = NULL) for n > 1e7 and size <= n/2. This is much faster and uses less memory, but does give different results.
list.files() (aka dir()) gains a new optional argument no.. which allows to exclude "." and ".." from listings.
Profiling via Rprof() now optionally records information at the statement level, not just the function level.
available.packages() gains a "license/restricts_use" filter which retains only packages for which installation can proceed solely based on packages which are guaranteed not to restrict use.
File ‘share/licenses/licenses.db’ has some clarifications, especially as to which variants of ‘BSD’ and ‘MIT’ is intended and how to apply them to packages. The problematic licence ‘Artistic-1.0’ has been removed.
The breaks argument in hist.default() can now be a function that returns the breakpoints to be used (previously it could only return the suggested number of breakpoints).

LONG VECTORS

This section applies only to 64-bit platforms.

There is support for vectors longer than 2^31 – 1 elements. This applies to raw, logical, integer, double, complex and character vectors, as well as lists. (Elements of character vectors remain limited to 2^31 – 1 bytes.)
Most operations which can sensibly be done with long vectors work: others may return the error ‘long vectors not supported yet’. Most of these are because they explicitly work with integer indices (e.g. anyDuplicated() and match()) or because other limits (e.g. of character strings or matrix dimensions) would be exceeded or the operations would be extremely slow.
length() returns a double for long vectors, and lengths can be set to 2^31 or more by the replacement function with a double value.
Most aspects of indexing are available. Generally double-valued indices can be used to access elements beyond 2^31 – 1.
There is some support for matrices and arrays with each dimension less than 2^31 but total number of elements more than that. Only some aspects of matrix algebra work for such matrices, often taking a very long time. In other cases the underlying Fortran code has an unstated restriction (as was found for complex svd()).
dist() can produce dissimilarity objects for more than 65536 rows (but for example hclust() cannot process such objects).
serialize() to a raw vector is unlimited in size (except by resources).
The C-level function R_alloc can now allocate 2^35 or more bytes.
agrep() and grep() will return double vectors of indices for long vector inputs.
Many calls to .C() have been replaced by .Call() to allow long vectors to be supported (now or in the future). Regrettably several packages had copied the non-API .C() calls and so failed.
.C() and .Fortran() do not accept long vector inputs. This is a precaution as it is very unlikely that existing code will have been written to handle long vectors (and the R wrappers often assume that length(x) is an integer).
Most of the methods for sort() work for long vectors.

rank(), sort.list() and order() support long vectors (slowly except for radix sorting).
sample() can do uniform sampling from a long vector.

PERFORMANCE IMPROVEMENTS

More use has been made of R objects representing registered entry points, which is more efficient as the address is provided by the loader once only when the package is loaded.
This has been done for packages base, methods, splines and tcltk: it was already in place for the other standard packages.

Since these entry points are always accessed by the R entry points they do not need to be in the load table which can be substantially smaller and hence searched faster. This does mean that .C / .Fortran / .Call calls copied from earlier versions of R may no longer work – but they were never part of the API.
Many .Call() calls in package base have been migrated to .Internal() calls.
solve() makes fewer copies, especially when b is a vector rather than a matrix.
eigen() makes fewer copies if the input has dimnames.
Most of the linear algebra functions make fewer copies when the input(s) are not double (e.g. integer or logical).
A foreign function call (.C() etc) in a package without a PACKAGE argument will only look in the first DLL specified in the ‘NAMESPACE’ file of the package rather than searching all loaded DLLs. A few packages needed PACKAGE arguments added.
The @<- operator is now implemented as a primitive, which should reduce some copying of objects when used. Note that the operator object must now be in package base: do not try to import it explicitly from package methods.

SIGNIFICANT USER-VISIBLE CHANGES

Packages need to be (re-)installed under this version (3.0.0) of R.
There is a subtle change in behaviour for numeric index values 2^31 and larger. These never used to be legitimate and so were treated as NA, sometimes with a warning. They are now legal for long vectors so there is no longer a warning, and x[2^31] <- y will now extend the vector on a 64-bit platform and give an error on a 32-bit one.
It is now possible for 64-bit builds to allocate amounts of memory limited only by the OS. It may be wise to use OS facilities (e.g. ulimit in a bash shell, limit in csh), to set limits on overall memory consumption of an R process, particularly in a multi-user environment. A number of packages need a limit of at least 4GB of virtual memory to load.
64-bit Windows builds of R are by default limited in memory usage to the amount of RAM installed: this limit can be changed by command-line option –max-mem-size or setting environment variable R_MAX_MEM_SIZE.

Interview Dr. Ian Fellows Fellstat.com #rstats Deducer

Here is an interview with Dr Ian Fellows, creator of acclaimed packages in R like Deducer and the Founder and President of Fellstat.com

Ajay- Describe your involvement with the Deducer Project, and the various plugins associated with it. What has been the usage and response for Deducer from R Community.

Ian- Deducer is a graphical user interface for data analysis built on R. It sprung out of a disconnect between the toolchain used by myself and the toolchain of the psychologists that I worked with at the University of California, San DIego. They were primarily SPSS user, whereas I liked to use R, especially for anything that was not a standard analysis.

I felt that there was a big gap in the audience that R serves. Not all consumers or producers of statistics can be expected to have the computational background (command-line programming) that R requires. I think it is important to recognize and work with the areas of expertise that statistical users have. I’m not an expert in psychology, and they didn’t expect me to be one. They are not experts in computation, and I don’t think that we should expect them to be in order to be a part of the R toolchain community.

This was the impetus behind Deducer, so it is fundamentally designed to be a familiar experience for users coming from an SPSS background and provides a full implementation of the standard methods in statistics, and data manipulation from descriptives to generalized linear models. Additionally, it has an advanced GUI for creating visualizations which has been well received, and won the John Chambers award for statistical software in 2011.

Uptake of the system is difficult to measure as CRAN does not track package downloads, but from what I can tell there has been a steadily increasing user base. The online manual has been accessed by over 75,000 unique users, with over 400,000 page views. There is a small, active group of developers creating add-on packages supporting various sub-diciplines of statistics. There are 8 packages on CRAN extending/using Deducer, and quite a few more on r-forge.

Ajay- Do you see any potential for Deducer as an enterprise software product (like R Studio et al)

Ian- Like R Studio, Deducer is used in enterprise environments but is not specifically geared towards that environment. I do see potential in that realm, but don’t have any particular plan to make an enterprise version of Deducer.

Ajay- Describe your work in Texas Hold’em Poker. Do you see any potential for R for diversifying into the casino analytics – which has hitherto been served exclusively by non open source analytics vendors.

Ian- As a Statistician, I’m very much interested in problems of inference under uncertainty, especially when the problem space is huge. Creating an Artificial Intelligence that can play (heads-up limit) Texas Hold’em Poker at a high level is a perfect example of this. There is uncertainty created by the random drawing of cards, the problem space is 10^{18}, and our opponent can adapt to any strategy that we employ.

While high level chess A.I.s have existed for decades, the first viable program to tackle full scale poker was introduced in 2003 by the incomparable Computer Poker Research group at the University of Alberta. Thus poker represents a significant challenge which can be used as a test bed to break new ground in applied game theory. In 2007 and 2008 I submitted entries to the AAA’s annual computer poker competition, which pits A.I.s from universities across the world against each other. My program, which was based on an approximate game theoretic equilibrium calculated using a co-evolutionary process called fictitious play, came in second behind the Alberta team.

Ajay- Describe your work in social media analytics for R. What potential do you see for Social Network Analysis given the current usage of it in business analytics and business intelligence tools for enterprise.

Ian- My dissertation focused on new model classes for social network analysis (http://arxiv.org/pdf/1208.0121v1.pdf and http://arxiv.org/pdf/1303.1219.pdf). R has a great collection of tools for social network analysis in the statnet suite of packages, which represents the forefront of the literature on the statistical modeling of social networks. I think that if the analytics data is small enough for the models to be fit, these tools can represent a qualitative leap in the understanding and prediction of user behavior.

Most uses of social networks in enterprise analytics that I have seen are limited to descriptive statistics (what is a user’s centrality; what is the degree distribution), and the use of these descriptive statistics as fixed predictors in a model. I believe that this approach is an important first step, but ignores the stochastic nature of the network, and the dynamics of tie formation and dissolution. Realistic modeling of the network can lead to more principled, and more accurate predictions of the quantities that enterprise users care about.

The rub is that the Markov Chain Monte Carlo Maximum Likelihood algorithms used to fit modern generative social network models (such as exponential-family random graph models) do not scale well at all. These models are typically limited to fitting networks with fewer than 50,000 vertices, which is clearly insufficient for most analytics customers who have networks more on the order of 50,000,000.

This problem is not insoluble though. Part of my ongoing research involves scalable algorithms for fitting social network models.

Ajay- You decided to go from your Phd into consulting (www.fellstat.com) . What were some of the options you considered in this career choice.

Ian– I’ve been working in the role of a statistical consultant for the last 7 years, starting as an in-house consultant at UCSD after obtaining my MS. Fellows Statistics has been operating for the last 3 years, though not fulltime until January of this year. As I had already been consulting, it was a natural progression to transition to consulting fulltime once I graduated with my Phd.

This has allowed me to both work on interesting corporate projects, and continue research related to my dissertation via sub-awards from various universities.

Ajay- What does Fellstat.com offer in its consulting practice.

Ian– Fellows Statistics offers personalized analytics services to both corporate and academic clients. We are a boutique company, that can scale from a single statistician to a small team of analysts chosen specifically with the client’s needs in mind. I believe that by being small, we can provide better, close-to-the-ground responsive service to our clients.

As a practice, we live at the intersection of mathematical sophistication, and computational skill, with a hint of UI design thrown into the mix. Corporate clients can expect a diverse range of analytic skills from the development of novel algorithms to the design and presentation of data for a general audience. We’ve worked with Revolution Analytics developing algorithms for their ScaleR product, the Center for Disease Control developing graphical user interfaces set to be deployed for world-wide HIV surveillance, and Prospectus analyzing clinical trial data for retinal surgery. With access to the cutting edge research taking place in the academic community, and the skills to implement them in corporate environments, Fellows Statistics is able to offer clients world-class analytics services.

Ajay- How does big data affect the practice of statistics in business decisions.

Ian– There is a big gap in terms of how the basic practice of statistics is taught in most universities, and the types of analyses that are useful when data sizes become large. Back when I was at UCSD, I remember a researcher there jokingly saying that everything is correlated rho=.2. He was joking, but there is a lot of truth to that statement. As data sizes get larger everything becomes significant if a hypothesis test is done, because the test has the power to detect even trivial relationships.

Ajay- How is the R community including developers coping with the Big Data era? What do you think R can do more for Big Data?

Ian- On the open source side, there has been a lot of movement to improve R’s handling of big data. The bigmemory project and the ff package both serve to extend R’s reach beyond in-memory data structures. Revolution Analytics also has the ScaleR package, which costs money, but is lightning fast and has an ever growing list of analytic techniques implemented. There are also several packages integrating R with hadoop.

Ajay- Describe your research into data visualization including word cloud and other packages. What do you think of Shiny, D3.Js and online data visualization?

Ian- I recently had the opportunity to delve into d3.js for a client project, and absolutely love it. Combined with Shiny, d3 and R one can very quickly create a web visualization of an R modeling technique. One limitation of d3 is that it doesn’t work well with internet explorer 6-8. Once these browsers finally leave the ecosystem, I expect an explosion of sites using d3.

Ajay- Do you think wordcloud is an overused data visualization type and how can it be refined?

Ian- I would say yes, but not for the reasons you would think. A lot of people criticize word clouds because they convey the same information as a bar chart, but with less specificity. With a bar chart you can actually see the frequency, whereas you only get a relative idea with word clouds based on the size of the word.

I think this is both an absolutely correct statement, and misses the point completely. Visualizations are about communicating with the reader. If your readers are statisticians, then they will happily consume the bar chart, following the bar heights to their point on the y-axis to find the frequencies. A statistician will spend time with a graph, will mull it over, and consider what deeper truths are found there. Statisticians are weird though. Most people care as much about how pretty the graph looks as its content. To communicate to these people (i.e. everyone else) it is appropriate and right to sacrifice statistical specificity to design considerations. After all, if the user stops reading you haven’t conveyed anything.

But back to the question… I would say that they are over used because they represent a very superficial analysis of a text or corpus. The word counts do convey an aspect of a text, but not a very nuanced one. The next step in looking at a corpus of texts would be to ask how are they different and how are they the same. The wordcloud package has the comparison and commonality word clouds, which attempt to extend the basic word cloud to answer these questions (see: http://blog.fellstat.com/?p=101).

About-

Dr. Ian Fellows is a professional statistician based out of the University of California, Los Angeles. His research interests range over many sub-disciplines of statistics. His work in statistical visualization won the prestigious John Chambers Award in 2011, and in 2007-2008 his Texas Hold’em AI programs were ranked second in the world.

Applied data analysis has been a passion for him, and he is accustomed to providing accurate, timely analysis for a wide range of projects, and assisting in the interpretation and communication of statistical results. He can be contacted at info@fellstat.com

Happy April Lulz Day

Happy April Fools Day

Right now people are breaking promises they told they would keep. This includes politicians, leaders, business men, CEOs , friends and family.
Right now people are working hard to convince you their religion, software, country, way of thinking is good for you, when they know that is not
Right now is a good time to discuss being self fish and self less
Right now evil people are plotting evil things and saying that is good, and good people are doing evil things to stop them thinking that is better
. “Man sacrifices his health in order to make money. Then he sacrifices money to recuperate his health. Then he is so anxious about the future that he doesn’t enjoy the present: the result being that he does not live in the present or the future; he lives as if he is never going to die, and then dies having never really lived” – Dalai Lama
What if I told you insane was working fifty hours a week in some office for fifty years at the end of which they tell you to piss off; ending up in some retirement village hoping to die before suffering the indignity of trying to make it to the toilet on time? Wouldn’t you consider that to be insane?
Life is a dream for the wise, a game for the fool, a comedy for the rich, a tragedy for the poor.
Sholom Aleichem
You can fool all the people some of the time, and some of the people all the time, but you cannot fool all the people all the time. Abe Lincoln

9.Failure is unimportant. It takes courage to make a fool of yourself.

Charlie Chaplin

10. Stay Hungry and Stay Foolish

Book Promotion- Click, Buy, Lie , Die

To build awareness of Eric Siegel’s new, acclaimed book, Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die (published by Wiley Feb. 19), an offer ya can’t refuse.
Order the book on April 3 via Amazon ($15) for:

1. Free access to the first of 4 modules of the author’s online training program, Predictive Analytics Applied

2. A 35% discount off the full training ($495), or its in-person version, Predictive Analytics for Business, Marketing & Web ($1,495 – Apr 25-26 in NYC)

3. Automatic entrance into a drawing to receive a pass for any Predictive Analytics World this year (San Francisco, Chicago, DC, Boston, London, or Berlin).

—

Ajay- at $15 a pop, and quite a nice book, it’s a steal! See book review here–

https://decisionstats.com/2013/02/25/book-review-predictive-analytics-the-power-to-predict-who-will-click-buy-lie-or-die/

How to learn SQL injection

In my previous post in the hacker series https://decisionstats.com/2013/03/20/hacking-for-beginners-top-website-hacks/ , we noted that SQL Injection remains a top method for security vulnerabilities. Accordingly- here is a list of resources to learn SQL Injection

Definition

SQL injection is a code injection technique that exploits a security vulnerability in an application’s software. The vulnerability happens when user input is either incorrectly filtered for string literal escape characters embedded in SQL statements or user input is not strongly typed and unexpectedly executed. SQL injection is mostly known as an attack vector for websites but can be used to attack any type of SQL database.

Basic Tools

SQL Inject Me

https://addons.mozilla.org/en-us/firefox/addon/sql-inject-me/

SQL Inject Me is the Exploit-Me tool used to test for SQL Injection vulnerabilities.

The tool works by submitting your HTML forms and substituting the form value with strings that are representative of an SQL Injection attack.The tool works by sending database escape strings through the form fields. It then looks for database error messages that are output into the rendered HTML of the page.

The tool does not attempting to compromise the security of the given system. It looks for possible entry points for an attack against the system. There is no port scanning, packet sniffing, password hacking or firewall attacks done by the tool.

Hackbar

https://addons.mozilla.org/en-US/firefox/addon/hackbar/

and http://code.google.com/p/hackbar/

This toolbar will help you in testing sql injections, XSS holes and site security. It is NOT a tool for executing standard exploits and it will NOT teach you how to hack a site

SQLMap

http://sqlmap.org/

sqlmap is an open source penetration testing tool that automates the process of detecting and exploiting SQL injection flaws and taking over of database servers.

Basic Tutorials ( in order of learning)

http://sqlzoo.net/hack/

A site for testing SQL Injection attacks. It is a test system and can be used for honing your SQL Skills.

Intermediate Tutorials on End to End SQL Injection

Step 1: Finding Vulnerable Website:

Step 2: Checking the Vulnerability:

To check the vulnerability , add the single quotes(‘) at the end of the url and hit enter.

If you got an error message , then it means that the site is vulnerable

Step 3: Finding Number of columns:

Step 4: Find the Vulnerable columns:

Step 5: Finding version,database,user

Step 6: Finding the Table Name

Step 8: Finding the Admin Panel:

from http://www.breakthesecurity.com/2010/12/hacking-website-using-sql-injection.html

Next Tutorial uses an automated tool called Havij from

http://www.itsecteam.com/products/havij-v116-advanced-sql-injection/

and the tutorial is at

http://cybersucks.blogspot.in/2013/01/hacking-website-using-sql-injectionfull.html

Please share:

Please share:

LONG VECTORS

PERFORMANCE IMPROVEMENTS

SIGNIFICANT USER-VISIBLE CHANGES

Please share:

Please share:

Please share:

3. Automatic entrance into a drawing to receive a pass for any Predictive Analytics World this year (San Francisco, Chicago, DC, Boston, London, or Berlin).

Please share:

Please share: