Ten steps to analysis using R

I am just listing down a set of basic R functions that allow you to start the task of business analytics, or analyzing a dataset(data.frame). I am doing this both as a reference for myself as well as anyone who wants to learn R- quickly.

I am not putting in data import functions, because data manipulation is a seperate baby altogether. Instead I assume you have a dataset ready for analysis and what are the top R commands you would need to analyze it.

For anyone who thought R was too hard to learn- here is ten functions to learning R

1) str(dataset) helps you with the structure of dataset

2) names(dataset) gives you the names of variables

3)mean(dataset) returns the mean of numeric variables

4)sd(dataset) returns the standard deviation of numeric variables

5)summary(variables) gives the summary quartile distributions and median of variables

That about gives me the basic stats I need for a dataset.

`> data(faithful)`
```> names(faithful)
[1] "eruptions" "waiting"```
```> str(faithful)
'data.frame':   272 obs. of  2 variables:
\$ eruptions: num  3.6 1.8 3.33 2.28 4.53 ...
\$ waiting  : num  79 54 74 62 85 55 88 85 51 85 ...```
```> summary(faithful)
eruptions        waiting
Min.   :1.600   Min.   :43.0
1st Qu.:2.163   1st Qu.:58.0
Median :4.000   Median :76.0
Mean   :3.488   Mean   :70.9
3rd Qu.:4.454   3rd Qu.:82.0
Max.   :5.100   Max.   :96.0

> mean(faithful)
eruptions   waiting
3.487783 70.897059
> sd(faithful)
eruptions   waiting
1.141371 13.594974```

6) I can do a basic frequency analysis of a particular variable using the table command and \$ operator (similar to dataset.variable name in other statistical languages)

```> table(faithful\$waiting)

43 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 62 63 64 65 66 67 68 69 70
1  3  5  4  3  5  5  6  5  7  9  6  4  3  4  7  6  4  3  4  3  2  1  1  2  4
71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 96
5  1  7  6  8  9 12 15 10  8 13 12 14 10  6  6  2  6  3  6  1  1  2  1  1```
`or I can do frequency analysis of the whole dataset using`
```> table(faithful)
waiting
eruptions 43 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 62 63 64 65 66 67
1.6    0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0
1.667  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0
1.7    0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0
1.733  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0```
`.....output truncated`
`7) plot(dataset)`
`It helps plot the dataset`

8) hist(dataset\$variable) is better at looking at histograms

hist(faithful\$waiting)

9) boxplot(dataset)

10) The tenth function for a beginner would be cor(dataset\$var1,dataset\$var2)

```> cor(faithful)
eruptions   waiting
eruptions 1.0000000 0.9008112
waiting   0.9008112 1.0000000```

I am assuming that as a beginner you would use the list of GUI at http://rforanalytics.wordpress.com/graphical-user-interfaces-for-r/  to import and export Data. I would deal with ten steps to data manipulation in R another post.

After resisting for two weeks I have decided to write a Google Plus review. This includes both the changed designed parameters, the invite growth features and all of the main sub-items and activities you can do in the G+  Stream, Share, Hang Out, Pictures, Circles.

Since I have 2500 people in my circles and I am in 91 circles

To keep it simple – I have noted the following 6 main sub-points.

1) Content Dissemination-

• Sharing Blog Articles
• Micro-Blogging
• Sharing Pictures

2) Online Professional Networking  and 3) Online Personal Socializing

4) Spam Control / Malware /Phishing/Porn Protection

5) Time Cost versus Networking Benefit

————————————————————————————————————————————————————–

1) Content Dissemination-

• Sharing Blog Articles

Sharing is as simple as Facebook but the design makes it simpler.

Note G+ uses lower number of colors, bigger fonts, slightly bigger icons to reduce the appearance of clutter.

Contrast this

with this-

Interesting to see that G+ has four types of media to share- besides writing the status/micro-blog (unfettered by 140 characters). Note these show icons only with hover text to tell you what the icon stands for.

Photo,Video,URL,Location (which seems to be Twitter like in every share)

Facebook has 5 types of Sharing and note the slightly different order as well the fact that both icon and text make it slightly more cluttered- Status (which is redundant clearly ),Photo,Link,Video,Question

G+ thus lacks polls /questions features. It is much easier to share content on Facebook automatically as of now- but for G+ you need to share the URL privately though. There exist G+ meme-s already thanks to re-sharing in G+ plus which seems to be inspired by Tumblr (?).

G+ has sharing in circles whereas Facebook has only Everyone, Friends, Friends of Friends ,Customize.  This makes G+ interface slightly better in tweaking the spread of content to targeted audience esp by Bloggers.

• For sharing Photos– G+ goes in for a whole new separate tab (one out of four) whereas Facebook treats photo sharing less prominently.
• Google has lesser white space between photos, (The Facebook way used to be just snap photo by iPhone and send by email to auto-post), and the privacy in sharing photos is much better in G+ as the dropdowns in Facebook are not as granular and neither as nifty in icon design.
•
• Also I like the hover and photo grows bigger feature and the auto import from Picassa (but I would like to auto-import into G+ from Flickr just as I can do in Facebook)
• Google Plus also has a much more detailed version for sharing videos than photos as compared to Facebook  -upload Photo options  versus
• G+ has much more focus on auto-sharing from mobiles

2) Online Professional Networking  and 3) Online Personal Socializing Organizing Contacts in Google Plus and seperate privacy controls make it easier to customize sharing without getting too complex. You can make as many circles and drag and drop very easily instead of manually clicking a dropdown box. Effectively speaking Facebook has just 4 kinds of circles and it does not distinguish between various types of friends which is great from philosophical point of view but not so goodn enforcing separateness between professional and personal networks. Note Facebook privacy settings are overwhelming despite the groovy data viz

4) Spam Control / Malware /Phishing/Porn Protection

Spam Control in Facebook versus in Google Plus- note the different options in Google Plus (including the ability to NOT reshare). I am not aware of more enhanced protection than the ones available for Gmail already. Spam is what really killed off a lot many social networks and the ability to control or reduce spam will be a critical design choice

5) Time Cost versus Networking Benefit

Linkedin has the lowest cost in time spent and networking done. If G+ adds a resume section for jobs, recruiters, and adds in Zynga games, the benefit from G+ will expand. As of now G+ is a minimal social network with minimalism as design ethos.

(Zynga would do well to partner with G+)

Contribution to #Rstats by Revolution

I have been watching for Revolution Analytics product almost since the inception of the company. It has managed to sail over storms, naysayers and critics with simple and effective strategy of launching good software, making good partnerships and keeping up media visibility with white papers, joint webinars, blogs, conferences and events.

However this is a listing of all technical contributions made by Revolution Analytics products to the #rstats project.

1) Useful Packages mostly in parallel processing or more efficient computing like

2) RevoScaler package to beat R’s memory problem (this is probably the best in my opinion as it is yet to be replicated by the open source version and is a clear cut reason for going in for the paid version)

http://www.revolutionanalytics.com/products/enterprise-big-data.php

• Efficient XDF File Format designed to efficiently handle huge data sets.
• Data Step Functionality to quickly clean, transform, explore, and visualize huge data sets.
• Data selection functionality to store huge data sets out of memory, and select subsets of rows and columns for in-memory operation with all R functions.
• Visualize Large Data sets with line plots and histograms.
• Built-in Statistical Algorithms for direct analysis of huge data sets:
• Summary Statistics
• Linear Regression
• Logistic Regression
• Crosstabulation
• On-the-fly data transformations to include derived variables in models without writing new data files.
• Extend Existing Analyses by writing user- defined R functions to “chunk” through huge data sets.
• Direct import of fixed-format text data files and SAS data sets into .xdf format

3) RevoDeploy R for  API based R solution – I somehow think this feature will get more important as time goes on but it seems a lower visibility offering right now.

http://www.revolutionanalytics.com/products/enterprise-deployment.php

• Collection of Web services implemented as a RESTful API.
• JavaScript and Java client libraries, allowing users to easily build custom Web applications on top of R.
• .NET Client library — includes a COM interoperability to call R from VBA
• Management Console for securely administrating servers, scripts and users through HTTP and HTTPS.
• XML and JSON format for data exchange.
• Built-in security model for authenticated or anonymous invocation of R Scripts.
• Repository for storing R objects and R Script execution artifacts.

4) Revolutions IDE (or Productivity Environment) for a faster coding environment than command line. The GUI by Revolution Analytics is in the works. – Having used this- only the Code Snippets function is a clear differentiator from newer IDE and GUI. The code snippets is awesome though and even someone who doesnt know much R can get analysis set up quite fast and accurately.

http://www.revolutionanalytics.com/products/enterprise-productivity.php

• Full-featured Visual Debugger for debugging R scripts, with call stack window and step-in, step-over, and step-out capability.
• Enhanced Script Editor with hover-over help, word completion, find-across-files capability, automatic syntax checking, bookmarks, and navigation buttons.
• Run Selection, Run to Line and Run to Cursor evaluation
• R Code Snippets to automatically generate fill-in-the-blank sections of R code with tooltip help.
• Object Browser showing available data and function objects (including those in packages), with context menus for plotting and editing data.
• Solution Explorer for organizing, viewing, adding, removing, rearranging, and sourcing R scripts.
• Customizable Workspace with dockable, floating, and tabbed tool windows.
• Version Control Plug-in available for the open source Subversion version control software.

Marketing contributions from Revolution Analytics-

1) Sponsoring R sessions and user meets

2) Evangelizing R at conferences  and partnering with corporate partners including JasperSoft, Microsoft , IBM and others at http://www.revolutionanalytics.com/partners/

3) Helping with online initiatives like http://www.inside-r.org/ (which is curiously dormant and now largely superseded by R-Bloggers.com) and the syntax highlighting tool at http://www.inside-r.org/pretty-r. In addition Revolution has been proactive in reaching out to the community

4) Helping pioneer blogging about R and Twitter Hash tag discussions , and contributing to Stack Overflow discussions. Within a short while, #rstats online community has overtaken a lot more established names- partly due to decentralized nature of its working.

Did I miss something out? yes , they share their code by GPL.

Let me know by feedback

Heritage offers 3 million chump change for Monkeys

My perspective is life is not fair, and if someone offers me 1 mill a year so they make 1 bill a year, I would still take it, especially if it leads to better human beings and better humanity on this planet. Health care isnt toothpaste.

Unless there are even more fine print changes involved- there exist several players in the pharma sector who do build and deploy models internally for denying claims or prospecting medical doctors with freebies, but they might just get caught with the new open data movement

————————————————————————————————–

A note from KDNuggets-

Heritage Health Prizereleased a second set of data on May 4. They also recently modified their ruleswhich now demand complete exclusivity and seem to disallow use of other tools (emphasis mine – Gregory PS)

By registering for the Competition, each Entrant (a) grants to Sponsor and its designees a worldwide, exclusive (except with respect to Entrant) , sub-licensable (through multiple tiers), transferable, fully paid-up, royalty-free, perpetual, irrevocable right to use, not use, reproduce, distribute (through multiple tiers), create derivative works of, publicly perform, publicly display, digitally perform, make, have made, sell, offer for sale and import the entry and the algorithm used to produce the entry, as well as any other algorithm, data or other information whatsoever developed or produced at any time using the data provided to Entrant in this Competition (collectively, the “Licensed Materials”), in any media now known or hereafter developed, for any purpose whatsoever, commercial or otherwise, without further approval by or payment to Entrant (the “License”) and
(b) represents that he/she/it has the unrestricted right to grant the License.
Entrant understands and agrees that the License is exclusive except with respect to Entrant: Entrant may use the Licensed Materials solely for his/her/its own patient management and other internal business purposes but may not grant or otherwise transfer to any third party any rights to or interests in the Licensed Materials whatsoever.

This has lead to a call to boycott the competition by Tristan, who also notes that academics cannot publish their results without prior written approval of the Sponsor.

Anthony Goldbloom, CEO of Kaggle, emailed the HHP participants on May 4

HPN have asked me to pass on the following message: “The Heritage Provider Network is sponsoring the Heritage Health Prize to spur innovation and creative thinking in healthcare. HPN, however, is a medical group and must retain an exclusive license to the algorithms created using its data so as to ensure that the algorithms are used responsibly, and are only used to provide better health care to patients and not for improper purposes.
Put simply, while the competition hopes to spur innovation, this is not a competition regarding movie ratings or chess results. We hope that the clarifications we have made to the Rules and the FAQ adequately address your concerns and look forward to your participation in the competition.”

What do you think? Will the exclusive license prevent you from participating?

Why does Matt (of WordPress) hate Matt (of Google)

120,000 views in 2010

Money earned By Matt (of WordPress)= \$\$\$\$\$ from me

Money earned by Mutt -(thats me)= 000,000,000

but do it PHAST.

ESLE blog trsnfer using Blog Export, divide Xml file into 13 files  using Notepad copy and paste

go to Appspot

Convert files to Blogger files\

before these Two matts got into dog fights.

https://wordpress2blogger.appspot.com/

Ever wanted to move your WordPress blogs over to Blogger? This site can aid in the process!

Instructions

 Login to your WordPress account and navigate to the Dashboard for the blog that you’d like to transfer to Blogger. Click on the Manage tab below the Blog name. Click on the Export link below the Manage tab. Download the WordPress WXR export file by clicking on Download Export File. Save this file to your local machine. Browse to that saved document with the form below and click Convert.   Save this file to your local machine. This file will be the contents of your posts/comments from WordPress in a Blogger export file. Login to your Blogger or create a new user. Once logged in, click on the Create a Blog link from the user dashboard, and then click on the Import Blog Tool Follow the instructions and upload your Blogger export file when prompted. After completing the import wizard, you should have a set of imported posts from WordPress that you can now publish to Blogger. Have fun!