Ten steps to analysis using R

I am just listing down a set of basic R functions that allow you to start the task of business analytics, or analyzing a dataset(data.frame). I am doing this both as a reference for myself as well as anyone who wants to learn R- quickly.

I am not putting in data import functions, because data manipulation is a seperate baby altogether. Instead I assume you have a dataset ready for analysis and what are the top R commands you would need to analyze it.

 

For anyone who thought R was too hard to learn- here is ten functions to learning R

1) str(dataset) helps you with the structure of dataset

2) names(dataset) gives you the names of variables

3)mean(dataset) returns the mean of numeric variables

4)sd(dataset) returns the standard deviation of numeric variables

5)summary(variables) gives the summary quartile distributions and median of variables

That about gives me the basic stats I need for a dataset.

> data(faithful)
> names(faithful)
[1] "eruptions" "waiting"
> str(faithful)
'data.frame':   272 obs. of  2 variables:
 $ eruptions: num  3.6 1.8 3.33 2.28 4.53 ...
 $ waiting  : num  79 54 74 62 85 55 88 85 51 85 ...
> summary(faithful)
   eruptions        waiting
 Min.   :1.600   Min.   :43.0
 1st Qu.:2.163   1st Qu.:58.0
 Median :4.000   Median :76.0
 Mean   :3.488   Mean   :70.9
 3rd Qu.:4.454   3rd Qu.:82.0
 Max.   :5.100   Max.   :96.0

> mean(faithful)
eruptions   waiting
 3.487783 70.897059
> sd(faithful)
eruptions   waiting
 1.141371 13.594974

6) I can do a basic frequency analysis of a particular variable using the table command and $ operator (similar to dataset.variable name in other statistical languages)

> table(faithful$waiting)

43 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 62 63 64 65 66 67 68 69 70
 1  3  5  4  3  5  5  6  5  7  9  6  4  3  4  7  6  4  3  4  3  2  1  1  2  4
71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 96
 5  1  7  6  8  9 12 15 10  8 13 12 14 10  6  6  2  6  3  6  1  1  2  1  1
or I can do frequency analysis of the whole dataset using
> table(faithful)
         waiting
eruptions 43 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 62 63 64 65 66 67
    1.6    0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0
    1.667  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0
    1.7    0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0
    1.733  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0
.....output truncated
7) plot(dataset)
It helps plot the dataset

8) hist(dataset$variable) is better at looking at histograms

hist(faithful$waiting)

9) boxplot(dataset)

10) The tenth function for a beginner would be cor(dataset$var1,dataset$var2)

> cor(faithful)
          eruptions   waiting
eruptions 1.0000000 0.9008112
waiting   0.9008112 1.0000000

 

I am assuming that as a beginner you would use the list of GUI at http://rforanalytics.wordpress.com/graphical-user-interfaces-for-r/  to import and export Data. I would deal with ten steps to data manipulation in R another post.

 

Review of Google Plus

After resisting for two weeks I have decided to write a Google Plus review. This includes both the changed designed parameters, the invite growth features and all of the main sub-items and activities you can do in the G+  Stream, Share, Hang Out, Pictures, Circles.

Since I have 2500 people in my circles and I am in 91 circles

To keep it simple – I have noted the following 6 main sub-points.

1) Content Dissemination-

 

  • Sharing Blog Articles
  • Micro-Blogging
  • Sharing Pictures

2) Online Professional Networking  and 3) Online Personal Socializing

4) Spam Control / Malware /Phishing/Porn Protection

5) Time Cost versus Networking Benefit

————————————————————————————————————————————————————–

1) Content Dissemination-

  • Sharing Blog Articles

 

Sharing is as simple as Facebook but the design makes it simpler.

Note G+ uses lower number of colors, bigger fonts, slightly bigger icons to reduce the appearance of clutter.

Contrast this

with this-

 

Interesting to see that G+ has four types of media to share- besides writing the status/micro-blog (unfettered by 140 characters). Note these show icons only with hover text to tell you what the icon stands for.

Photo,Video,URL,Location (which seems to be Twitter like in every share)

Facebook has 5 types of Sharing and note the slightly different order as well the fact that both icon and text make it slightly more cluttered- Status (which is redundant clearly ),Photo,Link,Video,Question

G+ thus lacks polls /questions features. It is much easier to share content on Facebook automatically as of now- but for G+ you need to share the URL privately though. There exist G+ meme-s already thanks to re-sharing in G+ plus which seems to be inspired by Tumblr (?).

In addition Google has made your Google Profile the number one SERP for searching your name, so there seem clear tied in benefits of SEO with content disseminated here.

G+ has sharing in circles whereas Facebook has only Everyone, Friends, Friends of Friends ,Customize.  This makes G+ interface slightly better in tweaking the spread of content to targeted audience esp by Bloggers.

  • For sharing Photos– G+ goes in for a whole new separate tab (one out of four) whereas Facebook treats photo sharing less prominently.
  • Google has lesser white space between photos, (The Facebook way used to be just snap photo by iPhone and send by email to auto-post), and the privacy in sharing photos is much better in G+ as the dropdowns in Facebook are not as granular and neither as nifty in icon design.
  •  
  • Also I like the hover and photo grows bigger feature and the auto import from Picassa (but I would like to auto-import into G+ from Flickr just as I can do in Facebook)
  • Google Plus also has a much more detailed version for sharing videos than photos as compared to Facebook  -upload Photo options  versus
  • G+ has much more focus on auto-sharing from mobiles

 

 

 

2) Online Professional Networking  and 3) Online Personal Socializing Organizing Contacts in Google Plus and seperate privacy controls make it easier to customize sharing without getting too complex. You can make as many circles and drag and drop very easily instead of manually clicking a dropdown box. Effectively speaking Facebook has just 4 kinds of circles and it does not distinguish between various types of friends which is great from philosophical point of view but not so goodn enforcing separateness between professional and personal networks. Note Facebook privacy settings are overwhelming despite the groovy data viz

4) Spam Control / Malware /Phishing/Porn Protection 

Spam Control in Facebook versus in Google Plus- note the different options in Google Plus (including the ability to NOT reshare). I am not aware of more enhanced protection than the ones available for Gmail already. Spam is what really killed off a lot many social networks and the ability to control or reduce spam will be a critical design choice

5) Time Cost versus Networking Benefit

Linkedin has the lowest cost in time spent and networking done. If G+ adds a resume section for jobs, recruiters, and adds in Zynga games, the benefit from G+ will expand. As of now G+ is a minimal social network with minimalism as design ethos.

(Zynga would do well to partner with G+)

 

Contribution to #Rstats by Revolution

I have been watching for Revolution Analytics product almost since the inception of the company. It has managed to sail over storms, naysayers and critics with simple and effective strategy of launching good software, making good partnerships and keeping up media visibility with white papers, joint webinars, blogs, conferences and events.

However this is a listing of all technical contributions made by Revolution Analytics products to the #rstats project.

1) Useful Packages mostly in parallel processing or more efficient computing like

 

2) RevoScaler package to beat R’s memory problem (this is probably the best in my opinion as it is yet to be replicated by the open source version and is a clear cut reason for going in for the paid version)

http://www.revolutionanalytics.com/products/enterprise-big-data.php

  • Efficient XDF File Format designed to efficiently handle huge data sets.
  • Data Step Functionality to quickly clean, transform, explore, and visualize huge data sets.
  • Data selection functionality to store huge data sets out of memory, and select subsets of rows and columns for in-memory operation with all R functions.
  • Visualize Large Data sets with line plots and histograms.
  • Built-in Statistical Algorithms for direct analysis of huge data sets:
    • Summary Statistics
    • Linear Regression
    • Logistic Regression
    • Crosstabulation
  • On-the-fly data transformations to include derived variables in models without writing new data files.
  • Extend Existing Analyses by writing user- defined R functions to “chunk” through huge data sets.
  • Direct import of fixed-format text data files and SAS data sets into .xdf format

 

3) RevoDeploy R for  API based R solution – I somehow think this feature will get more important as time goes on but it seems a lower visibility offering right now.

http://www.revolutionanalytics.com/products/enterprise-deployment.php

  • Collection of Web services implemented as a RESTful API.
  • JavaScript and Java client libraries, allowing users to easily build custom Web applications on top of R.
  • .NET Client library — includes a COM interoperability to call R from VBA
  • Management Console for securely administrating servers, scripts and users through HTTP and HTTPS.
  • XML and JSON format for data exchange.
  • Built-in security model for authenticated or anonymous invocation of R Scripts.
  • Repository for storing R objects and R Script execution artifacts.

 

4) Revolutions IDE (or Productivity Environment) for a faster coding environment than command line. The GUI by Revolution Analytics is in the works. – Having used this- only the Code Snippets function is a clear differentiator from newer IDE and GUI. The code snippets is awesome though and even someone who doesnt know much R can get analysis set up quite fast and accurately.

http://www.revolutionanalytics.com/products/enterprise-productivity.php

  • Full-featured Visual Debugger for debugging R scripts, with call stack window and step-in, step-over, and step-out capability.
  • Enhanced Script Editor with hover-over help, word completion, find-across-files capability, automatic syntax checking, bookmarks, and navigation buttons.
  • Run Selection, Run to Line and Run to Cursor evaluation
  • R Code Snippets to automatically generate fill-in-the-blank sections of R code with tooltip help.
  • Object Browser showing available data and function objects (including those in packages), with context menus for plotting and editing data.
  • Solution Explorer for organizing, viewing, adding, removing, rearranging, and sourcing R scripts.
  • Customizable Workspace with dockable, floating, and tabbed tool windows.
  • Version Control Plug-in available for the open source Subversion version control software.

 

Marketing contributions from Revolution Analytics-

1) Sponsoring R sessions and user meets

2) Evangelizing R at conferences  and partnering with corporate partners including JasperSoft, Microsoft , IBM and others at http://www.revolutionanalytics.com/partners/

3) Helping with online initiatives like http://www.inside-r.org/ (which is curiously dormant and now largely superseded by R-Bloggers.com) and the syntax highlighting tool at http://www.inside-r.org/pretty-r. In addition Revolution has been proactive in reaching out to the community

4) Helping pioneer blogging about R and Twitter Hash tag discussions , and contributing to Stack Overflow discussions. Within a short while, #rstats online community has overtaken a lot more established names- partly due to decentralized nature of its working.

 

Did I miss something out? yes , they share their code by GPL.

 

Let me know by feedback

RStudio 3- Making R as simple as possible but no simpler

From the nice shiny blog at http://blog.rstudio.org/, a shiny new upgraded software (and I used the Cobalt theme)–this is nice!

awesome coding!!!

 

http://www.rstudio.org/download/

Download RStudio v0.94

Diagram desktop

If you run R on your desktop:

Download RStudio Desktop

OR

Diagram server

If you run R on a Linux server and want to enable users to remotely access RStudio using a web browser:

Download RStudio Server

 

RStudio v0.94 — Release Notes

June 15th, 2011

 

New Features and Enhancements

Source Editor and Console

  • Run code:
    • Run all lines in source file
    • Run to current line
    • Run from current line
    • Redefine current function
    • Re-run previous region
    • Code is now run line-by-line in the console
  • Brace, paren, and quote matching
  • Improved cursor placement after newlines
  • Support for regex find and replace
  • Optional syntax highlighting for console input
  • Press F1 for help on current selection
  • Function navigation / jump to function
  • Column and line number display
  • Manually set/switch document type
  • New themes: Solarized and Solarized Dark

Plots

  • Improved image export:
    • Formats: PNG, JPEG, TIFF, SVG, BMP, Metafile, and Postscript
    • Dynamic resize with preview
    • Option to maintain aspect ratio when resizing
    • Copy to clipboard as bitmap or metafile
  • Improved PDF export:
    • Specify custom sizes
    • Preview before exporting
  • Remove individual plots from history
  • Resizable plot zoom window

History

  • History tab synced to loaded .Rhistory file
  • New commands:
    • Load and save history
    • Remove individual items from history
    • Clear all history
  • New options:
    • Load history from working directory or global history file
    • Save history always or only when saving .RData
    • Remove duplicate entries in history
  • Shortcut keys for inserting into console or source

Packages

  • Check for package updates
  • Filter displayed packages
  • Install multiple packages
  • Remove packages
  • New options:
    • Install from repository or local archive file
    • Target library
    • Install dependencies

Miscellaneous

  • Find text within help topic
  • Sort file listing by name, type, size, or modified
  • Set working directory based on source file, files pane, or browsed for directory.
  • Console titlebar button to view current working directory in files pane
  • Source file menu command
  • Replace space and dash with dot (.) in import dataset generated variable names
  • Add decimal separator preference for import dataset
  • Added .tar.gz (Linux) and .zip (Windows) distributions for non-admin installs
  • Read /etc/paths.d on OS X to ensure RStudio has the same path as terminal sessions do
  • Added manifest to rsession.exe to prevent unwanted program files and registry virtualization

Server

  • Break PAM auth into its own binary for improved compatibility with 3rd party PAM authorization modules.
  • Ensure that AppArmor profile is enforced even after reboot
  • Ability to add custom LD library path for all sessions
  • Improved R discovery:
    • Use which R then fallback to scanning for R script
    • Run R discovery unconfined then switch into restricted profile
  • Default to uncompressed save.image output if the administrator or user hasn’t specified their own options (improved suspend/resume performance)
  • Ensure all running sessions are automatically updated during server version upgrade
  • Added verify-installation command to rstudio-server utility for easily capturing configuration and startup related errors

 

Bug Fixes

Source Editor

  • Undo to unedited state clears now dirty bit
  • Extract function now captures free variables used on lhs
  • Selected variable highlight now visible in all themes
  • Syncing to source file updates made outside of RStudio now happens immediately at startup and does not cause a scroll to the bottom of the document.
  • Fixed various issues related to copying and pasting into word processors
  • Fixed incorrect syntax highlighting issues in .Rd files
  • Make sure font size for printed source files matches current editor setting
  • Eliminate conflict with Ctrl+F shortcut key on OS X
  • Zoomed Google Chrome browser no longer causes cursor position to be off
  • Don’t prevent opening of unknown file types in the editor

Console

  • Fixed sporadic missing underscores (and other bottom clipping of text) in console
  • Make sure console history is never displayed offscreen
  • Page Up and Page Down now work properly in the console
  • Substantially improved console performance for both rapid output and large quantities of output

Miscellaneous

  • Install successfully on Windows with special characters in home directory name
  • make install more tolerant of configurations where it can’t write into /usr/share
  • Eliminate spurious stderr output in forked children of multicore package
  • Ensure that file modified times always update in the files pane after a save
  • Always default to installing packages into first writeable path of .libPaths()
  • Ensure that LaTeX log files are always preserved after compilePdf
  • Fix conflicts with zap function from epicalc package
  • Eliminate shortcut key conflicts with Ubuntu desktop workspace switching shortcuts
  • Always prompt when attempting to save files of the same name
  • Maximized main window now properly restored when reopening RStudio
  • PAM authorization works correctly even if account has password expiration warning
  • Correct display of manipulate panel when Plots pane is on the left

 

Previous Release Notes

 

Heritage offers 3 million chump change for Monkeys

My perspective is life is not fair, and if someone offers me 1 mill a year so they make 1 bill a year, I would still take it, especially if it leads to better human beings and better humanity on this planet. Health care isnt toothpaste.

Unless there are even more fine print changes involved- there exist several players in the pharma sector who do build and deploy models internally for denying claims or prospecting medical doctors with freebies, but they might just get caught with the new open data movement

————————————————————————————————–

A note from KDNuggets-

Heritage Health Prizereleased a second set of data on May 4. They also recently modified their ruleswhich now demand complete exclusivity and seem to disallow use of other tools (emphasis mine – Gregory PS)

21. LICENSE
By registering for the Competition, each Entrant (a) grants to Sponsor and its designees a worldwide, exclusive (except with respect to Entrant) , sub-licensable (through multiple tiers), transferable, fully paid-up, royalty-free, perpetual, irrevocable right to use, not use, reproduce, distribute (through multiple tiers), create derivative works of, publicly perform, publicly display, digitally perform, make, have made, sell, offer for sale and import the entry and the algorithm used to produce the entry, as well as any other algorithm, data or other information whatsoever developed or produced at any time using the data provided to Entrant in this Competition (collectively, the “Licensed Materials”), in any media now known or hereafter developed, for any purpose whatsoever, commercial or otherwise, without further approval by or payment to Entrant (the “License”) and
(b) represents that he/she/it has the unrestricted right to grant the License. 
Entrant understands and agrees that the License is exclusive except with respect to Entrant: Entrant may use the Licensed Materials solely for his/her/its own patient management and other internal business purposes but may not grant or otherwise transfer to any third party any rights to or interests in the Licensed Materials whatsoever.

This has lead to a call to boycott the competition by Tristan, who also notes that academics cannot publish their results without prior written approval of the Sponsor.

Anthony Goldbloom, CEO of Kaggle, emailed the HHP participants on May 4

HPN have asked me to pass on the following message: “The Heritage Provider Network is sponsoring the Heritage Health Prize to spur innovation and creative thinking in healthcare. HPN, however, is a medical group and must retain an exclusive license to the algorithms created using its data so as to ensure that the algorithms are used responsibly, and are only used to provide better health care to patients and not for improper purposes.
Put simply, while the competition hopes to spur innovation, this is not a competition regarding movie ratings or chess results. We hope that the clarifications we have made to the Rules and the FAQ adequately address your concerns and look forward to your participation in the competition.”

What do you think? Will the exclusive license prevent you from participating?

Why does Matt (of WordPress) hate Matt (of Google)

Biz Stone, co-founder of Twitter
Image via Wikipedia

I want to show some bad ads of Google Ad sense. I pay through my nose for video upgrades and extra space to keep people happy.

120,000 views in 2010

Money earned By Matt (of WordPress)= $$$$$ from me

Money earned by Mutt -(thats me)= 000,000,000

Please allow me to run ads on wordpress.com

or create your own fucking ad networks

but do it PHAST.

ESLE blog trsnfer using Blog Export, divide Xml file into 13 files  using Notepad copy and paste

go to Appspot

Convert files to Blogger files\

Thats the company BIZ stone OF tWITTER  made

before these Two matts got into dog fights.

https://wordpress2blogger.appspot.com/

Ever wanted to move your WordPress blogs over to Blogger? This site can aid in the process!

Instructions

  1. Login to your WordPress account and navigate to the Dashboard for the blog that you’d like to transfer to Blogger.
  2. Click on the Manage tab below the Blog name.
  3. Click on the Export link below the Manage tab.
  4. Download the WordPress WXR export file by clicking on Download Export File.
  5. Save this file to your local machine.
  6. Browse to that saved document with the form below and click Convert.
     
  7. Save this file to your local machine. This file will be the contents of your posts/comments from WordPress in a Blogger export file.
  8. Login to your Blogger or create a new user.
  9. Once logged in, click on the Create a Blog link from the user dashboard, and then click on the Import Blog Tool
  10. Follow the instructions and upload your Blogger export file when prompted.
  11. After completing the import wizard, you should have a set of imported posts from WordPress that you can now publish to Blogger. Have fun!

NOTE: This hosted application will only allow downloads smaller than 1MB.

For information on how to run this conversion on your own, visit the open source project hosted at code.google.com

Powered by Google App Engine