Home » Posts tagged 'rstats' (Page 2)
Tag Archives: rstats
Polyglots for Data Science #python #sas #r #stats #spss #matlab #julia #octave
In the future I think analysts need to be polyglots you will need to know more than one language for crunching data.
SAS, Python, R, Julia,SPSS,Matlab Pick Any Two ;) or Any Three.
No, you can’t count C or Java as a statistical language :) :)
Efforts to promote Polyglots in Statistical Software are
1) R for SAS and SPSS Users (free or book)
 JMP and R reference http://www.jmp.com/support/help/Working_with_R.shtml
2) R for Stata Users (book)
4) Using Python and R together
 Accessing R from Python (Rpy2) http://www.bytemining.com/wpcontent/uploads/2010/10/rpy2.pdf
 Big Data with R and Python (though these have been made separately)
 Python for Data Analysis is a book . Python for Data Analysis by Wes McKinney
Probably we need a Python and R for Data Analysis book just like we have for SAS and R books.
 The RPy2 documentation is handy http://rpy.sourceforge.net/rpy2/doc2.1/html/introduction.html
 A nice tutorial is also here – also the inspiration to writing this post http://files.meetup.com/1225993/Laurent%20Gautier_R_toPython_bridge_to_R.pdf#!
5) Matlab and R
Reference (http://mathesaurus.sourceforge.net/matlabpythonxref.pdf ) includes Python
5) Octave and R
package http://cran.rproject.org/web/packages/RcppOctave/vignettes/RcppOctave.pdf includes Matlab
reference http://cran.rproject.org/doc/contrib/Randoctave.txt
6) Julia and python
 Julia and IPython https://github.com/JuliaLang/IJulia.jl
 PyPlot uses the Julia PyCall package to call Python’s matplotlib directly from Julia
7) SPSS and Python is here
8) SPSS and R is as below
 The Essentials for R for Statistics versions 22, 21, 20, and 19 are available here.
 This link will take you to the SourceForge site where the Version 18 Essentials and Plugins are hosted.
9) Using R from Clojure – Incanter
Use embedded R from Clojure and Incanter http://github.com/jolby/rincanter
Using R for random number creation from time stamps #rstats
Suppose – let us just suppose you want to create random numbers that are reproducible , and derived from time stamps
Here is the code in R
> a=as.numeric(Sys.time())
> set.seed(a)
> rnorm(log(a))
Note you can create a custom function ( I used the log) for generating random numbers of the system time too. This creates a random numbered list of pseudo random numbers (since nothing machine driven is purely random in the strict philosophy of the word)
a=as.numeric(Sys.time())
set.seed(a)
abs(100000000*rnorm(abs(log(a))))
[1] 39621645 99451316 109889294 110275233 278994547 6554596 38654159 68748122 8920823 13293010
[11] 57664241 24533980 174529340 105304151 168006526 39173857 12810354 145341412 241341095 86568818
[21] 105672257
Possible applications things that need both random numbers (like encryption keys) and time stamps (like events , web or industrial logs or as pseudo random pass codes in Google 2 factor authentication )
Note I used the rnorm function but you could possibly draw the functions also as a random input (rnorm or rcauchy)
Again I would trust my own random ness than one generated by an arm of US Govt (see http://www.nist.gov/itl/csd/ct/nist_beacon.cfm )
Update Random numbers in R
http://stat.ethz.ch/Rmanual/Rpatched/library/base/html/Random.html
Details
The currently available RNG kinds are given below. kind
is partially matched to this list. The default is "MersenneTwister"
.
"WichmannHill"
 The seed,
.Random.seed[1] == r[1:3]
is an integer vector of length 3, where eachr[i]
is in1:(p[i]  1)
, wherep
is the length 3 vector of primes,p = (30269, 30307, 30323)
. The Wichmann–Hill generator has a cycle length of 6.9536e12 (=prod(p1)/4
, see Applied Statistics (1984) 33, 123 which corrects the original article). "MarsagliaMulticarry"
: A multiplywithcarry RNG is used, as recommended by George Marsaglia in his post to the mailing list ‘sci.stat.math’. It has a period of more than 2^60 and has passed all tests (according to Marsaglia). The seed is two integers (all values allowed).
"SuperDuper"
: Marsaglia’s famous SuperDuper from the 70’s. This is the original version which does not pass the MTUPLE test of the Diehard battery. It has a period of about 4.6*10^18 for most initial seeds. The seed is two integers (all values allowed for the first seed: the second must be odd).
We use the implementation by Reeds et al. (1982–84).
The two seeds are the Tausworthe and congruence long integers, respectively. A onetoone mapping to S’s
.Random.seed[1:12]
is possible but we will not publish one, not least as this generator is not exactly the same as that in recent versions of SPLUS. "MersenneTwister":
 From Matsumoto and Nishimura (1998). A twisted GFSR with period 2^19937 – 1 and equidistribution in 623 consecutive dimensions (over the whole period). The ‘seed’ is a 624dimensional set of 32bit integers plus a current position in that set.
"KnuthTAOCP2002":
 A 32bit integer GFSR using lagged Fibonacci sequences with subtraction. That is, the recurrence used is
X[j] = (X[j100] – X[j37]) mod 2^30
and the ‘seed’ is the set of the 100 last numbers (actually recorded as 101 numbers, the last being a cyclic shift of the buffer). The period is around 2^129.
"KnuthTAOCP":
 An earlier version from Knuth (1997).
The 2002 version was not backwards compatible with the earlier version: the initialization of the GFSR from the seed was altered. R did not allow you to choose consecutive seeds, the reported ‘weakness’, and already scrambled the seeds.
Initialization of this generator is done in interpreted R code and so takes a short but noticeable time.
"L'EcuyerCMRG":
 A ‘combined multiplerecursive generator’ from L’Ecuyer (1999), each element of which is a feedback multiplicative generator with three integer elements: thus the seed is a (signed) integer vector of length 6. The period is around 2^191.
The 6 elements of the seed are internally regarded as 32bit unsigned integers. Neither the first three nor the last three should be all zero, and they are limited to less than
4294967087
and4294944443
respectively.This is not particularly interesting of itself, but provides the basis for the multiple streams used in package parallel.
"usersupplied":
 Use a usersupplied generator.
Function
RNGkind
allows usercoded uniform and normal random number generators to be supplied.
Iris for Big Data #rstats #bigdata
Quote of the Day
it is impossible to be a data scientist without knowing iris
#Anonymous #Quotes
Revolution Analytics has been nice enough to provide both datasets and code for analyzing Big Data in R.
http://www.revolutionanalytics.com/subscriptions/datasets/
http://packages.revolutionanalytics.com/datasets/
Site was updated so here are the new links
while the Datasets collection is still elementary, as a R Instructor I find this list extremely useful. However I wish they look at some other repositories and make .xdf and “tidy” csv versions. A little bit of RODBC usage should help, and so will some descriptions. Maybe they should partner with Quandl, DataMarket, or Infochimps on this initiative than do it alone.
Overall there can be a R package (like a Big Data version of the famous datasets package in R)
But a nice and very useful effort
Revolution R Datasets
 ../
 AirOnTime87to12/ 09Nov2013 00:46 
 AirOnTimeCSV2012/ 09Nov2013 00:30 
 AirOnTime2012.xdf 08Nov2013 18:08 190110335
 AirOnTime7Pct.xdf 08Nov2013 17:42 103317987
 AirlineData87to08.tar.gz 03May2013 21:05 5521408
 AirlineData87to08.zip 09May2013 14:59 1802240
 AirlineData87to08_11811.tar.gz 08Nov2013 03:27 1428527359
 AirlineData87to08_83010.zip 08Nov2013 06:37 1477052425
 AirlineDataSubsample.xdf 08Nov2013 07:27 390789536
 Census5PCT2000.tar.gz 08Nov2013 10:55 871208970
 Census5PCT2000.zip 08Nov2013 12:52 925929427
 CensusUS5Pct2000.xdf 08Nov2013 21:27 1204906764
 ccFraud.csv 23Apr2013 20:57 291737157
 ccFraudScore.csv 23Apr2013 21:10 273848249
 ccFraudScore10_CreateLoadTableQuotedColumns.fas..> 23Apr2013 21:10 981
 ccFraud_CreateLoadTable_QuotedColumns.fastload 23Apr2013 21:10 984
 index.php.txt 09May2013 22:17 3983
 mortDefault.tar.gz 08Nov2013 12:59 61585580
 mortDefault.zip 08Nov2013 13:08 63968310
More code
http://blog.revolutionanalytics.com/2013/08/bigdatasetsforr.html
Also a recent project made by a student of mine on Revolution Datasets and using their blog posts.
Using ifelse in R for creating new variables #rstats #data #manipulation
The ifelse function is simple and powerful and can help in data manipulation within R. Here I create a categoric variable from specific values in a numeric variable
> data(iris)
> iris$Type=ifelse(iris$Sepal.Length<5.8,”Small Flower”,”Big Flower”)
> table(iris$Type)
Big Flower Small Flower
77 73
The parameters of ifelse is quite simple
Usage
ifelse(test, yes, no)
Arguments
test
an object which can be coerced to logical mode.
yes
return values for true elements of test.
no
return values for false elements of tes
Basics of Data Handling for R beginners #rstats
 Assigning Objects
We can create new data objects and variables quite easily within R. We use the = or the → operator to denote assigning an object to it’s name. For the purpose of this article we will use = to assign objectnames and objects. This is very useful when we are doing data manipulation as we can reuse the manipulated data as inputs for other steps in our analysis.
Types of Data Objects in R
 Lists
A list is simply a collection of data. We create a list using the c operator.
The following code creates a list named numlist from 6 input numeric data
numlist=c(1,2,3,4,5,78)
The following code creates a list named charlist from 6 input character data
charlist=c(“John”,”Peter”,”Simon”,”Paul”,”Francis”)
The following code creates a list named mixlistfrom both numeric and character data.
mixlist=c(1,2,3,4,”R language”,”Ajay”)
 Matrices
Matrix is a two dimensional collection of data in rows and columns, unlike a list which is basically one dimensional. We can create a matrix using the matrix command while specifying the number of rows by nrow and number of columns by ncol paramter.
In the following code , we create an matrix named ajay and the data is input in 3 rows as specified, but it is entered into first column, then second column , so on.
ajay=matrix(c(1,2,3,4,5,6,12,18,24),nrow=3)
ajay
[,1] [,2] [,3]
[1,] 1 4 12
[2,] 2 5 18
[3,] 3 6 24
However please note the effect of using the byrow=T (TRUE) option. In the following code we create an matrix named ajay and the data is input in 3 rows as specified, but it is entered into first row, then second row , so on.
>ajay=matrix(c(1,2,3,4,5,6,12,18,24),nrow=3,byrow=T)
>ajay
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 12 18 24
 Data Frames
A data frame is a list of variables of the same number of rows with unique row names. The column names are the names of the variables.
6 weeks Data Scientist Online Courses #rstats
Hosting a 6 weekend live online certification course on Business Analytics with R starting June 1 at Edureka.Check www.edureka.in/rforanalytics for more details. Course has been decided to ensure more open data science than current expensive offerings that are tech rather than business oriented but more support and customization than a MOOC This is because many business customers don’t care if it is lapply or ddapply, or command line or GUI, as long as they get good ROI on time and money spent in shifting to R from other analytics software.
Using a Linux only package in Windows #rstats
Here is some R code for using a R package that has only a tar.gz file available (used to load R packages in Linux) and no Zip file available (used to load R packages in Windows).
Step 1 Download the tar.gz file.
Step 2 Unzip it (twice) using 7zip
Step 3 Change the path variable below to your unzipped, downloaded location for the R sub folder within the package folder .
Step 4 Copy and Paste this in R
Step 5 Start using the R package in Windows (where 75% of the money and clients and businesses still are)
Caveat Emptor No X Dependencies (ok!)
 WE DO NOT BREAK USERSPACE!

 Torvalds, Linus (20121223). Linus Torvalds  LKML