Home » Posts tagged 'rstats'
Tag Archives: rstats
Datamind.com whom I interact with on and off, and also the masterminds behind http://www.rdocumentation.org/
have finally created their platform for interactive and gamified R learning on the web. Take a look- it does like slightly better than Codeacademy’s interface doesnt it. The platform is called http://www.r-fiddle.org/#/
More power to R for Cloud Computing!
Now if they could only collobrate with other players like Quandl, BigML and even StatAce for a even cooler suggestion. Even Revolution Analytics and RStudio who have very expensive training modules should be able to use this for self paced online learning courses!
Quote- A software of beauty is a joy forever – Keats
In the future I think analysts need to be polyglots- you will need to know more than one language for crunching data.
SAS, Python, R, Julia,SPSS,Matlab- Pick Any Two or Any Three.
No, you can’t count C or Java as a statistical language
Efforts to promote Polyglots in Statistical Software are-
- JMP and R reference http://www.jmp.com/support/help/Working_with_R.shtml
2) R for Stata Users (book)
4) Using Python and R together
- Accessing R from Python (Rpy2) http://www.bytemining.com/wp-content/uploads/2010/10/rpy2.pdf
- Big Data with R and Python (though these have been made separately)
- Python for Data Analysis is a book . Python for Data Analysis by Wes McKinney
Probably we need a Python and R for Data Analysis book- just like we have for SAS and R books.
- The RPy2 documentation is handy http://rpy.sourceforge.net/rpy2/doc-2.1/html/introduction.html
- A nice tutorial is also here – also the inspiration to writing this post http://files.meetup.com/1225993/Laurent%20Gautier_R_toPython_bridge_to_R.pdf#!
5) Matlab and R
Reference (http://mathesaurus.sourceforge.net/matlab-python-xref.pdf ) includes Python
5) Octave and R
package http://cran.r-project.org/web/packages/RcppOctave/vignettes/RcppOctave.pdf includes Matlab
6) Julia and python
- Julia and IPython https://github.com/JuliaLang/IJulia.jl
- PyPlot uses the Julia PyCall package to call Python’s matplotlib directly from Julia
7) SPSS and Python is here
8) SPSS and R is as below
- The Essentials for R for Statistics versions 22, 21, 20, and 19 are available here.
- This link will take you to the SourceForge site where the Version 18 Essentials and Plugins are hosted.
9) Using R from Clojure – Incanter
Use embedded R from Clojure and Incanter http://github.com/jolby/rincanter
Suppose – let us just suppose- you want to create random numbers that are reproducible , and derived from time stamps
Here is the code in R
Note- you can create a custom function ( I used the log) for generating random numbers of the system time too. This creates a random numbered list of pseudo random numbers (since nothing machine driven is purely random in the strict philosophy of the word)
 39621645 99451316 109889294 110275233 278994547 6554596 38654159 68748122 8920823 13293010
 57664241 24533980 174529340 105304151 168006526 39173857 12810354 145341412 241341095 86568818
Possible applications- things that need both random numbers (like encryption keys) and time stamps (like events , web or industrial logs or as pseudo random pass codes in Google 2 factor authentication )
Note I used the rnorm function but you could possibly draw the functions also as a random input (rnorm or rcauchy)
Again I would trust my own random ness than one generated by an arm of US Govt (see http://www.nist.gov/itl/csd/ct/nist_beacon.cfm )
Update- Random numbers in R
The currently available RNG kinds are given below.
kind is partially matched to this list. The default is
- The seed,
.Random.seed[-1] == r[1:3]is an integer vector of length 3, where each
1:(p[i] - 1), where
pis the length 3 vector of primes,
p = (30269, 30307, 30323). The Wichmann–Hill generator has a cycle length of 6.9536e12 (=
prod(p-1)/4, see Applied Statistics (1984) 33, 123 which corrects the original article).
- A multiply-with-carry RNG is used, as recommended by George Marsaglia in his post to the mailing list ‘sci.stat.math’. It has a period of more than 2^60 and has passed all tests (according to Marsaglia). The seed is two integers (all values allowed).
- Marsaglia’s famous Super-Duper from the 70′s. This is the original version which does not pass the MTUPLE test of the Diehard battery. It has a period of about 4.6*10^18 for most initial seeds. The seed is two integers (all values allowed for the first seed: the second must be odd).
We use the implementation by Reeds et al. (1982–84).
The two seeds are the Tausworthe and congruence long integers, respectively. A one-to-one mapping to S’s
.Random.seed[1:12]is possible but we will not publish one, not least as this generator is not exactly the same as that in recent versions of S-PLUS.
- From Matsumoto and Nishimura (1998). A twisted GFSR with period 2^19937 – 1 and equidistribution in 623 consecutive dimensions (over the whole period). The ‘seed’ is a 624-dimensional set of 32-bit integers plus a current position in that set.
- A 32-bit integer GFSR using lagged Fibonacci sequences with subtraction. That is, the recurrence used is
X[j] = (X[j-100] – X[j-37]) mod 2^30
and the ‘seed’ is the set of the 100 last numbers (actually recorded as 101 numbers, the last being a cyclic shift of the buffer). The period is around 2^129.
- An earlier version from Knuth (1997).
The 2002 version was not backwards compatible with the earlier version: the initialization of the GFSR from the seed was altered. R did not allow you to choose consecutive seeds, the reported ‘weakness’, and already scrambled the seeds.
Initialization of this generator is done in interpreted R code and so takes a short but noticeable time.
- A ‘combined multiple-recursive generator’ from L’Ecuyer (1999), each element of which is a feedback multiplicative generator with three integer elements: thus the seed is a (signed) integer vector of length 6. The period is around 2^191.
The 6 elements of the seed are internally regarded as 32-bit unsigned integers. Neither the first three nor the last three should be all zero, and they are limited to less than
This is not particularly interesting of itself, but provides the basis for the multiple streams used in package parallel.
- Use a user-supplied generator.
RNGkindallows user-coded uniform and normal random number generators to be supplied.
Quote of the Day-
it is impossible to be a data scientist without knowing iris
Revolution Analytics has been nice enough to provide both datasets and code for analyzing Big Data in R.
while the Datasets collection is still elementary, as a R Instructor I find this list extremely useful. However I wish they look at some other repositories and make .xdf and “tidy” csv versions. A little bit of RODBC usage should help, and so will some descriptions. Maybe they should partner with Quandl, DataMarket, or Infochimps on this initiative than do it alone.
Overall there can be a R package (like a Big Data version of the famous datasets package in R)
But a nice and very useful effort
Revolution R Datasets
Also a recent project made by a student of mine on Revolution Datasets and using their blog posts.
The ifelse function is simple and powerful and can help in data manipulation within R. Here I create a categoric variable from specific values in a numeric variable
> iris$Type=ifelse(iris$Sepal.Length<5.8,”Small Flower”,”Big Flower”)
Big Flower Small Flower
The parameters of ifelse is quite simple
ifelse(test, yes, no)
an object which can be coerced to logical mode.
return values for true elements of test.
return values for false elements of tes
- Assigning Objects
We can create new data objects and variables quite easily within R. We use the = or the → operator to denote assigning an object to it’s name. For the purpose of this article we will use = to assign objectnames and objects. This is very useful when we are doing data manipulation as we can reuse the manipulated data as inputs for other steps in our analysis.
Types of Data Objects in R
A list is simply a collection of data. We create a list using the c operator.
The following code creates a list named numlist from 6 input numeric data
The following code creates a list named charlist from 6 input character data
The following code creates a list named mixlistfrom both numeric and character data.
Matrix is a two dimensional collection of data in rows and columns, unlike a list which is basically one dimensional. We can create a matrix using the matrix command while specifying the number of rows by nrow and number of columns by ncol paramter.
In the following code , we create an matrix named ajay and the data is input in 3 rows as specified, but it is entered into first column, then second column , so on.
[,1] [,2] [,3]
[1,] 1 4 12
[2,] 2 5 18
[3,] 3 6 24
However please note the effect of using the byrow=T (TRUE) option. In the following code we create an matrix named ajay and the data is input in 3 rows as specified, but it is entered into first row, then second row , so on.
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 12 18 24
- Data Frames
A data frame is a list of variables of the same number of rows with unique row names. The column names are the names of the variables.
Hosting a 6 weekend live online certification course on Business Analytics with R starting June 1 at Edureka.Check www.edureka.in/r-for-analytics for more details. Course has been decided to ensure more open data science than current expensive offerings that are tech rather than business oriented but more support and customization than a MOOC This is because many business customers don’t care if it is lapply or ddapply, or command line or GUI, as long as they get good ROI on time and money spent in shifting to R from other analytics software.