Home » Posts tagged 'rstats'
Tag Archives: rstats
The guys at Statace released major updates- I am particularly excited for the ability to create a custom GUI box for your own analysis or for sharing with consulting clients or students.
What does that mean? Basically they are making it a bit like R Commander Extensions- so if you have a package or analysis you would rather do visually (than code) – you can create a GUI module for it. The modular extension is quite cool in my opinion, but further proof will be in how well designed the pudding is.
Public sharing of results
Now you can share your analysis results for the world to see (example). Just click Share in the results pane.
Google Drive integration
We added integration with Google Drive. This makes collaboration and synchronization of large files even easier. Don’t forget we also support Dropbox. Just click the Connect to menu in the file manager.
Plots zoom and SVG export
Now you can open plots in a separate window that supports zoom in and zoom out. From it, you can also export to the SVG format which is ideal for printing. Just click the lens icon next to any plot.
Point-and-click PCA + data transformation without R knowledge
You can now carry out a PCA by just pointing and clicking though Analysis > Dimensional Analysis > Principal Components Analysis. We also added the Data menu which allows you to filter and sort datasets without any knowledge of R.
(Secret) Build your own visual dialog box to run R code
Do you have colleagues who don’t know R but need to use functionality you developed? Do you do consulting and want your customers to be able to run your models with point-and-click? Do you want to share a piece of R code with the world in an easy-to-use way?
StatAce now allows you to easily create a custom graphical interface for your R code. The process is entirely visual (no coding) and is what we use to build our own Data & Analysis menus (e.g. the bivariate correlation and linear regression dialog boxes). We are testing the functionality with a limited number of users, and their feedback has been great. Drop us a line at firstname.lastname@example.org to request early access.
Datamind.com whom I interact with on and off, and also the masterminds behind http://www.rdocumentation.org/
have finally created their platform for interactive and gamified R learning on the web. Take a look- it does like slightly better than Codeacademy’s interface doesnt it. The platform is called http://www.r-fiddle.org/#/
More power to R for Cloud Computing!
Now if they could only collobrate with other players like Quandl, BigML and even StatAce for a even cooler suggestion. Even Revolution Analytics and RStudio who have very expensive training modules should be able to use this for self paced online learning courses!
Quote- A software of beauty is a joy forever – Keats
In the future I think analysts need to be polyglots- you will need to know more than one language for crunching data.
SAS, Python, R, Julia,SPSS,Matlab- Pick Any Two ;) or Any Three.
No, you can’t count C or Java as a statistical language :) :)
Efforts to promote Polyglots in Statistical Software are-
- JMP and R reference http://www.jmp.com/support/help/Working_with_R.shtml
2) R for Stata Users (book)
4) Using Python and R together
- Accessing R from Python (Rpy2) http://www.bytemining.com/wp-content/uploads/2010/10/rpy2.pdf
- Big Data with R and Python (though these have been made separately)
- Python for Data Analysis is a book . Python for Data Analysis by Wes McKinney
Probably we need a Python and R for Data Analysis book- just like we have for SAS and R books.
- The RPy2 documentation is handy http://rpy.sourceforge.net/rpy2/doc-2.1/html/introduction.html
- A nice tutorial is also here – also the inspiration to writing this post http://files.meetup.com/1225993/Laurent%20Gautier_R_toPython_bridge_to_R.pdf#!
5) Matlab and R
Reference (http://mathesaurus.sourceforge.net/matlab-python-xref.pdf ) includes Python
5) Octave and R
package http://cran.r-project.org/web/packages/RcppOctave/vignettes/RcppOctave.pdf includes Matlab
6) Julia and python
- Julia and IPython https://github.com/JuliaLang/IJulia.jl
- PyPlot uses the Julia PyCall package to call Python’s matplotlib directly from Julia
7) SPSS and Python is here
8) SPSS and R is as below
- The Essentials for R for Statistics versions 22, 21, 20, and 19 are available here.
- This link will take you to the SourceForge site where the Version 18 Essentials and Plugins are hosted.
9) Using R from Clojure – Incanter
Use embedded R from Clojure and Incanter http://github.com/jolby/rincanter
Suppose – let us just suppose- you want to create random numbers that are reproducible , and derived from time stamps
Here is the code in R
Note- you can create a custom function ( I used the log) for generating random numbers of the system time too. This creates a random numbered list of pseudo random numbers (since nothing machine driven is purely random in the strict philosophy of the word)
 39621645 99451316 109889294 110275233 278994547 6554596 38654159 68748122 8920823 13293010
 57664241 24533980 174529340 105304151 168006526 39173857 12810354 145341412 241341095 86568818
Possible applications- things that need both random numbers (like encryption keys) and time stamps (like events , web or industrial logs or as pseudo random pass codes in Google 2 factor authentication )
Note I used the rnorm function but you could possibly draw the functions also as a random input (rnorm or rcauchy)
Again I would trust my own random ness than one generated by an arm of US Govt (see http://www.nist.gov/itl/csd/ct/nist_beacon.cfm )
Update- Random numbers in R
The currently available RNG kinds are given below.
kind is partially matched to this list. The default is
- The seed,
.Random.seed[-1] == r[1:3]is an integer vector of length 3, where each
1:(p[i] - 1), where
pis the length 3 vector of primes,
p = (30269, 30307, 30323). The Wichmann–Hill generator has a cycle length of 6.9536e12 (=
prod(p-1)/4, see Applied Statistics (1984) 33, 123 which corrects the original article).
- A multiply-with-carry RNG is used, as recommended by George Marsaglia in his post to the mailing list ‘sci.stat.math’. It has a period of more than 2^60 and has passed all tests (according to Marsaglia). The seed is two integers (all values allowed).
- Marsaglia’s famous Super-Duper from the 70′s. This is the original version which does not pass the MTUPLE test of the Diehard battery. It has a period of about 4.6*10^18 for most initial seeds. The seed is two integers (all values allowed for the first seed: the second must be odd).
We use the implementation by Reeds et al. (1982–84).
The two seeds are the Tausworthe and congruence long integers, respectively. A one-to-one mapping to S’s
.Random.seed[1:12]is possible but we will not publish one, not least as this generator is not exactly the same as that in recent versions of S-PLUS.
- From Matsumoto and Nishimura (1998). A twisted GFSR with period 2^19937 – 1 and equidistribution in 623 consecutive dimensions (over the whole period). The ‘seed’ is a 624-dimensional set of 32-bit integers plus a current position in that set.
- A 32-bit integer GFSR using lagged Fibonacci sequences with subtraction. That is, the recurrence used is
X[j] = (X[j-100] – X[j-37]) mod 2^30
and the ‘seed’ is the set of the 100 last numbers (actually recorded as 101 numbers, the last being a cyclic shift of the buffer). The period is around 2^129.
- An earlier version from Knuth (1997).
The 2002 version was not backwards compatible with the earlier version: the initialization of the GFSR from the seed was altered. R did not allow you to choose consecutive seeds, the reported ‘weakness’, and already scrambled the seeds.
Initialization of this generator is done in interpreted R code and so takes a short but noticeable time.
- A ‘combined multiple-recursive generator’ from L’Ecuyer (1999), each element of which is a feedback multiplicative generator with three integer elements: thus the seed is a (signed) integer vector of length 6. The period is around 2^191.
The 6 elements of the seed are internally regarded as 32-bit unsigned integers. Neither the first three nor the last three should be all zero, and they are limited to less than
This is not particularly interesting of itself, but provides the basis for the multiple streams used in package parallel.
- Use a user-supplied generator.
RNGkindallows user-coded uniform and normal random number generators to be supplied.
Quote of the Day-
it is impossible to be a data scientist without knowing iris
Revolution Analytics has been nice enough to provide both datasets and code for analyzing Big Data in R.
Site was updated so here are the new links
while the Datasets collection is still elementary, as a R Instructor I find this list extremely useful. However I wish they look at some other repositories and make .xdf and “tidy” csv versions. A little bit of RODBC usage should help, and so will some descriptions. Maybe they should partner with Quandl, DataMarket, or Infochimps on this initiative than do it alone.
Overall there can be a R package (like a Big Data version of the famous datasets package in R)
But a nice and very useful effort
Revolution R Datasets
- AirOnTime87to12/ 09-Nov-2013 00:46 -
- AirOnTimeCSV2012/ 09-Nov-2013 00:30 -
- AirOnTime2012.xdf 08-Nov-2013 18:08 190110335
- AirOnTime7Pct.xdf 08-Nov-2013 17:42 103317987
- AirlineData87to08.tar.gz 03-May-2013 21:05 5521408
- AirlineData87to08.zip 09-May-2013 14:59 1802240
- AirlineData87to08_11811.tar.gz 08-Nov-2013 03:27 1428527359
- AirlineData87to08_83010.zip 08-Nov-2013 06:37 1477052425
- AirlineDataSubsample.xdf 08-Nov-2013 07:27 390789536
- Census5PCT2000.tar.gz 08-Nov-2013 10:55 871208970
- Census5PCT2000.zip 08-Nov-2013 12:52 925929427
- CensusUS5Pct2000.xdf 08-Nov-2013 21:27 1204906764
- ccFraud.csv 23-Apr-2013 20:57 291737157
- ccFraudScore.csv 23-Apr-2013 21:10 273848249
- ccFraudScore10_CreateLoadTableQuotedColumns.fas..> 23-Apr-2013 21:10 981
- ccFraud_CreateLoadTable_QuotedColumns.fastload 23-Apr-2013 21:10 984
- index.php.txt 09-May-2013 22:17 3983
- mortDefault.tar.gz 08-Nov-2013 12:59 61585580
- mortDefault.zip 08-Nov-2013 13:08 63968310
Also a recent project made by a student of mine on Revolution Datasets and using their blog posts.
The ifelse function is simple and powerful and can help in data manipulation within R. Here I create a categoric variable from specific values in a numeric variable
> iris$Type=ifelse(iris$Sepal.Length<5.8,”Small Flower”,”Big Flower”)
Big Flower Small Flower
The parameters of ifelse is quite simple
ifelse(test, yes, no)
an object which can be coerced to logical mode.
return values for true elements of test.
return values for false elements of tes
- Assigning Objects
We can create new data objects and variables quite easily within R. We use the = or the → operator to denote assigning an object to it’s name. For the purpose of this article we will use = to assign objectnames and objects. This is very useful when we are doing data manipulation as we can reuse the manipulated data as inputs for other steps in our analysis.
Types of Data Objects in R
A list is simply a collection of data. We create a list using the c operator.
The following code creates a list named numlist from 6 input numeric data
The following code creates a list named charlist from 6 input character data
The following code creates a list named mixlistfrom both numeric and character data.
Matrix is a two dimensional collection of data in rows and columns, unlike a list which is basically one dimensional. We can create a matrix using the matrix command while specifying the number of rows by nrow and number of columns by ncol paramter.
In the following code , we create an matrix named ajay and the data is input in 3 rows as specified, but it is entered into first column, then second column , so on.
[,1] [,2] [,3]
[1,] 1 4 12
[2,] 2 5 18
[3,] 3 6 24
However please note the effect of using the byrow=T (TRUE) option. In the following code we create an matrix named ajay and the data is input in 3 rows as specified, but it is entered into first row, then second row , so on.
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 12 18 24
- Data Frames
A data frame is a list of variables of the same number of rows with unique row names. The column names are the names of the variables.