NumFocus- The Python Statistical Community

I really liked the mature design, and foundation of this charitable organization. While it is similar to FOAS in many ways (http://www.foastat.org/projects.html) I like the projects . Excellent projects and some of which I think should be featured in Journal of Statistical Software– (since there is a seperate R Journal) unless it wants to be overtly R focused.

 

In the same manner I think some non Python projects should try and reach out to NumFocus (if it is not wanting to be so  PyFocus-ed)

Here it is NumFocus

NumFOCUS supports and promotes world-class, innovative, open source scientific software. Most individual projects, even the wildly successful ones, find the overhead of a non-profit to be too large for their community to bear. NumFOCUS provides a critical service as an umbrella organization which removes the burden from the projects themselves to raise money.

Money donated through NumFOCUS goes to sponsor things like:

  • Coding sprints (food and travel)
  • Technical fellowships (sponsored students and mentors to work on code)
  • Equipment grants (to developers and projects)
  • Conference attendance for students (to PyData, SciPy, and other conferences)
  • Fees for continuous integration and other software engineering tools
  • Documentation development
  • Web-page hosting and bandwidth fees for projects

Core Projects

NumPy

static/images/NumPY.pngNumPy is the fundamental package needed for scientific computing with Python. Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases. Repositories for NumPy binaries: http://www.lfd.uci.edu/~gohlke/pythonlibs/#numpy, a variety of versions – http://sourceforge.net/projects/numpy/files/NumPy/, version 1.6.1 – http://sourceforge.net/projects/numpy/files/NumPy/1.6.1/.

SciPy

static/images/scipy.pngSciPy is open-source software for mathematics, science, and engineering. It is also the name of a very popular conference on scientific programming with Python. The SciPy library depends on NumPy, which provides convenient and fast N-dimensional array manipulation. The SciPy library is built to work with NumPy arrays, and provides many user-friendly and efficient numerical routines such as routines for numerical integration and optimization.

Matplotlib

static/images/matplotlib.png2D plotting library for Python that produces high quality figures that can be used in various hardcopy and interactive environments. matplolib is compatiable with python scripts and the python and ipython shells.

IPython

static/images/ipython.pngHigh quality open source python shell that includes tools for high level and interactive parallel computing.

SymPy

static/images/SymPy2.jpgSymPy is a Python library for symbolic mathematics. It aims to become a full-featured computer algebra system (CAS) while keeping the code as simple as possible in order to be comprehensible and easily extensible. SymPy is written entirely in Python and does not require any external libraries.

Other Projects

Cython

static/images/cython.pngCython is a language based on Pyrex that makes writing C extensions for Python as easy as writing them in Python itself. Cython supports calling C functions and declaring C types on variables and class attributes, allowing the compiler to generate very efficient C code from Cython code.

pandas

static/images/pandas.pngpandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

PyTables

static/images/logo-pytables-small.pngPyTables is a package for managing hierarchical datasets and designed to efficiently and easily cope with extremely large amounts of data. PyTables is built on top of the HDF5 library, using the Python language and the NumPy package. It features an Pythonic interface combined with C / Cython extensions for the performance-critical parts of the code. This makes it a fast, yet extremely easy to use tool for very large amounts of data. http://pytables.github.com/

scikit-image

static/images/scikitsimage.pngFree high-quality and peer-reviewed volunteer produced collection of algorithms for image processing.

scikit-learn

static/images/scikitslearn.pngModule designed for scientific pythons that provides accesible solutions to machine learning problems.

Scikits-Statsmodels

static/images/scikits.pngStatsmodels is a Python package that provides a complement to scipy for statistical computations including descriptive statistics and estimation of statistical models.

Spyder

static/images/spyder.pngInteractive development environment for Python that features advanced editing, interactive testing, debugging and introspection capabilities, as well as a numerical computing environment made possible through the support of Ipython, NumPy, SciPy, and matplotlib.

Theano

static/images/theano_logo_allblue_200x46.pngTheano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently.

Associated Projects

NumFOCUS is currently looking for representatives to enable us to promote the following projects. For information contact us at: info@NumFOCUS.org.

Sage

static/images/sage.pngOpen source mathematics sofware system that combines existing open-source packages into a Python-based interface.

NetworkX

NetworkX is a Python language software package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks.

Python(X,Y)

static/images/pythonxy.pngFree scientific and engineering development software used for numerical computations, and analysis and visualization of data using the Python programmimg language.

 

Task View on Web Technologies #rstats

Task Views on R offer a good way to navigate the 5000 + plus packages . Screenshot from 2013-09-17 19:16:50 They are here-  http://cran.r-project.org/web/views/

ROpenSci has a CRAN Task View , except it is on Github, and it is on using web services from within R. I think it is more like a R API View!

UPDATE- It is updated and now on CRAN here

I wish CRAN View allowed MORE Markdown expecially in the views– or the whole site …

😉

http://cran.r-project.org/web/views/WebTechnologies.html

You can see a lot of Not on CRAN packages huh ed?

CRAN Task View: Web Technologies and Services

Maintainer: Scott Chamberlain, Karthik Ram, Christopher Gandrud, Patrick Mair
Contact: scott at ropensci.org
Version: 2013-10-02
This task view contains information about using R to obtain and parse data from the web. The base version of R does not ship with many tools for interacting with the web. Thankfully, there are an increasingly large number of tools for interacting with the web. If you have any comments or suggestions for additions or improvements please contact the maintainer of the task view. A list of available packages and functions is presented below, grouped by the type of activity.

Tools for Working with the Web from R

Parsing Data from the Web

  • The repmis package contains a source_data() command to load plain-text data from a URL (either http or https).
  • The package XML contains functions for parsing XML and HTML, and supports xpath for searching XML (think regex for strings). A helpful function to read data from one or more HTML tables is readHTMLTable().
  • scrapeR provides additional tools for scraping data from HTML and XML documents.
  • The XML2R package (to be on CRAN soon) is a collection of convenient functions for coercing XML into data frames.
  • The rjson converts R object into Javascript object notation (JSON) objects and vice-versa.
  • An alternative to the rjson is RJSONIO which also converts to and from data in JSON format (it is fast for parsing).
  • An alternative to the XML package is selectr, which parses CSS3 Selectors and translates them to XPath 1.0 expressions.

Curl/HTTP/FTP and Authentication:

  • RCurl: A low level curl wrapper that allows one to compose general HTTP requests and provides convenient functions to fetch URIs, get/post forms, etc. and process the results returned by the Web server. This provides a great deal of control over the HTTP/FTP connection and the form of the request while providing a higher-level interface than is available just using R socket connections. It also provide tools for Web authentication.
  • httr: A light wrapper around RCurl that makes many things easier, but still allows you to access the lower level functionality of RCurl. It has convenient http verbs: GET(), POST(), PUT(), DELETE(), PATCH(), HEAD(), BROWSE(). These wrap functions are more convenient to use, though less configurable than counterparts in RCurl. http status codes are helpful for debugging http calls. This package makes this easier using, for example, stop_for_status() gets the http status code from a response object, and stops the function if the call was not successful.
  • Using web resources can require authentication, either via API keys, OAuth, username:password combination, or via other means. ROAuth is a package that provides a separate R interface to OAuth. OAuth is the most complicated authentication process, and can be most easily done using httr (see package demos).

Web Frameworks

  • The shiny package makes it easy to build interactive web applications with R.
  • The Rook web server interface contains the specification and convenience software for building and running Rook applications.
  • The opencpu framework for embedded statistical computation and reproducible research exposes a web API interfacing R, LaTeX and Pandoc. This API is used for example to integrate statistical functionality into systems, share and execute scripts or reports on centralized servers, and build R based apps.

JavaScript

  • ggvis (not on CRAN) makes it easy to describe interactive web graphics in R. It fuses the ideas of ggplot2 and shiny, rendering graphics on the web with Vega.
  • rCharts (not on CRAN) allows for interactive javascript charts from R.
  • rVega (not on CRAN) is an R wrapper for Vega.
  • clickme (not on CRAN) is an R package to create interactive plots.

Data Sources on the Web Accessible via R

Ecological and Evolutionary Biology

  • rvertnet: A wrapper to the VertNet collections database API.
  • rgbif: Interface to the Global Biodiversity Information Facility API methods.
  • rfishbase: A programmatic interface to fishbase.org.
  • treebase: An R package for discovery, access and manipulation of online phylogenies.
  • taxize: Taxonomic information from around the web.
  • dismo: Species distribution modeling, with wrappers to some APIs.
  • rnbn (not on CRAN): Access to the UK National Biodiversity Network data.
  • rWBclimate (not on CRAN): R interface for the World Bank climate data.
  • rbison (not on CRAN): Wrapper to the USGS Bison API.
  • neotoma (not on CRAN): Programmatic R interface to the Neotoma Paleoecological Database.
  • rnoaa (not on CRAN): R interface to NOAA Climate data API.
  • rnpn (not on CRAN): Wrapper to the National Phenology Network database API.
  • rfisheries: Package for interacting with fisheries databases at openfisheries.org.
  • rebird: A programmatic interface to the eBird database.
  • flora: Retrieve taxonomical information of botanical names from the Flora do Brasil website.
  • Rcolombos: This package provides programmatic access to Colombos, a web based interface for exploring and analyzing comprehensive organism-specific cross-platform expression compendia of bacterial organisms.
  • Reol: An R interface to the Encyclopedia of Life (EOL) API. Includes functions for downloading and extracting information off the EOL pages.
  • rPlant: An R interface to the the many computational resources iPlant offers through their RESTful application programming interface. Currently, rPlant functions interact with the iPlant foundational API, the Taxonomic Name Resolution Service API, and the Phylotastic Taxosaurus API. Before using rPlant, users will have to register with the iPlant Collaborative. http://www.iplantcollaborative.org/discover/discovery-environment

Genes and Genomes

  • cgdsr: R-Based API for accessing the MSKCC Cancer Genomics Data Server (CGDS).
  • rsnps (not on CRAN): Wrapper to the openSNP data API and the Broad Institute SNP Annotation and Proxy Search.
  • rentrez: Talk with NCBI entrez using R.

Earth Science

  • RNCEP: Obtain, organize, and visualize NCEP weather data.
  • crn: Provides the core functions required to download and format data from the Climate Reference Network. Both daily and hourly data are downloaded from the ftp, a consolidated file of all stations is created, station metadata is extracted. In addition functions for selecting individual variables and creating R friendly datasets for them is provided.
  • BerkeleyEarth: Data input for Berkeley Earth Surface Temperature.
  • waterData: An R Package for retrieval, analysis, and anomaly calculation of daily hydrologic time series data.
  • CHCN: A compilation of historical through contemporary climate measurements scraped from the Environment Canada Website Including tools for scraping data, creating metadata and formating temperature files.
  • decctools: Provides functions for retrieving energy statistics from the United Kingdom Department of Energy and Climate Change and related data sources. The current version focuses on total final energy consumption statistics at the local authority, MSOA, and LSOA geographies. Methods for calculating the generation mix of grid electricity and its associated carbon intensity are also provided.
  • Metadata: Collates metadata for climate surface stations.
  • sos4R: A client for Sensor Observation Services (SOS) as specified by the Open Geospatial Consortium (OGC). It allows users to retrieve metadata from SOS web services and to interactively create requests for near real-time observation data based on the available sensors, phenomena, observations et cetera using thematic, temporal and spatial filtering.

Economics

  • WDI: Search, extract and format data from the World Bank’s World Development Indicators.
  • FAOSTAT: The package hosts a list of functions to download, manipulate, construct and aggregate agricultural statistics provided by the FAOSTAT (Food and Agricultural Organization of the United Nations) database.

Chemistry

  • rpubchem: Interface to the PubChem Collection.

Agriculture

  • cimis: R package for retrieving data from CIMIS, the California Irrigation Management Information System.

Literature, Metadata, Text, and Altmetrics

  • rplos: A programmatic interface to the Web Service methods provided by the Public Library of Science journals for search.
  • rbhl (not on CRAN): R interface to the Biodiversity Heritage Library (BHL) API.
  • rmetadata (not on CRAN): Get scholarly metadata from around the web.
  • RMendeley: Implementation of the Mendeley API in R.
  • rentrez: Talk with NCBI entrez using R.
  • rorcid (not on CRAN): A programmatic interface the Orcid.org API.
  • rpubmed (not on CRAN): Tools for extracting and processing Pubmed and Pubmed Central records.
  • rAltmetic (not on CRAN): Query and visualize metrics from Altmetric.com.
  • rImpactStory: Programmatic interface to the ImpactStory API.
  • alm (not on CRAN): R wrapper to the almetrics API platform developed by PLoS.
  • ngramr: Retrieve and plot word frequencies through time from the Google Ngram Viewer.

Marketing

  • anametrix: Bidirectional connector to Anametrix API.

Data Depots

  • dvn: Provides access to The Dataverse Network API.
  • rfigshare: Programmatic interface for Figshare.
  • factualR: Thin wrapper for the Factual.com server API.
  • dataone: A package that provides read/write access to data and metadata from the DataONE network of Member Node data repositories.
  • yhatr: Lets you deploy, maintain, and invoke models via the Yhat REST API.
  • RSocrata: Provided with a Socrata dataset resource URL, or a Socrata SoDA web API query, returns an R data frame. Converts dates to POSIX format. Supports CSV and JSON. Manages throttling by Socrata.

Machine Learning as a Service

  • bigml: BigML, a machine learning web service.
  • MTurkR: Access to Amazon Mechanical Turk Requester API via R.

Web Analytics

  • rgauges (not on CRAN): Interface to Gaug.es API.
  • RSiteCatalyst: Functions for accessing the Adobe Analytics (Omniture SiteCatalyst) Reporting API.
  • r-google-analytics (not on CRAN): Provides access to Google Analytics.

News

  • GuardianR: Provides an interface to the Open Platform’s Content API of the Guardian Media Group. It retrieves content from news outlets The Observer, The Guardian, and guardian.co.uk from 1999 to current day.

Images, Videos, Music

  • imguR: A package to share plots using the image hosting service imgur.com.
  • RLastFM: A package to interface to the last.fm API.

Sports

  • nhlscrapr: Compiling the NHL Real Time Scoring System Database for easy use in R.

Maps

  • RgoogleMaps: This package serves two purposes: It provides a comfortable R interface to query the Google server for static maps, and use the map as a background image to overlay plots within R.
  • osmar: This package provides infrastructure to access OpenStreetMap data from different sources to work with the data in common R manner and to convert data into available infrastructure provided by existing R packages (e.g., into sp and igraph objects).
  • ggmap: Allows for the easy visualization of spatial data and models on top of Google Maps, OpenStreetMaps, Stamen Maps, or CloudMade Maps using ggplot2.

Social media

  • streamR: This package provides a series of functions that allow R users to access Twitter’s filter, sample, and user streams, and to parse the output into data frames. OAuth authentication is supported.
  • twitteR: Provides an interface to the Twitter web API.

Government

  • wethepeople: An R client for interacting with the White House’s “We The People” petition API.
  • govdat: Interface to various APIs for government data, including New York Times congress API, and the Sunlight Foundation set of APIs.
  • govStatJPN: Functions to get public survey data in Japan.

Other

  • sos4R: R client for the OGC Sensor Observation Service.
  • datamart: Provides an S4 infrastructure for unified handling of internal datasets and web based data sources. Examples include dbpedia, eurostat and sourceforge.
  • rDrop (not on CRAN): Dropbox interface.
  • zendeskR: This package provides an R wrapper for the Zendesk API.

The original was on github

https://github.com/ropensci/webservices

Some new packages I really liked!

  • rDrop: Dropbox interface.
  • nhlscraper: Compiling the NHL Real Time Scoring System Database for easy use in R
  • osmar: This package provides infrastructure to access OpenStreetMap data from different sources, to work with the data in common R manner, and to convert data into available infrastructure provided by existing R packages (e.g., into sp and igraph objects).
  • MTurkR: Access to Amazon Mechanical Turk Requester API via R. more

  • rgauges: Interface to Gaug.es API more (not on CRAN)

  • RSiteCatalyst: Adobe Analytics (Omniture SiteCatalyst) Reporting API

  • GuardianR: Provides an interface to the Open Platform’s Content API of the Guardian Media Group.

  • imguR: A package to share plots using the image hosting service imgur.com
  • RLastFM: A package to interface to the last.fm API.

Family as a basic social unit- Stats

Some stats from http://www.stepfamilies.info/stepfamily-fact-sheet.php

Sadly they decided to discontinue providing estimates of marriage, divorce, and remarriage except for those that are available from  current census. Thus, many of the following  current estimates were derived from the 1990 census and earlier data sources. This is for the US, but I would be interested in plotting say how GDP and Family Unit size changes— across countries

 

Current estimates from 1988-1990 suggest:

  • 92% of all men and women marry by age 50
  • 43% of first marriages will end in divorce within 15 years
  • 25% of all men and women report being marriage two or more times by age 50
  • In 2004, 42% of all marriages were remarriages for at least one partner
  • Of those who get divorced, 75% will remarry (65% bring children from a previous union)
  • 60% of those who get remarried redivorce
  • 15% of remarriages will end in divorce within 3 years, 25% within 5 years, 39% within 10 years
  • The average length of first and re- marriages that end in divorce is about 8 years
  • The average time between first divorce and remarriage is about 3.5 years
  • 54% of women will remarry within 5 years of first divorce and 75% within 10 years
  • 50% of men who remarry after first divorce do so within 3-4 years
  • Having low income and living in poor neighborhoods are associated with a lower chance of remarriage
  • Younger adults are more likely to remarry than older ones
  • Whites and Latinos are more likely to remarry than African Americans
  • After 5 years of divorce Whites are most likely to remarry (58%), followed by Latinos (44%) and African Americans (32%)
  • These proportions show a marked downward trend when compared to national samples in 1976, which indicated the probability of remarriage within 5 years of divorce was 73% for Whites and nearly 50% for African Americans
  • Estimates suggest that by the time they are 18, anywhere from 1/3 to 1/2 will have been part of a stepfamily

In addition while billions are being spent on data software , how can we cut down the cost of collecting Census data using newer technologies ( than those paper forms!!)

Google introduces Analytics Academy for e-learning

I really liked this and promptly signed up at https://analyticsacademy.withgoogle.com/course

I of course passed the test some 2 years back-Google Web Analytics IQ test (but its only valid for 18 months)

Digital Analytics Fundamentals

This three-week course provides a foundation for marketers and analysts seeking to understand the core principles of digital analytics and to improve business performance through better digital measurement.

Course highlights include:

  • An overview of today’s digital measurement landscape
  • Guidance on how to build an effective measurement plan
  • Best practices for collecting actionable data
  • Descriptions of key digital measurement concepts, terminology and analysis techniques
  • Deep-dives into Google Analytics reports with specific examples for evaluating your digital marketing performance
  • View lessons from experts

    Watch or read lessons from digital analytics advocate Justin Cutroni, all at your own pace.

  • Test your knowledge

    Apply what you learn in the course by completing short quizzes and practice exercises.

  • Join the learning community

    Engage with other course participants and analytics experts in the course forum and on Google+.

Algorithms.io is Dataweek Startup of September

One of the guys I keep shooting ideas with on a ir-regular basis Andy Bartley ‘s startup , Algorithms.io just won a startup competition

 

http://dataweek.co/algorithms-io-wins-data-2-0-summit-2013-startup-pitch-competition/

Andy was kind enough to mention me at link above ( I extracted it here)–what is really cool is they are now going to demo on analytics for wearable computing. That’s right- Analytics + Google Glass ? Any takers..? 🙂

See-

isit Algorithms.io tomorrow and Thursday at Dataweek 2013 at the Fort Mason center in San Francisco.  We will be in booth #118 giving a live demo of our new machine learning platform for wearable devices.

This new platform intelligently classifies streaming data from wearable devices into actionable events that can be used to build predictive applications.  It combines a data scientist, dev ops engineer, and developer all into one simple service.

 

 

Geoff: Is Algorithms.io a “marketplace for algorithms” or do you plan on producing / curating most of the algorithms internally?

Andy:  Right now we are performing the curation internally.  When you get past the marketing hype around Big Data, Machine Learning, Predictive Analytics, etc. what you’ll find is most companies still aren’t sure exactly how these technologies can benefit their business.  We talk with Fortune 500 companies every week who have few if any data scientists in house, and aren’t using any intelligent algorithms.  Our main focus right now is working with those companies to help them understand the use cases and how they integrate with the business model.

Longer term, we think there is an opportunity for an algorithm marketplace.  This isn’t a new topic, one of our advisors Ajay Ohri, also the author of Springer’s book on R, wrote about this idea back in 2011   (http://readwrite.com/2011/06/01/an-app-store-for-algorithms#awesm=~ohfvTpPiq6Jmt5).  We’ve discussed this topic with folks at some of the potential players like Google who could be interested in this type of marketplace.  Two of the primary gating factors for an algorithm marketplace are data quality and use cases.  Data quality is still a fundamental challenge, and the really compelling business use cases today can be tackled with a relatively limited set of algorithms.  As companies get more sophisticated data infrastructure in the next 2 – 3 years, the bar will begin to rise and an opportunity could emerge for commerce around algorithms.  We’re doing a number of things on the technology and IP fronts to position us to play in this space when it emerges.

 

Using R for random number creation from time stamps #rstats

Suppose – let us just suppose- you want to create random numbers that are reproducible , and derived from time stamps

Here is the code in R

> a=as.numeric(Sys.time())
> set.seed(a)
> rnorm(log(a))

Note- you can create a custom function  ( I used  the log) for generating random numbers of the system time too. This creates a random numbered list of pseudo random numbers (since nothing machine driven is purely random in the strict philosophy of the word)

a=as.numeric(Sys.time())
set.seed(a)
abs(100000000*rnorm(abs(log(a))))

[1]  39621645  99451316 109889294 110275233 278994547   6554596  38654159  68748122   8920823  13293010
[11]  57664241  24533980 174529340 105304151 168006526  39173857  12810354 145341412 241341095  86568818
[21] 105672257

Possible applications- things that need both random numbers (like encryption keys) and time stamps (like events , web or industrial logs or as pseudo random pass codes in Google 2 factor authentication )

Note I used the rnorm function but you could possibly draw the functions also as a random input (rnorm or rcauchy)

Again I would trust my own random ness than one generated by an arm of US Govt (see http://www.nist.gov/itl/csd/ct/nist_beacon.cfm )

Update- Random numbers in R

http://stat.ethz.ch/R-manual/R-patched/library/base/html/Random.html

Details

The currently available RNG kinds are given below. kind is partially matched to this list. The default is "Mersenne-Twister".

"Wichmann-Hill"
The seed, .Random.seed[-1] == r[1:3] is an integer vector of length 3, where each r[i] is in 1:(p[i] - 1), where p is the length 3 vector of primes, p = (30269, 30307, 30323). The Wichmann–Hill generator has a cycle length of 6.9536e12 (= prod(p-1)/4, see Applied Statistics (1984) 33, 123 which corrects the original article).

"Marsaglia-Multicarry":
A multiply-with-carry RNG is used, as recommended by George Marsaglia in his post to the mailing list ‘sci.stat.math’. It has a period of more than 2^60 and has passed all tests (according to Marsaglia). The seed is two integers (all values allowed).

"Super-Duper":
Marsaglia’s famous Super-Duper from the 70’s. This is the original version which does not pass the MTUPLE test of the Diehard battery. It has a period of about 4.6*10^18 for most initial seeds. The seed is two integers (all values allowed for the first seed: the second must be odd).

We use the implementation by Reeds et al. (1982–84).

The two seeds are the Tausworthe and congruence long integers, respectively. A one-to-one mapping to S’s .Random.seed[1:12] is possible but we will not publish one, not least as this generator is not exactly the same as that in recent versions of S-PLUS.

"Mersenne-Twister":
From Matsumoto and Nishimura (1998). A twisted GFSR with period 2^19937 – 1 and equidistribution in 623 consecutive dimensions (over the whole period). The ‘seed’ is a 624-dimensional set of 32-bit integers plus a current position in that set.

"Knuth-TAOCP-2002":
A 32-bit integer GFSR using lagged Fibonacci sequences with subtraction. That is, the recurrence used is

X[j] = (X[j-100] – X[j-37]) mod 2^30

and the ‘seed’ is the set of the 100 last numbers (actually recorded as 101 numbers, the last being a cyclic shift of the buffer). The period is around 2^129.

"Knuth-TAOCP":
An earlier version from Knuth (1997).

The 2002 version was not backwards compatible with the earlier version: the initialization of the GFSR from the seed was altered. R did not allow you to choose consecutive seeds, the reported ‘weakness’, and already scrambled the seeds.

Initialization of this generator is done in interpreted R code and so takes a short but noticeable time.

"L'Ecuyer-CMRG":
A ‘combined multiple-recursive generator’ from L’Ecuyer (1999), each element of which is a feedback multiplicative generator with three integer elements: thus the seed is a (signed) integer vector of length 6. The period is around 2^191.

The 6 elements of the seed are internally regarded as 32-bit unsigned integers. Neither the first three nor the last three should be all zero, and they are limited to less than 4294967087 and 4294944443 respectively.

This is not particularly interesting of itself, but provides the basis for the multiple streams used in package parallel.

"user-supplied":
Use a user-supplied generator.

 

Function RNGkind allows user-coded uniform and normal random number generators to be supplied.

BigML goes on hyperdrive- exciting new features

Some changes in BigML.com whose CEO I have interviewed here

Thier earlier innovation in making a market place for models (like similar market place for apps) was written about here

I like the concept of BigMLer a command line tool https://bigml.com/bigmler

New changes are-

1) Text Analysis now available- It seems like rudimentary tdm (term document matrix) but I have yet to test it whether I can do clustering within text data too

2) A Cloud Server called BigML Predict Server- making adoption faster due to  data hygiene for sensitive industries like finance etc

3) Confusion Matrix – to evaluate- a long overdue step . Maybe some curves should be added to evaluation here 😉

4) Misc technical upgrades- that are more complex to execute and less interesting to write about

  • multi label classification
  • secret urls for sharing models (view model only not data)
  • export to MS Excel ( maybe add Google docs export ?)
  • etc

Overall , with the addition of training courses as well- this is a new phase in this data science startup that I have been tracking for past few years.

-related