NumFocus- The Python Statistical Community

I really liked the mature design, and foundation of this charitable organization. While it is similar to FOAS in many ways (http://www.foastat.org/projects.html) I like the projects . Excellent projects and some of which I think should be featured in Journal of Statistical Software- (since there is a seperate R Journal) unless it wants to be overtly R focused.

 

In the same manner I think some non Python projects should try and reach out to NumFocus (if it is not wanting to be so  PyFocus-ed)

Here it is NumFocus

NumFOCUS supports and promotes world-class, innovative, open source scientific software. Most individual projects, even the wildly successful ones, find the overhead of a non-profit to be too large for their community to bear. NumFOCUS provides a critical service as an umbrella organization which removes the burden from the projects themselves to raise money.

Money donated through NumFOCUS goes to sponsor things like:

  • Coding sprints (food and travel)
  • Technical fellowships (sponsored students and mentors to work on code)
  • Equipment grants (to developers and projects)
  • Conference attendance for students (to PyData, SciPy, and other conferences)
  • Fees for continuous integration and other software engineering tools
  • Documentation development
  • Web-page hosting and bandwidth fees for projects

Core Projects

NumPy

static/images/NumPY.pngNumPy is the fundamental package needed for scientific computing with Python. Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases. Repositories for NumPy binaries: http://www.lfd.uci.edu/~gohlke/pythonlibs/#numpy, a variety of versions – http://sourceforge.net/projects/numpy/files/NumPy/, version 1.6.1 – http://sourceforge.net/projects/numpy/files/NumPy/1.6.1/.

SciPy

static/images/scipy.pngSciPy is open-source software for mathematics, science, and engineering. It is also the name of a very popular conference on scientific programming with Python. The SciPy library depends on NumPy, which provides convenient and fast N-dimensional array manipulation. The SciPy library is built to work with NumPy arrays, and provides many user-friendly and efficient numerical routines such as routines for numerical integration and optimization.

Matplotlib

static/images/matplotlib.png2D plotting library for Python that produces high quality figures that can be used in various hardcopy and interactive environments. matplolib is compatiable with python scripts and the python and ipython shells.

IPython

static/images/ipython.pngHigh quality open source python shell that includes tools for high level and interactive parallel computing.

SymPy

static/images/SymPy2.jpgSymPy is a Python library for symbolic mathematics. It aims to become a full-featured computer algebra system (CAS) while keeping the code as simple as possible in order to be comprehensible and easily extensible. SymPy is written entirely in Python and does not require any external libraries.

Other Projects

Cython

static/images/cython.pngCython is a language based on Pyrex that makes writing C extensions for Python as easy as writing them in Python itself. Cython supports calling C functions and declaring C types on variables and class attributes, allowing the compiler to generate very efficient C code from Cython code.

pandas

static/images/pandas.pngpandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

PyTables

static/images/logo-pytables-small.pngPyTables is a package for managing hierarchical datasets and designed to efficiently and easily cope with extremely large amounts of data. PyTables is built on top of the HDF5 library, using the Python language and the NumPy package. It features an Pythonic interface combined with C / Cython extensions for the performance-critical parts of the code. This makes it a fast, yet extremely easy to use tool for very large amounts of data. http://pytables.github.com/

scikit-image

static/images/scikitsimage.pngFree high-quality and peer-reviewed volunteer produced collection of algorithms for image processing.

scikit-learn

static/images/scikitslearn.pngModule designed for scientific pythons that provides accesible solutions to machine learning problems.

Scikits-Statsmodels

static/images/scikits.pngStatsmodels is a Python package that provides a complement to scipy for statistical computations including descriptive statistics and estimation of statistical models.

Spyder

static/images/spyder.pngInteractive development environment for Python that features advanced editing, interactive testing, debugging and introspection capabilities, as well as a numerical computing environment made possible through the support of Ipython, NumPy, SciPy, and matplotlib.

Theano

static/images/theano_logo_allblue_200x46.pngTheano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently.

Associated Projects

NumFOCUS is currently looking for representatives to enable us to promote the following projects. For information contact us at: info@NumFOCUS.org.

Sage

static/images/sage.pngOpen source mathematics sofware system that combines existing open-source packages into a Python-based interface.

NetworkX

NetworkX is a Python language software package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks.

Python(X,Y)

static/images/pythonxy.pngFree scientific and engineering development software used for numerical computations, and analysis and visualization of data using the Python programmimg language.

 

Iris for Big Data #rstats #bigdata

Quote of the Day-

it is impossible to be a data scientist without knowing iris 

#Anonymous #Quotes

 

Revolution Analytics has been nice enough to provide both datasets and code for analyzing Big Data in R.

http://www.revolutionanalytics.com/subscriptions/datasets/

http://packages.revolutionanalytics.com/datasets/

Site was updated so here are the new links

 

while the Datasets collection is still elementary, as a R Instructor I find this list extremely useful. However I wish they look at some other repositories and make .xdf and “tidy” csv versions. A little bit of RODBC usage should help, and so will some descriptions. Maybe they should partner with Quandl, DataMarket, or Infochimps on this initiative than do it alone.

 

Overall there can be a R package (like a Big Data version of the famous datasets package in R)

But a nice and very useful effort

Revolution R Datasets

More code-

http://blog.revolutionanalytics.com/2013/08/big-data-sets-for-r.html

Also a recent project made by a student of mine on Revolution Datasets and using their blog posts.

Note how much more better the above project is than use the mini and super clean datasets within R (like Boston)

 

Hat TIP- R’s very own Mr Smith
Unrelated-
For more on IRIS

 

Using ifelse in R for creating new variables #rstats #data #manipulation

The ifelse function is simple and powerful and can help in data manipulation within R. Here I create a categoric variable from specific values in a numeric variable

> data(iris)

> iris$Type=ifelse(iris$Sepal.Length<5.8,”Small Flower”,”Big Flower”)
> table(iris$Type)
Big Flower Small Flower
77           73

The parameters  of ifelse is quite simple

Usage

ifelse(test, yes, no)
Arguments

test
an object which can be coerced to logical mode.

yes
return values for true elements of test.

no
return values for false elements of tes