Importing data from csv file using PySpark

There are two ways to import the csv file, one as a RDD and the other as Spark Dataframe(preferred).  MLLIB is built around RDDs while ML is generally built around dataframes. and

!pip install pyspark

from pyspark import SparkContext, SparkConf
sc =SparkContext()

A SparkContext represents the connection to a Spark cluster, and can be used to create RDD and broadcast variables on that cluster.

To create a SparkContext you first need to build a SparkConf object that contains information about your application.Only one SparkContext may be active per JVM. You must stop() the active SparkContext before creating a new one.

Configuration for a Spark application. Used to set various Spark parameters as key-value pairs.




# Loads data.
data = sc.textFile(“C:/Users/Ajay/Desktop/test/new_sample.csv”)


 # Loads data. Be careful of indentations and whitespace

from pyspark.sql import SparkSession

spark = SparkSession.builder \
.master(“local”) \
.appName(“Data cleaning”) \

dataframe2 =“csv”).option(“header”,”true”).option(“mode”,”DROPMALFORMED”).load(“C:/Users/Ajay/Desktop/test/new_sample.csv”)


dataframe2.printSchema() (same as str(dataframe) in R and in Pandas)

Is Python going to be better than R for Big Data Analytics and Data Science? #rstats #python

Uptil now the R ecosystem of package developers has mostly shrugged away the Big Data question. In   a fascinating insight Hadley Wickham said this in a recent interview- shockingly it mimicks the FUD you know who has been accused of ( source

5. How do you respond when you hear the phrase ‘big data’? Big data is extremely overhyped and not terribly well defined. Many people think they have big data, when they actually don’t.

I think there are two particularly important transition points:

* From in-memory to disk. If your data fits in memory, it’s small data. And these days you can get 1 TB of ram, so even small data is big!

* From one computer to many computers.

R is a fantastic environment for the rapid exploration of in-memory data, but there’s no elegant way to scale it to much larger datasets. Hadoop works well when you have thousands of computers, but is incredible slow on just one machine. Fortunately, I don’t think one system needs to solve all big data problems.

To me there are three main classes of problem:

1. Big data problems that are actually small data problems, once you have the right subset/sample/summary.

2. Big data problems that are actually lots and lots of small data problems

3. Finally, there are irretrievably big problems where you do need all the data, perhaps because you fitting a complex model. An example of this type of problem is recommender systems

Ajay- One of the reasons of non development of R Big Data packages is- it takes money. The private sector in R ecosystem is a duopoly ( Revolution Analytics ( acquired by Microsoft) and RStudio (created by Microsoft Alum JJ Allaire). Since RStudio actively tries as a company to NOT step into areas Revolution Analytics works in- it has not ventured into Big Data in my opinion for strategic reasons.

Revolution Analytics project on RHadoop is actually just one consultant working on it here and it has not been updated since six months

We interviewed the creator of R Hadoop here

However Python developers have been trying to actually develop systems for Big Data actively. The Hadoop ecosystem and the Python ecosystem are much more FOSS friendly even in enterprise solutions.

This is where Python is innovating over R in Big Data-

  • Blaze: Translates NumPy/Pandas-like syntax to systems like databases.

    Blaze presents a pleasant and familiar interface to us regardless of what computational solution or database we use. It mediates our interaction with files, data structures, and databases, optimizing and translating our query as appropriate to provide a smooth and interactive session.

  • Odo: Migrates data between formats.

    Odo moves data between formats (CSV, JSON, databases) and locations (local, remote, HDFS) efficiently and robustly with a dead-simple interface by leveraging a sophisticated and extensible network of conversions.

    odo takes two arguments, a target and a source for a data transfer.

    >>> from odo import odo
    >>> odo(source, target)  # load source into target 
  • Dask.array: Multi-core / on-disk NumPy arrays

    Dask.arrays provide blocked algorithms on top of NumPy to handle larger-than-memory arrays and to leverage multiple cores. They are a drop-in replacement for a commonly used subset of NumPy algorithms.

  • DyND: In-memory dynamic arrays

    DyND is a dynamic ND-array library like NumPy. It supports variable length strings, ragged arrays, and GPUs. It is a standalone C++ codebase with Python bindings. Generally it is more extensible than NumPy but also less mature.

    The core DyND developer team consists of Mark Wiebe and Irwin Zaid. Much of the funding that made this project possible came through Continuum Analytics and DARPA-BAA-12-38, part of XDATA.

    LibDyND, a component of the Blaze project, is a C++ library for dynamic, multidimensional arrays. It is inspired by NumPy, the Python array programming library at the core of the scientific Python stack, but tries to address a number of obstacles encountered by some of its users. Examples of this are support for variable-sized string and ragged array types. The library is in a preview development state, and can be thought of as a sandbox where features are being tried and tweaked to gain experience with them.

    C++ is a first-class target of the library, the intent is that all its features should be easily usable in the language. This has many benefits, such as that development within LibDyND using its own components is more natural than in a library designed primarily for embedding in another language.

    This library is being actively developed together with its Python bindings,

On a single machine dask increases the scale of comfortable data from fits-in-memory to fits-on-diskby intelligently streaming data from disk and by leveraging all the cores of a modern CPU.

Users interact with dask either by making graphs directly or through the dask collections which provide larger-than-memory counterparts to existing popular libraries:

  • dask.array = numpy + threading
  • dask.bag = map, filter, toolz + multiprocessing
  • dask.dataframe = pandas + threading

Dask primarily targets parallel computations that run on a single machine. It integrates nicely with the existing PyData ecosystem and is trivial to setup and use:

conda install dask
pip install dask

When open source fights- closed source wins. When the Jedi fight the Sith Lords will win

So will R people rise to the Big Data challenge or will they bury their heads in sands like an ostrich or a kiwi. Will Python people learn from R design philosophies and try and incorporate more of it without redesigning the wheel

Converting code from one language to another automatically?

How I wish there was some kind of automated conversion tool – that would convert a CRAN R package into a standard Python package which is pip installable

Machine learning for more machine learning anyone?

Top 15 functions for Analytics in Python #python #rstats #analytics

Here is a list of top ten  fifteen functions for analysis in Python

  1. import (imports a particular package library in Python)
  2. getcwd (from os library) – get current working directory
  3. chdir (from os) -change directory
  4. listdir (from os ) -list files in the specified directory
  5. read_csv(from pandas) reads in a csv file
  6. (like proc contents in SAS or str in R , it describes the object called objectname)
  7. objectname.columns (like proc contents in SAS or names in R , it describes the object variable names of the object called objectname)
  8. objectname.head (like head in R , it prints the first few rows in the object called objectname)
  9. objectname.tail (like tail in R , it prints the last few rows in the object called objectname)
  10. len (length)
  11. objectname.ix[rows] (here if rows is a list of numbers this     will give those rows (or index) for the object called objectname)
  12. groupby -group by a categorical variable
  13. crosstab -cross tab between two categorical variables
  14. describe – data analysis exploratory of numerical variables
  15. corr – correlation between numerical variables
In [1]:
import pandas as pd #importing packages
import os as os
In [2]:
os.getcwd() #current working directory
In [3]:
os.chdir('/home/ajay/Downloads') #changes the working directory
In [4]:
In [5]:
os.listdir(a) #lists all the files in a directory

In [105]:
#note header =0 means we take the first row as a header (default) else we can specify header=None
In [106]:
<class 'pandas.core.frame.dataframe'="">
Int64Index: 53940 entries, 0 to 53939
Data columns (total 10 columns):
carat      53940 non-null float64
cut        53940 non-null object
color      53940 non-null object
clarity    53940 non-null object
depth      53940 non-null float64
table      53940 non-null float64
price      53940 non-null int64
x          53940 non-null float64
y          53940 non-null float64
z          53940 non-null float64
dtypes: float64(6), int64(1), object(3)
memory usage: 3.9+ MB
In [36]:
carat cut color clarity depth table price x y z
0 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
1 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
2 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
3 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
4 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
In [37]:
carat cut color clarity depth table price x y z
53930 0.71 Premium E SI1 60.5 55 2756 5.79 5.74 3.49
53931 0.71 Premium F SI1 59.8 62 2756 5.74 5.73 3.43
53932 0.70 Very Good E VS2 60.5 59 2757 5.71 5.76 3.47
53933 0.70 Very Good E VS2 61.2 59 2757 5.69 5.72 3.49
53934 0.72 Premium D SI1 62.7 59 2757 5.69 5.73 3.58
53935 0.72 Ideal D SI1 60.8 57 2757 5.75 5.76 3.50
53936 0.72 Good D SI1 63.1 55 2757 5.69 5.75 3.61
53937 0.70 Very Good D SI1 62.8 60 2757 5.66 5.68 3.56
53938 0.86 Premium H SI2 61.0 58 2757 6.15 6.12 3.74
53939 0.75 Ideal D SI2 62.2 55 2757 5.83 5.87 3.64
In [38]:
Index([u'carat', u'cut', u'color', u'clarity', u'depth', u'table', u'price', u'x', u'y', u'z'], dtype='object')
In [92]:
b=len(diamonds) #this is the total population size
In [93]:
import numpy as np
In [98]:
rows = np.random.choice(diamonds.index.values, 0.0001*b)
sampled_df = diamonds.ix[rows]
[45653  7503 47794 12017 46125]
In [99]:
carat cut color clarity depth table price x y z
45653 0.25 Ideal H IF 61.4 57 525 4.05 4.08 2.49
7503 1.05 Premium G SI2 61.3 58 4241 6.55 6.60 4.03
47794 0.71 Ideal J VS2 62.4 54 1899 5.72 5.76 3.58
12017 1.00 Premium F SI1 59.8 59 5151 6.55 6.49 3.90
46125 0.51 Ideal F VS1 61.7 54 1744 5.14 5.17 3.18
In [108]:
carat depth table price x y z
count 53940.000000 53940.000000 53940.000000 53940.000000 53940.000000 53940.000000 53940.000000
mean 0.797940 61.749405 57.457184 3932.799722 5.731157 5.734526 3.538734
std 0.474011 1.432621 2.234491 3989.439738 1.121761 1.142135 0.705699
min 0.200000 43.000000 43.000000 326.000000 0.000000 0.000000 0.000000
25% 0.400000 61.000000 56.000000 950.000000 4.710000 4.720000 2.910000
50% 0.700000 61.800000 57.000000 2401.000000 5.700000 5.710000 3.530000
75% 1.040000 62.500000 59.000000 5324.250000 6.540000 6.540000 4.040000
max 5.010000 79.000000 95.000000 18823.000000 10.740000 58.900000 31.800000
In [109]:
In [110]:
carat color clarity depth table price x y z
Fair 1610 1610 1610 1610 1610 1610 1610 1610 1610
Good 4906 4906 4906 4906 4906 4906 4906 4906 4906
Ideal 21551 21551 21551 21551 21551 21551 21551 21551 21551
Premium 13791 13791 13791 13791 13791 13791 13791 13791 13791
Very Good 12082 12082 12082 12082 12082 12082 12082 12082 12082
In [114]:
carat depth table price x y z
Fair 1.046137 64.041677 59.053789 4358.757764 6.246894 6.182652 3.982770
Good 0.849185 62.365879 58.694639 3928.864452 5.838785 5.850744 3.639507
Ideal 0.702837 61.709401 55.951668 3457.541970 5.507451 5.520080 3.401448
Premium 0.891955 61.264673 58.746095 4584.257704 5.973887 5.944879 3.647124
Very Good 0.806381 61.818275 57.956150 3981.759891 5.740696 5.770026 3.559801
In [115]:
carat depth table price x y z
Fair 1.00 65.0 58 3282.0 6.175 6.10 3.97
Good 0.82 63.4 58 3050.5 5.980 5.99 3.70
Ideal 0.54 61.8 56 1810.0 5.250 5.26 3.23
Premium 0.86 61.4 59 3185.0 6.110 6.06 3.72
Very Good 0.71 62.1 58 2648.0 5.740 5.77 3.56
In [117]:
pd.crosstab(diamonds.cut, diamonds.color)
color D E F G H I J
Fair 163 224 312 314 303 175 119
Good 662 933 909 871 702 522 307
Ideal 2834 3903 3826 4884 3115 2093 896
Premium 1603 2337 2331 2924 2360 1428 808
Very Good 1513 2400 2164 2299 1824 1204 678
In [121]:
carat depth table price x y z
carat 1.000000 0.028224 0.181618 0.921591 0.975094 0.951722 0.953387
depth 0.028224 1.000000 -0.295779 -0.010647 -0.025289 -0.029341 0.094924
table 0.181618 -0.295779 1.000000 0.127134 0.195344 0.183760 0.150929
price 0.921591 -0.010647 0.127134 1.000000 0.884435 0.865421 0.861249
x 0.975094 -0.025289 0.195344 0.884435 1.000000 0.974701 0.970772
y 0.951722 -0.029341 0.183760 0.865421 0.974701 1.000000 0.952006
z 0.953387 0.094924 0.150929 0.861249 0.970772 0.952006 1.000000

Polyglots for Data Science #python #sas #r #stats #spss #matlab #julia #octave

In the future I think analysts need to be polyglots- you will need to know more than one language for crunching data.

SAS, Python, R, Julia,SPSS,Matlab- Pick Any Two 😉 or Any Three.

No, you can’t count C or Java as a statistical  language 🙂 🙂

Efforts to promote Polyglots in Statistical Software are-

1) R for SAS and SPSS Users (free or book)

2) R for Stata Users (book)

3) SAS and R (blog and book)

4) Using Python and R together

Probably we need a Python and R for Data Analysis book- just like we have for SAS and R books.

5) Matlab   and R

Reference ( ) includes Python

5) Octave and R

package includes Matlab


6) Julia and python

  • PyPlot uses the Julia PyCall package to call Python’s matplotlib directly from Julia

7) SPSS and Python is here

8) SPSS and R is as below

  • The Essentials for R for Statistics versions 22, 21, 20, and 19 are available here.
  • This link will take you to the SourceForge site where the Version 18 Essentials and Plugins are hosted.


9) Using R from Clojure – Incanter

Use embedded R from Clojure and Incanter

NumFocus- The Python Statistical Community

I really liked the mature design, and foundation of this charitable organization. While it is similar to FOAS in many ways ( I like the projects . Excellent projects and some of which I think should be featured in Journal of Statistical Software– (since there is a seperate R Journal) unless it wants to be overtly R focused.


In the same manner I think some non Python projects should try and reach out to NumFocus (if it is not wanting to be so  PyFocus-ed)

Here it is NumFocus

NumFOCUS supports and promotes world-class, innovative, open source scientific software. Most individual projects, even the wildly successful ones, find the overhead of a non-profit to be too large for their community to bear. NumFOCUS provides a critical service as an umbrella organization which removes the burden from the projects themselves to raise money.

Money donated through NumFOCUS goes to sponsor things like:

  • Coding sprints (food and travel)
  • Technical fellowships (sponsored students and mentors to work on code)
  • Equipment grants (to developers and projects)
  • Conference attendance for students (to PyData, SciPy, and other conferences)
  • Fees for continuous integration and other software engineering tools
  • Documentation development
  • Web-page hosting and bandwidth fees for projects

Core Projects


static/images/NumPY.pngNumPy is the fundamental package needed for scientific computing with Python. Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases. Repositories for NumPy binaries:, a variety of versions –, version 1.6.1 –


static/images/scipy.pngSciPy is open-source software for mathematics, science, and engineering. It is also the name of a very popular conference on scientific programming with Python. The SciPy library depends on NumPy, which provides convenient and fast N-dimensional array manipulation. The SciPy library is built to work with NumPy arrays, and provides many user-friendly and efficient numerical routines such as routines for numerical integration and optimization.


static/images/matplotlib.png2D plotting library for Python that produces high quality figures that can be used in various hardcopy and interactive environments. matplolib is compatiable with python scripts and the python and ipython shells.


static/images/ipython.pngHigh quality open source python shell that includes tools for high level and interactive parallel computing.


static/images/SymPy2.jpgSymPy is a Python library for symbolic mathematics. It aims to become a full-featured computer algebra system (CAS) while keeping the code as simple as possible in order to be comprehensible and easily extensible. SymPy is written entirely in Python and does not require any external libraries.

Other Projects


static/images/cython.pngCython is a language based on Pyrex that makes writing C extensions for Python as easy as writing them in Python itself. Cython supports calling C functions and declaring C types on variables and class attributes, allowing the compiler to generate very efficient C code from Cython code.


static/images/pandas.pngpandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.


static/images/logo-pytables-small.pngPyTables is a package for managing hierarchical datasets and designed to efficiently and easily cope with extremely large amounts of data. PyTables is built on top of the HDF5 library, using the Python language and the NumPy package. It features an Pythonic interface combined with C / Cython extensions for the performance-critical parts of the code. This makes it a fast, yet extremely easy to use tool for very large amounts of data.


static/images/scikitsimage.pngFree high-quality and peer-reviewed volunteer produced collection of algorithms for image processing.


static/images/scikitslearn.pngModule designed for scientific pythons that provides accesible solutions to machine learning problems.


static/images/scikits.pngStatsmodels is a Python package that provides a complement to scipy for statistical computations including descriptive statistics and estimation of statistical models.


static/images/spyder.pngInteractive development environment for Python that features advanced editing, interactive testing, debugging and introspection capabilities, as well as a numerical computing environment made possible through the support of Ipython, NumPy, SciPy, and matplotlib.


static/images/theano_logo_allblue_200x46.pngTheano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently.

Associated Projects

NumFOCUS is currently looking for representatives to enable us to promote the following projects. For information contact us at:


static/images/sage.pngOpen source mathematics sofware system that combines existing open-source packages into a Python-based interface.


NetworkX is a Python language software package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks.


static/images/pythonxy.pngFree scientific and engineering development software used for numerical computations, and analysis and visualization of data using the Python programmimg language.


How to help your government keep the world safe using statistics #rstats #python #sas

Big Data for Big Brother. Now playing. At a computer near you. How to help water the tree of liberty using statistics?

Use R



Use Python




or use SAS software

SAS/CIA from the last paragraph of

Screenshot from 2013-06-09 20:19:01