On Linkedin Live with Internshala on Python and Python Careers

Change Python Version for Jupyter Notebook

Three ways to do it- sometimes package dependencies force analysts and developers to require older versions of Python

use conda to downgrade Python version (if Anaconda installed already)

 conda install python=3.5.0

Hat tip- http://chris35wills.github.io/conda_python_version/

https://docs.anaconda.com/anaconda/faq#how-do-i-get-the-latest-anaconda-with-python-3-5

2. you download the latest version of Anaconda and then make a Python 3.5 environment.

To create the new environment for Python 3.6, in your Terminal window or an Anaconda Prompt, run:

conda create -n py35 python=3.5 anaconda

3) Uninstall Anaconda and install older version of Anaconda https://repo.continuum.io/archive/ (download the most recent Anaconda that included Python 3.5 by default, Anaconda 4.2.0)

Importing data from csv file using PySpark

There are two ways to import the csv file, one as a RDD and the other as Spark Dataframe(preferred). MLLIB is built around RDDs while ML is generally built around dataframes. https://spark.apache.org/docs/latest/mllib-clustering.html and https://spark.apache.org/docs/latest/ml-clustering.html

!pip install pyspark

from pyspark import SparkContext, SparkConf
sc =SparkContext()

A SparkContext represents the connection to a Spark cluster, and can be used to create RDD and broadcast variables on that cluster. https://spark.apache.org/docs/latest/rdd-programming-guide.html#overview

To create a SparkContext you first need to build a SparkConf object that contains information about your application.Only one SparkContext may be active per JVM. You must stop() the active SparkContext before creating a new one.

Configuration for a Spark application. Used to set various Spark parameters as key-value pairs.

dir(SparkContext)

[‘PACKAGE_EXTENSIONS’,
‘class‘,
‘delattr‘,
‘dict‘,
‘dir‘,
‘doc‘,
‘enter‘,
‘eq‘,
‘exit‘,
‘format‘,
‘ge‘,
‘getattribute‘,
‘getnewargs‘,
‘gt‘,
‘hash‘,
‘init‘,
‘init_subclass‘,
‘le‘,
‘lt‘,
‘module‘,
‘ne‘,
‘new‘,
‘reduce‘,
‘reduce_ex‘,
‘repr‘,
‘setattr‘,
‘sizeof‘,
‘str‘,
‘subclasshook‘,
‘weakref‘,
‘_active_spark_context’,

‘dictToJavaMap’,
‘_do_init’,
‘_ensure_initialized’,
‘_gateway’,
‘_getJavaStorageLevel’,
‘_initialize_context’,
‘_jvm’,
‘_lock’,
‘_next_accum_id’,
‘_python_includes’,
‘_repr_html‘,
‘accumulator’,
‘addFile’,
‘addPyFile’,
‘applicationId’,
‘binaryFiles’,
‘binaryRecords’,
‘broadcast’,
‘cancelAllJobs’,
‘cancelJobGroup’,
‘defaultMinPartitions’,
‘defaultParallelism’,
‘dump_profiles’,
’emptyRDD’,
‘getConf’,
‘getLocalProperty’,
‘getOrCreate’,
‘hadoopFile’,
‘hadoopRDD’,
‘newAPIHadoopFile’,
‘newAPIHadoopRDD’,
‘parallelize’,
‘pickleFile’,
‘range’,
‘runJob’,
‘sequenceFile’,
‘setCheckpointDir’,
‘setJobGroup’,
‘setLocalProperty’,
‘setLogLevel’,
‘setSystemProperty’,
‘show_profiles’,
‘sparkUser’,
‘startTime’,
‘statusTracker’,
‘stop’,
‘textFile’,
‘uiWebUrl’,
‘union’,
‘version’,
‘wholeTextFiles’]

# Loads data.
data = sc.textFile(“C:/Users/Ajay/Desktop/test/new_sample.csv”)

type(data)

pyspark.rdd.RDD

# Loads data. Be careful of indentations and whitespace

from pyspark.sql import SparkSession

spark = SparkSession.builder \
.master(“local”) \
.appName(“Data cleaning”) \
.getOrCreate()

dataframe2 = spark.read.format(“csv”).option(“header”,”true”).option(“mode”,”DROPMALFORMED”).load(“C:/Users/Ajay/Desktop/test/new_sample.csv”)

type(dataframe2)

pyspark.sql.dataframe.DataFrame

dataframe2.printSchema() (same as str(dataframe) in R and dataframe.info() in Pandas)

Is Python going to be better than R for Big Data Analytics and Data Science? #rstats #python

Uptil now the R ecosystem of package developers has mostly shrugged away the Big Data question. In a fascinating insight Hadley Wickham said this in a recent interview- shockingly it mimicks the FUD you know who has been accused of ( source

https://peadarcoyle.wordpress.com/2015/08/02/interview-with-a-data-scientist-hadley-wickham/

5. How do you respond when you hear the phrase ‘big data’? Big data is extremely overhyped and not terribly well defined. Many people think they have big data, when they actually don’t.

I think there are two particularly important transition points:

* From in-memory to disk. If your data fits in memory, it’s small data. And these days you can get 1 TB of ram, so even small data is big!

* From one computer to many computers.

R is a fantastic environment for the rapid exploration of in-memory data, but there’s no elegant way to scale it to much larger datasets. Hadoop works well when you have thousands of computers, but is incredible slow on just one machine. Fortunately, I don’t think one system needs to solve all big data problems.

To me there are three main classes of problem:

1. Big data problems that are actually small data problems, once you have the right subset/sample/summary.

2. Big data problems that are actually lots and lots of small data problems

3. Finally, there are irretrievably big problems where you do need all the data, perhaps because you fitting a complex model. An example of this type of problem is recommender systems

Ajay- One of the reasons of non development of R Big Data packages is- it takes money. The private sector in R ecosystem is a duopoly ( Revolution Analytics ( acquired by Microsoft) and RStudio (created by Microsoft Alum JJ Allaire). Since RStudio actively tries as a company to NOT step into areas Revolution Analytics works in- it has not ventured into Big Data in my opinion for strategic reasons.

Revolution Analytics project on RHadoop is actually just one consultant working on it here https://github.com/RevolutionAnalytics/RHadoop and it has not been updated since six months

We interviewed the creator of R Hadoop here https://decisionstats.com/2014/07/10/interview-antonio-piccolboni-big-data-analytics-rhadoop-rstats/

However Python developers have been trying to actually develop systems for Big Data actively. The Hadoop ecosystem and the Python ecosystem are much more FOSS friendly even in enterprise solutions.

This is where Python is innovating over R in Big Data-

http://blaze.pydata.org/en/latest/

Blaze: Translates NumPy/Pandas-like syntax to systems like databases.

Blaze presents a pleasant and familiar interface to us regardless of what computational solution or database we use. It mediates our interaction with files, data structures, and databases, optimizing and translating our query as appropriate to provide a smooth and interactive session.
Odo: Migrates data between formats.

Odo moves data between formats (CSV, JSON, databases) and locations (local, remote, HDFS) efficiently and robustly with a dead-simple interface by leveraging a sophisticated and extensible network of conversions. http://odo.pydata.org/en/latest/perf.html

odo takes two arguments, a target and a source for a data transfer.
```
>>> from odo import odo
>>> odo(source, target)  # load source into target 
```
Dask.array: Multi-core / on-disk NumPy arrays

Dask.arrays provide blocked algorithms on top of NumPy to handle larger-than-memory arrays and to leverage multiple cores. They are a drop-in replacement for a commonly used subset of NumPy algorithms.
DyND: In-memory dynamic arrays

DyND is a dynamic ND-array library like NumPy. It supports variable length strings, ragged arrays, and GPUs. It is a standalone C++ codebase with Python bindings. Generally it is more extensible than NumPy but also less mature. https://github.com/libdynd/libdynd

The core DyND developer team consists of Mark Wiebe and Irwin Zaid. Much of the funding that made this project possible came through Continuum Analytics and DARPA-BAA-12-38, part of XDATA.

LibDyND, a component of the Blaze project, is a C++ library for dynamic, multidimensional arrays. It is inspired by NumPy, the Python array programming library at the core of the scientific Python stack, but tries to address a number of obstacles encountered by some of its users. Examples of this are support for variable-sized string and ragged array types. The library is in a preview development state, and can be thought of as a sandbox where features are being tried and tweaked to gain experience with them.

C++ is a first-class target of the library, the intent is that all its features should be easily usable in the language. This has many benefits, such as that development within LibDyND using its own components is more natural than in a library designed primarily for embedding in another language.

This library is being actively developed together with its Python bindings,

http://dask.pydata.org/en/latest/

On a single machine dask increases the scale of comfortable data from fits-in-memory to fits-on-diskby intelligently streaming data from disk and by leveraging all the cores of a modern CPU.

Users interact with dask either by making graphs directly or through the dask collections which provide larger-than-memory counterparts to existing popular libraries:

dask.array = numpy + threading
dask.bag = map, filter, toolz + multiprocessing
dask.dataframe = pandas + threading

Dask primarily targets parallel computations that run on a single machine. It integrates nicely with the existing PyData ecosystem and is trivial to setup and use:

conda install dask
or
pip install dask

https://github.com/cloudera/ibis

When open source fights- closed source wins. When the Jedi fight the Sith Lords will win

So will R people rise to the Big Data challenge or will they bury their heads in sands like an ostrich or a kiwi. Will Python people learn from R design philosophies and try and incorporate more of it without redesigning the wheel

Converting code from one language to another automatically?

How I wish there was some kind of automated conversion tool – that would convert a CRAN R package into a standard Python package which is pip installable

Machine learning for more machine learning anyone?

Install Package in Python from Github

You can use

pip install git+git://github.com/yhat/ggplot.git

or

pip install --upgrade https://github.com/yhat/ggplot/tarball/master

Top 15 functions for Analytics in Python #python #rstats #analytics

Here is a list of top ~~ten~~ fifteen functions for analysis in Python

import (imports a particular package library in Python)
getcwd (from os library) – get current working directory
chdir (from os) -change directory
listdir (from os ) -list files in the specified directory

read_csv(from pandas) reads in a csv file

objectname.info (like proc contents in SAS or str in R , it describes the object called objectname)
objectname.columns (like proc contents in SAS or names in R , it describes the object variable names of the object called objectname)
objectname.head (like head in R , it prints the first few rows in the object called objectname)
objectname.tail (like tail in R , it prints the last few rows in the object called objectname)
len (length)

objectname.ix[rows] (here if rows is a list of numbers this     will give those rows (or index) for the object called objectname)

groupby -group by a categorical variable

crosstab -cross tab between two categorical variables

describe – data analysis exploratory of numerical variables
corr – correlation between numerical variables

In [1]:

import pandas as pd #importing packages
import os as os

In [2]:

os.getcwd() #current working directory

Out[2]:

'/home/ajay/Desktop'

In [3]:

os.chdir('/home/ajay/Downloads') #changes the working directory

In [4]:

os.getcwd()

Out[4]:

'/home/ajay/Downloads'

In [5]:

a=os.getcwd()
os.listdir(a) #lists all the files in a directory

In [105]:

diamonds=pd.read_csv("diamonds.csv")
#note header =0 means we take the first row as a header (default) else we can specify header=None

In [106]:

diamonds.info()

<class 'pandas.core.frame.dataframe'="">
Int64Index: 53940 entries, 0 to 53939
Data columns (total 10 columns):
carat      53940 non-null float64
cut        53940 non-null object
color      53940 non-null object
clarity    53940 non-null object
depth      53940 non-null float64
table      53940 non-null float64
price      53940 non-null int64
x          53940 non-null float64
y          53940 non-null float64
z          53940 non-null float64
dtypes: float64(6), int64(1), object(3)
memory usage: 3.9+ MB

In [36]:

diamonds.head()

Out[36]:

	carat	cut	color	clarity	depth	table	price	x	y	z
0	0.23	Ideal	E	SI2	61.5	55	326	3.95	3.98	2.43
1	0.21	Premium	E	SI1	59.8	61	326	3.89	3.84	2.31
2	0.23	Good	E	VS1	56.9	65	327	4.05	4.07	2.31
3	0.29	Premium	I	VS2	62.4	58	334	4.20	4.23	2.63
4	0.31	Good	J	SI2	63.3	58	335	4.34	4.35	2.75

In [37]:

diamonds.tail(10)

Out[37]:

	carat	cut	color	clarity	depth	table	price	x	y	z
53930	0.71	Premium	E	SI1	60.5	55	2756	5.79	5.74	3.49
53931	0.71	Premium	F	SI1	59.8	62	2756	5.74	5.73	3.43
53932	0.70	Very Good	E	VS2	60.5	59	2757	5.71	5.76	3.47
53933	0.70	Very Good	E	VS2	61.2	59	2757	5.69	5.72	3.49
53934	0.72	Premium	D	SI1	62.7	59	2757	5.69	5.73	3.58
53935	0.72	Ideal	D	SI1	60.8	57	2757	5.75	5.76	3.50
53936	0.72	Good	D	SI1	63.1	55	2757	5.69	5.75	3.61
53937	0.70	Very Good	D	SI1	62.8	60	2757	5.66	5.68	3.56
53938	0.86	Premium	H	SI2	61.0	58	2757	6.15	6.12	3.74
53939	0.75	Ideal	D	SI2	62.2	55	2757	5.83	5.87	3.64

In [38]:

diamonds.columns

Out[38]:

Index([u'carat', u'cut', u'color', u'clarity', u'depth', u'table', u'price', u'x', u'y', u'z'], dtype='object')

In [92]:

b=len(diamonds) #this is the total population size
print(b)

In [93]:

import numpy as np

In [98]:

rows = np.random.choice(diamonds.index.values, 0.0001*b)
print(rows)
sampled_df = diamonds.ix[rows]

[45653  7503 47794 12017 46125]

In [99]:

sampled_df

Out[99]:

	carat	cut	color	clarity	depth	table	price	x	y	z
45653	0.25	Ideal	H	IF	61.4	57	525	4.05	4.08	2.49
7503	1.05	Premium	G	SI2	61.3	58	4241	6.55	6.60	4.03
47794	0.71	Ideal	J	VS2	62.4	54	1899	5.72	5.76	3.58
12017	1.00	Premium	F	SI1	59.8	59	5151	6.55	6.49	3.90
46125	0.51	Ideal	F	VS1	61.7	54	1744	5.14	5.17	3.18

In [108]:

diamonds.describe()

Out[108]:

	carat	depth	table	price	x	y	z
count	53940.000000	53940.000000	53940.000000	53940.000000	53940.000000	53940.000000	53940.000000
mean	0.797940	61.749405	57.457184	3932.799722	5.731157	5.734526	3.538734
std	0.474011	1.432621	2.234491	3989.439738	1.121761	1.142135	0.705699
min	0.200000	43.000000	43.000000	326.000000	0.000000	0.000000	0.000000
25%	0.400000	61.000000	56.000000	950.000000	4.710000	4.720000	2.910000
50%	0.700000	61.800000	57.000000	2401.000000	5.700000	5.710000	3.530000
75%	1.040000	62.500000	59.000000	5324.250000	6.540000	6.540000	4.040000
max	5.010000	79.000000	95.000000	18823.000000	10.740000	58.900000	31.800000

In [109]:

cut=diamonds.groupby("cut")

In [110]:

cut.count()

Out[110]:

	carat	color	clarity	depth	table	price	x	y	z
cut
Fair	1610	1610	1610	1610	1610	1610	1610	1610	1610
Good	4906	4906	4906	4906	4906	4906	4906	4906	4906
Ideal	21551	21551	21551	21551	21551	21551	21551	21551	21551
Premium	13791	13791	13791	13791	13791	13791	13791	13791	13791
Very Good	12082	12082	12082	12082	12082	12082	12082	12082	12082

In [114]:

cut.mean()

Out[114]:

	carat	depth	table	price	x	y	z
cut
Fair	1.046137	64.041677	59.053789	4358.757764	6.246894	6.182652	3.982770
Good	0.849185	62.365879	58.694639	3928.864452	5.838785	5.850744	3.639507
Ideal	0.702837	61.709401	55.951668	3457.541970	5.507451	5.520080	3.401448
Premium	0.891955	61.264673	58.746095	4584.257704	5.973887	5.944879	3.647124
Very Good	0.806381	61.818275	57.956150	3981.759891	5.740696	5.770026	3.559801

In [115]:

cut.median()

Out[115]:

	carat	depth	table	price	x	y	z
cut
Fair	1.00	65.0	58	3282.0	6.175	6.10	3.97
Good	0.82	63.4	58	3050.5	5.980	5.99	3.70
Ideal	0.54	61.8	56	1810.0	5.250	5.26	3.23
Premium	0.86	61.4	59	3185.0	6.110	6.06	3.72
Very Good	0.71	62.1	58	2648.0	5.740	5.77	3.56

In [117]:

pd.crosstab(diamonds.cut, diamonds.color)

Out[117]:

color	D	E	F	G	H	I	J
cut
Fair	163	224	312	314	303	175	119
Good	662	933	909	871	702	522	307
Ideal	2834	3903	3826	4884	3115	2093	896
Premium	1603	2337	2331	2924	2360	1428	808
Very Good	1513	2400	2164	2299	1824	1204	678

In [121]:

diamonds.corr()

Out[121]:

	carat	depth	table	price	x	y	z
carat	1.000000	0.028224	0.181618	0.921591	0.975094	0.951722	0.953387
depth	0.028224	1.000000	-0.295779	-0.010647	-0.025289	-0.029341	0.094924
table	0.181618	-0.295779	1.000000	0.127134	0.195344	0.183760	0.150929
price	0.921591	-0.010647	0.127134	1.000000	0.884435	0.865421	0.861249
x	0.975094	-0.025289	0.195344	0.884435	1.000000	0.974701	0.970772
y	0.951722	-0.029341	0.183760	0.865421	0.974701	1.000000	0.952006
z	0.953387	0.094924	0.150929	0.861249	0.970772	0.952006	1.000000

Polyglots for Data Science #python #sas #r #stats #spss #matlab #julia #octave

In the future I think analysts need to be polyglots- you will need to know more than one language for crunching data.

SAS, Python, R, Julia,SPSS,Matlab- Pick Any Two 😉 or Any Three.

No, you can’t count C or Java as a statistical language 🙂 🙂

Efforts to promote Polyglots in Statistical Software are-

1) R for SAS and SPSS Users (free or book)

SPSS and R reference
SAS/IML and R reference

JMP and R reference http://www.jmp.com/support/help/Working_with_R.shtml

2) R for Stata Users (book)

3) SAS and R (blog and book)

4) Using Python and R together

Accessing R from Python (Rpy2) http://www.bytemining.com/wp-content/uploads/2010/10/rpy2.pdf
Big Data with R and Python (though these have been made separately)

Python for Data Analysis is a book . Python for Data Analysis by Wes McKinney

Probably we need a Python and R for Data Analysis book- just like we have for SAS and R books.

The RPy2 documentation is handy http://rpy.sourceforge.net/rpy2/doc-2.1/html/introduction.html
A nice tutorial is also here – also the inspiration to writing this post http://files.meetup.com/1225993/Laurent%20Gautier_R_toPython_bridge_to_R.pdf#!

5) Matlab and R

Reference (http://mathesaurus.sourceforge.net/matlab-python-xref.pdf ) includes Python

5) Octave and R

package http://cran.r-project.org/web/packages/RcppOctave/vignettes/RcppOctave.pdf includes Matlab

reference http://cran.r-project.org/doc/contrib/R-and-octave.txt

6) Julia and python

Julia and IPython https://github.com/JuliaLang/IJulia.jl

PyPlot uses the Julia PyCall package to call Python’s matplotlib directly from Julia

7) SPSS and Python is here

8) SPSS and R is as below

The Essentials for R for Statistics versions 22, 21, 20, and 19 are available here.
This link will take you to the SourceForge site where the Version 18 Essentials and Plugins are hosted.
- Plugins for Version 18 for R

9) Using R from Clojure – Incanter

Use embedded R from Clojure and Incanter http://github.com/jolby/rincanter

Please share:

Please share:

Please share:

Please share:

Please share:

Please share:

Please share: