Top 15 functions for Analytics in Python #python #rstats #analytics

Here is a list of top ten  fifteen functions for analysis in Python

  1. import (imports a particular package library in Python)
  2. getcwd (from os library) – get current working directory
  3. chdir (from os) -change directory
  4. listdir (from os ) -list files in the specified directory
  5. read_csv(from pandas) reads in a csv file
  6. objectname.info (like proc contents in SAS or str in R , it describes the object called objectname)
  7. objectname.columns (like proc contents in SAS or names in R , it describes the object variable names of the object called objectname)
  8. objectname.head (like head in R , it prints the first few rows in the object called objectname)
  9. objectname.tail (like tail in R , it prints the last few rows in the object called objectname)
  10. len (length)
  11. objectname.ix[rows] (here if rows is a list of numbers this     will give those rows (or index) for the object called objectname)
  12. groupby -group by a categorical variable
  13. crosstab -cross tab between two categorical variables
  14. describe – data analysis exploratory of numerical variables
  15. corr – correlation between numerical variables
In [1]:
import pandas as pd #importing packages
import os as os
In [2]:
os.getcwd() #current working directory
Out[2]:
'/home/ajay/Desktop'
In [3]:
os.chdir('/home/ajay/Downloads') #changes the working directory
In [4]:
os.getcwd()
Out[4]:
'/home/ajay/Downloads'
In [5]:
a=os.getcwd()
os.listdir(a) #lists all the files in a directory

In [105]:
diamonds=pd.read_csv("diamonds.csv")
#note header =0 means we take the first row as a header (default) else we can specify header=None
In [106]:
diamonds.info()
<class 'pandas.core.frame.dataframe'="">
Int64Index: 53940 entries, 0 to 53939
Data columns (total 10 columns):
carat      53940 non-null float64
cut        53940 non-null object
color      53940 non-null object
clarity    53940 non-null object
depth      53940 non-null float64
table      53940 non-null float64
price      53940 non-null int64
x          53940 non-null float64
y          53940 non-null float64
z          53940 non-null float64
dtypes: float64(6), int64(1), object(3)
memory usage: 3.9+ MB
In [36]:
diamonds.head()
Out[36]:
carat cut color clarity depth table price x y z
0 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
1 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
2 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
3 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
4 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
In [37]:
diamonds.tail(10)
Out[37]:
carat cut color clarity depth table price x y z
53930 0.71 Premium E SI1 60.5 55 2756 5.79 5.74 3.49
53931 0.71 Premium F SI1 59.8 62 2756 5.74 5.73 3.43
53932 0.70 Very Good E VS2 60.5 59 2757 5.71 5.76 3.47
53933 0.70 Very Good E VS2 61.2 59 2757 5.69 5.72 3.49
53934 0.72 Premium D SI1 62.7 59 2757 5.69 5.73 3.58
53935 0.72 Ideal D SI1 60.8 57 2757 5.75 5.76 3.50
53936 0.72 Good D SI1 63.1 55 2757 5.69 5.75 3.61
53937 0.70 Very Good D SI1 62.8 60 2757 5.66 5.68 3.56
53938 0.86 Premium H SI2 61.0 58 2757 6.15 6.12 3.74
53939 0.75 Ideal D SI2 62.2 55 2757 5.83 5.87 3.64
In [38]:
diamonds.columns
Out[38]:
Index([u'carat', u'cut', u'color', u'clarity', u'depth', u'table', u'price', u'x', u'y', u'z'], dtype='object')
In [92]:
b=len(diamonds) #this is the total population size
print(b)
53940
In [93]:
import numpy as np
In [98]:
rows = np.random.choice(diamonds.index.values, 0.0001*b)
print(rows)
sampled_df = diamonds.ix[rows]
[45653  7503 47794 12017 46125]
In [99]:
sampled_df
Out[99]:
carat cut color clarity depth table price x y z
45653 0.25 Ideal H IF 61.4 57 525 4.05 4.08 2.49
7503 1.05 Premium G SI2 61.3 58 4241 6.55 6.60 4.03
47794 0.71 Ideal J VS2 62.4 54 1899 5.72 5.76 3.58
12017 1.00 Premium F SI1 59.8 59 5151 6.55 6.49 3.90
46125 0.51 Ideal F VS1 61.7 54 1744 5.14 5.17 3.18
In [108]:
diamonds.describe()
Out[108]:
carat depth table price x y z
count 53940.000000 53940.000000 53940.000000 53940.000000 53940.000000 53940.000000 53940.000000
mean 0.797940 61.749405 57.457184 3932.799722 5.731157 5.734526 3.538734
std 0.474011 1.432621 2.234491 3989.439738 1.121761 1.142135 0.705699
min 0.200000 43.000000 43.000000 326.000000 0.000000 0.000000 0.000000
25% 0.400000 61.000000 56.000000 950.000000 4.710000 4.720000 2.910000
50% 0.700000 61.800000 57.000000 2401.000000 5.700000 5.710000 3.530000
75% 1.040000 62.500000 59.000000 5324.250000 6.540000 6.540000 4.040000
max 5.010000 79.000000 95.000000 18823.000000 10.740000 58.900000 31.800000
In [109]:
cut=diamonds.groupby("cut")
In [110]:
cut.count()
Out[110]:
carat color clarity depth table price x y z
cut
Fair 1610 1610 1610 1610 1610 1610 1610 1610 1610
Good 4906 4906 4906 4906 4906 4906 4906 4906 4906
Ideal 21551 21551 21551 21551 21551 21551 21551 21551 21551
Premium 13791 13791 13791 13791 13791 13791 13791 13791 13791
Very Good 12082 12082 12082 12082 12082 12082 12082 12082 12082
In [114]:
cut.mean()
Out[114]:
carat depth table price x y z
cut
Fair 1.046137 64.041677 59.053789 4358.757764 6.246894 6.182652 3.982770
Good 0.849185 62.365879 58.694639 3928.864452 5.838785 5.850744 3.639507
Ideal 0.702837 61.709401 55.951668 3457.541970 5.507451 5.520080 3.401448
Premium 0.891955 61.264673 58.746095 4584.257704 5.973887 5.944879 3.647124
Very Good 0.806381 61.818275 57.956150 3981.759891 5.740696 5.770026 3.559801
In [115]:
cut.median()
Out[115]:
carat depth table price x y z
cut
Fair 1.00 65.0 58 3282.0 6.175 6.10 3.97
Good 0.82 63.4 58 3050.5 5.980 5.99 3.70
Ideal 0.54 61.8 56 1810.0 5.250 5.26 3.23
Premium 0.86 61.4 59 3185.0 6.110 6.06 3.72
Very Good 0.71 62.1 58 2648.0 5.740 5.77 3.56
In [117]:
pd.crosstab(diamonds.cut, diamonds.color)
Out[117]:
color D E F G H I J
cut
Fair 163 224 312 314 303 175 119
Good 662 933 909 871 702 522 307
Ideal 2834 3903 3826 4884 3115 2093 896
Premium 1603 2337 2331 2924 2360 1428 808
Very Good 1513 2400 2164 2299 1824 1204 678
In [121]:
diamonds.corr()
Out[121]:
carat depth table price x y z
carat 1.000000 0.028224 0.181618 0.921591 0.975094 0.951722 0.953387
depth 0.028224 1.000000 -0.295779 -0.010647 -0.025289 -0.029341 0.094924
table 0.181618 -0.295779 1.000000 0.127134 0.195344 0.183760 0.150929
price 0.921591 -0.010647 0.127134 1.000000 0.884435 0.865421 0.861249
x 0.975094 -0.025289 0.195344 0.884435 1.000000 0.974701 0.970772
y 0.951722 -0.029341 0.183760 0.865421 0.974701 1.000000 0.952006
z 0.953387 0.094924 0.150929 0.861249 0.970772 0.952006 1.000000
 

Markdown with R Commander #rstats

Just training and fiddling with spatial analytics

Markdown with R Commander

Ajay Ohri

2015-03-15

> library(maptools)
> library(raster)
> adm <- getData('GADM', country='IND', level=2)
> mahadm=adm[adm$NAME_1=="Maharashtra",]
> head(mahadm,20)
      PID ID_0 ISO NAME_0 ID_1      NAME_1 ID_2         NAME_2 NL_NAME_2                    VARNAME_2   TYPE_2 ENGTYPE_2
306 17478  105 IND  India   21 Maharashtra  306     Ahmednagar                             Ahmadnagar District  District
307 17479  105 IND  India   21 Maharashtra  307          Akola                                        District  District
308 17480  105 IND  India   21 Maharashtra  308       Amravati           Amaravati, Amraoti, Amaraoti District  District
309 17481  105 IND  India   21 Maharashtra  309     Aurangabad                                        District  District
310 17482  105 IND  India   21 Maharashtra  310       Bhandara                                        District  District
311 17483  105 IND  India   21 Maharashtra  311            Bid                     Bir|Beed|Bhir|Bidh District  District
312 17484  105 IND  India   21 Maharashtra  312        Buldana                                        District  District
313 17485  105 IND  India   21 Maharashtra  313     Chandrapur                                 Chanda District  District
314 17486  105 IND  India   21 Maharashtra  314          Dhule                  Dhulia, West Khandesh District  District
315 17487  105 IND  India   21 Maharashtra  315    Garhchiroli                                        District  District
316 17488  105 IND  India   21 Maharashtra  316        Gondiya                                        District  District
317 17489  105 IND  India   21 Maharashtra  317 Greater Bombay                                        District  District
318 17490  105 IND  India   21 Maharashtra  318        Hingoli                                        District  District
319 17491  105 IND  India   21 Maharashtra  319        Jalgaon                          East Khandesh District  District
320 17492  105 IND  India   21 Maharashtra  320          Jalna                                        District  District
321 17493  105 IND  India   21 Maharashtra  321       Kolhapur                                        District  District
322 17494  105 IND  India   21 Maharashtra  322          Latur                Kulaba, Kolaba, Kolabad District  District
323 17495  105 IND  India   21 Maharashtra  323         Nagpur                                        District  District
324 17496  105 IND  India   21 Maharashtra  324         Nanded                                 Nander District  District
325 17497  105 IND  India   21 Maharashtra  325      Nandurbar                                        District  District
> mahadm$pop=as.factor(sample(1:10,34,T))
> mahadm$pop2=as.factor(sample(1:10,34,T))
> mahadm$pop3=as.factor(sample(1:10,34,T))
> head(mahadm,20)
      PID ID_0 ISO NAME_0 ID_1      NAME_1 ID_2         NAME_2 NL_NAME_2                    VARNAME_2   TYPE_2 ENGTYPE_2
306 17478  105 IND  India   21 Maharashtra  306     Ahmednagar                             Ahmadnagar District  District
307 17479  105 IND  India   21 Maharashtra  307          Akola                                        District  District
308 17480  105 IND  India   21 Maharashtra  308       Amravati           Amaravati, Amraoti, Amaraoti District  District
309 17481  105 IND  India   21 Maharashtra  309     Aurangabad                                        District  District
310 17482  105 IND  India   21 Maharashtra  310       Bhandara                                        District  District
311 17483  105 IND  India   21 Maharashtra  311            Bid                     Bir|Beed|Bhir|Bidh District  District
312 17484  105 IND  India   21 Maharashtra  312        Buldana                                        District  District
313 17485  105 IND  India   21 Maharashtra  313     Chandrapur                                 Chanda District  District
314 17486  105 IND  India   21 Maharashtra  314          Dhule                  Dhulia, West Khandesh District  District
315 17487  105 IND  India   21 Maharashtra  315    Garhchiroli                                        District  District
316 17488  105 IND  India   21 Maharashtra  316        Gondiya                                        District  District
317 17489  105 IND  India   21 Maharashtra  317 Greater Bombay                                        District  District
318 17490  105 IND  India   21 Maharashtra  318        Hingoli                                        District  District
319 17491  105 IND  India   21 Maharashtra  319        Jalgaon                          East Khandesh District  District
320 17492  105 IND  India   21 Maharashtra  320          Jalna                                        District  District
321 17493  105 IND  India   21 Maharashtra  321       Kolhapur                                        District  District
322 17494  105 IND  India   21 Maharashtra  322          Latur                Kulaba, Kolaba, Kolabad District  District
323 17495  105 IND  India   21 Maharashtra  323         Nagpur                                        District  District
324 17496  105 IND  India   21 Maharashtra  324         Nanded                                 Nander District  District
325 17497  105 IND  India   21 Maharashtra  325      Nandurbar                                        District  District
    pop pop2 pop3
306   5    2    6
307   8    5    1
308   2    4    4
309   2    6   10
310   8    7    5
311   9   10    1
312   8    9    4
313   4    9    3
314   1    8    7
315   9    5    7
316   7    8    4
317   9    8    6
318   1   10    3
319   1    8    6
320   8    6   10
321   8    4    8
322   4    9    2
323   5    3    5
324   4    8   10
325   1    5    1
> par(mfrow=c(3,1))
> plot(mahadm,col=mahadm$pop)
1
> plot(mahadm,col=mahadm$pop2)
> plot(mahadm,col=mahadm$pop3)

Comprehensive Learning Path in R

I have built a comprehensive learning path for professionals, students and researchers at http://www.analyticsvidhya.com/learning-paths-data-science-business-analytics-business-intelligence-big-data/learning-path-r-data-science/

Rather than simply put a list of resources, I have tried to create a structured path which is agnostic to any one source instead takes in best sources for each step or phase in the analytics work flow.

There are links to resources by Hadley Wickham, Revolution , Data Camp, videos, live projects, slideshares, tutorials done in a systematic manner.

Have a look and let me know how this can be made better

LeaRning Path on R – Step by Step Guide to Learn Data Science on R

Screenshot from 2015-03-04 09:17:13

Interviewed on my analytics adventures

I just got interviewed rather extensively at http://www.analyticsvidhya.com/blog/2015/02/interview-expert-ajay-ohri-founder-decisionstats-com/

Interview with Industry expert – Ajay Ohri

Kunal: You started data science career much before people would have heard about it and it became one of the hottest field around. What were the challenges that you faced during the initial stages of your professional career?

Ajay: Cool question man. Yeah it used to be called business analytics, then data analytics and now its data science. What will they call it next?

Initial challenges: R was raw (this was 2007) , SAS was expensive, even Open Office was not so good as it is now. Getting a pipeline of work, leads for clients, converting leads to contracts and chasing people to pay me after work done were initial challenges.

you can read the rest of the interview at http://www.analyticsvidhya.com/blog/2015/02/interview-expert-ajay-ohri-founder-decisionstats-com/

Writing for Adaptive Systems

I have been advising Adpative Systems Inc for the past few months. You can see their profile at http://adaptivesystemsinc.com/

Basically I am helping with making actionable analytics. It seemed a logical next step after my writing ( more on that later) to test whether my research and opinions work in the real world of consulting as well.

As part of that I have written a few articles and I will be doing software reviews as well

Some of the articles I have written are-

An Adaptive Approach for Handling Messy Big Data

In this article I try and advocate a pragmatic and heterogeneous approach than an dogmatic approach to handle Big Data

Predictive Analytics: Moving beyond the buzzword to the action

In which I discuss analyzing the ROI on analytics software itself or analyzing software analytics itself

Selecting the Right Software for Data Analytics and Data Integration

In which I try and formulate a guide to help you in the brave new world of Big Data brand clutter where every software vendor is claiming to be the best and the fastest.

 

Ajay Ohri in Wired – An App Store for Algorithms

I got featured in a Wired article recently- this is one of my very old ideas- basically an App store for Algorithms.

I  briefly advised two startups in this space (but no longer do)

Klint Finley took time to shoot me some questions and you can read the final article here.

http://www.wired.com/2014/08/algorithmia/

I have now been featured in Wired , ReadWrite Web and a member of Star Trek has reblogged me on Tumblr! Geek heaven and I owe it all to the readers of Decisionstats.com!