Summer School in Analytics in Delhi

A comprehensive summer program is being offered by DecisionStats.org . It will involve multiple languages for analytics including Python, SAS, R, and will also equip you with social media skills, web analytics and social media analytics.

It is a classroom based training and is aimed only for students who can attend classes in Hauz Khas Village , Delhi.

–here is the full message

We will conduct a summer workshop in analytics. It will be a vigorous paid certificate program. After the program, we may offer internships to some of you.

Kindly fill this form and also forward it to your peers. Please cascade to your social media network and anyone you feel who could benefit from analytics training.

http://bit.ly/hkvsummer

Interns for DecisionStats – a cutting edge analytics firm

We have the annual summer internship back at DecisionStats. This year we especially need Graphic Designing Interns and people who want to be Data Scientists

So apply at info@decisionstats.org or link below

1) We now have a separate arm for Training and Consulting at http://decisionstats.org Basically we have hived off that business separately. We also have a new office in Hauz Khas Village.

2) Last year (first year of internships) our Intern Chandan from IIT KGP made this (http://www.slideshare.net/ajayohri/decisionstatscom-data-science-virtual-internship ) and this ( http://www.slideshare.net/ajayohri/python-for-r-users ) . He had no knowledge of either R and Python before he began.

3) Preference will be given to people who can come to office than telecommute.

http://internshala.com/internship/detail/multiple-profiles-management-graphic-design-internship-in-delhi-at-decisionstats1429240830

About Decisionstats (http://decisionstats.com):

Data Science and Analytics Website that deals in cutting edge research, consulting, writing and speaking assignments

About the Internship:

The communication intern will proof read, edit and write content including blog posts and social media. The intern will be given on the job training for social media, web analytics and search engine optimization as well as an understanding of digital business. Only requirement needs to be learnability, truthfulness and a good command of English

The graphic design intern will create , edit and write graphics including icons, logos, posters and infographics. The intern will be given on the job training for designing in a real time environment, web analytics and search engine optimization as well as an understanding of digital business. Only requirement needs to be learnability, truthfulness and a good command of design.

The management intern will create , edit and make schedules and assist in cordination. The intern will be given on the job training for managing in a start up environment, web analytics and search engine marketing as well as an understanding of digital business. Only requirement needs to be learnability, truthfulness, passion and good management skills.

The data science intern will create , edit and make data science research and assist in writing. The intern will be given on the job training for data science and analytics. Only requirement needs to be learnability, truthfulness, passion for writing code and hacking problems on the fly.

# of Internships available: 4

Who can apply:

The internships require people who are serious about careers, can devote the agreed upon hours per week and meet deadlines. Preferences will be given to candidates from established institutes and prior academic record.

Streams:

Analytics, Design, Engineering Management, English, Humanities, Management, Engineering

Writing on SAS for R Users

It might seem counter intuitive for me to write on SAS language for R users for the following reasons-

1) I have already written 2 books on R for Springer. Clearly R is my weapon of choice for data analysis.

2) R has been quite lucrative for me in my writing. It has positioned me as one of the earliest R trainers in India. I started up R curriculum for Jigsaw Analytics and Edureka and WeekendR which means thousands of people have viewed content written by me, or a video of me speaking on R or been trained on pedagogy derived on my original work.

3) I have spoken on R at colleges like LSR , DSE, DCE- DTU, VIT, MS Ramiah and IIT Delhi

4) I am currently writing “Python for R users” for Wiley

However I am writing SAS for R users because

1) They are fundamental different languages aimed at different audiences

2) I realize students are now trained in R in the west in colleges, but a lot of corporates still use SAS because switching cost of business disruption is a lot. The benefit of analytics is much more than the expensive annual fee ( as in a few basis points as best out of Total Cost of Ownership)

3) Existing books on both SAS and R are not updated for newer packages (basically hadley and dirk are making packages faster than people can write about them)

4) India’s outsourcing hires many students and needs polyglots who know both SAS and R language. Ergo a new book.

5) I am bored and I need a challenge. Plus I always more hugs and love from SAS Institute than some package creators

The fundamental difference between R and SAS remain

1) R is object oriented and SAS is not

2) SAS is much easier to learn and R is not

3) While R refers to objects through $ and [[ ]] , SAS uses CLASS and VAR operator as parameters to various procs ( functions)

4) SAS bundling of modules can be confusing to people used to download R’s packages.

Accordingly a student of mine has been working in my direction here https://welcomedata.wordpress.com/category/sas/

We intend to create a proposal for Wiley soon. What do you think? What would you like to read?

Top 15 functions for Analytics in Python #python #rstats #analytics

Here is a list of top ~~ten~~ fifteen functions for analysis in Python

import (imports a particular package library in Python)
getcwd (from os library) – get current working directory
chdir (from os) -change directory
listdir (from os ) -list files in the specified directory

read_csv(from pandas) reads in a csv file

objectname.info (like proc contents in SAS or str in R , it describes the object called objectname)
objectname.columns (like proc contents in SAS or names in R , it describes the object variable names of the object called objectname)
objectname.head (like head in R , it prints the first few rows in the object called objectname)
objectname.tail (like tail in R , it prints the last few rows in the object called objectname)
len (length)

objectname.ix[rows] (here if rows is a list of numbers this     will give those rows (or index) for the object called objectname)

groupby -group by a categorical variable

crosstab -cross tab between two categorical variables

describe – data analysis exploratory of numerical variables
corr – correlation between numerical variables

In [1]:

import pandas as pd #importing packages
import os as os

In [2]:

os.getcwd() #current working directory

Out[2]:

'/home/ajay/Desktop'

In [3]:

os.chdir('/home/ajay/Downloads') #changes the working directory

In [4]:

os.getcwd()

Out[4]:

'/home/ajay/Downloads'

In [5]:

a=os.getcwd()
os.listdir(a) #lists all the files in a directory

In [105]:

diamonds=pd.read_csv("diamonds.csv")
#note header =0 means we take the first row as a header (default) else we can specify header=None

In [106]:

diamonds.info()

<class 'pandas.core.frame.dataframe'="">
Int64Index: 53940 entries, 0 to 53939
Data columns (total 10 columns):
carat      53940 non-null float64
cut        53940 non-null object
color      53940 non-null object
clarity    53940 non-null object
depth      53940 non-null float64
table      53940 non-null float64
price      53940 non-null int64
x          53940 non-null float64
y          53940 non-null float64
z          53940 non-null float64
dtypes: float64(6), int64(1), object(3)
memory usage: 3.9+ MB

In [36]:

diamonds.head()

Out[36]:

	carat	cut	color	clarity	depth	table	price	x	y	z
0	0.23	Ideal	E	SI2	61.5	55	326	3.95	3.98	2.43
1	0.21	Premium	E	SI1	59.8	61	326	3.89	3.84	2.31
2	0.23	Good	E	VS1	56.9	65	327	4.05	4.07	2.31
3	0.29	Premium	I	VS2	62.4	58	334	4.20	4.23	2.63
4	0.31	Good	J	SI2	63.3	58	335	4.34	4.35	2.75

In [37]:

diamonds.tail(10)

Out[37]:

	carat	cut	color	clarity	depth	table	price	x	y	z
53930	0.71	Premium	E	SI1	60.5	55	2756	5.79	5.74	3.49
53931	0.71	Premium	F	SI1	59.8	62	2756	5.74	5.73	3.43
53932	0.70	Very Good	E	VS2	60.5	59	2757	5.71	5.76	3.47
53933	0.70	Very Good	E	VS2	61.2	59	2757	5.69	5.72	3.49
53934	0.72	Premium	D	SI1	62.7	59	2757	5.69	5.73	3.58
53935	0.72	Ideal	D	SI1	60.8	57	2757	5.75	5.76	3.50
53936	0.72	Good	D	SI1	63.1	55	2757	5.69	5.75	3.61
53937	0.70	Very Good	D	SI1	62.8	60	2757	5.66	5.68	3.56
53938	0.86	Premium	H	SI2	61.0	58	2757	6.15	6.12	3.74
53939	0.75	Ideal	D	SI2	62.2	55	2757	5.83	5.87	3.64

In [38]:

diamonds.columns

Out[38]:

Index([u'carat', u'cut', u'color', u'clarity', u'depth', u'table', u'price', u'x', u'y', u'z'], dtype='object')

In [92]:

b=len(diamonds) #this is the total population size
print(b)

In [93]:

import numpy as np

In [98]:

rows = np.random.choice(diamonds.index.values, 0.0001*b)
print(rows)
sampled_df = diamonds.ix[rows]

[45653  7503 47794 12017 46125]

In [99]:

sampled_df

Out[99]:

	carat	cut	color	clarity	depth	table	price	x	y	z
45653	0.25	Ideal	H	IF	61.4	57	525	4.05	4.08	2.49
7503	1.05	Premium	G	SI2	61.3	58	4241	6.55	6.60	4.03
47794	0.71	Ideal	J	VS2	62.4	54	1899	5.72	5.76	3.58
12017	1.00	Premium	F	SI1	59.8	59	5151	6.55	6.49	3.90
46125	0.51	Ideal	F	VS1	61.7	54	1744	5.14	5.17	3.18

In [108]:

diamonds.describe()

Out[108]:

	carat	depth	table	price	x	y	z
count	53940.000000	53940.000000	53940.000000	53940.000000	53940.000000	53940.000000	53940.000000
mean	0.797940	61.749405	57.457184	3932.799722	5.731157	5.734526	3.538734
std	0.474011	1.432621	2.234491	3989.439738	1.121761	1.142135	0.705699
min	0.200000	43.000000	43.000000	326.000000	0.000000	0.000000	0.000000
25%	0.400000	61.000000	56.000000	950.000000	4.710000	4.720000	2.910000
50%	0.700000	61.800000	57.000000	2401.000000	5.700000	5.710000	3.530000
75%	1.040000	62.500000	59.000000	5324.250000	6.540000	6.540000	4.040000
max	5.010000	79.000000	95.000000	18823.000000	10.740000	58.900000	31.800000

In [109]:

cut=diamonds.groupby("cut")

In [110]:

cut.count()

Out[110]:

	carat	color	clarity	depth	table	price	x	y	z
cut
Fair	1610	1610	1610	1610	1610	1610	1610	1610	1610
Good	4906	4906	4906	4906	4906	4906	4906	4906	4906
Ideal	21551	21551	21551	21551	21551	21551	21551	21551	21551
Premium	13791	13791	13791	13791	13791	13791	13791	13791	13791
Very Good	12082	12082	12082	12082	12082	12082	12082	12082	12082

In [114]:

cut.mean()

Out[114]:

	carat	depth	table	price	x	y	z
cut
Fair	1.046137	64.041677	59.053789	4358.757764	6.246894	6.182652	3.982770
Good	0.849185	62.365879	58.694639	3928.864452	5.838785	5.850744	3.639507
Ideal	0.702837	61.709401	55.951668	3457.541970	5.507451	5.520080	3.401448
Premium	0.891955	61.264673	58.746095	4584.257704	5.973887	5.944879	3.647124
Very Good	0.806381	61.818275	57.956150	3981.759891	5.740696	5.770026	3.559801

In [115]:

cut.median()

Out[115]:

	carat	depth	table	price	x	y	z
cut
Fair	1.00	65.0	58	3282.0	6.175	6.10	3.97
Good	0.82	63.4	58	3050.5	5.980	5.99	3.70
Ideal	0.54	61.8	56	1810.0	5.250	5.26	3.23
Premium	0.86	61.4	59	3185.0	6.110	6.06	3.72
Very Good	0.71	62.1	58	2648.0	5.740	5.77	3.56

In [117]:

pd.crosstab(diamonds.cut, diamonds.color)

Out[117]:

color	D	E	F	G	H	I	J
cut
Fair	163	224	312	314	303	175	119
Good	662	933	909	871	702	522	307
Ideal	2834	3903	3826	4884	3115	2093	896
Premium	1603	2337	2331	2924	2360	1428	808
Very Good	1513	2400	2164	2299	1824	1204	678

In [121]:

diamonds.corr()

Out[121]:

	carat	depth	table	price	x	y	z
carat	1.000000	0.028224	0.181618	0.921591	0.975094	0.951722	0.953387
depth	0.028224	1.000000	-0.295779	-0.010647	-0.025289	-0.029341	0.094924
table	0.181618	-0.295779	1.000000	0.127134	0.195344	0.183760	0.150929
price	0.921591	-0.010647	0.127134	1.000000	0.884435	0.865421	0.861249
x	0.975094	-0.025289	0.195344	0.884435	1.000000	0.974701	0.970772
y	0.951722	-0.029341	0.183760	0.865421	0.974701	1.000000	0.952006
z	0.953387	0.094924	0.150929	0.861249	0.970772	0.952006	1.000000

Random Thoughts on Cryptography

Some random thoughts while taking a walk in the park-

1) The inevitability of interception- Sooner or later, encrypted messages will be captured.

2) The cost of decryption- Decryption is inevitable. All the coder can do is increase the cost (time, money and computation) to the enemy

3) Signal/Noise- Introducing multiple algorithms to create random noise messages can increase the cost of decryption but reduce the probability of interception. The technologically weaker player should introduce more noise to distort the signal/noise ratio knowing the messages are being intercepted anyways ( especially electronic, radio or digital)

4) Intercepted flag- Interception takes time. Flags to capture interception shall help coders in knowing which messages have been intercepted and which not. This of course can be manipulated by the interceptor.

5) Turing is not God- You can use pictures, use Navajo slang, poetry code in the same message. Maybe change the code from binary to something else.

6) Kill all the decryptographers- Focusing on the personnel of the enemy can help increase the cost of decryption.

(yawns and shrugs)

Talking on Social Media And Social Media Analytics

Over the past seven years, words written by me have hit millions of views. Here is a short talk I gave recently in a workshop at Hauz Khas Village recently

Installing Ipython Notebook on Ubuntu 12

I ran into a series of errors and finally managed to make Ipython run on my Ubuntu 12. Notice I am adding some extra stuff in terms of mathjax and pandoc but that is just for a smoother install. Note the trouble point was the package pyzmq but it was troubleshooted by both the –upgrade option as well as the installing of python-dev

sudo apt-get install python-pip
sudo apt-get install python-dev

sudo pip install --upgrade ipython[all]
sudo pip install invoke
sudo pip install jinja2

sudo pip install --upgrade pyzmq

sudo python -m IPython.external.mathjax
sudo apt-get install pandoc
sudo pip install tornado jsonschema

ipython notebook

Sources-

Please share:

About Decisionstats (http://decisionstats.com):

About the Internship:

# of Internships available: 4

Who can apply:

Streams:

Please share:

Please share:

Please share:

Please share:

Please share:

Please share: