Christmas Carol: The Best Software (BI-Stats-Analytics)

There is no best software- they are just optimized for various constraints and tangible as well as intangible needs as defined for users.

  1. There is no best software- they are just optimized for various constraints and tangible as well as intangible needs as defined for users.  ( Image below Citation- support.sas.com )
  2. Price in products is defined as Demand divided by Supply. Sometimes this is Expected Demand over Expected Supply ( see Oil Prices) Everyone grumbles over prices but we pay what we think is fair. ( citation http://bm2.genes.nig.ac.jp/RGM2/index.php?ctv=Survival
  3. Prices in services are defined by value creation as well- Value= Benefit Divided by Cost  Benefits are tangible as in how much money it saves in fraud as well as intangible – how easy it is to start using JMP versus R Commander  Costs are Tangible- How much do we have to pay using our cheque book for this annual license or perpetual license or one time license or maintain contract or application support.Intangible costs are how long I have to hold the phone while talking to customer support and how much time it takes me to find the best solution using the website on my own without a sales person bothering me with frequent calls. (citation- http://academic.udayton.edu/gregelvers/psy216/spss/graphs.htm#tukey
  4. All sales people ( especially in the software industry) spam you with frequent calls, email reminders and how their company is the best company ever with the best software in the history of mankind. That is their job and they are pushed by sales quotas and pulled by their own enthusiasm to sell more to same customer. If you ever bought three licences and found out you just needed two at the end of the year- forgive the salesman. As Arthur Miller said’ All Salesmen are Dreamers  (Citation of STATA graph below http://www.ats.ucla.edu/stat/Stata/library/GraphExamples/code/grbartall.htm)
  5. Technology moves faster than you can say Jackie Robinson. and it is getting faster. Research and Development ( R and D) will always move slower than the speed at which Marketing thinks they can move. See http://www.dilbert.com for more insights on this. You either build a Billion Dollar in house lab ( like Palo Alto – remember) or you go for total outsourcing (like semi conductors and open source do). Or you go for a mix and match. ( Citation- http://people.sc.fsu.edu/~burkardt/html/matlab_graphics/matlab_graphics.html )

Based on the above parameters the best statistical software for 2009 continues to be the software that uses a mixture of Genetic Algorithms, Time Series Based Regression and Sampling – it is the software that runs in the head of the statistical /mathematical / customer BRAIN

Thats the best Software ever.

(Citation – Hugh of http://gapingvoid.com/ )

Happy Hols

Best Internet Site of 2009

Here is the best internet site of 2009.
It basically shows how many jobs have been created per dollar spent.
Funded by the debt of American Treasuries………

Here is the best internet site of 2009.
It basically shows how many jobs have been created per dollar spent.
Funded by the debt of American Treasuries
sold to Chinese.

Remember the Chinese Opium Wars.
Well the Chinese are hooked to American Treasuries and they probably need a Warship with Admiral to open their markets and currency. Oui!

Well anyway the website is called http://Recovery.gov

Born in the USA?

Here is some econometric search-ing I did

Using Google Public Data-and Wolfram Alpha and The Bureau of Labour Statistics

United States

United States – Monthly Data
Data Series Back
Data
May
2009
June
2009
July
2009
Aug
2009
Sept
2009
Oct
2009
Unemployment Rate (1)
Jump to page with historical data
9.4 9.5 9.4 9.7 9.8 10.2
Change in Payroll Employment (2)
Jump to page with historical data
-303 -463 -304 -154 (P) -219 (P) -190
Average Hourly Earnings (3)
Jump to page with historical data
18.53 18.54 18.59 18.66 (P) 18.67 (P) 18.72
Consumer Price Index (4)
Jump to page with historical data
0.1 0.7 0.0 0.4 0.2 0.3
Producer Price Index (5)
Jump to page with historical data
0.2 1.7 (P) -1.0 (P) 1.7 (P) -0.6 (P) 0.3
U.S. Import Price Index (6)
Jump to page with historical data
1.7 2.7 (R) -0.6 (R) 1.5 (R) 0.2 (R) 0.7
Footnotes
(1) In percent, seasonally adjusted. Annual averages are available for Not Seasonally Adjusted data.
(2) Number of jobs, in thousands, seasonally adjusted.
(3) For production and nonsupervisory workers on private nonfarm payrolls, seasonally adjusted.
(4) All items, U.S. city average, all urban consumers, 1982-84=100, 1-month percent change, seasonally adjusted.
(5) Finished goods, 1982=100, 1-month percent change, seasonally adjusted.
(6) All imports, 1-month percent change, not seasonally adjusted.
(R) Revised
(P) Preliminary
United States – Quarterly Data
Data Series Back
Data
3rd Qtr
2008
4th Qtr
2008
1st Qtr
2009
2nd Qtr
2009
3rd Qtr
2009
Employment Cost Index (1)
Jump to page with historical data
0.6 0.6 0.3 0.4 0.4
Productivity (2)
Jump to page with historical data
-0.1 0.8 0.3 6.9 9.5
Footnotes
(1) Compensation, all civilian workers, quarterly data, 3-month percent change, seasonally adjusted.
(2) Output per hour, nonfarm business, quarterly data, percent change from previous quarter at annual rate, seasonally adjusted.

And also included are the average wages for salary of teachers and average salary per hour of some offshore  prone industries

http://www.bls.gov/oes/2008/may/oes_nat.htm#b25-0000

http://www.bls.gov/oes/2008/may/oes_nat.htm#b11-0000

and

http://www.google.com/publicdata?ds=usunemployment&met=unemployment_rate&idim=state:ST370000:ST540000:ST510000&tdim=true

WHAT THEY PAY TEACHERS (MAY 2008)

Education, Training, and Library Occupations top
Wage Estimates
Occupation Code Occupation Title (click on the occupation title to view an occupational profile) Employment (1) Median Hourly Mean Hourly Mean Annual (2) Mean RSE (3)
25-0000 Education, Training, and Library Occupations 8,451,250 $21.26 $23.30 $48,460 0.5 %
25-1011 Business Teachers, Postsecondary 69,690 (4) (4) $77,340 1.0 %
25-1021 Computer Science Teachers, Postsecondary 32,520 (4) (4) $74,050 1.0 %
25-1022 Mathematical Science Teachers, Postsecondary 45,710 (4) (4) $68,130 0.9 %
25-1031 Architecture Teachers, Postsecondary 6,430 (4) (4) $75,450 1.9 %
25-1032 Engineering Teachers, Postsecondary 32,070 (4) (4) $90,070 1.1 %
25-1041 Agricultural Sciences Teachers, Postsecondary 10,000 (4) (4) $77,770 1.6 %
25-1042 Biological Science Teachers, Postsecondary 51,930 (4) (4) $83,270 2.7 %

WHAT THEY PAY THEMSELVES

Management Occupations top
Wage Estimates
Occupation Code Occupation Title (click on the occupation title to view an occupational profile) Employment (1) Median Hourly Mean Hourly Mean Annual (2) Mean RSE (3)
11-0000 Management Occupations 6,152,650 $42.15 $48.23 $100,310 0.2 %
11-1011 Chief Executives 301,930 $76.23 $77.13 $160,440 0.5 %
11-1021 General and Operations Managers 1,697,690 $44.02 $51.91 $107,970 0.2 %
11-1031 Legislators 64,650 (4) (4) $37,980 1.1 %

and JOBS PRONE TO SHORTAGE /OFFSHORING

Computer and Mathematical Science Occupations top
Wage Estimates
Occupation Code Occupation Title (click on the occupation title to view an occupational profile) Employment (1) Median Hourly Mean Hourly Mean Annual (2) Mean RSE (3)
15-0000 Computer and Mathematical Science Occupations 3,308,260 $34.26 $35.82 $74,500 0.3 %
15-1011 Computer and Information Scientists, Research 26,610 $47.10 $48.51 $100,900 1.1 %
15-1021 Computer Programmers 394,230 $33.47 $35.32 $73,470 0.6 %
15-1031 Computer Software Engineers, Applications 494,160 $41.07 $42.26 $87,900 0.4 %
15-1032 Computer Software Engineers, Systems Software 381,830 $44.44 $45.44 $94,520 0.5 %
15-1041 Computer Support Specialists 545,520 $20.89 $22.29 $46,370 0.3 %
15-1051 Computer Systems Analysts 489,890 $36.30 $37.90 $78,830 0.4 %
15-1061 Database Administrators 115,770 $33.53 $35.05 $72,900 0.8 %
15-1071 Network and Computer Systems Administrators 327,850 $31.88 $33.45 $69,570 0.3 %
15-1081 Network Systems and Data Communications Analysts 230,410 $34.18 $35.50 $73,830 0.4 %
15-1099 Computer Specialists, All Other 191,780 $36.13 $36.54 $76,000 0.5 %
15-2011 Actuaries 18,220 $40.77 $46.14 $95,980 1.4 %
15-2021 Mathematicians 2,770 $45.75 $45.65 $94,960 1.7 %
15-2031 Operations Research Analysts 60,860 $33.17 $35.68 $74,220 0.8 %
15-2041 Statisticians 20,680 $34.91 $35.96 $74,790 1.5 %
15-2091 Mathematical Technicians 1,100 $18.46 $20.24 $42,100 2.7 %
15-2099 Mathematical Science Occupations, All Other 6,600 $26.44 $31.55 $65,630 4.3 %

 

UNEMPLOYED IN THE USA (above)

BY STATE (below)

16 million people out of work. Give or take a million.

How can America pay 5.6 million people UNEMPLOYMENT BENEFITS

Keep another 10 million unemployed,

another 10 million only partially employed.

[tweetmeme source=”decisionstats”]

and still claim aggregate cost savings from offshoring jobs.

The Big Data Event- Why am I here?

I am here braving New York’s cold weather, as I prepare for this evening’s events. If you follow this blog closely ( including the poems) ,it is a welcome change— New York is a nice city people are friendly if you ask them nicely and the bus is a great way to watch the city – best of all I like the crowds which I have grown used while living in India.

Why Am I here?

Because the topics that are discussed here are cutting edge to the point that I cannot find anyone willing to teach me Hadoop and Map-Reduce while in University and at the same time teach me statistics on them as well ( as in how do we do a K Means clustering on a 1 terabyte dataset).

I asked the organizers on what makes the event special ( every event promises special Mojo after all).

This is what they said-

What is the unique value proposition of the event that will help developers and both current and potential customers-

The essence of the event is to explore new innovations in massively-parallel processing data warehousing technology and how it can help companies gain more insight from their data.  Applications include fraud detection, behavioral targeting, social network analysis, better predictions/forecasting, bioinformatics, etc.  We are exploring how MapReduce and Hadoop can be integrated into the enterprise IT system to help evolve data warehousing/BI/data mining

and to put it even more nicely’

The industry’s first big data event, Big Data Summit ‘09, being held this evening in New York City, will showcase Hadoop’s fit with MPP data warehouses. Aster Data will be presenting alongside Colin White, President and Founder of BI Research, Mike Brown of comScore Inc., and Jonathan Goldman, who represents LinkedIn.”

That’s good enough for me to drop into Roosevelt Hotel on East 45th Street at around 6 pm for some reluctant networking ( read: beers). 5 years ago whie working for GE , I used to run queries using SAS on a 147 million row database (the size of the DB) and wait 3 hours for it to come back. Today that much data fits very snugly in my laptop. How soon will we have Terabyte level personal computing, and Petabyte level business computing and the challenges it poses to standard statistical assumptions and synching of hardware and software- Big Big Data is an interesting area to watch.

Interview Professor John Fox Creator R Commander

Here is an interview with Prof John Fox, creator of the very popular R language based GUI, RCmdr.

Ajay- Describe your career in science from your high school days to the science books you have written. What do you think can be done to increase interest in science in young people.

John Fox- I’m a sociologist and social statistician, so I don’t have a career in science, as that term is generally understood. I was interested in science as a child, however: I attended a science high school in New York City (Brooklyn Tech), and when I began university in 1964 at New York’s City College, I started in engineering. I moved subsequently through majors in philosophy and psychology, before finishing in sociology — had I not graduated in 1968 I probably would have moved on to something else. I took a statistics course during my last year as an undergraduate and found it fascinating. I enrolled in the sociology graduate program at the University of Michigan, where I specialized in social psychology and demography, and finished with a PhD in 1972 when I was 24 years old. I became interested in computers during my first year in graduate school, where I initially learned to program in Fortran. I also took quite a few courses in statistics and math.

I haven’t written any science books, but I have written and edited a number of books on social statistics, including, most recently, Applied Regression Analysis and Generalized Linear Models, Second Edition (Sage, 2008).

I’m afraid that I don’t know how to interest young people in science. Science seemed intrinsically interesting to me when I was young, and still does.

Ajay- What prompted you to R Commander. How would you describe R Commander as a tool, say for a user of other languages and who want to learn R, but get afraid of the syntax.

John- I originally programmed the R Commander so that I could use R to teach introductory statistics courses to sociology undergraduates. I previously taught this course with Minitab or SPSS, which were programs that I never used for my own work. I waited for someone to come up with a simple, portable, easily installed point-and-click interface to R, but nothing appeared on the horizon, and so I decided to give it a try myself.

I suppose that the R Commander can ease users into writing commands, inasmuch as the commands are displayed, but I suspect that most users don’t look at them. I think that serious prospective users of R should be encouraged to use the command-line interface along with a script editor of some sort. I wouldn’t exaggerate the difficulty of learning R: I came to R — actually S then — after having programmed in perhaps a dozen other languages, most recently at that point Lisp, and found the S language particularly easy to pick up.

Ajay- I particularly like the R Cmdr plugins. Is it possible for anyone to increase R Commander with a customized package- plugin.

John- That’s the basic idea, though the plug-in author has to be able to program in R and must learn a little Tcl/Tk.

Ajay- Have you thought of using the R Commander GUI on an Amazon EC2 and thus making R high performance computing say available on demand ( similar to Zementis model deployment using Amazon Ec2). What are you views on the future of statistical computing

John- I’m not sure whether or how an interface like the Rcmdr, which is Tcl/Tk-based, can be adapted to cloud computing. I also don’t feel qualified to predict the future of statistical computing.

I think that R is where the action is for the near future.

Ajay-What are the best ways for using R Commander as a teaching tool ( I noticed the help is a bit outdated).

John- Is the help a bit outdated? My intention is that the R Commander should be largely self-explanatory. Most people know how to use point-and-click interfaces. In the basic courses for which it is principally designed, my goals are to teach the essential ideas of statistical reasoning and some skills in data analysis. In this kind of course, statistical software should facilitate the basic goals of the course.

As I said, for serious data analysis, I believe that it’s a good idea to encourage use of the command-line interface.

Ajay- What are your views on R being recognized by SAS Institute for it’s IML product. Do you think there can be a middle way for open source and proprietary software to exist.

John- I imagine that R is a challenge for producers of proprietary software like SAS, partly because R development moves more quickly, but also because R is giving away something that SAS and other vendors of proprietary statistical software are selling. For example, I once used SAS quite a bit but don’t anymore. I also have the sense that for some time SAS has directed its energies more toward business uses of its software than toward purely statistical applications.

Ajay- Do people in R Core team recognize the importance of GUI? What does the rest of R community feel? What has the feedback of users ben to you. Any plans to corporate sponsors for R Commander ( Rattle , an R language data mining GUI has a version called Rstat at http://www.informationbuilders.com/products/webfocus/predictivemodeling.html while the free version and code is at rattle.togaware.com)

John- I feel that the R Commander GUI has been generally positively received, both by members of R Core who have said something about it to me and by others in the R community. Of course, a nice feature of the R package system is that people can simply ignore packages in which they have no interest. I noticed recently that a Journal of Statistical Software paper that I wrote several years ago on the Rcmdr package has been downloaded nearly 35,000 times.

Because I wouldn’t expect many students using the Rcmdr package in a course to read that paper, I expect that the package is being used fairly widely.

Ajay- What does John Fox do for fun or as a hobby?

John- I’m tempted to say that much of my work is fun — particularly doing research, writing programs, and writing papers and books. I used to be quite a serious photographer, but I haven’t done that in years, and the technology of photography has changed a great deal. I run and swim for exercise, but that’s not really fun. I like to read and to travel, but who doesn’t?

Biography-

Prof John Fox is a giant in his chosen fields and has edited/authored 13 books and written chapters for 12 more books. He has also written and been published in almost 49 Journal articles. He is also editor in chief for R News newsletter. You can read more about Dr Fox at http://socserv.mcmaster.ca/jfox/

On R Cmdr-

R Cmdr has substantially decreased the hygiene factor for people wanting to learn R- they begin with the GUI and then later transition to customization using command line. It is so simple in its design that even under graduates have started basic data analysis with R Cmdr after just a class.You can read more on it here at http://socserv.mcmaster.ca/jfox/Misc/Rcmdr/Getting-Started-with-the-Rcmdr.pdf

ROC Curve

ROC Curve is a nice modeling concept to know as it will used practically in nearly all models

irrespective of spoefic technique and irrespective of statistical software.

We use the Wikipedia for referring to easy to implement statistics rather than crusty

thick books which seem prohibitely dense and opaque to outsiders

-This is how you define the ROC Curve.

actual value
p n total
prediction
outcome
p’ True
Positive
False
Positive
P’
n’ False
Negative
True
Negative
N’
total P N

true positive (TP)

eqv. with hit
true negative (TN)
eqv. with correct rejection
false positive (FP)
eqv. with false alarm, Type I error
false negative (FN)
eqv. with miss, Type II error
true positive rate (TPR)
eqv. with hit rate, recall, sensitivity
TPR = TP / P = TP / (TP + FN)
false positive rate (FPR)
eqv. with false alarm rate, fall-out
FPR = FP / N = FP / (FP + TN)
accuracy (ACC)
ACC = (TP + TN) / (P + N)
specificity (SPC)
SPC = TN / (FP + TN) = 1 ? FPR
positive predictive value (PPV)
eqv. with precision
PPV = TP / (TP + FP)

Here is a good java enabled page to calculate the ROC Curve.

http://www.rad.jhmi.edu/jeng/javarad/roc/JROCFITi.html

And in case any one asks, ROC stands for Receiver Operating Characteristic. ……