January 2010 – Page 2 – DECISION STATS

MapReduce Patent Granted

After 5 years of third party validation and almost 10 years of Google Internal Validation, the fastest way to crunch data belongs to the people who created it first, Google Inc

From

http://www.google.com/patents/about?id=XLfIAAAAEBAJ

Citations

Patent Number Title Issue date

4876643 Parallel searching system having a master processor for controlling plural slave processors for independently processing respective search requests Oct 24, 1989

5345584 System for managing data storage based on vector-summed size-frequency vectors for data sets, devices, and residual storage on devices Sep 6, 1994

5414849 Evaluating method of data division patterns and a program execution time for a distributed memory parallel computer system, and parallel program producing method using such an evaluating method May 9, 1995

5414899 Pivot structure from a lock handle May 16, 1995

5471622 Run-time system having nodes for identifying parallel tasks in a logic program and searching for available nodes to execute the parallel tasks Nov 28, 1995

5590319 Query processor for parallel processing in homogenous and heterogenous databases Dec 31, 1996

5806059 Database management system and method for query process for the same Sep 8, 1998

5819251 System and apparatus for storage retrieval and analysis of relational and non-relational data Oct 6, 1998

5870743 Method and apparatus for parallelizing operations that create a table Feb 9, 1999

5884299 Optimization of SQL queries involving aggregate expressions using a plurality of local and global aggregation operations Mar 16, 1999

5884303 Parallel searching technique Mar 16, 1999

5920854 Real-time document collection search engine with phrase indexing Jul 6, 1999

5956704 Method and apparatus for parallelizing operations that insert data into an existing data container Sep 21, 1999

5963954 Method for mapping an index of a database into an array of files Oct 5, 1999

6006224 Crucible query system Dec 21, 1999

6026394 System and method for implementing parallel operations in a database management system Feb 15, 2000

6182061 Method for executing aggregate queries, and computer system Jan 30, 2001

6226635 Layered query management May 1, 2001

6256621 Database management system and query operation therefor, including processing plural database operation requests based on key range of hash code Jul 3, 2001

6301574 System for providing business information Oct 9, 2001

6366904 Machine-implementable method and apparatus for iteratively extending the results obtained from an initial query in a database Apr 2, 2002

6408292 Method of and system for managing multi-dimensional databases using modular-arithmetic based address data mapping processes on integer-encoded business dimensions Jun 18, 2002

6556988 Database management apparatus and query operation therefor, including processing plural database operation requests based on key range of hash code Apr 29, 2003

6567806 System and method for implementing hash-based load-balancing query processing in a multiprocessor database system May 20, 2003

6741992 Flexible rule-based communication system and method for controlling the flow of and access to information between computer users May 25, 2004

6910070 Methods and systems for asynchronous notification of database events Jun 21, 2005

6961723 System and method for determining relevancy of query responses in a distributed network search mechanism Nov 1, 2005

6983322 System for discrete parallel processing of queries and updates Jan 3, 2006

7099871 System and method for distributed real-time search Aug 29, 2006

7103590 Method and system for pipelined database table functions Sep 5, 2006

7146365 Method, system, and program for optimizing database query execution Dec 5, 2006

7430549 Optimized SQL code generation Sep 30, 2008

7433863 SQL code generation for heterogeneous environment Oct 7, 2008

Claims

What is claimed is:1. A computer-implemented method of analyzing data records, comprising:

storing the data records in one or more data centers;

allocating groups of the stored data records to respective processes of a first plurality of processes executing in parallel;

after allocating the groups of the stored data records to the respective processes of the first plurality of processes executing in parallel, in each respective process of the first plurality of processes:

for each data record in at least a subset of the group of the stored data records allocated to the respective process:

creating a parsed representation of the data record;

applying a procedural language query to the parsed representation of the data record to extract one or more values, wherein the procedural language query is applied independently to each parsed representation; and

applying a respective emit operator to at least one of the extracted one or more values to add corresponding information to a respective intermediate data structure, wherein the respective emit operator implements one of a predefined set of application-independent statistical information processing functions;

in each process of a second plurality of processes, aggregating information from a subset of the intermediate data structures to produce aggregated data; and

combining the produced aggregated data to produce output data.

2. The method of claim 1, wherein the respective emit operator implements one of a predefined set of application-independent statistical information processing functions.

3. The method of claim 2, wherein the application-independent statistical information processing functions comprise one or more of the following: a function for counting occurrences of distinct values, a maximum value function, a minimum value function, a statistical sampling function, a function for identifying values that occur most frequently, and a function for estimating a total number of unique values.

4. The method of claim 1, wherein the applying the procedural language query to the parsed representation of the data record to extract the one or more values and the applying the respective emit operator to at least one of the one or more values to add the corresponding information to the respective intermediate data structure are performed independently for each data record.

5. The method of claim 1, wherein the parsed representation of the data record comprises a key-value pair.

6. The method of claim 1, wherein the intermediate data structure comprises a table having at least one index whose index values comprise unique values of the extracted one or more values.

7. The method of claim 6, wherein the aggregating information from the subset of the intermediate data structures to produce the aggregated data combines the extracted one or more values having the same index values.

8. The method of claim 1, wherein

when applying the procedural language query to the parsed representation produces a plurality of values, applying the respective emit operator to each of the produced plurality of values to add corresponding information to the respective intermediate data structure.

9. The method of claim 1, wherein the second plurality of processes are executing in parallel.

10. The method of claim 1, wherein the allocating the groups of the stored data records to the respective processes of the first plurality of processes executing in Parallel is application independent, and the procedural language query is application dependent.

11. The method of claim 1, wherein the data records comprise one or more of the following types of data records: log files, transaction records, and documents.

12. The method of claim 1, wherein the intermediate data structure is a table having a plurality of indices, wherein each of the plurality of indices is dynamically generated in accordance with the extracted one or more values.

13. A computer-implemented method of analyzing data records, comprising:

storing the data records in one or more data centers;

allocating groups of the stored data records to respective processes of a first plurality of processes executing in parallel;

after allocating the groups of the stored data records to the respective processes of the first plurality of processes executing in parallel, in each respective process of the first plurality of processes:

for each data record in at least a subset of the group of stored data records allocated to the respective process:

creating a parsed representation of the data record;

applying a procedural language query to the parsed representation of the data record to extract one or more values; and

applying a respective operator to at least one of the extracted one or more values to add corresponding information to a respective intermediate data structure;

in each process of a second plurality of processes, aggregating information from a subset of the intermediate data structures to produce aggregated data; and

combining the produced aggregated data to produce output data.

14. A computer system with one or more processors and memory for analyzing data records, wherein the data records are stored in one or more data centers, the computer system comprising:

a first plurality of processes operating in parallel, each of which is allocated a group of stored data records to process;

each respective process of the first plurality of processes including instructions for:

creating a parsed representation of each data record in at least a subset of the group of stored data records allocated to the respective process after the group of stored data records is allocated to the respective process;

applying a procedural language query to the parsed representation of each stored data record in at least the subset of the group of stored data records allocated to the respective process to produce one or more values; and

applying one or more emit operators to each of the one or more produced values to add corresponding information to an intermediate data structure; and

at least one aggregating process for aggregating information from a plurality of the intermediate data structures to produce output data.

15. The system of claim 14, wherein the at least one aggregating process for aggregating information comprises a second plurality of processes operating in parallel, wherein each respective process of the second plurality of processes operating in parallel includes instructions for aggregating information from the plurality of the intermediate data structures to produce the output data.

16. The system of claim 14, wherein the intermediate data structure comprises a table.

17. The system of claim 15, wherein at least one process of the second plurality of processes operating in parallel includes instructions for combining the output data to produce aggregated output data.

18. The system of claim 14, wherein each of the one or more emit operators implements one of a predefined set of application-independent statistical information processing functions.

19. The system of claim 18, wherein the application-independent statistical information processing functions comprise one or more of the following: a function for counting occurrences of distinct values, a maximum value function, a minimum value function, a statistical sampling function, a function for identifying values that occur most frequently, and a function for estimating a total number of unique values.

20. The system of claim 14, wherein the instructions for applying the procedural language query to the parsed representation of each data record in at least the subset of the group of stored data records allocated to the respective process to produce the one or more values include instructions for applying the procedural language query independently to each data record.

21. The system of claim 14, wherein the instructions for applying the procedural language query to the parsed representation of each data record in at least the subset of the group of stored data records allocated to the respective process to produce the one or more values and instructions for applying the one or more emit operators to each of the one or more produced values to add the corresponding information to the intermediate data structure include instructions for applying the procedural language query and the one or more emit operators independently to each data record.

22. The system of claim 14, wherein the at least one aggregating process for aggregating information is configured to aggregate, in each respective process of a second plurality of processes, the information from the plurality of the intermediate data structures to produce the output data.

23. The system of claim 14, wherein each parsed representation of each data record comprises a key-value pair.

24. The system of claim 14, wherein the intermediate data structure comprises a table having at least one index whose index values comprise unique values of the produced values.

25. The system of claim 24, wherein the at least one aggregating process for aggregating the information from the plurality of intermediate data structures to produce the output data includes instructions for combining the one or more produced values having the same index values.

26. The system of claim 14, wherein the instructions for applying the procedural language query to the parsed representation of each stored data record include instructions for applying the one or more emit operators to each of a plurality of produced values to add corresponding information to the intermediate data structure.

27. The system of claim 14, wherein the at least one aggregating process for aggregating the information from the plurality of intermediate data structures to produce the output data comprises a second plurality of processes executing in parallel.

28. The system of claim 14, wherein the system is configured such that the allocation of stored data records to each respective process of the first plurality of processes is application independent, and wherein the procedural language query is application dependent.

29. The system of claim 14, wherein the data records comprise one or more of the following types of data records: log files, transaction records, and documents.

30. The system of claim 14, wherein the intermediate data structure is a table having a plurality of indices, wherein each of the plurality of the indices is dynamically generated in accordance with the one or more produced values.

China bans Chinese Food for Googleplex

This is a direct result of Google ‘s stand on principles (see below). No Google for China means no Chinese food for Googlers. But seriously.

http://googleblog.blogspot.com/2010/01/new-approach-to-china.html

In mid-December, we detected a highly sophisticated and targeted attack on our corporate infrastructure originating from China that resulted in the theft of intellectual property from Google. However, it soon became clear that what at first appeared to be solely a security incident–albeit a significant one–was something quite different.

First, this attack was not just on Google. As part of our investigation we have discovered that at least twenty other large companies from a wide range of businesses–including the Internet, finance, technology, media and chemical sectors–have been similarly targeted. We are currently in the process of notifying those companies, and we are also working with the relevant U.S. authorities.

Second, we have evidence to suggest that a primary goal of the attackers was accessing the Gmail accounts of Chinese human rights activists. Based on our investigation to date we believe their attack did not achieve that objective. Only two Gmail accounts appear to have been accessed, and that activity was limited to account information (such as the date the account was created) and subject line, rather than the content of emails themselves.

Third, as part of this investigation but independent of the attack on Google, we have discovered that the accounts of dozens of U.S.-, China- and Europe-based Gmail users who are advocates of human rights in China appear to have been routinely accessed by third parties. These accounts have not been accessed through any security breach at Google, but most likely via phishing scams or malware placed on the users’ computers.

Algorithms and Ads: No Free Lunches and Hill Climbing

From http://www.no-free-lunch.org/

More formally, where
d = training set;
m = number of elements in training set;
f = ‘target’ input-output relationships;
h = hypothesis (the algorithm’s guess for f made in response to d); and
C = off-training-set ‘loss’ associated with f and h (‘generalization error’)
all algorithms are equivalent, on average, by any of the following measures of risk: E(C|d), E(C|m), E(C|f,d), or E(C|f,m).

How well you do is determined by how ‘aligned’ your learning algorithm P(h|d) is with the actual posterior, P(f|d).

Wolpert’s result, in essence, formalizes Hume, extends him and calls the whole of science into question.

Bing Ad

Make Bing your decision engine

Google Ad

_null_

From http://en.wikipedia.org/wiki/Hill_climbing

hill climbing is a mathematical optimization technique which belongs to the family of local search. It is relatively simple to implement, making it a popular first choice. Although more advanced algorithms may give better results, in some situations hill climbing works just as well.

Hill climbing can be used to solve problems that have many solutions, some of which are better than others. It starts with a random (potentially poor) solution, and iteratively makes small changes to the solution, each time improving it a little. When the algorithm cannot see any improvement anymore, it terminates. Ideally, at that point the current solution is close to optimal, but it is not guaranteed that hill climbing will ever come close to the optimal solution.

For example, hill climbing can be applied to the traveling salesman problem. It is easy to find a solution that visits all the cities but will be very poor compared to the optimal solution. The algorithm starts with such a solution and makes small improvements to it, such as switching the order in which two cities are visited. Eventually, a much better route is obtained.

Hill climbing is used widely in artificial intelligence, for reaching a goal state from a starting node. Choice of next node and starting node can be varied to give a list of related algorithms.

Bing Ad for Hill Climbing-

Climbing at Amazon

Buy books at Amazon.com and save. Qualified orders over $25 ship free

Amazon.com/books

Google Ad for Hill Climbing Algorithm

_null_

A year after Google’s Kill Bill OS announcements and Ballmer’s lets buy our way outta here- there seem still more sense to stick to Google ‘s ad algols. Unless you want to climb Microsoft’s online hills only to find there is no free lunch in their ad rates and offers.

Like the free and virus prone browser.

Dude, Where’s my Water!

A recent extract from the “independent” Times of India – privately owned and indeed the World’s largest newspaper in English

http://timesofindia.indiatimes.com/india/West-uses-glacier-theory-to-flog-India-on-climate-change/articleshow/5482652.cms

NEW DELHI: IPCC’s admission of getting its facts on Himalayan glaciers completely wrong has again brought out concerns about the use of science,

Twitter Facebook Share

Email Print Save Comment

and pseudo-science, to put pressure on India to take stronger action on climate change or to put greater responsibility for the climate crisis on it.

The ‘2035 demise’ date drawn by IPCC in its fourth assessment report for Himalayan glaciers was used very often to demand that India should take greater action to reduce its emissions in order to protect people from catastrophes like glacial melts and floods. Similarly, a ‘premature’ release of information on the so-called Asian Brown Cloud was used by several western NGOs and governments to pin the blame on the melting of glaciers and other climate change impacts on pollution from burning firewood and cow dung in India.

I had earlier pointed out the same based on my proximity to Oakridge , TN and some data ( see here-

https://decisionstats.wordpress.com/2010/01/05/climate-die-oxide/

on January 5

1) What is the expected date of melting of glaciers in Himalayas thus affecting sacred rivers like Ganges and also causing floods in densely populated Asia. How would nation states with shareable resources like Water react on the disputes, dams , hydro electricity and floods.

2) How would you count per capita CO2 consumption- Assume a Factory in China makes 3 tonnes of C02 every year but exports all its products to USA on Indian Cargo ship. Travel contributes another 1 tonne of C02 including air travel, visits etc.

As of now this will be counted as 3 tonne for China, 1 Tonne for India, X tonne for USA ? What is wrong in these assumptions

Indeed I gave a presentation ro senior Times Group People on using data which is available on my Linkedin profile with the Google Docs presentation at

http://linkedin.com/in/ajayohri

Who is correct? The Indians or the Cowboys see NYT article

http://www.nytimes.com/2010/01/05/science/earth/05satellite.html

The nation’s top scientists and spies are collaborating on an effort to use the federal government’s intelligence assets — including spy satellites and other classified sensors — to assess the hidden complexities of environmental change. They seek insights from natural phenomena like clouds and glaciers, deserts and tropical forests.

Not a coincidence this comes close on the National Security Function in India coming totally revamped

http://timesofindia.indiatimes.com/india/Narayanans-exit-gives-full-control-of-internal-security-to-Chidambaram/articleshow/5474408.cms

The exit of M K Narayanan as national security advisor has set the stage for a significant re-ordering of UPA-2’s power structure with

home minister P Chidambaram set to gain fuller control of internal security reducing the role of the next NSA to foreign policy.

Debate and discussion between the freest and largest democracy are welcome steps.

But who is right?

Is climate change negotiations also a proxy for negotiation on terror co operation- as pointed out by me the Sikhs and Indians remain the only forces to be in Kabul (respectively the Sikhs in recent (late 18th-19th Century) Source- A Brief History of Sikhs and ancient history ( 8 th Century AD) while Churchill’s memoirs in Young Winston talk of the stellar role of the Indian Army in Afghanistan or NWFP. Remember we have been here before- the Bush Administration negotiated and failed to get Indian troops in Iraq in 2004 over lack of monetary negotiations- the Indians turned to be right on true costs!

Are the Chinese or the Americans using India’s insecurities as a proxy?

ps- on Movies Why was Shekhar Kapur’s ( The Oscar nomianted director of Elizabeth ) documentary Paani stopped due to funding issues?

How can ice melting in North Pole lead to lack of water. Do water projections measure that rainwater harvesting has been low in India and ancient Indian religion is okay with Saraswati as one dis appeared river. If the Ganges dries up- the people in India may riot or may just blame it on sin and build smaller rain water dams.

Dude, Where’s my water? When is it gonna go ?

R for Stats : Updated

Here is the new website for statistical analysis using the free analytical software called R (which is enabled for cloud computing as well : see here http://bit.ly/OhriCloud

or http://rgrossman.com/2009/05/17/running-r-on-amazons-ec2/

for the R tutorial on running it on Amazon’s EC2 pay per demand RAM.

It is called R 4 stats or simply http://www.r4stats.com/

Hosted on Google’s Updated Google Sites Platform- it offers a preview to Bob’s earlier run away hit R for SAS and SPSS users updation as well as his upcoming work R for Stata Users.

In Bob’s words himself –

I have substantially expanded the table that compares SAS and SPSS
add-on modules to somewhat equivalent R packages. This new version is
at:
http://r4stats.com/add-on-modules
and I would very much appreciate any feedback you might have on it.

The site http://r4stats.com is the replacement to
http://RforSASandSPSSusers.com and includes the support files for both
“R for SAS and SPSS Users” and the new “R for Stata Users”, due out in
March from Springer.

Topic SAS Product SPSS Product R Package

Advanced Models
SAS/STAT IBM SPSS Advanced Statistics
R, MASS, many others

Association Analysis
Enterprise Miner
IBM SPSS Association
arules, arulesNBMiner, arulesSequences

Basics Base SAS
IBM SPSS Statistics Base
R

Bootstrapping
SAS/STAT
IBM SPSS Bootstrapping
BootCL, BootPR, boot, bootRes, BootStepAIC, bootspecdens, bootstrap, FRB, gPdtest, meboot, multtest, pvclust, rqmcmb2, scaleboot, simpleboot

Classification Analysis
Enterprise Miner
IBM SPSS Classification
rattle, see the neural networks and trees entries in this table.

Conjoint Analysis
SAS/STAT: PROC TRANSREG
IBM SPSS Conjoint
homals, psychoR, bayesm

Correspondence Analysis
SAS/STAT: PROC CORRESP
IBM SPSS Categories
ade4, cocorresp, FactoMineR, homals, made4, MASS, psychoR, PTAk, vegan

Custom Tables
Base SAS, PROC REPORT, PROC SQL, PROC TABULATE, Enterprise Reporter
IBM SPSS Custom Tables
reshape

Data Access
SAS/ACCESS
SPSS Data Access Pack
DBI, foreign, Hmisc: sas.get, sasxport.get, RODBC

Data Collection
SAS/FSP
IBM SPSS Data Collection Family
RSQLite, and the other open source programs MySQL or PostgreSQL are popular among R users for this purpose.

Data Mining
Enterprise Miner
IBM SPSS Modeler
(formerly Clementine)
arules, FactoMineR, rattle, various functions

Data Mining, In-database Processing
SAS In-Database Initiative with Teradata
IBM SPSS Modeler
PL/R

Data Preparation
Various procedures
IBM SPSS Data Preparation, various commands
dprep, plyr, reshape, sqldf, various functions

Developer Tools
SAS/AF, SAS/FSP, SAS Integration Technologies, SAS/TOOLKIT IBM SPSS Statistics Developer, IBM SPSS Statistics Programmability Extension
StatET, R links to most popular compilers, scripting languages, and databases.

Direct Marketing
Nothing quite like it
IBM SPSS Direct Marketing
Nothing quite like it

Exact Tests
SAS/STAT various
IBM SPSS Exact Tests
coin, elrm, exactLoglinTest, exactmaxsel, and options in many others

Excel Integration
SAS Enterprise BI Server IBM SPSS Advantage for Excel 2007
RExcel

Forecasting
SAS/ETS
IBM SPSS Forecasting
Over 40 packages that do time series are described at the Task View link above under Time Series.

Forecasting, Automated
Forecast Server IBM SPSS Forecasting
forecast

Genetics JMP Genomics
None http://www.bioconductor.org

Geographic Information Systems
SAS/GIS, SAS/GRAPH
None (Maps is defunct)
maps, mapdata, mapproj, GRASS via spgrass6, RColorBrewer, see Spatial in Task Views at link at top

Graphical user interfaces
Enterprise Guide, IML Studio, SAS/ASSIST, Analyst, Insight
IBM SPSS Statistics Base Deducer, JGR, R Commander, pmg, rattle, many others at http://www.sciviews.org/_rgui/

Graphics, Interactive
SAS/IML Studio, SAS/INSIGHT, JMP
None
GGobi via rggobi, iPlots, latticist, playwith

Graphics, Static
SAS/GRAPH
SPSS Base, Graphics Production Language
ggplot2, gplots, graphics, grid, gridBase, hexbin, lattice, plotrix, scatterplot3d, vcd, vioplot, geneplotter, Rgraphics

Graphics, Template Builder
Doesn’t use Grammar of Graphics model that forms the core of IBM SPSS Viz Designer or R’s ggplot2
IBM SPSS Viz Designer
Doesn’t use templates, but this GUI for ggplot2 http://www.stat.ucla.edu/~jeroen/ggplot2.html works similarly to IBM SPSS Viz Designer.

Guided Analytics
SAS/LAB
None
None

Matrix/linear Algebra
SAS/IML Studio
IBM SPSS Matrix
R, matlab, Matrix, sparseM

Missing Values Imputation
SAS/STAT: PROC MI
IBM SPSS Missing Values
amelia, Hmisc: aregImpute, EMV, rms (replaces Design): fit.mult.impute, mice, mitools, mvnmle, VIM

Neural Networks
Enterprise Miner
IBM SPSS Neural Networks
AMORE, grnnR, neuralnet, nnet, rattle

Operations Research
SAS/OR
None
glpk, linprog, LowRankQP, TSP

Power Analysis
SAS Power and Sample Size Application, SAS/STAT:
PROC POWER,
PROC GLMPOWER
SamplePower
asypow, powerpkg, pwr, MBESS

Quality Control
SAS/QC
IBM SPSS Statistics Base qcc, spc

Regression Models
SAS/STAT
IBM SPSS Regression
R, Hmisc, lasso, VGAM, pda, rms (replaces Design)

Sampling, Complex
SAS/STAT: PROC SURVEY SELECT, SURVEYMEANS, etc.
IBM SPSS Complex Samples
pps, sampfling, sampling, spsurvey, survey

Segmentation Analysis
Enterprise Miner
IBM Modeler Segmentation
cluster, rattle, som, see CRAN Task Views under Cluster for over 70 packages

Server Version
SAS for your particular server IBM SPSS Statistics Server,
IBM SPSS Modeler Server
rapache, R(D)COM Server, Rserve, StatET

Structural Equation Modeling
SAS/STAT: PROC CALIS
Amos OpenMX, sem

Text Analysis/Mining
Text Miner
IBM SPSS Text Analytics,
IBM SPSS Text Analysis for Surveys
Rstem, las, tm

Trees, Decision, Classification or Regression
Enterprise Miner
IBM SPSS Decision Trees, IBM SPSS AnswerTree, IBM SPSS Modeler (formerly Clementine)
ada, adabag, BayesTree, boost, GAMboost, gbev, gbm, maptree, mboost, mvpart, party, pinktoe,
quantregForest, rpart,rpart.permutation, randomForest, rattle, tree

All SAS and SPSS product names are registered trademarks of their respective companies.

Disclaimer- Bob Muenchen and I work for the same University. While we do have interesting conflicts often, his interview was one of the earliest where this blog began.

See- http://sites.google.com/site/r4statistics/interview

3 Idiots: Insight to Indian Engineer Campus Life

Ever wondered what makes Indian engineers so ahem hard working. Or Just in the mood to sample a BollyWood Movie. Here is 2009’s best movie – an all time grosser from the Oscar Nominated Aamir Khan.

It’s called 3 Idiots and loosely based on the adventures of 3-5 engineering students as they face academic and peer pressure challenges. Awesome. Loosely based on Chetan Bhagat’s book of 3 IIT friends.

Here is a preview of the video-

(Note the students praying for good grades).

Month: January 2010

MapReduce Patent Granted

Citations

Claims

China bans Chinese Food for Googleplex

Algorithms and Ads: No Free Lunches and Hill Climbing

Climbing at Amazon

Dude, Where’s my Water!

R for Stats : Updated

3 Idiots: Insight to Indian Engineer Campus Life

Patent Number	Title	Issue date
4876643	Parallel searching system having a master processor for controlling plural slave processors for independently processing respective search requests	Oct 24, 1989
5345584	System for managing data storage based on vector-summed size-frequency vectors for data sets, devices, and residual storage on devices	Sep 6, 1994
5414849	Evaluating method of data division patterns and a program execution time for a distributed memory parallel computer system, and parallel program producing method using such an evaluating method	May 9, 1995
5414899	Pivot structure from a lock handle	May 16, 1995
5471622	Run-time system having nodes for identifying parallel tasks in a logic program and searching for available nodes to execute the parallel tasks	Nov 28, 1995
5590319	Query processor for parallel processing in homogenous and heterogenous databases	Dec 31, 1996
5806059	Database management system and method for query process for the same	Sep 8, 1998
5819251	System and apparatus for storage retrieval and analysis of relational and non-relational data	Oct 6, 1998
5870743	Method and apparatus for parallelizing operations that create a table	Feb 9, 1999
5884299	Optimization of SQL queries involving aggregate expressions using a plurality of local and global aggregation operations	Mar 16, 1999
5884303	Parallel searching technique	Mar 16, 1999
5920854	Real-time document collection search engine with phrase indexing	Jul 6, 1999
5956704	Method and apparatus for parallelizing operations that insert data into an existing data container	Sep 21, 1999
5963954	Method for mapping an index of a database into an array of files	Oct 5, 1999
6006224	Crucible query system	Dec 21, 1999
6026394	System and method for implementing parallel operations in a database management system	Feb 15, 2000
6182061	Method for executing aggregate queries, and computer system	Jan 30, 2001
6226635	Layered query management	May 1, 2001
6256621	Database management system and query operation therefor, including processing plural database operation requests based on key range of hash code	Jul 3, 2001
6301574	System for providing business information	Oct 9, 2001
6366904	Machine-implementable method and apparatus for iteratively extending the results obtained from an initial query in a database	Apr 2, 2002
6408292	Method of and system for managing multi-dimensional databases using modular-arithmetic based address data mapping processes on integer-encoded business dimensions	Jun 18, 2002
6556988	Database management apparatus and query operation therefor, including processing plural database operation requests based on key range of hash code	Apr 29, 2003
6567806	System and method for implementing hash-based load-balancing query processing in a multiprocessor database system	May 20, 2003
6741992	Flexible rule-based communication system and method for controlling the flow of and access to information between computer users	May 25, 2004
6910070	Methods and systems for asynchronous notification of database events	Jun 21, 2005
6961723	System and method for determining relevancy of query responses in a distributed network search mechanism	Nov 1, 2005
6983322	System for discrete parallel processing of queries and updates	Jan 3, 2006
7099871	System and method for distributed real-time search	Aug 29, 2006
7103590	Method and system for pipelined database table functions	Sep 5, 2006
7146365	Method, system, and program for optimizing database query execution	Dec 5, 2006
7430549	Optimized SQL code generation	Sep 30, 2008
7433863	SQL code generation for heterogeneous environment	Oct 7, 2008

Topic	SAS Product	SPSS Product	R Package
Advanced Models	SAS/STAT	IBM SPSS Advanced Statistics	R, MASS, many others
Association Analysis	Enterprise Miner	IBM SPSS Association	arules, arulesNBMiner, arulesSequences
Basics	Base SAS	IBM SPSS Statistics Base	R
Bootstrapping	SAS/STAT	IBM SPSS Bootstrapping	BootCL, BootPR, boot, bootRes, BootStepAIC, bootspecdens, bootstrap, FRB, gPdtest, meboot, multtest, pvclust, rqmcmb2, scaleboot, simpleboot
Classification Analysis	Enterprise Miner	IBM SPSS Classification	rattle, see the neural networks and trees entries in this table.
Conjoint Analysis	SAS/STAT: PROC TRANSREG	IBM SPSS Conjoint	homals, psychoR, bayesm
Correspondence Analysis	SAS/STAT: PROC CORRESP	IBM SPSS Categories	ade4, cocorresp, FactoMineR, homals, made4, MASS, psychoR, PTAk, vegan
Custom Tables	Base SAS, PROC REPORT, PROC SQL, PROC TABULATE, Enterprise Reporter	IBM SPSS Custom Tables	reshape
Data Access	SAS/ACCESS	SPSS Data Access Pack	DBI, foreign, Hmisc: sas.get, sasxport.get, RODBC
Data Collection	SAS/FSP	IBM SPSS Data Collection Family	RSQLite, and the other open source programs MySQL or PostgreSQL are popular among R users for this purpose.
Data Mining	Enterprise Miner	IBM SPSS Modeler (formerly Clementine)	arules, FactoMineR, rattle, various functions
Data Mining, In-database Processing	SAS In-Database Initiative with Teradata	IBM SPSS Modeler	PL/R
Data Preparation	Various procedures	IBM SPSS Data Preparation, various commands	dprep, plyr, reshape, sqldf, various functions
Developer Tools	SAS/AF, SAS/FSP, SAS Integration Technologies, SAS/TOOLKIT	IBM SPSS Statistics Developer, IBM SPSS Statistics Programmability Extension	StatET, R links to most popular compilers, scripting languages, and databases.
Direct Marketing	Nothing quite like it	IBM SPSS Direct Marketing	Nothing quite like it
Exact Tests	SAS/STAT various	IBM SPSS Exact Tests	coin, elrm, exactLoglinTest, exactmaxsel, and options in many others
Excel Integration	SAS Enterprise BI Server	IBM SPSS Advantage for Excel 2007	RExcel
Forecasting	SAS/ETS	IBM SPSS Forecasting	Over 40 packages that do time series are described at the Task View link above under Time Series.
Forecasting, Automated	Forecast Server	IBM SPSS Forecasting	forecast
Genetics	JMP Genomics	None	http://www.bioconductor.org
Geographic Information Systems	SAS/GIS, SAS/GRAPH	None (Maps is defunct)	maps, mapdata, mapproj, GRASS via spgrass6, RColorBrewer, see Spatial in Task Views at link at top
Graphical user interfaces	Enterprise Guide, IML Studio, SAS/ASSIST, Analyst, Insight	IBM SPSS Statistics Base	Deducer, JGR, R Commander, pmg, rattle, many others at http://www.sciviews.org/_rgui/
Graphics, Interactive	SAS/IML Studio, SAS/INSIGHT, JMP	None	GGobi via rggobi, iPlots, latticist, playwith
Graphics, Static	SAS/GRAPH	SPSS Base, Graphics Production Language	ggplot2, gplots, graphics, grid, gridBase, hexbin, lattice, plotrix, scatterplot3d, vcd, vioplot, geneplotter, Rgraphics
Graphics, Template Builder	Doesn’t use Grammar of Graphics model that forms the core of IBM SPSS Viz Designer or R’s ggplot2	IBM SPSS Viz Designer	Doesn’t use templates, but this GUI for ggplot2 http://www.stat.ucla.edu/~jeroen/ggplot2.html works similarly to IBM SPSS Viz Designer.
Guided Analytics	SAS/LAB	None	None
Matrix/linear Algebra	SAS/IML Studio	IBM SPSS Matrix	R, matlab, Matrix, sparseM
Missing Values Imputation	SAS/STAT: PROC MI	IBM SPSS Missing Values	amelia, Hmisc: aregImpute, EMV, rms (replaces Design): fit.mult.impute, mice, mitools, mvnmle, VIM
Neural Networks	Enterprise Miner	IBM SPSS Neural Networks	AMORE, grnnR, neuralnet, nnet, rattle
Operations Research	SAS/OR	None	glpk, linprog, LowRankQP, TSP
Power Analysis	SAS Power and Sample Size Application, SAS/STAT: PROC POWER, PROC GLMPOWER	SamplePower	asypow, powerpkg, pwr, MBESS
Quality Control	SAS/QC	IBM SPSS Statistics Base	qcc, spc
Regression Models	SAS/STAT	IBM SPSS Regression	R, Hmisc, lasso, VGAM, pda, rms (replaces Design)
Sampling, Complex	SAS/STAT: PROC SURVEY SELECT, SURVEYMEANS, etc.	IBM SPSS Complex Samples	pps, sampfling, sampling, spsurvey, survey
Segmentation Analysis	Enterprise Miner	IBM Modeler Segmentation	cluster, rattle, som, see CRAN Task Views under Cluster for over 70 packages
Server Version	SAS for your particular server	IBM SPSS Statistics Server, IBM SPSS Modeler Server	rapache, R(D)COM Server, Rserve, StatET
Structural Equation Modeling	SAS/STAT: PROC CALIS	Amos	OpenMX, sem
Text Analysis/Mining	Text Miner	IBM SPSS Text Analytics, IBM SPSS Text Analysis for Surveys	Rstem, las, tm
Trees, Decision, Classification or Regression	Enterprise Miner	IBM SPSS Decision Trees, IBM SPSS AnswerTree, IBM SPSS Modeler (formerly Clementine)	ada, adabag, BayesTree, boost, GAMboost, gbev, gbm, maptree, mboost, mvpart, party, pinktoe, quantregForest, rpart,rpart.permutation, randomForest, rattle, tree

Please share:

Citations

Claims

Please share:

Please share:

Please share:

Please share:

Please share:

Please share: