SpiderPic- Better Photos for Marketing Collateral

Here is a new search engine for comparing prices and lots of stock images ( the kind professional photographers use). For Prices as low as $1-3 you can test,evaluate and buy very beautiful images to really zing up your marketing collateral as well as client presentations. Here is an example of querying for the word analysis and the images that come in.

Compare these results with the free results in Google Image Search and you can see that it is definitely worth a dekko given the very low prices that are checked here and the multiple websites that are listed.

Google Image Search (free) for the word Analysis

Spider Pic- Options to buy Images for as low as 1$ for your presentation

The Old Man and the C

At 32, I am probably too old to do this- but I am finally learning C
from here-
http://www.cprogramming.com/

Another tutorial from CS at Stanford is also quite good

This is document #101, Essential C, in the Stanford CS Education Library. This and other educational materials are available for free at http://cslibrary.stanford.edu/. This article is free to be used, reproduced, excerpted, retransmitted, or sold so long as this notice is clearly reproduced at its beginning.

New Blogging Schedule

Given existing demands on health , wealth and time- I am planning to blog every Friday at the minimum and also send a summary on the Linkedin Group.

Here are some ways to better connect to my writing-

1 Join the LinkedIn Group http://www.linkedin.com/groups?about=&gid=54257

2 Follow on Twitter http://twitter.com/decisionstats

3 Interact on Facebook http://www.facebook.com/Decisionstats

I will also be trying to find better ways to optimize my writing soon. Thanks for reading this!

MapReduce Patent Granted

After 5 years of third party validation and almost 10 years of Google Internal Validation, the fastest way to crunch data belongs to the people who created it first, Google Inc

From

http://www.google.com/patents/about?id=XLfIAAAAEBAJ

Citations

Patent Number Title Issue date
4876643 Parallel searching system having a master processor for controlling plural slave processors for independently processing respective search requests Oct 24, 1989
5345584 System for managing data storage based on vector-summed size-frequency vectors for data sets, devices, and residual storage on devices Sep 6, 1994
5414849 Evaluating method of data division patterns and a program execution time for a distributed memory parallel computer system, and parallel program producing method using such an evaluating method May 9, 1995
5414899 Pivot structure from a lock handle May 16, 1995
5471622 Run-time system having nodes for identifying parallel tasks in a logic program and searching for available nodes to execute the parallel tasks Nov 28, 1995
5590319 Query processor for parallel processing in homogenous and heterogenous databases Dec 31, 1996
5806059 Database management system and method for query process for the same Sep 8, 1998
5819251 System and apparatus for storage retrieval and analysis of relational and non-relational data Oct 6, 1998
5870743 Method and apparatus for parallelizing operations that create a table Feb 9, 1999
5884299 Optimization of SQL queries involving aggregate expressions using a plurality of local and global aggregation operations Mar 16, 1999
5884303 Parallel searching technique Mar 16, 1999
5920854 Real-time document collection search engine with phrase indexing Jul 6, 1999
5956704 Method and apparatus for parallelizing operations that insert data into an existing data container Sep 21, 1999
5963954 Method for mapping an index of a database into an array of files Oct 5, 1999
6006224 Crucible query system Dec 21, 1999
6026394 System and method for implementing parallel operations in a database management system Feb 15, 2000
6182061 Method for executing aggregate queries, and computer system Jan 30, 2001
6226635 Layered query management May 1, 2001
6256621 Database management system and query operation therefor, including processing plural database operation requests based on key range of hash code Jul 3, 2001
6301574 System for providing business information Oct 9, 2001
6366904 Machine-implementable method and apparatus for iteratively extending the results obtained from an initial query in a database Apr 2, 2002
6408292 Method of and system for managing multi-dimensional databases using modular-arithmetic based address data mapping processes on integer-encoded business dimensions Jun 18, 2002
6556988 Database management apparatus and query operation therefor, including processing plural database operation requests based on key range of hash code Apr 29, 2003
6567806 System and method for implementing hash-based load-balancing query processing in a multiprocessor database system May 20, 2003
6741992 Flexible rule-based communication system and method for controlling the flow of and access to information between computer users May 25, 2004
6910070 Methods and systems for asynchronous notification of database events Jun 21, 2005
6961723 System and method for determining relevancy of query responses in a distributed network search mechanism Nov 1, 2005
6983322 System for discrete parallel processing of queries and updates Jan 3, 2006
7099871 System and method for distributed real-time search Aug 29, 2006
7103590 Method and system for pipelined database table functions Sep 5, 2006
7146365 Method, system, and program for optimizing database query execution Dec 5, 2006
7430549 Optimized SQL code generation Sep 30, 2008
7433863 SQL code generation for heterogeneous environment Oct 7, 2008

Claims

What is claimed is:1. A computer-implemented method of analyzing data records, comprising:

storing the data records in one or more data centers;
allocating groups of the stored data records to respective processes of a first plurality of processes executing in parallel;
after allocating the groups of the stored data records to the respective processes of the first plurality of processes executing in parallel, in each respective process of the first plurality of processes:
for each data record in at least a subset of the group of the stored data records allocated to the respective process:
creating a parsed representation of the data record;
applying a procedural language query to the parsed representation of the data record to extract one or more values, wherein the procedural language query is applied independently to each parsed representation; and
applying a respective emit operator to at least one of the extracted one or more values to add corresponding information to a respective intermediate data structure, wherein the respective emit operator implements one of a predefined set of application-independent statistical information processing functions;
in each process of a second plurality of processes, aggregating information from a subset of the intermediate data structures to produce aggregated data; and
combining the produced aggregated data to produce output data.

2. The method of claim 1, wherein the respective emit operator implements one of a predefined set of application-independent statistical information processing functions.

3. The method of claim 2, wherein the application-independent statistical information processing functions comprise one or more of the following: a function for counting occurrences of distinct values, a maximum value function, a minimum value function, a statistical sampling function, a function for identifying values that occur most frequently, and a function for estimating a total number of unique values.

4. The method of claim 1, wherein the applying the procedural language query to the parsed representation of the data record to extract the one or more values and the applying the respective emit operator to at least one of the one or more values to add the corresponding information to the respective intermediate data structure are performed independently for each data record.

5. The method of claim 1, wherein the parsed representation of the data record comprises a key-value pair.

6. The method of claim 1, wherein the intermediate data structure comprises a table having at least one index whose index values comprise unique values of the extracted one or more values.

7. The method of claim 6, wherein the aggregating information from the subset of the intermediate data structures to produce the aggregated data combines the extracted one or more values having the same index values.

8. The method of claim 1, wherein

when applying the procedural language query to the parsed representation produces a plurality of values, applying the respective emit operator to each of the produced plurality of values to add corresponding information to the respective intermediate data structure.

9. The method of claim 1, wherein the second plurality of processes are executing in parallel.

10. The method of claim 1, wherein the allocating the groups of the stored data records to the respective processes of the first plurality of processes executing in Parallel is application independent, and the procedural language query is application dependent.

11. The method of claim 1, wherein the data records comprise one or more of the following types of data records: log files, transaction records, and documents.

12. The method of claim 1, wherein the intermediate data structure is a table having a plurality of indices, wherein each of the plurality of indices is dynamically generated in accordance with the extracted one or more values.

13. A computer-implemented method of analyzing data records, comprising:

storing the data records in one or more data centers;
allocating groups of the stored data records to respective processes of a first plurality of processes executing in parallel;
after allocating the groups of the stored data records to the respective processes of the first plurality of processes executing in parallel, in each respective process of the first plurality of processes:
for each data record in at least a subset of the group of stored data records allocated to the respective process:
creating a parsed representation of the data record;
applying a procedural language query to the parsed representation of the data record to extract one or more values; and
applying a respective operator to at least one of the extracted one or more values to add corresponding information to a respective intermediate data structure;
in each process of a second plurality of processes, aggregating information from a subset of the intermediate data structures to produce aggregated data; and
combining the produced aggregated data to produce output data.

14. A computer system with one or more processors and memory for analyzing data records, wherein the data records are stored in one or more data centers, the computer system comprising:

a first plurality of processes operating in parallel, each of which is allocated a group of stored data records to process;
each respective process of the first plurality of processes including instructions for:
creating a parsed representation of each data record in at least a subset of the group of stored data records allocated to the respective process after the group of stored data records is allocated to the respective process;
applying a procedural language query to the parsed representation of each stored data record in at least the subset of the group of stored data records allocated to the respective process to produce one or more values; and
applying one or more emit operators to each of the one or more produced values to add corresponding information to an intermediate data structure; and
at least one aggregating process for aggregating information from a plurality of the intermediate data structures to produce output data.

15. The system of claim 14, wherein the at least one aggregating process for aggregating information comprises a second plurality of processes operating in parallel, wherein each respective process of the second plurality of processes operating in parallel includes instructions for aggregating information from the plurality of the intermediate data structures to produce the output data.

16. The system of claim 14, wherein the intermediate data structure comprises a table.

17. The system of claim 15, wherein at least one process of the second plurality of processes operating in parallel includes instructions for combining the output data to produce aggregated output data.

18. The system of claim 14, wherein each of the one or more emit operators implements one of a predefined set of application-independent statistical information processing functions.

19. The system of claim 18, wherein the application-independent statistical information processing functions comprise one or more of the following: a function for counting occurrences of distinct values, a maximum value function, a minimum value function, a statistical sampling function, a function for identifying values that occur most frequently, and a function for estimating a total number of unique values.

20. The system of claim 14, wherein the instructions for applying the procedural language query to the parsed representation of each data record in at least the subset of the group of stored data records allocated to the respective process to produce the one or more values include instructions for applying the procedural language query independently to each data record.

21. The system of claim 14, wherein the instructions for applying the procedural language query to the parsed representation of each data record in at least the subset of the group of stored data records allocated to the respective process to produce the one or more values and instructions for applying the one or more emit operators to each of the one or more produced values to add the corresponding information to the intermediate data structure include instructions for applying the procedural language query and the one or more emit operators independently to each data record.

22. The system of claim 14, wherein the at least one aggregating process for aggregating information is configured to aggregate, in each respective process of a second plurality of processes, the information from the plurality of the intermediate data structures to produce the output data.

23. The system of claim 14, wherein each parsed representation of each data record comprises a key-value pair.

24. The system of claim 14, wherein the intermediate data structure comprises a table having at least one index whose index values comprise unique values of the produced values.

25. The system of claim 24, wherein the at least one aggregating process for aggregating the information from the plurality of intermediate data structures to produce the output data includes instructions for combining the one or more produced values having the same index values.

26. The system of claim 14, wherein the instructions for applying the procedural language query to the parsed representation of each stored data record include instructions for applying the one or more emit operators to each of a plurality of produced values to add corresponding information to the intermediate data structure.

27. The system of claim 14, wherein the at least one aggregating process for aggregating the information from the plurality of intermediate data structures to produce the output data comprises a second plurality of processes executing in parallel.

28. The system of claim 14, wherein the system is configured such that the allocation of stored data records to each respective process of the first plurality of processes is application independent, and wherein the procedural language query is application dependent.

29. The system of claim 14, wherein the data records comprise one or more of the following types of data records: log files, transaction records, and documents.

30. The system of claim 14, wherein the intermediate data structure is a table having a plurality of indices, wherein each of the plurality of the indices is dynamically generated in accordance with the one or more produced values.

China bans Chinese Food for Googleplex

This is a direct result of Google ‘s stand on principles (see below). No Google for China means no Chinese food for Googlers. But seriously.

http://googleblog.blogspot.com/2010/01/new-approach-to-china.html

In mid-December, we detected a highly sophisticated and targeted attack on our corporate infrastructure originating from China that resulted in the theft of intellectual property from Google. However, it soon became clear that what at first appeared to be solely a security incident–albeit a significant one–was something quite different.

First, this attack was not just on Google. As part of our investigation we have discovered that at least twenty other large companies from a wide range of businesses–including the Internet, finance, technology, media and chemical sectors–have been similarly targeted. We are currently in the process of notifying those companies, and we are also working with the relevant U.S. authorities.

Second, we have evidence to suggest that a primary goal of the attackers was accessing the Gmail accounts of Chinese human rights activists. Based on our investigation to date we believe their attack did not achieve that objective. Only two Gmail accounts appear to have been accessed, and that activity was limited to account information (such as the date the account was created) and subject line, rather than the content of emails themselves.

Third, as part of this investigation but independent of the attack on Google, we have discovered that the accounts of dozens of U.S.-, China- and Europe-based Gmail users who are advocates of human rights in China appear to have been routinely accessed by third parties. These accounts have not been accessed through any security breach at Google, but most likely via phishing scams or malware placed on the users’ computers.

Algorithms and Ads: No Free Lunches and Hill Climbing

From http://www.no-free-lunch.org/

More formally, where
d = training set;
m = number of elements in training set;
f = ‘target’ input-output relationships;
h = hypothesis (the algorithm’s guess for f made in response to d); and
C = off-training-set ‘loss’ associated with f and h (‘generalization error’)
all algorithms are equivalent, on average, by any of the following measures of risk: E(C|d), E(C|m), E(C|f,d), or E(C|f,m).

How well you do is determined by how ‘aligned’ your learning algorithm P(h|d) is with the actual posterior, P(f|d).

Wolpert’s result, in essence, formalizes Hume, extends him and calls the whole of science into question.

Bing Ad

Make Bing your decision engine

Google Ad

_null_

From http://en.wikipedia.org/wiki/Hill_climbing

hill climbing is a mathematical optimization technique which belongs to the family of local search. It is relatively simple to implement, making it a popular first choice. Although more advanced algorithms may give better results, in some situations hill climbing works just as well.

Hill climbing can be used to solve problems that have many solutions, some of which are better than others. It starts with a random (potentially poor) solution, and iteratively makes small changes to the solution, each time improving it a little. When the algorithm cannot see any improvement anymore, it terminates. Ideally, at that point the current solution is close to optimal, but it is not guaranteed that hill climbing will ever come close to the optimal solution.

For example, hill climbing can be applied to the traveling salesman problem. It is easy to find a solution that visits all the cities but will be very poor compared to the optimal solution. The algorithm starts with such a solution and makes small improvements to it, such as switching the order in which two cities are visited. Eventually, a much better route is obtained.

Hill climbing is used widely in artificial intelligence, for reaching a goal state from a starting node. Choice of next node and starting node can be varied to give a list of related algorithms.

Bing Ad for Hill Climbing-

Climbing at Amazon

Buy books at Amazon.com and save. Qualified orders over $25 ship free

Amazon.com/books

Google Ad for Hill Climbing Algorithm

_null_

A year after Google’s Kill Bill OS announcements and Ballmer’s lets buy our way outta here- there seem still more sense to stick to Google ‘s ad algols. Unless you want to climb Microsoft’s online hills only to find there is no free lunch in their ad rates and offers.

Like the free and virus prone browser.