MapReduce Patent Granted

After 5 years of third party validation and almost 10 years of Google Internal Validation, the fastest way to crunch data belongs to the people who created it first, Google Inc

From

http://www.google.com/patents/about?id=XLfIAAAAEBAJ

Citations

Patent Number Title Issue date
4876643 Parallel searching system having a master processor for controlling plural slave processors for independently processing respective search requests Oct 24, 1989
5345584 System for managing data storage based on vector-summed size-frequency vectors for data sets, devices, and residual storage on devices Sep 6, 1994
5414849 Evaluating method of data division patterns and a program execution time for a distributed memory parallel computer system, and parallel program producing method using such an evaluating method May 9, 1995
5414899 Pivot structure from a lock handle May 16, 1995
5471622 Run-time system having nodes for identifying parallel tasks in a logic program and searching for available nodes to execute the parallel tasks Nov 28, 1995
5590319 Query processor for parallel processing in homogenous and heterogenous databases Dec 31, 1996
5806059 Database management system and method for query process for the same Sep 8, 1998
5819251 System and apparatus for storage retrieval and analysis of relational and non-relational data Oct 6, 1998
5870743 Method and apparatus for parallelizing operations that create a table Feb 9, 1999
5884299 Optimization of SQL queries involving aggregate expressions using a plurality of local and global aggregation operations Mar 16, 1999
5884303 Parallel searching technique Mar 16, 1999
5920854 Real-time document collection search engine with phrase indexing Jul 6, 1999
5956704 Method and apparatus for parallelizing operations that insert data into an existing data container Sep 21, 1999
5963954 Method for mapping an index of a database into an array of files Oct 5, 1999
6006224 Crucible query system Dec 21, 1999
6026394 System and method for implementing parallel operations in a database management system Feb 15, 2000
6182061 Method for executing aggregate queries, and computer system Jan 30, 2001
6226635 Layered query management May 1, 2001
6256621 Database management system and query operation therefor, including processing plural database operation requests based on key range of hash code Jul 3, 2001
6301574 System for providing business information Oct 9, 2001
6366904 Machine-implementable method and apparatus for iteratively extending the results obtained from an initial query in a database Apr 2, 2002
6408292 Method of and system for managing multi-dimensional databases using modular-arithmetic based address data mapping processes on integer-encoded business dimensions Jun 18, 2002
6556988 Database management apparatus and query operation therefor, including processing plural database operation requests based on key range of hash code Apr 29, 2003
6567806 System and method for implementing hash-based load-balancing query processing in a multiprocessor database system May 20, 2003
6741992 Flexible rule-based communication system and method for controlling the flow of and access to information between computer users May 25, 2004
6910070 Methods and systems for asynchronous notification of database events Jun 21, 2005
6961723 System and method for determining relevancy of query responses in a distributed network search mechanism Nov 1, 2005
6983322 System for discrete parallel processing of queries and updates Jan 3, 2006
7099871 System and method for distributed real-time search Aug 29, 2006
7103590 Method and system for pipelined database table functions Sep 5, 2006
7146365 Method, system, and program for optimizing database query execution Dec 5, 2006
7430549 Optimized SQL code generation Sep 30, 2008
7433863 SQL code generation for heterogeneous environment Oct 7, 2008

Claims

What is claimed is:1. A computer-implemented method of analyzing data records, comprising:

storing the data records in one or more data centers;
allocating groups of the stored data records to respective processes of a first plurality of processes executing in parallel;
after allocating the groups of the stored data records to the respective processes of the first plurality of processes executing in parallel, in each respective process of the first plurality of processes:
for each data record in at least a subset of the group of the stored data records allocated to the respective process:
creating a parsed representation of the data record;
applying a procedural language query to the parsed representation of the data record to extract one or more values, wherein the procedural language query is applied independently to each parsed representation; and
applying a respective emit operator to at least one of the extracted one or more values to add corresponding information to a respective intermediate data structure, wherein the respective emit operator implements one of a predefined set of application-independent statistical information processing functions;
in each process of a second plurality of processes, aggregating information from a subset of the intermediate data structures to produce aggregated data; and
combining the produced aggregated data to produce output data.

2. The method of claim 1, wherein the respective emit operator implements one of a predefined set of application-independent statistical information processing functions.

3. The method of claim 2, wherein the application-independent statistical information processing functions comprise one or more of the following: a function for counting occurrences of distinct values, a maximum value function, a minimum value function, a statistical sampling function, a function for identifying values that occur most frequently, and a function for estimating a total number of unique values.

4. The method of claim 1, wherein the applying the procedural language query to the parsed representation of the data record to extract the one or more values and the applying the respective emit operator to at least one of the one or more values to add the corresponding information to the respective intermediate data structure are performed independently for each data record.

5. The method of claim 1, wherein the parsed representation of the data record comprises a key-value pair.

6. The method of claim 1, wherein the intermediate data structure comprises a table having at least one index whose index values comprise unique values of the extracted one or more values.

7. The method of claim 6, wherein the aggregating information from the subset of the intermediate data structures to produce the aggregated data combines the extracted one or more values having the same index values.

8. The method of claim 1, wherein

when applying the procedural language query to the parsed representation produces a plurality of values, applying the respective emit operator to each of the produced plurality of values to add corresponding information to the respective intermediate data structure.

9. The method of claim 1, wherein the second plurality of processes are executing in parallel.

10. The method of claim 1, wherein the allocating the groups of the stored data records to the respective processes of the first plurality of processes executing in Parallel is application independent, and the procedural language query is application dependent.

11. The method of claim 1, wherein the data records comprise one or more of the following types of data records: log files, transaction records, and documents.

12. The method of claim 1, wherein the intermediate data structure is a table having a plurality of indices, wherein each of the plurality of indices is dynamically generated in accordance with the extracted one or more values.

13. A computer-implemented method of analyzing data records, comprising:

storing the data records in one or more data centers;
allocating groups of the stored data records to respective processes of a first plurality of processes executing in parallel;
after allocating the groups of the stored data records to the respective processes of the first plurality of processes executing in parallel, in each respective process of the first plurality of processes:
for each data record in at least a subset of the group of stored data records allocated to the respective process:
creating a parsed representation of the data record;
applying a procedural language query to the parsed representation of the data record to extract one or more values; and
applying a respective operator to at least one of the extracted one or more values to add corresponding information to a respective intermediate data structure;
in each process of a second plurality of processes, aggregating information from a subset of the intermediate data structures to produce aggregated data; and
combining the produced aggregated data to produce output data.

14. A computer system with one or more processors and memory for analyzing data records, wherein the data records are stored in one or more data centers, the computer system comprising:

a first plurality of processes operating in parallel, each of which is allocated a group of stored data records to process;
each respective process of the first plurality of processes including instructions for:
creating a parsed representation of each data record in at least a subset of the group of stored data records allocated to the respective process after the group of stored data records is allocated to the respective process;
applying a procedural language query to the parsed representation of each stored data record in at least the subset of the group of stored data records allocated to the respective process to produce one or more values; and
applying one or more emit operators to each of the one or more produced values to add corresponding information to an intermediate data structure; and
at least one aggregating process for aggregating information from a plurality of the intermediate data structures to produce output data.

15. The system of claim 14, wherein the at least one aggregating process for aggregating information comprises a second plurality of processes operating in parallel, wherein each respective process of the second plurality of processes operating in parallel includes instructions for aggregating information from the plurality of the intermediate data structures to produce the output data.

16. The system of claim 14, wherein the intermediate data structure comprises a table.

17. The system of claim 15, wherein at least one process of the second plurality of processes operating in parallel includes instructions for combining the output data to produce aggregated output data.

18. The system of claim 14, wherein each of the one or more emit operators implements one of a predefined set of application-independent statistical information processing functions.

19. The system of claim 18, wherein the application-independent statistical information processing functions comprise one or more of the following: a function for counting occurrences of distinct values, a maximum value function, a minimum value function, a statistical sampling function, a function for identifying values that occur most frequently, and a function for estimating a total number of unique values.

20. The system of claim 14, wherein the instructions for applying the procedural language query to the parsed representation of each data record in at least the subset of the group of stored data records allocated to the respective process to produce the one or more values include instructions for applying the procedural language query independently to each data record.

21. The system of claim 14, wherein the instructions for applying the procedural language query to the parsed representation of each data record in at least the subset of the group of stored data records allocated to the respective process to produce the one or more values and instructions for applying the one or more emit operators to each of the one or more produced values to add the corresponding information to the intermediate data structure include instructions for applying the procedural language query and the one or more emit operators independently to each data record.

22. The system of claim 14, wherein the at least one aggregating process for aggregating information is configured to aggregate, in each respective process of a second plurality of processes, the information from the plurality of the intermediate data structures to produce the output data.

23. The system of claim 14, wherein each parsed representation of each data record comprises a key-value pair.

24. The system of claim 14, wherein the intermediate data structure comprises a table having at least one index whose index values comprise unique values of the produced values.

25. The system of claim 24, wherein the at least one aggregating process for aggregating the information from the plurality of intermediate data structures to produce the output data includes instructions for combining the one or more produced values having the same index values.

26. The system of claim 14, wherein the instructions for applying the procedural language query to the parsed representation of each stored data record include instructions for applying the one or more emit operators to each of a plurality of produced values to add corresponding information to the intermediate data structure.

27. The system of claim 14, wherein the at least one aggregating process for aggregating the information from the plurality of intermediate data structures to produce the output data comprises a second plurality of processes executing in parallel.

28. The system of claim 14, wherein the system is configured such that the allocation of stored data records to each respective process of the first plurality of processes is application independent, and wherein the procedural language query is application dependent.

29. The system of claim 14, wherein the data records comprise one or more of the following types of data records: log files, transaction records, and documents.

30. The system of claim 14, wherein the intermediate data structure is a table having a plurality of indices, wherein each of the plurality of the indices is dynamically generated in accordance with the one or more produced values.

Author: Ajay Ohri

http://about.me/ajayohri

Leave a comment