MapReduce Patent Granted

After 5 years of third party validation and almost 10 years of Google Internal Validation, the fastest way to crunch data belongs to the people who created it first, Google Inc

From

http://www.google.com/patents/about?id=XLfIAAAAEBAJ

Citations

Patent Number Title Issue date

4876643 Parallel searching system having a master processor for controlling plural slave processors for independently processing respective search requests Oct 24, 1989

5345584 System for managing data storage based on vector-summed size-frequency vectors for data sets, devices, and residual storage on devices Sep 6, 1994

5414849 Evaluating method of data division patterns and a program execution time for a distributed memory parallel computer system, and parallel program producing method using such an evaluating method May 9, 1995

5414899 Pivot structure from a lock handle May 16, 1995

5471622 Run-time system having nodes for identifying parallel tasks in a logic program and searching for available nodes to execute the parallel tasks Nov 28, 1995

5590319 Query processor for parallel processing in homogenous and heterogenous databases Dec 31, 1996

5806059 Database management system and method for query process for the same Sep 8, 1998

5819251 System and apparatus for storage retrieval and analysis of relational and non-relational data Oct 6, 1998

5870743 Method and apparatus for parallelizing operations that create a table Feb 9, 1999

5884299 Optimization of SQL queries involving aggregate expressions using a plurality of local and global aggregation operations Mar 16, 1999

5884303 Parallel searching technique Mar 16, 1999

5920854 Real-time document collection search engine with phrase indexing Jul 6, 1999

5956704 Method and apparatus for parallelizing operations that insert data into an existing data container Sep 21, 1999

5963954 Method for mapping an index of a database into an array of files Oct 5, 1999

6006224 Crucible query system Dec 21, 1999

6026394 System and method for implementing parallel operations in a database management system Feb 15, 2000

6182061 Method for executing aggregate queries, and computer system Jan 30, 2001

6226635 Layered query management May 1, 2001

6256621 Database management system and query operation therefor, including processing plural database operation requests based on key range of hash code Jul 3, 2001

6301574 System for providing business information Oct 9, 2001

6366904 Machine-implementable method and apparatus for iteratively extending the results obtained from an initial query in a database Apr 2, 2002

6408292 Method of and system for managing multi-dimensional databases using modular-arithmetic based address data mapping processes on integer-encoded business dimensions Jun 18, 2002

6556988 Database management apparatus and query operation therefor, including processing plural database operation requests based on key range of hash code Apr 29, 2003

6567806 System and method for implementing hash-based load-balancing query processing in a multiprocessor database system May 20, 2003

6741992 Flexible rule-based communication system and method for controlling the flow of and access to information between computer users May 25, 2004

6910070 Methods and systems for asynchronous notification of database events Jun 21, 2005

6961723 System and method for determining relevancy of query responses in a distributed network search mechanism Nov 1, 2005

6983322 System for discrete parallel processing of queries and updates Jan 3, 2006

7099871 System and method for distributed real-time search Aug 29, 2006

7103590 Method and system for pipelined database table functions Sep 5, 2006

7146365 Method, system, and program for optimizing database query execution Dec 5, 2006

7430549 Optimized SQL code generation Sep 30, 2008

7433863 SQL code generation for heterogeneous environment Oct 7, 2008

Claims

What is claimed is:1. A computer-implemented method of analyzing data records, comprising:

storing the data records in one or more data centers;

allocating groups of the stored data records to respective processes of a first plurality of processes executing in parallel;

after allocating the groups of the stored data records to the respective processes of the first plurality of processes executing in parallel, in each respective process of the first plurality of processes:

for each data record in at least a subset of the group of the stored data records allocated to the respective process:

creating a parsed representation of the data record;

applying a procedural language query to the parsed representation of the data record to extract one or more values, wherein the procedural language query is applied independently to each parsed representation; and

applying a respective emit operator to at least one of the extracted one or more values to add corresponding information to a respective intermediate data structure, wherein the respective emit operator implements one of a predefined set of application-independent statistical information processing functions;

in each process of a second plurality of processes, aggregating information from a subset of the intermediate data structures to produce aggregated data; and

combining the produced aggregated data to produce output data.

2. The method of claim 1, wherein the respective emit operator implements one of a predefined set of application-independent statistical information processing functions.

3. The method of claim 2, wherein the application-independent statistical information processing functions comprise one or more of the following: a function for counting occurrences of distinct values, a maximum value function, a minimum value function, a statistical sampling function, a function for identifying values that occur most frequently, and a function for estimating a total number of unique values.

4. The method of claim 1, wherein the applying the procedural language query to the parsed representation of the data record to extract the one or more values and the applying the respective emit operator to at least one of the one or more values to add the corresponding information to the respective intermediate data structure are performed independently for each data record.

5. The method of claim 1, wherein the parsed representation of the data record comprises a key-value pair.

6. The method of claim 1, wherein the intermediate data structure comprises a table having at least one index whose index values comprise unique values of the extracted one or more values.

7. The method of claim 6, wherein the aggregating information from the subset of the intermediate data structures to produce the aggregated data combines the extracted one or more values having the same index values.

8. The method of claim 1, wherein

when applying the procedural language query to the parsed representation produces a plurality of values, applying the respective emit operator to each of the produced plurality of values to add corresponding information to the respective intermediate data structure.

9. The method of claim 1, wherein the second plurality of processes are executing in parallel.

10. The method of claim 1, wherein the allocating the groups of the stored data records to the respective processes of the first plurality of processes executing in Parallel is application independent, and the procedural language query is application dependent.

11. The method of claim 1, wherein the data records comprise one or more of the following types of data records: log files, transaction records, and documents.

12. The method of claim 1, wherein the intermediate data structure is a table having a plurality of indices, wherein each of the plurality of indices is dynamically generated in accordance with the extracted one or more values.

13. A computer-implemented method of analyzing data records, comprising:

storing the data records in one or more data centers;

allocating groups of the stored data records to respective processes of a first plurality of processes executing in parallel;

after allocating the groups of the stored data records to the respective processes of the first plurality of processes executing in parallel, in each respective process of the first plurality of processes:

for each data record in at least a subset of the group of stored data records allocated to the respective process:

creating a parsed representation of the data record;

applying a procedural language query to the parsed representation of the data record to extract one or more values; and

applying a respective operator to at least one of the extracted one or more values to add corresponding information to a respective intermediate data structure;

in each process of a second plurality of processes, aggregating information from a subset of the intermediate data structures to produce aggregated data; and

combining the produced aggregated data to produce output data.

14. A computer system with one or more processors and memory for analyzing data records, wherein the data records are stored in one or more data centers, the computer system comprising:

a first plurality of processes operating in parallel, each of which is allocated a group of stored data records to process;

each respective process of the first plurality of processes including instructions for:

creating a parsed representation of each data record in at least a subset of the group of stored data records allocated to the respective process after the group of stored data records is allocated to the respective process;

applying a procedural language query to the parsed representation of each stored data record in at least the subset of the group of stored data records allocated to the respective process to produce one or more values; and

applying one or more emit operators to each of the one or more produced values to add corresponding information to an intermediate data structure; and

at least one aggregating process for aggregating information from a plurality of the intermediate data structures to produce output data.

15. The system of claim 14, wherein the at least one aggregating process for aggregating information comprises a second plurality of processes operating in parallel, wherein each respective process of the second plurality of processes operating in parallel includes instructions for aggregating information from the plurality of the intermediate data structures to produce the output data.

16. The system of claim 14, wherein the intermediate data structure comprises a table.

17. The system of claim 15, wherein at least one process of the second plurality of processes operating in parallel includes instructions for combining the output data to produce aggregated output data.

18. The system of claim 14, wherein each of the one or more emit operators implements one of a predefined set of application-independent statistical information processing functions.

19. The system of claim 18, wherein the application-independent statistical information processing functions comprise one or more of the following: a function for counting occurrences of distinct values, a maximum value function, a minimum value function, a statistical sampling function, a function for identifying values that occur most frequently, and a function for estimating a total number of unique values.

20. The system of claim 14, wherein the instructions for applying the procedural language query to the parsed representation of each data record in at least the subset of the group of stored data records allocated to the respective process to produce the one or more values include instructions for applying the procedural language query independently to each data record.

21. The system of claim 14, wherein the instructions for applying the procedural language query to the parsed representation of each data record in at least the subset of the group of stored data records allocated to the respective process to produce the one or more values and instructions for applying the one or more emit operators to each of the one or more produced values to add the corresponding information to the intermediate data structure include instructions for applying the procedural language query and the one or more emit operators independently to each data record.

22. The system of claim 14, wherein the at least one aggregating process for aggregating information is configured to aggregate, in each respective process of a second plurality of processes, the information from the plurality of the intermediate data structures to produce the output data.

23. The system of claim 14, wherein each parsed representation of each data record comprises a key-value pair.

24. The system of claim 14, wherein the intermediate data structure comprises a table having at least one index whose index values comprise unique values of the produced values.

25. The system of claim 24, wherein the at least one aggregating process for aggregating the information from the plurality of intermediate data structures to produce the output data includes instructions for combining the one or more produced values having the same index values.

26. The system of claim 14, wherein the instructions for applying the procedural language query to the parsed representation of each stored data record include instructions for applying the one or more emit operators to each of a plurality of produced values to add corresponding information to the intermediate data structure.

27. The system of claim 14, wherein the at least one aggregating process for aggregating the information from the plurality of intermediate data structures to produce the output data comprises a second plurality of processes executing in parallel.

28. The system of claim 14, wherein the system is configured such that the allocation of stored data records to each respective process of the first plurality of processes is application independent, and wherein the procedural language query is application dependent.

29. The system of claim 14, wherein the data records comprise one or more of the following types of data records: log files, transaction records, and documents.

30. The system of claim 14, wherein the intermediate data structure is a table having a plurality of indices, wherein each of the plurality of the indices is dynamically generated in accordance with the one or more produced values.

Author: Ajay Ohri

http://about.me/ajayohri View all posts by Ajay Ohri

Patent Number	Title	Issue date
4876643	Parallel searching system having a master processor for controlling plural slave processors for independently processing respective search requests	Oct 24, 1989
5345584	System for managing data storage based on vector-summed size-frequency vectors for data sets, devices, and residual storage on devices	Sep 6, 1994
5414849	Evaluating method of data division patterns and a program execution time for a distributed memory parallel computer system, and parallel program producing method using such an evaluating method	May 9, 1995
5414899	Pivot structure from a lock handle	May 16, 1995
5471622	Run-time system having nodes for identifying parallel tasks in a logic program and searching for available nodes to execute the parallel tasks	Nov 28, 1995
5590319	Query processor for parallel processing in homogenous and heterogenous databases	Dec 31, 1996
5806059	Database management system and method for query process for the same	Sep 8, 1998
5819251	System and apparatus for storage retrieval and analysis of relational and non-relational data	Oct 6, 1998
5870743	Method and apparatus for parallelizing operations that create a table	Feb 9, 1999
5884299	Optimization of SQL queries involving aggregate expressions using a plurality of local and global aggregation operations	Mar 16, 1999
5884303	Parallel searching technique	Mar 16, 1999
5920854	Real-time document collection search engine with phrase indexing	Jul 6, 1999
5956704	Method and apparatus for parallelizing operations that insert data into an existing data container	Sep 21, 1999
5963954	Method for mapping an index of a database into an array of files	Oct 5, 1999
6006224	Crucible query system	Dec 21, 1999
6026394	System and method for implementing parallel operations in a database management system	Feb 15, 2000
6182061	Method for executing aggregate queries, and computer system	Jan 30, 2001
6226635	Layered query management	May 1, 2001
6256621	Database management system and query operation therefor, including processing plural database operation requests based on key range of hash code	Jul 3, 2001
6301574	System for providing business information	Oct 9, 2001
6366904	Machine-implementable method and apparatus for iteratively extending the results obtained from an initial query in a database	Apr 2, 2002
6408292	Method of and system for managing multi-dimensional databases using modular-arithmetic based address data mapping processes on integer-encoded business dimensions	Jun 18, 2002
6556988	Database management apparatus and query operation therefor, including processing plural database operation requests based on key range of hash code	Apr 29, 2003
6567806	System and method for implementing hash-based load-balancing query processing in a multiprocessor database system	May 20, 2003
6741992	Flexible rule-based communication system and method for controlling the flow of and access to information between computer users	May 25, 2004
6910070	Methods and systems for asynchronous notification of database events	Jun 21, 2005
6961723	System and method for determining relevancy of query responses in a distributed network search mechanism	Nov 1, 2005
6983322	System for discrete parallel processing of queries and updates	Jan 3, 2006
7099871	System and method for distributed real-time search	Aug 29, 2006
7103590	Method and system for pipelined database table functions	Sep 5, 2006
7146365	Method, system, and program for optimizing database query execution	Dec 5, 2006
7430549	Optimized SQL code generation	Sep 30, 2008
7433863	SQL code generation for heterogeneous environment	Oct 7, 2008

Citations

Claims

Please share:

Related

Author: Ajay Ohri

Leave a comment Cancel reply