Learning Hadoop

Curious on learning hadoop- a hot resume skill

Try

http://www.cloudera.com/hadoop-training/#certification

Cloudera Certification for Hadoop

Cloudera Certification establishes you as a trusted and valuable resource for those working with Hadoop. Whether your company is just looking into the technology or your customers are asking for help, Cloudera Certification demonstrates your ability to solve problems using Hadoop.

  • Consultants, developers and technical leaders can use Cloudera Certification to demonstrate their experience with Hadoop.
  • Employers can use Cloudera Certification to identify candidates for new jobs or internal promotions, as well as ensure team members share a common knowledge base.
  • Customers can reduce risk by relying on contractors and suppliers who retain current Cloudera Certification for their personnel.

If you’d like to obtain Cloudera Certification for Developers or Administrators

http://www.cloudera.com/hadoop-training/#certification

Certifications in Analytics and Business Intelligence

I sometimes get a chat message on Twitter/ Facebook asking for help on some specific data issue. More often than not it is something like – How do I get started in BI/BA /Data stuff. So here is a list of certifications which I think are quite nice as beginning points or even CV multipliers.

[tweetmeme=”Decisionstats”]

1) Google’s Certifications

http://www.google.com/intl/en/adwords/professionals/

2) SAS Certifications

Quite well established and easily one of the best structured certification programs in the industry.

http://support.sas.com/certify/index.html

3) SPSS

The SPSS certification began last year and it helps provide a valuable skill set for both your practice as well as your resume. Also useful to have a second skill set apart from SAS in terms of statistical software.

http://www.spss.com/certification/

At this point I would like you to pause and think if the above certifications are useful or cost  effective for you as they are broadly general qualifications in statistical platforms as well as in applying them for the web analytics ( a key area for business analytics).

For more specialized certifications here are some more-

1) Microsoft SQL Server

http://www.microsoft.com/learning/en/us/certification/cert-sql-server.aspx

2) TDWI Certification

http://tdwi.org/pages/certification/index.aspx

3) IBM

Not sure how updated these are so caveat emptor!

http://www.redbooks.ibm.com/abstracts/sg245747.html

If you are knowledgeable about IBM’s Business Intelligence solutions and the fundamental concepts of DB2 Universal Database, and you are capable of performing the intermediate and advanced skills required to design, develop, and support Business Intelligence applications

Also IBM Cognos Certifications

http://www-01.ibm.com/software/data/education/cognos-cert.html

4) MicroStrategy

http://www.microstrategy.com/education/Certification/

5) Oracle

Included the all new Sun Certifications as well.

http://certification.oracle.com/

and http://blogs.oracle.com/certification/

6) SAP Certifications

http://www.sap.com/services/education/certification/index.epx

7) Cloudera’s Hadoop Certification

http://www.cloudera.com/developers/learn-hadoop/hadoop-certification/

These are some Business Intelligence and Business Analytics related certifications that I assembled in a list. Many other programs were either too software development specific or did not have a certification for general usage (like many R trainings or company tool specific trainings). Please feel free to add in any suggestions.

Hive Tutorial: Cloud Computing

Here is a nice video from Cloudera on a HIVE tutorial. I wonder what would happen if they put a real analytical system and not just basic analytics and reporting … like R or SPSS or JMP or SAS on big database system like Hadoop (including some text mined data from legacy company documents)

Unlike Oracle or other data base systems, Hadoop is free now and in reasonable future  (like MySQL used to be before acquired by big fish Sun acquired by bigger Oracle).

Citation-

http://wiki.apache.org/hadoop/Hive

Hive is a data warehouse infrastructure built on top of Hadoop that provides tools to enable easy data summarization, adhoc querying and analysis of large datasets data stored in Hadoop files

Hive is based on Hadoop which is a batch processing system. Accordingly, this system does not and cannot promise low latencies on queries. The paradigm here is strictly of submitting jobs and being notified when the jobs are completed as opposed to real time queries. As a result it should not be compared with systems like Oracle where analysis is done on a significantly smaller amount of data but the analysis proceeds much more iteratively with the response times between iterations being less than a few minutes. For Hive queries response times for even the smallest jobs can be of the order of 5-10 minutes and for larger jobs this may even run into hours.

If your input data is small you can execute a query in a short time. For example, if a table has 100 rows you can ‘set mapred.reduce.tasks=1’ and ‘set mapred.map.tasks=1’ and the query time will be ~15 seconds.