Think Big Analytics

I came across this lovely analytics company. Think Big Analytics. and I really liked their lovely explanation of the whole she-bang big data etc stuff. Because Hadoop isnt rocket science and can be made simpler to explain and deploy.

Check them out yourself at http://www.thinkbiganalytics.com/resources_reference

Also they have an awesome series of lectures coming up-

check out

http://www.eventbrite.com/org/1740609570

Up and Running with Big Data: 3 Day Deep-Dive

Over three days, explore the Big Data tools, technologies and techniques which allow organisations to gain insight and drive new business opportunities by finding signal in their data. Using Amazon Web Services, you’ll learn how to use the flexible map/reduce programming model to scale your analytics, use Hadoop with Elastic MapReduce, write queries with Hive, develop real world data flows with Pig and understand the operational needs of a production data platform

Day 1:

  • MapReduce concepts
  • Hadoop implementation:  Jobtracker, Namenode, Tasktracker, Datanode, Shuffle & Sort
  • Introduction to Amazon AWS and EMR with console and command-line tools
  • Implementing MapReduce with Java and Streaming

Day 2:

  • Hive Introduction
  • Hive Relational Operators
  • Hive Implementation to MapReduce
  • Hive Partitions
  • Hive UDFs, UDAFs, UDTFs

Day 3:

  • Pig Introduction
  • Pig Relational Operators
  • Pig Implementation to MapReduce
  • Pig UDFs
  • NoSQL discussion

Introducing Radoop

Thats Right- This is Radoop and it is

Hadoop meats Rapid Miner=Radoop

 

 

http://prezi.com/bin/preziloader.swf

http://prezi.com/dxx7m50le5hr/radoop-presentation-at-rcomm-2011/

 

Interview- Top Data Mining Blogger on Earth , Sandro Saitta

Surajustement Modèle 2
Image via Wikipedia

If you do a Google search for Data Mining Blog- for the past several years one Blog will come on top. data mining blog – Google Search http://bit.ly/kEdPlE

To honor 5 years of Sandro Saitta’s blog (yes thats 5 years!) , we cover an exclusive interview with him where he reveals his unique sauce for cool techie blogging.

Ajay- Describe your journey as a scientist and data miner, from early experiences, to schooling to your work/research/blogging.

Sandro- My first experience with data mining was my master project. I used decision tree to predict pollen concentration for the following week using input data such as wind, temperature and rain. The fact that an algorithm can make a computer learn from experience was really amazing to me. I found it so interesting that I started a PhD in data mining. This time, the field of application was civil engineering. Civil engineers put a lot of sensors on their structure in order to understand how they behave. With all these sensors they generate a lot of data. To interpret these data, I used data mining techniques such as feature selection and clustering. I started my blog, Data Mining Research, during my PhD, to share with other researchers.

I then started applying data mining in the stock market as my first job in industry. I realized the difference between image recognition, where 99% correct classification rate is state of the art, and stock market, where you’re happy with 55%. However, the company ambiance was not as good as I thought, so I moved to consulting. There, I applied data mining in behavioral targeting to increase click-through rates. When you compare the number of customers who click with the ones who don’t, then you really understand what class imbalance mean. A few months ago, I accepted a very good opportunity at SICPA. I’m looking forward to resolving new challenges there.

Ajay- Your blog is the top ranked blog for “data mining blog”. Could you share some tips on better blogging for analytics and technical people

Sandro- It’s always difficult to start a blog, since at the beginning you have no reader. Writing for nobody may seem stupid, but it is not. By writing my first posts during my PhD I was reorganizing my ideas. I was expressing concepts which were not always clear to me. I thus learned a lot and also improved my English level. Of course, it’s still not perfect, but I hope most people can understand me.

Next come the readers. A few dozen each week first. To increase this number, I then started to learn SEO (Search Engine Optimization) by reading books and blogs. I tested many techniques that increased Data Mining Research visibility in the blogosphere. I think SEO is interesting when you already have some content published (which means not at the very beginning of your blog). After a while, once your blog is nicely ranked, the main task is to work on the content of the blog. To be of interest, your content must be particular: original, informative or provocative for example. I also had the chance to have a good visibility thanks to well-known people in the field like Kevin Hillstrom, Gregory Piatetsky-Shapiro, Will Dwinnell / Dean Abbott, Vincent Granville, Matthew Hurst and many others.

Ajay- Whats your favorite statistical software and what are the various softwares that you have worked with.
Could you compare and contrast these software as well.

Sandro- My favorite software at this point is SAS. I worked with it for two years. Once you know the language, you can perform ETL and data mining so easily. It’s also very fast compared to others. There are a lot of tools for data mining, but I cannot think of a tool that is as powerful as SAS and, in the same time, has a high-level programming language behind it.

I also worked with R and Matlab. R is very nice since you have all the up-to-date data mining algorithms implemented. However, working in the memory is not always a good choice, especially for ETL. Matlab is an excellent tool for prototyping. It’s not so fast and certainly not done for ETL, but the price is low regarding all the possibilities for data mining. According to me, SAS is the best choice for ETL and a good choice for data mining. Of course, there is the price.

Ajay- What are your favorite techniques and training resources for learning basics of data mining to say statisticians or business management graduates.

Sandro- I’m the kind of guy who likes to read books. I read data mining books one after the other. The fact that the same concepts are explained differently (and by different people) helps a lot in learning a topic like data mining. Of course, nothing replaces experience in the field. You can read hundreds of books, you will still not be a good practitioner until you really apply data mining in specific fields. My second choice after books is blogs. By reading data mining blogs, you will really see the issues and challenges in the field. It’s still not experience, but we are closer. Finally, web resources and networks such as KDnuggets of course, but also AnalyticBridge and LinkedIn.

Ajay- Describe your hobbies and how they help you ,if at all in your professional life.

Sandro- One of my hobbies is reading. I read a lot of books about data mining, SEO, Google as well as Sci-Fi and Fantasy. I’m a big fan of Asimov by the way. My other hobby is playing tennis. I think I simply use my hobbies as a way to find equilibrium in my life. I always try to find the best balance between work, family, friends and sport.

Ajay- What are your plans for your website for 2011-2012.

Sandro- I will continue to publish guest posts and interviews. I think it is important to let other people express themselves about data mining topics. I will not write about my current applications due to the policies of my current employer. But don’t worry, I still have a lot to write, whether it is technical or not. I will also emphasis more on my experience with data mining, advices for data miners, tips and tricks, and of course book reviews!

Standard Disclosure of Blogging- Sandro awarded me the Peoples Choice award for his blog for 2010 and carried out my interview. There is a lot of love between our respective wordpress blogs, but to reassure our puritan American readers- it is platonic and intellectual.

About Sandro S-



Sandro Saitta is a Data Mining Research Engineer at SICPA Security Solutions. He is also a blogger at Data Mining Research (www.dataminingblog.com). His interests include data mining, machine learning, search engine optimization and website marketing.

You can contact Mr Saitta at his Twitter address- 

https://twitter.com/#!/dataminingblog

#Rstats gets into Enterprise Cloud Software

Defense Agencies of the United States Departme...
Image via Wikipedia

Here is an excellent example of how websites should help rather than hinder new customers take a demo of the software without being overwhelmed by sweet talking marketing guys who dont know the difference between heteroskedasticity, probability, odds and likelihood.

It is made by Zementis (Dr Michael Zeller has been a frequent guest here) and Revolution Analytics is still the best shot in Enterprise software for #Rstats

Now if only Revo could get into the lucrative Department of Energy or Department of Defense business- they could change the world AND earn some more revenue than they have been doing. But seriously.

Check out http://deployr.revolutionanalytics.com/zementis/ and play with it. or better still mash it with some data viz and ROC curves.- or extend it with some APIS 😉

New book on BigData Analytics and Data mining using #Rstats with a GUI

Joseph Marie Jacquard
Image via Wikipedia

I am hoping to put this on my pre-ordered or Amazon Wish list. The book the common people who wanted to do data mining with , but were unable to ask aloud they didnt know much.  It is written by the seminal Australian authority on data mining Dr Graham Williams whom I interviewed here at https://decisionstats.com/2009/01/13/interview-dr-graham-williams/

Data Mining for the masses using an ergonomically designed Graphical User Interface.

Thank you Springer. Thank you Dr Graham Williams

http://www.springer.com/statistics/physical+%26+information+science/book/978-1-4419-9889-7

Data Mining with Rattle and R

Data Mining with Rattle and R

The Art of Excavating Data for Knowledge Discovery

Series: Use R

Williams, Graham

1st Edition., 2011, XX, 409 p. 150 illus. in color.

  • Softcover, ISBN 978-1-4419-9889-7

    Due: August 29, 2011

    54,95 €
  • Encourages the concept of programming with data – more than just pushing data through tools, but learning to live and breathe the data
  • Accessible to many readers and not necessarily just those with strong backgrounds in computer science or statistics
  • Details some of the more popular algorithms for data mining, as well as covering model evaluation and model deployment

Data mining is the art and science of intelligent data analysis. By building knowledge from information, data mining adds considerable value to the ever increasing stores of electronic data that abound today. In performing data mining many decisions need to be made regarding the choice of methodology, the choice of data, the choice of tools, and the choice of algorithms.

Throughout this book the reader is introduced to the basic concepts and some of the more popular algorithms of data mining. With a focus on the hands-on end-to-end process for data mining, Williams guides the reader through various capabilities of the easy to use, free, and open source Rattle Data Mining Software built on the sophisticated R Statistical Software. The focus on doing data mining rather than just reading about data mining is refreshing.

The book covers data understanding, data preparation, data refinement, model building, model evaluation,  and practical deployment. The reader will learn to rapidly deliver a data mining project using software easily installed for free from the Internet. Coupling Rattle with R delivers a very sophisticated data mining environment with all the power, and more, of the many commercial offerings.

Content Level » Research

Keywords » Data mining

Related subjects » Physical & Information Science

Related- https://decisionstats.com/2009/01/13/interview-dr-graham-williams/

So which software is the best analytical software? Sigh- It depends

 

Graph of typical Operating System placement on...
Image via Wikipedia

 

Here is the software matrix that I am trying to develop for analytical software- It should help as a tentative guide for software purchases- it’s independent so unbiased (hopefully)- and it will try and bring as much range or sensitivity as possible. The list (rather than matrix) is of the format-

Type 0f analysis-

  • Data Visualization (Reporting with Pivot Ability to aggregate, disaggregate)
  • Reporting without Pivot Ability
  • Regression -Logistic Regression for Propensity or Risk Models
  • Regression- Linear for Pricing Models
  • Hypothesis Testing
  • A/B Scenario Testing
  • Decision Trees (CART, CHAID)
  • Time Series Forecasting
  • Association Analysis
  • Factor Analysis
  • Survey (Questionnaires)
  • Clustering
  • Segmentation
  • Data Manipulation

Dataset Size-

  • small dataset (upto X mb)
  • big dataset (upto Y gb)
  • enterprise class production BigData datasets (no limit)

Pricing of Software that can be used-

Ease of using Software

  • GUI vs Non GUI
  • Software that require not much extensive training
  • Software that require extensive training

Installation, Customization, Maintainability (or Support) for Software

  • Installation Dependencies- Size- Hardware (costs and  efficiencies)
  • Customization provided for specific use
  • Support Channels (including approximate Turn Around Time)

Software

  • Software I have used personally
  • SAS (Base, Stat,Enterprise,Connect,ETS) WPS KXEN SPSS (Base,Trends),Revolution R,R,Rapid Miner,Knime,JMP,SQL SERVER,Rattle, R Commander,Deducer
  • Software I know by reputation- SAS Enterprise Miner etc etc

Are there any other parameters for judging software?  let me know at http://twitter.com/decisionstats

Big Data Management and Advanced Analytics

Here is a new list for the top 10 considerations for Big Data Management and using Advanced Analytics -courtesy AsterData.

Source-

http://www.asterdata.com/wp_10_considerations/index.php?ref=decisionstats

“There are ten strong reasons why competitive organizations are turning to new data management solutions to handle their growing data volumes and evolving analytic needs. This new platform – a ‘data-analytics server’ – merges data storage and data analytics into one single system to conquer the big data challenge.

Big data storage is handled by a massively parallel database architecture; big data analytics is handled by an integrated analytics engine, so that analytics run fully in-database yielding ultra high performance on large data sets. The analytics engine leverages the powerful analytics framework MapReduce. The results are cost-effective, scalable data storage, ultra high performance and richer data analysis.”

Major considerations include:
Cost-effective, scalable data management – what are the requirements?
Advanced analytic queries – what’s meant by advanced analytics & how easy is it?
Running rich, diverse workloads – key factors for high concurrency & performance