Topic Models

Some stuff on Topic Models-

http://en.wikipedia.org/wiki/Topic_model

In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract “topics” that occur in a collection of documents. An early topic model was probabilistic latent semantic indexing (PLSI), created by Thomas Hofmann in 1999.[1] Latent Dirichlet allocation (LDA), perhaps the most common topic model currently in use, is a generalization of PLSI developed by David Blei, Andrew Ng, and Michael Jordan in 2002, allowing documents to have a mixture of topics.[2] Other topic models are generally extensions on LDA, such as Pachinko allocation, which improves on LDA by modeling correlations between topics in addition to the word correlations which constitute topics. Although topic models were first described and implemented in the context of natural language processing, they have applications in other fields such as bioinformatics.

http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation

In statistics, latent Dirichlet allocation (LDA) is a generative model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word’s creation is attributable to one of the document’s topics. LDA is an example of a topic model

David M Blei’s page on Topic Models-

http://www.cs.princeton.edu/~blei/topicmodeling.html

The topic models mailing list is a good forum for discussing topic modeling.

In R,

Some resources I compiled on Slideshare based on the above- Continue reading “Topic Models”

Revolution Webinar Series #Rstats

Revolution Analytics Webinar-

 

Featured Webinar
David Champagne REGISTER NOW
Presenter David Champagne
CTO, Revolution Analytics
Date Tuesday, December 20th
Time 11:00AM – 11:30AM Pacific 
Click here for the webinar time in your local time zone

Big Data Starts with R

Traditional IT infrastructure is simply unable to meet

the demands of the new “Big Data Analytics” landscape.   Many enterprises are turning to the “R” statistical programming language and Hadoop (both open source projects) as a potential solution. This webinar will introduce the statistical capabilities of R within the Hadoop ecosystem.  We’ll cover:

  • An introduction to new packages developed by Revolution Analytics to facilitate interaction with the data stores HDFS and HBase so that they can be leveraged from the R environment
  • An overview of how to write Map Reduce jobs in R using Hadoop
  • Special considerations that need to be made when working with R and Hadoop.

We’ll also provide additional resources that are available to people interested in integrating R and Hadoop.

 

Upcoming Webinars
Wed, Dec 14th
11:00AM – 11:30AM PT
Revolution R Enterprise – 100% R and MoreR users already know why the R language is the lingua franca of statisticians today: because it’s the most powerful statistical language in the world. Revolution Analytics builds on the power of open source R, and adds performance, productivity and integration features to create Revolution R Enterprise. In this webinar, author and blogger David Smith will introduce the additional capabilities of Revolution R Enterprise.
 Archived Webinars-
Revolution Webinar: New Features in Revolution R Enterprise 5.0 (including RevoScaleR) to Support Scalable Data AnalysisRevolution R Enterprise 5.0 is Revolution Analytics’ scalable analytics platform.  At its core is Revolution Analytics’ enhanced Distribution of R, the world’s most widely-used project for statistical computing.  In this webinar, Dr. Ranney will discuss new features and show examples of the new functionality, which extend the platform’s usability, integration and scalability

 

Graphs in Statistical Analysis

One of the seminal papers establishing the importance of data visualization (as it is now called) was the 1973 paper by F J Anscombe in http://www.sjsu.edu/faculty/gerstman/StatPrimer/anscombe1973.pdf

It has probably the most elegant introduction to an advanced statistical analysis paper that I have ever seen-

1. Usefulness of graphs

Most textbooks on statistical methods, and most statistical computer programs, pay too little attention to graphs. Few of us escape being indoctrinated with these notions:

(1) numerical calculations are exact, but graphs are rough;

(2) for any particular kind of statistical data there is just one set of calculations constituting a correct statistical analysis;

(3) performing intricate calculations is virtuous, whereas actually looking at the data is cheating.

A computer should make both calculations and graphs. Both sorts of output should be studied; each will contribute to understanding.

Of course the dataset makes it very very interesting for people who dont like graphical analysis too much.

From http://en.wikipedia.org/wiki/Anscombe%27s_quartet

 The x values are the same for the first three datasets.

Anscombe’s Quartet
I II III IV
x y x y x y x y
10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58
8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76
13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71
9.0 8.81 9.0 8.77 9.0 7.11 8.0 8.84
11.0 8.33 11.0 9.26 11.0 7.81 8.0 8.47
14.0 9.96 14.0 8.10 14.0 8.84 8.0 7.04
6.0 7.24 6.0 6.13 6.0 6.08 8.0 5.25
4.0 4.26 4.0 3.10 4.0 5.39 19.0 12.50
12.0 10.84 12.0 9.13 12.0 8.15 8.0 5.56
7.0 4.82 7.0 7.26 7.0 6.42 8.0 7.91
5.0 5.68 5.0 4.74 5.0 5.73 8.0 6.89

For all four datasets:

Property Value
Mean of x in each case 9 exact
Variance of x in each case 11 exact
Mean of y in each case 7.50 (to 2 decimal places)
Variance of y in each case 4.122 or 4.127 (to 3 d.p.)
Correlation between x and y in each case 0.816 (to 3 d.p.)
Linear regression line in each case y = 3.00 + 0.500x (to 2 d.p. and 3 d.p. resp.)
But see the graphical analysis –
While R has always been great in emphasizing graphical analysis, thanks in part due to work by H Wickham and others, SAS products and  language has also modified its approach at http://www.sas.com/technologies/analytics/statistics/datadiscovery/
 SAS Visual Data Discovery combines top-selling SAS products (Base SASSAS/STAT® and SAS/GRAPH®), along with two interfaces (SAS® Enterprise Guide® for guided tasks and batch analysis and JMP® software for discovery and exploratory analysis).
 and  ODS Statistical Graphs at
While ODS Statistical graphs is still not as smooth as say R’s GGPLOT2 http://tinyurl.com/ggplot2-book, it still is a progressive step
Pretty graphs make for better decisions too !

 

 

Free Machine Learning at Stanford

One of the cornerstones of the technology revolution, Stanford now offers some courses for free via distance learning. One of the more exciting courses is of course- machine learning

 

 

http://jan2012.ml-class.org/

About The Course

This course provides a broad introduction to machine learning, datamining, and statistical pattern recognition. Topics include: (i) Supervised learning (parametric/non-parametric algorithms, support vector machines, kernels, neural networks). (ii) Unsupervised learning (clustering, dimensionality reduction, recommender systems, deep learning). (iii) Best practices in machine learning (bias/variance theory; innovation process in machine learning and AI). The course will also draw from numerous case studies and applications, so that you’ll also learn how to apply learning algorithms to building smart robots (perception, control), text understanding (web search, anti-spam), computer vision, medical informatics, audio, database mining, and other areas.

The Instructor

Professor Andrew Ng is Director of the Stanford Artificial Intelligence Lab, the main AI research organization at Stanford, with 20 professors and about 150 students/post docs. At Stanford, he teaches Machine Learning, which with a typical enrollment of 350 Stanford students, is among the most popular classes on campus. His research is primarily on machine learning, artificial intelligence, and robotics, and most universities doing robotics research now do so using a software platform (ROS) from his group.

 

  1. When does the class start?The class will start in January 2012 and will last approximately ten weeks.
  2. What is the format of the class?The class will consist of lecture videos, which are broken into small chunks, usually between eight and twelve minutes each. Some of these may contain integrated quiz questions. There will also be standalone quizzes that are not part of video lectures, and programming assignments.
  3. Will the text of the lectures be available?We hope to transcribe the lectures into text to make them more accessible for those not fluent in English. Stay tuned.
  4. Do I need to watch the lectures live?No. You can watch the lectures at your leisure.
  5. Can online students ask questions and/or contact the professor?Yes, but not directly There is a Q&A forum in which students rank questions and answers, so that the most important questions and the best answers bubble to the top. Teaching staff will monitor these forums, so that important questions not answered by other students can be addressed.
  6. Will other Stanford resources be available to online students?No.
  7. How much programming background is needed for the course?The course includes programming assignments and some programming background will be helpful.
  8. Do I need to buy a textbook for the course?No.
  9. How much does it cost to take the course?Nothing: it’s free!
  10. Will I get university credit for taking this course?No.Interested in learning machine learning-

    Well here is the website to enroll http://jan2012.ml-class.org/

Webinar: Using R within Oracle #rstats

Webinar: Using R within Oracle — Nov 30, noon EST

==========================================
Oracle now supports the R open source statistical programming language. Come to this webinar to learn more about using R within an Oracle environment.

— URL for TechCast: https://stbeehive.oracle.com/bconf/confDetails?confID=334B:3BF0:owch:38893C00F42F38A1E0404498C8A6612B0004075AECF7&guest=true&confKey=608880
— Web Conference ID: 303397
— Web Conference Key: 608880
— Dialup:             1-866-682-4770      , ID 5548204, passcode 1234

After a steady rise in the past few years, in 2010 the open source data mining software R overtook other tools to become the tool used by more data miners (43%) than any other (http://www.rexeranalytics.com/Data-Miner-Survey-Results-2010.html).

Several analytic tool vendors have added R-integration to their software. However, Oracle is the largest company to throw their weight behind R. On October 3, Oracle unveiled their integration of R: Oracle R Enterprise (http://www.oracle.com/us/corporate/features/features-oracle-r-enterprise-498732.html) as part of their Oracle Big Data Appliance announcement (http://www.oracle.com/us/corporate/press/512001).

Oracle R Enterprise allows users to perform statistical analysis with advanced visualization on data stored in Oracle Database. Oracle R Enterprise enables scalable R solutions, while facilitating production deployment of R scripts and Hadoop based solutions, as well as integration of R results with Oracle BI Publisher and OBIEE dashboards.

This TechCast introduces the various Oracle R Enterprise components and features, along with R script demonstrations that interface with Oracle Database.

TechCast presenter: Mark Hornick, Senior Manager, Oracle Advanced Analytics Development.
This TechCast is part of the ongoing TechCasts series coordinated by Oracle BIWA: The BI, Warehousing and Analytics SIG (http://www.oracleBIWA.org).

Creating Pages on Google Plus for some languages

So I decided to create Pages on Google Plus for my favorite programming languages.

a programming language that lets you work more quickly and integrate your systems more effectively

Add to circles

  –  Comment  –  Share

Ajay Ohri

Ajay Ohri's profile photo

Ajay Ohri  –
Ajay Ohri shared a Google+ page with you.
Structured Query Language
Leading statistical language since 1960’s especially in sociology and market research
The leading statistical language in the world
The leading statistical language since 1970’s

Python

https://plus.google.com/107930407101060924456/posts

 

These are in accordance with Google’s Policies http://www.google.com/intl/en/+/policy/pagesterm.html  Continue reading “Creating Pages on Google Plus for some languages”

Preview- Google Cloud SQL

From –http://code.google.com/apis/sql/

What is Google Cloud SQL?

Google Cloud SQL is web service that allows you to create, configure, and use relational databases with your App Engine applications. It is a fully-managed service that maintains, manages, and administers your databases, allowing you to focus on your applications and services.

By offering the capabilities of a MySQL database, the service enables you to easily move your data, applications, and services into and out of the cloud. This allows for high data portability and helps in faster time-to-market because you can quickly leverage your existing database (using JDBC and/or DB-API) in your App Engine application.

Here is where you can get an invite to the beta only Google Cloud SQL

Sign up for Limited Preview

Google Cloud SQL is available to a limited number of users. To sign up for the service:

  1. Visit the Google APIs Console. The console opens the All services pane.
  2. Find the SQL Service line in the Services table and click Request access…
  3. Fill out the enrollment form.
  4. Our team will review your enrollment information and respond by email to the address associated with your Google Account.
  5. Follow the link in the email to view the Terms of Service. Please read these carefully before accepting.
  6. Sign up for the google-cloud-sql-announce group to receive important announcements and product news. (NOTE- Members: 384)
and after all that violence and double talk, a walk in the clouds with SQL.
1. There are three kinds of instances in the beta view
2. Wait for the Instance to be created note- the Design of the Interface uptil now is much better than Amazon’s.  
Note you need to have an appspot application from Google Apps and can choose between the Python and Java versions. Quite clearly there is a play for other languages too. I think GO is also supported.
3. You can import your data from your Google Storage bucket
4. I am not that hot at coding or maybe the interface was too pretty. Anyways- the log tells me that import of the text file has failed from Google Storage to Google Cloud SQL 
5. Incidentally the Google Cloud Storage interface is also much better than the Amazon GUI for transferring data- Note I was using the classical statistical dataset Boston Housing Data as the test case. 
6. The SQL prompt is the weakest part of the design process of the Interphase. There is no Query builder and the SELECT FROM WHERE prompt is slightly amusing/ insulting . I mean guys either throw in a fully fledged GUI for query builder similar to the MYSQL Workbench , than create a pretty white command prompt.
7. You can also export your data back to your Google Storage bucket 
These are early days, and I am trying to see if there is a play for some cloud kind of ODBC action between R, Prediction API , and the cloud SQL… so try it out yourself at http://code.google.com/apis/sql/ and see if there is any juice you can build  here.