Topic Models

Some stuff on Topic Models-

In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract “topics” that occur in a collection of documents. An early topic model was probabilistic latent semantic indexing (PLSI), created by Thomas Hofmann in 1999.[1] Latent Dirichlet allocation (LDA), perhaps the most common topic model currently in use, is a generalization of PLSI developed by David Blei, Andrew Ng, and Michael Jordan in 2002, allowing documents to have a mixture of topics.[2] Other topic models are generally extensions on LDA, such as Pachinko allocation, which improves on LDA by modeling correlations between topics in addition to the word correlations which constitute topics. Although topic models were first described and implemented in the context of natural language processing, they have applications in other fields such as bioinformatics.

In statistics, latent Dirichlet allocation (LDA) is a generative model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word’s creation is attributable to one of the document’s topics. LDA is an example of a topic model

David M Blei’s page on Topic Models-

The topic models mailing list is a good forum for discussing topic modeling.

In R,

Some resources I compiled on Slideshare based on the above- Continue reading “Topic Models”

Automatically creating tags for big blogs with WordPress

I use the simple-tags plugin in WordPress for automatically creating and posting tags. I am hoping this makes the site better to navigate. Given the fact that I had not been a very efficient tagger before, this plugin can really be useful for someone in creating tags for more than 100 (or 1000 posts) especially WordPress based blog aggregators.



The plugin is available here –

Simple Tags is the successor of Simple Tagging Plugin This is THE perfect tool to manage perfectly your WP terms for any taxonomy

It was written with this philosophy : best performances, more secured and brings a lot of new functions

This plugin is developped on WordPress 3.3, with the constant WP_DEBUG to TRUE.

  • Administration
  • Tags suggestion from Yahoo! Term Extraction API, OpenCalais, Alchemy, Zemanta, Tag The Net, Local DB with AJAX request
    • Compatible with TinyMCE, FCKeditor, WYMeditor and QuickTags
  • tags management (rename, delete, merge, search and add tags, edit tags ID)
  • Edit mass tags (more than 50 posts once)
  • Auto link tags in post content
  • Auto tags !
  • Type-ahead input tags / Autocompletion Ajax
  • Click tags
  • Possibility to tag pages (not only posts) and include them inside the tags results
  • Easy configuration ! (in WP admin)

The above plugin can be combined with the RSS Aggregator plugin for Search Engine Optimization purposes

Ajay-You can also combine this plugin with RSS auto post blog aggregator (read instructions here) and create SEO optimized Blog Aggregation or Curation

Related –

Information Ladder for Analytics

One very commonly used diagram in marketing and sales by analytics providers, which is hardly ever credited to its author is the Information Ladder

The information ladder is a diagram created by education professor Norman Longworth to describe the stages in human learning. According to the ladder, a learner moves through the following progression to construct “wisdom” at the highest level from “data” at the lowest level:

Data →
                Knowledge →
                                    Understanding → 
                                                                  Insight →

Whereas the first two steps can be scientifically exactly defined, the upper parts belong to the domain of psychology and philosophy.

I sometimes think the information ladder and especially the latter two parts are underutilized, under-quantified as metrics and rarely understood completely by the wise men in analytics and information display.

Some visual versions are below


Funny enough, it is one of the rare concepts first inspired by poetry-

The earliest formalized distinction between wisdom, knowledge, and information may have been made by poet and playwright T.S. Eliot 

Where is the Life we have lost in living?
Where is the wisdom we have lost in knowledge?
Where is the knowledge we have lost in information?


Business Analytics Projects

As per me, Analytics Projects get into these four  broad phases-

  • Business Problem  PhaseWhat needs to be done?
  1. Increase Revenues
  2. Cut Costs
  3. Investigate Unusual Events
  4. Project Timelines
  • Technical Problem PhaseTechnical Problems in Project Execution 
  1. Data Availability /Data Quality/Data Augmentation Costs
  2. Statistical -(Technique based approach) , Hypothesis Formulation,Sampling, Iterations
  3. Programming-(Tool based approach) Analytics Platform Coding (Input, Formats,Processing)
  • Technical Solution PhaseProblem Solving using the Tools and Skills Available 
  1. Data Cleaning /Outlier Treatment/Missing Value Imputation
  2. Statistical -(Technique based approach) Error Minimization, Model Validation, Confidence Levels
  3. Programming-(Tool based approach) Analytics Platform Coding (Output, Display,Graphs)
  • Business Solution PhasePut it all together in a word document, presentation and/or spreadsheet
  1. Finalized- Forecasts  , Models and Data Strategies
  2. Improvements  in existing processes
  3.  Control and Monitoring of Analytical Results post Implementation
  4. Legal and Compliance  guidelines to execution
  5. (Internal or External) Client Satisfaction and Expectation Management
  6. Audience Feedback based on presenting final deliverable to broader audience

December Snowflakes R 2.14.1

Almost missed this one due to Christmas-

R 2.14.1 is out, and so are binaries

so download them here (winduh users!)

David S sums it all up here

This update makes a few small improvements (such as the ability to accurately count the number of available cores for parallel processing on Solaris and Windows, and improved support of grayscale Postscript and PDF graphics export) and fixes a few minor bugs (such as a correction to BIC calculations in the presence of zero-weight observations).

Binaries are here-

Prof Peter D speeaks here-

Changes in recent versions are here-

Major Changes-

Direct support in R is starting with release 2.14.0 for High Performance Computing 



2011 Analytics Recap

Events in the field of data that impacted us in 2011

1) Oracle unveiled plans for R Enterprise. This is one of the strongest statements of its focus on in-database analytics. Oracle also unveiled plans for a Public Cloud

2) SAS Institute released version 9.3 , a major analytics software in industry use.

3) IBM acquired many companies in analytics and high tech. Again.However the expected benefits from Cognos-SPSS integration are yet to show a spectacular change in market share.

2011 Selected acquisitions

Emptoris Inc. December 2011

Cúram Software Ltd. December 2011

DemandTec December 2011

Platform Computing October 2011

 Q1 Labs October 2011

Algorithmics September 2011

 i2 August 2011

Tririga March 2011


4) SAP promised a lot with SAP HANA- again no major oohs and ahs in terms of market share fluctuations within analytics.

5) Amazon continued to lower prices of cloud computing and offer more options.

6) Google continues to dilly -dally with its analytics and cloud based APIs. I do not expect all the APIs in the Google APIs suit to survive and be viable in the enterprise software space.  This includes Google Cloud Storage, Cloud SQL, Prediction API at Some of the location based , translation based APIs may have interesting spin offs that may be very very commercially lucrative.

7) Microsoft -did- hmm- I forgot. Except for its investment in Revolution Analytics round 1 many seasons ago- very little excitement has come from MS plans in data mining- The plugins for cloud based data mining from Excel remain promising yet , while Azure remains a stealth mode starter.

8) Revolution Analytics promised us a GUI and didnt deliver (till yet 🙂 ) . But it did reveal a much better Enterprise software Revolution R 5.0 is one of the strongest enterprise software in the R /Stat Computing space and R’s memory handling problem is now an issue of perception than actual stuff thanks to newer advances in how it is used.

9) More conferences, more books and more news on analytics startups in 2011. Big Data analytics remained a strong buzzword. Expect more from this space including creative uses of Hadoop based infrastructure.

10) Data privacy issues continue to hamper and impede effective analytics usage. So does rational and balanced regulation in some of the most advanced economies. We expect more regulation and better guidelines in 2012.