Machine learning – DECISION STATS

Principal Component Analysis (PCA) Explained: A Powerful Dimensionality Reduction Technique

Principal Component Analysis (PCA) is one of the most widely used unsupervised machine learning techniques for dimensionality reduction. It transforms a dataset containing many correlated features into a smaller set of uncorrelated principal components, allowing machine learning models to train faster while preserving as much information as possible.

The primary objective of PCA is to address the curse of dimensionality by reducing the number of input variables without significantly sacrificing the underlying structure of the data. Instead of selecting existing features, PCA creates entirely new variables called principal components, each representing a weighted combination of the original features.

PCA identifies the directions of maximum variance in the dataset. The first principal component (PC1) captures the largest amount of variance, while each subsequent component captures the maximum remaining variance under the constraint that it is orthogonal to the previous components. These principal components are mathematically computed as the eigenvectors of the covariance matrix, with their corresponding eigenvalues indicating the amount of variance explained.

An important step before applying PCA is feature scaling. Since PCA is based on variance, variables measured on different scales can disproportionately influence the principal components. Standardizing the data using techniques such as StandardScaler ensures that each feature contributes equally to the analysis.

Choosing the appropriate number of principal components is a critical part of PCA. This is commonly done by analyzing the explained variance ratio or using a scree plot, which helps determine how many components retain a desired percentage of the original information while minimizing dimensionality.

Principal Component Analysis is widely used for data visualization, noise reduction, feature extraction, image compression, financial analysis, bioinformatics, and as a preprocessing step for many machine learning algorithms. By reducing redundant information, PCA often improves computational efficiency and helps mitigate overfitting in downstream models.

Model effectiveness is typically evaluated by examining the explained variance ratio, cumulative explained variance, and the performance of downstream machine learning models trained on the transformed features.

Although PCA is highly effective for reducing dimensionality and removing redundancy, it has certain limitations. It captures only linear relationships, can reduce model interpretability because principal components are combinations of original features, and always discards some information during compression. Nevertheless, PCA remains one of the most important preprocessing techniques in machine learning and data science, especially when working with high-dimensional datasets.

https://docs.google.com/presentation/d/e/2PACX-1vQtPEGQgnz9rZuztOyCzMSOPFimIRoA51pwROl4kEhxjWCN9UUQZ49CHk-U-QRS0Q/pub?start=true&loop=true&delayms=10000

Naive Bayes Explained: A Fast and Powerful Machine Learning Classifier

Naive Bayes is one of the simplest and fastest machine learning classification algorithms, widely used for text analysis, spam filtering, sentiment analysis, and document classification. It is based on Bayes’ Theorem, which calculates the probability of an event occurring based on prior knowledge and observed evidence.

What makes Naive Bayes unique is its “naive” assumption that all input features are independent of one another given the target class. Although this assumption is rarely true in real-world data, the algorithm often delivers surprisingly accurate results, especially for high-dimensional datasets such as text.

The model works by learning the probability of each class (prior probability) and the likelihood of each feature occurring within that class. It then combines these probabilities to predict the most likely class for new data. To avoid assigning zero probability to unseen features, Naive Bayes uses Laplace smoothing (alpha), making the model more robust.

There are three common variants of Naive Bayes:

Gaussian Naive Bayes – Best suited for continuous numerical data.
Multinomial Naive Bayes – Ideal for word counts and text classification tasks.
Bernoulli Naive Bayes – Designed for binary features, where only the presence or absence of a feature matters.

Naive Bayes is widely applied in real-world scenarios such as email spam detection, sentiment analysis of customer reviews, news article categorization, recommendation systems, and support ticket classification. Its exceptional speed, low computational cost, and effectiveness with limited training data make it an excellent baseline model for many machine learning projects.

Model performance is typically evaluated using metrics such as Precision, Recall, F1-Score, and the Confusion Matrix, which help measure classification accuracy beyond simple percentage correctness.

While Naive Bayes is highly efficient and scalable, it has limitations. The independence assumption can reduce accuracy when features are strongly correlated, and its predicted probabilities are not always well-calibrated. Despite these drawbacks, it remains one of the most reliable and practical algorithms for text classification and other probabilistic learning tasks.

Overall, Naive Bayes is an excellent choice when speed, simplicity, and strong baseline performance are important, particularly for natural language processing and large-scale text analytics.

https://docs.google.com/presentation/d/e/2PACX-1vRqJzv08KLO4xdT_Egxn6-dymbwq5mdayB6MOSRV6t6wi_HqhCHMZmHiSFb7WNgeg/pub?start=false&loop=false&delayms=10000

Writing on APIs for Programmable Web

I have been writing free lance on APIs for Programmable Web. Here is an updated list of the articles, many of these would be of interest to analytics users. Note- some of these are interviews and they are in bold. Note to regular readers: I keep updating this list , and at each updation bring it to the front page, then allowing the blog postings to slide it down!

Scoreoid Aims to Gamify the World Using APIs January 27th, 2014

Plot.ly’s Plot to Visualize More Data January 22nd, 2014

LumenData’s Acquisition of Algorithms.io is a Win-Win January 8th, 2014

Yactraq API Sees Huge Growth in 2013 January 6th, 2014

Scrape.it Describes a Better Way to Extract Data December 20th, 2013

Exclusive Interview: App Store Analytics API December 4th, 2013

APIs Enter 3d Printing Industry November 29th, 2013

PW Interview: José Luis Martinez of Textalytics November 6th, 2013

PW Interview Simon Chan PredictionIO November 5th, 2013

PW Interview: Scott Gimpel Founder and CEO FantasyData.com October 23rd, 2013

PW Interview Brandon Levy, cofounder and CEO of Stitch Labs October 8th, 2013

PW Interview: Jolo Balbin Co-Founder Text Teaser September 18th, 2013

PW Interview:Bob Bickel CoFounder Redline13 July 29th, 2013

PW Interview : Brandon Wirtz CTO Stremor.com July 4th, 2013

PW Interview: Andy Bartley, CEO Algorithms.io June 4th, 2013

PW Interview: Francisco J Martin, CEO BigML.com 2013/05/30

PW Interview: Tal Rotbart Founder- CTO, SpringSense 2013/05/28

PW Interview: Jeh Daruwala CEO Yactraq API, Behavorial Targeting for videos 2013/05/13

PW Interview: Michael Schonfeld of Dwolla API on Innovation Meeting the Payment Web 2013/05/02

PW Interview: Stephen Balaban of Lamda Labs on the Face Recognition API 2013/04/29

PW Interview: Amber Feng, Stripe API, The Payment Web 2013/04/24

PW Interview: Greg Lamp and Austin Ogilvie of Yhat on Shipping Predictive Models via API 2013/04/22

Google Mirror API documentation is open for developers 2013/04/18

PW Interview: Ricky Robinett, Ordr.in API, Ordering Food meets API 2013/04/16

PW Interview: Jacob Perkins, Text Processing API, NLP meets API 2013/04/10

Amazon EC2 On Demand Windows Instances -Prices reduced by 20% 2013/04/08

Amazon S3 API Requests prices slashed by half 2013/04/02

PW Interview: Stuart Battersby, Chatterbox API, Machine Learning meets Social 2013/04/02

PW Interview: Karthik Ram, rOpenSci, Wrapping all science APIs 2013/03/20

Viralheat Human Intent API- To buy or not to buy 2013/03/13

Interview Tammer Kamel CEO and Founder Quandl 2013/03/07

YHatHQ API: Calling Hosted Statistical Models 2013/03/04

Quandl API: A Wikipedia for Numerical Data 2013/02/25

Amazon Redshift API is out of limited preview and available! 2013/02/18

Windows Azure Media Services REST API 2013/02/14

Data Science Toolkit Wraps Many Data Services in One API 2013/02/11

Diving into Codeacademy’s API Lessons 2013/01/31

Google APIs finetuning Cloud Storage JSON API 2013/01/29

Interview Hilary Mason Chief Scientist bitly 2013/01/28

Interview: Viralheat CEO Raj Kadam on API Growth 2013/01/22

Google Compute API – Affordable Computing at Google Scale 2013/01/17

2012

Ergast API Puts Car Racing Fans in the Driver’s Seat 2012/12/05
Springer APIs- Fostering Innovation via API Contests 2012/11/20
Statistically programming the web – Shiny,HttR and RevoDeploy API 2012/11/19
Google Cloud SQL API- Bigger ,Faster and now Free 2012/11/12
A Look at the Web’s Most Popular API -Google Maps API 2012/10/09
Cloud Storage APIs for the next generation Enterprise 2012/09/26
Last.fm API: Sultan of Musical APIs 2012/09/12
Socrata Data API: Keeping Government Open 2012/08/29
BigML API Gets Bigger 2012/08/22
Bing APIs: the Empire Strikes Back 2012/08/15
Google Cloud SQL: Relational Database on the Cloud 2012/08/13
Google BigQuery API Makes Big Data Analytics Easy 2012/08/05
Your Store in The Cloud -Google Cloud Storage API 2012/08/01
Predict the future with Google Prediction API 2012/07/30
The Romney vs Obama API 2012/07/27

API Evangelist (programmableweb.com) July 4th, 2013

Why Online Education

1) Huge variety of courses from the best professors in the world (see Gamification course from Coursera below) or Machine Learning , Human Computer Interaction

2) They are free ( is a mistake)! time is not free.

Also signature courses at Coursera now offer credible tracks for $39, and they have more support.

Why do you as a student need support? because sometimes you get stuck, and sometimes you need human interaction to stay motivated.

3) Coursera- I love these things-

Can run the course faster at 1.75 times ( because seriously I get distracted otherwise)

Can run the multiple language CC (captions) – reading is so much faster

Best feature- in video quizzes

Most number of courses

Free!

Codeacademy-

Makes learning fun

Makes easy to learn language

I wish someone could mash more of Coursera content with Codeacademy gamification and teach hacking and data sciences to the next generation of hackers!!

Rest of the websites are good, but I stick to Coursera and Codeacademy!

5) Education empowers! Every person who learns R or JMP through a free MOOC will create more value for themselves, customers, and their society, country than had they remain uneducated because they could not afford the training.

Interview Pranay Agrawal Co-Founder Fractal Analytics

Here is an interview with Pranay Agrawal, Executive Vice President- Global Client Development, Fractal Analytics – one of India’s leading analytics services providers and one of the pioneers in analytics services delivery.

Ajay- Describe Fractal Analytics’ journey as a startup to a pioneer in the Predictive Analytics Services industry. What were some of the key turning points in the field of analytics that you have noticed during these times?

Pranay- In 2000, Fractal Analytics started as a pure-play analytics services company in India with a focus on financial services. Five years later, we spread our operation to the United States and opened new verticals. Today, we have the widest global footprint among analytics providers and have experience handling data and deep understanding of consumer behavior in over 150 counties. We have matured from an analytics service organization to a productized analytics services firm, specializing in consumer goods, retail, financial services, insurance and technology verticals.
We are on the fore-front of a massive inflection point with Big Data Analytics at the center. We have witnessed the transformation of analytics within our clients from a cost center to the most critical division that drives competitive advantage. Advances are quickly converging in computer science, artificial intelligence, machine learning and game theory, changing the way how analytics is consumed by B2B and B2C companies. Companies that use analytics well are poised to excel in innovation, customer engagement and business performance.

Ajay- What are analytical tools that you use at Fractal Analytics? Are there any trends in analytical software usage that you have observed?

Pranay- We are tools agnostic to serve our clients using whatever platforms they need to ensure they can quickly and effectively operationalize the results we deliver. We use R, SAS, SPSS, SpotFire, Tableau, Xcelsius, Webfocus, Microstrategy and Qlikview. We are seeing an increase in adoption of open source platform such as R, and specialize tools for dashboard like Tableau/Qlikview, plus an entire spectrum of emerging tools to process manage and extract information from Big Data that support Hadoop and NoSQL data structures

Ajay- What are Fractal Analytics plans for Big Data Analytics?

Pranay- We see our clients being overwhelmed by the increasing complexity of the data. While they are all excited by the possibilities of Big Data, on-the-ground struggle continues to realize its full potential. The analytics paradigm is changing in the context of Big Data. Our solutions focus on how to make it super-simple for our clients combined with analytics sophistication possible with Big Data.
Let’s take our Customer Genomics solution for retailers as an example. Retailers are collecting information about Shopper behaviors through every transaction. Retailers want to transform their business to make it more customer-centric but do not know how to go about it. Our Customer Genomics solution uses advanced machine learning algorithm to label every shopper across more than 80 different dimensions. Retailers use these to identify which products it should deep-discount depending on what price-sensitive shoppers buy. They are transforming the way they plan their assortment, planogram and targeted promotions armed with this intelligence.

We are also building harmonization engines using Concordia to enable real-time update of Customer Genomics based on every direct, social, or shopping transaction. This will further bridge the gap between marketing actions and consumer behavior to drive loyalty, market share and profitability.

Ajay- What are some of the key things that differentiate Fractal Analytics from the rest of the industry? How are you different?

Pranay- We are one of the pioneer pure-play analytics firm with over a decade of experience consulting with Fortune 500 companies. What clients most appreciate about working with us includes:

Experience managing structured and unstructured Big Data (volume, variety) with a deep understanding of consumer behavior in more than 150 counties
Advanced analytics leveraging supervised machine-learning platforms
Proprietary products for example: Concordia for data harmonization, Customer Genomics for consumer insights and personalized marketing, Pincer for pricing optimization, Eavesdrop for social media listening, Medley for assortment optimization in retail industry and Known Value Item for retail stores
Deep industry expertise enables us to leverage cross-industry knowledge to solve a wide range of marketing problems
Lowest attrition rates in the industry and very selective hiring process makes us a great place to work

Ajay- What are some of the initiatives that you have taken to ensure employee satisfaction and happiness?

Pranay- We believe happy employees create happy customers. We are building a great place to work by taking a personal interest in grooming people. Our people are highly engaged as evidenced by 33% new hire referrals and the highest Glassdoor ratings in our industry.
We recognize the accomplishments and contributions made through many programs such as:

FractElite – where peers nominate and defend the best of us
Recognition board – where anyone can write a visible thank you
Value cards – where anyone can acknowledge great role model behavior in one or more values
Townhall – a quarterly all hands where we announce anniversaries and FractElite awards, with an open forum to ask questions
Employee engagement surveys – to measure and report out on satisfaction programs
Open access to managers and leadership team – to ensure we understand and appreciate each person’s unique goals and ambitions, coach for high performance, and laud their success

Ajay- How happy are Fractal Analytics customers quantitatively? What is your retention rate- and what plans do you have for 2013?

Pranay- As consultants, delivering value with great service is critical to our growth, which has nearly doubled in the last year. Most of our clients have been with us for over five years and we are typically considered a strategic partner.
We conduct client satisfaction surveys during and after each project to measure our performance and identify opportunities to serve our clients better. In 2013, we will continue partnering with our clients to define additional process improvements from applying best practice in engagement management to building more advanced analytics and automated services to put high-impact decisions into our clients’ hands faster.

About–

Pranay Agrawal -Pranay co-founded Fractal Analytics in 2000 and heads client engagement worldwide. He has a MBA from India Institute of Management (IIM) Ahmedabad, Bachelors in Accounting from Bangalore University, and Certified Financial Risk Manager from GARP. He is is also available online on http://www.linkedin.com/in/pranayfractal

Fractal Analytics is a provider of predictive analytics and decision sciences to financial services, insurance, consumer goods, retail, technology, pharma and telecommunication industries. Fractal Analytics helps companies compete on analytics and in understanding, predicting and influencing consumer behavior. Over 20 fortune 500 financial services, consumer packaged goods, retail and insurance companies partner with Fractal to make better data driven decisions and institutionalize analytics inside their organizations.

Fractal sets up analytical centers of excellence for its clients to tackle tough big data challenges, improve decision management, help understand, predict & influence consumer behavior, increase marketing effectiveness, reduce risk and optimize business results.

Interview John Myles White , Machine Learning for Hackers

Here is an interview with one of the younger researchers and rock stars of the R Project, John Myles White, co-author of Machine Learning for Hackers.

Ajay- What inspired you guys to write Machine Learning for Hackers. What has been the public response to the book. Are you planning to write a second edition or a next book?

John-We decided to write Machine Learning for Hackers because there were so many people interested in learning more about Machine Learning who found the standard textbooks a little difficult to understand, either because they lacked the mathematical background expected of readers or because it wasn’t clear how to translate the mathematical definitions in those books into usable programs. Most Machine Learning books are written for audiences who will not only be using Machine Learning techniques in their applied work, but also actively inventing new Machine Learning algorithms. The amount of information needed to do both can be daunting, because, as one friend pointed out, it’s similar to insisting that everyone learn how to build a compiler before they can start to program. For most people, it’s better to let them try out programming and get a taste for it before you teach them about the nuts and bolts of compiler design. If they like programming, they can delve into the details later.

We once said that Machine Learning for Hackers is supposed to be a chemistry set for Machine Learning and I still think that’s the right description: it’s meant to get readers excited about Machine Learning and hopefully expose them to enough ideas and tools that they can start to explore on their own more effectively. It’s like a warmup for standard academic books like Bishop’s.

The public response to the book has been phenomenal. It’s been amazing to see how many people have bought the book and how many people have told us they found it helpful. Even friends with substantial expertise in statistics have said they’ve found a few nuggets of new information in the book, especially regarding text analysis and social network analysis — topics that Drew and I spend a lot of time thinking about, but are not thoroughly covered in standard statistics and Machine Learning undergraduate curricula.

I hope we write a second edition. It was our first book and we learned a ton about how to write at length from the experience. I’m about to announce later this week that I’m writing a second book, which will be a very short eBook for O’Reilly. Stay tuned for details.

Ajay- What are the key things that a potential reader can learn from this book?

John- We cover most of the nuts and bolts of introductory statistics in our book: summary statistics, regression and classification using linear and logistic regression, PCA and k-Nearest Neighbors. We also cover topics that are less well known, but are as important: density plots vs. histograms, regularization, cross-validation, MDS, social network analysis and SVM’s. I hope a reader walks away from the book having a feel for what different basic algorithms do and why they work for some problems and not others. I also hope we do just a little to shift a future generation of modeling culture towards regularization and cross-validation.

Ajay- Describe your journey as a science student up till your Phd. What are you current research interests and what initiatives have you done with them?

John-As an undergraduate I studied math and neuroscience. I then took some time off and came back to do a Ph.D. in psychology, focusing on mathematical modeling of both the brain and behavior. There’s a rich tradition of machine learning and statistics in psychology, so I got increasingly interested in ML methods during my years as a grad student. I’m about to finish my Ph.D. this year. My research interests all fall under one heading: decision theory. I want to understand both how people make decisions (which is what psychology teaches us) and how they should make decisions (which is what statistics and ML teach us). My thesis is focused on how people make decisions when there are both short-term and long-term consequences to be considered. For non-psychologists, the classic example is probably the explore-exploit dilemma. I’ve been working to import more of the main ideas from stats and ML into psychology for modeling how real people handle that trade-off. For psychologists, the classic example is the Marshmallow experiment. Most of my research work has focused on the latter: what makes us patient and how can we measure patience?

Ajay- How can academia and private sector solve the shortage of trained data scientists (assuming there is one)?

John- There’s definitely a shortage of trained data scientists: most companies are finding it difficult to hire someone with the real chops needed to do useful work with Big Data. The skill set required to be useful at a company like Facebook or Twitter is much more advanced than many people realize, so I think it will be some time until there are undergraduates coming out with the right stuff. But there’s huge demand, so I’m sure the market will clear sooner or later.

The changes that are required in academia to prepare students for this kind of work are pretty numerous, but the most obvious required change is that quantitative people need to be learning how to program properly, which is rare in academia, even in many CS departments. Writing one-off programs that no one will ever have to reuse and that only work on toy data sets doesn’t prepare you for working with huge amounts of messy data that exhibit shifting patterns. If you need to learn how to program seriously before you can do useful work, you’re not very valuable to companies who need employees that can hit the ground running. The companies that have done best in building up data teams, like LinkedIn, have learned to train people as they come in since the proper training isn’t typically available outside those companies.

Of course, on the flipside, the people who do know how to program well need to start learning more about theory and need to start to have a better grasp of basic mathematical models like linear and logistic regressions. Lots of CS students seem not to enjoy their theory classes, but theory really does prepare you for thinking about what you can learn from data. You may not use automata theory if you work at Foursquare, but you will need to be able to reason carefully and analytically. Doing math is just like lifting weights: if you’re not good at it right now, you just need to dig in and get yourself in shape.

About-

John Myles White is a Phd Student in Ph.D. student in the Princeton Psychology Department, where he studies human decision-making both theoretically and experimentally. Along with the political scientist Drew Conway, he is the author of a book published by O’Reilly Media entitled “Machine Learning for Hackers”, which is meant to introduce experienced programmers to the machine learning toolkit. He is also working with Mark Hansenon a book for laypeople about exploratory data analysis.John is the lead maintainer for several R packages, including ProjectTemplate and log4r.

(TIL he has played in several rock bands!)

—–

You can read more in his own words at his blog at http://www.johnmyleswhite.com/about/

He can be contacted via social media at Google Plus at https://plus.google.com/109658960610931658914 or twitter at twitter.com/johnmyleswhite/

Machine Learning to Translate Code from different programming languages

Google Translate has been a pioneer in using machine learning for translating various languages (and so is the awesome Google Transliterate)

I wonder if they can expand it to programming languages and not just human languages.

Issues in ~~converting~~ translating programming language code

1) Paths referred for stored objects

2) Object Names should remain the same and not translated

3) Multiple Functions have multiple uses , sometimes function translate is not straightforward

I think all these issues are doable, solveable and more importantly profitable.

I look forward to the day a iOS developer can convert his code to Android app code by simple upload and download.

Please share:

Please share:

Related articles

Please share:

Please share:

Please share:

Please share:

Please share: