Better Math is the solution to fake news and social media manipulation

Oft repeated lies become the truth

Repeat a lie often enough and it becomes the truth”, is a law of propaganda often attributed to the Nazi Joseph Goebbels. The 2016 election showed how critically timed social media news can damage fatally the candidacy of electoral candidates with little or no remedy in law. (It is not illegal to spread rumours in a foreign country).

Yet tech majors can help. With browser embedded plugins, they can analyze and scrutinize the sentiment, the polarity, the word association of a web page. With the added advantage of IP address lookups, the social media can be a fair place for  news just like  credit bureaus are fair places for credit ratings.

Unless you like Russian Hackers for the people, of the people, by the people.



Working with Cloudera’s VM and Python and R

  1. Download Cloudera VM from
  2. Boot it up using VMware using instructions from and  (after download from
    1. 1

      Select File > Open.


      In the file selection window, find and select the virtual machine package or configuration file for the virtual machine to open.

      Virtual machine package files have the extension .vmwarevm. Virtual machine configuration files have the extension .vmx. You can view a file’s extension by selecting File > Get info.


      Click the Open button.

      VMware opens the virtual machine and powers it on.

  3. Download putty from (seriously dude)
  4. login to Cloudera VM using Putty as followScreenshot 2018-01-02 12.06.42IP address for connecting
    1. Username and Password – cloudera
  5. Install R using – sudo yum install R
  6. For Python see latest version at Screenshot 2018-01-02 12.24.13
  7. cd /opt
  8. sudo wget
  9. bash
  10. Accept all conditions!
  11. type jupyter notebook to launch Python in Notebook screenshot-2018-01-02-14-11-13.png
  12. For RStudio
    1. See download link from
    2.  sudo wget
    3. sudo bash
    4. yum install rstudio-1.1.383-x86_64.rpm
  13. For RStudio Server (better alternative since RStudio didnt work above)
    1. instructions from 
    2. $ wget
      $ sudo yum install --nogpgcheck rstudio-server-rhel-1.1.383-x86_64.rpm
    3. Open this http://localhost:8787/ in browser in VM and use cloudera cloudera as username and password
    4. Install packages as needed 🙂
    5. To check rstudio sessions type this in command line

sudo rstudio-server active-sessions 

Screenshot 2018-01-02 14.07.32

Hat tip –


Installing xgboost in Windows 10 for Python

Install dependencies

!pip install numpy scipy scikit-learn pandas

!pip install deap update_checker tqdm stopit

Install xgboost

C:\Users\KOGENTIX>git clone –recursive

Download DLL from


and put it in xgboost/python-package folder

C:\Users\KOGENTIX\xgboost>cd python-package


Change Environment Variables so it finds xgboost dll













How AI will be the future of e-commerce


E-Commerce or electronic commerce  has grown rapidly in the past decade, leveraging the internet to deliver a wide variety of goods and services. These include players like Amazon, Flipkart and Alibaba that sell a wide variety of products, or players like Pepper Fry that sells furniture.Eventually electronic commerce is supposed to eclipse the traditional brick and mortar enterprises.

E-commerce is efficient in multiple ways. It can save on inventory by using warehouses for dispatch and logistics instead of storing in showrooms. They can use the data captured by online analytics software to better forecast demand of certain stock keeping units (sku’s). Lastly they can offer room for faster experimentation in interfaces including things like recommendation engines (- i.e those who bought this book also bought these other books). The online data captured from customer clickstream can be used to refine pricing and discounts which are critical in a very competitive market.

Ecommerce and Big Data

A large number of customers come to electronic commerce site every day, every hour. They click on certain links, follow certain pages, post reviews, view (but don’t purchase), and finally purchase items. This continuous stream of data, called click-stream adds up to really big numbers of volume and velocity of data, with the different behaviors creating huge variety as well ( i.e some customers view a page and buy, some view twenty pages to buy). This data is like crude oil, it needs to be refined for business to take action on the insights.

Cosnider for example association analysis/ or recommendation engines. Past data will be a huge sparse matrix ( a matrix where most data is 0) where column headers will be huge variety of goods (lets say book titles). By looking at the various book titles that sell well together, the final book page will have a section ( people who viewed this also viewed that or people who bought this also bought that). This in turn will trigger impulse purchases by future customers.

The math behind this is simple, but it cannot be done on static data, it has to be done on rapidly changing data. Thus it will be needing both big data, machine learning and automated interface design. In addition we can do A/B testing on the interface to make it in sync with customer flow.

Behind the scenes data will be in distributed manner on Hbase, and MySQL, using map/reduce and spark to process, and using MLlib and R and scikit-learn for Machine Learning. Big data also helps identify problem pages, where either search rank is low, or where there is high bounce rate (customers leave soon after reaching page).

E-commerce and Analytics and Machine Learning

Just like the association analysis example above, e-commerce uses analytics in a wide variety of way. The following are the ways e-commerce uses analytics

  1. Inventory Forecasts- this uses prior data and calculates future purchases. Mostly it uses time series data but for probability of purchase/non-purchase it can use classification algorithms like Naive Bayes as well.
  2. Logistics Optimization- Instead of having showrooms, eCommerce has websites but it does have warehouses. Time to deliver goods in a quality manner is critical to brand reputation and customer loyalty. This leads to optimization of routes for trucks and delivery boys, a modern application of the classification transportation problem in operation research.
  3. Analysis of customer- mouse(heatmap) or tracing eyeballs to improve web page interface. (see heatmap of a webpage from )
  4. Dynamic pricing (or discounts)- Ecommerce depends on discounts and promotions. It takes less than a minute for a customer to open a rival ecommerce site and compare prices (something which was not the case in traditional brick and mortar). By using algorithms for dynamic pricing based on customer behaviour (tracking done by cookies) , ecommerce tries to twek profits. Too much discount and profit is lost. Too less discount and potential customer is lost. So dynanic pricing based on prior behaviour is the key. This can be done using linear models like regressors.
  5. Classification of large amount of images of stock keeping units(SKUs) , generating tags for a wide amount of data (say a few tags for a computer followed by read more product info) and the webpage accordingly. This can be done using deep learning as well as topic modeling for tags.
  6. Search results page for different keywords to give the most relevant result- This can use Big Data technologies like Solr
  7. Classifying reviews into spam/not spam and looking at sentiment – using Text Mining
  8. Classifying sellers and discarding sellers who are supplying low quality products using reviews.

E-commerce and AI

Both Analytics and Machine Learning are subsets of Artificial Intelligence. Using AI we can build better discounts (by using xgboost regression rather than OLS regression), better web pages (faster A/B testing or using eyeball studies), better routes to travel for trucks, delivery boys (and even drones), better prediction of customer needs (and showing it in prompts). The key thrust will be of course on deep learning and tensor flow sitting on top of Hadoop big data for better eCommerce insights. Indeed the company with the best AI system will dwarf the competition since everything else in the eCommerce world can be copied easily apart from AI based analytics.m at 

(Created as part of a blogathon contest by Lymbyc, a company that creates Epoch, AI based platform, and a virtual data scientist at )