Interview Anup Purohit CIO YES BANK #Datathon #Datadriven #YESBANK #datascience #hackathons

Ajay- What is your take on the importance of being ‘data driven’?


Anup- At the expense of sounding cliched, we believe that Data is one of the most important assets we have, which doesn’t get reflected on our balance sheet. As a bank we are in the business of customer service, therefore our ability to provide a seamless experience to every customer depends on our ability to collect, store and analyze relevant data.

  • This has become even more important in the last 1-2 years with increased talk of providing ‘banking-as-a-service’, essentially integrating banking/financial service with every aspect of life, making the availability data management and analytics essential tools
  • An interesting point here is that now that we think of it, most industries especially banks have always had these data points and even at times used it – the catch though is that it was limited to customer onboarding. While KYC literally spells Know Your Customer this data was almost never used beyond compliance and risk analytics. This data combined with data collected during the lifecycle of the customer, and the additional computing ability available today is a real differentiator and hence being ‘data driven’ is a necessity and not a choice.

Ajay- Tell us a bit more about how the outlook on being data led or data driven has changed over time

Anup- For us, the start of this journey of really becoming data driven actually came at a very opportune moment. If you go back to 2003/04 when we started out building probably the country’s only greenfield bank, our focus was largely on the corporate segment while starting to build a retail franchisee.  Being the nth entrant in a highly competitive industry technology was always going to be a differentiator for us. Yet both the data technologies available and the data we were collecting, were limited in their advancement and size respectively. Change came about 4-5 years back on 3 fronts



    • First, we moved far more deeper into the retail segment with a greater push on building a granular retail bank, which meant that the volume of data and the sources of data increased manifold
    • Second, this was interestingly also the time when the so called Big Data technologies like distributed file systems, commodity servers and cloud compute really began to hit their stride
    • Third, the sources of collecting data almost tripled, and this is an understated aspect. For example, today we talk about voice analytics and the rise of Alexa/Siri among others, but given that most customer service centers had IVR meant that voice data was available even then. Similarly, Optical technologies like OCR also began to find sync then, therefore image data especially for signature verification etc. being to take up. With these sources the need for investing in compute was more than ever
  • This was like a trifecta bringing a rare situation where the demand and supply graphs for us rising at the same time, and made our decision to invest far more into data management, security and analytics a tad bit easier

Ajay- What specific business needs / opportunities led to your investment in Big Data technologies ?

Anup-As I said earlier, the 3 factors of rise in the quantum of data volumes and variety of sources coupled with the exponential growth in technology availability meant that the traditional database management technologies we were extensively using were nearing obsolescence.



    • However, we still faced an interesting dilemma, the newer distributed file systems, cognitive and cloud computing were being used at that time only by technology intensive industries. The financial services in particular was largely playing the ‘wait and watch’ game which to an extent made sense – the idea being to wait for the technology and people expertise in big data, to mature and then be fast followers.
    • While deliberating we started looking at our global peers some of whom were also our early investors and partners, and we realized that they had already taken the leap and were almost 4 years ahead of the curve. But among industry peers in India there were still no early/first movers. However among other industries some like e-commerce had moved beyond RDBMS and invested heavily into their machine learning capabilities.
    • Sensing an opportunity, we reached out to many of these organizations and tried to understand their stories and motivations. 2 clear learnings emerged


  • While it’s true that any technology takes time to develop, big data and ML systems and statistical learning was already fairly mature and the very nature of machine learning meant that the true value can be unlocked once the machine truly understands the nuances specific to industries and your customer set – offsetting the value of being ‘fast followers’


    • Also, in our cross industry discussions we clearly learned that all customer service industries are essentially similar, and the extent of success depended on a term that is very commonly used but rarely understood in banking – KYC – Know Your Customer – better you know them, more integrated and customized your services are
  • It was clear then that we needed to invest and ‘go big’ on  far nimble, scalable and flexible solution which lead us to Hadoop. Leveraging on commodity servers as opposed to specialized hardware results in quantum cost savings on the infrastructure alone which is required to analyze these large data sets
  • It also set the foundation for our 3 priorities in data management – data security which is of paramount importance, data management and data based decision-making. On these would want to stress a lot on Data Management, well sourced cleaned and managed data lays the foundation for any machine learning tool – and it’s essentially your data stacks which determine whether your machine learning software becomes Skynet or Flintstones


Ajay- What does your team look like and how do you interact with business ?


Anup- I’m a sports enthusiast, so on this let me give you a football analogy, especially since I hear a lot about the rising importance of stacking teams with data scientists.

  • In football, whatever formation you play essentially there are 3 parts, defense, midfield and attack. Our data management and analytics teams are similarly aligned, though it has little to do with our love for sports
  • The Data Management, Security, Sourcing team are our defenders, setting the rules of data security and sharing, making sure that we have a robust core from which we can build outwards
  • Then there is the midfield which sort of links defense and offense, the risk analytics team a 10-15 a centralized core team focusing on optimizing data extraction, standardization, recycling/feedback loops and learning
  • Our Offence – Business analytics is responsible for interacting with all functions of YES BANK to create innovative solutions that aids business success. This team is spread  both vertically and horizontally , with dedicated business intelligence teams for each function with a specialized analytics strategy team layer on top to collaborate with decentralized teams and identify valuable analytics content and promote it across the bank

Ajay- Why did YES Bank invest in Datathon ? How does it fit into your overall strategy ?



Anup- To keep pace with the fast changing landscape of data technology, it’s important to also have a broad ‘outside-in’ view through an ecosystem of leaders and learners who can guide us through our data native transformation.

  • YES Datathon is our initiative to crowd source ideas from a pool of talented data learners and practitioners. We received an overwhelming response from over 1700+ teams making our ecosystem 6000+ strong. We are currently in the last phase of Datathon where the Top 50 teams have access to curated and anonymized YES Bank data to select a problem statement of choice and create a PoC
  • While going through their submissions, I realized ideas need not always come from within office walls. It has been interesting to see how we can learn and adapt data practices from across industries to ensure our service continue to  remains the best in the industry.
  • While YES Datathon is a step to build an engaged community of  Data experts, engineers and scientists to maximize our data management and analytics practices– the long term approach is to build a team of experts within the bank Team TechTONic, a CORE team of 100 business & technology experts who will drive Digitization & digitalization of the Bank’s technology.


Anup Purohit has been the CIO of Yes Bank since 2015, and has over a 23-year long record of accomplishments in IT management across global and multi-cultural environments .

Data Science Training can be inexpensive and free

IS it TOUGH to be a DATA SCIENTIST? NO , it is not

Data Science is Not Rocket Science. But once a data scientist you have to keep learning every day.

Master R and Python basics along with statistics basics.

Then learn Machine Learning.

  • Text Mining Basic and Topic Modeling.
  • Time Series.

  • Then learn Deep Learning, ANN, CNN, RNN , LSTM.

  • Computer Vision.

  • Speech Recognition.

  • Chatbots.

  • Blockchain.

you can learn this from internet for free. Dont get confused or insecure to pay lacs of rupees or thousands of dollars to institutes that give you certificates that are not recognized by corporations


Intro to R

Intro to Python

Intro to Machine Learning

Here is one more free “kernel”, but in colab format:–essential-machine-learning-and-exploratory-data-analysis-with-python-and-jupyter-notebook

Is KAGGLE a website only for super human data scientists? NO NO NO

You can be a kaggler very easily-

1) Understand how kernels function especially input file and output submission- The best is to use Notebook method not script method of using code

2) Have basic knowledge of EDA and Data Viz in either R or Python ( if you dont know that EDA means exploratory data analysis you can start learning – from Kaggle KERNELS itself

3) Have basic knowledge of Machine Learning Algorithms (and how to apply ) and how to compare Area under Curve (AUC)

4) Deep Learning is advanced and for Python preferably

5) Practice one hour a day. Kaggle is like a gym for the brain if you do this for a year, see where your career zooms.

And one more thing- cross port your code on Github

I am sure there are better kernels, but you can find them out yourself, and best of all they are free. tip- Number of votes often points out to a better more popular kernel


R Basics Here and

basic statistics

and free SAS learning from SAS itself 

Interview Questions

Python basics here and


(Free )Kaggle kernel + IBM Cognitive + edx + Kaggle contest + hackathon > certificate from paid private company ???

40 hours to gain a certificate for X dollars versus 40 hours on Kaggle for free. Which will give you better skills. What will get you a job – skills or certificates. 

When we interview data scientist freshers we always have  a coding round as the first step. Certificates from private institutes dont matter regardless of how long or how expensive they are

I have been asked why I write these articles on free resources on data science, what is my agenda and why not let things be.

Well, short answer, if you charge thousands of dollars for content which can be free, and force young people in debt and indulge in predatory pricing, then someone needs to expose these merchants of data science certificates

Someone asked why I charge for my 3 data science books. I write books, publisher sells them and gives me 13% of royalty.  The books are 1/10th of price of a course.

Most importantly I write books for academic credentials and because I love writing (as seen by my extensive blogging on (writing books is a great way to share knowledge in my opinion but takes a long time so writing a blog tutorial or kaggle kernel or github code is faster  

I still do guest lectures- but in all cases I am not responsible for students paying too much and I balance this by my evangelizing free resources that would be students are completely unaware of.

These free resources are often updated more than the curriculum of courses by institutes and they are often easy to understand

As the man said- Money for nothing and my MTV


1) What prompted you to make

The concept of Hyreo took shape in our mind as an outcome of the recruiting challenges we faced on a daily basis. All aspects of recruiting are very human labor intense and predictability of outcome at each stage was quite limited. The amount of time spend in sourcing, validating and assessing candidates was very high and hence pretty expensive. The same challenges existed in companies of all sizes. Hyreo took shape in our mind as a possible solution to address some of the recruiting challenges we saw around. We are trying to leverage smart technology and automation to improve the way candidate sourcing, assessment and engagement is carried out. We also felt that the opportunity was quite large since globally the recruitment model and process is fairly standard with limited or minor changes. Availability of technologies including Open NLP and others also helped us decide on building Hyreo as a potential solution to these recruiting problems.

2) In your two year journey as an entrepreneur with  Hyreo, name some
learnings and some turning points.

A few learnings from our entrepreneurial journey:

  1. Customers are the most important factor impacting everything – employees, investors & partners
  2. Partner as much as possible than build everything in-house and create ‘win-win’ for all parties
  3. Be prepared for rejection, it is unavoidable
  4. Hire slow but fire fast
  5. Entrepreneur knows more about the product than investor, customer or media
  6. Marketing is more important than one might think. Place it early in the lifecycle and use it effectively
  7. Create evangelists and supporters of the cause early in the game, but never on equity

3) Specifically which need is trying to address and solve

Hyreo is disrupting the way companies ‘Discover’ and ‘Engage’ with talent. Hyreo leverages smart technology to automate the process of job information dissemination to prospect candidates, understand their interest level and subject proficiency and keep the candidates engaged and up-to date on the latest status 24/7. Build as a SaaS solution with chatbot technology, the platform is able to integrate with legacy systems or exist as a stand-alone system. Hyreo is built in a modular fashion such that customers can choose the product based on specific needs. By using the platform, companies are able to reduce 50% overall effort in recruiting and 40% overall cost with substantial improvement in candidate experience and hence talent brand.

4) What are some of the other innovations you see in the HR space

All aspect of HR and human capital management areas is getting disrupted with legacy processes being challenged by newer technology including Machine learning and AI based systems. Some of the areas that we see interesting innovation and proved merit include:

  1. Employee engagement: Be it answering employee queries or addressing issues of the employees, innovative technology solutions including Chatbots are being deployed
  2. Candidate reference checks are being automated to ensure the cycle time and the overall effort is reduced considerably
  3. Digital Learning platforms including micro learning platforms
  4. Intelligent interviewing platforms


5) What are some of the obstacles you see to HR innovations.

The journey has just begun and the initial inertia opposing the change has drastically reduced. There is a lot of exciting new technology in the market now, and it will take time for all stakeholders to evaluate options and adopt best practices. Some of the areas we should look at:

  1. HR should be a CEOs function and there should be focus on not just improving process but the mindset should be to invest in success
  2. There is a need for re-branding HR as a growth catalyst rather than a growth support function
  3. Need more investments in HR Tech space

About Hyreo-


Missing Value Imputation and Dealing With Outliers

Missing Value Imputation and Dealing With Outliers

These are an important part of data pre-processing and these are rarely taught in DONKEY ACADEMY who charge you a lot to give you a certificate that doesn’t give you a job.

So okay after that violence and double talk (from Dire Straits) here is how you deal with outliers

1) Replace outliers or missing values them with mean or median – based on distribution -which you see if age< 20 or age>80 then age=median(age)

2) Replace them by capping upper and lower limits. eg an age distribution of 1-120 for bank customers can be capped like if age<20 then age=20 if age>80 then age=80

3) Use MICE package for Imputation (in R) or pandas-mice for Python ( eg if males have median age of 50 and females have median age 0f 45, replace all male age missing values with 50 and all female missing values with 45

4) Use OutlierTest in car package in R This is barely the tip of iceberg in missing value and outliers

#machinelearning hashtag#algorithms hashtag#pythonprogramminglanguage hashtag#analytics hashtag#datascience hashtag#python hashtag#rstats