Home » Interviews

Category Archives: Interviews

Cloud Computing for Christmas

My second book – R for Cloud Computing : An Approach for Data Scientists is now ready for sale ( ebook). Softcover should be available within a month. Some of you have already booked an online review copy. It has taken me 2 years to write this book, and as always I accept all feedback on how to be a better writer.

I would like to especially thank Hannah Bracken of Springer Publishing for this.

and I dedicate this book to my 7 year son Kush.

http://www.springer.com/statistics/computational+statistics/book/978-1-4939-1701-3

Screenshot from 2014-12-10 10:23:45

Everything that is good in me, come from your love, Kush

Interview Christoph Waldhauser ‎Founder, CEO at KDSS #rstats #googleplus

Here is an interview with ‎Christoph Waldhauser a noted researcher and the Founder, CEO at KDSS K Data Science Solutions, which is a R based Analytics advisory firm. In a generous and insightful interview, Christoph talks of the perceptions around open source, academia versus startups, Europe and North America for technology and his own journey through it all.
1456608
Ajay Ohri (AO)- Describe your career in science. At what point did you decide to become involved in open source projects including R
Chrisoph Waldhauser (CW)-  When I did my second course on quantitative social science, the software we used was SPSS. At that time, the entire social science curriculum was built around that package. There were no student versions available, only a number of computer labs on campus that had SPSS installed. I had previously switched from Windows to Linux to cut on licensing costs (as a student you are constantly short on money) and to try something new. So I was willing to try out the same with R. In the beginning I was quite lost, having to work with survey weighted data and only rudimentary exposure to Perl before that. That was in a time long before RStudio and I started out with Emacs and ESS. As you might imagine, I landed in a world full of metaphorical pain. But in due time I was able to replicate all of the things we did in class. Even more, I was able to do them much more efficiently. And while my colleagues where only interpreting SPSS output, I was understanding where those numbers came from. This epiphany was then guiding me for the remainder of my scientific career. For instance, instead of Word I’d use LaTeX (back then a thing unheard of at my department) and even put free/libre/open source software in the focus of my research for my master thesis.
Continuing down that path, I eventually started working at WU Vienna University of Economics and Business and that had led me to one of the centers of R development. There, most of every day’s work was revolving around Free Software. The people there had a great impact on my understanding of Free Software and its influence on how we do research.
AO- Describe your work in social media analytics including package creation for Google Plus and your publication on Twitter
CW- Social media analytics is a very exciting field. The majority of research focuses on listening to the garden hose of social media data, that is, analyzing the communication revolving around certain keywords or communities. For instance, linking real-world events to the #euromaidan hashtag in Ukraine. I tread down a different path: instead of looking at what all users have to say on a certain topic, I investigate how a certain user or class of users communicates across all topics they bring up. So instead of following a top-down approach, I chose to go bottom-up.
Starting with the smallest building blocks of a social network has a number of advantages and leads to Google+ eventually. The reason behind this is, that the utility of social media for Google is different from say Twitter or Facebook. While classical social media is used to engage followers, say a lottery connected to the Facebook page of a brand, Google+ has an additional purpose: enlist users to help Google produce better, more accurate search results. With this in mind, focus shifts naturally from many users on one topic to how a single user can use Google+ to optimize the message they get across and manage the search terms they are associated with.
This line of argument has fueled my research in Google+ and Twitter: Which messages resonate most with the followers of a certain user? We know that each follower aims at resharing and liking messages she deems interesting. What precisely is interesting to a follower depends on her tastes. And if that will eventually lead to a reshare or not depends also on other factors like time of day and chance. For this, I’ve created a simulation framework that is centered on the preferences of individual users and their decision to reshare a social media message. In analyses of Twitter and Google+ content, we’ve found interesting patterns. For instance, there are significant differences in the types of message that are popular among followers of the US Democrats’ and Republicans’ Twitter accounts. I’m currently investigating if these observations can also be found in Europe. In the world of brand marketing, we’ve found significant differences in the wording of messages between localized Google+ pages. For instance, different mixtures of emotions in BMW’s German and US Google+ pages are key to increased reshare rates.
AO- What are some of the cultural differences in implementing projects in academia and in startups?
 
CW- This is a very hard question, mainly due to the fact that there is no one academia. Broadly speaking, quantitative academia can always be broken down in two classes that I like to refer to as science vs. engineering. Within this framework, science is seeking to understand why something is happening, often at the cost of practical implentability. The engineering mindset on the other hand focuses on producing working solutions, and is less concerned about understanding why something behaves the way it does. Take for instance neural networks that are currently enjoying somewhat of a renaissance due to deep learning approaches. In science, neural networks are not really popular because they are black boxes. You can use a neural network to produce great classifiers and use them to filter e.g. the picture of cats out of a stack of pictures. But even then it is not clear what factors are the defining essence of a cat. So while engineers might be happy to have found a useful classifier, scientists will not be content. This focus on understanding precludes many technological options to science. For instance, big data analyses aim at finding patterns that hold for most cases, but accept if the patterns don’t apply to every case. This is fine for engineering, but science would require theoretical explanations for each case that did not match the pattern. To me, this “rigor” has few practical benefits.
In startups, there is little place for science. The largest part of startups is being financed by some sort borrowed capital. And these lenders are only interested in return on their investment, and not insights or enlightenment. So, to me, there are few difference between academic engineering and startups. But I find that startups that want to do science proper, will have a very hard time getting off to good start. That is not to say it’s impossible, just more difficult.
AO-  What do you think of the open access publishing movement as represented by http://arxiv.org/ and JSS. What are some of the costs and benefits for researchers that prevent whole scale adoption of the open access system and how can these be addressed
 
CW- I think it is important to differentiate open access and preprints like arxiv.org. Open access merely means that articles are accessible without paying for accessing them. As most research that is published has been paid for by the taxpayer anyway, it should also be freely accessible to them. Keeping information behind paywalls is a moral choice, and I think it is self-evident which side we as a scientific community should choose. I’d also question the argument of publishing houses that their services are costly. Which services? Copy-editing? Marketing stunts at conferences? I fail to see how these services are important to academia.
Turning to preprints, one must note that academic publishing is currently plagued by two flaws. One is the lack of transparency that leads to poor reviews. The other one is academia’s using of publications as a quantitative indicator of academic success. This led to a vast increase of results being submitted for publication: a publishing house that had to review hundreds of articles before is now facing thousands. Therefore, it is not uncommon today for authors having to wait for multiple years until a final decision has been made by the editors. And the longer the backlog of articles gets, the lower the quality of the reviews will become. This is unacceptable.
Preprints are a way around this deadlock. Findings can be accessed by fellow researchers even before a formal review has been completed. In an ideal world with impeccable review quality, this would lead to a watering down of the quality of research being available. This certainly poses a risk, but today’s reviews are from flawless. More often than not, reviews fail to discover obvious flaws in research design and barely ever do reviewers check if data actually do lead to the results published. So, relying on preprints or reviewed articles, researchers always need to use common sense anyway:  If five independent research groups come to the same conclusion, the papers are likely to be solid. This heuristic is somewhat similar to Wikipedia: it might not always be correct, but most of the time it is.
AO- What are some of the differences that you have encountered in the ecosystem in funding, research, and development both in academia and tech startups as compared to Europe versus North America
 
CW- Living and working in a country that is increasingly being affected by the aftershocks of the Great Depression of 2007–2009 has left me somewhat disillusioned. In face of the economic problems in Europe at the moment, most of public funding has come to a full stop. Private capital is somewhat still available, but also here risk management has led to an increased focus on business plans and marketability. As pointed out above, this is less of a problem for engineering approaches (even though writing convincing business plans is challenging to scientists and engineers alike). But it is outright deadly to science. From what I see, North America has a different tradition. There, engineering generates so much revenue that part of that revenue goes back to science. An attitude we certainly lack in Europe is what Tim O’Reilly terms the makers’ mindset. We could use some more of that.
AO-In enterprise software people often pay more for software that is bad compared to software that is open source. What are your thoughts on this
 
CW- I’ve just had an interesting discussion with the head of a credit risk unit in a major bank. The unit is switching from SAS to SPSS for modeling credit risk. R, an equally capable or perhaps even superior free software solution, was not even considered. The rationale behind that is simple: in case the software is faulty, there is a company to blame and hold liable. Free software in general does not have software companies that back it in that way. So this appears to be the reason behind the psychological barrier to use free software. But I think it is a false security. Suppose every bank in the world uses either SAS or SPSS for credit risk modeling. And at one point, a fatal flaw in one of those two packages is being discovered. This flaw is then likely to affect most of the banks that chose it. So within 24 hours of that flaw being discovered, the company backing the product will have to file for bankruptcy. It is somewhat ironic that people responsible for credit risk management don’t see that the high correlation introduced due to all banks relying on the same software company does not mitigate but greatly inflate their risk.
For example, some years ago a SAS executive said, she’d feel more comfortable to fly in a plane that has been developed using closed source and not open source software, because closed source would provide increased quality. That line of argument has been thoroughly refuted. However, there is some truth in the fact that an investor might be more comfortable in putting money in a aircraft company that relies on closed source software for reasons of liability. Should a plane go down because of a closed source software bug, then the software company and not the aircraft company could be held liable. So any lawsuits against the aircraft company would be redirected to the software company. But at the end of the day, again, the software company will go out of business, leaving the aircraft company with the damage non the less.
So the trade off between poorly designed or implemented, expensive closed source software and superior free software is made due to questions of liability. But the truth is, that this is a false sense of security. I would therefore always argue in favor of free software.
KDSS is a bleeding edge state of the art data science advisory firm. You can see more of Christoph’s work at https://github.com/tophcito

Interview Heiko Miertzsch eoda #rstats #startups

This is an interview wit Heiko Miertzsch, founder EODA ( http://www.eoda.de/en/).  EODA is a cutting edge startup . recently they launched a few innovative products that made me sit up and pay attention. In this interview, Heiko Miertzsch , the founder of eoda talks on the startup journey and the vision of analytics.

eoda_logo
DecisionStats (DS)- Describe the journey of your startup eoda. What made you choose R as the platform for your software and training. Name a few turning points and milestones in your corporate journey

Heiko Miertzsch (HM)- eoda was founded in 2010 by Oliver and me. We both have a strong background in analytics and Information Technology industry. So we observed the market a while before starting the business. We saw two trends: First, a lot of new Technologies and Tools for data analysis appeared and second Open Source seemed to become more and more important for several reasons. Just to name one the easiness to share experience and code in a broad and professional community. Disruptive forces seem to change the market and we just don’t want back the wrong horse.

From the beginning on we tested R and we were enthusiastic. We started choosing R for our projects, software development, services and build up a training program for R. We already believed in 2010 that R has a successful future. It was more flexible than other statistic languages, more powerful in respect of the functionality, you could integrate it in an existing environment and much more.

DS- You make both Software products and services. What are the challenges in marketing both?

HM- We even do more: We provide consulting, training, individual software, customizing software and services. It is pure fun for us to go to our customers and say “hey, we can help you solving your analytical problems, no matter what kind of service you want to buy, what kind of infrastructure you use, if you want to learn about forest trees or buy a SaaS solution to predict your customers revenues”. In a certain way we don’t see barriers between these delivery models because we use analytics as our basis. First of all, we focus on the analytical problem of our customers and then we find the ideal solution together with the customer.

DS- Describe your software tableR. How does it work, what is the pricing and what is the benefit to user. Name a few real life examples if available for usage.

HM- Today the process of data collection, analysis and presenting the results is characterized by the use of a heterogeneous software environment with many tools, file formats and manual processing steps. tableR supports the entire process from design a questionnaire, share a structured format with the CAXI software, import the data and doing the analysis and plot the table report with only one single solution. The base report comes with just one click and if you want to go more into detail you can enhance your analysis with your own R code.

tableR is used in a closed beta at the moment and the open beta will in start next weeks.

(It is available at http://www.eoda.de/en/tableR.html)

DS- Describe your software translateR (http://www.eoda.de/en/translateR.html) . How does it work, what is the pricing and what is the benefit to user. Name a few real life examples if available for usage.

HM- Many companies realized the advantages of the open source programming language R. translateR allows a fast and inexpensive migration to R – currently from SPSS code.

The manual migration of complex SPSS® scripts has always been tedious and error-prone. translateR will help here and the task of translating by hand becomes a thing of the past. The beta test of translateR will also start in the next weeks.

DS- How do you think we can use R on the cloud for better analytics?

HM- Well, R seems to bring together the best “Data Scientists” of the world with all their different focuses on different methods, vertical knowledge, technical experience and more. The cloud is a great workplace: It holds the data – a lot of data and it offers a technical platform with computing power. If it succeeds to bring these two aspects together, we could provide a lot of knowledge to solve a lot of problems – with individual and global impact.

DS- What advantages and disadvantages does working on the cloud give to a R user?

HM- In terms of R I don’t see other aspects than in using the cloud in general.

DS- Startup life can be hectic – what do you do to relax.

HM- Oliver and I have both families, so eoda is our time to relax – just fun. I guess we do the same typical things like others, Oliver plays soccer and goes running. I like any kind of endurance sports and go climbing, the first to give the thoughtless space the second to train to focus on a concrete target.

About-

translateR is the new service from German based R specialist eoda, which helps users to translate SPSS® Code to R automatically. translateR is developed in cooperation with the University of Kassel and financially supported by the LOEWE-program of the state Hessen. translateR will be available as a cloud service and as a desktop application.

eoda offers consulting, software development and training for analytical and statistical questions. eoda is focused on R and specializes in integrating R into existing software environments.

 

Interview Tobias Verbeke Open Analytics #rstats #startups

Here is an interview with Tobias Verbeke, Managing Director of Open Analytics (http://www.openanalytics.eu/). Open Analytics is doing cutting edge work with R in the enterprise software space.
Ajay- Describe your career journey including your involvement with Open Source and R. What things enticed you to try R?

Tobias- I discovered the free software foundation while still at university and spent wonderful evenings configuring my GNU/Linux system and reading RMS essays. For the statistics classes proprietary software was proposed and that was obviously not an option, so I started tackling all problems using R which was at the time (around 2000) still an underdog together with pspp (a command-line SPSS clone) and xlispstat. From that moment on, I decided that R was the hammer and all problems to be solved were nails ;-) In my early career I worked as a statistician / data miner for a general consulting company which gave me the opportunity to bring R into Fortune 500 companies and learn what was needed to support its use in an enterprise context. In 2008 I founded Open Analytics to turn these lessons into practice and we started building tools to support the data analysis process using R. The first big project was Architect, which started as an eclipse-based R IDE, but more and more evolves into an IDE for data science more generally. In parallel we started working on infrastructure to automate R-based analyses and to plug R (and therefore statistical logic) into larger business processes and soon we had a tool suite to cover the needs of industry.

Ajay- What is RSB all about- what needs does it satisfy- who can use it ?

Tobias- RSB stands for the R Service Bus and is communication middleware and a work manager for R jobs. It allows to trigger and receive results from R jobs using a plethora of protocols such as RESTful web services, e-mail protocols, sftp, folder polling etc. The idea is to enable people to push a button (or software to make a request) and have them receive automated R based analysis results or reports for their data.

Ajay- What is your vision and what have been the challenges and learning so far in the project

Tobias- RSB started when automating toxicological analyses in pharmaceutical industry in collaboration with Philippe Lamote. Together with David Dossot, an exceptional software architect in Vancouver, we decided to cleanly separate concerns, namely to separate the integration layer (RSB) from the statistical layer (R) and, likewise, from the application layer. As a result any arbitrary R code can be run via RSB and any client application can interact with RSB as long as it can talk one of the many supported protocols. This fundamental design principle makes us different from alternative solutions where statistical logic and integration logic are always somehow interwoven, which results in maintenance and integration headaches. One of the challenges has been to keep focus on the core task of automating statistical analyses and not deviating into features that would turn RSB into a tool for interaction with an R session, which deserves an entirely different approach. Rservice-diagram

Ajay- Computing seems to be moving to an heterogeneous cloud , server and desktop model. What do you think about the R and Cloud Computing- current and future

Tobias- From a freedom perspective, cloud computing and the SaaS model is often a step backwards, but in our own practice we obviously follow our customers’ needs and offer RSB hosting from our data centers as well. Also, our other products e.g. the R IDE Architect are ready for the cloud and use on servers via Architect Server. As far as R itself concerns in relation to cloud computing, I foresee its use to increase. At Open Analytics we see an increasing demand for R-based statistical engines that power web applications living in the cloud.

Ajay- You recently released RSB version 6. What are all the new features. What is the planned roadmap going forward

Tobias- RSB 6.0 is all about large-scale production environments and strong security. It kicked off on a project where RSB was responsible for spitting 8500 predictions per second. Such large-scale production deployments of RSB motivated the development of a series of features. First of all RSB was made lightning fast: we achieved a full round trip from REST call to prediction in 7 ms on the mentioned use case. In order to allow for high throughput, RSB also gained a synchronous API (RSB 5.0 had an asynchronous API only). Another new feature is the availability of client-side connection pooling to the pool manager of R processes that are read to serve RSB. Besides speed, this type of production environments also need monitoring and resilience in case of issues. For the monitoring, we made sure that everything is in place for monitoring and remotely querying not only the RSB application itself, but also the pool of R processes managed by RServi.

 

(Note from Ajay- RJ is an open source library providing tools and interfaces to integrate R in Java applications. RJ project also provides a pool for R engines, easy to setup and manage by a web-interface or JMX. One or multiple client can borrow the R engines (called RServi)  see http://www.walware.de/it/rj/ and https://github.com/walware/rj-servi)
Also, we now allow to define node validation strategies to be able to check that R nodes are still functioning properly. If not, the nodes are killed and new nodes are started and added to the pool. In terms of security, we are now able to cover a very wide spectrum of authentication and authorization. We have machines up and running using openid, basic http authentication, LDAP, SSL client certificates etc. to serve everyone from the individual user who is happy with openid authentication for his RSB app to large investment banks who have very strong security requirements. The next step is to provide tighter integration with Architect, such that people can release new RSB applications without leaving the IDE.

Ajay- How does the startup ecosystem in Europe compare with say the SF Bay Area, What are some of the good things and not so great things

Tobias- I do not feel qualified to answer such a question, since I founded a single company in Antwerp, Belgium. That being said, Belgium is great! :-)

Ajay- How can we popularize STEM education using MooCs , training etc

Tobias- Free software. Free as in beer and as in free speech!

Ajay- Describe the open source ecosystem in general and R ecosystem in  particular for Europe. How does it compare with other locations in your opinion

Tobias- Open source is probably a global ecosystem and crosses oceans very easily. Dries Buytaert started off Drupal in Belgium and now operates from the US interacting with a global community. From a business perspective, there are as many open source models as there are open source companies. I noticed that the major US R companies (Revolution Analytics and RStudio) cherished the open source philosophy initially, but drifted both into models combining open source and proprietary components. At Open Analytics, there are only open source products and enterprise customers have access to exactly the same functionality as a student may have in a developing country. That being said, I don’t believe this is a matter of geography, but has to do more with the origins and different strategies of the companies.

Ajay- What do you do for work life balance and de stressing when not shipping  code.

Tobias- In a previous life the athletics track helped keeping hands off the keyboard. Currently, my children find very effective ways to achieve similar goals

About-

OpenAnalytics is a consulting company specialized in statistical computing using open technologies. You can read more on it at http://www.openanalytics.eu

Interview Louis Bajuk-Yorgan TIBCO Enterprise Runtime for R (TERR) #rstats

Here is an interview with Louis Bajuk-Yorgan, from TIBCO.  TIBCO which was the leading commercial vendor to S Plus, the precursor of the R language makes a commercial enterprise version of R called TIBCO Enterprise Runtime for R (TERR). Louis also presented recently at User2014 http://user2014.stat.ucla.edu/abstracts/talks/54_Bajuk-Yorgan.pdf

09b7eb1

DecisionStats(DS)- How is TERR different from Revolution Analytics or Oracle R. How is it similar.

Louis Bajuk-Yorgan (Lou)- TERR is unique, in that it is the only commercially-developed alternative R interpreter. Unlike other vendors, who modify and extend the open source R engine, we developed TERR from the ground up, leveraging our 20+ years of experience with the closely-related S-PLUS engine.
Because of this, we were able to architect TERR to be faster, more scalable, and handle memory much more efficiently than the open source R engine. Other vendors are constrained by the limitations of the open source R engine, especially around memory management.
Another important difference is that TERR can be licensed to customers and partners for tight integration into their software, which delivers a better experience for their customers. Other vendors typically integrate loosely with open source R, keeping R at arm’s length to protect their IP from the risk of contamination by R’s GPL license. They often force customers to download, install and configure R separately, making for a much more difficult customer experience.
Finally, TIBCO provides full support for the TERR engine, giving large enterprise customers the confidence to use it in their production environments. TERR is integrated in several TIBCO products, including Spotfire and Streambase, enabling customers to take models developed in TERR and quickly integrate them into BI and real-time applications.

DS- How much of R is TERR compatible with?

Lou- We regularly test TERR with a wide variety of R packages, and extend TERR to greater R coverage  over time. We are currently compatible with ~1800 CRAN packages, as well as many bioconductor packages. The full list of compatible CRAN packages is available at the TERR Community site at tibcommunity.com.

DS- Describe Tibco Cloud Compute Grid, What are it’s applications for data science.

Lou- Tibco Cloud Compute Grid leverages the Tibco Gridserver architecture, which has been used by major Wall Street firms to run massively-parallel applications across tens of thousands of individual nodes. TIBCO CCG brings this robust platform to the cloud, enabling anyone to run massively-parallel jobs on their Amazon EC2 account. The platform is ideal for Big Computation types of jobs, such as Monte Carlo simulation and risk calculations. More information can be found at the TIBCO Cloud Marketplace at https://marketplace.cloud.tibco.com/.

DS-  What advantages does TIBCO’s rich history with the S project give it for the R project.

Lou- Our 20+ years of experience with S-PLUS gave us a unique knowledge of the commercial applications of the S/R language, deep experience with architecting, extending and maintaining a commercial S language engine, strong ties to the R community and a rich trove of algorithms we could apply on developing the TERR engine.

DS-  Describe  some benchmarks of TERR with open source of R.

Lou- While the speed of individual operations will vary, overall TERR is roughly 2-10x faster than open source R when applied to small data sets, but 10-100x  faster when applied to larger data sets. This is because TERR’s efficient memory management enables it to handle larger data more reliably, and stay more linear in performance as data sizes increase.
 TERR

DS-  TERR is not open source. Why is that?

Lou-  While open sourcing TERR is an option we continue to consider, we’ve decided to intially focus our energy and time on building the best S/R language engine possible. Running a successful, vibrant open source project is a significant undertaking to do well, and if we choose to do so, we will invest accordingly.
Instead, for now we’ve decided to make a Developer Editon of TERR freely available, so that the R community at large could still benefit from our work on TERR. The Developer Editon is available at tap.tibco.com.

DS- How is TIBCO a company to work for potential data scientists.

Lou- Great! I’ve have worked in this space for nearly 18 years in large part because I get the opportunity to work with customers in many different industries (such as Life Sciences, Financial Services, Energy, Consumer Packaged Goods, etc),  who are trying to solve valuable and interesting problems.
We have an entire team of data scientists, called the Industry Analytics Group, who work on these sorts of problems for our customers, and we are always looking for more Data Scientists to join that team.

DS-  How is TIBCO giving back to the R Community globally. What are it’s plans on community.

Lou- As mentioned above, we make a free Developers Editon of TERR available. In addition, we’ve been sponsors of useR for several years, we contribute feedback to the R Core team as we develop TERR, and we often open source packages that we develop for TERR to that they can be used with open source R as well.  This has included packages ported from S-PLUS (such as sjdbc) and new packages (such as tibbrConnector).

DS- As a sixth time attendee of UseR, Describe  the evolution of R ecosystem as you have observed it.

Lou- It has been fascinating  to see how the R community has grown and evolved over the years. The useR conference at UCLA this year was the largest ever (700+ attendees), with more commercial sponsors than ever before (including enterprise heavyweights like TIBCO, Teradata and Oracle, smaller analytic vendors like RStudio, Revolution and Alteryx, and new companies like plot.ly). What really struck me, however, was the nature of the attendees. There were far more attendees from commercial companies this year, many of whom were R users. More so than in the past, there were many people who simply wanted to learn about R.
About-
Lou Bajuk-Yorgan leads Predictive Analytics product strategy at TIBCO Spotfire, including the development of the new TIBCO Enterprise Runtime for R. With a background in Physics and Atmospheric Sciences, Lou was a Research Scientist at NASA JPL before focusing on analytics and BI software 16 years ago. An avid cyclist, runner and gamer, Lou frequently speaks and tweets (@LouBajuk) about the importance of Predictive Analytics for the most valuable business challenges.

Interview Jan de Leeuw Founder JSS

Here is an interview with one of the greatest statisticians and educator of this generation, Prof Jan de Leeuw. In this exclusive and free wheeling interview, Prof De Leeuw talks on the evolution of technology, education, statistics and generously shares nuggets of knowledge of interest to present and future statisticians.

1

DecisionStats(DS)- You have described UCLA Dept of Statistics as your magnum opus.Name a couple of turning points in your career which helped in this creation .

Jan de Leeuw (JDL) -From about 1980 until 1987 I was head of the Department of Data Theory at Leiden University. Our work there produced a large number of dissertations which we published using our own publishing company. I also became president of the Psychometric Society in 1987. These developments resulted in an invitation from UCLA to apply for the position of director of the interdepartmental program in social statistics, with a joint appointment in Mathematics and Psychology. I accepted the offer in 1987. The program eventually morphed into the new Department of Statistics in 1998.

DS- Describe your work with Gifi software and non linear multivariate analysis.

JDL- I started working on NLMVA and MDS in 1968, while I was a graduate student researcher in the new Department of Data Theory. Joe Kruskal and Doug Carroll invited me to spend a year at Bells Labs in Murray Hill in 1973-1974. At that time I also started working with Forrest Young and his student Yoshio Takane. This led to the sequence of “alternating least squares” papers, mainly in Psychometrika. After I returned to Leiden we set up a group of young researchers, supported by NWO (the Dutch equivalent of the NSF) and by SPSS, to develop a series of Fortran programs for NLMVA and MDS.
In 1980 the group had grown to about 10-15 people, and we gave a succesful postgraduate course on the “Gifi methods”, which eventually became the 1990 Gifi book. By the time I left Leiden most people in the group had gone on to do other things, although I continued to work in the area with some graduate students from Leiden and Los Angeles. Then around 2010 I worked with Patrick Mair, visiting scholar at UCLA, to produce the R packages smacof, anacor, homals, aspect, and isotone. Also see https://www.youtube.com/watch?v=u733Mf7jX24

DS- You have presided over almost 5 decades of changes in statistics.  Can you describe the effect of changes in computing and statistical languages over the years, and some learning from these changes

JDL- I started in 1968 with PL/I. Card decks had to be flown to Paris to be compiled and executed on the IBM/360 mainframes. Around the same time APL came up and satisfied my personal development needs, although of course APL code was difficult to communicate. It was even difficult to underatand your own code after a week. We had APL symbol balls on the Selectrix typewriters and APL symbols on the character terminals. The basic model was there — you develop in an interpreted language (APL) and then for production you use a compiled language (FORTRAN). Over the years APL was replaced by XLISP and then by R. Fortran was largely replaced by C, I never switched to C++ or Java. We discouraged our students to use SAS or SPSS or MATLAB. UCLA Statistics promoted XLISP-STAT for quite a long time, but eventually we had to give it up. See http://www.stat.ucla.edu/~deleeuw/janspubs/2005/articles/deleeuw_A_05.pdf.

(In 1998 the UCLA Department of Statistics, which had been one of the major users of Lisp-Stat, and one of the main producers of Lisp-Stat code, decided to switch to S/R. This paper discusses why this decision was made, and what the pros and the cons were. )

Of course the WWW came up in the early nineties and we used a lot of CGI and PHP to write instructional software for browsers.

Generally, there has never been an computational environment like R — so integrated with statistical practice and development, and so enormous, accessible and democratic. I must admit I personally still prefer to use R as originally intended: as a convenient wrapper around and command line interface for compiled libraries and code. But it is also great for rapid prototyping, and in that role it has changed the face of statistics.
The fact that you cannot really propose statistical computations without providing R references and in many cases R code has contributed a great deal to reproducibility and open access.

DS- Does Big Data and Cloud Computing , in the era of data deluge require a new focus on creativity in statistics or just better application in industry of statistical computing over naive models
JDL- I am not really active in Big Data and Cloud Computing, mostly because I am more of a developer than a data analyst. That is of course a luxury.
The data deluge has been there for a long time (sensors in the road surface, satellites, weather stations, air pollution monitors, EEG’s, MRI’s) but until fairly recently there were no tools, both in hardware and software, to attack these data sets. Of course big data sets have changed the face of statistics once again, because in the context of big data the emphasis on optimality and precise models becomes laughable. What I see in the area is a lot of computer science, a lot of fads, a lot of ad-hoc work, and not much of a general rational approach. That may be unavoidable.
DS- What is your biggest failure in professional life
JDL- I decided in 1963 to major in psychology, mainly because I wanted to discover big truths about myself. About a year later I discovered that psychology and philosophy do not produce big truths, and that my self was not a very interesting object of study anyway. I switched to physics for a while, and minored in math, but by that time I already had a research assistant job, was developing software, and was not interested any more in going to lectures and doing experiments. In a sense I dropped out. It worked out fairly well, but it sometimes gives rise to imposter syndrome.
DS- If you had to do it all over again, what are the things you would really look forward to doing.

JDL- I really don’t know how to answer this. A life cannot be corrected, repeated, or relived.

DS- What motivates you to start Journal of Statistical software and  push for open access.
JDL- That’s basically in the UserR! 2014 presentation. See http://gifi.stat.ucla.edu/janspubs/2014/notes/deleeuw_mullen_U_14.pdf
DS-  How can we make the departments of Statistics and departments of Computer Science work closely for better industry relevant syllabus especially in data mining, business analytics and statistical computing.
JDL- That’s hard. The cultures are very different — CS is so much more agressive and self-assured, as well as having more powerful tools and better trained students. We have tried joint appointments but they do not work very well. There are some interdisciplinary programs but generally CS dominates and provides the basic keywords such as neural nets, machine learning, data mining, cloud computing and so on. One problem is that in many universities statistics is the department that teaches the introductory statistics courses, and not much more. Statistics is forever struggling to define itself, to fight silly battles about foundations, and to try to control the centrifugal forces that do statistics outside statistics departments.
DS- What are some of the work habits that have helped you be more productive in your writing and research
JDL- Well, if one is not brilliant, one has to be a workaholic. It’s hard on the family. I decided around 1975 that my main calling was to gather and organize groups of reseachers with varying skills and interests — and not to publish as much as possible. That helped.
Jan de Leeuw (born December 19, 1945) is a Dutch statistician, and Distinguished Professor of Statistics and Founding Chair of the Department of Statistics, University of California, Los Angeles. He is known as the founding editor and editor-in-chief of the Journal of Statistical Software, as well as editor-in-chief of the Journal of Multivariate Analysis.

Interview Antonio Piccolboni Big Data Analytics RHadoop #rstats

Here is an interview with Antonio Piccolboni , a consultant on big data analytics who has most notably worked on the RHadoop project for Revolution Analytics. Here he tells us about writing better code, and the projects he has been involved with.
ap
 
DecisionStats(DS)- Describe your career journey from being a computer science student to one of the principal creators for RHadoop. What motivated you, what challenges did you overcome. What were the turning points.(You have 3500+ citations. What are most of those citations regarding.)

Antonio (AP)- I completed my undergrad in CS in Italy. I liked research and industry didn’t seem so exciting back then, both because of the lack of a local industry and the Microsoft monopoly, so I entered the PhD program.
After a couple of false starts I focused on bioinformatics. I was very fortunate to get involved in an international collaboration and that paved the way for a move to the United States. I wanted to work in the US as an academic, but for a variety of reasons that didn’t work out.
Instead I briefly joined a new proteomics department in a mass spectrometry company, then a research group doing transcriptomics, also in industry, but largely grant-funded. That’s the period when I accumulated most of my citations.
After several years there, I realized that bioinformatics was not offering the opportunities I was hoping for and that I was missing out on great changes that were happening in the computer industry, in particular Hadoop, so after much deliberation I took the plunge and worked first for a web ratings company and then a social network, where I took the role of what is now called a “data scientist”, using the statistical skills that I acquired during the first part of my career. After taking a year off to work on my own idea I became a free lance and Revolution Analytics one of my clients, and I became involved in RHadoop.
As you can see there were several turning points. It seems to me one needs to seek a balance of determination and flexibility, both mental and financial, to explore different options, while trying to make the most of each experience. Also, be at least aware of what happens outside your specialty area. Finally, the mandatory statistical warning: any generalizations from a single career are questionable at best.

 

DS-What are the top five things you have learnt for better research productivity and code output in your distinguished career as a computer scientist.
AP-1. Keep your code short. Conciseness in code seems to correlate with a variety of desirable properties, like testability and maintainability. There are several aspects to it and I have a linkblog about this (asceticprogrammer.info). If I had said “simple”, different people would have understood different things, but when you say “short” it’s clear and actionable, albeit not universally accepted.
2. Test your code. Since proving code correct is unfeasible for the vast majority of projects, development is more like an experimental science, where you assemble programs and then corroborate that they have the desired properties via experiments. Testing can have many forms, but no testing is no option.
3. Many seem to think that programming is an abstract activity somewhere in between mathematics and machines. I think a developer’s constituency are people, be them the millions using a social network or the handful using a specialized API. So I try to understand how people interact with my work, what they try to achieve, what their background is and so forth.
4. Programming is a difficult activity, meaning that failure happens even to the best and brightest. Learning to take risk into account and mitigate it is very important.
5. Programs are dynamic artifacts. For each line of code, one may not only ask if it is correct but for how long, as assumptions shift, or how often it will be executed. For a feature, one could wonder how many will use it, and how many additional lines of code will be necessary to maintain it.
6. Bonus statistical suggestion: check the assumptions. Academic statistics has an emphasis on theorems and optimality, bemoaned already by Tukey over sixty years ago. Theorems are great guides for data analysis, but rely on assumptions being met, and, when they are not, consequences can be unpredictable. When you apply the most straightforward, run of the mill test or estimator, you are responsible for checking the assumptions, or otherwise validating the results. “It looked like a normal distribution” won’t cut it when things go wrong.

 

DS-Describe the RHadoop project- especially the newer plyrmr package. How was the journey to create it.
AP-Hadoop is for data and R is for statistics, to use slogans, so it’s natural to ask the question of how to combine them, and RHadoop is one possible answer.
We selected a few important components of Hadoop and provided an R API. plyrmr is an offshoot of rmr, which is an API to the mapreduce system. While rmr has enjoyed some success, we received feedback that a simplified API would enable even more people to directly access and analyze the data.Again based on feedback we decided to focus on structured data, equivalent to an R data frame. We tried to reduce the role of user-defined functions as parameters to be fed into the API, and when custom functions are needed they are simpler. Grouping and regrouping the data is fundamental to mapreduce. While in rmr the programmer has to process two data structures, one for the data itself and the other describing the grouping, plyrmr uses a very familiar SQL-like “group” function.
Finally, we added a layer of delayed evaluation that allows to perform certain optimizations automatically and encourages reuse by reducing the cost of abstraction. We found enough commonalities with the popular package plyr that we decided to use it as a model, hence the tribute in the name. This lowers the cognitive burden for a typical user.

 

DS-Hue is an example of making interfaces easier for users to use Hadoop. so are sandboxes and video trainings. How can we make it easier to create better interfaces to software like RHadoop et al
AP- It’s always a trade-off between power and ease of use, however I believe that the ability to express analyses in a repeatable and communicable way is fundamental to science and necessary to business and one of the key elements in the success of R. I haven’t seen a point and click GUI that satisfies these requirements yet, albeit it’s not inconceivable. For me, the most fruitful effort is still on languages and APIs. While some people write their own algorithms, the typical data analyst needs a large repertoire of algorithms that can be applied to specific problems. I see a lot of straightforward adaptations of sequential algorithms or parallel algorithms that work at smaller scales, and I think that’s the wrong direction. Extreme data sizes call for algorithms that work within stricter memory, work and communication constraints than before. On the other hand, the abundance of data, at least in some cases, offers the option of using less powerful or efficient statistics. It’s a trade off whose exploration has just started.

 

DS-What do you do to maintain work life balance and manage your time
 
AP- I think becoming a freelancer affords me a flexibility that employed work generally lacks. I can put in more or fewer hours depending on competing priorities and can move them around other needs, like being with family in the morning or going for a bike ride while it’s sunny.  I am not sure I manage my time incredibly well, but I try to keep track of where I spend it at least by broad categories, whether I am billing it to a client or not. “If you can not measure it, you can not improve it”, a quote usually attributed to Lord Kelvin.

 
DS- What do you think is the future of R as an enterprise and research software in terms of computing on mobile, desktop, cloud and how do you see things evolve from here

AP- One of the most interesting things that are happening right now is the development of different R interpreters. A successful language needs at least two viable implementations in my opinion. None of the alternatives is ready for prime time at the moment, but work is ongoing. Some implementations are experimental but demonstrate technological advances that can be then incorporated into the other interpreters. The main challenge is transitioning the language and the community to the world of parallel and distributed programming, which is a hardware-imposed priority. RHadoop is meant to help with that, for the largest data sets. Collaboration and publishing on the web is being addressed by many valuable tools and it looks to me the solutions exist already and it’s more a problem of adoption.  For the enterprise, there are companies offering training, consulting, licensing,  centralized deployments,  database APIs, you name it. It would be interesting to see touch interfaces applied to interactive data visualization, but while there is progress on the latter, touch on desktop is limited to a single platform and R doesn’t run on mobile, so I don’t see it as an imminent development.

 

About-
Antonio Piccolboni is an  experienced data scientist (FlowingdataRadar on this emerging role) with industrial and academic backgrounds currently working as an independent consultant on big data analytics. His clients include Revolution Analytics. His other recent work is on social network analysis (hi5) and web analytics (Quantcast). You can contact him via http://piccolboni.info/about.html or his LinkedIn profile
Follow

Get every new post delivered to your Inbox.

Join 802 other followers