Interview Jan de Leeuw Founder JSS

Here is an interview with one of the greatest statisticians and educator of this generation, Prof Jan de Leeuw. In this exclusive and free wheeling interview, Prof De Leeuw talks on the evolution of technology, education, statistics and generously shares nuggets of knowledge of interest to present and future statisticians.

1

DecisionStats(DS)- You have described UCLA Dept of Statistics as your magnum opus.Name a couple of turning points in your career which helped in this creation .

Jan de Leeuw (JDL) –From about 1980 until 1987 I was head of the Department of Data Theory at Leiden University. Our work there produced a large number of dissertations which we published using our own publishing company. I also became president of the Psychometric Society in 1987. These developments resulted in an invitation from UCLA to apply for the position of director of the interdepartmental program in social statistics, with a joint appointment in Mathematics and Psychology. I accepted the offer in 1987. The program eventually morphed into the new Department of Statistics in 1998.

DS- Describe your work with Gifi software and non linear multivariate analysis.

JDL- I started working on NLMVA and MDS in 1968, while I was a graduate student researcher in the new Department of Data Theory. Joe Kruskal and Doug Carroll invited me to spend a year at Bells Labs in Murray Hill in 1973-1974. At that time I also started working with Forrest Young and his student Yoshio Takane. This led to the sequence of “alternating least squares” papers, mainly in Psychometrika. After I returned to Leiden we set up a group of young researchers, supported by NWO (the Dutch equivalent of the NSF) and by SPSS, to develop a series of Fortran programs for NLMVA and MDS.
In 1980 the group had grown to about 10-15 people, and we gave a succesful postgraduate course on the “Gifi methods”, which eventually became the 1990 Gifi book. By the time I left Leiden most people in the group had gone on to do other things, although I continued to work in the area with some graduate students from Leiden and Los Angeles. Then around 2010 I worked with Patrick Mair, visiting scholar at UCLA, to produce the R packages smacof, anacor, homals, aspect, and isotone. Also see https://www.youtube.com/watch?v=u733Mf7jX24

DS- You have presided over almost 5 decades of changes in statistics.  Can you describe the effect of changes in computing and statistical languages over the years, and some learning from these changes

JDL- I started in 1968 with PL/I. Card decks had to be flown to Paris to be compiled and executed on the IBM/360 mainframes. Around the same time APL came up and satisfied my personal development needs, although of course APL code was difficult to communicate. It was even difficult to underatand your own code after a week. We had APL symbol balls on the Selectrix typewriters and APL symbols on the character terminals. The basic model was there — you develop in an interpreted language (APL) and then for production you use a compiled language (FORTRAN). Over the years APL was replaced by XLISP and then by R. Fortran was largely replaced by C, I never switched to C++ or Java. We discouraged our students to use SAS or SPSS or MATLAB. UCLA Statistics promoted XLISP-STAT for quite a long time, but eventually we had to give it up. See http://www.stat.ucla.edu/~deleeuw/janspubs/2005/articles/deleeuw_A_05.pdf.

(In 1998 the UCLA Department of Statistics, which had been one of the major users of Lisp-Stat, and one of the main producers of Lisp-Stat code, decided to switch to S/R. This paper discusses why this decision was made, and what the pros and the cons were. )

Of course the WWW came up in the early nineties and we used a lot of CGI and PHP to write instructional software for browsers.

Generally, there has never been an computational environment like R — so integrated with statistical practice and development, and so enormous, accessible and democratic. I must admit I personally still prefer to use R as originally intended: as a convenient wrapper around and command line interface for compiled libraries and code. But it is also great for rapid prototyping, and in that role it has changed the face of statistics.
The fact that you cannot really propose statistical computations without providing R references and in many cases R code has contributed a great deal to reproducibility and open access.

DS- Does Big Data and Cloud Computing , in the era of data deluge require a new focus on creativity in statistics or just better application in industry of statistical computing over naive models
JDL- I am not really active in Big Data and Cloud Computing, mostly because I am more of a developer than a data analyst. That is of course a luxury.
The data deluge has been there for a long time (sensors in the road surface, satellites, weather stations, air pollution monitors, EEG’s, MRI’s) but until fairly recently there were no tools, both in hardware and software, to attack these data sets. Of course big data sets have changed the face of statistics once again, because in the context of big data the emphasis on optimality and precise models becomes laughable. What I see in the area is a lot of computer science, a lot of fads, a lot of ad-hoc work, and not much of a general rational approach. That may be unavoidable.
DS- What is your biggest failure in professional life
JDL- I decided in 1963 to major in psychology, mainly because I wanted to discover big truths about myself. About a year later I discovered that psychology and philosophy do not produce big truths, and that my self was not a very interesting object of study anyway. I switched to physics for a while, and minored in math, but by that time I already had a research assistant job, was developing software, and was not interested any more in going to lectures and doing experiments. In a sense I dropped out. It worked out fairly well, but it sometimes gives rise to imposter syndrome.
DS- If you had to do it all over again, what are the things you would really look forward to doing.

JDL- I really don’t know how to answer this. A life cannot be corrected, repeated, or relived.

DS- What motivates you to start Journal of Statistical software and  push for open access.
JDL- That’s basically in the UserR! 2014 presentation. See http://gifi.stat.ucla.edu/janspubs/2014/notes/deleeuw_mullen_U_14.pdf
DS-  How can we make the departments of Statistics and departments of Computer Science work closely for better industry relevant syllabus especially in data mining, business analytics and statistical computing.
JDL- That’s hard. The cultures are very different — CS is so much more agressive and self-assured, as well as having more powerful tools and better trained students. We have tried joint appointments but they do not work very well. There are some interdisciplinary programs but generally CS dominates and provides the basic keywords such as neural nets, machine learning, data mining, cloud computing and so on. One problem is that in many universities statistics is the department that teaches the introductory statistics courses, and not much more. Statistics is forever struggling to define itself, to fight silly battles about foundations, and to try to control the centrifugal forces that do statistics outside statistics departments.
DS- What are some of the work habits that have helped you be more productive in your writing and research
JDL– Well, if one is not brilliant, one has to be a workaholic. It’s hard on the family. I decided around 1975 that my main calling was to gather and organize groups of reseachers with varying skills and interests — and not to publish as much as possible. That helped.
Jan de Leeuw (born December 19, 1945) is a Dutch statistician, and Distinguished Professor of Statistics and Founding Chair of the Department of Statistics, University of California, Los Angeles. He is known as the founding editor and editor-in-chief of the Journal of Statistical Software, as well as editor-in-chief of the Journal of Multivariate Analysis.

Interview Antonio Piccolboni Big Data Analytics RHadoop #rstats

Here is an interview with Antonio Piccolboni , a consultant on big data analytics who has most notably worked on the RHadoop project for Revolution Analytics. Here he tells us about writing better code, and the projects he has been involved with.
ap
 
DecisionStats(DS)- Describe your career journey from being a computer science student to one of the principal creators for RHadoop. What motivated you, what challenges did you overcome. What were the turning points.(You have 3500+ citations. What are most of those citations regarding.)

Antonio (AP)- I completed my undergrad in CS in Italy. I liked research and industry didn’t seem so exciting back then, both because of the lack of a local industry and the Microsoft monopoly, so I entered the PhD program.
After a couple of false starts I focused on bioinformatics. I was very fortunate to get involved in an international collaboration and that paved the way for a move to the United States. I wanted to work in the US as an academic, but for a variety of reasons that didn’t work out.
Instead I briefly joined a new proteomics department in a mass spectrometry company, then a research group doing transcriptomics, also in industry, but largely grant-funded. That’s the period when I accumulated most of my citations.
After several years there, I realized that bioinformatics was not offering the opportunities I was hoping for and that I was missing out on great changes that were happening in the computer industry, in particular Hadoop, so after much deliberation I took the plunge and worked first for a web ratings company and then a social network, where I took the role of what is now called a “data scientist”, using the statistical skills that I acquired during the first part of my career. After taking a year off to work on my own idea I became a free lance and Revolution Analytics one of my clients, and I became involved in RHadoop.
As you can see there were several turning points. It seems to me one needs to seek a balance of determination and flexibility, both mental and financial, to explore different options, while trying to make the most of each experience. Also, be at least aware of what happens outside your specialty area. Finally, the mandatory statistical warning: any generalizations from a single career are questionable at best.

 

DS-What are the top five things you have learnt for better research productivity and code output in your distinguished career as a computer scientist.
AP-1. Keep your code short. Conciseness in code seems to correlate with a variety of desirable properties, like testability and maintainability. There are several aspects to it and I have a linkblog about this (asceticprogrammer.info). If I had said “simple”, different people would have understood different things, but when you say “short” it’s clear and actionable, albeit not universally accepted.
2. Test your code. Since proving code correct is unfeasible for the vast majority of projects, development is more like an experimental science, where you assemble programs and then corroborate that they have the desired properties via experiments. Testing can have many forms, but no testing is no option.
3. Many seem to think that programming is an abstract activity somewhere in between mathematics and machines. I think a developer’s constituency are people, be them the millions using a social network or the handful using a specialized API. So I try to understand how people interact with my work, what they try to achieve, what their background is and so forth.
4. Programming is a difficult activity, meaning that failure happens even to the best and brightest. Learning to take risk into account and mitigate it is very important.
5. Programs are dynamic artifacts. For each line of code, one may not only ask if it is correct but for how long, as assumptions shift, or how often it will be executed. For a feature, one could wonder how many will use it, and how many additional lines of code will be necessary to maintain it.
6. Bonus statistical suggestion: check the assumptions. Academic statistics has an emphasis on theorems and optimality, bemoaned already by Tukey over sixty years ago. Theorems are great guides for data analysis, but rely on assumptions being met, and, when they are not, consequences can be unpredictable. When you apply the most straightforward, run of the mill test or estimator, you are responsible for checking the assumptions, or otherwise validating the results. “It looked like a normal distribution” won’t cut it when things go wrong.

 

DS-Describe the RHadoop project- especially the newer plyrmr package. How was the journey to create it.
AP-Hadoop is for data and R is for statistics, to use slogans, so it’s natural to ask the question of how to combine them, and RHadoop is one possible answer.
We selected a few important components of Hadoop and provided an R API. plyrmr is an offshoot of rmr, which is an API to the mapreduce system. While rmr has enjoyed some success, we received feedback that a simplified API would enable even more people to directly access and analyze the data.Again based on feedback we decided to focus on structured data, equivalent to an R data frame. We tried to reduce the role of user-defined functions as parameters to be fed into the API, and when custom functions are needed they are simpler. Grouping and regrouping the data is fundamental to mapreduce. While in rmr the programmer has to process two data structures, one for the data itself and the other describing the grouping, plyrmr uses a very familiar SQL-like “group” function.
Finally, we added a layer of delayed evaluation that allows to perform certain optimizations automatically and encourages reuse by reducing the cost of abstraction. We found enough commonalities with the popular package plyr that we decided to use it as a model, hence the tribute in the name. This lowers the cognitive burden for a typical user.

 

DS-Hue is an example of making interfaces easier for users to use Hadoop. so are sandboxes and video trainings. How can we make it easier to create better interfaces to software like RHadoop et al
AP- It’s always a trade-off between power and ease of use, however I believe that the ability to express analyses in a repeatable and communicable way is fundamental to science and necessary to business and one of the key elements in the success of R. I haven’t seen a point and click GUI that satisfies these requirements yet, albeit it’s not inconceivable. For me, the most fruitful effort is still on languages and APIs. While some people write their own algorithms, the typical data analyst needs a large repertoire of algorithms that can be applied to specific problems. I see a lot of straightforward adaptations of sequential algorithms or parallel algorithms that work at smaller scales, and I think that’s the wrong direction. Extreme data sizes call for algorithms that work within stricter memory, work and communication constraints than before. On the other hand, the abundance of data, at least in some cases, offers the option of using less powerful or efficient statistics. It’s a trade off whose exploration has just started.

 

DS-What do you do to maintain work life balance and manage your time
 
AP- I think becoming a freelancer affords me a flexibility that employed work generally lacks. I can put in more or fewer hours depending on competing priorities and can move them around other needs, like being with family in the morning or going for a bike ride while it’s sunny.  I am not sure I manage my time incredibly well, but I try to keep track of where I spend it at least by broad categories, whether I am billing it to a client or not. “If you can not measure it, you can not improve it”, a quote usually attributed to Lord Kelvin.

 
DS- What do you think is the future of R as an enterprise and research software in terms of computing on mobile, desktop, cloud and how do you see things evolve from here

AP- One of the most interesting things that are happening right now is the development of different R interpreters. A successful language needs at least two viable implementations in my opinion. None of the alternatives is ready for prime time at the moment, but work is ongoing. Some implementations are experimental but demonstrate technological advances that can be then incorporated into the other interpreters. The main challenge is transitioning the language and the community to the world of parallel and distributed programming, which is a hardware-imposed priority. RHadoop is meant to help with that, for the largest data sets. Collaboration and publishing on the web is being addressed by many valuable tools and it looks to me the solutions exist already and it’s more a problem of adoption.  For the enterprise, there are companies offering training, consulting, licensing,  centralized deployments,  database APIs, you name it. It would be interesting to see touch interfaces applied to interactive data visualization, but while there is progress on the latter, touch on desktop is limited to a single platform and R doesn’t run on mobile, so I don’t see it as an imminent development.

 

About
Antonio Piccolboni is an  experienced data scientist (FlowingdataRadar on this emerging role) with industrial and academic backgrounds currently working as an independent consultant on big data analytics. His clients include Revolution Analytics. His other recent work is on social network analysis (hi5) and web analytics (Quantcast). You can contact him via http://piccolboni.info/about.html or his LinkedIn profile

Interview Vivian Zhang co-founder SupStat

Here is an interview with Vivian Zhang, CTO and co-founder Supstat which is an interesting startup in the R ecosystem. In this interview Vivian talks of the startup journey, helping spread R in China and New York City, and managing Meetups, conferences and training business with balance and excellence.

download

DecisionStats- (DS) Describe the story behind creation of SupStat Inc and the journey so far along with milestones and turning points. What is your vision for SupStat and what do you want it to accomplish and how.

Vivian Zhang(VZ) –

Creation:

SupStat was born in 2012 out of the collaboration of 60+ individuals(Statistician, Computer Engineers, Mathematician,Professors, graduate students and talend Data genius)who met through a well-known non-profit organization in China, Capital of Statistics. The SupStat team met through various collaborations on R packages and analytic work. In 2013, SupStat became involved in the New York City data science community through hosting the NYC Open Data Meetup, and soon began offering formal courses through the NYC Data Science Academy. SupStat offers consulting services in the areas of R development, data visualization, and big data solutions. We are experienced with many technologies and languages including R, Python, Hadoop, Spark, Node.js, etc. Courses offered include Data Science with R (Beginner, Intermediate), Data Science with Python (Beginner, Intermediate), and Hadoop (Beginner, Intermediate), as well as many targeted courses on R packages and data visualization tools.

Allen and I, the two co-founders, have been passionate about Data Mining since a young age (we talked about it back in 1997). With industry experience as Chief Data scientist/Senior Analyst and a spirit of entrepreneurship, we started the firm by gathering all the top R/Hadoop/D3.js programmers we knew.

Milestones of SupStat:

June 2012, Established in Beijing

July 2012,  Offered R intensive Bootcamp in Beijing to 50+ college students

June 2013, Established in NYC

Nov 2013,  Launched our NYC training brand: NYC Data Science Academy

Jan 2014,  Became premium partner of Revolution Analytics in China

Feb 2014,  Became training and reseller partner of RStudio in US and China

April 2014, Became Exclusive reseller partner of Transwarp in US

                Started to offer R built-in and professional services for Hadoop/Spark

May 2014, Organized and sponsored R conference in Beijing

                NYC Open Data Meetup had 1800+ members in one year

Jun 2014, Sponsored UCLA R conference (Vivian was panelist for female R programmer talk.)

The major turning point was in November, 2013, when we decided to start our NYC office and launched the NYC Data Science Academy.

Our Mission:

We are committed to helping our clients make distinctive, lasting and substantial improvement in their performance, sales, clients and employee satisfaction by fully utilizing data. We are a value-driven firm. For us this means:

  • Solving the hardest problems

  • Utilizing state-of-the-art data science to help our clients succeed

  • Applying a structured problem-solving approach where all options are considered, researched, and analyzed carefully before recommendations are made

Our Vision: Be a firm that attracts exceptional people to meet the growing demand for data analysis and visualization.

Future goals:

With engaged clients, we want to share the excitement, unlimited potential and methodologies of using data to create business value. We want to be the go-to firm when people think of getting data analytic training, consulting, and big data products.

With top data scientists, we want to be the home for those who want different data challenges all the time. We promote their open data/demo work in the community and  expand the impact of the analytic tools and methodologies they developed. We connect the best ones to build the strongest team.

With new college students and young professionals, we want to help them succeed and be able to handle real world problems right away though our short-term, intensive training programs and internship programs. Through our rich experience, we have tailored our training program to solve some of the critical problems people face in their workplace.

Through our partnerships we want to spread the best technologies between the US and China. We want to close the gap and bring solutions and offerings to clients we serve. We are at the frontline to pick what is the best product for our clients.

We are glad we have the opportunity to do what we love and are good at, and will continue to enjoy doing it with a growing and energetic team.

DS -What is the state of open source statistical software in China? How have you contributed to R in China and how do you see the future of open source evolve there?

VZ- People love R and embrace R.  In May 2014, We helped to organize and sponsor the R conference in Beijing, with 1400 attendants. See our blog post for more details: http://www.r-bloggers.com/the-7th-china-r-conference-in-beijing/

We have helped organize two R conferences in China in the past year, Spring in Beijing and Winter in Shanghai. And we will do a Summer R conference in Guangzhou this year. That’s three R conferences in one year!

DS- Describe some of your work with your partners in helping sell and support R in China and USA

VZ- Revolution Analytics and RStudio are very respected in the R community. We are glad to work and learn from them through collaboration.

With Revolution, we provide services to do proof-of-concept and professional services including analytics and visualization. We also sell Revolution products and licenses in China. With RStudio, we sell Rstudio Server Pro and Shiny and promote training programs around those products in NYC. We plan to sell these products in China starting this summer. With Transwarp, we offer the best R analytic and paralleling experience through the Hadoop/Spark ecosystem.

DS- You have done many free workshops in multiple cities. What has been the response so far.

VZ- Let us first talk about what happened in NYC.

I went to a few meetups before I started my own meetup group. Most of the presentation/talks were awesome but they were not delivered and constructed in a way that attendants could learn and apply the technology right away. Most of the time, those events didn’t offer source code or technical details in the slides.

When I started my own group, my goal was “whatever cool stuff we showed you, we will teach you how to do it.” The majority of the events were designed as hands-on workshops while we hosted a few high profile speakers’ talks from time to time (including the chief data science scientist for the Obama Campaign).

My workshops cover a wide range of topics, including R, Python, Hadoop, D3.js, data processing, Tableau, location data query, open data, etc. People are super excited and keep saying “oh wow oh wow”, “never thought that I could do it”, ”it is pretty cool.” Soon our attendants started giving back to the group by teaching their skills and fun projects, offering free conference room, and sponsoring pizzas.

We are glad we have built a community of sharing experience and passion for data science. And I think this is a very unique thing we can do in NYC (due to the fact everything is close to half-hour subway distance). We host events 2-3 times per week and have attracted 1900 members in one year.

In other cities such as Shanghai and Beijing, we do free workshops for college students and scholars every month. We promise to go to the colleges as far as within 24 hours distance by train from Beijing.  Through partnerships with Capital of Statistics and DataUnion, we hosted entrepreneur sharing events with devoted speeches and lighting talks.

In NYC, we typically see 15 to 150 people per event. U.S. sponsors have included McKinsey & Company, Thoughtworks, and others. Our Beijing monthly tech event sees over 500 attendees and gains attraction from event co-hosts including Baiyu, Youku and others.

DS- What are some interesting projects of Supstat that you would like to showcase.

VZ- Let me start with one interesting open data project on Citibike data done by our team. The blog post, slides and meetup videos can be found at http://nycdatascience.com/meetup/nyc-open-data-project-ii-my-citibike/

Citibike provides a public bike service. There are many bike stations in NYC. People want to take a bike from a station with at least one available bike. And when they get to the destination, they want to return their bike to a station with at least one available slot. Our goal was to predict where to rent and where to return Citibikes. We showed the complete workflow including data scraping, cleaning, manipulation, processing, modeling, and making algorithms into a product.

We built a program to scrape data and save it to our database automatically. Using this data we utilized models from time series theory and machine learning to predict bike numbers in all the stations. Based on the models, we built a website for this citibike system. This application helps users of citibike arrange their trips better. We also showed a few tricks such as how to set up cron job on Linux, Windows and Mac machines, and how to get around RAM limitations on servers with PostgreSQL.

We’ve done other projects in China using R to solve problems ranging from Telecommunications data caching to Cardiovascular disease prevention. Each of these projects has required a unique combination of statistical knowledge and data science tools, with R being the backbone of the solution. The commercial cases can be found at our website: http://www.supstat.com/consulting/

About-

SupStat is a statistical consulting company specialized in statistical computing and graphics using state-of-the-art technologies.

VIVIAN S. ZHANG Co-founder & CTO, NYC, Beijing and Shanghai Office

Vivian is a data scientist who has been devoted to the analytics industry and the development and use of data technologies for several years. She obtained expertise in data analysis and data management using various statistical analytical tools and programming languages as a Senior Analyst and Biostatistician at Memorial Sloan-Kettering Cancer Center and Scientific Programmer at Brown University. She is the co-founder SupStat, NYC Data Science Academy, NYC Open-Data meetup. She likes to portray herself as a programmer, data-holic, visualization evangelist.

You can read more about SupStat at http://www.supstat.com/team/

Interview Torch Browser

Here is an interview with Torch browser which embeds bit torrents within the Chromium open source browser and is compatible with all Chrome downloads. With a new 1 million users in 2014, Torch is lighting up the Internet at http://www.torchbrowser.com/

Ajay- Why did you create torch and how did you create it. your startup story:
Torch- Torch browser was a project born out of the demand for easy one-click media and file sharing. Rather than bogging down ones browser with multiple file sharing, torrenting, and media sharing software or extensions, we wanted to streamline these concepts into a one-click, user friendly experience. As well as eventually integrate the simplicity and one step solution in a way that would give the user instant access to their desired media or file share. For this reason we have advanced Torch to include features such as no wait torrenting, once the torrent starts to download, it can already be played. Drag & Drop, which allows for simple and quick searching, instant audio extraction, file download accelerator, Torch Music, and so much more.

What has been user feedback and stats:
The user feedback has been extremely positive. You can see this by following our Facebook page at http://facebook.com/torchbrowser, where we proudly boast over 1MM likes
We constantly take polls requesting user feedback and feature requests. Even though one cannot please everyone, we do our best to develop Torchaccording to the majorities requests. If enough people want it, we will build it. Because ultimately it’s the user’s browser, not ours.

How do you intend to monetize:
This is a valid and common question. Currently our focus is on filling the needs of our users, developing for multiple platforms, adding features and improving overall system performance. At the same time, we are discussing possible future premium services that will generate revenues.

Any plans to embed tor in torch, to enable relay server through single clickbrowser and what about chrome plugins like mafiaafire:
All Chromium plugins available at the Chrome web store are compatible with Torch. The issue of by-passing unwarranted blocks is under serious consideration by our team. We are all for the complete and public access to all things internet. If it’s on the net, and it’s intended to be shared, then it should be accessible to all. Censorship and unjustified blocks are not things that we feel comfortable with and we don’t believe that our users need to tolerate this.

What legal concerns if any did you encounter or plan for during the product design phase:
There are no legal concerns. Like other browsers and similar applications,Torch is a general purpose tool for browsing, sharing and downloading internet content.

Do you collect user data:
Torch does not collect any personal data. The only data collected by Torch is non-personally identifiable and used for technical and functional purposes only. You can read more concerning this in our privacy policy at:http://torchbrowser.com/privacy

Your views on NSA spying internationally:
This question is a bit vague and beyond the scope of our development being that system security should be managed via personal OS firewall settings. However, unwarranted intrusions into ones personal domain is not something that makes anyone feel comfortable, nor should it be tolerated. With that said, we are evaluating the benefits of including added browsing protection into a future build.

Again, thank you for the opportunity, and we hope that this has been helpful.

The Torch Team.

Writing on APIs for Programmable Web

I have been writing free lance on APIs for Programmable Web. Here is an updated list of the articles, many of these would be of interest to analytics users. Note- some of these are interviews and they are in bold. Note to regular readers: I keep updating this list , and at each updation bring it to the front page, then allowing the blog postings to slide it down!

Scoreoid Aims to Gamify the World Using APIs January 27th, 2014

Plot.ly’s Plot to Visualize More Data January 22nd, 2014

LumenData’s Acquisition of Algorithms.io is a Win-Win January 8th, 2014

Yactraq API Sees Huge Growth in 2013  January 6th, 2014

Scrape.it Describes a Better Way to Extract Data December 20th, 2013

Exclusive Interview: App Store Analytics API December 4th, 2013

APIs Enter 3d Printing Industry November 29th, 2013

PW Interview: José Luis Martinez of Textalytics November 6th, 2013

PW Interview Simon Chan PredictionIO November 5th, 2013

PW Interview: Scott Gimpel Founder and CEO FantasyData.com October 23rd, 2013

PW Interview Brandon Levy, cofounder and CEO of Stitch Labs October 8th, 2013

PW Interview: Jolo Balbin Co-Founder Text Teaser  September 18th, 2013

PW Interview:Bob Bickel CoFounder Redline13 July 29th, 2013

PW Interview : Brandon Wirtz CTO Stremor.com   July 4th, 2013

PW Interview: Andy Bartley, CEO Algorithms.io  June 4th, 2013

PW Interview: Francisco J Martin, CEO BigML.com 2013/05/30

PW Interview: Tal Rotbart Founder- CTO, SpringSense 2013/05/28

PW Interview: Jeh Daruwala CEO Yactraq API, Behavorial Targeting for videos 2013/05/13

PW Interview: Michael Schonfeld of Dwolla API on Innovation Meeting the Payment Web  2013/05/02

PW Interview: Stephen Balaban of Lamda Labs on the Face Recognition API  2013/04/29

PW Interview: Amber Feng, Stripe API, The Payment Web 2013/04/24

PW Interview: Greg Lamp and Austin Ogilvie of Yhat on Shipping Predictive Models via API   2013/04/22

Google Mirror API documentation is open for developers   2013/04/18

PW Interview: Ricky Robinett, Ordr.in API, Ordering Food meets API    2013/04/16

PW Interview: Jacob Perkins, Text Processing API, NLP meets API   2013/04/10

Amazon EC2 On Demand Windows Instances -Prices reduced by 20%  2013/04/08

Amazon S3 API Requests prices slashed by half  2013/04/02

PW Interview: Stuart Battersby, Chatterbox API, Machine Learning meets Social 2013/04/02

PW Interview: Karthik Ram, rOpenSci, Wrapping all science API2013/03/20

Viralheat Human Intent API- To buy or not to buy 2013/03/13

Interview Tammer Kamel CEO and Founder Quandl 2013/03/07

YHatHQ API: Calling Hosted Statistical Models 2013/03/04

Quandl API: A Wikipedia for Numerical Data 2013/02/25

Amazon Redshift API is out of limited preview and available! 2013/02/18

Windows Azure Media Services REST API 2013/02/14

Data Science Toolkit Wraps Many Data Services in One API 2013/02/11

Diving into Codeacademy’s API Lessons 2013/01/31

Google APIs finetuning Cloud Storage JSON API 2013/01/29

2012
Ergast API Puts Car Racing Fans in the Driver’s Seat 2012/12/05
Springer APIs- Fostering Innovation via API Contests 2012/11/20
Statistically programming the web – Shiny,HttR and RevoDeploy API 2012/11/19
Google Cloud SQL API- Bigger ,Faster and now Free 2012/11/12
A Look at the Web’s Most Popular API -Google Maps API 2012/10/09
Cloud Storage APIs for the next generation Enterprise 2012/09/26
Last.fm API: Sultan of Musical APIs 2012/09/12
Socrata Data API: Keeping Government Open 2012/08/29
BigML API Gets Bigger 2012/08/22
Bing APIs: the Empire Strikes Back 2012/08/15
Google Cloud SQL: Relational Database on the Cloud 2012/08/13
Google BigQuery API Makes Big Data Analytics Easy 2012/08/05
Your Store in The Cloud -Google Cloud Storage API 2012/08/01
Predict the future with Google Prediction API 2012/07/30
The Romney vs Obama API 2012/07/27

Interview: Linkurious aims to simplify graph databases

linkurious-239x60-trHere is an interview with a really interesting startup Linkurious and it’s co-founders Sebastien Heymann( also co-founder of Gephi) and Jean Villedieu. They are hoping to making graph databases easier to use and thus spur on their usage.

Decisionstats (DS)-  How did you come up about setting across your startup

Linkurious (L) -A lot of businesses are struggling to understand the connections within their data. Who are the persons connected to this financial transaction? What happens to the telecommunication network if this antenna fails? Who is the most influential person in this community? There are a lot of questions that involve a deep understanding of graphs. Most business intelligence and data visualization tools are not adapted for these questions because they have a hard time handling queries about connections and because their interface is not suited for network visualization.
I noticed this because I co-founded a graph visualization software called Gephi a few years ago. It quickly became a reference and the software was downloaded 250k times last year. It really helped people understand the connections in their data in a new way.
In 2013, this success inspired me to found Linkurious. The idea is to provide a solution that’s easy to use to democratize graph visualization.

What does it mean?
We want to help people understand the connection in their data. Linkurious is really easy to use and optimized for the exploration of graphs.
You can install it in minutes. Then, it gives you a search interface through which you can query the data. What’s special about our software is that the result of your search is represented as a graph that you can explore dynamically. Contrary to Gephi or other graph visualization tools, Linkurious only shows you a limited subset of your data and not the whole graph. The goal here is to focus on what the user is looking for and help him find an answer faster.
In order to do that, Linkurious also comes with the ability to filter nodes or color them according to their properties. This way, it’s much faster to understand the data.

DS- How do you support packages from Python , and R and other languages like Julia? What is Linkurious based on?

L- Linkurious is largely based on a stack of open-source technologies. We rely on Neo4j, the leading graph database to store and access the data. Neo4j can handle really large datasets, this means that our users can access the information much faster than with a traditional SQL database. Neo4j also comes with a query language that allows “smart search”, locating nodes and relationships based on rules like “what’s the shortest path between these 2 nodes?” or “who among the close network of this person has been to London and loves sushi”. That’s the kind of things that Facebook delivers via Graph Search and it’s exciting to see these technologies applied in the business world.
We also use Nodejs, Sigmajs and ElasticSearch.

DS-  Name  a few case studies where enterprises have used graphical analysis for great benefit?

L- There really are a lot of use cases for graph visualization and we are learning about it almost every day. There are well know applications that are connected to security. For example, graph databases are great to identify suspicious patterns across a variety of data sources. People using false identities to defraud bank tend to share addresses, phone numbers or names. Without graphs, it’s hard to see how they are connected and they tend to remain undetected until it’s too late. Graph visualization can be triggered by alert systems. Then, analysts can investigate the data and decide whether the alert should be escalated or not.
In the telecom industry, you can use graph to map your network and identify weak links, assess the potential of a failure (i.e. impact analysis). Graph visualization helps understand these information and better manage the network.

We also have clients in the logistics, health or consulting industry. Every data oriented industry needs data visualization tools, and graphs offer powerful ways to ask new questions and reveal unforeseen information.

DS-What are some of the challenges with creating, sustaining and maintaining a cutting edge technology startup in Europe and France

L- There are a lot of challenges with creating and sustaining a challenges. I think the bigger ones are not necessarily location-related. The main issue is to build something people want. It’s certainly been our biggest challenge. We’ve used a lean startup approach to ship a prototype of our product as fast as we could. The first version of Linkurious was buggy and didn’t much interest from customers. But we did get feedback from a few people who really liked it. Since then, we’ve been focusing on them to develop our vision of Linkurious. We are pleased with the results, I think we are on the right path but it’s really a journey.
As for the more location-related challenges, I think France usually gets a bad rep for not being start-up friendly. Our experience has been quite the contrary. There are administrative annoyances but we also benefit from generous benefits, access to great engineers and a burgeoning startup eco-system!

About

The mission of Linurio.us is  to help users access and navigate graph databases in a simple manner so they can make sense of their data.

Some of their interesting solutions are here.

Interview Anne Milley JMP

 An interview with noted analytics thought leader Anne Milley from JMP. Anne talks of statistics, cloud computing, culture of JMP, globalization and analytics in general.

DecisionStats(DS) How was 2013 as a year for statistics in general and JMP in particular?  

Anne Milley-  (AM) I’d say the first-ever  International Year of Statistics (Statistics2013) was a great success! We hope to carry some of that momentum into 2014. We are fans of the UK’s 10-year GetStats campaign—they are in the third year, and it seems to be going really well. JMP had a very good year as well, with worldwide double-digit growth again. We are pleased to have launched version 11 of JMP and JMP Pro last year at our annual Discovery Summit user conference.

DS-  Any cloud computing plans for JMP?

AM- We are exploring options, but with memory and storage still so incredibly cheap on the desktop, the responsiveness of local, in-memory computing on Macs or Windows operating systems remains compelling. John Sall said it best in a blog post he wrote in December.  It is our intention to have a public cloud offering in 2014.

DS- Describe the company culture and environment in the JMP division. Any global plans?

AM- John Sall’s passion to bring interactive, intuitive data visualization and analysis on the desktop continues. There is a strong commitment in the JMP division to speeding the statistical discovery process and making it fun. It’s a powerfully motivating factor to work in an environment where that passion and purpose are shared, and where we get to interact with customers who are also passionate users of JMP, many of whom use JMP and SAS together.

While a majority of JMP personnel are in Cary, North Carolina, almost half the staff are contributing from other states and countries. JMP is sold everywhere we have SAS offices (in 59 countries). JMP has localized versions in seven languages, and we keep getting requests for more.

DS- You have been a SAS Institute veteran for 15 years now. What are some of the ups and downs you remember as milestones in the field of analytics?

AM- The most exciting milestone is that analytics has been getting more attention in the last few years, thanks to a combination of factors. Analytics is a very inclusive term (statistics, optimization, data mining, machine learning, data science, etc.), but statistics is the main discipline we draw on when we are trying to make informed decisions in the face of uncertainty. In the early days of data mining, there was a tension between statisticians and data miners/machine learners, but we now have a richer set of methods (with more solid theoretic underpinnings) with which to analyze data and make better decisions. We have better ways to automate parts of the model-building process as well, which is important with ever-wider data. In the early days of data mining, I remember many reacting with “Why spend so much time dredging through opportunistically collected data, when statistics has so much more to offer, like design of experiments?” There is still some merit to that, and maybe we will see the pendulum swing back to doing more with information-rich data.

DS- What are your top three forecasts for analytics technology in 2014?

AM- My perspective may be different than others on what’s trending in analytics technology, but as we try to do more with more data, here are my top three picks:

  • We will continue to innovate new ways to visualize data and statistical output to capitalize on our high visual bandwidth. (Examples of some of our recent innovations can be found on the JMP Blog.)

  • We will continue to see innovative ways to create more analytic bandwidth and democratize analytics—for example, more quickly build and deploy analytic applications and interactive visualizations for others to use.

  • We will see more integration with commonly used analytical tools and infrastructure to help analysts be more productive.

DS-  How do you maintain work-life balance?

AM- I enjoy what I do and the great people I work with; that is part of what motivates me each day and is added to the long list of things for which I’m grateful. Outside of work, I enjoy spending time with family, regular exercise, organic gardening and other creative pursuits.

DS-As a senior technology management person working for the past 15 years, do you think technology is a better employer for women employees than it was in the 1990s? What steps can be done to increase this?

AM- I certainly see more support for women in technology with various women-in-technology organizations and programs around the world. And I also see more encouragement for girls and young women to get more exposure to science, technology, engineering, math, and statistics and consider the career options knowledge of these areas could bring. But there is more to do. I would like to add statistics to the STEM list explicitly since many still consider statistics a branch of math and don’t appreciate that statistics is the science/language of science. (Florence Nightingale said that statistics is “the most important science in the whole world.”) This year, we will see the first Women in Statistics Conference “enticing, elevating, and empowering careers of women in statistics.” There are several organizations and programs out there advocating for women in science, engineering, statistics and math, which is great. The resources such organizations provide for networking, mentoring, career development and making role models more visible are important in raising awareness on what the impediments are and how to overcome them. We should all read Sheryl Sandberg’s re-release of Lean In for Graduates (due out in April). Thank you for asking this question!

 AboutAnne Milley   SR DIRECTOR, ANALYTIC STRATEGY, JMP

Anne oversees product management and analytic strategy in JMP Product Marketing. She is a contributing faculty member for the International Institute of Analytics.