Home » Analytics

Category Archives: Analytics

Using Windows Azure Machine Learning as a service with R #rstats

A Brief Tutorial I wrote by playing with the software at manage.windowsazure.com

Interview Vivian Zhang co-founder SupStat

Here is an interview with Vivian Zhang, CTO and co-founder Supstat which is an interesting startup in the R ecosystem. In this interview Vivian talks of the startup journey, helping spread R in China and New York City, and managing Meetups, conferences and training business with balance and excellence.

download

DecisionStats- (DS) Describe the story behind creation of SupStat Inc and the journey so far along with milestones and turning points. What is your vision for SupStat and what do you want it to accomplish and how.

Vivian Zhang(VZ) -

Creation:

SupStat was born in 2012 out of the collaboration of 60+ individuals(Statistician, Computer Engineers, Mathematician,Professors, graduate students and talend Data genius)who met through a well-known non-profit organization in China, Capital of Statistics. The SupStat team met through various collaborations on R packages and analytic work. In 2013, SupStat became involved in the New York City data science community through hosting the NYC Open Data Meetup, and soon began offering formal courses through the NYC Data Science Academy. SupStat offers consulting services in the areas of R development, data visualization, and big data solutions. We are experienced with many technologies and languages including R, Python, Hadoop, Spark, Node.js, etc. Courses offered include Data Science with R (Beginner, Intermediate), Data Science with Python (Beginner, Intermediate), and Hadoop (Beginner, Intermediate), as well as many targeted courses on R packages and data visualization tools.

Allen and I, the two co-founders, have been passionate about Data Mining since a young age (we talked about it back in 1997). With industry experience as Chief Data scientist/Senior Analyst and a spirit of entrepreneurship, we started the firm by gathering all the top R/Hadoop/D3.js programmers we knew.

Milestones of SupStat:

June 2012, Established in Beijing

July 2012,  Offered R intensive Bootcamp in Beijing to 50+ college students

June 2013, Established in NYC

Nov 2013,  Launched our NYC training brand: NYC Data Science Academy

Jan 2014,  Became premium partner of Revolution Analytics in China

Feb 2014,  Became training and reseller partner of RStudio in US and China

April 2014, Became Exclusive reseller partner of Transwarp in US

                Started to offer R built-in and professional services for Hadoop/Spark

May 2014, Organized and sponsored R conference in Beijing

                NYC Open Data Meetup had 1800+ members in one year

Jun 2014, Sponsored UCLA R conference (Vivian was panelist for female R programmer talk.)

The major turning point was in November, 2013, when we decided to start our NYC office and launched the NYC Data Science Academy.

Our Mission:

We are committed to helping our clients make distinctive, lasting and substantial improvement in their performance, sales, clients and employee satisfaction by fully utilizing data. We are a value-driven firm. For us this means:

  • Solving the hardest problems

  • Utilizing state-of-the-art data science to help our clients succeed

  • Applying a structured problem-solving approach where all options are considered, researched, and analyzed carefully before recommendations are made

Our Vision: Be a firm that attracts exceptional people to meet the growing demand for data analysis and visualization.

Future goals:

With engaged clients, we want to share the excitement, unlimited potential and methodologies of using data to create business value. We want to be the go-to firm when people think of getting data analytic training, consulting, and big data products.

With top data scientists, we want to be the home for those who want different data challenges all the time. We promote their open data/demo work in the community and  expand the impact of the analytic tools and methodologies they developed. We connect the best ones to build the strongest team.

With new college students and young professionals, we want to help them succeed and be able to handle real world problems right away though our short-term, intensive training programs and internship programs. Through our rich experience, we have tailored our training program to solve some of the critical problems people face in their workplace.

Through our partnerships we want to spread the best technologies between the US and China. We want to close the gap and bring solutions and offerings to clients we serve. We are at the frontline to pick what is the best product for our clients.

We are glad we have the opportunity to do what we love and are good at, and will continue to enjoy doing it with a growing and energetic team.

DS -What is the state of open source statistical software in China? How have you contributed to R in China and how do you see the future of open source evolve there?

VZ- People love R and embrace R.  In May 2014, We helped to organize and sponsor the R conference in Beijing, with 1400 attendants. See our blog post for more details: http://www.r-bloggers.com/the-7th-china-r-conference-in-beijing/

We have helped organize two R conferences in China in the past year, Spring in Beijing and Winter in Shanghai. And we will do a Summer R conference in Guangzhou this year. That’s three R conferences in one year!

DS- Describe some of your work with your partners in helping sell and support R in China and USA

VZ- Revolution Analytics and RStudio are very respected in the R community. We are glad to work and learn from them through collaboration.

With Revolution, we provide services to do proof-of-concept and professional services including analytics and visualization. We also sell Revolution products and licenses in China. With RStudio, we sell Rstudio Server Pro and Shiny and promote training programs around those products in NYC. We plan to sell these products in China starting this summer. With Transwarp, we offer the best R analytic and paralleling experience through the Hadoop/Spark ecosystem.

DS- You have done many free workshops in multiple cities. What has been the response so far.

VZ- Let us first talk about what happened in NYC.

I went to a few meetups before I started my own meetup group. Most of the presentation/talks were awesome but they were not delivered and constructed in a way that attendants could learn and apply the technology right away. Most of the time, those events didn’t offer source code or technical details in the slides.

When I started my own group, my goal was “whatever cool stuff we showed you, we will teach you how to do it.” The majority of the events were designed as hands-on workshops while we hosted a few high profile speakers’ talks from time to time (including the chief data science scientist for the Obama Campaign).

My workshops cover a wide range of topics, including R, Python, Hadoop, D3.js, data processing, Tableau, location data query, open data, etc. People are super excited and keep saying “oh wow oh wow”, “never thought that I could do it”, ”it is pretty cool.” Soon our attendants started giving back to the group by teaching their skills and fun projects, offering free conference room, and sponsoring pizzas.

We are glad we have built a community of sharing experience and passion for data science. And I think this is a very unique thing we can do in NYC (due to the fact everything is close to half-hour subway distance). We host events 2-3 times per week and have attracted 1900 members in one year.

In other cities such as Shanghai and Beijing, we do free workshops for college students and scholars every month. We promise to go to the colleges as far as within 24 hours distance by train from Beijing.  Through partnerships with Capital of Statistics and DataUnion, we hosted entrepreneur sharing events with devoted speeches and lighting talks.

In NYC, we typically see 15 to 150 people per event. U.S. sponsors have included McKinsey & Company, Thoughtworks, and others. Our Beijing monthly tech event sees over 500 attendees and gains attraction from event co-hosts including Baiyu, Youku and others.

DS- What are some interesting projects of Supstat that you would like to showcase.

VZ- Let me start with one interesting open data project on Citibike data done by our team. The blog post, slides and meetup videos can be found at http://nycdatascience.com/meetup/nyc-open-data-project-ii-my-citibike/

Citibike provides a public bike service. There are many bike stations in NYC. People want to take a bike from a station with at least one available bike. And when they get to the destination, they want to return their bike to a station with at least one available slot. Our goal was to predict where to rent and where to return Citibikes. We showed the complete workflow including data scraping, cleaning, manipulation, processing, modeling, and making algorithms into a product.

We built a program to scrape data and save it to our database automatically. Using this data we utilized models from time series theory and machine learning to predict bike numbers in all the stations. Based on the models, we built a website for this citibike system. This application helps users of citibike arrange their trips better. We also showed a few tricks such as how to set up cron job on Linux, Windows and Mac machines, and how to get around RAM limitations on servers with PostgreSQL.

We’ve done other projects in China using R to solve problems ranging from Telecommunications data caching to Cardiovascular disease prevention. Each of these projects has required a unique combination of statistical knowledge and data science tools, with R being the backbone of the solution. The commercial cases can be found at our website: http://www.supstat.com/consulting/

About-

SupStat is a statistical consulting company specialized in statistical computing and graphics using state-of-the-art technologies.

VIVIAN S. ZHANG Co-founder & CTO, NYC, Beijing and Shanghai Office

Vivian is a data scientist who has been devoted to the analytics industry and the development and use of data technologies for several years. She obtained expertise in data analysis and data management using various statistical analytical tools and programming languages as a Senior Analyst and Biostatistician at Memorial Sloan-Kettering Cancer Center and Scientific Programmer at Brown University. She is the co-founder SupStat, NYC Data Science Academy, NYC Open-Data meetup. She likes to portray herself as a programmer, data-holic, visualization evangelist.

You can read more about SupStat at http://www.supstat.com/team/

Latest Interview – Rapid Miner CEO Ingo Mierswa

Here is an interview I did with the CEO of Rapid Miner, Ingo Mierswa. Ingo, who is something of a prodigy and genius with multi-lingual capabilities, stellar academic and business record talks on navigating the journey for an open source startup.

http://www.kdnuggets.com/2014/06/interview-ingo-mierswa-rapidminer-analytics-turning-points.html

Popularized by Michael (Monty) Widenius, one of the founders of MySQL and an investor in RapidMiner, business source is a commercial software license model that offers many of the benefits of open source, but with a built-in time delay on users being able to access new versions of our products.

 

Related-

  1. Guide to Data Science Cheat Sheets 2014/05/12
  2. Book Review: Data Just Right 2014/04/03
  3. Exclusive Interview: Richard Socher, founder of etcML, Easy Text Classification Startup 2014/03/31
  4. Trifacta – Tackling Data Wrangling with Automation and Machine Learning 2014/03/17
  5. Paxata automates Data Preparation for Big Data Analytics 2014/03/07
  6. etcML Promises to Make Text Classification Easy  2014/03/05
  7. Wolfram Breakthrough Knowledge-based Programming Language – what it means for Data Science? 2014/03/02

10 for 10 – Packt lowers cost of books for students and researchers alike

The high cost of textbooks and science books is an open scandal. Despite this publishers are barely profitable, and the ecosystem is ripe for disruption.

Packt is one such player. I have reviewed many books for them ( in return I get ebooks and books – some of which I give to my students).

Now they have an intriguing offer.

As you are aware, this month, Packt is celebrating 10 years of success with over 2000 Titles in its Library. To celebrate this huge milestone, we have come up with an exciting opportunity for collaboration which you might be interested in.

Packt is offering all of its eBooks and Videos at just $10 each. This campaign is specifically aimed towards thanking all our customers for their support and opening up our comprehensive range of titles just for $10 each. This promotion covers every title and customers can stock up on as many copies as they like until July 5th. I hope you find this as a great opportunity to explore what’s new and maintain your personal and professional development.

Interested- you can see http://www.packtpub.com/10years

Disclosure- The author was offered 2 free ebooks as part of this campaign on social media. Books is one thing he is willing to blog for ;)

Analysing Google Plus posts using R language #rstats

Here is a short post in retrieving information from the Google+ API using R, and then analysing it.

To create an API key:

  1. Go to the Google Developers Console.
  2. Create or select a project.
  3. In the sidebar on the left, select APIs & auth.
  4. In the displayed list of APIs, find the Google+ API and set its status to ON.
  5. In the sidebar on the left, select Credentials.
  6. Create an API key by clicking Create New Key. Select the appropriate kind of key: Server key  Then clickCreate.

from- https://developers.google.com/+/api/oauth

and the R code

#install.packages("plusser")
library(plusser)
help(plusser)
library(RCurl)
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))
setAPIkey('AIzaSyBtYqDsAtzp4FOS7FGbrc_n6mD-uJIOvcQ')
myProfile=harvestProfile("+AjayOhri", parseFun = parseProfile)
str(myProfile)
myposts=harvestPage("+AjayOhri", parseFun = parsePost, results = 1, nextToken = NULL, cr = 1)
str(myposts)
head(myposts)
plot(myposts$ti,myposts$nC) #number of comments
plot(myposts$ti,myposts$nP) #number of likes or plus 1
plot(myposts$ti,myposts$nR) #number of reshares

some screenshots and images Screenshot 2014-06-26 13.33.08

Screenshot 2014-06-26 13.32.56

You can also see the Rpubs document here http://rpubs.com/decisionstats2/plusser Now you can do text analysis and sentiment analysis on myposts$msg and do social media analysis on what makes people like what kind of content. 


For better results, use a google plus id (page or person) which has a lot of PUBLIC posts!

 

ggvis is awesomeness personified #rstats

 

Hu ha! Latest sexy software from our man Dr Hadley Wickham and his ninjas at RStudio. Now YOU can make a Business Intelligence software for FREE. How good is it? time will tell if someone can use it to give Tableau Software and Qlikview a run for the money

Seriously- I would like to see ONE implementation of RHadoop and Shiny with ggplot2 and d3

(Big data analytics indeed ;) )

from

———————————-

http://ggvis.rstudio.com/

ggvis is a data visualization package for R which lets you:

  • Declaratively describe data graphics with a syntax similar in spirit to ggplot2.
  • Create rich interactive graphics that you can play with locally in Rstudio or in your browser.
  • Leverage shiny’s infrastructure to publish interactive graphics usable from any browser (either within your company or to the world).

The goal is to combine the best of R (e.g. every modelling function you can imagine) and the best of the web (everyone has a web browser). Data manipulation and transformation are done in R, and the graphics are rendered in a web browser, using Vega. For RStudio users, ggvis graphics display in a viewer panel, which is possible because RStudio is a web browser.

Please note that the API has changed significantly between ggvis 0.1 and 0.3. Documentation for the old version is here.

Screenshot 2014-06-25 21.07.50

Great Way to learn Git easily

a great way to learn Git easily is here https://try.github.io/

Screenshot 2014-06-24 19.23.59

This is a much better designed code school project than the one for R

http://tryr.codeschool.com/

However Swirl is a great way to learn  R in an interactive way. its only drawback is it needs to be integrated with something like http://www.r-fiddle.org/#/ for a true automated browser only version

Why do I favor automated elearning solutions now? Because teaching the same thing again and again can be boring for the teacher and videos can be boring for the students. Note how the potential student is given positive reinforcement to boost his morale, something any good teacher know.

Follow

Get every new post delivered to your Inbox.

Join 839 other followers