So many R Packages Everywhere, which one do I use? #rstats

Some thoughts on R Packages

  • CRAN is no longer the sole repository for many useful R packages. This includes R Forge, Google Code and increasingly Github
  • CRAN lacks the flexibility and social aspect of Github.
  • CRAN Views is the only thing that lists subject wide listing of R packages. The categorization is however done more on methods than on use cases or business domains.
  • Multiple R packages for the same thing. Which one do I use? Only Stack Overflow helps with that. No rating , no recommendation system
  • The packages suggested by R package feature needs better and automatic association analysis . Right now it is manual and dependent on package author and maintainer.
  • Quis custodiet ipsos custodes? Who guards the guardians of R packages. In an era of cyber security, we need better transparency on security measures within R packages especially given the international nature of the project.  I am very sure I ( or anyone) can create R code to communicate discretely especially on Windows

  • I would rather not install anything on my local machine, and read the package directly from the CRAN . CRAN was designed in an era of low bandwidth- this needs to be upgraded.
  • Note I am refraining respectfully from the atrocious nature of aesthetics in the home website. Many statisticians feel no use of making R user friendly. My professors at U tenn (from which I dropped out in 2 sems) were horrified when I took courses in graphic design as I wanted to know more on the A and B, which make the A/B testing of statistical design. Now that I am getting older, I get horrified by the lack of HTML, CSS and JQuery by some of the brightest programmers in this project.
  • Please comment below.

 

Jetstrap for builiding websites with Twitter Bootstrap

Twitter Bootstrap is a free collection of tools for creating websites and web applications. It contains HTML and CSS-based design templates for typography, forms, buttons, charts, navigation and other interface components, as well as optional JavaScript extensions.

It is the most popular project in GitHub[2] and is used by NASA and MSNBC among others.

———————-

If you like me, hate to get down and dirty in HTML, CSS , JQuery ( not mentioning the excellent Code Academy HTML/CSS tutorials and  JQuery Track ) and want to create a pretty simple website for yourself- Jetstrap helps you build the popular Twitter Bootstrap design (very minimalistic) for websites.

And it’s free! And click and point and paste your content- and awesome CSS, HTML. Allows you to download the HTML to paste in your existing site!

2

Here is one I created in 5 minutes!

123

So lose your old website! Because not every website needs WordPress!

Try Jetstrap for Bootstrap!

Interview Alain Chesnais Chief Scientist Trendspottr.com

Here is a brief interview with Alain Chesnais ,Chief Scientist  Trendspottr.com. It is a big honor to interview such a legend in computer science, and I am grateful to both him and Mark Zohar for taking time to write these down.
alain_chesnais2.jpg

Ajay-  Describe your career from your student days to being the President of ACM (Association of Computing Machinery http://www.acm.org/ ). How can we increase  the interest of students in STEM education, particularly in view of the shortage of data scientists.
 
Alain- I’m trying to sum up a career of over 35 years. This may be a bit long winded…
I started my career in CS when I was in high school in the early 70’s. I was accepted in the National Science Foundation’s Science Honors Program in 9th grade and the first course I took was a Fortran programming course at Columbia University. This was on an IBM 360 using punch cards.
The next year my high school got a donation from DEC of a PDP-8E mini computer. I ended up spending a lot of time in the machine room all through high school at a time when access to computers wasn’t all that common. I went to college in Paris and ended up at l’Ecole Normale Supérieure de Cachan in the newly created Computer Science department.
My first job after finishing my graduate studies was as a research assistant at the Centre National de la Recherche Scientifique where I focused my efforts on modelling the behaviour of distributed database systems in the presence of locking. When François Mitterand was elected president of France in 1981, he invited Nicholas Negroponte and Seymour Papert to come to France to set up the Centre Mondial Informatique. I was hired as a researcher there and continued on to become director of software development until it was closed down in 1986. I then started up my own company focusing on distributed computer graphics. We sold the company to Abvent in the early 90’s.
After that, I was hired by Thomson Digital Image to lead their rendering team. We were acquired by Wavefront Technologies in 1993 then by SGI in 1995 and merged with Alias Research. In the merged company: Alias|wavefront, I was director of engineering on the Maya project. Our team received an Oscar in 2003 for the creation of the Maya software system.
Since then I’ve worked at various companies, most recently focusing on social media and Big Data issues associated with it. Mark Zohar and I worked together at SceneCaster in 2007 where we developed a Facebook app that allowed users to create their own 3D scenes and share them with friends via Facebook without requiring a proprietary plugin. In December 2007 it was the most popular app in its category on Facebook.
Recently Mark approached me with a concept related to mining the content of public tweets to determine what was trending in real time. Using math similar to what I had developed during my graduate studies to model the performance of distributed databases in the presence of locking, we built up a real time analytics engine that ranks the content of tweets as they stream in. The math is designed to scale linearly in complexity with the volume of data that we analyze. That is the basis for what we have created for TrendSpottr.
In parallel to my professional career, I have been a very active volunteer at ACM. I started out as a member of the Paris ACM SIGGRAPH chapter in 1985 and volunteered to help do our mailings (snail mail at the time). After taking on more responsibilities with the chapter, I was elected chair of the chapter in 1991. I was first appointed to the SIGGRAPH Local Groups Steering Committee, then became ACM Director for Chapters. Later I was successively elected SIGGRAPH Vice Chair, ACM SIG Governing Board (SGB) Vice Chair for Operations, SGB Chair, ACM SIGGRAPH President, ACM Secretary/Treasurer, ACM Vice President, and finally, in 2010, I was elected ACM President. My term as ACM President has just ended on July 1st. Vint Cerf is our new President. I continue to serve on the ACM Executive Committee in my role as immediate Past President.
(Note- About ACM
ACM, the Association for Computing Machinery www.acm.org, is the world’s largest educational and scientific computing society, uniting computing educators, researchers and professionals to inspire dialogue, share resources and address the field’s challenges. )
Ajay- What sets Trendspotter apart from other startups out there in terms of vision in trying to achieve a more coherent experience on the web.
 
Alain- The Basic difference with other approaches that we are aware of is that we have developed an incremental solution that calculates the results on the fly as the data streams in. Our evaluators are based on solid mathematical foundations that have proven their usefulness over time. One way to describe what we do is to think of it as signal processing where the tweets are the signal and our evaluators are like triggers that tell you what elements of the signal have the characteristics that we are filtering for (velocity and acceleration). One key result of using this approach is that our unit cost per tweet analyzed does not go up with increased volume. Using more traditional data analysis approaches involving an implicit sort would imply a complexity of N*log(N), where N is the volume of tweets being analyzed. That would imply that the cost per tweet analyzed would go up with the volume of tweets. Our approach was designed to avoid that, so that we can maintain a cap on our unit costs of analysis, no matter what volume of data we analyze.
Ajay- What do you think is the future of big data visualization going to look like? What are some of the technologies that you are currently bullish on?
Alain- I see several trends that would have deep impact on Big Data visualization. I firmly believe that with large amounts of data, visualization is key tool for understanding both the structure and the relationships that exist between data elements. Let’s focus on some of the key things that are pushing in this direction:
  • the volume of data that is available is growing at a rate we have never seen before. Cisco has measured an 8 fold increase in the volume of IP traffic over the last 5 years and predicts that we will reach the zettabyte of data over IP in 2016
  • more of the data is becoming publicly available. This isn’t only on social networks such as Facebook and twitter, but joins a more general trend involving open research initiatives and open government programs
  • the desired time to get meaningful results is going down dramatically. In the past 5 years we have seen the half life of data on Facebook, defined as the amount of time that half of the public reactions to any given post (likes, shares., comments) take place, go from about 12 hours to under 3 hours currently
  • our access to the net is always on via mobile device. You are always connected.
  • the CPU and GPU capabilities of mobile devices is huge (an iPhone has 10 times the compute power of a Cray-1 and more graphics capabilities than early SGI workstations)
Put all of these observations together and you quickly come up with a massive opportunity to analyze data visually on the go as it happens no matter where you are. We can’t afford to have to wait for results. When something of interest occurs we need to be aware of it immediately.
Ajay- What are some of the applications we could use Trendspottr. Could we predict events like Arab Spring, or even the next viral thing.
 
Alain- TrendSpottr won’t predict what will happen next. What it *will* do is alert you immediately when it happens. You can think of it like a smoke detector. It doesn’t tell that a fire will take place, but it will save your life when a fire does break out.
Typical uses for TrendSpottr are
  • thought leadership by tracking content that your readership is interested in via TrendSpottr you can be seen as a thought leader on the subject by being one of the first to share trending content on a given subject. I personally do this on my Facebook page (http://www.facebook.com/alain.chesnais) and have seen my klout score go up dramatically as a result
  • brand marketing to be able to know when something is trending about your brand and take advantage of it as it happens.
  • competitive analysis to see what is being said about two competing elements. For instance, searching TrendSpottr for “Obama OR Romney” gives you a very good understanding about how social networks are reacting to each politician. You can also do searches like “$aapl OR $msft OR $goog” to get a sense of what is the current buzz for certain hi tech stocks.
  • understanding your impact in real time to be able to see which of the content that you are posting is trending the most on social media so that you can highlight it on your main page. So if all of your content is hosted on common domain name (ourbrand.com), searching for ourbrand.com will show you the most active of your site’s content. That can easily be set up by putting a TrendSpottr widget on your front page

Ajay- What are some of the privacy guidelines that you keep in  mind- given the fact that you collect individual information but also have government agencies as potential users.

 
Alain- We take privacy very seriously and anonymize all of the data that we collect. We don’t keep explicit records of the data we collected through the various incoming streams and only store the aggregate results of our analysis.
About
Alain Chesnais is immediate Past President of ACM, elected for the two-year term beginning July 1, 2010.Chesnais studied at l’Ecole Normale Supérieure de l’Enseignement Technique and l’Université de Paris where he earned a Maîtrise de Mathematiques, a Maitrise de Structure Mathématique de l’Informatique, and a Diplôme d’Etudes Approfondies in Compuer Science. He was a high school student at the United Nations International School in New York, where, along with preparing his International Baccalaureate with a focus on Math, Physics and Chemistry, he also studied Mandarin Chinese.Chesnais recently founded Visual Transitions, which specializes in helping companies move to HTML 5, the newest standard for structuring and presenting content on the World Wide Web. He was the CTO of SceneCaster.com from June 2007 until April 2010, and was Vice President of Product Development at Tucows Inc. from July 2005 – May 2007. He also served as director of engineering at Alias|Wavefront on the team that received an Oscar from the Academy of Motion Picture Arts and Sciences for developing the Maya 3D software package.

Prior to his election as ACM president, Chesnais was vice president from July 2008 – June 2010 as well as secretary/treasurer from July 2006 – June 2008. He also served as president of ACM SIGGRAPH from July 2002 – June 2005 and as SIG Governing Board Chair from July 2000 – June 2002.

As a French citizen now residing in Canada, he has more than 20 years of management experience in the software industry. He joined the local SIGGRAPH Chapter in Paris some 20 years ago as a volunteer and has continued his involvement with ACM in a variety of leadership capacities since then.

About Trendspottr.com

TrendSpottr is a real-time viral search and predictive analytics service that identifies the most timely and trending information for any topic or keyword. Our core technology analyzes real-time data streams and spots emerging trends at their earliest acceleration point — hours or days before they have become “popular” and reached mainstream awareness.

TrendSpottr serves as a predictive early warning system for news and media organizations, brands, government agencies and Fortune 500 companies and helps them to identify emerging news, events and issues that have high viral potential and market impact. TrendSpottr has partnered with HootSuite, DataSift and other leading social and big data companies.

Saving Output in R for Presentations

While SAS language has a beautifully designed ODS (Output Delivery System) for saving output from certain analysis in excel files (and html and others), in R one can simply use the object, put it in a write.table and save it a csv file using the file parameter within write.table.

As a business analytics consultant, the output from a Proc Means, Proc Freq (SAS) or a summary/describe/table command (in R) is to be presented as a final report. Copying and pasting is not feasible especially for large amounts of text, or remote computers.

Using the following we can simple save the output  in R

 

> getwd()
[1] “C:/Users/KUs/Desktop/Ajay”
> setwd(“C:\Users\KUs\Desktop”)

#We shifted the directory, so we can save output without putting the entire path again and again for each step.

#I have found the summary command most useful for initial analysis and final display (particularly during the data munging step)

nams=summary(ajay)

# I assigned a new object to the analysis step (summary), it could also be summary,names, describe (HMisc) or table (for frequency analysis),
> write.table(nams,sep=”,”,file=”output.csv”)

Note: This is for basic beginners in R using it for business analytics dealing with large number of variables.

 

pps: Note

If you have a large number of files in a local directory to be read in R, you can avoid typing the entire path again and again by modifying the file parameter in the read.table and changing the working directory to that folder

 

setwd(“C:/Users/KUs/Desktop/”)
ajayt1=read.table(file=”test1.csv”,sep=”,”,header=T)

ajayt2=read.table(file=”test2.csv”,sep=”,”,header=T)

 

and so on…

maybe there is a better approach somewhere on Stack Overflow or R help, but this will work just as well.

you can then merge the objects created ajayt1 and ajayt2… (to be continued)

Cloud Computing – can be evil

Cloud Computing can be evil because-

1) Most browsers are owned by for profit corporations . Corporations can be evil, sometimes

And corporations can go bankrupt. You can back up data locally, but try backing up a corporation.

2) The content on your web page can be changed using translator extensions . This has interesting ramifications as in George Orwell. You may not be even aware of subtle changes introduced in your browser in the way it renders the html or some words using keywords from a browser extension app.

Imagine a new form of language called Politically Correct Truthspeak, and that can be in English but using machine learning learn to substitute politically sensitive words with Govt sanctioned words.

3) Your DNS and IP settings can be redirected using extensions. This means if a Govt passes a law- you can be denied the websites using just the browser not even the ISP.

Thats an extreme scenario for a authoritative govt creating its own version of Mafiaafire Redirector.

So how to keep the cloud computer honest?Move some stuff to the desktop

How to keep desktop computing efficient?Use some more cloud computing

It is not an OR but an AND function in which some computing can be local, some shared and some in the cloud.

Si?

Interview Michal Kosinski , Concerto Web Based App using #Rstats

Here is an interview with Michal Kosinski , leader of the team that has created Concerto – a web based application using R. What is Concerto? As per http://www.psychometrics.cam.ac.uk/page/300/concerto-testing-platform.htm

Concerto is a web based, adaptive testing platform for creating and running rich, dynamic tests. It combines the flexibility of HTML presentation with the computing power of the R language, and the safety and performance of the MySQL database. It’s totally free for commercial and academic use, and it’s open source

Ajay-  Describe your career in science from high school to this point. What are the various stats platforms you have trained on- and what do you think about their comparative advantages and disadvantages?  

Michal- I started with maths, but quickly realized that I prefer social sciences – thus after one year, I switched to a psychology major and obtained my MSc in Social Psychology with a specialization in Consumer Behaviour. At that time I was mostly using SPSS – as it was the only statistical package that was taught to students in my department. Also, it was not too bad for small samples and the rather basic analyses I was performing at that time.

 

My more recent research performed during my Mphil course in Psychometrics at Cambridge University followed by my current PhD project in social networks and research work at Microsoft Research, requires significantly more powerful tools. Initially, I tried to squeeze as much as possible from SPSS/PASW by mastering the syntax language. SPSS was all I knew, though I reached its limits pretty quickly and was forced to switch to R. It was a pretty dreary experience at the start, switching from an unwieldy but familiar environment into an unwelcoming command line interface, but I’ve quickly realized how empowering and convenient this tool was.

 

I believe that a course in R should be obligatory for all students that are likely to come close to any data analysis in their careers. It is really empowering – once you got the basics you have the potential to use virtually any method there is, and automate most tasks related to analysing and processing data. It is also free and open-source – so you can use it wherever you work. Finally, it enables you to quickly and seamlessly migrate to other powerful environments such as Matlab, C, or Python.

Ajay- What was the motivation behind building Concerto?

Michal- We deal with a lot of online projects at the Psychometrics Centre – one of them attracted more than 7 million unique participants. We needed a powerful tool that would allow researchers and practitioners to conveniently build and deliver online tests.

Also, our relationships with the website designers and software engineers that worked on developing our tests were rather difficult. We had trouble successfully explaining our needs, each little change was implemented with a delay and at significant cost. Not to mention the difficulties with embedding some more advanced methods (such as adaptive testing) in our tests.

So we created a tool allowing us, psychometricians, to easily develop psychometric tests from scratch an publish them online. And all this without having to hire software developers.

Ajay -Why did you choose R as the background for Concerto? What other languages and platforms did you consider. Apart from Concerto, how else do you utilize R in your center, department and University?

Michal- R was a natural choice as it is open-source, free, and nicely integrates with a server environment. Also, we believe that it is becoming a universal statistical and data processing language in science. We put increasing emphasis on teaching R to our students and we hope that it will replace SPSS/PASW as a default statistical tool for social scientists.

Ajay -What all can Concerto do besides a computer adaptive test?

Michal- We did not plan it initially, but Concerto turned out to be extremely flexible. In a nutshell, it is a web interface to R engine with a built-in MySQL database and easy-to-use developer panel. It can be installed on both Windows and Unix systems and used over the network or locally.

Effectively, it can be used to build any kind of web application that requires a powerful and quickly deployable statistical engine. For instance, I envision an easy to use website (that could look a bit like SPSS) allowing students to analyse their data using a web browser alone (learning the underlying R code simultaneously). Also, the authors of R libraries (or anyone else) could use Concerto to build user-friendly web interfaces to their methods.

Finally, Concerto can be conveniently used to build simple non-adaptive tests and questionnaires. It might seem to be slightly less intuitive at first than popular questionnaire services (such us my favourite Survey Monkey), but has virtually unlimited flexibility when it comes to item format, test flow, feedback options, etc. Also, it’s free.

Ajay- How do you see the cloud computing paradigm growing? Do you think browser based computation is here to stay?

Michal – I believe that cloud infrastructure is the future. Dynamically sharing computational and network resources between online service providers has a great competitive advantage over traditional strategies to deal with network infrastructure. I am sure the security concerns will be resolved soon, finishing the transformation of the network infrastructure as we know it. On the other hand, however, I do not see a reason why client-side (or browser) processing of the information should cease to exist – I rather think that the border between the cloud and personal or local computer will continually dissolve.

About

Michal Kosinski is Director of Operations for The Psychometrics Centre and Leader of the e-Psychometrics Unit. He is also a research advisor to the Online Services and Advertising group at the Microsoft Research Cambridge, and a visiting lecturer at the Department of Mathematics in the University of Namur, Belgium. You can read more about him at http://www.michalkosinski.com/

You can read more about Concerto at http://code.google.com/p/concerto-platform/ and http://www.psychometrics.cam.ac.uk/page/300/concerto-testing-platform.htm

R Concerto- Computer Adaptive Tests

A really nice use for R is education

http://www.psychometrics.cam.ac.uk/page/300/concerto-testing-platform.htm

Concerto: R-Based Online Adaptive Testing Platform

Concerto is a web based, adaptive testing platform for creating and running rich, dynamic tests. It combines the flexibility of HTML presentation with the computing power of the R language, and the safety and performance of the MySQL database. It’s totally free for commercial and academic use, and it’s open source. If you have any questions, you feel like generously supporting the project, or you want to develop a commerical test on the platform, feel free to email Michal Kosinski.

We rely as much as possible on popular open source packages in order to maximize the safety and reliability of the system, and to ensure that its elements are kept up-to-date.

Why choose Concerto?

  • Simple to use: Check our Step-by-Step tutorial to see how to create a test in minutes.
  • Flexibility: You can use the R engine to apply virtually any IRT or CAT models.
  • Scalability: Modular design, MySQL tables, and low system requirements allow the testing of thousands for pennies.
  • Reliability: Concerto relies on popular, constantly updated, and reliable elements used by millions of users world-wide.
  • Elegant feedback and items: The flexibility of the HTML layer and the power of R allow you to use (or generate on the fly!) polished multi-media items, as well as feedback full of graphs and charts generated by R for each test taker.
  • Low costs: It’s free and open-source!

Demonstration tests:

 Concerto explained:

Get Concerto:

Before installing concerto you may prefer to test it using a demo account on our server.Email Michal Kosinski in order to get demo account.

Training in Concerto:

Next session 9th Dec 2011: book early!

Commercial tests and Concerto:

Concerto is an open-source project so anyone can use it free of charge, even for commercial purposes. However, it might be faster and less expensive to hire our experienced team to develop your test, provide support and maintenance, and take responsibility for its smooth and reliable operation. Contact us!

 

%d bloggers like this: