Interview Michal Kosinski , Concerto Web Based App using #Rstats

Here is an interview with Michal Kosinski , leader of the team that has created Concerto – a web based application using R. What is Concerto? As per http://www.psychometrics.cam.ac.uk/page/300/concerto-testing-platform.htm

Concerto is a web based, adaptive testing platform for creating and running rich, dynamic tests. It combines the flexibility of HTML presentation with the computing power of the R language, and the safety and performance of the MySQL database. It’s totally free for commercial and academic use, and it’s open source

Ajay- Describe your career in science from high school to this point. What are the various stats platforms you have trained on- and what do you think about their comparative advantages and disadvantages?

Michal- I started with maths, but quickly realized that I prefer social sciences – thus after one year, I switched to a psychology major and obtained my MSc in Social Psychology with a specialization in Consumer Behaviour. At that time I was mostly using SPSS – as it was the only statistical package that was taught to students in my department. Also, it was not too bad for small samples and the rather basic analyses I was performing at that time.

My more recent research performed during my Mphil course in Psychometrics at Cambridge University followed by my current PhD project in social networks and research work at Microsoft Research, requires significantly more powerful tools. Initially, I tried to squeeze as much as possible from SPSS/PASW by mastering the syntax language. SPSS was all I knew, though I reached its limits pretty quickly and was forced to switch to R. It was a pretty dreary experience at the start, switching from an unwieldy but familiar environment into an unwelcoming command line interface, but I’ve quickly realized how empowering and convenient this tool was.

I believe that a course in R should be obligatory for all students that are likely to come close to any data analysis in their careers. It is really empowering – once you got the basics you have the potential to use virtually any method there is, and automate most tasks related to analysing and processing data. It is also free and open-source – so you can use it wherever you work. Finally, it enables you to quickly and seamlessly migrate to other powerful environments such as Matlab, C, or Python.

Ajay- What was the motivation behind building Concerto?

Michal- We deal with a lot of online projects at the Psychometrics Centre – one of them attracted more than 7 million unique participants. We needed a powerful tool that would allow researchers and practitioners to conveniently build and deliver online tests.

Also, our relationships with the website designers and software engineers that worked on developing our tests were rather difficult. We had trouble successfully explaining our needs, each little change was implemented with a delay and at significant cost. Not to mention the difficulties with embedding some more advanced methods (such as adaptive testing) in our tests.

So we created a tool allowing us, psychometricians, to easily develop psychometric tests from scratch an publish them online. And all this without having to hire software developers.

Ajay -Why did you choose R as the background for Concerto? What other languages and platforms did you consider. Apart from Concerto, how else do you utilize R in your center, department and University?

Michal- R was a natural choice as it is open-source, free, and nicely integrates with a server environment. Also, we believe that it is becoming a universal statistical and data processing language in science. We put increasing emphasis on teaching R to our students and we hope that it will replace SPSS/PASW as a default statistical tool for social scientists.

Ajay -What all can Concerto do besides a computer adaptive test?

Michal- We did not plan it initially, but Concerto turned out to be extremely flexible. In a nutshell, it is a web interface to R engine with a built-in MySQL database and easy-to-use developer panel. It can be installed on both Windows and Unix systems and used over the network or locally.

Effectively, it can be used to build any kind of web application that requires a powerful and quickly deployable statistical engine. For instance, I envision an easy to use website (that could look a bit like SPSS) allowing students to analyse their data using a web browser alone (learning the underlying R code simultaneously). Also, the authors of R libraries (or anyone else) could use Concerto to build user-friendly web interfaces to their methods.

Finally, Concerto can be conveniently used to build simple non-adaptive tests and questionnaires. It might seem to be slightly less intuitive at first than popular questionnaire services (such us my favourite Survey Monkey), but has virtually unlimited flexibility when it comes to item format, test flow, feedback options, etc. Also, it’s free.

Ajay- How do you see the cloud computing paradigm growing? Do you think browser based computation is here to stay?

Michal – I believe that cloud infrastructure is the future. Dynamically sharing computational and network resources between online service providers has a great competitive advantage over traditional strategies to deal with network infrastructure. I am sure the security concerns will be resolved soon, finishing the transformation of the network infrastructure as we know it. On the other hand, however, I do not see a reason why client-side (or browser) processing of the information should cease to exist – I rather think that the border between the cloud and personal or local computer will continually dissolve.

About

Michal Kosinski is Director of Operations for The Psychometrics Centre and Leader of the e-Psychometrics Unit. He is also a research advisor to the Online Services and Advertising group at the Microsoft Research Cambridge, and a visiting lecturer at the Department of Mathematics in the University of Namur, Belgium. You can read more about him at http://www.michalkosinski.com/

Data Documentation Initiative

Here is a nice initiative in standardizing data documentation for social sciences (which can be quite a relief to legions of analysts)

http://www.ddialliance.org/what

Benefits of DDI

http://ddi.icpsr.umich.edu/ddi-at-work/benefits

The DDI facilitates:

Interoperability. Codebooks marked up using the DDI specification can be exchanged and transported seamlessly, and applications can be written to work with these homogeneous documents.

Richer content. The DDI was designed to encourage the use of a comprehensive set of elements to describe social science datasets as completely and as thoroughly as possible, thereby providing the potential data analyst with broader knowledge about a given collection.

Single document – multiple purposes. A DDI codebook contains all of the information necessary to produce several different types of output, including, for example, a traditional social science codebook, a bibliographic record, or SAS/SPSS/Stata data definition statements. Thus, the document may be repurposed for different needs and applications. Changes made to the core document will be passed along to any output generated.

On-line subsetting and analysis. Because the DDI markup extends down to the variable level and provides a standard uniform structure and content for variables, DDI documents are easily imported into on-line analysis systems, rendering datasets more readily usable for a wider audience.

Precision in searching. Since each of the elements in a DDI-compliant codebook is tagged in a specific way, field-specific searches across documents and studies are enabled. For example, a library of DDI codebooks could be searched to identify datasets covering protest demonstrations during the 1960s in specific states or countries.

Also see-

Ohri’s Johari Window

An empty Johari window, with the “Rooms” arranged clockwise, starting with Room 1 at the top left

A Johari window is a cognitive psychological tool created by Joseph Luft and Harry Ingham in 1955^[1] in the United States, used to help people better understand their interpersonal communication and relationships. It is used primarily in self-help groups and corporate settings as a heuristic exercise.

When performing the exercise, subjects are given a list of 56 adjectives and picks five or six that they feel describe their own personality. Peers of the subject are then given the same list, and each picks five or six adjectives that describe the subject. These adjectives are then mapped onto a grid

A Johari window consists of the following 56 adjectives used as possible descriptions of the participant. In alphabetical order they are:

able
accepting
adaptable
bold
brave
calm
caring
cheerful
clever
complex
confident

dependable
dignified
energetic
extroverted
friendly
giving
happy
helpful
idealistic
independent
ingenious

intelligent
introverted
kind
knowledgeable
logical
loving
mature
modest
nervous
observant
organized

patient
powerful
proud
quiet
reflective
relaxed
religious
responsive
searching
self-assertive
self-conscious

sensible
sentimental
shy
silly
smart
spontaneous
sympathetic
tense
trustworthy
warm
wise
witty

Continue reading “Ohri’s Johari Window”

Ohri's Johari Window

An empty Johari window, with the “Rooms” arranged clockwise, starting with Room 1 at the top left

A Johari window consists of the following 56 adjectives used as possible descriptions of the participant. In alphabetical order they are:

able
accepting
adaptable
bold
brave
calm
caring
cheerful
clever
complex
confident

dependable
dignified
energetic
extroverted
friendly
giving
happy
helpful
idealistic
independent
ingenious

intelligent
introverted
kind
knowledgeable
logical
loving
mature
modest
nervous
observant
organized

patient
powerful
proud
quiet
reflective
relaxed
religious
responsive
searching
self-assertive
self-conscious

sensible
sentimental
shy
silly
smart
spontaneous
sympathetic
tense
trustworthy
warm
wise
witty

Continue reading “Ohri's Johari Window”

Open Source Compiler for SAS language/ GNU -DAP

I am still testing this out.

But if you know bit more about make and .compile in Ubuntu check out

http://www.gnu.org/software/dap/

I loved the humorous introduction

Dap is a small statistics and graphics package based on C. Version 3.0 and later of Dap can read SBS programs (based on the utterly famous, industry standard statistics system with similar initials – you know the one I mean)! The user wishing to perform basic statistical analyses is now freed from learning and using C syntax for straightforward tasks, while retaining access to the C-style graphics and statistics features provided by the original implementation. Dap provides core methods of data management, analysis, and graphics that are commonly used in statistical consulting practice (univariate statistics, correlations and regression, ANOVA, categorical data analysis, logistic regression, and nonparametric analyses).

Anyone familiar with the basic syntax of C programs can learn to use the C-style features of Dap quickly and easily from the manual and the examples contained in it; advanced features of C are not necessary, although they are available. (The manual contains a brief introduction to the C syntax needed for Dap.) Because Dap processes files one line at a time, rather than reading entire files into memory, it can be, and has been, used on data sets that have very many lines and/or very many variables.

I wrote Dap to use in my statistical consulting practice because the aforementioned utterly famous, industry standard statistics system is (or at least was) not available on GNU/Linux and costs a bundle every year under a lease arrangement. And now you can run programs written for that system directly on Dap! I was generally happy with that system, except for the graphics, which are all but impossible to use, but there were a number of clumsy constructs left over from its ancient origins.

http://www.gnu.org/software/dap/#Sample output

Unbalanced ANOVA

Crossed, nested ANOVA

Random model, unbalanced

Mixed model, balanced

Mixed model, unbalanced

Split plot

Latin square

Missing treatment combinations

Linear regression

Linear regression, model building

Ordinal cross-classification

Stratified 2×2 tables

Loglinear models

Logit model for linear-by-linear association

Logistic regression

sounds too good to be true- GNU /DAP joins WPS workbench and Dulles Open’s Carolina as the third SAS language compiler (besides the now defunct BASS software) see http://en.wikipedia.org/wiki/SAS_language#Controversy

Also see http://en.wikipedia.org/wiki/DAP_(software)

Dap was written to be a free replacement for SAS, but users are assumed to have a basic familiarity with the C programming language in order to permit greater flexibility. Unlike R it has been designed to be used on large data sets.

It has been designed so as to cope with very large data sets; even when the size of the data exceeds the size of the computer’s memory

R courses from Statistics.com (revolutionanalytics.com)
Categorical Data Analysis for the Behavioral and Social Sciences (psypress.com)
Skills of a good data miner (zyxo.wordpress.com)
GNU Octave 3.4 has just been released (gnu.org)
Revolution R Enterprise 4.2 now available (revolutionanalytics.com)

Google Books Ngram Viewer

Here is a terrific data visualization from Google based on their digitized books collection. How does it work, basically you can test the frequency of various words across time periods from 1700s to 2010.

Like the frequency and intensity of kung fu vs yoga, or pizza versus hot dog. The basic datasets scans millions /billions of words.

Here is my yoga vs kung fu vs judo graph.

http://ngrams.googlelabs.com/info

What’s all this do?

When you enter phrases into the Google Books Ngram Viewer, it displays a graph showing how those phrases have occurred in a corpus of books (e.g., “British English”, “English Fiction”, “French”) over the selected years. Let’s look at a sample graph:

This shows trends in three ngrams from 1950 to 2000: “nursery school” (a 2-gram or bigram), “kindergarten” (a 1-gram or unigram), and “child care” (another bigram). What the y-axis shows is this: of all the bigrams contained in our sample of books written in English and published in the United States, what percentage of them are “nursery school” or “child care”? Of all the unigrams, what percentage of them are “kindergarten”? Here, you can see that use of the phrase “child care” started to rise in the late 1960s, overtaking “nursery school” around 1970 and then “kindergarten” around 1973. It peaked shortly after 1990 and has been falling steadily since.

(Interestingly, the results are noticeably different when the corpus is switched to British English.)

Corpora

Below are descriptions of the corpora that can be searched with the Google Books Ngram Viewer. All of these corpora were generated in July 2009; we will update these corpora as our book scanning continues, and the updated versions will have distinct persistent identifiers.

Informal corpus name	Persistent identifier	Description
American English	googlebooks-eng-us-all-20090715	Same filtering as the English corpus but further restricted to books published in the United States.
British English	googlebooks-eng-gb-all-20090715	Same filtering as the English corpus but further restricted to books published in Great Britain.

Top Google Ngram Searches (paul.kedrosky.com)
Find out what’s in a word, or five, with the Google Books Ngram Viewer (googleblog.blogspot.com)
Historical Word Frequency and Google Books (volokh.com)
Culturomics: Hacking The Library of Babel (reason.com)
New Visualization Tool from Google With Data From 5.2 Million Digitized Books (readwriteweb.com)
Web N-Gram Now More Accessible (bing.com)

American Decline- Why outsourcing doesnt make sense

Image via Wikipedia

Here is a celebrated graphic from an American journalist using U.S. Department of Labor’s Bureau of Labor Statistics. It is a good example of using time as a dimension for animation- and heat maps for geography enabled visualizations.

————————–According to the U.S. Department of Labor’s Bureau of Labor Statistics, there are nearly 31 million people currently unemployed — that’s including those involuntarily working part time and those who want a job, but have given up on trying to find one. In the face of the worst economic upheaval since the Great Depression, millions of Americans are hurting. “The Decline: The Geography of a Recession,” as created by labor writer LaToya Egwuekwe, serves as a vivid representation of just how much. Watch the deteriorating transformation of the U.S. economy from January 2007 — approximately one year before the start of the recession — to the most recent unemployment data available today. Original link: http://www.latoyaegwuekwe.com/geographyofarecession.html. For more information, email latoya.egwuekwe@yahoo.com

————————————————————————————-

31 million unemployed- Does a US corporation seriously think that it can build everything OUTSIDE America and SELL INSIDE America. or who think it is okay intellectual property continues to be stolen as long as labor is cheap.

Shame on you if you outsourced your neighbour’s jobs- or would rather hire in a geography where they steal your intellectual property.

This Christmastime – May the Ghost of the Unemployed Family Christmases visit you in your sleep instead.

Middle class feels shaky (ajc.com)
Will Today’s Unemployed Become Tomorrow’s Unemployable? (economix.blogs.nytimes.com)
Bureau of Labor Statistics Reports on Pay Comparisons Among 77 Metropolitan Areas (prweb.com)
The Cost of Outsourcing Brain Power (creditloan.com)
Salary of a Computer Software Engineer (thinkup.waldenu.edu)
10 Jobs That Pay Over 100k (mademan.com)
Mysa.com: Cardenas was an economist for the Bureau of Labor Statistics (mysanantonio.com)

Please share:

Benefits of DDI

Please share:

Please share:

Please share:

Related Articles

Please share:

What’s all this do?

Corpora

Related Articles

Please share:

Related Articles

Please share: