Interview BigML.com

Here is an interview with Charlie Parker, head of large scale online algorithms at http://bigml.com

Ajay-  Describe your own personal background in scientific computing, and how you came to be involved with machine learning, cloud computing and BigML.com

Charlie- I am a machine learning Ph.D. from Oregon State University. Francisco Martin (our founder and CEO), Adam Ashenfelter (the lead developer on the tree algorithm), and myself were all studying machine learning at OSU around the same time. We all went our separate ways after that.

Francisco started Strands and turned it into a 100+ million dollar company building recommender systems. Adam worked for CleverSet, a probabilistic modeling company that was eventually sold to Cisco, I believe. I worked for several years in the research labs at Eastman Kodak on data mining, text analysis, and computer vision.

When Francisco left Strands to start BigML, he brought in Justin Donaldson who is a brilliant visualization guy from Indiana, and an ex-Googler named Jose Ortega who is responsible for most of our data infrastructure. They pulled in Adam and I a few months later. We also have Poul Petersen, a former Strands employee, who manages our herd of servers. He is a wizard and makes everyone else’s life much easier.

Ajay- You use clojure for the back end of BigML.com .Are there any other languages and packages you are considering? What makes clojure such a good fit for cloud computing ?

Charlie- Clojure is a great language because it offers you all of the benefits of Java (extensive libraries, cross-platform compatibility, easy integration with things like Hadoop, etc.) but has the syntactical elegance of a functional language. This makes our code base small and easy to read as well as powerful.

We’ve had occasional issues with speed, but that just means writing the occasional function or library in Java. As we build towards processing data at the Terabyte level, we’re hoping to create a framework that is language-agnostic to some extent. So if we have some great machine learning code in C, for example, we’ll use Clojure to tie everything together, but the code that does the heavy lifting will still be in C. For the API and Web layers, we use Python and Django, and Justin is a huge fan of HaXe for our visualizations.

 Ajay- Current support is for Decision Trees. When can we see SVM, K Means Clustering and Logit Regression?

Charlie- Right now we’re focused on perfecting our infrastructure and giving you new ways to put data in the system, but expect to see more algorithms appearing in the next few months. We want to make sure they are as beautiful and easy to use as the trees are. Without giving too much away, the first new thing we will probably introduce is an ensemble method of some sort (such as Boosting or Bagging). Clustering is a little further away but we’ll get there soon!

Ajay- How can we use the BigML.com API using R and Python.

Charlie- We have a public github repo for the language bindings. https://github.com/bigmlcom/io Right now, there there are only bash scripts but that should change very soon. The python bindings should be there in a matter of days, and the R bindings in probably a week or two. Clojure and Java bindings should follow shortly after that. We’ll have a blog post about it each time we release a new language binding. http://blog.bigml.com/

Ajay-  How can we predict large numbers of observations using a Model  that has been built and pruned (model scoring)?

Charlie- We are in the process of refactoring our backend right now for better support for batch prediction and model evaluation. This is something that is probably only a few weeks away. Keep your eye on our blog for updates!

Ajay-  How can we export models built in BigML.com for scoring data locally.

Charlie- This is as simple as a call to our API. https://bigml.com/developers/models The call gives you a JSON object representing the tree that is roughly equivalent to a PMML-style representation.

About-

You can read about Charlie Parker at http://www.linkedin.com/pub/charles-parker/11/85b/4b5 and the rest of the BigML team at

https://bigml.com/team

 

Easter Eggs in #Rstats

Yes.

Cite-http://en.wikipedia.org/wiki/Easter_egg_(media)

A virtual Easter egg is an intentional hidden messagein-joke, or feature in a work such as a computer programweb pagevideo gamemoviebook, or crossword. The term was coined — according to Warren Robinett — by Atari after they were pointed to the secret message left by Robinett in the game Adventure.[1] It draws a parallel with the custom of the Easter egg hunt observed in many Western nations as well as the last Russian imperial family’s tradition of giving elaborately jeweled egg-shaped creations by Carl Fabergé which contained hidden surprises

In R.

Cite-http://stackoverflow.com/questions/7910270/are-there-any-easter-eggs-in-base-r-or-in-major-packages

I like this

just type

example(readLine)

and these two

on 32 bit R type

memory.limit(4096)

and on any version try four question marks

Perhaps the prettiest eggs are the demos in animation package.

But there is magic in asking for help on internal functions in R

Just type-

?.Internal

and you get the sobering thought that you probably are a R Muggle

Call an Internal Function

Description

.Internal performs a call to an internal code which is built in to the R interpreter.

Only true R wizards should even consider using this function, and only R developers can add to the list of internal functions.

Usage

 .Internal(call)

Arguments

call a call expression

See Also

.Primitive, .External (the nearest equivalent available to users).

I liked that I could see the actual internal functions in svn at http://svn.r-project.org/R/trunk/src/main/names.c

The opening of the internals document floored me.

It must have been a curious year in 2003-4 when the copyright of R was held (briefly it seems) by the R Foundation and also by the R Development Core Team. (which sounds better?)

*  R : A Computer Language for Statistical Data Analysis
 *  Copyright (C) 1995, 1996  Robert Gentleman and Ross Ihaka
 *  Copyright (C) 1997--2012  The R Development Core Team
 *  Copyright (C) 2003, 2004  The R Foundation

My contribution

R help discourages for loop

Try ??for or ?for

you go into a loop till you hit escape

If you want more-just write
 .Internal(inspect(ls())) at the end of your  R program.

 

 

 

 

 

 

Google introduces Google Play

Some nice new features from the big G men from Mountain view. Google Play- for movies, games, apps, music and books. Nice to see entertainment is back on Google’s priority.

 

See this to read more

https://play.google.com/about/

When will I get Google Play?

About Google Play

Q: What is Google Play?
A: Google Play is a new digital content experience from Google where you can find your favorite music, movies, books, and Android apps and games. It’s your entertainment hub: you can access it from the web or from your Android device or even TV, and all your content is instantly available across all of these devices.

Q: What is your strategy with Google Play?
A: Our goal with Google Play is to bring together all your favorite content in one place that you can access across your devices. Specifically, digital content is fundamental to the mobile experience, so bringing all of this content together in one place for users makes the Android platform even more compelling. We’re also simplifying digital content for Google users – you can go to the Google Play website on your desktop and purchase and experience the latest movies, music and books. With Google Play, we’re giving you a simpler way to get your digital content.

Q: What will the experience be for users? What will happen to my existing account?
A: All content and apps in your existing account will remain in your account, but will transition to Google Play. On your device, the Android Market app icon will become the Google Play store icon. You’ll see “Play Store.” For the movies, books and music apps, you’ll begin to see Play versions of these as well, such as “Play Music,” and “Play Movies.”

Q: When will I get Google Play? What markets is this available in?
A: We’ll be rolling out Google Play globally starting today. On the web, Google Play will be live today. On devices, it will take a few days for the Android Market app to update to the Google Play Store app. The music, books and movies apps will also receive an update today.
Around the globe, Google Play will include Android apps and games. In countries where we have already launched music, books or movies, you will see those categories available in Google Play, too.

Q: I live outside the US. When will I get the books, music or movies verticals? I only see Android apps and games?
A: We want to bring different content categories to as many countries as possible. We’ve already launched movies and books in several countries outside the U.S. and will continue to do so overtime, but we don’t have a specific timeline to share.

Q: What types of content are available in my country?

  • Paid Apps: Available in these countries
  • Movies: Available in US, UK, Canada, and Japan
  • eBooks: Available in US, UK, Canada, and Australia
  • Music: Available in US

 

Q: Does this mean Google Music and the Google eBookstore will cease to exist? What about my account?
A: Both Google Music and the Google eBookstore are now part of Google Play. Your music and your books, including anything you bought, are still there, available to you in Google Play and accessible through your Google account.

Q: Where did my Google eBooks books go? Will I still have access to them?
A: Your books are now part of Google Play. Your books are still there, available to you in your Google Play library and accessible through your Google account.

Q: I don’t use an Android phone, can I still use Google Play?
A: Yes. Google Play is available on any computer with a modern browser at play.google.com. On the web, you can browse and buy books, movies and music. You can read books on the Google Play web reader, listen to music on your computer or watch movies online. Your digital content is all stored in the cloud, so you can access from anywhere using your Google Account.
We’ve also created ways to experience your music and books on other platforms such as the Google Books iOS app.

Q: Why do I not see Google Play yet on my device?
A: Please see our help center article on this here.

Q: How can I contact Google Play consumer support?
A: You can call or email our team here.

Cloud Computing – can be evil

Cloud Computing can be evil because-

1) Most browsers are owned by for profit corporations . Corporations can be evil, sometimes

And corporations can go bankrupt. You can back up data locally, but try backing up a corporation.

2) The content on your web page can be changed using translator extensions . This has interesting ramifications as in George Orwell. You may not be even aware of subtle changes introduced in your browser in the way it renders the html or some words using keywords from a browser extension app.

Imagine a new form of language called Politically Correct Truthspeak, and that can be in English but using machine learning learn to substitute politically sensitive words with Govt sanctioned words.

3) Your DNS and IP settings can be redirected using extensions. This means if a Govt passes a law- you can be denied the websites using just the browser not even the ISP.

Thats an extreme scenario for a authoritative govt creating its own version of Mafiaafire Redirector.

So how to keep the cloud computer honest?Move some stuff to the desktop

How to keep desktop computing efficient?Use some more cloud computing

It is not an OR but an AND function in which some computing can be local, some shared and some in the cloud.

Si?

How to use Bit Torrents

I really liked the software Qbittorent available from http://www.qbittorrent.org/ I think bit torrents should be the default way of sharing huge content especially software downloads. For protecting intellectual property there should be much better codes and software keys than presently available.

The qBittorrent project aims to provide a Free Software alternative to µtorrent. Additionally, qBittorrent runs and provides the same features on all major platforms (Linux, Mac OS X, Windows, OS/2, FreeBSD).

qBittorrent is based on Qt4 toolkit and libtorrent-rasterbar.

qBittorrent v2 Features

  • Polished µTorrent-like User Interface
  • Well-integrated and extensible Search Engine
    • Simultaneous search in most famous BitTorrent search sites
    • Per-category-specific search requests (e.g. Books, Music, Movies)
  • All Bittorrent extensions
    • DHT, Peer Exchange, Full encryption, Magnet/BitComet URIs, …
  • Remote control through a Web user interface
    • Nearly identical to the regular UI, all in Ajax
  • Advanced control over trackers, peers and torrents
    • Torrents queueing and prioritizing
    • Torrent content selection and prioritizing
  • UPnP / NAT-PMP port forwarding support
  • Available in ~25 languages (Unicode support)
  • Torrent creation tool
  • Advanced RSS support with download filters (inc. regex)
  • Bandwidth scheduler
  • IP Filtering (eMule and PeerGuardian compatible)
  • IPv6 compliant
  • Sequential downloading (aka “Download in order”)
  • Available on most platforms: Linux, Mac OS X, Windows, OS/2, FreeBSD
So if you are new to Bit Torrents- here is a brief tutorial
Some terminology from

Tracker

tracker is a server that keeps track of which seeds and peers are in the swarm.

Seed

Seed is used to refer to a peer who has 100% of the data. When a leech obtains 100% of the data, that peer automatically becomes a Seed.

Peer

peer is one instance of a BitTorrent client running on a computer on the Internet to which other clients connect and transfer data.

Leech

leech is a term with two meanings. Primarily leech (or leeches) refer to a peer (or peers) who has a negative effect on the swarm by having a very poor share ratio (downloading much more than they upload, creating a ratio less than 1.0)
1) Download and install the software from  http://www.qbittorrent.org/
2) If you want to search for new files, you can use the nice search features in here
3) If you want to CREATE new bit torrents- go to Tools -Torrent Creator
4) For sharing content- just seed the torrent you just created. What is seeding – hey did you read the terminology in the beginning?
5) Additionally –
From

Trackers: Below are some popular public trackers. They are servers which help peers to communicate.

Here are some good trackers you can use:

 

http://open.tracker.thepiratebay.org/announce
http://www.torrent-downloads.to:2710/announce
http://denis.stalker.h3q.com:6969/announce
udp://denis.stalker.h3q.com:6969/announce
http://www.sumotracker.com/announce

and

Super-seeding

When a file is new, much time can be wasted because the seeding client might send the same file piece to many different peers, while other pieces have not yet been downloaded at all. Some clients, like ABCVuzeBitTornado, TorrentStorm, and µTorrent have a “super-seed” mode, where they try to only send out pieces that have never been sent out before, theoretically making the initial propagation of the file much faster. However the super-seeding becomes less effective and may even reduce performance compared to the normal “rarest first” model in cases where some peers have poor or limited connectivity. This mode is generally used only for a new torrent, or one which must be re-seeded because no other seeds are available.
Note- you use this tutorial and any or all steps at your own risk. I am not legally responsible for any mishaps you get into. Please be responsible while being an efficient bit tor renter. That means respecting individual property rights.

Interview Michal Kosinski , Concerto Web Based App using #Rstats

Here is an interview with Michal Kosinski , leader of the team that has created Concerto – a web based application using R. What is Concerto? As per http://www.psychometrics.cam.ac.uk/page/300/concerto-testing-platform.htm

Concerto is a web based, adaptive testing platform for creating and running rich, dynamic tests. It combines the flexibility of HTML presentation with the computing power of the R language, and the safety and performance of the MySQL database. It’s totally free for commercial and academic use, and it’s open source

Ajay-  Describe your career in science from high school to this point. What are the various stats platforms you have trained on- and what do you think about their comparative advantages and disadvantages?  

Michal- I started with maths, but quickly realized that I prefer social sciences – thus after one year, I switched to a psychology major and obtained my MSc in Social Psychology with a specialization in Consumer Behaviour. At that time I was mostly using SPSS – as it was the only statistical package that was taught to students in my department. Also, it was not too bad for small samples and the rather basic analyses I was performing at that time.

 

My more recent research performed during my Mphil course in Psychometrics at Cambridge University followed by my current PhD project in social networks and research work at Microsoft Research, requires significantly more powerful tools. Initially, I tried to squeeze as much as possible from SPSS/PASW by mastering the syntax language. SPSS was all I knew, though I reached its limits pretty quickly and was forced to switch to R. It was a pretty dreary experience at the start, switching from an unwieldy but familiar environment into an unwelcoming command line interface, but I’ve quickly realized how empowering and convenient this tool was.

 

I believe that a course in R should be obligatory for all students that are likely to come close to any data analysis in their careers. It is really empowering – once you got the basics you have the potential to use virtually any method there is, and automate most tasks related to analysing and processing data. It is also free and open-source – so you can use it wherever you work. Finally, it enables you to quickly and seamlessly migrate to other powerful environments such as Matlab, C, or Python.

Ajay- What was the motivation behind building Concerto?

Michal- We deal with a lot of online projects at the Psychometrics Centre – one of them attracted more than 7 million unique participants. We needed a powerful tool that would allow researchers and practitioners to conveniently build and deliver online tests.

Also, our relationships with the website designers and software engineers that worked on developing our tests were rather difficult. We had trouble successfully explaining our needs, each little change was implemented with a delay and at significant cost. Not to mention the difficulties with embedding some more advanced methods (such as adaptive testing) in our tests.

So we created a tool allowing us, psychometricians, to easily develop psychometric tests from scratch an publish them online. And all this without having to hire software developers.

Ajay -Why did you choose R as the background for Concerto? What other languages and platforms did you consider. Apart from Concerto, how else do you utilize R in your center, department and University?

Michal- R was a natural choice as it is open-source, free, and nicely integrates with a server environment. Also, we believe that it is becoming a universal statistical and data processing language in science. We put increasing emphasis on teaching R to our students and we hope that it will replace SPSS/PASW as a default statistical tool for social scientists.

Ajay -What all can Concerto do besides a computer adaptive test?

Michal- We did not plan it initially, but Concerto turned out to be extremely flexible. In a nutshell, it is a web interface to R engine with a built-in MySQL database and easy-to-use developer panel. It can be installed on both Windows and Unix systems and used over the network or locally.

Effectively, it can be used to build any kind of web application that requires a powerful and quickly deployable statistical engine. For instance, I envision an easy to use website (that could look a bit like SPSS) allowing students to analyse their data using a web browser alone (learning the underlying R code simultaneously). Also, the authors of R libraries (or anyone else) could use Concerto to build user-friendly web interfaces to their methods.

Finally, Concerto can be conveniently used to build simple non-adaptive tests and questionnaires. It might seem to be slightly less intuitive at first than popular questionnaire services (such us my favourite Survey Monkey), but has virtually unlimited flexibility when it comes to item format, test flow, feedback options, etc. Also, it’s free.

Ajay- How do you see the cloud computing paradigm growing? Do you think browser based computation is here to stay?

Michal – I believe that cloud infrastructure is the future. Dynamically sharing computational and network resources between online service providers has a great competitive advantage over traditional strategies to deal with network infrastructure. I am sure the security concerns will be resolved soon, finishing the transformation of the network infrastructure as we know it. On the other hand, however, I do not see a reason why client-side (or browser) processing of the information should cease to exist – I rather think that the border between the cloud and personal or local computer will continually dissolve.

About

Michal Kosinski is Director of Operations for The Psychometrics Centre and Leader of the e-Psychometrics Unit. He is also a research advisor to the Online Services and Advertising group at the Microsoft Research Cambridge, and a visiting lecturer at the Department of Mathematics in the University of Namur, Belgium. You can read more about him at http://www.michalkosinski.com/

You can read more about Concerto at http://code.google.com/p/concerto-platform/ and http://www.psychometrics.cam.ac.uk/page/300/concerto-testing-platform.htm

Moving from OpenDNS to Google DNS

It is best to use a DNS resolution service to avoid targeted attacks on your machine esp if you use the browser a lot. and it is quite fast!! Takes 2 minutes to set it up even for non geeks

I was getting slower browsing speeds on OpenDNS http://www.opendns.com/

so I switched to Google DNS (though I am not sure how people in Iran and China – who have a much greater need for DNS verification services will get secure resolution of DNS)

http://code.google.com/speed/public-dns/

What is Google Public DNS?

Google Public DNS is a free, global Domain Name System (DNS) resolution service, that you can use as an alternative to your current DNS provider.

To try it out:

  • Configure your network settings to use the IP addresses 8.8.8.8 and 8.8.4.4 as your DNS servers or
  • Read our configuration instructions.

New! For IPv6 addresses, see our configuration instructions.

If you decide to try Google Public DNS, your client programs will perform all DNS lookups using Google Public DNS.

Why does DNS matter?

The DNS protocol is an important part of the web’s infrastructure, serving as the Internet’s phone book: every time you visit a website, your computer performs a DNS lookup. Complex pages often require multiple DNS lookups before they start loading, so your computer may be performing hundreds of lookups a day.

Why should you try Google Public DNS?

By using Google Public DNS you can: