Webscraping using iMacros

The noted Diamonds dataset in the ggplot2 package of R is actually culled from the website http://www.diamondse.info/diamond-prices.asp

However it has ~55000 diamonds, while the whole Diamonds search engine has almost ten times that number. Using iMacros – a Google Chrome Plugin, we can scrape that data (or almost any data). The iMacros chrome plugin is available at  https://chrome.google.com/webstore/detail/cplklnmnlbnpmjogncfgfijoopmnlemp while notes on coding are at http://wiki.imacros.net

Imacros makes coding as easy as recording macro and the code is automatcially generated for whatever actions you do. You can set parameters to extract only specific parts of the website, and code can be run into a loop (of 9999 times!)

Here is the iMacros code-Note you need to navigate to the web site http://www.diamondse.info/diamond-prices.asp before running it

VERSION BUILD=5100505 RECORDER=CR
FRAME F=1
SET !EXTRACT_TEST_POPUP NO
SET !ERRORIGNORE YES
TAG POS=6 TYPE=TABLE ATTR=TXT:* EXTRACT=TXT
TAG POS=1 TYPE=DIV ATTR=CLASS:paginate_enabled_next
SAVEAS TYPE=EXTRACT FOLDER=* FILE=test+3

 

 

 

 

 

 

 

 

 

and voila- all the diamonds you need to analyze!

The returning data can be read using the standard delimiter data munging in the language of SAS or R.

More on IMacros from

https://chrome.google.com/webstore/detail/cplklnmnlbnpmjogncfgfijoopmnlemp/details

Description

Automate your web browser. Record and replay repetitious work

If you encounter any problems with iMacros for Chrome, please let us know in our Chrome user forum at http://forum.iopus.com/viewforum.php?f=21

Our forum is also the best place for new feature suggestions :-)
----

iMacros was designed to automate the most repetitious tasks on the web. If there’s an activity you have to do repeatedly, just record it in iMacros. The next time you need to do it, the entire macro will run at the click of a button! With iMacros, you can quickly and easily fill out web forms, remember passwords, create a webmail notifier, and more. You can keep the macros on your computer for your own use, use them within bookmark sync / Xmarks or share them with others by embedding them on your homepage, blog, company Intranet or any social bookmarking service as bookmarklet. The uses are limited only by your imagination!

Popular uses are as web macro recorder, form filler on steroids and highly-secure password manager (256-bit AES encryption).


FaceBook IPO- Who hacked whom?

Some thoughts on the FB IPO-

1) Is Zuck reading emails on his honeymoon? Where is he?

2) In 3 days FB lost 34 billion USD in market valuation. Thats enough to buy AOL,Yahoo, LinkedIn and Twitter (combined)

3) People are now shorting FB based on 3-4 days of trading performance. Maybe they know more ARIMA !

4) Who made money on the over-pricing in terms on employees who sold on 1 st day, financial bankers who did the same?

5) Who lost money on the first three days due to Nasdaq’s problems?

6) What is the exact technical problem that Nasdaq had?

7) The much deplored FaceBook Price/Earnings ratio (99) is still comparable to AOL’s (85) and much less than LI (620!). see http://www.google.com/finance?cid=296878244325128

8) Maybe FB can stop copying Google’s ad model (which Google invented) and go back to the drawing table. Like a FB kind of Paypal

9) There are more experts on the blogosphere than experts in Wall Street.

10) No blogger is willing to admit that they erred in the optimism on the great white IPO hope.

I did. Mea culpa. I thought FB is a good stock. I would buy it still- but the rupee tanked by 10% since past 1 week against the dollar.

 

I am now waiting for Chinese social network market to open with IPO’s. Thats walled gardens within walled gardens of Jade and Bamboo.

Related- Art Work of Another 100 billion dollar company (2006)

Visualizing Bigger Data in R using Tabplot

The amazing tabplot package creates the tableplot feature for visualizing huge chunks of data. This is a great example of creative data visualization that is resource lite and extremely fast in a first look at the data. (note- The tabplot package is being used and table plot function is being used . The TABLEPLOT package is different and is NOT being used here).

library(ggplot2)
data(diamonds)
library(tabplot)
tableplot(diamonds)
system.time(tableplot(diamonds))

visualizing a 50000 row by 10 variable dataset in 0.7 s is fast !!

click on screenshot to see it

and some say R is slow 😉

 

Note I used a free Windows Amazon EC2 Instance for it-

See screenshot for hardware configuration

 

the best thing is there is a handy GTK GUI for this package. You can check it out at

 

 

New RCommander with ggplot #rstats

 

My favorite GUI (or one of them) R Commander has a relatively new plugin called KMGGplot2. Until now Deducer was the only GUI with ggplot features , but the much lighter and more popular R Commander has been a long champion in people wanting to pick up R quickly.

 

http://cran.r-project.org/web/packages/RcmdrPlugin.KMggplot2/

RcmdrPlugin.KMggplot2: Rcmdr Plug-In for Kaplan-Meier Plot and Other Plots by Using the ggplot2 Package

 

As you can see by the screenshot- it makes ggplot even easier for people (like R  newbies and experienced folks alike)

 

This package is an R Commander plug-in for Kaplan-Meier plot and other plots by using the ggplot2 package.

Version: 0.1-0
Depends: R (≥ 2.15.0), stats, methods, grid, Rcmdr (≥ 1.8-4), ggplot2 (≥ 0.9.1)
Imports: tcltk2 (≥ 1.2-3), RColorBrewer (≥ 1.0-5), scales (≥ 0.2.1), survival (≥ 2.36-14)
Published: 2012-05-18
Author: Triad sou. and Kengo NAGASHIMA
Maintainer: Triad sou. <triadsou at gmail.com>
License: GPL-2
CRAN checks: RcmdrPlugin.KMggplot2 results

 

----------------------------------------------------------------
NEWS file for the RcmdrPlugin.KMggplot2 package
----------------------------------------------------------------

----------------------------------------------------------------

Changes in version 0.1-0 (2012-05-18)

 o Restructuring implementation approach for efficient
   maintenance.
 o Added options() for storing package specific options (e.g.,
   font size, font family, ...).
 o Added a theme: theme_simple().
 o Added a theme element: theme_rect2().
 o Added a list box for facet_xx() functions in some menus
   (Thanks to Professor Murtaza Haider).
 o Kaplan-Meier plot: added confidence intervals.
 o Box plot: added violin plots.
 o Bar chart for discrete variables: deleted dynamite plots.
 o Bar chart for discrete variables: added stacked bar charts.
 o Scatter plot matrix: added univariate plots at diagonal
   positions (ggplot2::plotmatrix).
 o Deleted the dummy data for histograms, which is large in
   size.

----------------------------------------------------------------

Changes in version 0.0-4 (2011-07-28)

 o Fixed "scale_y_continuous(formatter = "percent")" to
   "scale_y_continuous(labels = percent)" for ggplot2
   (>= 0.9.0).
 o Fixed "legend = FALSE" to "show_guide = FALSE" for
   ggplot2 (>= 0.9.0).
 o Fixed the DESCRIPTION file for ggplot2 (>= 0.9.0) dependency.

----------------------------------------------------------------

Changes in version 0.0-3 (2011-07-28; FIRST RELEASE VERSION)

 o Kaplan-Meier plot: Show no. at risk table on outside.
 o Histogram: Color coding.
 o Histogram: Density estimation.
 o Q-Q plot: Create plots based on a maximum likelihood estimate
   for the parameters of the selected theoretical distribution.
 o Q-Q plot: Create plots based on a user-specified theoretical
   distribution.
 o Box plot / Errorbar plot: Box plot.
 o Box plot / Errorbar plot: Mean plus/minus S.D.
 o Box plot / Errorbar plot: Mean plus/minus S.D. (Bar plot).
 o Box plot / Errorbar plot: 95 percent Confidence interval
   (t distribution).
 o Box plot / Errorbar plot: 95 percent Confidence interval
   (bootstrap).
 o Scatter plot: Fitting a linear regression.
 o Scatter plot: Smoothing with LOESS for small datasets or GAM
   with a cubic regression basis for large data.
 o Scatter plot matrix: Fitting a linear regression.
 o Scatter plot matrix: Smoothing with LOESS for small datasets
   or GAM with a cubic regression basis for large data.
 o Line chart: Normal line chart.
 o Line chart: Line char with a step function.
 o Line chart: Area plot.
 o Pie chart: Pie chart.
 o Bar chart for discrete variables: Bar chart for discrete
   variables.
 o Contour plot: Color coding.
 o Contour plot: Heat map.
 o Distribution plot: Normal distribution.
 o Distribution plot: t distribution.
 o Distribution plot: Chi-square distribution.
 o Distribution plot: F distribution.
 o Distribution plot: Exponential distribution.
 o Distribution plot: Uniform distribution.
 o Distribution plot: Beta distribution.
 o Distribution plot: Cauchy distribution.
 o Distribution plot: Logistic distribution.
 o Distribution plot: Log-normal distribution.
 o Distribution plot: Gamma distribution.
 o Distribution plot: Weibull distribution.
 o Distribution plot: Binomial distribution.
 o Distribution plot: Poisson distribution.
 o Distribution plot: Geometric distribution.
 o Distribution plot: Hypergeometric distribution.
 o Distribution plot: Negative binomial distribution.

Happy $100 Billion to Mark Zuckerberg Productions !

Heres to an expected $100 billion market valuation to the latest Silicon Valley Legend, Facebook- A Mark Zuckerberg Production.

Some milestones that made FB what it is-

1) Beating up MySpace, Ibibo, Google Orkut combined

2) Smart timely acquisitions from Friend feed , to Instagram

3) Superb infrastructure for 900 million accounts, fast interface rollouts, and a policy of never deleting data. Some of this involved creating new technology like Cassandra. There have been no anti-trust complaints against FB’s behavior particularly as it simply stuck to being the cleanest interface offering a social network

4) Much envied and copied features like Newsfeed, App development on the FB platform, Social Gaming as revenue streams

5) Replacing Google as the hot techie employer, just like Google did to Microsoft.

6) An uncanny focus, including walking away from a billion dollars from Yahoo,resisting Google, Apple’s Ping, imposing design changes unilaterally, implementing data sharing only with flexible partners  and strategic investors (like Bing)

FB has made more money for more people than any other company in the past ten years. Here’s wishing it an even more interesting next ten years! With 900 million users if they could integrate a PayPal like system, or create an alternative to Adsense for content creators, they could create an all new internet economy – one which is more open than the Google dominated internet ; 0

 

BigML meets R #rstats

I am just checking the nice new R package created by BigML.com co-founder Justin Donaldson. The name of the new package is bigml, which can confuse a bit since there do exist many big suffix named packages in R (including biglm)

The bigml package is available at CRAN http://cran.r-project.org/web/packages/bigml/index.html

I just tweaked the code given at http://blog.bigml.com/2012/05/10/r-you-ready-for-bigml/ to include the ssl authentication code at http://www.brocktibert.com/blog/2012/01/19/358/

so it goes

> library(bigml)
Loading required package: RJSONIO
Loading required package: RCurl
Loading required package: bitops
Loading required package: plyr
> setCredentials(“bigml_username”,”API_key”)

# download the file needed for authentication
download.file(url="http://curl.haxx.se/ca/cacert.pem", destfile="cacert.pem")

# set the curl options
curl <- getCurlHandle()
options(RCurlOptions = list(capath = system.file("CurlSSL", "cacert.pem",
package = "RCurl"),
ssl.verifypeer = FALSE))
curlSetOpt(.opts = list(proxy = 'proxyserver:port'), curl = curl)

> iris.model = quickModel(iris, objective_field = ‘Species’)

Of course there are lots of goodies added here , so read the post yourself at http://blog.bigml.com/2012/05/10/r-you-ready-for-bigml/

Incidentally , the author of this R package (bigml) Justin Donalsdon who goes by name sudojudo at http://twitter.com/#!/sudojudo has also recently authored two other R packages including tsne at  http://cran.r-project.org/web/packages/tsne/index.html (tsne: T-distributed Stochastic Neighbor Embedding for R (t-SNE) -A “pure R” implementation of the t-SNE algorithm) and a GUI toolbar http://cran.r-project.org/web/packages/sculpt3d/index.html (sculpt3d is a GTK+ toolbar that allows for more interactive control of a dataset inside the RGL plot window. Controls for simple brushing, highlighting, labeling, and mouseMode changes are provided by point-and-click rather than through the R terminal interface)

This along with the fact the their recently released python bindings for bigml.com was one of the top news at Hacker News- shows bigML.com is going for some traction in bringing cloud computing, better software interfaces and data mining together!