Visualizing Bigger Data in R using Tabplot

The amazing tabplot package creates the tableplot feature for visualizing huge chunks of data. This is a great example of creative data visualization that is resource lite and extremely fast in a first look at the data. (note- The tabplot package is being used and table plot function is being used . The TABLEPLOT package is different and is NOT being used here).

library(ggplot2)
data(diamonds)
library(tabplot)
tableplot(diamonds)
system.time(tableplot(diamonds))

visualizing a 50000 row by 10 variable dataset in 0.7 s is fast !!

click on screenshot to see it

and some say R is slow 😉

 

Note I used a free Windows Amazon EC2 Instance for it-

See screenshot for hardware configuration

 

the best thing is there is a handy GTK GUI for this package. You can check it out at

 

 

Web Analytics using R , Google Analytics and TS Forecasting

This is a continuation of the previous post on using Google Analytics .

Now that we have downloaded and plotted the data- we try and fit time series to the website data to forecast future traffic.

Some observations-

1) Google Analytics has 0 predictive analytics, it is just descriptive analytics and data visualization models (including the recent social analytics). However you can very well add in basic TS function using R to the GA API.

Why do people look at Website Analytics? To know today’s traffic and derive insights for the Future

2) Web Data clearly follows a 7 day peak and trough for weekly effects (weekdays and weekends), this is also true for hourly data …and this can be used for smoothing historic web data for future forecast.

3) On an advanced level, any hugely popular viral posts can be called a level shift (not drift) and accoringly dampened.

Test and Control!

Similarly using ARIMAX, we can factor in quantity and tag of posts as X regressor variables.

and now the code-( dont laugh at the simplicity please, I am just tinkering and playing with data here!)

You need to copy and paste the code at the bottom of   this post  http://www.decisionstats.com/using-google-analytics-with-r/ if you want to download your GA data down first.

Note I am using lubridate ,forecast and timeSeries packages in this section.

#Plotting the Traffic  plot(ga.data$data[,2],type="l") 

library(timeSeries)
library(forecast)

#Using package lubridate to convert character dates into time
library(lubridate)
ga.data$data[,1]=ymd(ga.data$data[,1])
ls()
dataset1=ga.data$data
names(dataset1) <- make.names(names(dataset1))
str(dataset1)
head(dataset1)
dataset2 <- ts(dataset1$ga.visitors,start=0,frequency = frequency(dataset1$ga.visitors), names=dataset1$ga.date)
str(dataset2)
head(dataset2)
ts.test=dataset2[1:200]
ts.control=dataset2[201:275]

 #Note I am splitting the data into test and control here

fitets=ets(ts.test)
plot(fitets)
testets=ets(ts.control,model=fitets)
accuracy(testets)
plot(testets)
spectrum(ts.test,method='ar')
decompose(ts.test)

library("TTR")
bb=SMA(dataset2,n=7)#We are doing a simple moving average for every 7 days. Note this can be 24 hrs for hourly data, or 30 days for daily data for month # 

to month comparison or 12 months for annual
#We notice that Web Analytics needs sommethening for every 7 thday as there is some relation to traffic on weekedays /weekends /same time last week
head(dataset2,40)
head(bb,40)

par(mfrow=c(2,1))
plot(bb,type="l",main="Using Seven Day Moving Average for Web Visitors")
plot(dataset2,main="Original Data")

Created by Pretty R at inside-R.org

Though I still wonder why the R query, gA R code /package could not be on the cloud (why it  needs to be downloaded)– cloud computing Gs?

Also how about adding some MORE predictive analytics to Google Analytics, chaps!

To be continued-

auto.arima() and forecasts!!!

cross validations!!!

and adapting the idiosyncratic periods and cycles  of web analytics to time series !!

Top 5 XKCD on Data Visualization

By request, an analysis of Top 5  XKCDs on data visualization. Statisticians and Data Scientists to note-

1) DOT PLOT

 

2)  LINE PLOTS

3) FLOW CHARTS

4) PIE CHARTS and 5) BAR GRAPHS

I am not going into the big big graphs of course like the Star Wars Plot data visualization at

http://xkcd.com/657/ or the Money Chart at http://xkcd.com/980/ because I dont believe in data visualization to show off, but to keep it simple simply 🙂

Now I gotta find me a software that can write my blog for me 🙂

Analytics for Cyber Conflict

 

The emerging use of Analytics and Knowledge Discovery in Databases for Cyber Conflict and Trade Negotiations

 

The blog post is the first in series or articles on cyber conflict and the use of analytics for targeting in both offense and defense in conflict situations.

 

It covers knowledge discovery in four kinds of databases (so chosen because of perceived importance , sensitivity, criticality and functioning of the geopolitical economic system)-

  1. Databases on Unique Identity Identifiers- including next generation biometric databases connected to Government Initiatives and Banking, and current generation databases of identifiers like government issued documents made online
  2. Databases on financial details -This includes not only traditional financial service providers but also online databases with payment details collected by retail product selling corporates like Sony’s Playstation Network, Microsoft ‘s XBox and
  3. Databases on contact details – including those by offline businesses collecting marketing databases and contact details
  4. Databases on social behavior- primarily collected by online businesses like Facebook , and other social media platforms.

It examines the role of

  1. voluntary privacy safeguards and government regulations ,

  2. weak cryptographic security of databases,

  3. weakness in balancing marketing ( maximized data ) with privacy (minimized data)

  4. and lastly the role of ownership patterns in database owning corporates

A small distinction between cyber crime and cyber conflict is that while cyber crime focusses on stealing data, intellectual property and information  to primarily maximize economic gains

cyber conflict focuses on stealing information and also disrupt effective working of database backed systems in order to gain notional competitive advantages in economics as well as geo-politics. Cyber terrorism is basically cyber conflict by non-state agents or by designated terrorist states as defined by the regulations of the “target” entity. A cyber attack is an offensive action related to cyber-infrastructure (like the Stuxnet worm that disabled uranium enrichment centrifuges of Iran). Cyber attacks and cyber terrorism are out of scope of this paper, we will concentrate on cyber conflicts involving databases.

Some examples are given here-

Types of Knowledge Discovery in –

1) Databases on Unique Identifiers- including biometric databases.

Unique Identifiers or primary keys for identifying people are critical for any intensive knowledge discovery program. The unique identifier generated must be extremely secure , and not liable to reverse engineering of the cryptographic hash function.

For biometric databases, an interesting possibility could be determining the ethnic identity from biometric information, and also mapping relatives. Current biometric information that is collected is- fingerprint data, eyes iris data, facial data. A further feature could be adding in voice data as a part of biometric databases.

This is subject to obvious privacy safeguards.

For example, Google recently unveiled facial recognition to unlock Android 4.0 mobiles, only to find out that the security feature could easily be bypassed by using a photo of the owner.

 

 

Example of Biometric Databases

In Afghanistan more than 2 million Afghans have contributed iris, fingerprint, facial data to a biometric database. In India, 121 million people have already been enrolled in the largest biometric database in the world. More than half a million customers of the Tokyo Mitsubishi Bank are are already using biometric verification at ATMs.

Examples of Breached Online Databases

In 2011, Playstation Network by Sony (PSN) lost data of 77 million customers including personal information and credit card information. Additionally data of 24 million customers were lost by Sony’s Sony Online Entertainment. The websites of open source platforms like SourceForge, WineHQ and Kernel.org were also broken into 2011. Even retailers like McDonald and Walgreen reported database breaches.

 

The role of cyber conflict arises in the following cases-

  1. Databases are online for accessing and authentication by proper users. Databases can be breached remotely by non-owners ( or “perpetrators”) non with much lesser chance of intruder identification, detection and penalization by regulators, or law enforcers (or “protectors”) than offline modes of intellectual property theft.

  2. Databases are valuable to external agents (or “sponsors”) subsidizing ( with finance, technology, information, motivation) the perpetrators for intellectual property theft. Databases contain information that can be used to disrupt the functioning of a particular economy, corporation (or “ primary targets”) or for further chain or domino effects in accessing other data (or “secondary targets”)

  3. Loss of data is more expensive than enhanced cost of security to database owners

  4. Loss of data is more disruptive to people whose data is contained within the database (or “customers”)

So the role play for different people for these kind of databases consists of-

1) Customers- who are in the database

2) Owners -who own the database. They together form the primary and secondary targets.

3) Protectors- who help customers and owners secure the databases.

and

1) Sponsors- who benefit from the theft or disruption of the database

2) Perpetrators- who execute the actual theft and disruption in the database

The use of topic models and LDA is known for making data reduction on text, and the use of data visualization including tied to GPS based location data is well known for investigative purposes, but the increasing complexity of both data generation and the sophistication of machine learning driven data processing makes this an interesting area to watch.

 

 

The next article in this series will cover-

the kind of algorithms that are currently or being proposed for cyber conflict, the role of non state agents , and what precautions can knowledge discovery in databases practitioners employ to avoid breaches of security, ethics, and regulation.

Citations-

  1. Michael A. Vatis , CYBER ATTACKS DURING THE WAR ON TERRORISM: A PREDICTIVE ANALYSIS Dartmouth College (Institute for Security Technology Studies).
  2. From Data Mining to Knowledge Discovery in Databases Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyt

Graphs in Statistical Analysis

One of the seminal papers establishing the importance of data visualization (as it is now called) was the 1973 paper by F J Anscombe in http://www.sjsu.edu/faculty/gerstman/StatPrimer/anscombe1973.pdf

It has probably the most elegant introduction to an advanced statistical analysis paper that I have ever seen-

1. Usefulness of graphs

Most textbooks on statistical methods, and most statistical computer programs, pay too little attention to graphs. Few of us escape being indoctrinated with these notions:

(1) numerical calculations are exact, but graphs are rough;

(2) for any particular kind of statistical data there is just one set of calculations constituting a correct statistical analysis;

(3) performing intricate calculations is virtuous, whereas actually looking at the data is cheating.

A computer should make both calculations and graphs. Both sorts of output should be studied; each will contribute to understanding.

Of course the dataset makes it very very interesting for people who dont like graphical analysis too much.

From http://en.wikipedia.org/wiki/Anscombe%27s_quartet

 The x values are the same for the first three datasets.

Anscombe’s Quartet
I II III IV
x y x y x y x y
10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58
8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76
13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71
9.0 8.81 9.0 8.77 9.0 7.11 8.0 8.84
11.0 8.33 11.0 9.26 11.0 7.81 8.0 8.47
14.0 9.96 14.0 8.10 14.0 8.84 8.0 7.04
6.0 7.24 6.0 6.13 6.0 6.08 8.0 5.25
4.0 4.26 4.0 3.10 4.0 5.39 19.0 12.50
12.0 10.84 12.0 9.13 12.0 8.15 8.0 5.56
7.0 4.82 7.0 7.26 7.0 6.42 8.0 7.91
5.0 5.68 5.0 4.74 5.0 5.73 8.0 6.89

For all four datasets:

Property Value
Mean of x in each case 9 exact
Variance of x in each case 11 exact
Mean of y in each case 7.50 (to 2 decimal places)
Variance of y in each case 4.122 or 4.127 (to 3 d.p.)
Correlation between x and y in each case 0.816 (to 3 d.p.)
Linear regression line in each case y = 3.00 + 0.500x (to 2 d.p. and 3 d.p. resp.)
But see the graphical analysis –
While R has always been great in emphasizing graphical analysis, thanks in part due to work by H Wickham and others, SAS products and  language has also modified its approach at http://www.sas.com/technologies/analytics/statistics/datadiscovery/
 SAS Visual Data Discovery combines top-selling SAS products (Base SASSAS/STAT® and SAS/GRAPH®), along with two interfaces (SAS® Enterprise Guide® for guided tasks and batch analysis and JMP® software for discovery and exploratory analysis).
 and  ODS Statistical Graphs at
While ODS Statistical graphs is still not as smooth as say R’s GGPLOT2 http://tinyurl.com/ggplot2-book, it still is a progressive step
Pretty graphs make for better decisions too !

 

 

Amazing IBM Tech Trends 2011 report

I was reading the amazing Tech Trend 2011 report by IBM at https://www.ibm.com/developerworks/mydeveloperworks/files/app/person/060001TJG2/file/110ccd08-25d9-4932-9bcc-c583868c9f31

What really amazed me is that distortions introduced in Data Visualization even in length of the graphs.

See below and click to enlarge- my notes are in black font, they refer to the length of the weird green bar(?). This I think is one of the worst graphs I have seen this year.

 

 

Updated Interview Elissa Fink -VP Tableau Software

Here is an interview with Elissa Fink, VP Marketing of that new wonderful software called Tableau that makes data visualization so nice and easy to learn and work with.

Elissa Fink, VP, Marketing

Ajay-  Describe your career journey from high school to over 20 plus years in marketing. What are the various trends that you have seen come and go in marketing.

Elissa- I studied literature and linguistics in college and didn’t discover analytics until my first job selling advertising for the Wall Street Journal. Oddly enough, the study of linguistics is not that far from decision analytics: they both are about taking a structured view of information and trying to see and understand common patterns. At the Journal, I was completely captivated analyzing and comparing readership data. At the same time, the idea of using computers in marketing was becoming more common. I knew that the intersection of technology and marketing was going to radically change things – how we understand consumers, how we market and sell products, and how we engage with customers. So from that point on, I’ve always been focused on technology and marketing, whether it’s working as a marketer at technology companies or applying technology to marketing problems for other types of companies.  There have been so many interesting trends. Taking a long view, a key trend I’ve noticed is how marketers work to understand, influence and motivate consumer behavior. We’ve moved marketing from where it was primarily unpredictable, qualitative and aimed at talking to mass audiences, where the advertising agency was king. Now it’s a discipline that is more data-driven, quantitative and aimed at conversations with individuals, where the best analytics wins. As with any trend, the pendulum swings far too much to either side causing backlashes but overall, I think we are in a great place now. We are using data-driven analytics to understand consumer behavior. But pure analytics is not the be-all, end-all; good marketing has to rely on understanding human emotions, intuition and gut feel – consumers are far from rational so taking only a rational or analytical view of them will never explain everything we need to know.

Ajay- Do you think technology companies are still predominantly dominated by men . How have you seen diversity evolve over the years. What initiatives has Tableau taken for both hiring and retaining great talent.

Elissa- The thing I love about the technology industry is that its key success metrics – inventing new products that rapidly gain mass adoption in pursuit of making profit – are fairly objective. There’s little subjective nature to the counting of dollars collected selling a product and dollars spent building a product. So if a female can deliver a better product and bigger profits faster and better, then that female is going to get the resources, jobs, power and authority to do exactly that. That’s not to say that the technology industry is gender-blind, race-blind, etc. It isn’t – technology is far from perfect. For example, the industry doesn’t have enough diversity in positions of power. But I think overall, in comparison to a lot of other industries, it’s pretty darn good at giving people with great ideas the opportunities to realize their visions regardless of their backgrounds or characteristics.

At Tableau, we are very serious about bringing in and developing talented people – they are the key to our growth and success. Hiring is our #1 initiative so we’ve spent a lot of time and energy both on finding great candidates and on making Tableau a place that they want to work. This includes things like special recruiting events, employee referral programs, a flexible work environment, fun social events, and the rewards of working for a start-up. Probably our biggest advantage is the company itself – working with people you respect on amazing, cutting-edge products that delight customers and are changing the world is all too rare in the industry but a reality at Tableau. One of our senior software developers put it best when he wrote “The emphasis is on working smarter rather than longer: family and friends are why we work, not the other way around. Tableau is all about happy, energized employees executing at the highest level and delivering a highly usable, high quality, useful product to our customers.” People who want to be at a place like that should check out our openings at http://www.tableausoftware.com/jobs.

Ajay- What are most notable features in tableau’s latest edition. What are the principal software that competes with Tableau Software products and how would you say Tableau compares with them.

Elissa- Tableau 6.1 will be out in July and we are really excited about it for 3 reasons.

First, we’re introducing our mobile business intelligence capabilities. Our customers can have Tableau anywhere they need it. When someone creates an interactive dashboard or analytical application with Tableau and it’s viewed on a mobile device, an iPad in particular, the viewer will have a native, touch-optimized experience. No trying to get your fingertips to act like a mouse. And the author didn’t have to create anything special for the iPad; she just creates her analytics the usual way in Tableau. Tableau knows the dashboard is being viewed on an iPad and presents an optimized experience.

Second, we’ve take our in-memory analytics engine up yet another level. Speed and performance are faster and now people can update data incrementally rapidly. Introduced in 6.0, our data engine makes any data fast in just a few clicks. We don’t run out of memory like other applications. So if I build an incredible dashboard on my 8-gig RAM PC and you try to use it on your 2-gig RAM laptop, no problem.

And, third, we’re introducing more features for the international markets – including French and German versions of Tableau Desktop along with more international mapping options.  It’s because we are constantly innovating particularly around user experience that we can compete so well in the market despite our relatively small size. Gartner’s seminal research study about the Business Intelligence market reported a massive market shift earlier this year: for the first time, the ease-of-use of a business intelligence platform was more important than depth of functionality. In other words, functionality that lots of people can actually use is more important than having sophisticated functionality that only specialists can use. Since we focus so heavily on making easy-to-use products that help people rapidly see and understand their data, this is good news for our customers and for us.

Ajay-  Cloud computing is the next big thing with everyone having a cloud version of their software. So how would you run Cloud versions of Tableau Server (say deploying it on an Amazon Ec2  or a private cloud)

Elissa- In addition to the usual benefits espoused about Cloud computing, the thing I love best is that it makes data and information more easily accessible to more people. Easy accessibility and scalability are completely aligned with Tableau’s mission. Our free product Tableau Public and our product for commercial websites Tableau Digital are two Cloud-based products that deliver data and interactive analytics anywhere. People often talk about large business intelligence deployments as having thousands of users. With Tableau Public and Tableau Digital, we literally have millions of users. We’re serving up tens of thousands of visualizations simultaneously – talk about accessibility and scalability!  We have lots of customers connecting to databases in the Cloud and running Tableau Server in the Cloud. It’s actually not complex to set up. In fact, we focus a lot of resources on making installation and deployment easy and fast, whether it’s in the cloud, on premise or what have you. We don’t want people to have spend weeks or months on massive roll-out projects. We want it to be minutes, hours, maybe a day or 2. With the Cloud, we see that people can get started and get results faster and easier than ever before. And that’s what we’re about.

Ajay- Describe some of the latest awards that Tableau has been wining. Also how is Tableau helping universities help address the shortage of Business Intelligence and Big Data professionals.

Elissa-Tableau has been very fortunate. Lately, we’ve been acknowledged by both Gartner and IDC as the fastest growing business intelligence software vendor in the world. In addition, our customers and Tableau have won multiple distinctions including InfoWorld Technology Leadership awards, Inc 500, Deloitte Fast 500, SQL Server Magazine Editors’ Choice and Community Choice awards, Data Hero awards, CODiEs, American Business Awards among others. One area we’re very passionate about is academia, participating with professors, students and universities to help build a new generation of professionals who understand how to use data. Data analysis should not be exclusively for specialists. Everyone should be able to see and understand data, whatever their background. We come from academic roots, having been spun out of a Stanford research project. Consequently, we strongly believe in supporting universities worldwide and offer 2 academic programs. The first is Tableau For Teaching, where any professor can request free term-length licenses of Tableau for academic instruction during his or her courses. And, we offer a low-cost Student Edition of Tableau so that students can choose to use Tableau in any of their courses at any time.

Elissa Fink, VP Marketing,Tableau Software

 

Elissa Fink is Tableau Software’s Vice President of Marketing. With 20+ years helping companies improve their marketing operations through applied data analysis, Elissa has held executive positions in marketing, business strategy, product management, and product development. Prior to Tableau, Elissa was EVP Marketing at IXI Corporation, now owned by Equifax. She has also served in executive positions at Tele Atlas (acquired by TomTom), TopTier Software (acquired by SAP), and Nielsen/Claritas. Elissa also sold national advertising for the Wall Street Journal. She’s a frequent speaker and has spoken at conferences including the DMA, the NCDM, Location Intelligence, the AIR National Forum and others. Elissa is a graduate of Santa Clara University and holds an MBA in Marketing and Decision Systems from the University of Southern California.

Elissa first discovered Tableau late one afternoon at her previous company. Three hours later, she was still “at play” with her data. “After just a few minutes using the product, I was getting answers to questions that were taking my company’s programmers weeks to create. It was instantly obvious that Tableau was on a special mission with something unique to offer the world. I just had to be a part of it.”

To know more – read at http://www.tableausoftware.com/

and existing data viz at http://www.tableausoftware.com/learn/gallery

Storm seasons: measuring and tracking key indicators
What’s happening with local real estate prices?
How are sales opportunities shaping up?
Identify your best performing products
Applying user-defined parameters to provide context
Not all tech companies are rocket ships
What’s really driving the economy?
Considering factors and industry influencers
The complete orbit along the inside, or around a fixed circle
How early do you have to be at the airport?
What happens if sales grow but so does customer churn?
What are the trends for new retail locations?
How have student choices changed?
Do patients who disclose their HIV status recover better?
Closer look at where gas prices swing in areas of the U.S.
U.S. Census data shows more women of greater age
Where do students come from and how does it affect their grades?
Tracking customer service effectiveness
Comparing national and local test scores
What factors correlate with high overall satisfaction ratings?
Fund inflows largely outweighed outflows well after the bubble
Which programs are competing for federal stimulus dollars?
Oil prices and volatility
A classic candlestick chart
How do oil, gold and CPI relate to the GDP growth rate?