Home » Posts tagged 'syntax' (Page 2)
Tag Archives: syntax
Variations on the line graph can include fan charts in time series which include joining line chart of historic data with ranges of future projections. Another common variation is to plot the linear regression or trend line between the two variables and superimpose it on the graph.
The slope of the line chart shows the rate of change at that particular point , and can also be used to highlight areas of discontinuity or irregular change between two variables.
The basic syntax of line graph is created by first using Plot() function to plot the points and then lines () function to plot the lines between the points.
‘data.frame’: 50 obs. of 2 variables:
$ speed: num 4 4 7 7 8 9 10 10 10 11 …
$ dist : num 2 10 4 22 16 10 18 26 34 17 …
> lines(cars,type=”o”, pch=20, lty=2, col=”green”)
> title(main=”Example Automobiles”, col.main=”blue”, font.main=2)
An example of Time Series Forecasting graph or fan chart is http://addictedtor.free.fr/graphiques/RGraphGallery.php?graph=51
- Top Ten Graphs for Business Analytics -Pie Charts (1/10) (decisionstats.com)
Protovis composes custom views of data with simple marks such as bars and dots. Unlike low-level graphics libraries that quickly become tedious for visualization, Protovis defines marks through dynamic properties that encode data, allowing inheritance, scales and layouts to simplify construction.
Protovis is free and open-source and is a Stanford project. It has been used in web interface R Node (which I will talk later )
While Protovis is designed for custom visualization, it is still easy to create many standard chart types. These simpler examples serve as an introduction to the language, demonstrating key abstractions such as quantitative and ordinal scales, while hinting at more advanced features, including stack layout.
Many charting libraries provide stock chart designs, but offer only limited customization; Protovis excels at custom visualization design through a concise representation and precise control over graphical marks. These examples, including a few recreations of unusual historical designs, demonstrate the language’s expressiveness.
Try Protovis today :) http://vis.stanford.edu/protovis/
- Linking Petterson – Visualising FRBR data with Protovis (home.hio.no)
- The Stanford Visualization Group Debuts Visual Tool for Cleaning Up Data (readwriteweb.com)
One of the most frustrating things I had to do while working as financial business analysts was working with Data Time Formats in Base SAS. The syntax was simple enough and SAS was quite good with handing queries to the Oracle data base that the client was using, but remembering the different types of formats in SAS language was a challenge (there was a date9. and date6 and mmddyy etc )
Data and Time variables are particularly important variables in financial industry as almost everything is derived variable from the time (which varies) while other inputs are mostly constants. This includes interest as well as late fees and finance fees.
In R, date and time are handled quite simply-
Use the strptime( dataset, format) function to convert the character into string
For example if the variable dob is “01/04/1977) then following will convert into a date object
and if the same date is 01Apr1977
does the same
For troubleshooting help with date and time, remember to enclose the formats
%d,%b,%m and % Y in the same exact order as the original string- and if there are any delimiters like ” -” or “/” then these delimiters are entered in exactly the same order in the format statement of the strptime
Sys.time() gives you the current date-time while the function difftime(time1,time2) gives you the time intervals( say if you have two columns as date-time variables)
What are the various formats for inputs in date time?
- Abbreviated weekday name in the current locale. (Also matches full name on input.)
- Full weekday name in the current locale. (Also matches abbreviated name on input.)
- Abbreviated month name in the current locale. (Also matches full name on input.)
- Full month name in the current locale. (Also matches abbreviated name on input.)
- Date and time. Locale-specific on output,
"%a %b %e %H:%M:%S %Y"on input.
- Day of the month as decimal number (01–31).
- Hours as decimal number (00–23).
- Hours as decimal number (01–12).
- Day of year as decimal number (001–366).
- Month as decimal number (01–12).
- Minute as decimal number (00–59).
- AM/PM indicator in the locale. Used in conjunction with
%Iand not with
%H. An empty string in some locales.
- Second as decimal number (00–61), allowing for up to two leap-seconds (but POSIX-compliant implementations will ignore leap seconds).
- Week of the year as decimal number (00–53) using Sunday as the first day 1 of the week (and typically with the first Sunday of the year as day 1 of week 1). The US convention.
- Weekday as decimal number (0–6, Sunday is 0).
- Week of the year as decimal number (00–53) using Monday as the first day of week (and typically with the first Monday of the year as day 1 of week 1). The UK convention.
- Date. Locale-specific on output,
- Time. Locale-specific on output,
- Year without century (00–99). Values 00 to 68 are prefixed by 20 and 69 to 99 by 19 – that is the behaviour specified by the 2004 POSIX standard, but it does also say ‘it is expected that in a future version the default century inferred from a 2-digit year will change’.
- Year with century.
- Signed offset in hours and minutes from UTC, so
-0800is 8 hours behind UTC.
- (output only.) Time zone as a character string (empty if not available).
- Also to read the helpful documentation (especially for time zone level, and leap year seconds and differences)
- Mark-up Your Events Online with Microformats (seomoz.org)
- How do you convert octal numbers to decimal numbers (wiki.answers.com)
- The Rollover of Doom: a Trap for Good Programmers (esr.ibiblio.org)
- Formatting Dates, Times and Numbers in ASP.NET (4guysfromrolla.com)
- Rcpp 0.9.0 and RcppClassic 0.9.0 (dirk.eddelbuettel.com)
- Comparing times and dates in Ruby (nofluffjuststuff.com)
- C#: Programatically Convert between ASCII, Decimal, and Hexidecimal (lockergnome.com)
- Scale and Scalability: Rethinking the Most Overused IT System Selling Point for the Cloud Era (itexpertvoice.com)
- Coding Horror: A Visual Explanation of SQL Joins (codinghorror.com)
Analyzing data can have many challenges associated with it. In the case of business analytics data, these challenges or constraints can have a marked effect on the quality and timeliness of the analysis as well as the expected versus actual payoff from the analytical results.
Challenges of Analytical Data Processing-
1) Data Formats- Reading in complete data, without losing any part (or meta data), or adding in superfluous details (that increase the scope). Technical constraints of data formats are relatively easy to navigate thanks to ODBC and well documented and easily search-able syntax and language.
The costs of additional data augmentation (should we pay for additional credit bureau data to be appended) , time of storing and processing the data (every column needed for analysis can add in as many rows as whole dataset, which can be a time enhancing problem if you are considering an extra 100 variables with a few million rows), but above all that of business relevance and quality guidelines will ensure basic data input and massaging are considerable parts of whole analytical project timeline.
2) Data Quality-Perfect data exists in a perfect world. The price of perfect information is one business will mostly never budget or wait for. To deliver inferences and results based on summaries of data which has missing, invalid, outlier data embedded within it makes the role of an analyst just as important as which ever tool is chosen to remove outliers, replace missing values, or treat invalid data.
3) Project Scope-
How much data? How much Analytical detail versus High Level Summary? Timelines for delivery as well as refresh of data analysis? Checks (statistical as well as business)?
How easy is it to load and implement the new analysis in existing Information Technology Infrastructure? These are some of the outer parameters that can limit both your analytical project scope, your analytical tool choice, and your processing methodology.
4) Output Results vis a vis stakeholder expectation management-
Stakeholders like to see results, not constraints, hypothesis ,assumptions , p-value, or chi -square value. Output results need to be streamlined to a decision management process to justify the investment of human time and effort in an analytical project, choice,training and navigating analytical tool complexities and constraints are subset of it. Optimum use of graphical display is a part of aligning results to a more palatable form to stakeholders, provided graphics are done nicely.
Eg Marketing wants to get more sales so they need a clear campaign, to target certain customers via specific channels with specified collateral. In order to base their business judgement, business analytics needs to validate , cross validate and sometimes invalidate this business decision making with clear transparent methods and processes.
Given a dataset- the basic analytical steps that an analyst will do with R are as follows. This is meant as a note for analysts at a beginner level with R.
Package -specific syntax
update.packages() #This updates all packages
install.packages(package1) #This installs a package locally, a one time event
library(package1) #This loads a specified package in the current R session, which needs to be done every R session
CRAN________LOCAL HARD DISK_________R SESSION is the top to bottom hierarchy of package storage and invocation.
ls() #This lists all objects or datasets currently active in the R session
> names(assetsCorr) #This gives the names of variables within a dataframe
 “AssetClass” ”LargeStocksUS” ”SmallStocksUS”
 “CorporateBondsUS” ”TreasuryBondsUS” ”RealEstateUS”
 “StocksCanada” ”StocksUK” ”StocksGermany”
 “StocksSwitzerland” ”StocksEmergingMarkets”
> str(assetsCorr) #gives complete structure of dataset
‘data.frame’: 12 obs. of 11 variables:
$ AssetClass : Factor w/ 12 levels “CorporateBondsUS”,..: 4 5 2 6 1 12 3 7 11 9 …
$ LargeStocksUS : num 15.3 16.4 1 0 0 …
$ SmallStocksUS : num 13.49 16.64 0.66 1 0 …
$ CorporateBondsUS : num 9.26 6.74 0.38 0.46 1 0 0 0 0 0 …
$ TreasuryBondsUS : num 8.44 6.26 0.33 0.27 0.95 1 0 0 0 0 …
$ RealEstateUS : num 10.6 17.32 0.08 0.59 0.35 …
$ StocksCanada : num 10.25 19.78 0.56 0.53 -0.12 …
$ StocksUK : num 10.66 13.63 0.81 0.41 0.24 …
$ StocksGermany : num 12.1 20.32 0.76 0.39 0.15 …
$ StocksSwitzerland : num 15.01 20.8 0.64 0.43 0.55 …
$ StocksEmergingMarkets: num 16.5 36.92 0.3 0.6 0.12 …
> dim(assetsCorr) #gives dimensions observations and variable number
 12 11
str(Dataset) – This gives the structure of the dataset (note structure gives both the names of variables within dataset as well as dimensions of the dataset)
head(dataset,n1) gives the first n1 rows of dataset while
tail(dataset,n2) gives the last n2 rows of a dataset where n1,n2 are numbers and dataset is the name of the object (here a data frame that is being considered)
summary(dataset) gives you a brief summary of all variables while
describe(dataset) gives a detailed description on the variables
simple graphics can be given by
As you can see in above cases, there are multiple ways to get even basic analysis about data in R- however most of the syntax commands are intutively understood (like hist for histogram, t.test for t test, plot for plot).
For detailed analysis throughout the scope of analysis, for a business analytics user it is recommended to using multiple GUI, and multiple packages. Even for highly specific and specialized analytical tasks it is recommended to check for a GUI that incorporates the required package.
- The data analysis path is built on curiosity, followed by action (radar.oreilly.com)
- Using Datasets in KRL (Flickr RSS) (code.kynetx.com)
- R interface to Google Chart Tools (r-bloggers.com)
- How To Get Experience Working With Large Datasets (highscalability.com)
- A portal for European government data: PublicData.eu plans (onlinejournalismblog.com)
- 5 Datasets You Can Buy and Use for SEO (and a few for free!) (seomoz.org)
- Integrated Longitudinal Database Available in Census Centers (kauffman.org)
I was searching for some basic syntax in R (basically cross tabs and density plots) and I came across the Quick R site.
Its really a nice site for R beginners and anyone trying to remember some syntax.
R syntax can be very simple- a histoigram is just hist(), boxplot is just boxplot() and t test is just t.test(dataset)
Here is an example from the site-
# Simple Histogram
# Colored Histogram with Different Number of Bins
hist(mtcars$mpg, breaks=12, col="red")
# Add a Normal Curve (Thanks to Peter Dalgaard)
x <- mtcars$mpg
h<-hist(x, breaks=10, col="red", xlab="Miles Per Gallon",
main="Histogram with Normal Curve")
yfit <- yfit*diff(h$mids[1:2])*length(x)
lines(xfit, yfit, col="blue", lwd=2)
Histograms can be a poor method for determining the shape of a distribution because it is so strongly affected by the number of bins used.
KERNEL DENSITY PLOTS
Kernal density plots are usually a much more effective way to view the distribution of a variable. Create the plot using plot(density(x)) where x is a numeric vector.
# Kernel Density Plot
d <- density(mtcars$mpg) # returns the density data
plot(d) # plots the results
# Filled Density Plot
d <- density(mtcars$mpg)
plot(d, main="Kernel Density of Miles Per Gallon")
polygon(d, col="red", border="blue")
COMPARING GROUPS VIA KERNAL DENSITY
The sm.density.compare( ) function in the sm package allows you to superimpose the kernal density plots of two or more groups. The format is sm.density.compare(x, factor) where x is a numeric vector and factor is the grouping variable.
# Compare MPG distributions for cars with
# 4,6, or 8 cylinders
# create value labels
cyl.f <- factor(cyl, levels= c(4,6,8),
labels = c("4 cylinder", "6 cylinder", "8 cylinder"))
# plot densities
sm.density.compare(mpg, cyl, xlab="Miles Per Gallon")
title(main="MPG Distribution by Car Cylinders")
# add legend via mouse click
legend(locator(1), levels(cyl.f), fill=colfill)
It is not as exhaustive as http://cran.r-project.org/doc/manuals/R-intro.html
but it is much more simpler and easy to follow.
The site is created by Robert I. Kabacoff, Ph.D.
and he is working on a book called “R in Action”
I have received numerous requests for a hardcopy version of this site, so over the past year I have been writing a book that takes the material here and significantly expands upon it. If you are interested, early access is available.
If you have not been to that website, I recommend it highly (though the tagline or logo of R for SAS/SPSS/Stata users seems a bit familiar)-http://www.statmethods.net/index.html
for SAS/SPSS/Stata Users
- Two Thoughts on Lisp Syntax. (kazimirmajorinc.blogspot.com)
- Some Basics about Stats (psipsychologytutor.org)
- Bone Density Tests: A Clue to Your Future (webmd.com)
- Net Access Corporation Unveils 50,000 Square Foot, State-of-the-Art Data Center in Parsippany, New Jersey (prweb.com)
- programming languages – What makes lisp macros so special – Stack Overflow (stackoverflow.com)
- Thinking about Syntax (latenightpc.com)
- Our minds use syntax to understand actions, just like with language [Mad Psychology] (io9.com)
- Syntax highlighting for Django using Pygments (ofbrooklyn.com)
- People of HTML5 – Bruce Lawson (hacks.mozilla.org)
- Haskell syntax vs. Lisp syntax | LispCast (lispcast.com)
So I saw the news at NY R Meetup and decided to have a go at Prediction API Package (which first started off as a blog post at
1)My OS was Ubuntu 10.10 Netbook
Ubuntu has a slight glitch plus workaround for installing the RCurl package on which the Google Prediction API is dependent- you need to first install this Ubuntu package for RCurl to install libcurl4-gnutls-dev
Once you install that using Synaptic,
Simply start R
2) Install Packages rjson and Rcurl using install.packages and choosing CRAN
Since GooglePredictionAPI is not yet on CRAN
3) Download that package from
You need to copy this downloaded package to your “first library ” folder
When you start R, simply run
and thats the folder you copy the GooglePredictionAPI package you downloaded.
5) Now the following line works
- Under R prompt,
> install.packages("googlepredictionapi_0.1.tar.gz", repos=NULL, type="source")
6) Uploading data to Google Storage using the GUI (rather than gs util)
Just go to https://sandbox.google.com/storage/
and thats the Google Storage manager
Notes on Training Data-
Use a csv file
The first column is the score column (like 1,0 or prediction score)
There are no headers- so delete headers from data file and move the dependent variable to the first column (Note I used data from the kaggle contest for R package recommendation at
6) The good stuff:
Once you type in the basic syntax, the first time it will ask for your Google Credentials (email and password)
It then starts showing you time elapsed for training.
Now you can disconnect and go off (actually I got disconnected by accident before coming back in a say 5 minutes so this is the part where I think this is what happened is why it happened, dont blame me, test it for yourself) -
and when you come back (hopefully before token expires) you can see status of your request (see below)
> library(rjson) > library(RCurl) Loading required package: bitops > library(googlepredictionapi) > my.model <- PredictionApiTrain(data="gs://numtraindata/training_data") The request for training has sent, now trying to check if training is completed Training on numtraindata/training_data: time:2.09 seconds Training on numtraindata/training_data: time:7.00 seconds
Note I changed the format from the URL where my data is located- simply go to your Google Storage Manager and right click on the file name for link address ( https://sandbox.google.com/storage/numtraindata/training_data.csv)
to gs://numtraindata/training_data (that kind of helps in any syntax error)
8) From the kind of high level instructions at https://code.google.com/p/google-prediction-api-r-client/, you could also try this on a local file
## Load googlepredictionapi and dependent libraries library(rjson) library(RCurl) library(googlepredictionapi) ## Make a training call to the Prediction API against data in the Google Storage. ## Replace MYBUCKET and MYDATA with your data. my.model <- PredictionApiTrain(data="gs://MYBUCKET/MYDATA") ## Alternatively, make a training call against training data stored locally as a CSV file. ## Replace MYPATH and MYFILE with your data. my.model <- PredictionApiTrain(data="MYPATH/MYFILE.csv")
At the time of writing my data was still getting trained, so I will keep you posted on what happens.
- An R interface to the Google Prediction API (revolutionanalytics.com)
- Google Prediction Goes to the Movies (technoverseblog.com)
- 11 new APIs: Google Predictions, Amazon User Management (programmableweb.com)
- R at Google (r-bloggers.com)
- Google API Console Opens Up Millions of Queries Daily (programmableweb.com)
- Canonical Design Team: So, you want to provide an API for the world to use? (design.canonical.com)