## PiCloud gives away 20 free compute hours PER month

Announcement from PiCloud- (and this is apart from the 5 hours free that a beginner account gets)

http://www.picloud.com/

 Starting this month, all users will get 20 c1 core hours worth of credits each and every month.   If you ran out of your original 5 core hour credits, you can come back and play around some more! If you have minimal computing needs, this means that you can now use PiCloud regularly without even having to enter a credit card.   Looking for more? Don’t forget, we’re giving away \$500 worth of credits as part of our Academic Research Program. Applications are due this Thursday, October 27th

## Ten steps to analysis using R

I am just listing down a set of basic R functions that allow you to start the task of business analytics, or analyzing a dataset(data.frame). I am doing this both as a reference for myself as well as anyone who wants to learn R- quickly.

I am not putting in data import functions, because data manipulation is a seperate baby altogether. Instead I assume you have a dataset ready for analysis and what are the top R commands you would need to analyze it.

For anyone who thought R was too hard to learn- here is ten functions to learning R

1) str(dataset) helps you with the structure of dataset

2) names(dataset) gives you the names of variables

3)mean(dataset) returns the mean of numeric variables

4)sd(dataset) returns the standard deviation of numeric variables

5)summary(variables) gives the summary quartile distributions and median of variables

That about gives me the basic stats I need for a dataset.

`> data(faithful)`
```> names(faithful)
[1] "eruptions" "waiting"```
```> str(faithful)
'data.frame':   272 obs. of  2 variables:
\$ eruptions: num  3.6 1.8 3.33 2.28 4.53 ...
\$ waiting  : num  79 54 74 62 85 55 88 85 51 85 ...```
```> summary(faithful)
eruptions        waiting
Min.   :1.600   Min.   :43.0
1st Qu.:2.163   1st Qu.:58.0
Median :4.000   Median :76.0
Mean   :3.488   Mean   :70.9
3rd Qu.:4.454   3rd Qu.:82.0
Max.   :5.100   Max.   :96.0

> mean(faithful)
eruptions   waiting
3.487783 70.897059
> sd(faithful)
eruptions   waiting
1.141371 13.594974```

6) I can do a basic frequency analysis of a particular variable using the table command and \$ operator (similar to dataset.variable name in other statistical languages)

```> table(faithful\$waiting)

43 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 62 63 64 65 66 67 68 69 70
1  3  5  4  3  5  5  6  5  7  9  6  4  3  4  7  6  4  3  4  3  2  1  1  2  4
71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 96
5  1  7  6  8  9 12 15 10  8 13 12 14 10  6  6  2  6  3  6  1  1  2  1  1```
`or I can do frequency analysis of the whole dataset using`
```> table(faithful)
waiting
eruptions 43 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 62 63 64 65 66 67
1.6    0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0
1.667  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0
1.7    0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0
1.733  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0```
`.....output truncated`
`7) plot(dataset)`
`It helps plot the dataset`

8) hist(dataset\$variable) is better at looking at histograms

hist(faithful\$waiting)

9) boxplot(dataset)

10) The tenth function for a beginner would be cor(dataset\$var1,dataset\$var2)

```> cor(faithful)
eruptions   waiting
eruptions 1.0000000 0.9008112
waiting   0.9008112 1.0000000```

I am assuming that as a beginner you would use the list of GUI at http://rforanalytics.wordpress.com/graphical-user-interfaces-for-r/  to import and export Data. I would deal with ten steps to data manipulation in R another post.

## Challenges of Analyzing a dataset (with R)

Analyzing data can have many challenges associated with it. In the case of business analytics data, these challenges or constraints can have a marked effect on the quality and timeliness of the analysis as well as the expected versus actual payoff from the analytical results.

Challenges of Analytical Data Processing-

1) Data Formats- Reading in complete data, without losing any part (or meta data), or adding in superfluous details (that increase the scope). Technical constraints of data formats are relatively easy to navigate thanks to ODBC and well documented and easily search-able syntax and language.

The costs of additional data augmentation (should we pay for additional credit bureau data to be appended) , time of storing and processing the data (every column needed for analysis can add in as many rows as whole dataset, which can be a time enhancing problem if you are considering an extra 100 variables with a few million rows), but above all that of business relevance and quality guidelines will ensure basic data input and massaging are considerable parts of whole analytical project timeline.

2) Data Quality-Perfect data exists in a perfect world. The price of perfect information is one business will mostly never budget or wait for. To deliver inferences and results based on summaries of data which has missing, invalid, outlier data embedded within it makes the role of an analyst just as important as which ever tool is chosen to remove outliers, replace missing values, or treat invalid data.

3) Project Scope-

How much data? How much Analytical detail versus High Level Summary? Timelines for delivery as well as refresh of data analysis? Checks (statistical as well as business)?

How easy is it to load and implement the new analysis in existing Information Technology Infrastructure? These are some of the outer parameters that can limit both your analytical project scope, your analytical tool choice, and your processing methodology.
4) Output Results vis a vis stakeholder expectation management-

Stakeholders like to see results, not constraints, hypothesis ,assumptions , p-value, or chi -square value. Output results need to be streamlined to a decision management process to justify the investment of human time and effort in an analytical project, choice,training and navigating analytical tool complexities and constraints are subset of it. Optimum use of graphical display is a part of aligning results to a more palatable form to stakeholders, provided graphics are done nicely.

Eg Marketing wants to get more sales so they need a clear campaign, to target certain customers via specific channels with specified collateral. In order to base their business judgement, business analytics needs to validate , cross validate and sometimes invalidate this business decision making with clear transparent methods and processes.

Given a dataset- the basic analytical steps that an analyst will do with R are as follows. This is meant as a note for analysts at a beginner level with R.

Package -specific syntax

update.packages() #This updates all packages
install.packages(package1) #This installs a package locally, a one time event
library(package1) #This loads a specified package in the current R session, which needs to be done every R session

CRAN________LOCAL HARD DISK_________R SESSION is the top to bottom hierarchy of package storage and invocation.

ls() #This lists all objects or datasets currently active in the R session

> names(assetsCorr)  #This gives the names of variables within a dataframe
[1] “AssetClass”            “LargeStocksUS”         “SmallStocksUS”
[4] “CorporateBondsUS”      “TreasuryBondsUS”       “RealEstateUS”
[7] “StocksCanada”          “StocksUK”              “StocksGermany”
[10] “StocksSwitzerland”     “StocksEmergingMarkets”

> str(assetsCorr) #gives complete structure of dataset
‘data.frame’:    12 obs. of  11 variables:
\$ AssetClass           : Factor w/ 12 levels “CorporateBondsUS”,..: 4 5 2 6 1 12 3 7 11 9 …
\$ LargeStocksUS        : num  15.3 16.4 1 0 0 …
\$ SmallStocksUS        : num  13.49 16.64 0.66 1 0 …
\$ CorporateBondsUS     : num  9.26 6.74 0.38 0.46 1 0 0 0 0 0 …
\$ TreasuryBondsUS      : num  8.44 6.26 0.33 0.27 0.95 1 0 0 0 0 …
\$ RealEstateUS         : num  10.6 17.32 0.08 0.59 0.35 …
\$ StocksCanada         : num  10.25 19.78 0.56 0.53 -0.12 …
\$ StocksUK             : num  10.66 13.63 0.81 0.41 0.24 …
\$ StocksGermany        : num  12.1 20.32 0.76 0.39 0.15 …
\$ StocksSwitzerland    : num  15.01 20.8 0.64 0.43 0.55 …
\$ StocksEmergingMarkets: num  16.5 36.92 0.3 0.6 0.12 …

> dim(assetsCorr) #gives dimensions observations and variable number
[1] 12 11

str(Dataset) – This gives the structure of the dataset (note structure gives both the names of variables within dataset as well as dimensions of the dataset)

head(dataset,n1) gives the first n1 rows of dataset while
tail(dataset,n2) gives the last n2 rows of a dataset where n1,n2 are numbers and dataset is the name of the object (here a data frame that is being considered)

summary(dataset) gives you a brief summary of all variables while

library(Hmisc)
describe(dataset) gives a detailed description on the variables

simple graphics can be given by

hist(Dataset1)
and
plot(Dataset1)

As you can see in above cases, there are multiple ways to get even basic analysis about data in R- however most of the syntax commands are intutively understood (like hist for histogram, t.test for t test, plot for plot).

For detailed analysis throughout the scope of analysis, for a business analytics user it is recommended to using multiple GUI, and multiple packages. Even for highly specific and specialized analytical tasks it is recommended to check for a GUI that incorporates the required package.

## Oracle Open World/ RODM package

From the press release, here comes Oracle Open World. They really have an excellent rock concert in that as well.

.NET and Windows @ Oracle Develop and Oracle OpenWorld 2010

Oracle Develop will again feature a .NET track for Oracle developers. Oracle Develop is suited for all levels of .NET developers, from beginner to advanced. It covers introductory Oracle .NET material, new features, deep dive application tuning, and includes three hours of hands-on labs apply what you learned from the sessions.

To register, go to Oracle Develop registration site.

Oracle OpenWorld will include several sessions on using the Oracle Database on Windows and .NET.

Session schedules and locations for Windows and .NET sessions at Oracle Develop and OpenWorld are now available.

With ODAC 11.2.0.1.2, developers can connect to Oracle Database versions 9.2 and higher from Visual Studio 2010 and .NET Framework 4. ODAC components support the full framework, as well as the new .NET Framework Client Profile.

Learn about Oracle’s beta and production plans to support Microsoft Entity Framework with Oracle Database.

for

Data Mining Using the RDOM Package

By Casimir Saternos

Some excerpts-

Open R and enter the following command.

`> library(RODM)`

This command loads the RODM library and as well the dependent RODBC package. The next step is to make a database connection.

`> DB <- RODM_open_dbms_connection(dsn="orcl", uid="dm", pwd="dm")`

Subsequent commands use the DB object (an instance of the RODBC class) to connect to the database. The DNS specified in the command is the name you used earlier for the Data Source Name during the ODBC connection configuration. You can view the actual R code being executed by the command by simply typing the function name (without parentheses).

`> RODM_open_dbms_connection`

And say making a Model in Oracle and R-

> numrows <- length(orange_data[,1])
> orange_data.rows <- length(orange_data[,1])
> orange_data.id <- matrix(seq(1, orange_data.rows),  nrow=orange_data.rows, ncol=1, dimnames= list(NULL, c(“CASE_ID”)))
> orange_data <- cbind(orange_data.id, orange_data)

This adjustment to the data frame then needs to be propagated to the database. You can confirm the change using the sqlColumns function, as listed earlier.

```> RODM_create_dbms_table(DB, "orange_data")
> sqlColumns(DB, 'orange_data')\$COLUMN_NAME```

> glm <- RODM_create_glm_model(
database = DB,
data_table_name = “orange_data”,
case_id_column_name = “CASE_ID”,
target_column_name = “circumference”,
model_name = “GLM_MODEL”,
mining_function = “regression”)

Information about this model can then be obtained by analyzing value returned from the model and stored in the variable named glm.

```> glm\$model.model_settings
> glm\$glm.globals
> \$glm.coefficients```

Once you have a model, you can apply the model to a new set of data. To begin, create or retrieve sample data in the same format as the training data.

```> query<-('select 999 case_id, 1 tree, 120 age,
32 circumference from dual')

> orange_test<-sqlQuery(DB, query)
> RODM_create_dbms_table(DB, "orange_test")
and
Finally, the model can be applied to the new data set and the results analyzed.

```
```results <- RODM_apply_model(database = DB,
data_table_name = "orange_test",
model_name = "GLM_MODEL",
supplemental_cols = "circumference")```

When your session is complete, you can clean up objects that were created (if you like) and you should close the database connection:

```> RODM_drop_model(database=DB,'GLM_MODEL')
> RODM_drop_dbms_table(DB, "orange_test")
> RODM_drop_dbms_table(DB, "orange_data")
> RODM_close_dbms_connection(DB)```

See the full article at http://www.oracle.com/technetwork/articles/datawarehouse/saternos-r-161569.html