DECISION STATS

R and Hadoop

Here is an exciting project for using R on the cloud computing environment ( two of my favorite things). It is called RHIPE

R and Hadoop Integrated Processing Environment v.0.38

cloud

The website source is http://ml.stat.purdue.edu/rhipe/

RHIPE(phonetic spelling: hree-pay’ ¹) is a java package that integrates the R environment with Hadoop, the open source implementation of Google’s mapreduce. Using RHIPE it is possible to code map-reduce algorithms in R e.g

m <- function(key,val){
  words <- strsplit(val," +")[[1]]
  wc <- table(words)
  cln <- names(wc)
  names(wc)<-NULL; names(cln)<-NULL;
  return(sapply(1:length(wc),function(r) list(key=cln[r],value=wc[[r]]),simplify=F))
}
r <- function(key,value){
  value <- do.call("rbind",value)
  return(list(list(key=key,value=sum(value))))
}
rhmr(map=m,reduce=r,combiner=T,input.folder="X",output.folder="Y")

rhapply packages the user's request into an R vector object. This is serialized and sent to the RHIPE server. The RHIPE server picks apart the object creating a job request that Hadoop can understand. Each element of the provided list is processed by the users function during the Map stage of mapreduce. The results are returned and if the output is to a file, these results are serialized and written to a Hadoop Sequence file, the values can be read back into R using the rhsq* functions.

2 rhlapply

rhlapply <- function( list.object,
                    func,
                    configure=expression(),
                    output.folder='',
                    shared.files=c(),
                    hadoop.mapreduce=list(),
                    verbose=T,
                    takeAll=T)

list.object
 This can either be a list or a single scalar. In case of the former, the function given by func will be applied to each element of list.object. In case of a scalar, the function will be applied to the list 1:n where n is the value of the scalar 
func
 A function that takes one parameter: an element of the list. 
configure
 An configuration expression to run before the func is executed. Executed once for every JVM. If you need variables, data frames, use rhsave or rhsave.image , use rhput to copy it to the DFS and then use shared.files
config = expression({
              library(lattice)
              load("mydataset.Rdata")
})



output.folder
 Any file that is created by the function is stored in the output.folder. This is deleted first. If not given, the files created will not be copied.  For side effect files to be copies create them in tmp e.g pdf("tmp/x.pdf"), note no leading slash.The directory will contain a slew of part* files, as many as there maps. These contain the binary key-value pairs.

shared.files
 The function or the preload expression might require the presence resource files e.g *.Rdata files. The user could copy it from the HDFS in the R code or just load it from the local work directory were the files present. This is the role of shared.files. It is a vector of paths to files on the HDFS, each of these will be copied to the work directory where the R code is run. e.g c('/tmp/x.Rdata','/foo.tgz'), then the first file can be loaded via load("x.Rdata") . For those familiar with Hadoop terminology, this is implemented via DistributedCache . 
hadoop.mapreduce
 a list of Hadoop specific options e.g
list(mapreduce.map.tasks=10,mapreduce.reduce.tasks=3)

takeAll
 if takeALL is true, the value returned is a list each entry the return value of the the function, not in order so element 1 of the returned list is not the result of  func(list.object=1=) . 
verbose
 If True, the user will see the job progress in the R console. If False, the web url to the jobtracker will be displayed. Cancelling the command with CTRL-C will not cancel the job, use rhkill for that. 




Mapreduce in R.
rhmr <- function(map,reduce,input.folder,configure=list(map=expression(),reduce=expression()),
                close=list(map=expression(),reduce=expression())
                 output.folder='',combiner=F,step=F,
                 shared.files=c(),inputformat="TextInputFormat",
                 outputformat="TextOutputFormat",
                 hadoop.mapreduce=list(),verbose=T,libjars=c())
Execute map reduce algorithms from within R. A discussion of the parameters follow.

input.folder
 A folder on the DFS containing the files to process. Can be a vector. 
output.folder
 A folder on the DFS where output will go to. 
inputformat
 Either of TextInputFormat or SequenceFileInputFormat. Use the former for text files and the latter for sequence files created from within R or as outputs from RHIPE(e.g rhlapply or rhmr). Note, one can't use any sequence file, they must have been created via a RHIPE function. Custom Input formats are also possible. Download the source and look at code/java/RXTextInputFormat.java 
outputformat
 Either of TextOutputFormat or SequenceFileOutputFormat. In case of the former, the return value from the mapper or reducer is converted to character and written to disk. The following code is used to convert to character.
paste(key,sep='',collapse=field_separator)
Custom output formats are also possible. Download the source and look at code/java/RXTextOutputFormat.java
If custom formats implement their own writables, it must subclass RXWritable or use one of the writables presents in RHIPE

shared.files
 same as in rhlapply, see that for documentation. 
verbose
 If T, the job progress is displayed. If false, then the job URL is displayed. 

At any time in the configure, close, map or reduce function/expression, the variable mapred.task.is.map will be equal to "true" if it is map task,"false" otherwise (both strings) Also, mapred.iswhat is mapper, reducer, combiner in their respective environments.

configure
 A list with either one element (an expression) or two elements map and reduce both of which must be expressions. These expressions are called in their respective environments, i.e the map expression is called during the map configure and similarly for the  reduce expression. The reduce expression is called for the combiner configure method.If only one list element, the expression is used for both the map and reduce

close
 Same as configure . 
map
 a function that takes two values key and value. Should return a list of lists. Each list entry must contain two elements key and value , e.g
...
ret <- list()
ret[[1]] <-  list(key=c(1,2), value=c('x','b'))
return(ret)
If any of key/value are missing the output is not collected, e.g. return NULL to skip this record. If the input format is a TextInputFormat, the value is the entire line and the key is probably useless to the user( it is a number indicating bytes into the file). If the input format is SequenceFileInputFormat, the key and value are taken from the sequence file.

reduce
 Not needed if mapred.reduce.tasks is 0. Takes a key and a list of values( all values emitted from the maps that share the same map output key ). If step is True, then not a list. Must return a list of lists each element of which must have two elements key and value.     This collects all the values and sends them to function. If NULL is returned or the return value is not conforming to the above nothing is collected the Hadoop collector. 
step
 If step is TRUE, then the reduce function is called for every value corresponding to a key that is once for every value.

 The variable red.status is equal to 1 on the first call.
 red.status is equal to 0 for every subsequent calls including the last value
 The reducer function is called one last time with red.status equal to -1. The value is NULL.Anything returned at any of these stages is written to disk The close function is called once every value for a given key has been processed, but returning anything  has no effect.  To a assign to the global environment use  the <<- operator.


combiner
 T or F, to use the reducer as a combiner. Using a combiner makes computation more efficient. If combiner is true, the reduce function will be called as a combiner (0 or more times, it may never be called during the combine stage even if combiner is T) .The value of mapred.task.is.map is 'true' or '*'false*' (both strings)  if the combiner is being executed as part of the map stage or reduce stage respectively.
Whether knowledge of this is useful or not is something I'm not sure of. However, if combiner is T , keep in mind,your reduce function must be able to handle inputs sent from the map or inputs sent from the reduce function(itself).

libjars
 If specifying a custom input/output format, the user might need to specify jar files here. 
hadoop.mapreduce
 set RHIPE and Hadoop options via this. 


1.1 RHIPE Options for mapreduce

 





Option
Default
Explanation




rhipejob.rport
8888
The port on which Rserve runs, should be same across all machines


rhipejob.charsxp.short
0
If 1,  RHIPE optimize serialization for character vectors. This reduces the length of the serialization


rhipejob.getmapbatches
100
If the reduce/mapper emits several key,values, how many to get from Rserve at a time. A higher number reduce the number of network reads(the network reads are to localhost)


rhipejob.outfmt.is.text
1 if TextInputFormat
Must be 1 if the output is textual


rhipejob.textoutput.fieldsep
' '
The field separator for any text based output format


rhipejob.textinput.comment
'#'
In the TextInputFormat, lines beginning with this are skipped


rhipejob.combinerspill
100,000
The combiner is run after collecting at most this many items


rhipejob.tor.batch
200,000
Number of values for the same key to collate before sending to the Reducer, if you have dollops of memory, set this larger. However, too large and you hit Java's heap space limit


rhipejob.max.count.reduce
Java's INT_MAX (about 2BN)
the total number of values for a given key to be collected, note the values are not ordered by any variable.


rhipejob.inputformat.keyclass
The default is chosen depending on TextInputFormat or SequenceFileInputFormat
Provide the full Java URL to the keyclass e.g org.saptarshiguha.rhipe.hadoop.RXWritableText, when using a Custom InputFormat  implement RXWritable     and implement the methods


rhipejob.inputformat.valueclass
The default is chosen depending on TextInputFormat or SequenceFileInputFormat
Provide the full Java URL to the valueclass e.g org.saptarshiguha.rhipe.hadoop.RXWritableText when using a Custom InputFormat  implement RXWritable     and implement the methods


mapred.input.format.class
As above, the default is either org.saptarshiguha.rhipe.hadoop.RXTextInputFormat or org.apache.hadoop.mapred.SequenceFileInputFormat
specify yours here


rhipejob.outputformat.keyclass
The default is chosen depending on TextInputFormat or SequenceFileInputFormat
Provide the full Java URL to the keyclass e.g org.saptarshiguha.rhipe.hadoop.RXWritableText , also the keyclass must implement RXWritable and


rhipejob.outputformat.valueclass
The default is chosen depending on TextInputFormat or SequenceFileInputFormat
Provide the full Java URL to the value e.g org.saptarshiguha.rhipe.hadoop.RXWritableText , also the valueclass must implement RXWritable


mapred.output.format.class
As above, the default is either org.saptarshiguha.rhipe.hadoop.RXTextOutputFormat or org.apache.hadoop.mapred.SequenceFileInputFormat
specify yours here, provide libjars if required




Citation:http://ml.stat.purdue.edu/rhipe/index.html
Great exciting news for the world of computing remotely.

Option	Default	Explanation
rhipejob.rport	8888	The port on which Rserve runs, should be same across all machines
rhipejob.charsxp.short	0	If 1, RHIPE optimize serialization for character vectors. This reduces the length of the serialization
rhipejob.getmapbatches	100	If the reduce/mapper emits several key,values, how many to get from Rserve at a time. A higher number reduce the number of network reads(the network reads are to localhost)
rhipejob.outfmt.is.text	1 if TextInputFormat	Must be 1 if the output is textual
rhipejob.textoutput.fieldsep	' '	The field separator for any text based output format
rhipejob.textinput.comment	'#'	In the TextInputFormat, lines beginning with this are skipped
rhipejob.combinerspill	100,000	The combiner is run after collecting at most this many items
rhipejob.tor.batch	200,000	Number of values for the same key to collate before sending to the Reducer, if you have dollops of memory, set this larger. However, too large and you hit Java's heap space limit
rhipejob.max.count.reduce	Java's INT_MAX (about 2BN)	the total number of values for a given key to be collected, note the values are not ordered by any variable.
rhipejob.inputformat.keyclass	The default is chosen depending on TextInputFormat or SequenceFileInputFormat	Provide the full Java URL to the keyclass e.g `org.saptarshiguha.rhipe.hadoop.RXWritableText`, when using a Custom InputFormat implement RXWritable and implement the methods
rhipejob.inputformat.valueclass	The default is chosen depending on TextInputFormat or SequenceFileInputFormat	Provide the full Java URL to the valueclass e.g `org.saptarshiguha.rhipe.hadoop.RXWritableText` when using a Custom InputFormat implement RXWritable and implement the methods
mapred.input.format.class	As above, the default is either `org.saptarshiguha.rhipe.hadoop.RXTextInputFormat` or `org.apache.hadoop.mapred.SequenceFileInputFormat`	specify yours here
rhipejob.outputformat.keyclass	The default is chosen depending on TextInputFormat or SequenceFileInputFormat	Provide the full Java URL to the keyclass e.g `org.saptarshiguha.rhipe.hadoop.RXWritableText` , also the keyclass must implement `RXWritable` and
rhipejob.outputformat.valueclass	The default is chosen depending on TextInputFormat or SequenceFileInputFormat	Provide the full Java URL to the value e.g `org.saptarshiguha.rhipe.hadoop.RXWritableText` , also the valueclass must implement `RXWritable`
mapred.output.format.class	As above, the default is either `org.saptarshiguha.rhipe.hadoop.RXTextOutputFormat` or `org.apache.hadoop.mapred.SequenceFileInputFormat`	specify yours here, provide libjars if required

Interview Peter J Thomas -Award Winning BI Expert

Here is an in depth interview with Peter J Thomas, one of Europe’s top Business Intelligence expert and influential thought leaders. Peter talks about BI tools, data quality, science careers, cultural transformation and BI and the key focus areas.

I am a firm believer that the true benefits of BI are only realised when it leads to cultural transformation. -Peter James Thomas

Ajay- Describe about your early career including college to the present.

Peter –I was an all-rounder academically, but at the time that I was taking public exams in the 1980s, if you wanted to pursue a certain subject at University, you had to do related courses between the ages of 16 and 18. Because of this, I dropped things that I enjoyed such as English and ended up studying Mathematics, Further Mathematics, Chemistry and Physics. This was not because I disliked non-scientific subjects, but because I was marginally fonder of the scientific ones. In a way it is nice that my current blogging allows me to use language more.

The culmination of these studies was attending Imperial College in London to study for a BSc in Mathematics. Within the curriculum, I was more drawn to Pure Mathematics and Group Theory in particular, and so went on to take an MSc in these areas. This was an intercollegiate course and I took a unit at each of King’s College and Queen Mary College, but everything else was still based at Imperial. I was invited to stay on to do a PhD. It was even suggested that I might be able to do this in two years, given my MSc work, but I decided that a career in academia was not for me and so started looking at other options.

As sometimes happens a series of coincidences and a slice of luck meant that I joined a technology start-up, then called Cedardata, late in 1988; my first role was as a Trainee Analyst / Programmer. Cedardata was one of the first organisations to offer an Accounting system based on a relational database platform; something that was then rather novel, at least in the commercial arena. The RDBMS in question was Oracle version 5, running on VAX VMS – later DEC Ultrix and a wide variety of other UNIX flavours. Our input screens were written in SQL*Forms 2 – later Oracle Forms – and more complex processing logic and reports were in Pro*C; this was before PL/SQL. Obviously this environment meant that I had to become very conversant with SQL*Plus and C itself.

When I joined Cedardata, they had 10 employees, 3 customers and annual revenue of just £50,000 ($80,000). By the time I left the company eight years later, it had grown dramatically to having a staff of 250, over 300 clients in a wide range of industries and sales in excess of £12 million ($20 million). It had also successfully floatated on the main London Stock Exchange. When a company grows that quickly the same thing tends to happen to its employees.

Cedardata was probably the ideal environment for me at the time; an organisation that grew rapidly, offering new opportunities and challenges to its employees; that was fiercely meritocratic; and where narrow, but deep, technical expertise was encouraged to be rounded out by developing more general business acumen, a customer-focused attitude and people-management skills. I don’t think that I would have learnt as much, or progressed anything like as quickly in any other type of organisation.

It was also at Cedardata that I had my first experience of the class of applications that later became known as Business Intelligence tools. This was using BusinessObjects 3.0 to write reports, cross-tabs and graphs for a prospective client, the UK Foreign and Commonwealth Office (State Department). The approach must have worked as we beat Oracle Financials in a play-off to secure the multi-million pound account.

During my time at Cedardata, I rose to become an executive and filled a number of roles including Head of Development and also Assistant to the MD / Head of Product Strategy. Spending my formative years in an organisation where IT was the business and where the customer was King had a profound impact on me and has influenced my subsequent approach to IT / Business alignment.

Ajay- How would you convince young people to take maths and science more? What advice would you give to policy makers to promote more maths and science students?

Peter- While I have used little of my Mathematics directly in my commercial career, the approach to problem-solving that it inculcated in me has been invaluable. On arriving at University, it was something of a shock to be presented with Mathematical problems where you couldn’t simply look up the method of solution in a textbook and apply it to guarantee success. Even in my first year I had to grapple with challenges where you had no real clue where to start. Instead what worked, at least most of the time, was immersing yourself in the general literature, breaking down the problem into more manageable chunks, trying different techniques – sometimes quite recherché ones – to make progress, occasionally having an insight that provides a short-cut, but more often succeeding through dogged determination. All of that sounds awfully like the approach that has worked for me in a business context.

Having said that, I was not terribly business savvy as a student. I didn’t take Mathematics because I thought that it would lead to a career, I took it because I was fascinated by the subject. As I mentioned earlier, I enjoyed learning about a wide range of things, but Science seemed to relate to the most fundamental issues. Mathematics was both the framework that underpinned all of the Sciences and also offered its own world where astonishing an beautiful results could be found, independent of any applicability; although it has to be said that there are few braches of Mathematics that have not be applied somewhere or other.

I think you either have this appreciation of Science and Mathematics or you don’t and that this happens early on.

Certainly my interest was supported by my parents and a variety of teachers, but a lot of it arose from simply reading about Cosmology, or Vulcanism, or Palaeontology. I watched a YouTube of Steven Jay Gould recently saying that when he was a child in the 1950s all children were “in” to Dinosaurs, but that he actually got to make a career out of it. Maybe all children aren’t “in” to dinosaurs in the same way today, perhaps the mystery and sense of excitement has gone.

In the UK at least there appears to be less and less people taking Science and Mathematics. I am not sure what is behind this trend. I read pieces that suggest that Science and Maths are viewed as being “hard” subjects, and people opt for “easier” alternatives. I think creative writing is one of the hardest things to do, so I’m not sure where this perspective comes from.

Perhaps some things that don’t help are the twin images of the Scientist as a white-coated boffin and the Mathematician as a chalk-covered recluse, neither of whom have much of a grasp on the world beyond their narrow discipline. While of course there is a modicum of truth in these stereotypes, they are far from being wholly accurate in my experience.

Perhaps Science has fallen off of the pedestal that it was placed on in the 1950s and 1960s. Interest in Science had been spurred by a range of inventions that had improved people’s lives and often made the inventors a lot of money. Science was seen as the way to a better tomorrow, a view reinforced by such iconic developments as the discovery of the structure of DNA, our ever deepening insight about sub-atomic physics and the unravelling of many mysteries of the Universe. These advances in pure science were supported by feats of scientific / engineering achievement such as the Apollo space programme. The military importance of Science was also put into sharp relief by the Manhattan Project; something that also maybe sowed the seeds for later disenchantment and even fear of the area.

The inevitable fallibility of some Scientists and some scientific projects burst the bubble. High-profile problems included the Thalidomide tragedy and the outcry, however ill-informed, about genetically modified organisms. Also the poster child of the scientific / engineering community was laid low by the Challenger disaster. On top of this, living with the scientifically-created threat of mutually-assured destruction probably began to change the degree of positivity with which people viewed Science and Scientists. People arrived at the realisation that Science cannot address every problem; how much effort has gone into finding a cure for cancer for example?

In addition, in today’s highly technological world, the actual nuts and bolts of how things work are often both hidden and mysterious. While people could relatively easily understand how a steam engine works, how many have any idea about how their iPod functions? Technology has become invisible and almost unimportant, until it stops working.

I am a little wary of Governments fixing issues such as these, which are the result of major generational and cultural trends. Often state action can have unintended and perverse results. Society as a whole goes through cycles and maybe at some future point Science and Mathematics will again be viewed as interesting areas to study; I certainly hope so. Perhaps the current concerns about climate change will inspire a generation of young people to think more about technological ways to address this and interest them in pertinent Sciences such as Meteorology and Climatology.

Ajay-. How would you rate the various tools within the BI industry like in a SWOT analysis (briefly and individually)?

Peter- I am going to offer a Politician’s reply to this. The really important question in BI is not which tool is best, but how to make BI projects successful. While many an unsuccessful BI manager may blame the tool or its vendor, this is not where the real issues lie.

I firmly believe that successful BI rests on four mutually reinforcing pillars:

understand the questions the business needs to answer,
understand the data available,
transform the data to meet the business needs and
embed the use of BI in the organisation’s culture.

If you get these things right then you can be successful with almost any of the excellent BI tools available in the marketplace. If you get any one of them wrong, then using the paragon of BI tools is not going to offer you salvation.

I think about BI tools in the same way as I do the car market. Not so many years ago there were major differences between manufacturers.

The Japanese offered ultimate reliability, but maybe didn’t often engage the spirit.

The Germans prided themselves on engineering excellence, slanted either in the direction of performance or luxury, but were not quite as dependable as the Japanese.

The Italians offered out-and-out romance and theatre, with mechanical integrity an afterthought.

The French seemed to think that bizarrely shaped cars with wheels as thin as dinner plates were the way forward, but at least they were distinctive.

The Swedes majored on a mixture of safety and aerospace cachet, but sometimes struggled to shift their image of being boring.

The Americans were still in the middle of their love affair with the large and the rugged, at the expense of convenience and value-for-money.

Stereotypically, my fellow-countrymen majored on agricultural charm, or wooden-panelled nostalgia, but struggled with the demands of electronics.

Nowadays, the quality and reliability of cars are much closer to each other. Most manufacturers have products with similar features and performance and economy ratings. If we take financial issues to one side, differences are more likely to related to design, or how people perceive a brand. Today the quality of a Ford is not far behind that of a Toyota. The styling of a Honda can be as dramatic as an Alfa Romeo. Lexus and Audi are playing in areas previously the preserve of BMW and Mercedes and so on.

To me this is also where the market for BI tools is at present. It is relatively mature and the differences between product sets are less than before.

Of course this doesn’t mean that the BI field will not be shaken up by some new technology or approach (in-memory BI or SaaS come to mind). This would be the equivalent of the impact that the first hybrid cars had on the auto market.

However, from the point of view of implementations, most BI tools will do at least an adequate job and picking one should not be your primary concern in a BI project.

Ajay- SAS Institute Chief Marketing Officer, Jim Davis (interviewed with this blog) points to the superiority of business analytics rather than business intelligence as an over hyped term. What numbers, statistics and graphs would you quote rather than semantics to help re direct those perceptions?

I myself use SAS,SPSS, R and find the decision management capabilities as James Taylor calls Decision Management much better enabled than by simple ETL tools or reporting and aggregating graphs tools in many BI tools.

Peter- I have expended quite a lot of energy and hundreds of words on this subject. If people are interested in my views, which are rather different to those of Jim Davis, then I’d suggest that they read them in a series of articles starting with Business Analytics vs Business Intelligence [URL http://peterthomas.wordpress.com/2009/03/28/business-analytics-vs-business-intelligence/ ].

I will however offer some further thoughts and to do this I’ll go back to my car industry analogy. In a world where cars are becoming more and more comparable in terms of their reliability, features, safety and economy, things like styling, brand management and marketing become more and more important.

As the true differences between BI vendors narrow, expect more noise to be made by marketing departments about how different their products are.

I have no problem in acknowledging SAS as a leader in Business Analytics, too many people I respect use their tools for me to think otherwise. However, I think a better marketing strategy for them would be to stick to the many positives of their own products. If they insist on continuing to trash competitors, then it would make sense for them to do this in a way that couldn’t be debunked by a high school student after ten seconds’ reflection.

Ajay- In your opinion what is the average RoI that a small, large medium enterprise gets by investing in a business intelligence platform. What advice would you give to such firms (separately) to help them make their minds?

Peter- The question is pretty much analogous to “What are the benefits of opening an office in China?” the answer is going to depend on what the company does; what their overall strategy is and how a China operation might complement this; whether their products and services are suitable for the Chinese market; how their costs, quality and features compare to local competitors; and whether they have cracked markets closer to home already.

To put things even more prosaically, “How long is a piece of string?”

Taking to one side the size and complexity of an organisation, BI projects come in all shapes and sizes.

Personally I have led Enterprise-wide, all-pervasive BI projects which have had a profound impact on the company. I have also seen well-managed and successful BI projects targeted on a very narrow and specific area.

The former obviously cost more than the latter, but the benefits are commensurately greater. In fact I would argue that the wider a BI project is spread, the greater its payback. Maybe lessons can be learnt and confidence built in an initial implementation to a small group, but to me the real benefit of BI is realised when it touches everything that a company does.

This is not based on a self-interested boosting of BI. To me if what we want to do is take better business decisions, then the greater number of such decisions that are impacted, the better that this is for the organisation.

Also there are some substantial up-front investments required for BI. These would include: building the BI team; establishing the warehouse and a physical architecture on which to deliver your application. If these can be leveraged more widely, then costs come down.

The same point can be made about the intellectual property that a successful BI team develops. This is one reason why I am a fan of the concept of BI Competency Centres [URL http://peterthomas.wordpress.com/2009/05/11/business-intelligence-competency-centres/ ].

I have been lucky enough to contribute to an organisation turning round from losing hundreds of millions of dollars to recording profits of twice that magnitude. When business managers cite BI as a major factor behind such a transformation, then this is clearly a technology that can be used to dramatic effect.

Nevertheless both estimating the potential impact of BI and measuring its actual effectiveness are non-trivial activities. A number of different approaches can be taken, some of which I cover in my article:

Measuring the benefits of Business Intelligence [URL http://peterthomas.wordpress.com/2009/02/26/measuring-the-benefits-of-business-intelligence/ ]. As ever there is no single recipe for success.

Ajay-. Which BI tool/ code are you most comfortable with and what are its salient points?

Peter –Although I have been successful with elements of the IBM-Cognos toolset and think that this has many strong points, not least being relatively user-friendly, I think I’ll go back to my earlier comments about this area being much less important than many others for the success of a BI project.

Ajay -How do you think cloud computing will change BI? What percentage of BI budgets go to data quality and what is eventual impact of data quality on results?

Peter –I think that the jury is still out on cloud computing and BI. By this I do not mean that cloud computing will not have an impact, but rather that it remains unclear what this impact will actually be.

Given the maturity of the market, my suspicion is that the BI equivalent of a Google is not going to emerge from nowhere. There are many excellent BI start-ups in this space and I have been briefed by quite a few of them.

However, I think the future of cloud computing in BI is likely to be determined by how the likes of IBM-Cognos, SAP-BusinessObjects and Oracle-Hyperion embrace the area.

Having said this, one of the interesting things in computing is how easy it is to misjudge the future and perhaps there is a potential titan of cloud BI currently gestating in the garage so beloved of IT mythology.

On data quality, I have never explicitly split out this component of a BI effort. Rather data quality has been an integral part of what we have done. Again I have taken a four-pillared approach:

improve how the data is entered;
make sure your interfaces aren’t the problem;
check how the data has been entered / interfaced;
and don’t suppress bad data in your BI.

The first pillar consists of improved validation in front-end systems – something that can be facilitated by the BI team providing master data to them – and also a focus on staff training, stressing the importance to the organisation of accurately recording certain data fields.

The second pillar is more to do with the general IT Architecture and how this relates to the Information Architecture, again master data has a role to play, but so does ensuring that the IT culture is one in which different teams collaborate well and are concerned about what happens to data when it leaves “their” systems.

The third pillar is the familiar world of after-the-fact data quality reports and auditing, something that is necessary, but not sufficient, for success in data quality.

Finally there is what I think can be one of the most important pillars; ensuring that the BI system takes a warts-and-all approach to data. This means that bad data is highlighted, rather than being suppressed. In turn this creates pressure for the problems to be addressed where they arise and creates a virtuous circle.

For those who might be interested in this area, I expand on it more in Using BI to drive improvements in data quality [URL http://peterthomas.wordpress.com/2009/02/11/using-bi-to-drive-improvements-in-data-quality/ ].

Ajay- You are well known with England’s rock climbing and boulder climbing community. A fun question- what is the similarity between a BI implementation/project and climbing a big boulder.

Peter –I would have to offer two minor clarifications.

First it is probably my partner who is better known in climbing circles, via here blog [URL http://77jenn.blogspot.com/ ] and articles and reviews that she has written for the climbing press; though I guess I can take credit for most of the photos and videos.

Second, particularly given the fact that a lot of our climbing takes place in Wales, I should acknowledge the broader UK climbing community and also mention our most mountainous region of Scotland.

Despite what many inhabitants of Sheffield might think to the contrary, there is life beyond Stanage Edge [URL http://en.wikipedia.org/wiki/Stanage ].

I have written about the determination and perseverance that are required to get to the top of a boulder, or indeed to the top of any type of climb [URL http://peterthomas.wordpress.com/2009/03/31/perseverance/ ].

I think those same qualities are necessary for any lengthy, complex project. I am a firm believer that the true benefits of BI are only realised when it leads to cultural transformation. Certainly the discipline of change management has many parallels with rock climbing. You need a positive attitude and a strong belief in your ultimate success, despite the inevitable setbacks. If one approach doesn’t yield fruit then you need to either fine-tune or try something radically different.

I suppose a final similarity is the feeling that you get having completed a climb, particularly if it is at the limit of your ability and has taken a long time to achieve. This is one of both elation and deep satisfaction, but is quickly displaced by a desire to find the next challenge.

This is something that I have certainly experienced in business life and I think the feelings will be familiar to many readers.

Biography-

Peter Thomas has led all-pervasive, Business Intelligence and Cultural Transformation projects serving the needs of 500+ users in multiple business units and service departments across 13 European and 5 Latin American countries. He has also developed Business Intelligence strategies for operations spanning four continents. His BI work has won two industry awards including “Best Enterprise BI Implementation”, from Cognos in 2006 and “Best use of IT in Insurance”, from Financial Sector Technology in 2005. Peter speaks about success factors in both Business Intelligence and the associated Change Management at seminars across both Europe and North America and writes about these areas and many other aspects of business, technology and change on his blog [URL http://peterthomas.wordpress.com ].

Training on R

Here is an interesting training from Revolution Computing

New Training Course from REvolution Computing
High-Performance Computing with R
July 31, 2009 – Washington, DC – Prior to JSM
Time: 9am – 5pm
$600 commercial delegates, $450 government, $250 academic

Click Here to Register Now!

An overview of available HPC technologies for the R language to enable faster, scalable analytics that can take advantage of multiprocessor capability will be presented in a one-day course. This will include a comprehensive overview of REvolution’s recently released R packages foreach and iterators, making parallel programming easier than ever before for R programmers, as well as other available technologies such as RMPI, SNOW and many more. We will demonstrate each technology with simple examples that can be used as starting points for more sophisticated work. The agenda will also cover:

Identifying performance problems

Profiling R programs

Multithreading, using compiled code, GPGPU

Multiprocess computing

SNOW, MPI, NetWorkSpaces, and more

Batch queueing systems

Dealing with lots of data

Attendees should have basic familiarity with the R language—we will keep examples elementary but relevant to real-world applications.

This course will be conducted hands-on, classroom style. Computers will not be provided. Registrants are required to bring their own laptops.

For the full agenda Click Here or Click Here to Register Now!”

Source; www.revolution-computing.com

Disclaimer- I am NOT commerically related to REvolution, just love R. I do hope REvolution chaps do spend tiny bit of time improving the user GUI as well not just for HPC purposes.

They recently released some new packages free to the CRAN community as well

The release of 3 new packages for R designed to allow all R users to more quickly handle large, complex sets of data: iterators, foreach and doMC.

* iterators implements the “iterator” data structure familiar to users of languages like Java,

C# and Python to make it easy to program useful sequences – from all the prime numbers to the columns of a matrix or the rows of an external database.

* foreach builds on the “iterators” package to introduce a new way of programming loops in R. Unlike the traditional “for” loop, foreach runs multiple iterations simultaneously, in parallel. This makes loops run faster on a multi-core laptop, and enables distribution of large parallel-processing problems to multiple workstations in a cluster or in the cloud, without additional complicated programming. foreach works with parallel programming backends for R from the open-source and commercial domains.

* doMC is an open source parallel programming backend to enable parallel computation with “foreach” on Unix/Linux machines. It automatically enables foreach and iterator functions to work with the “multicore” package from R Core member Simon Urbanek

The new packages have been developed by REvolution Computing and released under open source licenses to the R community, enabling all existing R users

citation:

http://www.revolution-computing.com/aboutus/news-room/2009/breakthrough-parallel-packages-and-functions-for-r.php

Interview Jill Dyche Baseline Consulting

Here is an interview with Jill Dyche, co-Founder Baseline Consulting and one of the best Business Intelligence consultants and analysts. Her writing is read by huge portion of the industry and has influenced many paradigms.She is also Author of e-Data, The CRM Handbook, and Customer Data Integration: Reaching a Single Version of the Truth.

BI tools are not recommended when they’re the first topic in a BI discussion.

Jill Dyche, Baseline Consulting

Ajay- What approximate Return of Investment would you give to various vendors within Business Intelligence?

Jill- You don’t kid around do you, Ajay? In general the answer has everything to do with the problem BI is solving for a company. For instance, we’re working on deploying operational BI at a retailer right now. This new program is giving people in the stores more power to make decisions about promotions and in-store events. The projected ROI is $300,000 per store per year—and the retailer has over 1000 stores. In another example, we’re working with an HMO client on a master data management project that helps it reconcile patient data across hospitals, clinics, pharmacies, and home health care. The ROI could be life-saving. So, as they say in the Visa commercials: Priceless.

Ajay- What is impact of third party cloud storage and processing do you think will be there on Business Intelligence consulting?

Jill- There’s a lot of buzz about cloud storage for BI, most of it is coming from the VC community at this point, not from our clients. The trouble with that is that BI systems really need control over their storage. There are companies out there—check out a product called RainStor—that do BI storage in the cloud very well, and are optimized for it. But most “cloud” environments geared to BI are really just hosted offerings that provide clients with infrastructure and processing resources that they don’t have in-house. Where the cloud really has benefits is when it provides significant processing power to companies that can’t build it easily themselves.

Ajay- What are the top writing tips would you give to young struggling business bloggers especially in this recession.

Jill- I’d advise bloggers to write like they talk, a standard admonishment by many a professor of Business Writing. So much of today’s business writing—especially in blogs—is stilted, overly-formal, and pedantic. I don’t care if your grammar is accurate; if your writing sounds like the Monroe Doctrine, no one will read it. (Just give me one quote from the Monroe Doctrine. See what I mean?) Don’t use the word “leverage” when you can use the word “use.” Be genuine and conversational. And avoid clichés like the plague.

Ajay- How would you convince young people especially women to join more science careers. Describe your own career journey.

Jill- As much as we need those role models in science, high-tech, and math careers, I’d tell them to only embrace it if they really love it. My career path to high-tech was unconventional and unintentional. I started as a technical writer specializing in relational databases just as they were getting hot. One thing I know for sure is if you want to learn about something interesting, be willing to roll up your sleeves and work with it. My technical writing about databases, and then data warehouses, led to some pretty interesting client work.

Sure I’ve coded SQL in my career, and optimized some pretty hairy WHERE clauses. But the bigger issue is applying that work to business problems. Actually I’m grateful that I wasn’t a very good programmer. I’d still be waiting for that infinite loop to finish running.

Ajay- What are the areas within an enterprise where implementation of BI leads to the most gains. And when are BI tools not recommended?

Jill- The best opportunities for BI are for supporting business growth. And that typically means BI used by sales and marketing. Who’s the next customer and what will they buy? It’s answers to questions like these that can set a company apart competitively and contribute to both the top and bottom lines.

Not to be too heretical, but to answer your second question: BI tools are not recommended when they’re the first topic in a BI discussion. We’ve had several “Don’t go into the light” conversations with clients lately where they are prematurely looking at BI tools rather than examining their overall BI readiness. Companies need to be honest about their development processes, existing skill sets, and their data and platform infrastructures before they start phoning up data visualization vendors. Unfortunately, many people engage BI software vendors way before they’re ready.

Ajay- You and your partner Evan wrote what was really the first book on Master Data Management. But you’d been in the BI and data warehousing world before that. Why MDM?

Jill- We just kept watching what our clients couldn’t pull off with their data warehouses. We saw the effort they were going through to enforce business rules through ETL, and what they were trying to do to match records across different source systems. We also saw the amount of manual effort that went into things like handling survivor records, which leads to a series of conversations about data ownership.

Our book (Customer Data Integration: Reaching a Single Version of the Truth, Wiley) has as much to do with data management and data governance as it does with CDI and MDM. As Evan recently said in his presentation at the TDWI MDM Insight event, “You can’t master your data until you manage your data.” We really believe that, and our clients are starting to put it into practice too.

Ajay- Why did you and Evan choose to focus on customer master data (CDI) rather than a more general book on MDM?

Jill- There were two reasons. The first one was because other master data domains like product and location have their own unique sets of definitions and rules. Even though these domains also need MDM, they’re different and the details around implementing them and choosing vendor products to enable them are different. The second reason was that the vast majority of our clients started their MDM programs with customer data. One of Baseline’s longest legacies is enabling the proverbial “360-degree view” of customers. It’s what we knew.

Ajay- What’s surprised you most about your CDI/MDM clients?

Jill- The extent to which they use CDI and MDM as the context for bringing IT and the business closer together. You’d think BI would be ideal for that, and it is. But it’s interesting how MDM lets companies strip back a lot of the tool discussions and just focus on the raw conversations about definitions and rules for business data. Business people get why data is so important, and IT can help guide them in conversations about streamlining data quality and management. Companies like Dell have used MDM for nothing less than business alignment.

Ajay- Any plan to visit India and China for giving lectures?

Jill- I just turned down a trip to China this fall because I had a schedule conflict, which I’m really bummed about. Far as India is concerned, nothing yet but if you’re looking for houseguests let me know.(Ajay- sure I have a big brand new house just ready- and if I visit USA may I be a house guest too?)

About Jill Dyche-

Jill blogs at http://www.jilldyche.com/. where she takes the perpetual challenge of business-IT alignment head on in her trenchant, irreverent style.

Jill Dyché is a partner and co-founder of Baseline Consulting. Her role at Baseline is a combination of best-practice expert, industry gadfly, key client advisor, and all-around thought leader. She is responsible for key client strategies and market analysis in the areas of data governance, business intelligence, master data management, and customer relationship management. Jill counsels boards of directors on the strategic importance of their information investments.

Author

Jill is the author of three books on the business value of IT. Jill’s first book, e-Data (Addison Wesley, 2000) has been published in eight languages. She is a contributor to Impossible Data Warehouse Situations: Solutions from the Experts (Addison Wesley, 2002), and her book, The CRM Handbook (Addison Wesley, 2002), is the bestseller on the topic.

Jill’s work has been featured in major publications such as Computerworld, Information Week, CIO Magazine, the Wall Street Journal, the Chicago Tribune and Newsweek.com. Jill’s latest book, Customer Data Integration (John Wiley and Sons, 2006) was co-authored with Baseline partner Evan Levy, and shows the business breakthroughs achieved with integrated customer data.

Industry Expert

Jill is a featured speaker at industry conferences, university programs, and vendor events. She serves as a judge for several IT best practice awards. She is a member of the Society of Information Management and Women in Technology, a faculty member of TDWI, and serves as a co-chair for the MDM Insight conference. Jill is a columnist for DM Review, and a blogger for BeyeNETWORK and Baseline Consulting.

Interview Alison Bolen SAS.com

My biggest editing soapbox right now is to encourage brevity. We’re so used to writing white papers, brochures and magazine articles that the concept of throwing down 200 words on a topic from your day is a very foreign exercise. –

Alison Bolen Editor-in-Chief sascom

Here is an interview with Alison Bolen the editor-in-chief of SAScom , online magazaine of the SAS Institute. Alison talks of the challenges in maintaining several of the topmost expertise blogs on SAS ,Business Analytics and Business Intelligence.

Ajay- Describe your career in the technology writing and publishing area. What advice would you give to young Web publishers and content producers just entering the job market in this recession? Describe your journey within SAS.

Alison- I started at SAS in 1999 as a summer student working as a contributing editor for SAS Communications magazine. Before the end of the year, I came on full time and soon transitioned to writing and editing for the Web. At that time, we were just developing the strategy for the customer support site and e-newsletters. As the first editor for the SAS Tech Report, I led marketing efforts that brought in 15,000 opt-in subscribers within six months. A year later, I switched to writing and editing customer success stories, which I enjoyed doing until I took on the role of Editor-in-Chief for sascom® magazine in 2006. We started our blogging program in 2007, and I’ve been actively involved in coaching SAS bloggers for the past two years.

Outside of SAS, I’ve written for Southwest Hydrology Magazine, the Arizona Daily Star and other regional papers. My bachelor’s degree is in magazine journalism and my master’s degree is in technical and business communications.

Ajay www.SAS.com/Blogs has many, many blogs by experts, RSS feeds and even covers the annual SAS conference with video content. In terms of social media adaptation, what prompts you to stay ahead of the competition in ensuring marketing and technical communications for brand awareness?

What do you think are the basics of setting up a social media presence for a company, regardless of size?

Alison- Social media excites me because you can cut through the clutter and be real. Our new business forecasting blog by Michael Gilliland is a good example. Teaching people how to forecast better is his top priority, not selling software. Our overarching goal for the blogging program is similar: to share and develop expertise.

We’re big advocates of aligning your social media presence with existing marketing goals. We have a few grass-roots teams interested in social media, and we have a director-level Marketing 2.0 Council that our Social Media Manager Dave Thomas leads to determine broad guidelines and strategies. But the overarching concept is to look at the goals of your individual marketing campaigns first, and then determine which social media channels might help you reach those goals.

Most of all, take off your marketing hat when you enter the blog, network or forum. Social media consists of individuals, for the most part, and not companies, so be sure to offer value as a colleague and build relationships.

Ajay- I noticed that SAS.com/ Blogs are almost ad free – even of SAS products – apart from a simple banner of the company. Was this a deliberate decision, and if so, why?

Alison- Yes, most of the SAS blogs were intentionally created to help establish the individual blogger’s expertise – not to promote SAS products or services. One positive side effect is that SAS – by extension – builds credibility as well. But we really do see the blogs as a place to discuss concepts and ideas more than products and tools.

Ajay- What distinguishes good writers on blogs from bad writers on blogs? How about some tips for technical blog writing and especially editing (since many writers need editors more than they realize)?

Alison- The best blog writers know how to simplify and explain even the most mundane, everyday processes. This is true of personal and technical blog writing. If you can look at your life or your work and see what piece of it others would find interesting or want to know more about – and then know how to describe that sliver of yourself clearly – you have what it takes to be a good blogger. Chris Hemedinger does this well on The SAS Dummy blog.

Ajay- I balance one blog, small consulting assignments and being a stay-at-home dad for an 18-month old. How easy is it for you to balance being editor of sascom, given the huge content your sites create, and three kids? Does working for SAS and its employee-friendly reputation help you do so?

Alison- I couldn’t balance work and kids without a whole lot of help from friends and family, that’s for sure. And the employee-friendly benefits help too. The biggest benefit is the cultural mindset, though, not any individual policy. My boss and my boss’ boss are both working mothers, and they’re balancing the same types of schedules. There’s an understanding about finding a healthy work-life balance that permeates SAS from top to bottom.

Ajay- As a social media consultant it is a weekly struggle for me to convince companies to discontinue registration for normal content (but keep it for special events), use a lot more video tutorials and share content freely across the Web. Above all, convincing busy senior managers to start writing a blog or an article is an exercise in diplomacy itself. How do you convince senior managers to devote time to content creation?

Alison- In a lot of areas, the content is already being created for analyst presentations, press interviews and consulting briefs. It’s really a matter of understanding how to take those existing materials and re-present them in a more personal voice. Not everyone can – or should – do it. You have to decide if you have the voice for it and whether or not it will bring you value beyond what you’re getting through your existing channels.

Ajay- Any plans to visit India and have a SAS India blogathon?

Alison- Alas, not this year.

Maybe I will visit Cary,NC then 🙂

Bio:
Alison Bolen is the Editor of sascom magazine and the sascom voices blog, where SAS experts publish their thoughts on popular and emerging business and technology trends worldwide. Since starting at SAS in 1999, Alison has edited print publications, Web sites, e-newsletters, customer success stories and blogs.

Alison holds a bachelor’s degree in magazine journalism from Ohio University and a master’s degree in technical writing from North Carolina State University.

1) Describe your career in the technology writing and publishing area. What advice would you give to young Web publishers and content producers just entering the job market in this recession? Describe your journey within SAS.

I started at SAS in 1999 as a summer student working as a contributing editor for SAS Communications magazine. Before the end of the year, I came on full time and soon transitioned to writing and editing for the Web. At that time, we were just developing the strategy for the customer support site and e-newsletters. As the first editor for the SAS Tech Report, I led marketing efforts that brought in 15,000 opt-in subscribers within six months. A year later, I switched to writing and editing customer success stories, which I enjoyed doing until I took on the role of Editor-in-Chief for sascom® magazine in 2006. We started our blogging program in 2007, and I’ve been actively involved in coaching SAS bloggers for the past two years.

Outside of SAS, I’ve written for Southwest Hydrology Magazine, the Arizona Daily Star and other regional papers. My bachelor’s degree is in magazine journalism and my master’s degree is in technical and business communications.

If you’re just beginning your career as a writer, start a blog and stick with it. There’s no better way to get daily writing practice, learn the basics of search engine optimization and start to understand what works online.
2) SAS.com/Blogs has many, many blogs by experts, RSS feeds and even covers the annual SAS conference with video content. In terms of social media adaptation, what prompts you to stay ahead of the competition in ensuring marketing and technical communications for brand awareness?

What do you think are the basics of setting up a social media presence for a company, regardless of size?

Social media excites me because you can cut through the clutter and be real. Our new business forecasting blog by Michael Gilliland is a good example. Teaching people how to forecast better is his top priority, not selling software. Our overarching goal for the blogging program is similar: to share and develop expertise.

3) I noticed that SAS.com/ Blogs are almost ad free – even of SAS products – apart from a simple banner of the company. Was this a deliberate decision, and if so, why?

Yes, most of the SAS blogs were intentionally created to help establish the individual blogger’s expertise – not to promote SAS products or services. One positive side effect is that SAS – by extension – builds credibility as well. But we really do see the blogs as a place to discuss concepts and ideas more than products and tools.
4) What distinguishes good writers on blogs from bad writers on blogs? How about some tips for technical blog writing and especially editing (since many writers need editors more than they realize)?

The best blog writers know how to simplify and explain even the most mundane, everyday processes. This is true of personal and technical blog writing. If you can look at your life or your work and see what piece of it others would find interesting or want to know more about – and then know how to describe that sliver of yourself clearly – you have what it takes to be a good blogger. Chris Hemedinger does this well on The SAS Dummy blog.

My biggest editing soapbox right now is to encourage brevity. We’re so used to writing white papers, brochures and magazine articles at SAS that the concept of throwing down 200 words on a random topic from your day is a very foreign exercise. You have to learn how to edit your day – not just your writing – to find those topics and distill those thoughts into quick snippets that keep readers interested. And don’t forget it’s okay to have fun!
5) I balance one blog, small consulting assignments and being a stay-at-home dad for an 18-month old. How easy is it for you to balance being editor of sascom, given the huge content your sites create, and three kids? Does working for SAS and its employee-friendly reputation help you do so?

I couldn’t balance work and kids without a whole lot of help from friends and family, that’s for sure. And the employee-friendly benefits help too. The biggest benefit is the cultural mindset, though, not any individual policy. My boss and my boss’ boss are both working mothers, and they’re balancing the same types of schedules. There’s an understanding about finding a healthy work-life balance that permeates SAS from top to bottom.

6) As a social media consultant it is a weekly struggle for me to convince companies to discontinue registration for normal content (but keep it for special events), use a lot more video tutorials and share content freely across the Web. Above all, convincing busy senior managers to start writing a blog or an article is an exercise in diplomacy itself. How do you convince senior managers to devote time to content creation?

In a lot of areas, the content is already being created for analyst presentations, press interviews and consulting briefs. It’s really a matter of understanding how to take those existing materials and re-present them in a more personal voice. Not everyone can – or should – do it. You have to decide if you have the voice for it and whether or not it will bring you value beyond what you’re getting through your existing channels.

7) Any plans to visit India and have a SAS India blogathon?

Alas, not this year.

The Great Driving Challenge- coolest young couples

Here is one of the new startups in India. A batch mate from B school whom I owe too many beers, and too few

calculus notes —–well he asked me to help him vote. Treat this as shameless self promotion just like http://www.cerebralmastication.com/ ‘s moustache and R rated R stats profanity on #rstats in twitter

Please do vote and read- they are a fun couple. http://www.greatdrivingchallenge.com/application/1245656268196502/

2 rhlapply

1.1 RHIPE Options for mapreduce

Citation:http://ml.stat.purdue.edu/rhipe/index.html

Great exciting news for the world of computing remotely.

Please share:

I am a firm believer that the true benefits of BI are only realised when it leads to cultural transformation. -Peter James Thomas

Please share:

Please share:

Please share:

BI tools are not recommended when they’re the first topic in a BI discussion.

Jill Dyche, Baseline Consulting

Author

Industry Expert

Please share:

My biggest editing soapbox right now is to encourage brevity. We’re so used to writing white papers, brochures and magazine articles that the concept of throwing down 200 words on a topic from your day is a very foreign exercise. –

Alison Bolen Editor-in-Chief sascom

Please share:

Please share: