July 2009 – Page 3 – DECISION STATS

Wisdom from Elder Research- 10 Top Data Mining Mistakes

This is a great data mining tutorial from John Elder. Visit his site at http://datamininglab.com/

for more great video tutorials- all very lucid, easy to understand and powerful.

Zementis News

From a Zementis Newsletter- interesting advances on the R on the cloud front. Thanks to Rom Ramos for sending this, and I hope Zementis and some one like Google/ Biocep team up so all I need to make a model is some data and a browser. 🙂

The R Journal – A Refereed Journal for the R Project Launches

As a sign of the open source R project for statistical computing gaining momentum, the R newsletter has been transformed into The R Journal, a refereed journal for articles covering topics that are of interest to users or developers of R. As a supporter of the R PMML Package (see blog and video tutorial), we are honored that our article “PMML: An Open Standard for Sharing Models” which emphasizes the importance of the Predictive Model Markup Language (PMML) standard is part of the inaugural issue. If you already develop your models in R, export them via PMML, then deploy and scale your models in ADAPA on the Amazon EC2 cloud. Read the full story.

Integrating Predictive Analytics via Web Services

Predictive analytics will deliver more value and become more pervasive across the enterprise, once we manage to seamlessly integrate predictive models into any business process. In order to execute predictive models on-demand, in real-time or in batch mode, the integration via web services presents a simple and effective way to leverage scoring results within different applications. For most scenarios, the best way to incorporate predictive models into the business process is as a decision service. Query the model(s) daily, hourly, or in real-time, but if at all possible try to design a loosely coupled system following a Service Oriented Architecture (SOA).

Using web services, for example, one can quickly improve existing systems and processes by adding predictive decision models. Following the idea of a loosely coupled architecture, it is even possible to use integration tools like Jitterbit or Microsoft SQL Service Integration Services (SSIS) to embed predictive mode ls that are deployed in ADAPA on the Amazon Elastic Compute Cloud without the need to write any code. Of course, there is also the option to use custom Java code or MS SQL Server SSIS Scripting for which we provide a sample client application. Read the full story.

About ADAPA®:

A fast real-time deployment environment for Predictive Analytics Models – a stand alone scoring engine that reads .xml based PMML descriptions of models and scores streams of data. Developed by Zementis – a fully hosted Software-as-a Service (SaaS) solution on the Amazon Elastic Computing Cloud. It’s easy to use and remarkably inexpensive starting at only $0.99 per instance hour.

Interview John Moore CTO, Swimfish

Here is an interview with John F Moore, VP Engineering and Chief Technology Officer, Swimfish a provider of business solutions and CRM. A well known figure in Technology and CRM circles, John talks of Social CRM, Technology Offshoring, Community Initiatives and his own career.

Too many CRM systems are not usable. They are built by engineers that think of the system as a large database and the systems often look like a database making it difficult to use by the sales, support, and marketing people.

-John F Moore

Ajay – Describe your career journey from college to CTO. What changes in mindset did you undergo along the journey? What advice would you give to young students to take up science careers ?

John- First, I wanted to take time to thank you for the interview offer. I graduated from Boston University in 1988 with a degree in Electrical Engineering. At the time of my graduation I found myself to be very interested in the advanced taking place on the personal computing front by companies like Lotus with their 1-2-3 product. I knew that I wanted to be involved with these efforts and landed my first job in the software space as a Software Quality Engineer working on 1-2-3 for DOS.

I spent the first few years of my career working at Lotus as a developer, a quality engineer, and manager, on products such as Lotus 1-2-3 and Lotus Notes. Throughout those early career years I learned a lot and focused on taking as many classes as possible.

From Lotus I sought out the start-up environment and by early 2000 and joined a startup named Brainshark (http://www.brainshark.com). Brainshark was, and is, focused on delivering an asynchronous communication platform on the web and was one of the early providers of SAAS. In my seven years at Brainshark I learned a lot about delivering an Enterprise class SAAS solution on top of the Microsoft technology stack. The requirements to pass security audits for Fortune 500 companies, the need to match the performance of in-house solutions, resulted in all of us learning a great deal. These were very fun times.

I now work as the VP of Engineering and CTO at Swimfish, a services and software provider of business solutions. We focus on the financial marketplace where we have the founder has a very deep background, but also work within other verticals as well. Our products are focused on the CRM, document management, and mobile product space and are built on the Microsoft technology stack. Our customers leverage both our SAAS and on-premise solutions which require us to build our products to be more flexible than is generally required for a SAAS-only solution.

The exciting thing for me is the sheer amount of opportunities I see available for science/engineering students graduating in the near future. To be prepared for these opportunities, however, it will be important to not just be technically savvy.

Engineering students should also be looking at:

* Business classes. If you want to build cool products they must deliver business value.

* Writing and speaking classes. You must be able to articulate your ideas or no one will be willing to invest in them.

I would also encourage people to take chances, get in over your head as often as possible.You may fail, you may succeed. Either way you will gain experiences that make it all worthwhile.

Ajay- How do you think social media can help with CRM. What are the basic do’s and don’ts for social media CRM in your opinion?

John- You touch upon a subject that I am very passionate about. When I think of Social CRM I think about a system of processes and products that enable businesses to actively engage with customers in a manner that delivers maximum value to all. Customers should be able to find answers to their questions with minimal friction or effort; companies should find the right customers for their products.

Social CRM should deliver on some of these fronts:

* Analyze the web of relationships that exists to define optimal pathways. These pathways will define relationships that businesses can leverage for finding their customers. These pathways will enable customers to quickly find answers to their questions. For example, I needed an answer to a question about SharePoint and project management. I asked the question on Twitter and within 3 minutes had answers from two different people. Not only did I get the answer I needed but I made two new friends who I still talk to today.

* Monitor conversations to gauge brand awareness, identify customers having problems or asking questions. This monitoring should not be stalking; however, it should be used to provide quick responses to customers to benefit the greater community.

* Usability. Too many CRM systems are not usable. They are built by engineers that think of the system as a large database and the systems often look like a database making it difficult to use by the sales, support, and marketing people.

Finally, when I think of social media I think of these properties:

* Social is about relationship building.

* You should always add more value to the community than you take in return.

* Be transparent and honest. People can tell when you’re not.

Ajay- You are involved in some noble causes – like using blog space for out of work techies and separately for Alzheimer’s disease. How important do you think is for people especially younger people to be dedicated to community causes?

John- My mother-in-law was diagnosed with Alzheimer’s disease at the age 57. My wife and I moved into their two-family house to help her through the final years of her life. It is a horrible disease and one that it is easy to be passionate about if you have seen it in action.

My motivation on the job front is very similar. I have seen too many people suffer through these poor economic times and I simply want to do what I can to help people get back to work.

It probably sounds corny, but I firmly believe that we must all do what we can for each other. Business is competitive, but it does not mean that we cannot, or should not, help each other out. I think it’s important for everyone to have causes they believe in. You have to find your passions in life and follow them. Be a whole person and help change the world for the better.

Ajay- Describe your daily challenges as head of Engineering of Swimfish, Inc How important is it for the tech team to be integrated with the business and understand it as well.

John- The engineering team at Swimfish works very closely with the business teams. It is important for the team to understand the challenges our customers are encountering and to build products that help the customer succeed. I am not satisfied with the lack of success that many companies encounter when deploying a CRM solution.

We go as deep as possible to understand the business, the processes currently in use, the disparate systems being utilized, and then the underlying technologies currently in use. Only then do we focus on the solutions and deliver the right solution for that company.

On the product front it is the same. We work closely with customers on the features we are planning to add, trying to ensure that the solutions meet their needs as well as the needs of the other customers in the market that we are hoping to serve.

I do expect my engineers to be great at their core job, that goes without question. However, if they cannot understand the business needs they will not work for me very long.My weeks at Swimfish always provide me with interesting challenges and opportunities.

My typical day involves:

* Checking in with our support team to understand if there are any major issues being encountered by any of our customers.

* Challenging the support team to hit their targets. I love sales as without them I cannot deliver products.

* Checking in with my developers and test teams to determine how each of our projects is doing. We have a daily standup as well, but I try and personally check-in with as many people as possible.

* Most days I spend some time developing, mostly in C#. My current focus area is on our next release of our Milestone Tracking Matrix where I have made major revisions to our user interface.

I also spend time interacting on various social platforms, such as Twitter, as it is critical for me to understand the challenges that people are encountering in their businesses, to keep up with the rapid pace of technology, and just to check-in with friends. Keep it real.

Ajay- What are your views on off shoring work especially science jobs which ultimately made science careers less attractive in the US- at the same time outsourcing companies ( in India) generally pay only 1/3 rd of billing fees to salaries. Do you think concepts like ODesk can help change the paradigm of tech out-sourcing.

John- I have mixed opinions on off-shoring. You should not offshore because of perceived cost savings only. On net you will generally break even, you will not save as much as you might originally think.

I am, however, close to starting a relationship with a good development provider in Costa Rica. The reason for this relationship is not cost based, it is knowledge based. This company has a lot of experience with the primary CRM system that we sell to customers and I have not been successful in finding this experience locally. I will save a lot of money in upfront training on this skill-set; they have done a lot of work in this area already (and have great references). There is real value to our business, and theirs.

Note that Swimfish is already working with a geographically dispersed team as part of the engineering team is in California and part is in Massachusetts. This arrangement has already helped us to better prepare for an offshore relationship and I know we will be successful when we begin.

Ajay- What does John Moore do to have fun when he is not in front of his computer or with a cause.

John- As the father of two teenage daughters I spend a lot of time going to soccer, basketball, and softball games. I also enjoy spending time running, having completed a couple of marathons, and relaxing with a good book. My next challenge will be skydiving as my 17 year old daughter and I are going skydiving when she turns 18.

Brief Bio:

For the last decade I have worked as a senior engineering manager for SAAS applications built upon the Microsoft technology stack. I have established the processes, and hired the teams that delivered hundreds of updates ranging from weekly patches to longer running full feature releases. My background as a hands-on developer combined with my strong QA background has enabled me to deliver high quality software on-time.

You can learn more about me, and my opinions, by reading my blog at http://johnfmoore.wordpress.com/ or joining me on Twitter at http://twitter.com/JohnFMoore

R and Hadoop

Here is an exciting project for using R on the cloud computing environment ( two of my favorite things). It is called RHIPE

R and Hadoop Integrated Processing Environment v.0.38

cloud

The website source is http://ml.stat.purdue.edu/rhipe/

RHIPE(phonetic spelling: hree-pay’ ¹) is a java package that integrates the R environment with Hadoop, the open source implementation of Google’s mapreduce. Using RHIPE it is possible to code map-reduce algorithms in R e.g

m <- function(key,val){
  words <- strsplit(val," +")[[1]]
  wc <- table(words)
  cln <- names(wc)
  names(wc)<-NULL; names(cln)<-NULL;
  return(sapply(1:length(wc),function(r) list(key=cln[r],value=wc[[r]]),simplify=F))
}
r <- function(key,value){
  value <- do.call("rbind",value)
  return(list(list(key=key,value=sum(value))))
}
rhmr(map=m,reduce=r,combiner=T,input.folder="X",output.folder="Y")

rhapply packages the user's request into an R vector object. This is serialized and sent to the RHIPE server. The RHIPE server picks apart the object creating a job request that Hadoop can understand. Each element of the provided list is processed by the users function during the Map stage of mapreduce. The results are returned and if the output is to a file, these results are serialized and written to a Hadoop Sequence file, the values can be read back into R using the rhsq* functions.

2 rhlapply

rhlapply <- function( list.object,
                    func,
                    configure=expression(),
                    output.folder='',
                    shared.files=c(),
                    hadoop.mapreduce=list(),
                    verbose=T,
                    takeAll=T)

list.object
 This can either be a list or a single scalar. In case of the former, the function given by func will be applied to each element of list.object. In case of a scalar, the function will be applied to the list 1:n where n is the value of the scalar 
func
 A function that takes one parameter: an element of the list. 
configure
 An configuration expression to run before the func is executed. Executed once for every JVM. If you need variables, data frames, use rhsave or rhsave.image , use rhput to copy it to the DFS and then use shared.files
config = expression({
              library(lattice)
              load("mydataset.Rdata")
})



output.folder
 Any file that is created by the function is stored in the output.folder. This is deleted first. If not given, the files created will not be copied.  For side effect files to be copies create them in tmp e.g pdf("tmp/x.pdf"), note no leading slash.The directory will contain a slew of part* files, as many as there maps. These contain the binary key-value pairs.

shared.files
 The function or the preload expression might require the presence resource files e.g *.Rdata files. The user could copy it from the HDFS in the R code or just load it from the local work directory were the files present. This is the role of shared.files. It is a vector of paths to files on the HDFS, each of these will be copied to the work directory where the R code is run. e.g c('/tmp/x.Rdata','/foo.tgz'), then the first file can be loaded via load("x.Rdata") . For those familiar with Hadoop terminology, this is implemented via DistributedCache . 
hadoop.mapreduce
 a list of Hadoop specific options e.g
list(mapreduce.map.tasks=10,mapreduce.reduce.tasks=3)

takeAll
 if takeALL is true, the value returned is a list each entry the return value of the the function, not in order so element 1 of the returned list is not the result of  func(list.object=1=) . 
verbose
 If True, the user will see the job progress in the R console. If False, the web url to the jobtracker will be displayed. Cancelling the command with CTRL-C will not cancel the job, use rhkill for that. 




Mapreduce in R.
rhmr <- function(map,reduce,input.folder,configure=list(map=expression(),reduce=expression()),
                close=list(map=expression(),reduce=expression())
                 output.folder='',combiner=F,step=F,
                 shared.files=c(),inputformat="TextInputFormat",
                 outputformat="TextOutputFormat",
                 hadoop.mapreduce=list(),verbose=T,libjars=c())
Execute map reduce algorithms from within R. A discussion of the parameters follow.

input.folder
 A folder on the DFS containing the files to process. Can be a vector. 
output.folder
 A folder on the DFS where output will go to. 
inputformat
 Either of TextInputFormat or SequenceFileInputFormat. Use the former for text files and the latter for sequence files created from within R or as outputs from RHIPE(e.g rhlapply or rhmr). Note, one can't use any sequence file, they must have been created via a RHIPE function. Custom Input formats are also possible. Download the source and look at code/java/RXTextInputFormat.java 
outputformat
 Either of TextOutputFormat or SequenceFileOutputFormat. In case of the former, the return value from the mapper or reducer is converted to character and written to disk. The following code is used to convert to character.
paste(key,sep='',collapse=field_separator)
Custom output formats are also possible. Download the source and look at code/java/RXTextOutputFormat.java
If custom formats implement their own writables, it must subclass RXWritable or use one of the writables presents in RHIPE

shared.files
 same as in rhlapply, see that for documentation. 
verbose
 If T, the job progress is displayed. If false, then the job URL is displayed. 

At any time in the configure, close, map or reduce function/expression, the variable mapred.task.is.map will be equal to "true" if it is map task,"false" otherwise (both strings) Also, mapred.iswhat is mapper, reducer, combiner in their respective environments.

configure
 A list with either one element (an expression) or two elements map and reduce both of which must be expressions. These expressions are called in their respective environments, i.e the map expression is called during the map configure and similarly for the  reduce expression. The reduce expression is called for the combiner configure method.If only one list element, the expression is used for both the map and reduce

close
 Same as configure . 
map
 a function that takes two values key and value. Should return a list of lists. Each list entry must contain two elements key and value , e.g
...
ret <- list()
ret[[1]] <-  list(key=c(1,2), value=c('x','b'))
return(ret)
If any of key/value are missing the output is not collected, e.g. return NULL to skip this record. If the input format is a TextInputFormat, the value is the entire line and the key is probably useless to the user( it is a number indicating bytes into the file). If the input format is SequenceFileInputFormat, the key and value are taken from the sequence file.

reduce
 Not needed if mapred.reduce.tasks is 0. Takes a key and a list of values( all values emitted from the maps that share the same map output key ). If step is True, then not a list. Must return a list of lists each element of which must have two elements key and value.     This collects all the values and sends them to function. If NULL is returned or the return value is not conforming to the above nothing is collected the Hadoop collector. 
step
 If step is TRUE, then the reduce function is called for every value corresponding to a key that is once for every value.

 The variable red.status is equal to 1 on the first call.
 red.status is equal to 0 for every subsequent calls including the last value
 The reducer function is called one last time with red.status equal to -1. The value is NULL.Anything returned at any of these stages is written to disk The close function is called once every value for a given key has been processed, but returning anything  has no effect.  To a assign to the global environment use  the <<- operator.


combiner
 T or F, to use the reducer as a combiner. Using a combiner makes computation more efficient. If combiner is true, the reduce function will be called as a combiner (0 or more times, it may never be called during the combine stage even if combiner is T) .The value of mapred.task.is.map is 'true' or '*'false*' (both strings)  if the combiner is being executed as part of the map stage or reduce stage respectively.
Whether knowledge of this is useful or not is something I'm not sure of. However, if combiner is T , keep in mind,your reduce function must be able to handle inputs sent from the map or inputs sent from the reduce function(itself).

libjars
 If specifying a custom input/output format, the user might need to specify jar files here. 
hadoop.mapreduce
 set RHIPE and Hadoop options via this. 


1.1 RHIPE Options for mapreduce

 





Option
Default
Explanation




rhipejob.rport
8888
The port on which Rserve runs, should be same across all machines


rhipejob.charsxp.short
0
If 1,  RHIPE optimize serialization for character vectors. This reduces the length of the serialization


rhipejob.getmapbatches
100
If the reduce/mapper emits several key,values, how many to get from Rserve at a time. A higher number reduce the number of network reads(the network reads are to localhost)


rhipejob.outfmt.is.text
1 if TextInputFormat
Must be 1 if the output is textual


rhipejob.textoutput.fieldsep
' '
The field separator for any text based output format


rhipejob.textinput.comment
'#'
In the TextInputFormat, lines beginning with this are skipped


rhipejob.combinerspill
100,000
The combiner is run after collecting at most this many items


rhipejob.tor.batch
200,000
Number of values for the same key to collate before sending to the Reducer, if you have dollops of memory, set this larger. However, too large and you hit Java's heap space limit


rhipejob.max.count.reduce
Java's INT_MAX (about 2BN)
the total number of values for a given key to be collected, note the values are not ordered by any variable.


rhipejob.inputformat.keyclass
The default is chosen depending on TextInputFormat or SequenceFileInputFormat
Provide the full Java URL to the keyclass e.g org.saptarshiguha.rhipe.hadoop.RXWritableText, when using a Custom InputFormat  implement RXWritable     and implement the methods


rhipejob.inputformat.valueclass
The default is chosen depending on TextInputFormat or SequenceFileInputFormat
Provide the full Java URL to the valueclass e.g org.saptarshiguha.rhipe.hadoop.RXWritableText when using a Custom InputFormat  implement RXWritable     and implement the methods


mapred.input.format.class
As above, the default is either org.saptarshiguha.rhipe.hadoop.RXTextInputFormat or org.apache.hadoop.mapred.SequenceFileInputFormat
specify yours here


rhipejob.outputformat.keyclass
The default is chosen depending on TextInputFormat or SequenceFileInputFormat
Provide the full Java URL to the keyclass e.g org.saptarshiguha.rhipe.hadoop.RXWritableText , also the keyclass must implement RXWritable and


rhipejob.outputformat.valueclass
The default is chosen depending on TextInputFormat or SequenceFileInputFormat
Provide the full Java URL to the value e.g org.saptarshiguha.rhipe.hadoop.RXWritableText , also the valueclass must implement RXWritable


mapred.output.format.class
As above, the default is either org.saptarshiguha.rhipe.hadoop.RXTextOutputFormat or org.apache.hadoop.mapred.SequenceFileInputFormat
specify yours here, provide libjars if required




Citation:http://ml.stat.purdue.edu/rhipe/index.html
Great exciting news for the world of computing remotely.

Option	Default	Explanation
rhipejob.rport	8888	The port on which Rserve runs, should be same across all machines
rhipejob.charsxp.short	0	If 1, RHIPE optimize serialization for character vectors. This reduces the length of the serialization
rhipejob.getmapbatches	100	If the reduce/mapper emits several key,values, how many to get from Rserve at a time. A higher number reduce the number of network reads(the network reads are to localhost)
rhipejob.outfmt.is.text	1 if TextInputFormat	Must be 1 if the output is textual
rhipejob.textoutput.fieldsep	' '	The field separator for any text based output format
rhipejob.textinput.comment	'#'	In the TextInputFormat, lines beginning with this are skipped
rhipejob.combinerspill	100,000	The combiner is run after collecting at most this many items
rhipejob.tor.batch	200,000	Number of values for the same key to collate before sending to the Reducer, if you have dollops of memory, set this larger. However, too large and you hit Java's heap space limit
rhipejob.max.count.reduce	Java's INT_MAX (about 2BN)	the total number of values for a given key to be collected, note the values are not ordered by any variable.
rhipejob.inputformat.keyclass	The default is chosen depending on TextInputFormat or SequenceFileInputFormat	Provide the full Java URL to the keyclass e.g `org.saptarshiguha.rhipe.hadoop.RXWritableText`, when using a Custom InputFormat implement RXWritable and implement the methods
rhipejob.inputformat.valueclass	The default is chosen depending on TextInputFormat or SequenceFileInputFormat	Provide the full Java URL to the valueclass e.g `org.saptarshiguha.rhipe.hadoop.RXWritableText` when using a Custom InputFormat implement RXWritable and implement the methods
mapred.input.format.class	As above, the default is either `org.saptarshiguha.rhipe.hadoop.RXTextInputFormat` or `org.apache.hadoop.mapred.SequenceFileInputFormat`	specify yours here
rhipejob.outputformat.keyclass	The default is chosen depending on TextInputFormat or SequenceFileInputFormat	Provide the full Java URL to the keyclass e.g `org.saptarshiguha.rhipe.hadoop.RXWritableText` , also the keyclass must implement `RXWritable` and
rhipejob.outputformat.valueclass	The default is chosen depending on TextInputFormat or SequenceFileInputFormat	Provide the full Java URL to the value e.g `org.saptarshiguha.rhipe.hadoop.RXWritableText` , also the valueclass must implement `RXWritable`
mapred.output.format.class	As above, the default is either `org.saptarshiguha.rhipe.hadoop.RXTextOutputFormat` or `org.apache.hadoop.mapred.SequenceFileInputFormat`	specify yours here, provide libjars if required

Interview Peter J Thomas -Award Winning BI Expert

Here is an in depth interview with Peter J Thomas, one of Europe’s top Business Intelligence expert and influential thought leaders. Peter talks about BI tools, data quality, science careers, cultural transformation and BI and the key focus areas.

I am a firm believer that the true benefits of BI are only realised when it leads to cultural transformation. -Peter James Thomas

Ajay- Describe about your early career including college to the present.

Peter –I was an all-rounder academically, but at the time that I was taking public exams in the 1980s, if you wanted to pursue a certain subject at University, you had to do related courses between the ages of 16 and 18. Because of this, I dropped things that I enjoyed such as English and ended up studying Mathematics, Further Mathematics, Chemistry and Physics. This was not because I disliked non-scientific subjects, but because I was marginally fonder of the scientific ones. In a way it is nice that my current blogging allows me to use language more.

The culmination of these studies was attending Imperial College in London to study for a BSc in Mathematics. Within the curriculum, I was more drawn to Pure Mathematics and Group Theory in particular, and so went on to take an MSc in these areas. This was an intercollegiate course and I took a unit at each of King’s College and Queen Mary College, but everything else was still based at Imperial. I was invited to stay on to do a PhD. It was even suggested that I might be able to do this in two years, given my MSc work, but I decided that a career in academia was not for me and so started looking at other options.

As sometimes happens a series of coincidences and a slice of luck meant that I joined a technology start-up, then called Cedardata, late in 1988; my first role was as a Trainee Analyst / Programmer. Cedardata was one of the first organisations to offer an Accounting system based on a relational database platform; something that was then rather novel, at least in the commercial arena. The RDBMS in question was Oracle version 5, running on VAX VMS – later DEC Ultrix and a wide variety of other UNIX flavours. Our input screens were written in SQL*Forms 2 – later Oracle Forms – and more complex processing logic and reports were in Pro*C; this was before PL/SQL. Obviously this environment meant that I had to become very conversant with SQL*Plus and C itself.

When I joined Cedardata, they had 10 employees, 3 customers and annual revenue of just £50,000 ($80,000). By the time I left the company eight years later, it had grown dramatically to having a staff of 250, over 300 clients in a wide range of industries and sales in excess of £12 million ($20 million). It had also successfully floatated on the main London Stock Exchange. When a company grows that quickly the same thing tends to happen to its employees.

Cedardata was probably the ideal environment for me at the time; an organisation that grew rapidly, offering new opportunities and challenges to its employees; that was fiercely meritocratic; and where narrow, but deep, technical expertise was encouraged to be rounded out by developing more general business acumen, a customer-focused attitude and people-management skills. I don’t think that I would have learnt as much, or progressed anything like as quickly in any other type of organisation.

It was also at Cedardata that I had my first experience of the class of applications that later became known as Business Intelligence tools. This was using BusinessObjects 3.0 to write reports, cross-tabs and graphs for a prospective client, the UK Foreign and Commonwealth Office (State Department). The approach must have worked as we beat Oracle Financials in a play-off to secure the multi-million pound account.

During my time at Cedardata, I rose to become an executive and filled a number of roles including Head of Development and also Assistant to the MD / Head of Product Strategy. Spending my formative years in an organisation where IT was the business and where the customer was King had a profound impact on me and has influenced my subsequent approach to IT / Business alignment.

Ajay- How would you convince young people to take maths and science more? What advice would you give to policy makers to promote more maths and science students?

Peter- While I have used little of my Mathematics directly in my commercial career, the approach to problem-solving that it inculcated in me has been invaluable. On arriving at University, it was something of a shock to be presented with Mathematical problems where you couldn’t simply look up the method of solution in a textbook and apply it to guarantee success. Even in my first year I had to grapple with challenges where you had no real clue where to start. Instead what worked, at least most of the time, was immersing yourself in the general literature, breaking down the problem into more manageable chunks, trying different techniques – sometimes quite recherché ones – to make progress, occasionally having an insight that provides a short-cut, but more often succeeding through dogged determination. All of that sounds awfully like the approach that has worked for me in a business context.

Having said that, I was not terribly business savvy as a student. I didn’t take Mathematics because I thought that it would lead to a career, I took it because I was fascinated by the subject. As I mentioned earlier, I enjoyed learning about a wide range of things, but Science seemed to relate to the most fundamental issues. Mathematics was both the framework that underpinned all of the Sciences and also offered its own world where astonishing an beautiful results could be found, independent of any applicability; although it has to be said that there are few braches of Mathematics that have not be applied somewhere or other.

I think you either have this appreciation of Science and Mathematics or you don’t and that this happens early on.

Certainly my interest was supported by my parents and a variety of teachers, but a lot of it arose from simply reading about Cosmology, or Vulcanism, or Palaeontology. I watched a YouTube of Steven Jay Gould recently saying that when he was a child in the 1950s all children were “in” to Dinosaurs, but that he actually got to make a career out of it. Maybe all children aren’t “in” to dinosaurs in the same way today, perhaps the mystery and sense of excitement has gone.

In the UK at least there appears to be less and less people taking Science and Mathematics. I am not sure what is behind this trend. I read pieces that suggest that Science and Maths are viewed as being “hard” subjects, and people opt for “easier” alternatives. I think creative writing is one of the hardest things to do, so I’m not sure where this perspective comes from.

Perhaps some things that don’t help are the twin images of the Scientist as a white-coated boffin and the Mathematician as a chalk-covered recluse, neither of whom have much of a grasp on the world beyond their narrow discipline. While of course there is a modicum of truth in these stereotypes, they are far from being wholly accurate in my experience.

Perhaps Science has fallen off of the pedestal that it was placed on in the 1950s and 1960s. Interest in Science had been spurred by a range of inventions that had improved people’s lives and often made the inventors a lot of money. Science was seen as the way to a better tomorrow, a view reinforced by such iconic developments as the discovery of the structure of DNA, our ever deepening insight about sub-atomic physics and the unravelling of many mysteries of the Universe. These advances in pure science were supported by feats of scientific / engineering achievement such as the Apollo space programme. The military importance of Science was also put into sharp relief by the Manhattan Project; something that also maybe sowed the seeds for later disenchantment and even fear of the area.

The inevitable fallibility of some Scientists and some scientific projects burst the bubble. High-profile problems included the Thalidomide tragedy and the outcry, however ill-informed, about genetically modified organisms. Also the poster child of the scientific / engineering community was laid low by the Challenger disaster. On top of this, living with the scientifically-created threat of mutually-assured destruction probably began to change the degree of positivity with which people viewed Science and Scientists. People arrived at the realisation that Science cannot address every problem; how much effort has gone into finding a cure for cancer for example?

In addition, in today’s highly technological world, the actual nuts and bolts of how things work are often both hidden and mysterious. While people could relatively easily understand how a steam engine works, how many have any idea about how their iPod functions? Technology has become invisible and almost unimportant, until it stops working.

I am a little wary of Governments fixing issues such as these, which are the result of major generational and cultural trends. Often state action can have unintended and perverse results. Society as a whole goes through cycles and maybe at some future point Science and Mathematics will again be viewed as interesting areas to study; I certainly hope so. Perhaps the current concerns about climate change will inspire a generation of young people to think more about technological ways to address this and interest them in pertinent Sciences such as Meteorology and Climatology.

Ajay-. How would you rate the various tools within the BI industry like in a SWOT analysis (briefly and individually)?

Peter- I am going to offer a Politician’s reply to this. The really important question in BI is not which tool is best, but how to make BI projects successful. While many an unsuccessful BI manager may blame the tool or its vendor, this is not where the real issues lie.

I firmly believe that successful BI rests on four mutually reinforcing pillars:

understand the questions the business needs to answer,
understand the data available,
transform the data to meet the business needs and
embed the use of BI in the organisation’s culture.

If you get these things right then you can be successful with almost any of the excellent BI tools available in the marketplace. If you get any one of them wrong, then using the paragon of BI tools is not going to offer you salvation.

I think about BI tools in the same way as I do the car market. Not so many years ago there were major differences between manufacturers.

The Japanese offered ultimate reliability, but maybe didn’t often engage the spirit.

The Germans prided themselves on engineering excellence, slanted either in the direction of performance or luxury, but were not quite as dependable as the Japanese.

The Italians offered out-and-out romance and theatre, with mechanical integrity an afterthought.

The French seemed to think that bizarrely shaped cars with wheels as thin as dinner plates were the way forward, but at least they were distinctive.

The Swedes majored on a mixture of safety and aerospace cachet, but sometimes struggled to shift their image of being boring.

The Americans were still in the middle of their love affair with the large and the rugged, at the expense of convenience and value-for-money.

Stereotypically, my fellow-countrymen majored on agricultural charm, or wooden-panelled nostalgia, but struggled with the demands of electronics.

Nowadays, the quality and reliability of cars are much closer to each other. Most manufacturers have products with similar features and performance and economy ratings. If we take financial issues to one side, differences are more likely to related to design, or how people perceive a brand. Today the quality of a Ford is not far behind that of a Toyota. The styling of a Honda can be as dramatic as an Alfa Romeo. Lexus and Audi are playing in areas previously the preserve of BMW and Mercedes and so on.

To me this is also where the market for BI tools is at present. It is relatively mature and the differences between product sets are less than before.

Of course this doesn’t mean that the BI field will not be shaken up by some new technology or approach (in-memory BI or SaaS come to mind). This would be the equivalent of the impact that the first hybrid cars had on the auto market.

However, from the point of view of implementations, most BI tools will do at least an adequate job and picking one should not be your primary concern in a BI project.

Ajay- SAS Institute Chief Marketing Officer, Jim Davis (interviewed with this blog) points to the superiority of business analytics rather than business intelligence as an over hyped term. What numbers, statistics and graphs would you quote rather than semantics to help re direct those perceptions?

I myself use SAS,SPSS, R and find the decision management capabilities as James Taylor calls Decision Management much better enabled than by simple ETL tools or reporting and aggregating graphs tools in many BI tools.

Peter- I have expended quite a lot of energy and hundreds of words on this subject. If people are interested in my views, which are rather different to those of Jim Davis, then I’d suggest that they read them in a series of articles starting with Business Analytics vs Business Intelligence [URL http://peterthomas.wordpress.com/2009/03/28/business-analytics-vs-business-intelligence/ ].

I will however offer some further thoughts and to do this I’ll go back to my car industry analogy. In a world where cars are becoming more and more comparable in terms of their reliability, features, safety and economy, things like styling, brand management and marketing become more and more important.

As the true differences between BI vendors narrow, expect more noise to be made by marketing departments about how different their products are.

I have no problem in acknowledging SAS as a leader in Business Analytics, too many people I respect use their tools for me to think otherwise. However, I think a better marketing strategy for them would be to stick to the many positives of their own products. If they insist on continuing to trash competitors, then it would make sense for them to do this in a way that couldn’t be debunked by a high school student after ten seconds’ reflection.

Ajay- In your opinion what is the average RoI that a small, large medium enterprise gets by investing in a business intelligence platform. What advice would you give to such firms (separately) to help them make their minds?

Peter- The question is pretty much analogous to “What are the benefits of opening an office in China?” the answer is going to depend on what the company does; what their overall strategy is and how a China operation might complement this; whether their products and services are suitable for the Chinese market; how their costs, quality and features compare to local competitors; and whether they have cracked markets closer to home already.

To put things even more prosaically, “How long is a piece of string?”

Taking to one side the size and complexity of an organisation, BI projects come in all shapes and sizes.

Personally I have led Enterprise-wide, all-pervasive BI projects which have had a profound impact on the company. I have also seen well-managed and successful BI projects targeted on a very narrow and specific area.

The former obviously cost more than the latter, but the benefits are commensurately greater. In fact I would argue that the wider a BI project is spread, the greater its payback. Maybe lessons can be learnt and confidence built in an initial implementation to a small group, but to me the real benefit of BI is realised when it touches everything that a company does.

This is not based on a self-interested boosting of BI. To me if what we want to do is take better business decisions, then the greater number of such decisions that are impacted, the better that this is for the organisation.

Also there are some substantial up-front investments required for BI. These would include: building the BI team; establishing the warehouse and a physical architecture on which to deliver your application. If these can be leveraged more widely, then costs come down.

The same point can be made about the intellectual property that a successful BI team develops. This is one reason why I am a fan of the concept of BI Competency Centres [URL http://peterthomas.wordpress.com/2009/05/11/business-intelligence-competency-centres/ ].

I have been lucky enough to contribute to an organisation turning round from losing hundreds of millions of dollars to recording profits of twice that magnitude. When business managers cite BI as a major factor behind such a transformation, then this is clearly a technology that can be used to dramatic effect.

Nevertheless both estimating the potential impact of BI and measuring its actual effectiveness are non-trivial activities. A number of different approaches can be taken, some of which I cover in my article:

Measuring the benefits of Business Intelligence [URL http://peterthomas.wordpress.com/2009/02/26/measuring-the-benefits-of-business-intelligence/ ]. As ever there is no single recipe for success.

Ajay-. Which BI tool/ code are you most comfortable with and what are its salient points?

Peter –Although I have been successful with elements of the IBM-Cognos toolset and think that this has many strong points, not least being relatively user-friendly, I think I’ll go back to my earlier comments about this area being much less important than many others for the success of a BI project.

Ajay -How do you think cloud computing will change BI? What percentage of BI budgets go to data quality and what is eventual impact of data quality on results?

Peter –I think that the jury is still out on cloud computing and BI. By this I do not mean that cloud computing will not have an impact, but rather that it remains unclear what this impact will actually be.

Given the maturity of the market, my suspicion is that the BI equivalent of a Google is not going to emerge from nowhere. There are many excellent BI start-ups in this space and I have been briefed by quite a few of them.

However, I think the future of cloud computing in BI is likely to be determined by how the likes of IBM-Cognos, SAP-BusinessObjects and Oracle-Hyperion embrace the area.

Having said this, one of the interesting things in computing is how easy it is to misjudge the future and perhaps there is a potential titan of cloud BI currently gestating in the garage so beloved of IT mythology.

On data quality, I have never explicitly split out this component of a BI effort. Rather data quality has been an integral part of what we have done. Again I have taken a four-pillared approach:

improve how the data is entered;
make sure your interfaces aren’t the problem;
check how the data has been entered / interfaced;
and don’t suppress bad data in your BI.

The first pillar consists of improved validation in front-end systems – something that can be facilitated by the BI team providing master data to them – and also a focus on staff training, stressing the importance to the organisation of accurately recording certain data fields.

The second pillar is more to do with the general IT Architecture and how this relates to the Information Architecture, again master data has a role to play, but so does ensuring that the IT culture is one in which different teams collaborate well and are concerned about what happens to data when it leaves “their” systems.

The third pillar is the familiar world of after-the-fact data quality reports and auditing, something that is necessary, but not sufficient, for success in data quality.

Finally there is what I think can be one of the most important pillars; ensuring that the BI system takes a warts-and-all approach to data. This means that bad data is highlighted, rather than being suppressed. In turn this creates pressure for the problems to be addressed where they arise and creates a virtuous circle.

For those who might be interested in this area, I expand on it more in Using BI to drive improvements in data quality [URL http://peterthomas.wordpress.com/2009/02/11/using-bi-to-drive-improvements-in-data-quality/ ].

Ajay- You are well known with England’s rock climbing and boulder climbing community. A fun question- what is the similarity between a BI implementation/project and climbing a big boulder.

Peter –I would have to offer two minor clarifications.

First it is probably my partner who is better known in climbing circles, via here blog [URL http://77jenn.blogspot.com/ ] and articles and reviews that she has written for the climbing press; though I guess I can take credit for most of the photos and videos.

Second, particularly given the fact that a lot of our climbing takes place in Wales, I should acknowledge the broader UK climbing community and also mention our most mountainous region of Scotland.

Despite what many inhabitants of Sheffield might think to the contrary, there is life beyond Stanage Edge [URL http://en.wikipedia.org/wiki/Stanage ].

I have written about the determination and perseverance that are required to get to the top of a boulder, or indeed to the top of any type of climb [URL http://peterthomas.wordpress.com/2009/03/31/perseverance/ ].

I think those same qualities are necessary for any lengthy, complex project. I am a firm believer that the true benefits of BI are only realised when it leads to cultural transformation. Certainly the discipline of change management has many parallels with rock climbing. You need a positive attitude and a strong belief in your ultimate success, despite the inevitable setbacks. If one approach doesn’t yield fruit then you need to either fine-tune or try something radically different.

I suppose a final similarity is the feeling that you get having completed a climb, particularly if it is at the limit of your ability and has taken a long time to achieve. This is one of both elation and deep satisfaction, but is quickly displaced by a desire to find the next challenge.

This is something that I have certainly experienced in business life and I think the feelings will be familiar to many readers.

Biography-

Peter Thomas has led all-pervasive, Business Intelligence and Cultural Transformation projects serving the needs of 500+ users in multiple business units and service departments across 13 European and 5 Latin American countries. He has also developed Business Intelligence strategies for operations spanning four continents. His BI work has won two industry awards including “Best Enterprise BI Implementation”, from Cognos in 2006 and “Best use of IT in Insurance”, from Financial Sector Technology in 2005. Peter speaks about success factors in both Business Intelligence and the associated Change Management at seminars across both Europe and North America and writes about these areas and many other aspects of business, technology and change on his blog [URL http://peterthomas.wordpress.com ].

Training on R

Here is an interesting training from Revolution Computing

New Training Course from REvolution Computing
High-Performance Computing with R
July 31, 2009 – Washington, DC – Prior to JSM
Time: 9am – 5pm
$600 commercial delegates, $450 government, $250 academic

Click Here to Register Now!

An overview of available HPC technologies for the R language to enable faster, scalable analytics that can take advantage of multiprocessor capability will be presented in a one-day course. This will include a comprehensive overview of REvolution’s recently released R packages foreach and iterators, making parallel programming easier than ever before for R programmers, as well as other available technologies such as RMPI, SNOW and many more. We will demonstrate each technology with simple examples that can be used as starting points for more sophisticated work. The agenda will also cover:

Identifying performance problems

Profiling R programs

Multithreading, using compiled code, GPGPU

Multiprocess computing

SNOW, MPI, NetWorkSpaces, and more

Batch queueing systems

Dealing with lots of data

Attendees should have basic familiarity with the R language—we will keep examples elementary but relevant to real-world applications.

This course will be conducted hands-on, classroom style. Computers will not be provided. Registrants are required to bring their own laptops.

For the full agenda Click Here or Click Here to Register Now!”

Source; www.revolution-computing.com

Disclaimer- I am NOT commerically related to REvolution, just love R. I do hope REvolution chaps do spend tiny bit of time improving the user GUI as well not just for HPC purposes.

They recently released some new packages free to the CRAN community as well

The release of 3 new packages for R designed to allow all R users to more quickly handle large, complex sets of data: iterators, foreach and doMC.

* iterators implements the “iterator” data structure familiar to users of languages like Java,

C# and Python to make it easy to program useful sequences – from all the prime numbers to the columns of a matrix or the rows of an external database.

* foreach builds on the “iterators” package to introduce a new way of programming loops in R. Unlike the traditional “for” loop, foreach runs multiple iterations simultaneously, in parallel. This makes loops run faster on a multi-core laptop, and enables distribution of large parallel-processing problems to multiple workstations in a cluster or in the cloud, without additional complicated programming. foreach works with parallel programming backends for R from the open-source and commercial domains.

* doMC is an open source parallel programming backend to enable parallel computation with “foreach” on Unix/Linux machines. It automatically enables foreach and iterator functions to work with the “multicore” package from R Core member Simon Urbanek

The new packages have been developed by REvolution Computing and released under open source licenses to the R community, enabling all existing R users

citation:

http://www.revolution-computing.com/aboutus/news-room/2009/breakthrough-parallel-packages-and-functions-for-r.php

Please share:

Please share:

Too many CRM systems are not usable. They are built by engineers that think of the system as a large database and the systems often look like a database making it difficult to use by the sales, support, and marketing people.

-John F Moore

Please share:

2 rhlapply

1.1 RHIPE Options for mapreduce

Citation:http://ml.stat.purdue.edu/rhipe/index.html

Great exciting news for the world of computing remotely.

Please share:

I am a firm believer that the true benefits of BI are only realised when it leads to cultural transformation. -Peter James Thomas

Please share:

Please share:

Please share: