DECISION STATS

Interview Eric Siegel, Phd President Prediction Impact

An interview with Eric Siegel, Ph.D.President of Prediction Impact, Inc. and founding chair of Predictive Analytics World.

Ajay- What does this round of Predictive Analytics World have —–which was not there in the edition earlier in the year.

Eric- Predictive Analytics World (pawcon.com) – Oct 20-21 in DC delivers a fresh set of 25 vendor-neutral presentations across verticals employing predictive analytics, such as banking, financial services, e-commerce, education, healthcare, high technology, insurance, non-profits, publishing, retail and telecommunications.

PAW features keynote speaker, Stephen Baker, author of The Numerati and Senior writer at BusinessWeek. His keynote is described at www.predictiveanalyticsworld.com/dc/2009/agenda.php#day2-2

A strong representation of leading enterprises have signed up to tell their stories — speakers will present how predictive analytics is applied at Aflac, AT&T Bell South, Amway, The Coca-Cola Company, Financial Times, Hewlett-Packard, IRS, National Center for Dropout Prevention, The National Rifle Association, The New York Times, Optus (Australian telecom), PREMIER Bankcard, Reed Elsevier, Sprint-Nextel, Sunrise Communications (Switzerland), Target, US Bank, U.S. Department of Defense, Zurich — plus special examples from Anheuser-Busch, Disney, HSBC, Pfizer, Social Security Administration, WestWind Foundation and others.

To see the entire agenda at a glance: www.predictiveanalyticsworld.com/dc/2009/agenda_overview.php

We’ve added a third workshop, offered the day before (Oct 19), “Hands-On Predictive Analytics. There’s no better way to dive in than operating real predictive modeling software yourself – hands-on.” For more info: www.predictiveanalyticsworld.com/dc/2009/handson_predictive_analytics.php

Ajay- What do academics, corporations and data miners gain in this conference? list 4 bullet points for the specific gains.

Eric- A. First, PAW’s experienced speakers provide the “how to” of predictive analytics. PAW is a unique conference in its focus on the commercial deployment of predictive analytics, rather than research and development. The core analytical technology is established and proven, valuable as-is without additional R&D — but that doesn’t mean it’s a “cakewalk” to employ it successfully to ensure value is attained. Challenges include data requirements and preparation, integration of predictive models and their scores into existing organizational systems and processes, tracking and evaluating performance, etc. There’s no better way to hone your skills and cultivate an informed plan for your organization’s efforts than hearing how other organizations did it.

B. Second, PAW covers the latest state-of-the-art methods produced by research labs, and how they provide value in commercial deployment. This October, almost all sessions in Track 2 are at the Expert/Practitioner-level. Advanced topics include ensemble models, uplift modeling (incremental modeling), model scoring with cloud computing, predictive text analytics, social network analysis, and more.

PAW’s pre- and post-conference workshops round out the learning opportunities. In addition to the hands-on workshop mentioned above, there is a course covering core methods, “The Best and the Worst of Predictive Analytics: Predictive Modeling Methods and Common Data Mining Mistakes” (www.predictiveanalyticsworld.com/dc/2009/predictive_modeling_methods.php) and a business-level seminar on decision automation and support, “Putting Predictive Analytics to Work” (www.predictiveanalyticsworld.com/dc/2009/predictive_analytics_work.php).

C. Third, the leading predictive analytics software vendors and consulting firms are present at PAW as sponsors and exhibitors, available to provide demos and answer your questions. What do the predictive analytics solutions do, how do they compare, and which is best for your? PAW is the one-stop-shop for selecting the tool or solution most suited to address your needs.

D. Fourth, PAW provides a unique, focused opportunity to network with colleagues and establish valuable contacts in the predictive analytics industry. Mingle, connect and hang out with professionals facing similar challenges (coffee breaks, meals, and at the reception).

Ajay- How do you balance the interests of various competing softwares and companies who sponsor such event?

Eric- As a vendor-neutral event, PAW’s core program of 25 sessions is booked exclusively with enterprise practitioners, thought leaders and adopters, with no predictive analytics software vendors speaking or co-presenting. These sessions provide substantive content with take-aways which provide value that’s independent of any particular software solution — no product pitches! Beyond these 25 sessions are three short sponsor sessions that are demarcated as such, and, despite being branded, generally prove to be quite substantive as well. In this way, PAW delivers a high quality, unbiased program.

Supplementing this vendor-neutral program, the room right next door has an expo where attendees have all the access to software and solution vendors they could want (cf. in my answer to the prior question regarding software vendors, above).

Here are a couple more PAW links:

For informative PAW event updates:
www.predictiveanalyticsworld.com/notifications.php

To sign up for the PAW group on LinkedIn, see:
www.linkedin.com/e/gis/1005097

Ajay- Describe your career in science including research that you specialize in. How would you motivate students today to go for science careers

Eric- Well, first off, my work as a predictive analytics consultant, instructor and conference chair is in the application of established technology, rather than the research and development of new or improved methods.

But the Ph.D. next to my name reveals my secret past as an “academic”. Pure research is something I really enjoyed and I kind of feel like I had a brain transplant in order to change to “real world work”. I’m glad I made the change, although I see good sides to both types of work (really, they’re like two entirely different lifestyles).

In my research I focused on core predictive modeling methods. The ability for a computer to automatically learn from experience (data really is recorded experience, after all), is the best thing since sliced bread. Ever since I realized, as a kid, that space travel would in fact be a huge pain in the neck, nothing in science has ever seemed nearly as exciting.

Predictive analytics is an endeavor in machine learning. A predictive model is the encoding of a set of rules or patterns or regularities at some level. The model is the thing output by automated, number-crunchin’ analysis and, therefore, is the thing “learned” from the “experience” (data). The “magic” here is the ability of these methods to find a model that performs not only over the historical data on your disk drive, but that will perform equally well for tomorrow’s new situations. That ability to generalize from the data at hand means the system has actually learned something.

And indeed the ability to learn and apply what’s been learned turns out to provide plenty of business value, as I imagined back in the lab. The output of a predictive model is a predictive score for each individual customer or prospect. The score in turn directly informs the business decision to be taken with that individual customer (to contact or not to contact; with which offer to contact, etc.) – business intelligence just doesn’t get more actionable than that.

For the impending student, I’d first point out the difference between applied science and research science. Research science is fun in that you have the luxury of abstraction and are usually fairly removed from the need to prove near-term industrial applicability. Applied science is fun for the opposite reason: The tangle of challenges, although less abstract and in that sense more mundane, are the only thing between you and getting the great ideas of the world to actually work, come to fruition, and have an irrefutable impact.

Ajay- What are the top five conferences in analytics and data mining in your opinion in the world including PAW.

Eric- KDD – The leading event for research and development of the core methods behind the commercial deployments covered at PAW (“Knowledge Discovery and Data Mining”).

ICML – Another long-standing research conference on machine learning (core data mining).

eMetrics.org – For online marketing optimization and web analytics

Text Analytics Summit – Text mining can leverage “unstructured data” (text) to augment predictive analytics; the chair of this conference is speaking at PAW on just that topic: www.predictiveanalyticsworld.com/dc/2009/agenda.php#day2-15

Predictive Analytics World, the business-focused event for predictive analytics professionals, managers and commercial practitioners – focused on the commercial deployment of predictive analytics: pawcon.com

Ajay- Would PAW 2009 have video archives, videos as well or podcasts for people not able to attend on site.

Eric- While the PAW conferences emphasize in-person participation, we are in the planning stages for future webcasts and other online content. PAW’s “Predictive Analytics Guide” has a growing list of online resources: www.predictiveanalyticsworld.com/predictive_analytics.php

Ajay- How do you think social media marketing can help in these conferences.

Eric- Like most events, PAW leverages social media to spread the word.

But perhaps most pertinent is the other way around: predictive analytics can help social media by increasing relevancy, dynamically selecting the content to which each reader or viewer is most likely to respond.

Ajay- Do you have any plans to take PAW more international? Any plans for a PAW journal for trainings and papers.

Eric- We’re in discussions on these topics, but for now I can only say, stay tuned!

Biographyy

The president of Prediction Impact, Inc., Eric Siegel is an expert in predictive analytics and data mining and a former computer science professor at Columbia University, where he won awards for teaching, including graduate-level courses in machine learning and intelligent systems – the academic terms for predictive analytics.He has published 13 papers in data mining research and computer science education, has served on 10 conference program committees, has chaired a AAAI Symposium held at MIT, and is the founding chair of Predictive Analytics World.

For more on Predictive Analytic World-

Predictive Analytics World Conference
October 20-21, 2009, Washington, DC
www.predictiveanalyticsworld.com
LinkedIn Group: www.linkedin.com/e/gis/1005097

Interview Gary D. Miner Author and Professor

Here is an interview with Gary Miner, Phd who has been in the data mining business for almost 30 years and a pioneer in healthcare studies pertaining to Alzheimer’s diseases. He is also co author of “the Handbook of Statistical Analysis and Data Mining Applications”. Gary writes on how he has seen data mining change over the years, health care applications as well as his book and quotes from his experience.

GaryMinersmall

Ajay- Describe your career in science starting from college till today. How would you interest young students in science careers today in the mid of the recession

Gary – I knew that I wanted to be in “Science” even before college days, taking all the science and math courses I could in high school. This continued in undergraduate college years at a private college [Hamline University, St. Paul, Minnesota……..older than the State of Minnesota, founded in 1854, and had the first Medical School, later “sold” to the University of Minnesota] as a Biology and Chemistry major, with a minor in education. From there is did a M.S. conducting a “Physiological genetics research project”, and then a Ph.D. at another institution where I worked on Genetic Polymorphisms of Mouse blood enzymes. So through all of this, I had to use statistics to analyze the data. My M.S. was analyzed before the time of even “electronic calculators”, so I used, if you can believe this, a “hand cranked calculator”, rented, one summer to analyze my M.S. dataset. By the time my Ph.D. thesis data was being analyzed, electronic calculators were available, but the big main-frame computers were on college campuses, so I punched the data into CARDS, walked down the hill to the computing center, dropped off the stack of cards, to come back the next day to get “reams of output” on large paper [about 15” by 18”, folded in a stack, if anyone remembers those days …]. I then spent about 30 years doing medical research in academic environments with the emphasis on genetics, biochemistry, and proteonomics in the areas of mental illness and Alzheimer’s Disease, which became my main area of study, publishing the first book in 1989 on the GENETICS OF ALZHEIMER’S DISEASE.

Today, in my “semi-retirement careers”, one side-line outreach is working with medical residents on their research projects, which I’ve been doing for about 7 or 8 years now. This involves design of the research project, data collection, and most importantly “effective and accurate” analysis of the datasets. I find this a way I can reach out to the younger generation to interest them not only in “science”, but in doing “science correctly”. As you probably know, we are in the arena of the “Duming of America”; anti-science, if you wish. I’ve seen this happening for at least 30 years, during the 1980’s, 1990’s, and continuing into this Century. Even the medical residents I get to work with each year have been going “downhill” yearly in their ability to “problem solve”. I believe this is an effect of this “dumning of America”.

There are several books coming out on this Dumning of America this summer; one the first week of June, another on July 12, and another in September [see the attached PPT for slides with the covers of these 3 books}. It is a real problem, as Americans over the past few decades have moved towards “wanting simple answers”, and most things in the “real world”, e.g. reality are not simple………..that’s where Science comes in.

A recent 2008 study done by the School of Public Health at Ohio University showed that up to 88% of the published scientific papers in a top respected cancer journal either used statistics INCORRECTLY, and/or the CONCLUSION was INCORRECT. When I and my wife both did Post-Docs in Psychiatric Epidemiology in 1980-82, basically doing an MPH, the first words out of the mouth of the “Biostats – Epidemiology” professor in the first lecture to the incoming MPH students was “We might as well through out most of the medical research literature of the past 25 years, as it has either not been designed correctly or statistics have been used incorrectly”!!! ……That caught my attention. And following medical research [and medicine in general] I can tell you that “not much has changed in the past 25 years since then”, and thus that puts us “50 years behind in medical research” and medicine. ANALOGY: If some of our major companies, that are successfully using predictive analytics to organize and efficiently run their organizations, took on the “mode of operation” of medicine and medical research, they’d be “bankrupt” in 6 months” …. That’s what I tell my students.

Ajay- Describe some of the exciting things data mining can do to lower health care costs and provide more people with coverage.

Gary- As mentioned above, my personal feeling is that “medicine / health care” is 50 years “behind the times”, compared to the efficiency needed to successfully survive in this Global Economy; corporations and organizations like Wal-Mart, INTEL, many of our Pharmaceutical Companies, have used data mining / predictive analytics to survive successfully. Wal-Mart especially: Wal-Mart has it’s own set of data miners, and were writing their own procedures in the early 1990’s ………..before most of us ever heard of data mining; that is why Wal-Mart can go into China today, and open a store in any location, and know almost to 99% accuracy 1) how many check out stand needed, 2) what products to stock, 3) where in the store to stock them, and 4) what their profit margin will be. They have done this through very accurate “Predictive Analytics” modeling.

Other “ingrained” USA corporations have NOT grabbed onto this “most accurate” technology [e.g. predictive analytics modeling], and reaping the “rewards” of impending bankruptcy and disappearance today. Examples in the news, of course, our our 3 – big automakers in Detroit. If they had engaged effective data mining / modeling in the late 1990’s they could have avoided their current problems. I see the same for many of our oldest and larges USA Insurance Companies………..they are “middle management fat”, and I’ve seen their ratings go down over the past 10 years from an A rating to even a C rating [for the company in which I have my auto insurance ? you might ask me why I stay? …. An agent who is a friend, BUT it is frustrating, and this companies “mode of operation” is completely “customer un-friendly”.], while new insurance companies have “grabbed” onto modern technology, and are rising stars.

So my influence on the younger generation is to have my students do research and DATA ANALYSIS correctly.

Ajay- Describe your book ” HANDBOOK OF STATISTICAL ANALYSIS & DATA MINING APPLICATIONS”. Who would be the target audience of this and can corporate data miners gain from it as well.

Gary- There are several target audiences: The main audience we were writing for, after our Publisher looked at what “niches” had been un-met in data mining literature, was for the professional in smaller and middle sized businesses and organizations that needed to learn about “data mining / predictive analytics” “fast”…..e.g. maybe situations where the company did not have a data anlaysis group using predictive analytics, but the CEO’s and Professionals in the company knew they needed to learn and start using predictive analytics to “stay alive”. This seemed like potentially a very large audience. The book is oriented so that one does NOT have to start at chapter 1, and read sequentially, but instead can START WITH A TUTORIAL. Working through a tutorial, I’ve found in my 40 years of being in education, is the fastest way for a person to learn something new. And this has been confirmed………..I;ve had newcomers to data mining, who have already gotten the HANDBOOK, write me and say: “I’ve gone through a bunch of tutorials, and finding that I am really learning ‘how to do this’……..I’ve ready other books on ‘theory’, but just didn’t get the ‘hang of it’ from those”. My data mining consultants at StatSoft, who travel and work in “real world” situations every day, and who wrote maybe 1/3 of the tutorials in the HANDBOOK, tell me: “A person can go through the TUTORIALS in the HANDBOOK, and know 70% of what we who are doing predictive analytics consulting every day know !!!”

But there are other audiences: Corporate data miners can find it very useful also, as a “way of thinking as a data miner” can be gained from reading the book, as was expressed by one of the Amazon.com 5-STAR reviews: “What I like about this book is that it embeds those methods in a broader context, that of the philosophy and structure of data mining, especially as the methods are used in the corporate world. To me, it was really helpful in thinking like a data miner, especially as it involves the mix of science and art.”

But we’ve had others who have told us they will use is as an extra textbook in their Business Intelligence and Data Minng courses, because of the “richness” of the tutorials. Here’s a comment on the Amazon reviews from a Head of Business School who has maybe over 100 graduate students doing data mining:

“5.0 out of 5 stars. At last, a useable data mining book”

This is one of the few, of many, data mining books that delivers what it promises. It promises many detailed examples and cases. The companion DVD has detailed cases and also has a real 90 day trial copy of Statistica. I have taught data mining for over 10 years and I know it is very difficult to find comprehensive cases that can be used for classroom examples and for students to actually mine data. The price of the book is also very reasonable expecially when you compare the quantity and quality of the material to the typical intro stat book that usually costs twice as much as this data mining book.

The book also addresses new areas of data mining that are under development. Anyone that really wants to understand what data mining is about will find this book infinitively useful.”

So, I think the HANDBOOK will see use in many college classrooms.

Ajay- A question I never get the answer to is which data mining tool is good for what and not so good for what. Could you help me out with this one? What in your opinion, among the data mining and statistical tools used by you in your 40 years in this profession would you recommend for some uses, and what would you not recommend for other uses ( eg SAS,SPSS,KXEN,Statsoft,R etc etc)

Gary- This is a question I can’t answer well; but my book co-author, Robert Nisbet, Ph.D. can. He has used most of these softwares, and in fact has written 2 reviews over the past 6 years in which most of these have been discussed. I like “cutting edge endeavors”, that has been the modus operandi of my ‘career’, so when I took this “semi-retirement postion” as a data mining consultant at StatSoft, I was introduced to DATA MINING, as we started developing STATISTICA Data Miner shortly after I arrived. So most of my experience is with STATISTICA Data Miner, which of course has always been rated NO 1 in all the reviews on data miner software done by Dr. Nisbet – I believe this is primarily due to the fact that STATISTICA was written for the PC from the beginning, thus dos not have any legacy “main frame computer” coding in its history, and secondly StatSoft has been able to move rapidly to make changes as business and government data analysis needs change, and thirdly and most importantly, STATISTICA products have very “open architecture”, “flexibility”, and “customization” with every “built together / workable together” as one package. And of course the graphical output is second to none – that is how STATISTICA originally got its reputation. So I find no need of any other software, as if I need a new algorithm, I can program it to work with the “off the shelf” STATISTICA Data Miner algorithms, and thus get anything I need with the full graphical and other outputs seamlessly available.

Ask Bob Nisbet to answer this question, as he has the background to do so.

Ajay- What are the most interesting trends to watch out for in 2009-2010 in data mining in your opinion.

Gary- Things move so rapidly in this 21st century world, that this is difficult to say. Let me answer this with “hindsight”:

In late October, 2008 I wrote the first draft of Chapter 21 for the HANDBOOK. This was the “future directions of data mining”. You can look in that chapter yourself to find the 4 main areas I decided to focus on. One was on “social networking”, and one of the new examples used was TWITTER. At that time, less than one year ago, no one knew if TWITTER was going to amount to much or not ??? big question? Well, on Jan 14 when the US-AIRWAYS A320 Airbus made an emergency landing in the Hudson River, I got an EMAIL automatic message from CNN [that I subscribe to] telling me that a “plane was down in the Hudson, watch it live” …………I click on the live video: The voice form the Helicopter overhead was saying: “We see a plane, half sunk into the water, but no people? What has happened to the people? Are they all dead?………” Well, as it turned out, the CNN Helicopters had spend nearly one hour searching the river for the plane, as had other news agencies. BUT THE “ENTIRE” WORLD ALREADY KNEW !!! … Why? A person on a ferry that was crossing the river close to the crash landing used his I-Phone, snaped a photo, uploaded it to TWIT-PIX and sent a TWITTER message, and this was re-tweeted around the world. The world knew in “seconds to minutes” to which the traditional NEWS MEDIA was 1 hour late on the scene, when ALL the PEOPLE had been rescued and were on-shore in a warm building within 45 minutes of the landing. THE TRADITIONAL NEWS MEDIA ARRIVED 15 MINUTES AFTER EVERYTHING HAD HAPPENED !!!! ………AT THIS POINT we ALL KNEW that TWITTER was a new phenomenon ……….and it started growing, with 10,00 people an hour joining at one point in last spring of this year, and who knows what the rate is today. TWITTER has become a most important part not only of “social networking” among friends, but for BUSINESS —- companies even sending out ‘Parts Availability” lists to their dealers, etc.

TWITTER affected Chapter 21…………..I immediately re-wrote Chapter 21, including this first photo of the Hudson Plane crash-landing with all the people standing on the wings. BUT, not the end of this story: By the time the book was about to go to press, TWITTER had decided that “ownership” of uploaded photos resided with the photographer, and the person who took this original US-AIRBUS – PEOPLE ON THE WINGS photo wanted $600 for us to publish it in the HANDBOOK. So, I re-wrote again [the chapter was already “set” in page proofs……….so we had to make the changes directly at the printer]………this time finding another photo uploaded to social media, but in this case the person had “checked” the box to put the photo in public domain.

So TWITTER is one that I predicted would become important, but I’d thought it would be months AFTER the HANDBOOK was released in May, not last January!!!

Other things we presented in Chapter 21 about the “future of data mining” involved “photo / image recognition”, among others. The “Image Recognition”, and more importantly “movement recognition / analysis” for things like Physical Therapy and other medical areas may be more slow to evolve and fully develop, but are immensely important. The ability to analyze such “Three-dimensional movement data” is already available in rudimentary form in our version 9 of STATISTICA [just released in June], and anyone could implement it fully with MACROS, but it probably will be some time before it is fully feasible from a business standpoint to develop it with fully automatic “point and click” functionality to make it readily accessible for anyone’s use.

Ajay What would your advice be to a young statistician just starting his research career.

Gary- Make sure you delve in / grab in FULLY to the subject areas……….you need to know BOTH the “domain” of the data you are working with, and “correct” methods of data analysis, especially when using the traditional p-value statistics. Today’s culture is too much on “superficiality”………..good data analysis requires “depth” of understanding. One needs to FOCUS ………good FOCUS can’t be done with elaborate “multi-tasking”. Granted, today’s youth [the “Technology-Inherited”] probably have their brains “wired differently” than the “Technology-Immigrants” like myself [e.g. the older generations], but never-the-less, I see ERRORS all over the place in today’s world, from “typos” in magazine and newspaper, to web page paragraphs, links that don’t work, etc etc ……….and I conclude that this is all do to NON-FOCUSED / MULTI-TASKING people. You can’t drive a car / bus / train and TEXT MESSAGE at the same time ……….the scientific tests that have been conducted show that it takes 20-times as long for a TEXT MESSAGING driver to stop, than a driver fully focused on the road, when given a “danger” warning. [Now, maybe this scientific experiment used ALL TECHNOLOGY-IMMIGRANTS as drivers?? If so, the scientific design was “flawed” ……..they should have used BOTH Technology-Immigrants and Technology-Inheritants as participants in the study. Then we’d have 2 dependent, or target variables: Age and TEXT MESSAGING…..]

Short Bio-

Professor, 30 years medical research in genetics, DNA, Proteins, Neuropsychology of Schizophrenia and Alzheimer’s Disease……….now semi-retired position as DATA MINING CONSULTANT – SENIOR STATISTICIAN

Rent a TextBook Chegg.com

Here is a great site recommended by NYT (http://www.nytimes.com/2009/07/05/business/05ping.html)

It is called Chegg.com and it allows you to rent textbooks just like movies and help cut down your textbook expenses into half.

Wisdom from Elder Research- 10 Top Data Mining Mistakes

This is a great data mining tutorial from John Elder. Visit his site at http://datamininglab.com/

for more great video tutorials- all very lucid, easy to understand and powerful.

Zementis News

From a Zementis Newsletter- interesting advances on the R on the cloud front. Thanks to Rom Ramos for sending this, and I hope Zementis and some one like Google/ Biocep team up so all I need to make a model is some data and a browser. 🙂

The R Journal – A Refereed Journal for the R Project Launches

As a sign of the open source R project for statistical computing gaining momentum, the R newsletter has been transformed into The R Journal, a refereed journal for articles covering topics that are of interest to users or developers of R. As a supporter of the R PMML Package (see blog and video tutorial), we are honored that our article “PMML: An Open Standard for Sharing Models” which emphasizes the importance of the Predictive Model Markup Language (PMML) standard is part of the inaugural issue. If you already develop your models in R, export them via PMML, then deploy and scale your models in ADAPA on the Amazon EC2 cloud. Read the full story.

Integrating Predictive Analytics via Web Services

Predictive analytics will deliver more value and become more pervasive across the enterprise, once we manage to seamlessly integrate predictive models into any business process. In order to execute predictive models on-demand, in real-time or in batch mode, the integration via web services presents a simple and effective way to leverage scoring results within different applications. For most scenarios, the best way to incorporate predictive models into the business process is as a decision service. Query the model(s) daily, hourly, or in real-time, but if at all possible try to design a loosely coupled system following a Service Oriented Architecture (SOA).

Using web services, for example, one can quickly improve existing systems and processes by adding predictive decision models. Following the idea of a loosely coupled architecture, it is even possible to use integration tools like Jitterbit or Microsoft SQL Service Integration Services (SSIS) to embed predictive mode ls that are deployed in ADAPA on the Amazon Elastic Compute Cloud without the need to write any code. Of course, there is also the option to use custom Java code or MS SQL Server SSIS Scripting for which we provide a sample client application. Read the full story.

About ADAPA®:

A fast real-time deployment environment for Predictive Analytics Models – a stand alone scoring engine that reads .xml based PMML descriptions of models and scores streams of data. Developed by Zementis – a fully hosted Software-as-a Service (SaaS) solution on the Amazon Elastic Computing Cloud. It’s easy to use and remarkably inexpensive starting at only $0.99 per instance hour.

Interview John Moore CTO, Swimfish

Here is an interview with John F Moore, VP Engineering and Chief Technology Officer, Swimfish a provider of business solutions and CRM. A well known figure in Technology and CRM circles, John talks of Social CRM, Technology Offshoring, Community Initiatives and his own career.

Too many CRM systems are not usable. They are built by engineers that think of the system as a large database and the systems often look like a database making it difficult to use by the sales, support, and marketing people.

-John F Moore

Ajay – Describe your career journey from college to CTO. What changes in mindset did you undergo along the journey? What advice would you give to young students to take up science careers ?

John- First, I wanted to take time to thank you for the interview offer. I graduated from Boston University in 1988 with a degree in Electrical Engineering. At the time of my graduation I found myself to be very interested in the advanced taking place on the personal computing front by companies like Lotus with their 1-2-3 product. I knew that I wanted to be involved with these efforts and landed my first job in the software space as a Software Quality Engineer working on 1-2-3 for DOS.

I spent the first few years of my career working at Lotus as a developer, a quality engineer, and manager, on products such as Lotus 1-2-3 and Lotus Notes. Throughout those early career years I learned a lot and focused on taking as many classes as possible.

From Lotus I sought out the start-up environment and by early 2000 and joined a startup named Brainshark (http://www.brainshark.com). Brainshark was, and is, focused on delivering an asynchronous communication platform on the web and was one of the early providers of SAAS. In my seven years at Brainshark I learned a lot about delivering an Enterprise class SAAS solution on top of the Microsoft technology stack. The requirements to pass security audits for Fortune 500 companies, the need to match the performance of in-house solutions, resulted in all of us learning a great deal. These were very fun times.

I now work as the VP of Engineering and CTO at Swimfish, a services and software provider of business solutions. We focus on the financial marketplace where we have the founder has a very deep background, but also work within other verticals as well. Our products are focused on the CRM, document management, and mobile product space and are built on the Microsoft technology stack. Our customers leverage both our SAAS and on-premise solutions which require us to build our products to be more flexible than is generally required for a SAAS-only solution.

The exciting thing for me is the sheer amount of opportunities I see available for science/engineering students graduating in the near future. To be prepared for these opportunities, however, it will be important to not just be technically savvy.

Engineering students should also be looking at:

* Business classes. If you want to build cool products they must deliver business value.

* Writing and speaking classes. You must be able to articulate your ideas or no one will be willing to invest in them.

I would also encourage people to take chances, get in over your head as often as possible.You may fail, you may succeed. Either way you will gain experiences that make it all worthwhile.

Ajay- How do you think social media can help with CRM. What are the basic do’s and don’ts for social media CRM in your opinion?

John- You touch upon a subject that I am very passionate about. When I think of Social CRM I think about a system of processes and products that enable businesses to actively engage with customers in a manner that delivers maximum value to all. Customers should be able to find answers to their questions with minimal friction or effort; companies should find the right customers for their products.

Social CRM should deliver on some of these fronts:

* Analyze the web of relationships that exists to define optimal pathways. These pathways will define relationships that businesses can leverage for finding their customers. These pathways will enable customers to quickly find answers to their questions. For example, I needed an answer to a question about SharePoint and project management. I asked the question on Twitter and within 3 minutes had answers from two different people. Not only did I get the answer I needed but I made two new friends who I still talk to today.

* Monitor conversations to gauge brand awareness, identify customers having problems or asking questions. This monitoring should not be stalking; however, it should be used to provide quick responses to customers to benefit the greater community.

* Usability. Too many CRM systems are not usable. They are built by engineers that think of the system as a large database and the systems often look like a database making it difficult to use by the sales, support, and marketing people.

Finally, when I think of social media I think of these properties:

* Social is about relationship building.

* You should always add more value to the community than you take in return.

* Be transparent and honest. People can tell when you’re not.

Ajay- You are involved in some noble causes – like using blog space for out of work techies and separately for Alzheimer’s disease. How important do you think is for people especially younger people to be dedicated to community causes?

John- My mother-in-law was diagnosed with Alzheimer’s disease at the age 57. My wife and I moved into their two-family house to help her through the final years of her life. It is a horrible disease and one that it is easy to be passionate about if you have seen it in action.

My motivation on the job front is very similar. I have seen too many people suffer through these poor economic times and I simply want to do what I can to help people get back to work.

It probably sounds corny, but I firmly believe that we must all do what we can for each other. Business is competitive, but it does not mean that we cannot, or should not, help each other out. I think it’s important for everyone to have causes they believe in. You have to find your passions in life and follow them. Be a whole person and help change the world for the better.

Ajay- Describe your daily challenges as head of Engineering of Swimfish, Inc How important is it for the tech team to be integrated with the business and understand it as well.

John- The engineering team at Swimfish works very closely with the business teams. It is important for the team to understand the challenges our customers are encountering and to build products that help the customer succeed. I am not satisfied with the lack of success that many companies encounter when deploying a CRM solution.

We go as deep as possible to understand the business, the processes currently in use, the disparate systems being utilized, and then the underlying technologies currently in use. Only then do we focus on the solutions and deliver the right solution for that company.

On the product front it is the same. We work closely with customers on the features we are planning to add, trying to ensure that the solutions meet their needs as well as the needs of the other customers in the market that we are hoping to serve.

I do expect my engineers to be great at their core job, that goes without question. However, if they cannot understand the business needs they will not work for me very long.My weeks at Swimfish always provide me with interesting challenges and opportunities.

My typical day involves:

* Checking in with our support team to understand if there are any major issues being encountered by any of our customers.

* Challenging the support team to hit their targets. I love sales as without them I cannot deliver products.

* Checking in with my developers and test teams to determine how each of our projects is doing. We have a daily standup as well, but I try and personally check-in with as many people as possible.

* Most days I spend some time developing, mostly in C#. My current focus area is on our next release of our Milestone Tracking Matrix where I have made major revisions to our user interface.

I also spend time interacting on various social platforms, such as Twitter, as it is critical for me to understand the challenges that people are encountering in their businesses, to keep up with the rapid pace of technology, and just to check-in with friends. Keep it real.

Ajay- What are your views on off shoring work especially science jobs which ultimately made science careers less attractive in the US- at the same time outsourcing companies ( in India) generally pay only 1/3 rd of billing fees to salaries. Do you think concepts like ODesk can help change the paradigm of tech out-sourcing.

John- I have mixed opinions on off-shoring. You should not offshore because of perceived cost savings only. On net you will generally break even, you will not save as much as you might originally think.

I am, however, close to starting a relationship with a good development provider in Costa Rica. The reason for this relationship is not cost based, it is knowledge based. This company has a lot of experience with the primary CRM system that we sell to customers and I have not been successful in finding this experience locally. I will save a lot of money in upfront training on this skill-set; they have done a lot of work in this area already (and have great references). There is real value to our business, and theirs.

Note that Swimfish is already working with a geographically dispersed team as part of the engineering team is in California and part is in Massachusetts. This arrangement has already helped us to better prepare for an offshore relationship and I know we will be successful when we begin.

Ajay- What does John Moore do to have fun when he is not in front of his computer or with a cause.

John- As the father of two teenage daughters I spend a lot of time going to soccer, basketball, and softball games. I also enjoy spending time running, having completed a couple of marathons, and relaxing with a good book. My next challenge will be skydiving as my 17 year old daughter and I are going skydiving when she turns 18.

Brief Bio:

For the last decade I have worked as a senior engineering manager for SAAS applications built upon the Microsoft technology stack. I have established the processes, and hired the teams that delivered hundreds of updates ranging from weekly patches to longer running full feature releases. My background as a hands-on developer combined with my strong QA background has enabled me to deliver high quality software on-time.

You can learn more about me, and my opinions, by reading my blog at http://johnfmoore.wordpress.com/ or joining me on Twitter at http://twitter.com/JohnFMoore

R and Hadoop

Here is an exciting project for using R on the cloud computing environment ( two of my favorite things). It is called RHIPE

R and Hadoop Integrated Processing Environment v.0.38

cloud

The website source is http://ml.stat.purdue.edu/rhipe/

RHIPE(phonetic spelling: hree-pay’ ¹) is a java package that integrates the R environment with Hadoop, the open source implementation of Google’s mapreduce. Using RHIPE it is possible to code map-reduce algorithms in R e.g

m <- function(key,val){
  words <- strsplit(val," +")[[1]]
  wc <- table(words)
  cln <- names(wc)
  names(wc)<-NULL; names(cln)<-NULL;
  return(sapply(1:length(wc),function(r) list(key=cln[r],value=wc[[r]]),simplify=F))
}
r <- function(key,value){
  value <- do.call("rbind",value)
  return(list(list(key=key,value=sum(value))))
}
rhmr(map=m,reduce=r,combiner=T,input.folder="X",output.folder="Y")

rhapply packages the user's request into an R vector object. This is serialized and sent to the RHIPE server. The RHIPE server picks apart the object creating a job request that Hadoop can understand. Each element of the provided list is processed by the users function during the Map stage of mapreduce. The results are returned and if the output is to a file, these results are serialized and written to a Hadoop Sequence file, the values can be read back into R using the rhsq* functions.

2 rhlapply

rhlapply <- function( list.object,
                    func,
                    configure=expression(),
                    output.folder='',
                    shared.files=c(),
                    hadoop.mapreduce=list(),
                    verbose=T,
                    takeAll=T)

list.object
 This can either be a list or a single scalar. In case of the former, the function given by func will be applied to each element of list.object. In case of a scalar, the function will be applied to the list 1:n where n is the value of the scalar 
func
 A function that takes one parameter: an element of the list. 
configure
 An configuration expression to run before the func is executed. Executed once for every JVM. If you need variables, data frames, use rhsave or rhsave.image , use rhput to copy it to the DFS and then use shared.files
config = expression({
              library(lattice)
              load("mydataset.Rdata")
})



output.folder
 Any file that is created by the function is stored in the output.folder. This is deleted first. If not given, the files created will not be copied.  For side effect files to be copies create them in tmp e.g pdf("tmp/x.pdf"), note no leading slash.The directory will contain a slew of part* files, as many as there maps. These contain the binary key-value pairs.

shared.files
 The function or the preload expression might require the presence resource files e.g *.Rdata files. The user could copy it from the HDFS in the R code or just load it from the local work directory were the files present. This is the role of shared.files. It is a vector of paths to files on the HDFS, each of these will be copied to the work directory where the R code is run. e.g c('/tmp/x.Rdata','/foo.tgz'), then the first file can be loaded via load("x.Rdata") . For those familiar with Hadoop terminology, this is implemented via DistributedCache . 
hadoop.mapreduce
 a list of Hadoop specific options e.g
list(mapreduce.map.tasks=10,mapreduce.reduce.tasks=3)

takeAll
 if takeALL is true, the value returned is a list each entry the return value of the the function, not in order so element 1 of the returned list is not the result of  func(list.object=1=) . 
verbose
 If True, the user will see the job progress in the R console. If False, the web url to the jobtracker will be displayed. Cancelling the command with CTRL-C will not cancel the job, use rhkill for that. 




Mapreduce in R.
rhmr <- function(map,reduce,input.folder,configure=list(map=expression(),reduce=expression()),
                close=list(map=expression(),reduce=expression())
                 output.folder='',combiner=F,step=F,
                 shared.files=c(),inputformat="TextInputFormat",
                 outputformat="TextOutputFormat",
                 hadoop.mapreduce=list(),verbose=T,libjars=c())
Execute map reduce algorithms from within R. A discussion of the parameters follow.

input.folder
 A folder on the DFS containing the files to process. Can be a vector. 
output.folder
 A folder on the DFS where output will go to. 
inputformat
 Either of TextInputFormat or SequenceFileInputFormat. Use the former for text files and the latter for sequence files created from within R or as outputs from RHIPE(e.g rhlapply or rhmr). Note, one can't use any sequence file, they must have been created via a RHIPE function. Custom Input formats are also possible. Download the source and look at code/java/RXTextInputFormat.java 
outputformat
 Either of TextOutputFormat or SequenceFileOutputFormat. In case of the former, the return value from the mapper or reducer is converted to character and written to disk. The following code is used to convert to character.
paste(key,sep='',collapse=field_separator)
Custom output formats are also possible. Download the source and look at code/java/RXTextOutputFormat.java
If custom formats implement their own writables, it must subclass RXWritable or use one of the writables presents in RHIPE

shared.files
 same as in rhlapply, see that for documentation. 
verbose
 If T, the job progress is displayed. If false, then the job URL is displayed. 

At any time in the configure, close, map or reduce function/expression, the variable mapred.task.is.map will be equal to "true" if it is map task,"false" otherwise (both strings) Also, mapred.iswhat is mapper, reducer, combiner in their respective environments.

configure
 A list with either one element (an expression) or two elements map and reduce both of which must be expressions. These expressions are called in their respective environments, i.e the map expression is called during the map configure and similarly for the  reduce expression. The reduce expression is called for the combiner configure method.If only one list element, the expression is used for both the map and reduce

close
 Same as configure . 
map
 a function that takes two values key and value. Should return a list of lists. Each list entry must contain two elements key and value , e.g
...
ret <- list()
ret[[1]] <-  list(key=c(1,2), value=c('x','b'))
return(ret)
If any of key/value are missing the output is not collected, e.g. return NULL to skip this record. If the input format is a TextInputFormat, the value is the entire line and the key is probably useless to the user( it is a number indicating bytes into the file). If the input format is SequenceFileInputFormat, the key and value are taken from the sequence file.

reduce
 Not needed if mapred.reduce.tasks is 0. Takes a key and a list of values( all values emitted from the maps that share the same map output key ). If step is True, then not a list. Must return a list of lists each element of which must have two elements key and value.     This collects all the values and sends them to function. If NULL is returned or the return value is not conforming to the above nothing is collected the Hadoop collector. 
step
 If step is TRUE, then the reduce function is called for every value corresponding to a key that is once for every value.

 The variable red.status is equal to 1 on the first call.
 red.status is equal to 0 for every subsequent calls including the last value
 The reducer function is called one last time with red.status equal to -1. The value is NULL.Anything returned at any of these stages is written to disk The close function is called once every value for a given key has been processed, but returning anything  has no effect.  To a assign to the global environment use  the <<- operator.


combiner
 T or F, to use the reducer as a combiner. Using a combiner makes computation more efficient. If combiner is true, the reduce function will be called as a combiner (0 or more times, it may never be called during the combine stage even if combiner is T) .The value of mapred.task.is.map is 'true' or '*'false*' (both strings)  if the combiner is being executed as part of the map stage or reduce stage respectively.
Whether knowledge of this is useful or not is something I'm not sure of. However, if combiner is T , keep in mind,your reduce function must be able to handle inputs sent from the map or inputs sent from the reduce function(itself).

libjars
 If specifying a custom input/output format, the user might need to specify jar files here. 
hadoop.mapreduce
 set RHIPE and Hadoop options via this. 


1.1 RHIPE Options for mapreduce

 





Option
Default
Explanation




rhipejob.rport
8888
The port on which Rserve runs, should be same across all machines


rhipejob.charsxp.short
0
If 1,  RHIPE optimize serialization for character vectors. This reduces the length of the serialization


rhipejob.getmapbatches
100
If the reduce/mapper emits several key,values, how many to get from Rserve at a time. A higher number reduce the number of network reads(the network reads are to localhost)


rhipejob.outfmt.is.text
1 if TextInputFormat
Must be 1 if the output is textual


rhipejob.textoutput.fieldsep
' '
The field separator for any text based output format


rhipejob.textinput.comment
'#'
In the TextInputFormat, lines beginning with this are skipped


rhipejob.combinerspill
100,000
The combiner is run after collecting at most this many items


rhipejob.tor.batch
200,000
Number of values for the same key to collate before sending to the Reducer, if you have dollops of memory, set this larger. However, too large and you hit Java's heap space limit


rhipejob.max.count.reduce
Java's INT_MAX (about 2BN)
the total number of values for a given key to be collected, note the values are not ordered by any variable.


rhipejob.inputformat.keyclass
The default is chosen depending on TextInputFormat or SequenceFileInputFormat
Provide the full Java URL to the keyclass e.g org.saptarshiguha.rhipe.hadoop.RXWritableText, when using a Custom InputFormat  implement RXWritable     and implement the methods


rhipejob.inputformat.valueclass
The default is chosen depending on TextInputFormat or SequenceFileInputFormat
Provide the full Java URL to the valueclass e.g org.saptarshiguha.rhipe.hadoop.RXWritableText when using a Custom InputFormat  implement RXWritable     and implement the methods


mapred.input.format.class
As above, the default is either org.saptarshiguha.rhipe.hadoop.RXTextInputFormat or org.apache.hadoop.mapred.SequenceFileInputFormat
specify yours here


rhipejob.outputformat.keyclass
The default is chosen depending on TextInputFormat or SequenceFileInputFormat
Provide the full Java URL to the keyclass e.g org.saptarshiguha.rhipe.hadoop.RXWritableText , also the keyclass must implement RXWritable and


rhipejob.outputformat.valueclass
The default is chosen depending on TextInputFormat or SequenceFileInputFormat
Provide the full Java URL to the value e.g org.saptarshiguha.rhipe.hadoop.RXWritableText , also the valueclass must implement RXWritable


mapred.output.format.class
As above, the default is either org.saptarshiguha.rhipe.hadoop.RXTextOutputFormat or org.apache.hadoop.mapred.SequenceFileInputFormat
specify yours here, provide libjars if required




Citation:http://ml.stat.purdue.edu/rhipe/index.html
Great exciting news for the world of computing remotely.

Option	Default	Explanation
rhipejob.rport	8888	The port on which Rserve runs, should be same across all machines
rhipejob.charsxp.short	0	If 1, RHIPE optimize serialization for character vectors. This reduces the length of the serialization
rhipejob.getmapbatches	100	If the reduce/mapper emits several key,values, how many to get from Rserve at a time. A higher number reduce the number of network reads(the network reads are to localhost)
rhipejob.outfmt.is.text	1 if TextInputFormat	Must be 1 if the output is textual
rhipejob.textoutput.fieldsep	' '	The field separator for any text based output format
rhipejob.textinput.comment	'#'	In the TextInputFormat, lines beginning with this are skipped
rhipejob.combinerspill	100,000	The combiner is run after collecting at most this many items
rhipejob.tor.batch	200,000	Number of values for the same key to collate before sending to the Reducer, if you have dollops of memory, set this larger. However, too large and you hit Java's heap space limit
rhipejob.max.count.reduce	Java's INT_MAX (about 2BN)	the total number of values for a given key to be collected, note the values are not ordered by any variable.
rhipejob.inputformat.keyclass	The default is chosen depending on TextInputFormat or SequenceFileInputFormat	Provide the full Java URL to the keyclass e.g `org.saptarshiguha.rhipe.hadoop.RXWritableText`, when using a Custom InputFormat implement RXWritable and implement the methods
rhipejob.inputformat.valueclass	The default is chosen depending on TextInputFormat or SequenceFileInputFormat	Provide the full Java URL to the valueclass e.g `org.saptarshiguha.rhipe.hadoop.RXWritableText` when using a Custom InputFormat implement RXWritable and implement the methods
mapred.input.format.class	As above, the default is either `org.saptarshiguha.rhipe.hadoop.RXTextInputFormat` or `org.apache.hadoop.mapred.SequenceFileInputFormat`	specify yours here
rhipejob.outputformat.keyclass	The default is chosen depending on TextInputFormat or SequenceFileInputFormat	Provide the full Java URL to the keyclass e.g `org.saptarshiguha.rhipe.hadoop.RXWritableText` , also the keyclass must implement `RXWritable` and
rhipejob.outputformat.valueclass	The default is chosen depending on TextInputFormat or SequenceFileInputFormat	Provide the full Java URL to the value e.g `org.saptarshiguha.rhipe.hadoop.RXWritableText` , also the valueclass must implement `RXWritable`
mapred.output.format.class	As above, the default is either `org.saptarshiguha.rhipe.hadoop.RXTextOutputFormat` or `org.apache.hadoop.mapred.SequenceFileInputFormat`	specify yours here, provide libjars if required

Please share:

Please share:

Please share:

Please share:

Please share:

Too many CRM systems are not usable. They are built by engineers that think of the system as a large database and the systems often look like a database making it difficult to use by the sales, support, and marketing people.

-John F Moore

Please share:

2 rhlapply

1.1 RHIPE Options for mapreduce

Citation:http://ml.stat.purdue.edu/rhipe/index.html

Great exciting news for the world of computing remotely.

Please share: