Home » Posts tagged 'SPSS'

Tag Archives: SPSS

Visual Guides to CRISP-DM ,KDD and SEMMA

UPDATED- Here are three great examples of a visualization making a process easy to understand. Please click on the images to read them clearly.

1) It visualizes CRISP-DM and is made by Nicole Leaper (http://exde.wordpress.com/2009/03/13/a-visual-guide-to-crisp-dm-methodology/)

12345

2) KDD -Knowledge Discovery in Databases -visualization by Fayyad whom I have interviewed here at http://www.decisionstats.com/interview-dr-usama-fayyad-founder-open-insights-llc/

and work By Gregory Piatetsky Shapiro interviewed by this website here

http://decisionstats.com/2009/08/13/interview-gregory-piatetsky-kdnuggets-com/

kdd

3) I am also attaching a visual representation of SEMMA from http://www.dataprix.net/en/blogs/respinosamilla/theory-data-mining

metodo-semma

 

Hadoop and Cloud Computing

Just working on the Hadoop on Amazon’s Time Sharing Platform ;)

Hopefully we will see some SAS, or SPSS or R up there soon .

Protected: WPS for Big Data

This content is password protected. To view it please enter your password below:

Interview James G Kobielus IBM Big Data

Here is an interview with  James G Kobielus, who is the Senior Program Director, Product Marketing, Big Data Analytics Solutions at IBM. Special thanks to Payal Patel Cudia of IBM’s communication team,for helping with the logistics for this.

Ajay -What are the specific parts of the IBM Platform that deal with the three layers of Big Data -variety, velocity and volume

James-Well first of all, let’s talk about the IBM Information Management portfolio. Our big data platform addresses the three layers of big data to varying degrees either together in a product , or two out of the three or even one of the three aspects. We don’t have separate products for the variety, velocity and volume separately.

Let us define these three layers-Volume refers to the hundreds of terabytes and petabytes of stored data inside organizations today. Velocity refers to the whole continuum from batch to real time continuous and streaming data.

Variety refers to multi-structure data from structured to unstructured files, managed and stored in a common platform analyzed through common tooling.

For Volume-IBM has a highly scalable Big Data platform. This includes Netezza and Infosphere groups of products, and Watson-like technologies that can support petabytes volume of data for analytics. But really the support of volume ranges across IBM’s Information Management portfolio both on the database side and the advanced analytics side.

For real time Velocity, we have real time data acquisition. We have a product called IBM Infosphere, part of our Big Data platform, that is specifically built for streaming real time data acquisition and delivery through complex event processing. We have a very rich range of offerings that help clients build a Hadoop environment that can scale.

Our Hadoop platform is the most real time capable of all in the industry. We are differentiated by our sheer breadth, sophistication and functional depth and tooling integrated in our Hadoop platform. We are differentiated by our streaming offering integrated into the Hadoop platform. We also offer a great range of modeling and analysis tools, pretty much more than any other offering in the Big Data space.

Attached- Jim’s slides from Hadoop World

Ajay- Any plans for Mahout for Hadoop

Jim- I cant speak about product plans. We have plans but I cant tell you anything more. We do have a feature in Big Insights called System ML, a library for machine learning.

Ajay- How integral are acquisitions for IBM in the Big Data space (Netezza,Cognos,SPSS etc). Is it true that everything that you have in Big Data is acquired or is the famous IBM R and D contributing here . (see a partial list of IBM acquisitions at at http://www.ibm.com/investor/strategy/acquisitions.wss )

Jim- We have developed a lot on our own. We have the deepest R and D of anybody in the industry in all things Big Data.

For example – Watson has Big Insights Hadoop at its core. Apache Hadoop is the heart and soul of Big Data (see http://www-01.ibm.com/software/data/infosphere/hadoop/ ). A great deal that makes Big Insights so differentiated is that not everything that has been built has been built by the Hadoop community.

We have built additions out of the necessity for security, modeling, monitoring, and governance capabilities into BigInsights to make it truly enterprise ready. That is one example of where we have leveraged open source and we have built our own tools and technologies and layered them on top of the open source code.

Yes of course we have done many strategic acquisitions over the last several years related to Big Data Management and we continue to do so. This quarter we have done 3 acquisitions with strong relevance to Big Data. One of them is Vivisimo (http://www-03.ibm.com/press/us/en/pressrelease/37491.wss ).

Vivisimo provides federated Big Data discovery, search and profiling capabilities to help you figure out what data is out there,what is relevance of that data to your data science project- to help you answer the question which data should you bring in your Hadoop Cluster.

 We also did Varicent , which is more performance management and we did TeaLeaf , which is a customer experience solution provider where customer experience management and optimization is one of the hot killer apps for Hadoop in the cloud. We have done great many acquisitions that have a clear relevance to Big Data.

Netezza already had a massively parallel analytics database product with an embedded library of models called Netezza Analytics, and in-database capabilties to massively parallelize Map Reduce and other analytics management functions inside the database. In many ways, Netezza provided capabilities similar to that IBM had provided for many years under the Smart Analytics Platform (http://www-01.ibm.com/software/data/infosphere/what-is-advanced-analytics/ ) .

There is a differential between Netezza and ISAS.

ISAS was built predominantly in-house over several years . If you go back a decade ago IBM acquired Ascential Software , a product portfolio that was the heart and soul of IBM InfoSphere Information Manager that is core to our big Data platform. In addition to Netezza, IBM bought SPSS two years back. We already had data mining tools and predictive modeling in the InfoSphere portfolio, but we realized we needed to have the best of breed, SPSS provided that and so IBM acquired them.

 Cognos- We had some BI reporting capabilities in the InfoSphere portfolio that we had built ourselves and also acquired for various degrees from prior acquisitions. But clearly Cognos was one of the best BI vendors , and we were lacking such a rich tool set in our product in visualization and cubing and so for that reason we acquired Cognos.

There is also Unica – which is a marketing campaign optimization which in many ways is a killer app for Hadoop. Projects like that are driving many enterprises.

Ajay- How would you rank order these acquisitions in terms of strategic importance rather than data of acquisition or price paid.

Jim-Think of Big Data as an ecosystem that has components that are fitted to particular functions for data analytics and data management. Is the database the core, or the modeling tool the core, or the governance tools the core, or is the hardware platform the core. Everything is critically important. We would love to hear from you what you think have been most important. Each acquisition has helped play a critical role to build the deepest and broadest solution offering in Big Data. We offer the hardware, software, professional services, the hosting service. I don’t think there is any validity to a rank order system.

Ajay-What are the initiatives regarding open source that Big Data group have done or are planning?

Jim- What we are doing now- We are very much involved with the Apache Hadoop community. We continue to evolve the open source code that everyone leverages.. We have built BigInsights on Apache Hadoop. We have the closest, most up to date in terms of version number to Apache Hadoop ( Hbase,HDFS, Pig etc) of all commercial distributions with our BigInsights 1.4 .

We have an R library integrated with BigInsights . We have a R library integrated with Netezza Analytics. There is support for R Models within the SPSS portfolio. We already have a fair amount of support for R across the portfolio.

Ajay- What are some of the concerns (privacy,security,regulation) that you think can dampen the promise of Big Data.

Jim- There are no showstoppers, there is really a strong momentum. Some of the concerns within the Hadoop space are immaturity of the technology, the immaturity of some of the commercial offerings out there that implement Hadoop, the lack of standardization for formal sense for Hadoop.

There is no Open Standards Body that declares, ratifies the latest version of Mahout, Map Reduce, HDFS etc. There is no industry consensus reference framework for layering these different sub projects. There are no open APIs. There are no certifications or interoperability standards or organizations to certify different vendors interoperability around a common API or framework.

The lack of standardization is troubling in this whole market. That creates risks for users because users are adopting multiple Hadoop products. There are lots of Hadoop deployments in the corporate world built around Apache Hadoop (purely open source). There may be no assurance that these multiple platforms will interoperate seamlessly. That’s a huge issue in terms of just magnifying the risk. And it increases the need for the end user to develop their own custom integrated code if they want to move data between platforms, or move map-reduce jobs between multiple distributions.

Also governance is a consideration. Right now Hadoop is used for high volume ETL on multi structured and unstructured data sources, or Hadoop is used for exploratory sand boxes for data scientists. These are important applications that are a majority of the Hadoop deployments . Some Hadoop deployments are stand alone unstructured data marts for specific applications like sentiment analysis like.

Hadoop is not yet ready for data warehousing. We don’t see a lot of Hadoop being used as an alternative to data warehouses for managing the single version of truth of system or record data. That day will come but there needs to be out there in the marketplace a broader range of data governance mechanisms , master data management, data profiling products that are mature that enterprises can use to make sure their data inside their Hadoop clusters is clean and is the single version of truth. That day has not arrived yet.

One of the great things about IBM’s acquisition of Vivisimo is that a piece of that overall governance picture is discovery and profiling for unstructured data , and that is done very well by Vivisimo for several years.

What we will see is vendors such as IBM will continue to evolve security features inside of our Hadoop platform. We will beef up our data governance capabilities for this new world of Hadoop as the core of Big Data, and we will continue to build up our ability to integrate multiple databases in our Hadoop platform so that customers can use data from a bit of Hadoop,some data from a bit of traditional relational data warehouse, maybe some noSQL technology for different roles within a very complex Big Data environment.

That latter hybrid deployment model is becoming standard across many enterprises for Big Data. A cause for concern is when your Big Data deployment has a bit of Hadoop, bit of noSQL, bit of EDW, bit of in-memory , there are no open standards or frameworks for putting it all together for a unified framework not just for interoperability but also for deployment.

There needs to be a virtualization or abstraction layer for unified access to all these different Big Data platforms by the users/developers writing the queries, by administrators so they can manage data and resources and jobs across all these disparate platforms in a seamless unified way with visual tooling. That grand scenario, the virtualization layer is not there yet in any standard way across the big data market. It will evolve, it may take 5-10 years to evolve but it will evolve.

So, that’s the concern that can dampen some of the enthusiasm for Big Data Analytics.

About-

You can read more about Jim at http://www.linkedin.com/pub/james-kobielus/6/ab2/8b0 or

follow him on Twitter at http://twitter.com/jameskobielus

You can read more about IBM Big Data at http://www-01.ibm.com/software/data/bigdata/

Revolution R Enterprise 6.0 launched!

Just got the email-more software is good news!

Revolution R Enterprise 6.0 for 32-bit and 64-bit Windows and 64-bit Red Hat Enterprise Linux (RHEL 5.x and RHEL 6.x) features an updated release of the RevoScaleR package that provides fast, scalable data management and data analysis: the same code scales from data frames to local, high-performance .xdf files to data distributed across a Windows HPC Server cluster or IBM Platform Computing LSF cluster.  RevoScaleR also allows distribution of the execution of essentially any R function across cores and nodes, delivering the results back to the user.

Detailed information on what’s new in 6.0 and known issues:
http://www.revolutionanalytics.com/doc/README_RevoEnt_Windows_6.0.0.pdf

and from the manual-lots of function goodies for Big Data

 

  • IBM Platform LSF Cluster support [Linux only]. The new RevoScaleR function, RxLsfCluster, allows you to create a distributed compute context for the Platform LSF workload manager.
  •  Azure Burst support added for Microsoft HPC Server [Windows only]. The new RevoScaleR function, RxAzureBurst, allows you to create a distributed compute context to have computations performed in the cloud using Azure Burst
  • The rxExec function allows distributed execution of essentially any R function across cores and nodes, delivering the results back to the user.
  • functions RxLocalParallel and RxLocalSeq allow you to create compute context objects for local parallel and local sequential computation, respectively.
  • RxForeachDoPar allows you to create a compute context using the currently registered foreach parallel backend (doParallel, doSNOW, doMC, etc.). To execute rxExec calls, simply register the parallel backend as usual, then set your compute context as follows: rxSetComputeContext(RxForeachDoPar())
  • rxSetComputeContext and rxGetComputeContext simplify management of compute contexts.
  • rxGlm, provides a fast, scalable, distributable implementation of generalized linear models. This expands the list of full-featured high performance analytics functions already available: summary statistics (rxSummary), cubes and cross tabs (rxCube,rxCrossTabs), linear models (rxLinMod), covariance and correlation matrices (rxCovCor),
    binomial logistic regression (rxLogit), and k-means clustering (rxKmeans)example: a Tweedie family with 1 million observations and 78 estimated coefficients (categorical data)
    took 17 seconds with rxGlm compared with 377 seconds for glm on a quadcore laptop

     

    and easier working with R’s big brother SAS language

     

    RevoScaleR high-performance analysis functions will now conveniently work directly with a variety of external data sources (delimited and fixed format text files, SAS files, SPSS files, and ODBC data connections). New functions are provided to create data source objects to represent these data sources (RxTextData, RxOdbcData, RxSasData, and RxSpssData), which in turn can be specified for the ‘data’ argument for these RevoScaleR analysis functions: rxHistogramrxSummary, rxCube, rxCrossTabs, rxLinMod, rxCovCor, rxLogit, and rxGlm.


    example, 

    you can analyze a SAS file directly as follows:


    # Create a SAS data source with information about variables and # rows to read in each chunk

    sasDataFile <- file.path(rxGetOption(“sampleDataDir”),”claims.sas7bdat”)
    sasDS <- RxSasData(sasDataFile, stringsAsFactors = TRUE,colClasses = c(RowNum = “integer”),rowsPerRead = 50)

    # Compute and draw a histogram directly from the SAS file
    rxHistogram( ~cost|type, data = sasDS)
    # Compute summary statistics
    rxSummary(~., data = sasDS)
    # Estimate a linear model
    linModObj <- rxLinMod(cost~age + car_age + type, data = sasDS)
    summary(linModObj)
    # Import a subset into a data frame for further inspection
    subData <- rxImport(inData = sasDS, rowSelection = cost > 400,
    varsToKeep = c(“cost”, “age”, “type”))
    subData

 

The installation instructions and instructions for getting started with Revolution R Enterprise & RevoDeployR for Windows: http://www.revolutionanalytics.com/downloads/instructions/windows.php

Interview Alvaro Tejada Galindo, SAP Labs Montreal, Using SAP Hana with #Rstats

Here is a brief interview with Alvaro Tejada Galindo aka Blag who is a developer working with SAP Hana and R at SAP Labs, Montreal. SAP Hana is SAP’s latest offering in BI , it’s also a database and a computing environment , and using R and HANA together on the cloud can give major productivity gains in terms of both speed and analytical ability, as per preliminary use cases.

Ajay- Describe how you got involved with databases and R language.
Blag-  I used to work as an ABAP Consultant for 11 years, but also been involved with programming since the last 13 years, so I was in touch with SQLServer, Oracle, MySQL and SQLite. When I joined SAP, I heard that SAP HANA was going to use an statistical programming language called “R”. The next day I started my “R” learning.

Ajay- What made the R language a fit for SAP HANA. Did you consider other languages? What is your view on Julia/Python/SPSS/SAS/Matlab languages

Blag- I think “R” is a must for SAP HANA. As the fastest database in the market, we needed a language that could help us shape the data in the best possible way. “R” filled that purpose very well. Right now, “R” is not the only language as “L” can be used as well (http://wiki.tcl.tk/17068) …not forgetting “SQLScript” which is our own version of SQL (http://goo.gl/x3bwh) . I have to admit that I tried Julia, but couldn’t manage to make it work. Regarding Python, it’s an interesting question as I’m going to blog about Python and SAP HANA soon. About Matlab, SPSS and SAS I haven’t used them, so I got nothing to say there.

Ajay- What is your view on some of the limitations of R that can be overcome with using it with SAP HANA.

Blag-  I think mostly the ability of SAP HANA to work with big data. Again, SAP HANA and “R” can work very nicely together and achieve things that weren’t possible before.

Ajay-  Have you considered other vendors of R including working with RStudio, Revolution Analytics, and even Oracle R Enterprise.

Blag-  I’m not really part of the SAP HANA or the R groups inside SAP, so I can’t really comment on that. I can only say that I use RStudio every time I need to do something with R. Regarding Oracle…I don’t think so…but they can use any of our products whenever they want.

Ajay- Do you have a case study on an actual usage of R with SAP HANA that led to great results.

Blag-   Right now the use of “R” and SAP HANA is very preliminary, I don’t think many people has start working on it…but as an example that it works, you can check this awesome blog entry from my friend Jitender Aswani “Big Data, R and HANA: Analyze 200 Million Data Points and Later Visualize Using Google Maps “ (http://allthingsr.blogspot.com/#!/2012/04/big-data-r-and-hana-analyze-200-million.html)

Ajay- Does your group in SAP plan to give to the R ecosystem by attending conferences like UseR 2012, sponsoring meets, or package development etc

Blag- My group is in charge of everything developers, so sure, we’re planning to get more in touch with R developers and their ecosystem. Not sure how we’re going to deal with it, but at least I’m going to get myself involved in the Montreal R Group.

 

About-

http://scn.sap.com/people/alvaro.tejadagalindo3

Name: Alvaro Tejada Galindo
Email: a.tejada.galindo@sap.com
Profession: Development
Company: SAP Canada Labs-Montreal
Town/City: Montreal
Country: Canada
Instant Messaging Type: Twitter
Instant Messaging ID: Blag
Personal URL: http://blagrants.blogspot.com
Professional Blog URL: http://www.sdn.sap.com/irj/scn/weblogs?blog=/pub/u/252210910
My Relation to SAP: employee
Short Bio: Development Expert for the Technology Innovation and Developer Experience team.Used to be an ABAP Consultant for the last 11 years. Addicted to programming since 1997.

http://www.sap.com/solutions/technology/in-memory-computing-platform/hana/overview/index.epx

and from

http://en.wikipedia.org/wiki/SAP_HANA

SAP HANA is SAP AG’s implementation of in-memory database technology. There are four components within the software group:[1]

  • SAP HANA DB (or HANA DB) refers to the database technology itself,
  • SAP HANA Studio refers to the suite of tools provided by SAP for modeling,
  • SAP HANA Appliance refers to HANA DB as delivered on partner certified hardware (see below) as anappliance. It also includes the modeling tools from HANA Studio as well replication and data transformation tools to move data into HANA DB,[2]
  • SAP HANA Application Cloud refers to the cloud based infrastructure for delivery of applications (typically existing SAP applications rewritten to run on HANA).

R is integrated in HANA DB via TCP/IP. HANA uses SQL-SHM, a shared memory-based data exchange to incorporate R’s vertical data structure. HANA also introduces R scripts equivalent to native database operations like join or aggregation.[20] HANA developers can write R scripts in SQL and the types are automatically converted in HANA. R scripts can be invoked with HANA tables as both input and output in the SQLScript. R environments need to be deployed to use R within SQLScript

More blog posts on using SAP and R together

Dealing with R and HANA

http://scn.sap.com/community/in-memory-business-data-management/blog/2011/11/28/dealing-with-r-and-hana
R meets HANA

http://scn.sap.com/community/in-memory-business-data-management/blog/2012/01/29/r-meets-hana

HANA meets R

http://scn.sap.com/community/in-memory-business-data-management/blog/2012/01/26/hana-meets-r
When SAP HANA met R – First kiss

http://scn.sap.com/community/developer-center/hana/blog/2012/05/21/when-sap-hana-met-r–first-kiss

 

Using RODBC with SAP HANA DB-

SAP HANA: My experiences on using SAP HANA with R

http://scn.sap.com/community/in-memory-business-data-management/blog/2012/02/21/sap-hana-my-experiences-on-using-sap-hana-with-r

and of course the blog that started it all-

Jitender Aswani’s http://allthingsr.blogspot.in/

 

 

Interview Michal Kosinski , Concerto Web Based App using #Rstats

Here is an interview with Michal Kosinski , leader of the team that has created Concerto – a web based application using R. What is Concerto? As per http://www.psychometrics.cam.ac.uk/page/300/concerto-testing-platform.htm

Concerto is a web based, adaptive testing platform for creating and running rich, dynamic tests. It combines the flexibility of HTML presentation with the computing power of the R language, and the safety and performance of the MySQL database. It’s totally free for commercial and academic use, and it’s open source

Ajay-  Describe your career in science from high school to this point. What are the various stats platforms you have trained on- and what do you think about their comparative advantages and disadvantages?  

Michal- I started with maths, but quickly realized that I prefer social sciences – thus after one year, I switched to a psychology major and obtained my MSc in Social Psychology with a specialization in Consumer Behaviour. At that time I was mostly using SPSS – as it was the only statistical package that was taught to students in my department. Also, it was not too bad for small samples and the rather basic analyses I was performing at that time.

 

My more recent research performed during my Mphil course in Psychometrics at Cambridge University followed by my current PhD project in social networks and research work at Microsoft Research, requires significantly more powerful tools. Initially, I tried to squeeze as much as possible from SPSS/PASW by mastering the syntax language. SPSS was all I knew, though I reached its limits pretty quickly and was forced to switch to R. It was a pretty dreary experience at the start, switching from an unwieldy but familiar environment into an unwelcoming command line interface, but I’ve quickly realized how empowering and convenient this tool was.

 

I believe that a course in R should be obligatory for all students that are likely to come close to any data analysis in their careers. It is really empowering – once you got the basics you have the potential to use virtually any method there is, and automate most tasks related to analysing and processing data. It is also free and open-source – so you can use it wherever you work. Finally, it enables you to quickly and seamlessly migrate to other powerful environments such as Matlab, C, or Python.

Ajay- What was the motivation behind building Concerto?

Michal- We deal with a lot of online projects at the Psychometrics Centre – one of them attracted more than 7 million unique participants. We needed a powerful tool that would allow researchers and practitioners to conveniently build and deliver online tests.

Also, our relationships with the website designers and software engineers that worked on developing our tests were rather difficult. We had trouble successfully explaining our needs, each little change was implemented with a delay and at significant cost. Not to mention the difficulties with embedding some more advanced methods (such as adaptive testing) in our tests.

So we created a tool allowing us, psychometricians, to easily develop psychometric tests from scratch an publish them online. And all this without having to hire software developers.

Ajay -Why did you choose R as the background for Concerto? What other languages and platforms did you consider. Apart from Concerto, how else do you utilize R in your center, department and University?

Michal- R was a natural choice as it is open-source, free, and nicely integrates with a server environment. Also, we believe that it is becoming a universal statistical and data processing language in science. We put increasing emphasis on teaching R to our students and we hope that it will replace SPSS/PASW as a default statistical tool for social scientists.

Ajay -What all can Concerto do besides a computer adaptive test?

Michal- We did not plan it initially, but Concerto turned out to be extremely flexible. In a nutshell, it is a web interface to R engine with a built-in MySQL database and easy-to-use developer panel. It can be installed on both Windows and Unix systems and used over the network or locally.

Effectively, it can be used to build any kind of web application that requires a powerful and quickly deployable statistical engine. For instance, I envision an easy to use website (that could look a bit like SPSS) allowing students to analyse their data using a web browser alone (learning the underlying R code simultaneously). Also, the authors of R libraries (or anyone else) could use Concerto to build user-friendly web interfaces to their methods.

Finally, Concerto can be conveniently used to build simple non-adaptive tests and questionnaires. It might seem to be slightly less intuitive at first than popular questionnaire services (such us my favourite Survey Monkey), but has virtually unlimited flexibility when it comes to item format, test flow, feedback options, etc. Also, it’s free.

Ajay- How do you see the cloud computing paradigm growing? Do you think browser based computation is here to stay?

Michal - I believe that cloud infrastructure is the future. Dynamically sharing computational and network resources between online service providers has a great competitive advantage over traditional strategies to deal with network infrastructure. I am sure the security concerns will be resolved soon, finishing the transformation of the network infrastructure as we know it. On the other hand, however, I do not see a reason why client-side (or browser) processing of the information should cease to exist – I rather think that the border between the cloud and personal or local computer will continually dissolve.

About

Michal Kosinski is Director of Operations for The Psychometrics Centre and Leader of the e-Psychometrics Unit. He is also a research advisor to the Online Services and Advertising group at the Microsoft Research Cambridge, and a visiting lecturer at the Department of Mathematics in the University of Namur, Belgium. You can read more about him at http://www.michalkosinski.com/

You can read more about Concerto at http://code.google.com/p/concerto-platform/ and http://www.psychometrics.cam.ac.uk/page/300/concerto-testing-platform.htm

Follow

Get every new post delivered to your Inbox.

Join 739 other followers