Automating Regression Models :KXEN

Note : I have used KXEN both for modeling internet leads in 2008 as well as for consumer finance propensity models in 2006.These  are my personal views.

An extremely useful software and to my surprise very much under used in Industry is KXEN. It can be used for building regression models, time series models as well clustering models as well. I have used it primarily for regression models and clustering though. What KXEN does is it uses all the rules that modelers make for adding, dropping , and manipulating variables and automates them . It then produces the exported model code into PMML, SAS, SPSS and even SQL code for direct model execution. The validity of the model is tested by the modeler using a nice variety of graphs.The software uses a properietary algorithm called SRM.

The latest version is version 5 and it is available here http://www.kxen.com/index.php?option=com_content&task=view&id=61&Itemid=166

 

 

It is one of the best automated modelings I have used and been trained in, so if you are interested you can try an evaluation version here. The support team is excellent , and so are the sales chap ( they neither oversell nor undermine the competition) .It can cut down on both modeling time as well training time for learning the model. If combined with WPS or R to clean data in the initial stage it is good solution for renting or buying especially for cost savvy businesses.

  The only caveat if at all is it needs data to be cleaned  for formatting and variable type issues before entering for modeling, but that can be easily done.

Softwares for creating a Virtual Office

Would you like to work from home and see your kid when they are actually kids. Or would you rather commute to office, wasting some carbon fueled oil, and time in traffic.

 

Here are some softwares for creating a virtual office from home. if your organization does not have a work from home policy you can refer to the advantages section

 

Advantages of A Virtual Office

1) Zero company overheads – Especially leased rent space that sells by the foot.Save building costs rather than retrench employees.

2) Greater happiness for employees – Most employees prefer it. It also acts as a retention barrier.

3) Lower Energy costs – Employees are more energy conscious while working from home computers rather than offices with bright lighting.

4) Lower Transportation costs for employee

5) Lower Communication costs for employer ( see below).

6) Much lower IT costs

Disadvantages of A Virtual Office

1) Zero company oversight or supervision- When is the employee slacking off. Solution -End of Week Reports.

2) Greater risks if data breaches – Not if you use the softwares below.

3) Loss of Face to Face feeling – Not if you video conference. And you can always have meetings once a week or a month .

Infrastructure for a Virtual Office.

1) Broadband internet

2) VPN for secure connectivity and daily logs of data in and data out on the computer used. ( A good example of a VPN client I have used is

SafeNet SoftRemoteLT Version 10.7.5 (04/2006) ©2006 SafeNet, Inc.

3) VOIP communication -Like www.skype.com

4) Collobrative Software

Google Sites for documentation . http://sites.google.com

Google Docs http://docs.google.com/ Open Office (www.openoffice.org)  for Office Productivity software (if you want lower software costs as well)

www.gotomeeting.com or Citrix (www.citrix.com) for demonstrations or virtual meeting

Open Source Solutions Like R ( www.r-project.org) with GUI Rattle (www.togaware.com for analytical solutions.

 

5) Employee Owned PC ( remember Activity Log is enabled) with Speakers, Headphones and a Webcam.

6) All computation can be done on remote servers (Microsoft Windows Terminal Server or Better Still X Windows Server ) connected with VPN and Remote Desktop (embedded in Windows). This is thus even safer data security than data on emploee’s PC in office.The remote servers are managed by virtual teams of vendors. For huge data requirements try Amazon and www.rightscale.com

Cost of a Virtual Office is virtual ( or zero). There are obvious benefits for both cost cutting employers and employees.

So what’s stopping you from asking your HR manager and your CEO for a Virtual Office Today.

ps- I dont know how to create a Virtual Office with a Mac.yet.Watch this space.

Happy Holidays and a Bonus Tech Cartoon

More Analytics in the Cloud

Here is a company called www.birst.com which does this- upload data, crunch and share it.

 

image

 

Other softwares include Cloudbase released by www.business.com and available at http://cloudbase.sourceforge.net/

"CloudBase is a data warehouse system for Terabyte and Petabyte scale analytics. It is built on top of Map-Reduce architecture. The current code has been developed to Hadoop‘s map-reduce implementation. CloudBase allows you to query flat log files using ANSI SQL. It comes with JDBC driver so you can use any JDBC database manager application (e.g Squirrel) as front end. CloudBase is developed by Business.com and is released to open source community under GNU General Public License 2.0." 

 

A third product is Vertica , which can be seen here http://www.vertica.com/cloud

 

image

The benefits are "

Vertica Analytic Database for the Cloud is an on-demand version of Vertica’s blazingly fast, grid-enabled columnar database hosted on Amazon’s Elastic Compute Cloud. The pay-as-you-go offering enables companies to create large, high-performance analytic data marts without upfront data center costs and delays.

Built for the Cloud
Vertica is the only cloud-based analytic database with the following innovations, which enable it to manage terabytes of data faster and more reliably than any other cloud database:

  • “Scale-out” grid architecture – handles changing workloads as elastically as the cloud
  • Aggressive data compression – keeps storage costs low
  • Automatic K-Safety – provides replication, failover and recovery in the cloud

New Business Intelligence Possibilities
Vertica for the Cloud completely changes the economics of BI, making it possible to rapidly
initiate a much broader spectrum of analytic projects and businesses:

  • Ad-hoc and short-lived business analytic projects
  • New analytic Software as a Service (SaaS) businesses
  • Vertica Analytic Database proof of concept projects

Benefits of Vertica for the Cloud:

  • Fastest “Time to Terabyte” – Fully provisioned and ready for loading within minutes
  • Fastest performance – 100x to 1000x faster than other cloud databases
  • Runs 24×7 – Automatic K-Safety makes Vertica the only failure-resilient analytic cloud DBMS
  • Lowest startup cost – No upfront hardware, data center or admin overhead. Just pay for database usage until you’re done, then stop paying for it
  • Painless scalability – Scales seamlessly as data volume changes
  • Smallest footprint – Compresses data up to 90% to lower costs and improve performance
  • Proven platform – Hosted by Amazon, within their proven data center

"

 

But if you want to directly start experimenting with the Amazon Ec2 , costs are not verey high. Remember it is a pay as you go system. As an analytics supplier looking to cut costs , the cloud computing paradigm seems the fastest way to do so.

http://aws.amazon.com/ec2/instance-types/

image

A Base SAS to Java Compiler

Here is a nice SAS to Java compiler. It basically cuts away at the problem of executing legacy SAS code, SAS training and focusses on executing the tasks in Java thus making them much faster.

It’s available at http://dullesopen.com/

And its free for personal use.And academic use.

image

I quote from the website "

Carolina Benefits

Converting Base SAS® to Java with Carolina provides two main benefits to enterprises:

  • Savings on license fees. Carolina costs about 70% less than SAS.
  • Performance gains. Carolina-converted code runs significantly faster than the native SAS program.

Additional Benefits

  • Greater flexibility. Java is an industry-standard environment that runs on all platforms. It is much easier to support than the legacy SAS environment it replaces.
  • Better integration. Carolina, as a Java application, supports web services through true J2EE integration.
  • Flawless automated conversion. Eliminate time-consuming, error-prone manual conversion.
  • Simpler contracts. Carolina is licensed in a simple, straightforward fashion.
  • Reduced training costs. Carolina-converted programs can be understood by analysts without training in SAS, and SAS-trained analysts don’t need to learn a new programming language."

Some Business Analytics Questions

My presentation at the Timesgroup office went well. I have uploaded a copy on to the Google Docs and it can be viewed here http://docs.google.com/Presentation?docid=dcvss358_324cfhpg7cv&hl=en

Some questions that came out of the discussion were –

 

1) Businesses are aligned in types of products that can be sold ? 

What are the 5 top segments of customers ( irrespective of product) that come to your website

2) Cross Sell- What are the top 10 products ( irrespective of whether you own them or not) that can be sold to your customer database

3) Who are the top dis satisfied customers of your competitors ?

4) Businesses change every year . How many times are reporting systems over hauled completely to reflect the change – say once in six months?

5) How many employees go in for self funded  training once a year ?

Overall , a great discussion and hopefully it helps you answer the above questions too.

SQL and Hadoop: What is this cloud thing

Here is a very good ,in fact brilliant post from Joe Hellerstein, a Professor of Computer Science at UC Berkeley at http://radar.oreilly.com/2008/11/the-commoditization-of-massive.html

It explains the difference between the two databases type.

 

Enterprise IT camp tends to favor relational databases and the SQL language, while the web upstarts have rallied around the MapReduce programming model popularized at Google, and cloned in open source as Apache Hadoop. Hadoop is in wide use at companies like Yahoo! and Facebook, and gets a lot of attention in tech blogs as the next big open source project. But if you mention Hadoop in a corporate IT shop you are often met with blank stares — SQL is ubiquitous in those environments

 

Setting aside the trash talk, the usual cases made for the two technologies can be summarized as follows:

Relational Databases

  • multipurpose: useful for analysis and data update, batch and interactive tasks
  • high data integrity via ACID transactions
  • lots of compatible tools, e.g. for loading, management, reporting, data visualization and mining
  • support for SQL, the most widely-used language for data analysis
  • automatic SQL query optimization, which can radically improve performance
  • integration of SQL with familiar programming languages via connectivity protocols, mapping layers and user-defined functions

MapReduce (Hadoop)

  • designed for large clusters: 1000+ computers
  • very high availability, keeping long jobs running efficiently even when individual computers break or slow down
  • data is accessed in "native format" from a filesystem — no need to transform data into tables at load time
  • no special query language; programmers use familiar languages like Java, Python, and Perl
  • programmers retain control over performance, rather than counting on a query optimizer
  • the open-source Hadoop implementation is funded by corporate donors, and will mature over time as Linux and Apache did

Hadoop is still relatively young, and by all reports much slower and more resource intensive than Google’s MapReduce implementation.

What I liked about the article was explaining Hadoop in simple terms to corporate SQL types like me.

It’s interesting how Hadoop would be configured on the NVidia Tesla supercomputer ( at 10000 USD)

 

– Update – Mathematica is already being modified for the GPU versus CPU system, and there was an interesting discussion in R _help list today on this.

Mathematica is launching a version working with Nvidia GPUs. It is claimed that it’d make it
~10-100x faster!
http://www.physorg.com/news146247669.html