The Mommy Track

Wage_labour
Image via Wikipedia

A new paper quantitatively analyzes the impact of child bearing on women. Summary-

Women [who score in the upper third on a standardized test] have a net 8 percent reduction in pay during the first five years after giving birth

From http://papers.nber.org/papers/w16582

Having a child lowers a woman’s lifetime earnings, but how much depends upon her skill level. In The Mommy Track Divides: The Impact of Childbearing on Wages of Women of Differing Skill Levels (NBER Working Paper No. 16582), co-authors Elizabeth Ty Wilde, Lily Batchelder, and David Ellwood estimate that having a child costs the average high skilled woman $230,000 in lost lifetime wages relative to similar women who never gave birth. By comparison, low skilled women experience a lifetime wage loss of only $49,000.

Using the 1979 National Longitudinal Survey of Youth (NLSY), Wilde et. al. divided women into high, medium, and low skill categories based on their Armed Forces Qualification Test (AFQT) scores. The authors use these skill categories, combined with earnings, labor force participation, and family formation data, to chart the labor market progress of women before and after childbirth, from ages 14-to-21 in 1979 through 41-to-49 in 2006, this study’s final sample year.

High scoring and low scoring women differed in a number of ways. While 70-75 percent of higher scoring women work full-time all year prior to their first birth, only 55-60 percent of low scoring women do. As they age, the high scoring women enjoy steeper wage growth than low scoring women; low scoring women’s wages do not change much if they reenter the labor market after they have their first child. Five years after the first birth, about 35 percent of each group is working full-time. However, the high scoring women who are not working full-time are more likely to be working part-time than the low scoring women, who are more likely to leave the workforce entirely.

and

Men’s earning profiles are relatively unaffected by having children although men who never have children earn less on average than those who do. High scoring women who have children late also tend to earn more than high scoring childless women. Their earnings advantage occurs before they have children and narrows substantially after they become mothers.

HIGHLIGHTS from REXER Survey :R gives best satisfaction

Simple graph showing hierarchical clustering. ...
Image via Wikipedia

A Summary report from Rexer Analytics Annual Survey

 

HIGHLIGHTS from the 4th Annual Data Miner Survey (2010):

 

•   FIELDS & GOALS: Data miners work in a diverse set of fields.  CRM / Marketing has been the #1 field in each of the past four years.  Fittingly, “improving the understanding of customers”, “retaining customers” and other CRM goals are also the goals identified by the most data miners surveyed.

 

•   ALGORITHMS: Decision trees, regression, and cluster analysis continue to form a triad of core algorithms for most data miners.  However, a wide variety of algorithms are being used.  This year, for the first time, the survey asked about Ensemble Models, and 22% of data miners report using them.
A third of data miners currently use text mining and another third plan to in the future.

 

•   MODELS: About one-third of data miners typically build final models with 10 or fewer variables, while about 28% generally construct models with more than 45 variables.

 

•   TOOLS: After a steady rise across the past few years, the open source data mining software R overtook other tools to become the tool used by more data miners (43%) than any other.  STATISTICA, which has also been climbing in the rankings, is selected as the primary data mining tool by the most data miners (18%).  Data miners report using an average of 4.6 software tools overall.  STATISTICA, IBM SPSS Modeler, and R received the strongest satisfaction ratings in both 2010 and 2009.

 

•   TECHNOLOGY: Data Mining most often occurs on a desktop or laptop computer, and frequently the data is stored locally.  Model scoring typically happens using the same software used to develop models.  STATISTICA users are more likely than other tool users to deploy models using PMML.

 

•   CHALLENGES: As in previous years, dirty data, explaining data mining to others, and difficult access to data are the top challenges data miners face.  This year data miners also shared best practices for overcoming these challenges.  The best practices are available online.

 

•   FUTURE: Data miners are optimistic about continued growth in the number of projects they will be conducting, and growth in data mining adoption is the number one “future trend” identified.  There is room to improve:  only 13% of data miners rate their company’s analytic capabilities as “excellent” and only 8% rate their data quality as “very strong”.

 

Please contact us if you have any questions about the attached report or this annual research program.  The 5th Annual Data Miner Survey will be launching next month.  We will email you an invitation to participate.

 

Information about Rexer Analytics is available at www.RexerAnalytics.com. Rexer Analytics continues their impressive journey see http://www.rexeranalytics.com/Clients.html

|My only thought- since most data miners are using multiple tools including free tools as well as paid software, Perhaps a pie chart of market share by revenue and volume would be handy.

Also some ideas on comparing diverse data mining projects by data size, or complexity.

 

OK Cupid Data Visualization- Flow Chart to your Heart

Quite appropriate on a V Day, OK Cupid remains quite innovative how they use data (in this questionnaire data)

Protovis a graphical toolkit for visualization

I just found about a new data visualization tool called Protovis http://vis.stanford.edu/protovis/ex/

Protovis composes custom views of data with simple marks such as bars and dots. Unlike low-level graphics libraries that quickly become tedious for visualization, Protovis defines marks through dynamic properties that encode data, allowing inheritancescales and layouts to simplify construction.

Protovis is free and open-source and is a Stanford project. It has been used in web interface R Node (which I will talk later )

http://squirelove.net/r-node/doku.php

Conventional

While Protovis is designed for custom visualization, it is still easy to create many standard chart types. These simpler examples serve as an introduction to the language, demonstrating key abstractions such as quantitative and ordinal scales, while hinting at more advanced features, including stack layout.

Custom

Many charting libraries provide stock chart designs, but offer only limited customization; Protovis excels at custom visualization design through a concise representation and precise control over graphical marks. These examples, including a few recreations of unusual historical designs, demonstrate the language’s expressiveness.

 

 

Try Protovis today 🙂 http://vis.stanford.edu/protovis/

It uses JavaScript and SVG for web-native visualizations; no plugin required (though you will need a modern web browser)! Although programming experience is helpful, Protovis is mostly declarative and designed to be learned by example.

R Commander Plugins-20 and growing!

First graphical user interface in 1973.
Image via Wikipedia
R Commander Extensions: Enhancing a Statistical Graphical User Interface by extending menus to statistical packages

R Commander ( see paper by Prof J Fox at http://www.jstatsoft.org/v14/i09/paper ) is a well known and established graphical user interface to the R analytical environment.
While the original GUI was created for a basic statistics course, the enabling of extensions (or plug-ins  http://www.r-project.org/doc/Rnews/Rnews_2007-3.pdf ) has greatly enhanced the possible use and scope of this software. Here we give a list of all known R Commander Plugins and their uses along with brief comments.

  1. DoE – http://cran.r-project.org/web/packages/RcmdrPlugin.DoE/RcmdrPlugin.DoE.pdf
  2. doex
  3. EHESampling
  4. epack- http://cran.r-project.org/web/packages/RcmdrPlugin.epack/RcmdrPlugin.epack.pdf
  5. Export- http://cran.r-project.org/web/packages/RcmdrPlugin.Export/RcmdrPlugin.Export.pdf
  6. FactoMineR
  7. HH
  8. IPSUR
  9. MAc- http://cran.r-project.org/web/packages/RcmdrPlugin.MAc/RcmdrPlugin.MAc.pdf
  10. MAd
  11. orloca
  12. PT
  13. qcc- http://cran.r-project.org/web/packages/RcmdrPlugin.qcc/RcmdrPlugin.qcc.pdf and http://cran.r-project.org/web/packages/qcc/qcc.pdf
  14. qual
  15. SensoMineR
  16. SLC
  17. sos
  18. survival-http://cran.r-project.org/web/packages/RcmdrPlugin.survival/RcmdrPlugin.survival.pdf
  19. SurvivalT
  20. Teaching Demos

Note the naming convention for above e plugins is always with a Prefix of “RCmdrPlugin.” followed by the names above
Also on loading a Plugin, it must be already installed locally to be visible in R Commander’s list of load-plugin, and R Commander loads the e-plugin after restarting.Hence it is advisable to load all R Commander plugins in the beginning of the analysis session.

However the notable E Plugins are
1) DoE for Design of Experiments-
Full factorial designs, orthogonal main effects designs, regular and non-regular 2-level fractional
factorial designs, central composite and Box-Behnken designs, latin hypercube samples, and simple D-optimal designs can currently be generated from the GUI. Extensions to cover further latin hypercube designs as well as more advanced D-optimal designs (with blocking) are planned for the future.
2) Survival- This package provides an R Commander plug-in for the survival package, with dialogs for Cox models, parametric survival regression models, estimation of survival curves, and testing for differences in survival curves, along with data-management facilities and a variety of tests, diagnostics and graphs.
3) qcc -GUI for  Shewhart quality control charts for continuous, attribute and count data. Cusum and EWMA charts. Operating characteristic curves. Process capability analysis. Pareto chart and cause-and-effect chart. Multivariate control charts
4) epack- an Rcmdr “plug-in” based on the time series functions. Depends also on packages like , tseries, abind,MASS,xts,forecast. It covers Log-Exceptions garch
and following Models -Arima, garch, HoltWinters
5)Export- The package helps users to graphically export Rcmdr output to LaTeX or HTML code,
via xtable() or Hmisc::latex(). The plug-in was originally intended to facilitate exporting Rcmdr
output to formats other than ASCII text and to provide R novices with an easy-to-use,
easy-to-access reference on exporting R objects to formats suited for printed output. The
package documentation contains several pointers on creating reports, either by using
conventional word processors or LaTeX/LyX.
6) MAc- This is an R-Commander plug-in for the MAc package (Meta-Analysis with
Correlations). This package enables the user to conduct a meta-analysis in a menu-driven,
graphical user interface environment (e.g., SPSS), while having the full statistical capabilities of
R and the MAc package. The MAc package itself contains a variety of useful functions for
conducting a research synthesis with correlational data. One of the unique features of the MAc
package is in its integration of user-friendly functions to complete the majority of statistical steps
involved in a meta-analysis with correlations. It uses recommended procedures as described in
The Handbook of Research Synthesis and Meta-Analysis (Cooper, Hedges, & Valentine, 2009).

A query to help for ??Rcmdrplugins reveals the following information which can be quite overwhelming given that almost 20 plugins are now available-

RcmdrPlugin.DoE::DoEGlossary
Glossary for DoE terminology as used in
RcmdrPlugin.DoE
RcmdrPlugin.DoE::Menu.linearModelDesign
RcmdrPlugin.DoE Linear Model Dialog for
experimental data
RcmdrPlugin.DoE::Menu.rsm
RcmdrPlugin.DoE response surface model Dialog
for experimental data
RcmdrPlugin.DoE::RcmdrPlugin.DoE-package
R-Commander plugin package that implements
design of experiments facilities from packages
DoE.base, FrF2 and DoE.wrapper into the
R-Commander
RcmdrPlugin.DoE::RcmdrPlugin.DoEUndocumentedFunctions
Functions used in menus
RcmdrPlugin.doex::ranblockAnova
Internal RcmdrPlugin.doex objects
RcmdrPlugin.doex::RcmdrPlugin.doex-package
Install the DOEX Rcmdr Plug-In
RcmdrPlugin.EHESsampling::OpenSampling1
Internal functions for menu system of
RcmdrPlugin.EHESsampling
RcmdrPlugin.EHESsampling::RcmdrPlugin.EHESsampling-package
Help with EHES sampling
RcmdrPlugin.Export::RcmdrPlugin.Export-package
Graphically export objects to LaTeX or HTML
RcmdrPlugin.FactoMineR::defmacro
Internal RcmdrPlugin.FactoMineR objects
RcmdrPlugin.FactoMineR::RcmdrPlugin.FactoMineR
Graphical User Interface for FactoMineR
RcmdrPlugin.IPSUR::IPSUR-package
An IPSUR Plugin for the R Commander
RcmdrPlugin.MAc::RcmdrPlugin.MAc-package
Meta-Analysis with Correlations (MAc) Rcmdr
Plug-in
RcmdrPlugin.MAd::RcmdrPlugin.MAd-package
Meta-Analysis with Mean Differences (MAd) Rcmdr
Plug-in
RcmdrPlugin.orloca::activeDataSetLocaP
RcmdrPlugin.orloca: A GUI for orloca-package
(internal functions)
RcmdrPlugin.orloca::RcmdrPlugin.orloca-package
RcmdrPlugin.orloca: A GUI for orloca-package
RcmdrPlugin.orloca::RcmdrPlugin.orloca.es
RcmdrPlugin.orloca.es: Una interfaz grafica
para el paquete orloca
RcmdrPlugin.qcc::RcmdrPlugin.qcc-package
Install the Demos Rcmdr Plug-In
RcmdrPlugin.qual::xbara
Internal RcmdrPlugin.qual objects
RcmdrPlugin.qual::RcmdrPlugin.qual-package
Install the quality Rcmdr Plug-In
RcmdrPlugin.SensoMineR::defmacro
Internal RcmdrPlugin.SensoMineR objects
RcmdrPlugin.SensoMineR::RcmdrPlugin.SensoMineR
Graphical User Interface for SensoMineR
RcmdrPlugin.SLC::Rcmdr.help.RcmdrPlugin.SLC
RcmdrPlugin.SLC: A GUI for slc-package
(internal functions)
RcmdrPlugin.SLC::RcmdrPlugin.SLC-package
RcmdrPlugin.SLC: A GUI for SLC R package
RcmdrPlugin.sos::RcmdrPlugin.sos-package
Efficiently search R Help pages
RcmdrPlugin.steepness::Rcmdr.help.RcmdrPlugin.steepness
RcmdrPlugin.steepness: A GUI for
steepness-package (internal functions)
RcmdrPlugin.steepness::RcmdrPlugin.steepness
RcmdrPlugin.steepness: A GUI for steepness R
package
RcmdrPlugin.survival::allVarsClusters
Internal RcmdrPlugin.survival Objects
RcmdrPlugin.survival::RcmdrPlugin.survival-package
Rcmdr Plug-In Package for the survival Package
RcmdrPlugin.TeachingDemos::RcmdrPlugin.TeachingDemos-package
Install the Demos Rcmdr Plug-In

 

Challenges of Analyzing a dataset (with R)

GIF-animation showing a moving echocardiogram;...
Image via Wikipedia

Analyzing data can have many challenges associated with it. In the case of business analytics data, these challenges or constraints can have a marked effect on the quality and timeliness of the analysis as well as the expected versus actual payoff from the analytical results.

Challenges of Analytical Data Processing-

1) Data Formats- Reading in complete data, without losing any part (or meta data), or adding in superfluous details (that increase the scope). Technical constraints of data formats are relatively easy to navigate thanks to ODBC and well documented and easily search-able syntax and language.

The costs of additional data augmentation (should we pay for additional credit bureau data to be appended) , time of storing and processing the data (every column needed for analysis can add in as many rows as whole dataset, which can be a time enhancing problem if you are considering an extra 100 variables with a few million rows), but above all that of business relevance and quality guidelines will ensure basic data input and massaging are considerable parts of whole analytical project timeline.

2) Data Quality-Perfect data exists in a perfect world. The price of perfect information is one business will mostly never budget or wait for. To deliver inferences and results based on summaries of data which has missing, invalid, outlier data embedded within it makes the role of an analyst just as important as which ever tool is chosen to remove outliers, replace missing values, or treat invalid data.

3) Project Scope-

How much data? How much Analytical detail versus High Level Summary? Timelines for delivery as well as refresh of data analysis? Checks (statistical as well as business)?

How easy is it to load and implement the new analysis in existing Information Technology Infrastructure? These are some of the outer parameters that can limit both your analytical project scope, your analytical tool choice, and your processing methodology.
4) Output Results vis a vis stakeholder expectation management-

Stakeholders like to see results, not constraints, hypothesis ,assumptions , p-value, or chi -square value. Output results need to be streamlined to a decision management process to justify the investment of human time and effort in an analytical project, choice,training and navigating analytical tool complexities and constraints are subset of it. Optimum use of graphical display is a part of aligning results to a more palatable form to stakeholders, provided graphics are done nicely.

Eg Marketing wants to get more sales so they need a clear campaign, to target certain customers via specific channels with specified collateral. In order to base their business judgement, business analytics needs to validate , cross validate and sometimes invalidate this business decision making with clear transparent methods and processes.

Given a dataset- the basic analytical steps that an analyst will do with R are as follows. This is meant as a note for analysts at a beginner level with R.

Package -specific syntax

update.packages() #This updates all packages
install.packages(package1) #This installs a package locally, a one time event
library(package1) #This loads a specified package in the current R session, which needs to be done every R session

CRAN________LOCAL HARD DISK_________R SESSION is the top to bottom hierarchy of package storage and invocation.

ls() #This lists all objects or datasets currently active in the R session

> names(assetsCorr)  #This gives the names of variables within a dataframe
[1] “AssetClass”            “LargeStocksUS”         “SmallStocksUS”
[4] “CorporateBondsUS”      “TreasuryBondsUS”       “RealEstateUS”
[7] “StocksCanada”          “StocksUK”              “StocksGermany”
[10] “StocksSwitzerland”     “StocksEmergingMarkets”

> str(assetsCorr) #gives complete structure of dataset
‘data.frame’:    12 obs. of  11 variables:
$ AssetClass           : Factor w/ 12 levels “CorporateBondsUS”,..: 4 5 2 6 1 12 3 7 11 9 …
$ LargeStocksUS        : num  15.3 16.4 1 0 0 …
$ SmallStocksUS        : num  13.49 16.64 0.66 1 0 …
$ CorporateBondsUS     : num  9.26 6.74 0.38 0.46 1 0 0 0 0 0 …
$ TreasuryBondsUS      : num  8.44 6.26 0.33 0.27 0.95 1 0 0 0 0 …
$ RealEstateUS         : num  10.6 17.32 0.08 0.59 0.35 …
$ StocksCanada         : num  10.25 19.78 0.56 0.53 -0.12 …
$ StocksUK             : num  10.66 13.63 0.81 0.41 0.24 …
$ StocksGermany        : num  12.1 20.32 0.76 0.39 0.15 …
$ StocksSwitzerland    : num  15.01 20.8 0.64 0.43 0.55 …
$ StocksEmergingMarkets: num  16.5 36.92 0.3 0.6 0.12 …

> dim(assetsCorr) #gives dimensions observations and variable number
[1] 12 11

str(Dataset) – This gives the structure of the dataset (note structure gives both the names of variables within dataset as well as dimensions of the dataset)

head(dataset,n1) gives the first n1 rows of dataset while
tail(dataset,n2) gives the last n2 rows of a dataset where n1,n2 are numbers and dataset is the name of the object (here a data frame that is being considered)

summary(dataset) gives you a brief summary of all variables while

library(Hmisc)
describe(dataset) gives a detailed description on the variables

simple graphics can be given by

hist(Dataset1)
and
plot(Dataset1)

As you can see in above cases, there are multiple ways to get even basic analysis about data in R- however most of the syntax commands are intutively understood (like hist for histogram, t.test for t test, plot for plot).

For detailed analysis throughout the scope of analysis, for a business analytics user it is recommended to using multiple GUI, and multiple packages. Even for highly specific and specialized analytical tasks it is recommended to check for a GUI that incorporates the required package.