Some statistics on blogs by the people who host them-
Some statistics on blogs by the people who host them-
From an advertisement placed by Govt of Pakistan in Wall Street Journal,
Only Pakistan= Making sacrifices statistics cannot reflect.
Oh dear! What would the statisticians say?
The ad cites a series of statistics. Almost 22,000 Pakistani civilians have died or been seriously injured in the fight against terrorism, the ad said. The army has lost almost 3,000 soldiers. More than 3.5 million people have been displaced by the fighting and the damage to the economy over the past decade is estimated at $68 billion, it added.
People will quibble with these statistics from a country where reporters often find it difficult to get basic data.
Here is an interview with Dan Steinberg, Founder and President of Salford Systems (http://www.salford-systems.com/ )
Ajay- Describe your journey from academia to technology entrepreneurship. What are the key milestones or turning points that you remember.
Dan- When I was in graduate school studying econometrics at Harvard, a number of distinguished professors at Harvard (and MIT) were actively involved in substantial real world activities. Professors that I interacted with, or studied with, or whose software I used became involved in the creation of such companies as Sun Microsystems, Data Resources, Inc. or were heavily involved in business consulting through their own companies or other influential consultants. Some not involved in private sector consulting took on substantial roles in government such as membership on the President’s Council of Economic Advisors. The atmosphere was one that encouraged free movement between academia and the private sector so the idea of forming a consulting and software company was quite natural and did not seem in any way inconsistent with being devoted to the advancement of science.
Ajay- What are the latest products by Salford Systems? Any future product plans or modification to work on Big Data analytics, mobile computing and cloud computing.
Dan- Our central set of data mining technologies are CART, MARS, TreeNet, RandomForests, and PRIM, and we have always maintained feature rich logistic regression and linear regression modules. In our latest release scheduled for January 2012 we will be including a new data mining approach to linear and logistic regression allowing for the rapid processing of massive numbers of predictors (e.g., one million columns), with powerful predictor selection and coefficient shrinkage. The new methods allow not only classic techniques such as ridge and lasso regression, but also sub-lasso model sizes. Clear tradeoff diagrams between model complexity (number of predictors) and predictive accuracy allow the modeler to select an ideal balance suitable for their requirements.
The new version of our data mining suite, Salford Predictive Modeler (SPM), also includes two important extensions to the boosted tree technology at the heart of TreeNet. The first, Importance Sampled learning Ensembles (ISLE), is used for the compression of TreeNet tree ensembles. Starting with, say, a 1,000 tree ensemble, the ISLE compression might well reduce this down to 200 reweighted trees. Such compression will be valuable when models need to be executed in real time. The compression rate is always under the modeler’s control, meaning that if a deployed model may only contain, say, 30 trees, then the compression will deliver an optimal 30-tree weighted ensemble. Needless to say, compression of tree ensembles should be expected to be lossy and how much accuracy is lost when extreme compression is desired will vary from case to case. Prior to ISLE, practitioners have simply truncated the ensemble to the maximum allowable size. The new methodology will substantially outperform truncation.
The second major advance is RULEFIT, a rule extraction engine that starts with a TreeNet model and decomposes it into the most interesting and predictive rules. RULEFIT is also a tree ensemble post-processor and offers the possibility of improving on the original TreeNet predictive performance. One can think of the rule extraction as an alternative way to explain and interpret an otherwise complex multi-tree model. The rules extracted are similar conceptually to the terminal nodes of a CART tree but the various rules will not refer to mutually exclusive regions of the data.
Ajay- You have led teams that have won multiple data mining competitions. What are some of your favorite techniques or approaches to a data mining problem.
Dan- We only enter competitions involving problems for which our technology is suitable, generally, classification and regression. In these areas, we are partial to TreeNet because it is such a capable and robust learning machine. However, we always find great value in analyzing many aspects of a data set with CART, especially when we require a compact and easy to understand story about the data. CART is exceptionally well suited to the discovery of errors in data, often revealing errors created by the competition organizers themselves. More than once, our reports of data problems have been responsible for the competition organizer’s decision to issue a corrected version of the data and we have been the only group to discover the problem.
In general, tackling a data mining competition is no different than tackling any analytical challenge. You must start with a solid conceptual grasp of the problem and the actual objectives, and the nature and limitations of the data. Following that comes feature extraction, the selection of a modeling strategy (or strategies), and then extensive experimentation to learn what works best.
Ajay- I know you have created your own software. But are there other software that you use or liked to use?
Dan- For analytics we frequently test open source software to make sure that our tools will in fact deliver the superior performance we advertise. In general, if a problem clearly requires technology other than that offered by Salford, we advise clients to seek other consultants expert in that other technology.
Ajay- Your software is installed at 3500 sites including 400 universities as per http://www.salford-systems.com/company/aboutus/index.html What is the key to managing and keeping so many customers happy?
Dan- First, we have taken great pains to make our software reliable and we make every effort to avoid problems related to bugs. Our testing procedures are extensive and we have experts dedicated to stress-testing software . Second, our interface is designed to be natural, intuitive, and easy to use, so the challenges to the new user are minimized. Also, clear documentation, help files, and training videos round out how we allow the user to look after themselves. Should a client need to contact us we try to achieve 24-hour turn around on tech support issues and monitor all tech support activity to ensure timeliness, accuracy, and helpfulness of our responses. WebEx/GotoMeeting and other internet based contact permit real time interaction.
Ajay- What do you do to relax and unwind?
Dan- I am in the gym almost every day combining weight and cardio training. No matter how tired I am before the workout I always come out energized so locating a good gym during my extensive travels is a must. I am also actively learning Portuguese so I look to watch a Brazilian TV show or Portuguese dubbed movie when I have time; I almost never watch any form of video unless it is available in Portuguese.
Dan Steinberg, President and Founder of Salford Systems, is a well-respected member of the statistics and econometrics communities. In 1992, he developed the first PC-based implementation of the original CART procedure, working in concert with Leo Breiman, Richard Olshen, Charles Stone and Jerome Friedman. In addition, he has provided consulting services on a number of biomedical and market research projects, which have sparked further innovations in the CART program and methodology.
Dr. Steinberg received his Ph.D. in Economics from Harvard University, and has given full day presentations on data mining for the American Marketing Association, the Direct Marketing Association and the American Statistical Association. After earning a PhD in Econometrics at Harvard Steinberg began his professional career as a Member of the Technical Staff at Bell Labs, Murray Hill, and then as Assistant Professor of Economics at the University of California, San Diego. A book he co-authored on Classification and Regression Trees was awarded the 1999 Nikkei Quality Control Literature Prize in Japan for excellence in statistical literature promoting the improvement of industrial quality control and management.
His consulting experience at Salford Systems has included complex modeling projects for major banks worldwide, including Citibank, Chase, American Express, Credit Suisse, and has included projects in Europe, Australia, New Zealand, Malaysia, Korea, Japan and Brazil. Steinberg led the teams that won first place awards in the KDDCup 2000, and the 2002 Duke/TeraData Churn modeling competition, and the teams that won awards in the PAKDD competitions of 2006 and 2007. He has published papers in economics, econometrics, computer science journals, and contributes actively to the ongoing research and development at Salford.
Continuing my series of basic data manipulation using R. For people knowing analytics and
new to R.
1 Keeping only some variables Using subset we can keep only the variables we want- Sitka89 <- subset(Sitka89, select=c(size,Time,treat)) Will keep only the variables we have selected (size,Time,treat). 2 Dropping some variables Harman23.cor$cov.arm.span <- NULL
This deletes the variable named cov.arm.span in the dataset Harman23.cor 3 Keeping records based on character condition Titanic.sub1<-subset(Titanic,Sex=="Male") Note the double equal-to sign
4 Keeping records based on date/time condition subset(DF, as.Date(Date) >= '2009-09-02' & as.Date(Date) <= '2009-09-04') 5 Converting Date Time Formats into other formats if the variable dob is “01/04/1977) then following will convert into a date object z=strptime(dob,”%d/%m/%Y”) and if the same date is 01Apr1977 z=strptime(dob,"%d%b%Y") 6 Difference in Date Time Values and Using Current Time The difftime function helps in creating differences in two date time variables. difftime(time1, time2, units='secs') or difftime(time1, time2, tz = "", units = c("auto", "secs", "mins", "hours", "days", "weeks")) For current system date time values you can use Sys.time() Sys.Date() This value can be put in the difftime function shown above to calculate age or time elapsed. 7 Keeping records based on numerical condition Titanic.sub1<-subset(Titanic,Freq >37) For enhanced usage-
you can also use the R Commander GUI with the sub menu Data > Active Dataset 8 Sorting Data Sorting A Data Frame in Ascending Order by a variable AggregatedData<- sort(AggregatedData, by=~ Package) Sorting a Data Frame in Descending Order by a variable AggregatedData<- sort(AggregatedData, by=~ -Installed) 9 Transforming a Dataset Structure around a single variable Using the Reshape2 Package we can use melt and acast functions library("reshape2") tDat.m<- melt(tDat) tDatCast<- acast(tDat.m,Subject~Item) If we choose not to use Reshape package, we can use the default reshape method in R. Please do note this takes longer processing time for bigger datasets. df.wide <- reshape(df, idvar="Subject", timevar="Item", direction="wide") 10 Type in Data Using scan() function we can type in data in a list 11 Using Diff for lags and Cum Sum function forCumulative Sums We can use the diff function to calculate difference between two successive values of a variable. Diff(Dataset$X) Cumsum function helps to give cumulative sum Cumsum(Dataset$X) > x=rnorm(10,20) #This gives 10 Randomly distributed numbers with Mean 20 > x  20.76078 19.21374 18.28483 20.18920 21.65696 19.54178 18.90592 20.67585  20.02222 18.99311 > diff(x)  -1.5470415 -0.9289122 1.9043664 1.4677589 -2.1151783 -0.6358585 1.7699296  -0.6536232 -1.0291181 > cumsum(x)  20.76078 39.97453 58.25936 78.44855 100.10551 119.64728 138.55320  159.22905 179.25128 198.24438 > diff(x,2) # The diff function can be used as diff(x, lag = 1, differences = 1, ...) where differences is the order of differencing  -2.4759536 0.9754542 3.3721252 -0.6474195 -2.7510368 1.1340711 1.1163064  -1.6827413 Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole. 12 Merging Data Deducer GUI makes it much simpler to merge datasets. The simplest syntax for a merge statement is totalDataframeZ <- merge(dataframeX,dataframeY,by=c("AccountId","Region")) 13 Aggregating and group processing of a variable We can use multiple methods for aggregating and by group processing of variables.
Two functions we explore here are aggregate and Tapply. Refering to the R Online Manual at
[http://stat.ethz.ch/R-manual/R-patched/library/stats/html/aggregate.html] ## Compute the averages for the variables in 'state.x77', grouped ## according to the region (Northeast, South, North Central, West) that ## each state belongs to aggregate(state.x77, list(Region = state.region), mean) Using TApply ## tapply(Summary Variable, Group Variable, Function) Reference [http://www.ats.ucla.edu/stat/r/library/advanced_function_r.htm#tapply] We can also use specialized packages for data manipulation. For additional By-group processing you can see the doBy package as well as Plyr package
for data manipulation.Doby contains a variety of utilities including:
1) Facilities for groupwise computations of summary statistics and other facilities for working with grouped data.
2) General linear contrasts and LSMEANS (least-squares-means also known as population means),
3) HTMLreport for autmatic generation of HTML file from R-script with a minimum of markup, 4) various other utilities and is available at[ http://cran.r-project.org/web/packages/doBy/index.html]
Also Available at [http://cran.r-project.org/web/packages/plyr/index.html],
Plyr is a set of tools that solves a common set of problems:
you need to break a big problem down into manageable pieces,
operate on each pieces and then put all the pieces back together.
For example, you might want to fit a model to each spatial location or
time point in your study, summarise data by panels or collapse high-dimensional arrays
to simpler summary statistics.
Here is an interview with Beth Scultz Editor in Chief, AllAnalytics.com .
Allanalytics.com http://www.allanalytics.com/ is the new online community on Predictive Analytics, and its a bit different in emphasizing quality more than just quantity. Beth is veteran in tech journalism and communities.
Ajay-Describe your journey in technology journalism and communication. What are the other online communities that you have been involved with?
Beth- I’m a longtime IT journalist, having begun my career covering the telecommunications industry at the brink of AT&T’s divestiture — many eons ago. Over the years, I’ve covered the rise of internal corporate networking; the advent of the Internet and creation of the Web for business purposes; the evolution of Web technology for use in building intranets, extranets, and e-commerce sites; the move toward a highly dynamic next-generation IT infrastructure that we now call cloud computing; and development of myriad enterprise applications, including business intelligence and the analytics surrounding them. I have been involved in developing online B2B communities primarily around next-generation enterprise IT infrastructure and applications. In addition, Shawn Hessinger, our community editor, has been involved in myriad Web sites aimed at creating community for small business owners.
Ajay- Technology geeks get all the money while journalists get a story. Comments please
Beth- Great technology geeks — those being the ones with technology smarts as well as business savvy — do stand to make a lot of money. And some pursue that to all ends (with many entrepreneurs gunning for the acquisition) while others more or less fall into it. Few journalists, at least few tech journalists, have big dollars in mind. The gratification for journalists comes in being able to meet these folks, hear and deliver their stories — as appropriate — and help explain what makes this particular technology geek developing this certain type of product or service worth paying attention to.
Ajay- Describe what you are trying to achieve with the All Analytics community and how it seeks to differentiate itself with other players in this space.
Beth- With AllAnaltyics.com, we’re concentrating on creating the go-to site for CXOs, IT professionals, line-of-business managers, and other professionals to share best practices, concrete experiences, and research about data analytics, business intelligence, information optimization, and risk management, among many other topics. We differentiate ourself by featuring excellent editorial content from a top-notch group of bloggers, access to industry experts through weekly chats, ongoing lively and engaging message board discussions, and biweekly debates.
We’re a new property, and clearly in rapid building mode. However, we’ve already secured some of the industry’s most respected BI/analytics experts to participate as bloggers. For example, a small sampling of our current lineup includes the always-intrigueing John Barnes, a science fiction novelist and statistics guru; Sandra Gittlen, a longtime IT journalist with an affinity for BI coverage; Olivia Parr-Rud, an internationally recognized expert in BI and organizational alignment; Tom Redman, a well-known data-quality expert; and Steve Williams, a leading BI strategy consultant. I blog daily as well, and in particular love to share firsthand experiences of how organizations are benefiting from the use of BI, analytics, data warehousing, etc. We’ve featured inside looks at analytics initiatives at companies such as 1-800-Flowers.com, Oberweis Dairy, the Cincinnati Zoo & Botanical Garden, and Thomson Reuters, for example.
In addition, we’ve hosted instant e-chats with Web and social media experts Joe Stanganelli and Pierre DeBois, and this Friday, Aug. 26, at 3 p.m. ET we’ll be hosting an e-chat with Marshall Sponder, Web metrics guru and author of the newly published book, Social Media Analytics: Effective Tools for Building, Interpreting, and Using Metrics. (Readers interested in participating in the chat do need to fill out a quick registration form, available here http://www.allanalytics.com/register.asp . The chat is available here http://www.allanalytics.com/messages.asp?piddl_msgthreadid=241039&piddl_msgid=439898#msg_439898 .
Experts participating in our biweekly debate series, called Point/Counterpoint, have broached topics such as BI in the cloud, mobile BI and whether an analytics culture is truly possible to build.
Ajay- What are some tips you would like to share about writing tech stories to aspiring bloggers.
Beth- I suppose my best advice is this: Don’t write about technology for technology’s sake. Always strive to tell the audience why they should care about a particular technology, product, or service. How might a reader use it to his or her company’s advantage, and what are the potential benefits? Improved productivity, increased revenue, better customer service? Providing anecdotal evidence goes a long way toward delivering that message, as well.
Ajay- What are the other IT world websites that have made a mark on the internet.
Beth- I’d be remiss if I didn’t give a shout out to UBM TechWeb sites, including InformationWeek, which has long charted the use of IT within the enterprise; Dark Reading, a great source for folks interested in securing an enterprise’s information assets; and Light Reading, which takes the pulse of the telecom industry.
Beth Schultz has more than two decades of experience as an IT writer and editor. Most recently, she brought her expertise to bear writing thought-provoking editorial and marketing materials on a variety of technology topics for leading IT publications and industry players. Previously, she oversaw multimedia content development, writing and editing for special feature packages at Network World. Beth has a keen ability to identify business and technology trends, developing expertise through in-depth analysis and early-adopter case studies. Over the years, she has earned more than a dozen national and regional editorial excellence awards for special issues from American Business Media, American Society of Business Press Editors, Folio.net, and others.
1) More Presentation Templates
2) More HTML 5 clipart
3) Online Latex (lyx) GUI (or a Chrome Extension)
4) Online Speech to Text dictation (or a Chrome Extension)
5) Online Screen Capture software for audio and video editing (or a Chrome Extension)
6) Some sharing of usage and statistics with world tech community
7) An on -site in house version for enterprise software customers (|?)
8) An easy to make HTML5 editor using just the browser
Seriously http://googledocs.blogspot.com/ needs to be challenged more.