R releases new version R 2.9.2

What is new in 2.9.2 (technical details not marketing spit and shine),

what didnt work in 2.9.1 ( shockingly bugs are fixed openly !!)

NEW FEATURES

    o   install.packages(NULL) now lists packages only once even if they
        occur in more than one repository (as the latest compatible
        version of those available will always be downloaded).

    o   approxfun() and approx() now accept a 'rule' of length two, for
        easy specification of different interpolation rules on left and
        right.

        They no longer segfault for invalid zero-length specification
        of 'yleft, 'yright', or 'f'.

    o   seq_along(x) is now equivalent to seq_len(length(x)) even where
        length() has an S3/S4 method; previously it (intentionally)
        always used the default method for length().

    o   PCRE has been updated to version 7.9 (for bug fixes).

    o   agrep() uses 64-bit ints where available on 32-bit platforms
        and so may do a better job with complex matches.
        (E.g. PR#13789, which failed only on 32-bit systems.)

DEPRECATED & DEFUNCT

    o   R CMD Rd2txt is deprecated, and will be removed in 2.10.0.
        (It is just a wrapper for R CMD Rdconv -t txt.)

    o   tools::Rd_parse() is deprecated and will be removed in 2.10.0
        (which will use only Rd version 2).

BUG FIXES

    o   parse_Rd() still did not handle source reference encodings
        properly.

    o   The C utility function PrintValue no longer attempts to print
        attributes for CHARSXPs as those attributes are used
        internally for the CHARSXP cache.  This fixes a segfault when
        calling it on a CHARSXP from C code.

    o   PDF graphics output was producing two instances of anything
        drawn with the symbol font face. (Report from Baptiste Auguie.)

    o   length(x) <- newval and grep() could cause memory corruption.
        (PR#13837)

    o   If model.matrix() was given too large a model, it could crash
        R. (PR#13838, fix found by Olaf Mersmann.)

    o   gzcon() (used by load()) would re-open an open connection,
        leaking a file descriptor each time. (PR#13841)

    o   The checks for inconsistent inheritance reported by setClass()
        now detect inconsistent superclasses and give better warning
        messages.

    o   print.anova() failed to recognize the column labelled
        P(>|Chi|) from a Poisson/binomial GLM anova as a p-value
        column in order to format it appropriately (and as a
        consequence it gave no significance stars).

    o   A missing PROTECT caused rare segfaults during calls to
        load().  (PR#13880, fix found by Bill Dunlap.)

    o   gsub() in a non-UTF-8 locale with a marked UTF-8 input
        could in rare circumstances overrun a buffer and so segfault.

    o   R CMD Rdconv --version was not working correctly.

    o   Missing PROTECTs in nlm() caused "random" errors. (PR#13381 by
        Adam D.I. Kramer, analysis and suggested fix by Bill Dunlap.)

    o   Some extreme cases of pbeta(log.p = TRUE) are more accurate
        (finite values < -700 rather than -Inf).  (PR#13786)

        pbeta() now reports on more cases where the asymptotic
        expansions lose accuracy (the underlying TOMS708 C code was
        ignoring some of these, including the PR#13786 example).

    o   new.env(hash = TRUE, size = NA) now works the way it has been
        documented to for a long time.

    o   tcltk::tk_choose.files(multi = TRUE) produces better-formatted
        output with filenames containing spaces.  (PR#13875)

    o   R CMD check --use-valgrind did not run valgrind on the package
tests.

    o   The tclvalue() and the print() and as.xxx methods for class
        "tclObj" crashed R with an invalid object -- seen with an
        object saved from an earlier session.

    o   R CMD BATCH garbled options -d <debugger> (useful for
        valgrind, although --debugger=valgrind always worked)

    o   INSTALL with LazyData and Encoding declared in DESCRIPTION
        might have left options("encoding") set for the rest of the
        package installation.

And from www.r-project.org the remaining updated news
  • R version 2.9.2 has been released on 2009-08-24. The source code will first become available in this directory, and eventually via all of CRAN. Binaries will arrive in due course (see download instructions above).
  • The first issue of The R Journal is now available
  • The R Foundation as been awarded four slots for R projects in the Google Summer of Code 2009.
  • DSC 2009, The 6th workshop on Directions in Statistical Computing, has been held at the Center for Health and Society, University of Copenhagen, Denmark, July 13-14, 2009.
  • useR! 2009, the R user conference, has been be held at Agrocampus Rennes, France, July 8-10, 2009.
  • useR! 2010, the R user conference, will be held at NIST, Gaithersburg, Maryland, USA, July 21-23, 2010.
  • We have started to collect information about local UseR Groups in the R Wiki.

Citation – http://www.r-project.org

Book Review (short) Data Driven-Profiting from your most important business asset -Tom Redman

Once in a whle comes a book that squeezes a lot of common sense in easy to execute paradigms, adds some flavours of anecdotes and adds penetrating insights as the topping. The Book Data Driven by Tom Redman is such a book- and it may rightly called the successor to the now epic Davenport Tome on Competing on Analytics.

Data Driven, the book is divided in 3 parts.

1) Data Quality – Including opportunity costs of bad data management.

2)  Putting Data and Information to work

3) Creating a Management system for Data and Information.

At 218 pages not including the appendix- this is one easy read for someone who needs to refresh their mental batteries with data hygiene perspectives. With terrific wisdom and easy to communicate language and paradigms this would surely mark another important chapter in bring data quality to the forefront rather than the back burner of Business Intelligence and Business Analytics. All the trillion dollar algorthms in the world and software is useless without data qquality. Read this book and it will show you how to use the most important valuable and under used asset- data.

Best of Decision Stats- Modeling and Text Mining Part3

Here are some of the top articles by way of views, in an  area I love– of modeling and text mining.

1) Karl Rexer – Rexer Analytics

http://www.decisionstats.com/2009/06/09/interview-karl-rexer-rexer-analytics/

Karl produces one of the most respected surveys that captures emerging trends in data mining and technology. Karl was also one of the most enthusiastic people I have interviewed- and I am thankful for his help in getting me some more interviews.

2) Gregory Piatesky Shapiro

One of the earliest and easily the best Knowledge Discoverer of all times, Gregory produces http://www.kdnuggets.com and the newsletter is easily the must newsletter to be on. Gregory was doing data mining , while the Google boys were still debating whether to drop out of Stanford or not.
Continue reading “Best of Decision Stats- Modeling and Text Mining Part3”

The Top Decisionstats Articles -Part 2 Business Intelligence and Data Quality

I am self convinced novice at business intelligence. I understand the broad concepts, understand reporting tools, and definitely forecasting tools. But the whole systems view baffles me enough. Fortunately I have been learning from some of the best writers in this field. Here in order of circulation are the top Business Intelligence articles.

Business Intelligence


1) Jill Dyche

http://www.decisionstats.com/2009/06/30/interview-jill-dyche-baseline-consulting/

Jill is a fabulously wise and experienced person with a great writing style. Here answers were some of the most educative I have seen in Bi writing.

2) Peter Thomas

http://www.decisionstats.com/2009/07/02/peter-james-thomas-bi/

The best of British BI is epitomized by Peter Thomas, and he is truly a European giant when it comes to the field. His worst weakness is a tendency to disappear when Test cricket is around- but that is

eminently understable. I can relate to the cricket as well.

3) Karen Lopez

http://www.decisionstats.com/2009/07/28/interview-karen-lopez/

Karen gives an excellent insight on creating mock ups or data models before actual implementation. She has worked on it for three decades and her wisdom is clearly visible here.

Data Quality

Data quality is such an overlooked and easy to fix issue, that I belive any BI vendor that builds the best, most robust data quality architechture will gain the maximum Pareto like benefits out of results. Curiously competing BI vendors will often compete on price, grahics appeal, etc etc, but the easy Garbage In Garbage Out rule is something they should consider. The Data Quality Interviews gave me an important tutorial in these aspects of data management.

1) Jim Harris

http://www.decisionstats.com/tag/jim-harris/

Jim is an one man army when it comes to evangelizing data quality and his OCDQ blog is widely read and cited.

2) Steve Sarsfield

http://www.decisionstats.com/2009/08/13/interview-steve-sarsfield-author-the-data-governance-imperative/

His excellent book is the one must read item that people in cost cutting corporations should buy especially if they are considering to go down the Davenport competing on analytics model.

( To be continued- Part 3 Modeling and Text Mining

Part 4 Social Media

Part 5 Humour and Poetry )

The Top DecisionStats Articles -Part 1 Analytics

I was just looking at my web analytics numbers and we seem to have crossed some milestones.

The site has now gotten more than 50,000 views since being launched in Dec 2007.

Thank you everyone for your help in this. More importantly the quality of comments has been fabulous. Since I am out of ideas for the rest of the week- here is a best of posts collection.
Here are some of the most favorite articles as measured by number of page views. I have personal fovurites as well, but these are just the ranks as per page views and how they measure up.

Top 5 Interviews

1) Interviews with SAS Institute leaders- I have found generally great professionalism from SAS Institute people. This is surprising because comin from an open source background, SAS is often looked as a big brother. I find that more of a perception and less of a reality as the company continues to innovate.

a) with John Sall, founder SAS Institute- This is really the biggest interview I did in terms of the person involved. To my surprise ( I wasnt expecting John to say yes) the interview was really frank, and it came very fast. The answers seem to be written by John himself.

Quote- Quantitative fields can be fairly resistant to recession- John Sall.

http://www.decisionstats.com/2009/07/28/interview-john-sall-jmp/

b) Interview with Anne Milley, Director, Product Marketing , SAS Institute- This is a favourite because it came very soon after the NYTimes article on R etc. One of my personal opinions is that the difference between great and good leaders is often the fact that great leaders are humble enough  to learn and then build on their strengths. It ran in two parts- and I was really appreciative of the in-depth answers that Anne wrote.

Quotes-

Analytics continues to be our middle name.

Customers vote with the cheque book.

Continue reading “The Top DecisionStats Articles -Part 1 Analytics”

Interview Gregory Piatetsky KDNuggets.com

Here is an interviw with Gregory Piatetsky, founder and editor of KDNuggets (www.KDnuggets.com ) ,the oldest and biggest independent industry websites in terms of data mining and analytics-

gps6

Ajay- Please describe your career in science, many challenges and rewards that came with it. Name any scientific research, degrees teaching etc.


Gregory-
I was born in Moscow, Russia and went to a top math high-school in Moscow. A unique  challenge for me was that my father was one of leading mathematicians in Soviet Union.  While I liked math (and still do), I quickly realized while still in high school that  I will never be as good as my father, and math career was not for me.

Fortunately, I discovered computers and really liked the process of programming and solving applied problems.  At that time (late 1970s) computers were not very popular and it was not clear that one can make a career in computers.  However I was very lucky that I was able to pursue what I liked and find demand for my skills.

I got my MS in 1979 and PhD in 1984 in Computer Science from New York University.
I was interested in AI (perhaps thanks to a lot of science fiction I read as a kid), but found a job in databases, so I was looking for ways to combine them.

In 1984 I joined GTE Labs where I worked on research in databases and AI, and in 1989 started the first project on Knowledge Discovery in data. To help convince my management that there will be a demand for this thing
called “data mining” (GTE management did not see much future for it), I also organized a AAAI workshop on the topic.

I thought “data mining” is not sexy enough name, and so I called it “Knowledge Discovery in Data”, or KDD.  Since 1989, I was working on KDD and data mining in all aspects – more on my page www.kdnuggets.com/gps.html

Ajay-  How would you encourage a young science entrepreneur in this recession.

Gregory- Many great companies were started or grew in a recession, e.g.
http://www.insidecrm.com/features/businesses-started-slump-111108/

Recession may be compared to a brush fire which removes dead wood and allows new trees to grow.

Ajay- What prompted you to set up KD Nuggets? Any reasons for the name (kNowledge Discovery Nuggets). Describe some key milestones in this iconic website for data mining people.

Gregory- After a third KDD workshop in 1993 I started a newsletter to connect about 50 people who attended the workshop and possibly others who were interested in data mining and KDD.  The idea was that it will have short items or “nuggets” of information. Also, at that time a popular metaphor for data miner was gold miners who were looking for gold “nuggets”.  So, I wanted a newsletter with “nuggets” – short, valuable items about Knowledge Discovery.  Thus, the name KDnuggets.

In 1994 I created a website on data mining at GTE and in 1997, after I left  GTE , I moved it to the current domain name www.kdnuggets.com .

In 1999, I was working for startup which provided data mining services to financial industry.  However, because of Y2K issues, all banks etc froze their systems in the second half of 1999, and we had very little work (and our salaries were reduced as well).  I decided that I will try to get some ads and was able to get companies like SPSS and Megaputer to advertise.

Since 2001, I am an independent consultant and KDnuggets is only part of what I am doing.  I also do data mining consulting, and actively participate in SIGKDD (Director 1998-2005, Chair 2005-2009).

Some people think that KDnuggets is a large company, with publisher, webmaster, editor, ad salesperson, billing dept, etc.  KDnuggets indeed has all this functions, but it is all me and my two cats.

Ajay- I am impressed by the fact KD nuggets is almost a dictionary or encyclopedia for data mining. But apart from advertising you have not been totally commercial- many features of your newsletter remain ad free – you still maintain a minimalistic look and do not take sponsership aligned with one big vendor. What is your vision for KD Nuggets for the years to come to keep it truly independent.

Gregory- My vision for KDnuggets is to be a comprehensive resource for data mining community, and I really enjoyed maintaining such resource for the first 7-8 years completely non-commercially. However, when I became self -employed, I could not do KDnuggets without any income, so I selectively introduced ads, and only those which are relevant to data mining.

I like to think of KDnuggets as a Craiglist for data mining community.

I certainly realize the importance of social media and Web 2.0 (and interested people can follow my tweets at tweeter.com/kdnuggets)  and plan to add more social features to KDnuggets.

Still, just like Wikipedia and Facebook do not make New York Times obsolete, I think there is room and need for an edited website, especially for such a nerdy and not very social group like data miners.

Ajay- What is the worst mistake/error in writing publishing that you did. What is the biggest triumph or high moment in the Nuggets history.

Gregory- My biggest mistake is probably in choosing the name kdnuggets – in retrospect,  I could have used a shorter and easier to spell domain name, but in 1997 I never expected that I will still be publishing www.KDnuggets.com 12 years later.

Ajay- Who are your favourite data mining students ( having known so many people). What qualities do you think set a data mining person apart from other sceinces.

Gregory- I was only an adjunct professor for a short time, so I did not really have data mining students, but I was privileged enough to know many current data mining leaders when they were students.  Among more recent students, I am very impressed with Jure Leskovec, who just finished his PhD and got the best KDD dissertation award.

Ajay- What does Gregory Piatetsky do for fun when he is not informing the world on analytics and knowledge discovery.

Gregory- I enjoy travelling with my family, and in the summer I like biking and windsurfing.
I also read a lot, and currently in the middle of reading Proust (which I periodically dilute by other, lighter books).

Ajay- What is your favourite reading blog and website ? Any India plans to visit.
Gregory
– I visit many blogs on www.kdnuggets.com/websites/blogs.html

and I like especially
– Matthew Hurst blog: Data Mining: Text Mining, Visualization, and Social Media
– Occam’s Razor by Avinash Kaushik, examining web analytics.
– Juice Analytics, blogging about analytics and visualization
– Geeking with Greg, exploring the future of personalized information.

I also like your website decisionstats.com and plan to visit it more frequently

I visited many countries, but not yet India – waiting for the right occasion !

Biography

(http://www.kdnuggets.com/gps.html)

Gregory Piatetsky-Shapiro, Ph.D. is the President of KDnuggets, which provides research and consulting services in the areas of data mining, web mining, and business analytics. Gregory is considered to be one of the founders of the data mining and knowledge discovery field.Gregory edited or co-edited many collections on data mining and knowledge discovery, including two best-selling books: Knowledge Discovery in Databases (AAAI/MIT Press, 1991) and Advances in Knowledge Discovery in Databases (AAAI/MIT Press, 1996), and has over 60 publications in the areas of data mining, artificial intelligence and database research.

Gregory is the founder of Knowledge Discovery in Database (KDD) conference series. He organized and chaired the first three Knowledge Discovery in Databases (KDD) workshops in 1989, 1991, and 1993. He then served as the Chair of KDD Steering committee and guided the conversion of KDD workshops into leading international conferences on data mining. He also was the General Chair of the KDD-98 conference.

Interview Tasso Argyros CTO Aster Data Systems

Here is an interview with Tasso Argyros,the CTO and co-founder of Aster Data Systems (www.asterdata.com ) .Aster Data Systems is one of the first DBMS to tightly integrate SQL with MapReduce.

tassos_argyros

Ajay- Maths and Science students the world over are facing a major decline. What would you recommend to young students to get careers in science.

[TA]My father is a professor of Mathematics and I spent a lot of my college time studying advanced math. What I would say to new students is that Math is not a way to get  a job, it’s a way to learn how to think. As such, a Math education can lead to success in any discipline that requires intellectual abilities. As long as they take the time to specialize at some point – via  postgraduate education or a job where they can learn a new discipline from smart people – they won’t regret the investment.

Ajay- Describe your career in Science particularly your time at Stanford. What made you think of starting up Asterdata. How important is it for a team rather than an individual to begin startups. Could you describe the startup moment when your team came together.

[TA] – While at Stanford I became very familiar with the world of startups through my advisor, David Cheriton (who was an angel investor in VMWare, Google and founder of two successful companies). My research was about processing large amounts of data on large, low-cost computer farms. A year into my research it became obvious that this approach had huge processingpower advantages and it was superior to anything else I could see in the marketplace. I then happened to meet my other two co-founders, Mayank Bawa & George Candea who were looking at a similar technical problem from the database and reliability perspective, respectively.

I distinctly remember George walking into my office one day (I barely knew him back then) and saying “I want talk to you about startups and the future” – the rest has become history.

Ajay- How would you describe your product Aster nCluster Cloud Edition to omebody who does not anything beyond the Traditional Server/ Datawarehouse technologies. Could you rate it against some known vendors and give a price point specific to what level of usage does the Total Cost of Ownership in Asterdata becomes cheaper than a say Oracle or a SAP or a Microsoft Datawarehosuing solution.

[TA]- Aster allows businesses  to reduce the data analytics TCO in two interesting ways. First, it has a much lower hardware cost than any traditional DW technology because of its use of commodity servers or cloud infrastructure like Amazon EC2. Secondly, Aster has implemented a lot of  innovations that simplify the (previously tedious and expensive) management of the system, which includes scaling the system elastically up/down as needed – so they are not paying for capacity they don’t need at a given point in time.

But cutting costs is one side of the equation; what makes me even more excited is the ability to make a business more profitable, competitive and efficient through analyzing more data at greaterdepth. We have customers that have cut their costs and increased their customers and revenue by using Aster to analyze their valuable (and usually underutilized) data. If you have data – and you think you’re not taking full advantage of it – Aster can help.

Ajay- I have always have this one favourite question.When can I analyze 100 giga bytes of data using just a browser and some statistical software like R or advanced forecasting softwares that are available.Describe some of Asterdata ‘s work in enhancing the analytical capabilities of big data.

Can I run R ( free -open source) on an on demand basis for an Asterdata solution. How much would it cost me to crunch 100 gb of data and make segmentations and models with say 50 hours of processing time per month

[TA]- One of the big innovations that Aster does it to allow analytical applications like R to be embedded in the database via our SQL/MapReduce framework. We actually have customers right now that are using R to do advanced analytics over terabytes of data.  100GB is actually on the lower end of what our software can enable and as such the cost would not be significant.

Ajay- What do people at Asterdata do when not making complex software.

[TA]- A lot of Asterites love to travel around the world – we are, after all, a very diverse company. We also love coffee, Indian food as well as international and US sports like soccer, cricket, cycling,and football!

Ajay- Name some competing products to Asterdata and where Asterdata products are more suitable for a TCO viewpoint. Name specific areas where you would not recommend your own products.

[TA]- We go against products like Orace database, Teradata and IBM DB2. If you need to do analytics over 100s of GBs or terabytes of data, our price/performance ratio would be orders of magnitude better.

Ajay- How do you convince named and experienced VC’s Sequia Capital to invest in a start-up ( eg I could do with some server costs coming financing)

[TA]- You need to convince Sequoia of three things. (a) that the market you’re going after is very large (in the billions of dollars, if you’re successful). (b) that your team is the best set of people that could ever come together to solve the particular problem you’re trying to solve. And (c) that the technology you’ve developed gives you an “unfair advantage” over incumbents or new market entrants.  Most importantly, you have to smile a lot! J

Biography

About Tasso:

Tasso (Tassos) Argyros is the CTO and co-founder of Aster Data Systems, where he is responsible for all product and engineering operations of the company. Tasso was recently recognized as one ofBusinessWeek’s Best Young Tech Entrepreneurs for 2009 and was an SAP fellow at the Stanford Computer Science department. Prior to Aster, Tasso was pursuing a Ph.D. in the Stanford Distributed Systems Group with a focus on designing cluster architectures for fast, parallel data processing using large farms of commodity servers. He holds an MsC in Computer Science from Stanford University and a Diploma in Computer and Electrical Engineering from Technical University of Athens.

About Aster:

Aster Data Systems is a proven leader in high-performance database systems for data warehousing and analytics – the first DBMS to tightly integrate SQL with MapReduce – providing deep insights on data analyzed on clusters of low-cost commodity hardware.

The Aster nCluster database cost-effectively powers frontline analytic applications for companies such as MySpace, aCerno (an Akamai company), and ShareThis. Running on low-cost off-the-shelf hardware, and providing ‘hands-free’ administration, Aster enables enterprises to meet their data warehousing needs within their budget.

Aster is headquartered in San Carlos, California and is backed by Sequoia Capital, JAFCO Ventures, IVP, Cambrian Ventures, and First-Round Capital, as well as industry visionaries including David Cheriton, Rajeev Motwani and Ron Conway.

Aster_logo_3.0_red