community – Page 7 – DECISION STATS

WPS Version 2.5.1 Released – can still run SAS language/data and R

However this is what Phil Rack the reseller is quoting on http://www.minequest.com/Pricing.html

Windows Desktop Price: $884 on 32-bit Windows and $1,149 on 64-bit Windows.

The Bridge to R is available on the Windows platforms and is available for free to customers who
license WPS through MineQuest,LLC. Companies and organizations outside of North America
may purchase a license for the Bridge to R which starts at $199 per desktop or $599 per server

Windows Server Price: $1,903 per logical CPU for 32-bit and $2,474 for 64-bit.

Note that Linux server versions are available but do not yet support the Eclipse IDE and are
command line only

WPS sure seems going well-but their pricing is no longer fixed and on the home website, you gotta fill a form. Ditt0 for the 30 day free evaluation

http://www.teamwpc.co.uk/products/wps/modules/core

Data File Formats

The table below provides a summary of data formats presently supported by the WPS Core module.

Data File Format	Un-Compressed Data		Compressed Data
Data File Format	Read	Write	Read	Write
SD2 (SAS version 6 data set)
SAS7BDAT (SAS version 7 data set)
SAS7BDAT (SAS version 8 data set)
SAS7BDAT (SAS version 9 data set)
SASSEQ (SAS version 8/9 sequential file)
V8SEQ (SAS version 8 sequential file)
V9SEQ (SAS version 9 sequential file)
WPD (WPS native data set)
WPDSEQ (WPS native sequential file)
XPORT (transport format)

Additional access to EXCEL, SPSS and dBASE files is supported by utilising the WPS Engine for DB Filesmodule.

and they have a new product release on Valentine Day 2011 (oh these Europeans!)

From the press release at http://www.teamwpc.co.uk/press/wps2_5_1_released

WPS Version 2.5.1 Released

New language support, new data engines, larger datasets, improved scalability

LONDON, UK – 14 February 2011 – World Programming today released version 2.5.1 of their WPS software for workstations, servers and mainframes.

WPS is a competitively priced, high performance, highly scalable data processing and analytics software product that allows users to execute programs written in the language of SAS. WPS is supported on a wide variety of hardware and operating system platforms and can connect to and work with many types of data with ease. The WPS user interface (Workbench) is frequently praised for its ease of use and flexibility, with the option to include numerous third-party extensions.

This latest version of the software has the ability to manipulate even greater volumes of data, removing the previous 2^31 (2 billion) limit on number of observations.

Complimenting extended data processing capabilities, World Programming has worked hard to boost the performance, scalability and reliability of the WPS software to give users the confidence they need to run heavy workloads whilst delivering maximum value from available computer power.

WPS version 2.5.1 offers additional flexibility with the release of two new data engines for accessing Greenplum and SAND databases. WPS now comes with eleven data engines and can access a huge range of commonly used and industry-standard file-formats and databases.

Support in WPS for the language of SAS continues to expand with more statistical procedures, data step functions, graphing controls and many other language items and options.

WPS version 2.5.1 is available as a free upgrade to all licensed users of WPS.

Summary of Main New Features:

Supporting Even Larger Datasets
WPS is now able to process very large data sets by lifting completely the previous size limit of 2^31 observations.

Performance and Scalability Boosted
Performance and scalability improvements across the board combine to ensure even the most demanding large and concurrent workloads are processed efficiently and reliably.

More Language Support
WPS 2.5.1 continues the expansion of it’s language support with over 70 new language items, including new Procedures, Data Step functions and many other language items and options.

Statistical Analysis
The procedure support in WPS Statistics has been expanded to include PROC CLUSTER and PROC TREE.

Graphical Output
The graphical output from WPS Graphing has been expanded to accommodate more configurable graphics.

Hash Tables
Support is now provided for hash tables.

Greenplum®
A new WPS Engine for Greenplum provides dedicated support for accessing the Greenplum database.

SAND®
A new WPS Engine for SAND provides dedicated support for accessing the SAND database.

Oracle®
Bulk loading support now available in the WPS Engine for Oracle.

SQL Server®
To enhance existing SQL Server database access, a new SQLSERVR (please note spelling) facility in the ODBC engine.

More Information:

Existing Users should visit www.teamwpc.co.uk/support/wps/release where you can download a readme file containing more information about all the new features and fixes in WPS 2.5.1.

New Users should visit www.teamwpc.co.uk/products/wps where you can explore in more detail all the features available in WPS or request a free evaluation.

and from http://www.teamwpc.co.uk/products/wps/data it seems they are going on the BIG DATA submarine as well-

Data Support

Extremely Large Data Size Handling

WPS is now able to handle extremely large data sets now that the previous limit of 2^31 observations has been lifted.

Access Standard Databases

Use I/O Features in WPS Core

CLIPBOARD (Windows only)
DDE (Windows only)
EMAIL (via SMTP or MAPI)
FTP
HTTP
PIPE (Windows and UNIX only)
SOCKET
STDIO
URL

Use Standard Data File Formats

Revolution R Enterprise 4.2 now available (revolutionanalytics.com)
EMC woos developers with free Greenplum Community Edition (v3.co.uk)
How Vendors Are Lowering Big Data Barriers (nytimes.com)

LibreOffice Stable Release launched

Non Oracle Open Office completes important milestone- from the press release

The Document Foundation launches LibreOffice 3.3

The first stable release of the free office suite is available for download

The Internet, January 25, 2011 – The Document Foundation launches LibreOffice 3.3, the first stable release of the free office suite developed by the community. In less than four months, the number of developers hacking LibreOffice has grown from less than twenty in late September 2010, to well over one hundred today. This has allowed us to release ahead of the aggressive schedule set by the project.

Not only does it ship a number of new and original features, LibreOffice 3.3 is also a significant achievement for a number of reasons:

– the developer community has been able to build their own and independent process, and get up and running in a very short time (with respect to the size of the code base and the project’s strong ambitions);

– thanks to the high number of new contributors having been attracted into the project, the source code is quickly undergoing a major clean-up to provide a better foundation for future development of LibreOffice;

– the Windows installer, which is going to impact the largest and most diverse user base, has been integrated into a single build containing all language versions, thus reducing the size for download sites from 75 to 11GB, making it easier for us to deploy new versions more rapidly and lowering the carbon footprint of the entire infrastructure.

Caolán McNamara from RedHat, one of the developer community leaders, comments, “We are excited: this is our very first stable release, and therefore we are eager to get user feedback, which will be integrated as soon as possible into the code, with the first enhancements being released in February. Starting from March, we will be moving to a real time-based, predictable, transparent and public release schedule, in accordance with Engineering Steering Committee’s goals and users’ requests”. The LibreOffice development roadmap is available at http://wiki.documentfoundation.org/ReleasePlan

LibreOffice 3.3 brings several unique new features. The 10 most-popular among community members are, in no particular order:

the ability to import and work with SVG files;
an easy way to format title pages and their numbering in Writer;
a more-helpful Navigator Tool for Writer;
improved ergonomics in Calc for sheet and cell management;
and Microsoft Works and Lotus Word Pro document import filters.

In addition, many great extensions are now bundled, providing

PDF import,

a slide-show presenter console,

a much improved report builder, and more besides.

A more-complete and detailed list of all the new features offered by LibreOffice 3.3 is viewable on the following web page: http://www.libreoffice.org/download/new-features-and-fixes/

LibreOffice 3.3 also provides all the new features of OpenOffice.org 3.3, such as new custom properties handling; embedding of standard PDF fonts in PDF documents; new Liberation Narrow font; increased document protection in Writer and Calc; auto decimal digits for “General” format in Calc; 1 million rows in a spreadsheet; new options for CSV import in Calc; insert drawing objects in Charts; hierarchical axis labels for Charts; improved slide layout handling in Impress; a new easier-to-use print interface; more options for changing case; and colored sheet tabs in Calc. Several of these new features were contributed by members of the LibreOffice team prior to the formation of The Document Foundation.

LibreOffice hackers will be meeting at FOSDEM in Brussels on February 5 and 6, and will be presenting their work during a one-day workshop on February 6, with speeches and hacking sessions coordinated by several members of the project.

The home of The Document Foundation is at http://www.documentfoundation.org

The home of LibreOffice is at http://www.libreoffice.org where the download page has been redesigned by the community to be more user-friendly.

*** About The Document Foundation

The Document Foundation has the mission of facilitating the evolution of the OOo Community into a new, open, independent, and meritocratic organization within the next few months. An independent Foundation is a better reflection of the values of our contributors, users and supporters, and will enable a more effective, efficient and transparent community. TDF will protect past investments by building on the achievements of the first decade, will encourage wide participation within the community, and will co-ordinate activity across the community.

*** Media Contacts for TDF

Florian Effenberger (Germany)

Mobile: +49 151 14424108 – E-mail: floeff@documentfoundation.org

Olivier Hallot (Brazil)

Mobile: +55 21 88228812 – E-mail: olivier.hallot@documentfoundation.org

Charles H. Schulz (France)

Mobile: +33 6 98655424 – E-mail: charles.schulz@documentfoundation.org

Italo Vignoli (Italy)

Mobile: +39 348 5653829 – E-mail: italo.vignoli@documentfoundation.org

LibreOffice now default Office Suite in Ubuntu 11.04 (omgubuntu.co.uk)
Ubuntu 11.04 switches to LibreOffice in latest daily builds (downloadsquad.switched.com)
Ubuntu Ditches OpenOffice For LibreOffice (informationweek.com)
Ubuntu opts for LibreOffice over Oracle’s OpenOffice (zdnet.com)
LibreOffice Is Taking Shape With Third Beta (pcworld.com)
Ubuntu 11 Switches To Libre Office (lockergnome.com)

Computer Education grants from Google

Image representing Google as depicted in Crunc... — Image via CrunchBase

message from the official google blog-

http://googleblog.blogspot.com/2011/01/supporting-computer-science-education.html

With programs like Computer Science for High School (CS4HS), we hope to increase the number of CS majors —and therefore the number of people entering into careers in CS—by promoting computer science curriculum at the high school level.

For the fourth consecutive year, we’re funding CS4HS to invest in the next generation of computer scientists and engineers. CS4HS is a workshop for high school and middle school computer science teachers that introduces new and emerging concepts in computing and provides tips, tools and guidance on how to teach them. The ultimate goals are to “train the trainer,” develop a thriving community of high school CS teachers and spread the word about the awe and beauty of computing.

If you’re a university, community college, or technical School in the U.S., Canada, Europe, Middle East or Africa and are interested in hosting a workshop at your institution, please visit www.cs4hs.com to submit an application for grant funding.Applications will be accepted between January 18, 2011 and February 18, 2011.

In addition to submitting your application, on the CS4HS website you’ll find info on how to organize a workshop, as well as websites and agendas from last year’s participants to give you an idea of how the workshops were structured in the past. There’s also a collection ofCS4HS curriculum modules that previous participating schools have shared for future organizers to use in their own program.

Supporting computer science education with CS4HS (googleblog.blogspot.com)
Tell the US Federal Government how to fix K-12 CS Ed (computinged.wordpress.com)
Google Reflects on Its Contributions to Computer Science Education (searchenginejournal.com)
A Joint Call for Research on Why Computer Science Education is Important for K-12 (computinged.wordpress.com)
Do High Schools Know What ‘Computer Science’ Is? (developers.slashdot.org)
Advice On Teaching Linux To CS Freshmen? (ask.slashdot.org)
Celebrating the second Computer Science Education Week (googleblog.blogspot.com)

Comparing Bit Torrent Downloaders

Tux, as originally drawn by Larry Ewing — Image via Wikipedia

I personally like UTorrent on Windows and KTorrent on Linux.

While no experts on this, anything that gets the data down faster while maximizing my pipes efficiency.

I also like Torrenting than any of the sudo-apt get method of downloading software or the zip unzip,tar untar, install/make file

Torrenting is a simpler way of sharing applications but sadly not used much by the stats computing community to share downloads.

Also I think any dashboard or visualization should be sorted (but not alphabetically but numerically/categorically)

SORT THE DASHBOARD —-KEEP IT SORTED

So I am partially recreating after sorting the data viz from http://en.wikipedia.org/wiki/Comparison_of_BitTorrent_clients

BitTorrent client	Magnet URI	Super-seeding	Embedded tracker	UPnP [81]	NAT Port Mapping Protocol	NAT traversal [82]	DHT [83]	Peer exchange	Encryption	UDP tracker	LPD
µTorrent	Yes	Yes[95]	Yes[96]	Yes[97]	Yes	Yes[98]	Yes[99]	Yes[85]	Yes[100]	Yes	Yes[101]
BitSpirit [11]	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes	No
BitTorrent 6	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes[85]	Yes	Yes	Yes
OneSwarm	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes	No
qBittorrent	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes
SoMud	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes
Vuze (formerly Azureus)	Yes	Yes	Yes	Yes	Yes	Yes[102]	Yes[87]	Yes	Yes	Yes	No
BitComet	Yes	Yes	Separate download	Yes	Yes	Yes	Yes	Yes	Yes	Yes	No
Tixati [43]	Yes	Yes	No	Yes	No	No	Yes	Yes	Yes	Yes	Partial
Aria2	Yes	No	Yes	No	No	No	Yes	Yes	Yes	Yes	Yes
Tribler	Yes	No	Yes	Yes	Yes	No	Yes	Yes	Yes	No	No
Bitflu	Yes	No	No	No	No	No	Yes	Yes	No	Yes	No
Deluge	Yes	No	No	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes
Flush	Yes	No	No	Yes	Yes	No	Yes	Yes	No	No	Yes
KTorrent	Yes	No	No	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Partial
Shareaza	Yes	No	No	Yes	Yes	No	Yes[93]	Yes	No	No	No
Transmission	Yes	No	No	Yes	Yes	Yes	Yes	Yes[94]	Yes	No	Yes
LimeWire	Partial	Yes	Yes	Yes	Yes	No	Yes	Yes	Yes	Yes	No
BitTyrant	No	Yes[citation needed]	Yes	Yes	Yes	Yes[86]	Yes[87]	Yes	Yes	No	No
BitTornado	No	Yes	Yes[84]	Yes	No	No	No	No	Yes	No	No
Torrent Swapper	No	Yes	Yes[84]	Yes	No	No	No	Yes	No	No	No
Localhost	No	Yes	Yes	Yes	No	Yes	Yes [89]	No	No	No	No
Meerkat Bittorrent Client	No	Yes	No	Yes	Yes	Yes	Yes	No	Yes	No	No
rTorrent	No	Yes	No	No	No	No	Yes	Yes	Yes	Yes	No[92]
TorrentFlux	No	Yes	No	Yes	No	No	No	No	Yes	No	No
TorrentVolve	No	Partial [76]	No	Partial[76]	Partial [76]	Partial [76]	Partial[76]	Partial [76]	Partial [76]	Partial [76]	No
Opera	No	No	Yes[90]	No	No	No	No	Yes[91]	No	No	No
BitTorrent 5 / Mainline	No	No	Yes[84]	Yes	Yes	No	Yes	Yes	Yes	No	No
ABC	No	No	Yes	Yes	No	No	No	No	No	No	No
Blog Torrent	No	No	Yes	No	No	No	No	No	No	No	No
MLDonkey	No	No	Yes	Yes	Yes	No	No	No	No	Yes	No
Tomato Torrent	No	No	Yes	No	No	No	Yes	No	No	No	No
Acquisition	No	No	No	No	Yes	No	No	No	No	No	No
Arctic Torrent	No	No	No	No	No	No	No	Yes	No	No	No
BitLet	No	No	No	Yes	No	No	No	No	No	No	No
BitLord	No	No	No	Yes	No	Yes	No	Yes	No	Yes	No
BitThief	No	No	No	No	No	No	No	No	No	No	No
Bits on Wheels	No	No	No	No	No	No	No	No	No	No	No
BTG	No	No	No	Yes	Yes	No	Yes	Yes	Yes	Yes	No
BTPD	No	No	No	No	No	No	No	No	No	No	No
FlashGet	No	No	No	No	No	No	Yes	No	Yes	No	No
Folx	No	No	No	Yes	Yes	No	Yes	Yes	No	Yes	No
Free Download Manager	No	No	No	No	No	No	Yes	Yes	No	No	No
G3 Torrent	No	No	No	No	No	No	No	No	No	No	No
Gnome BitTorrent	No	No	No	No	No	No	No	No	No	No	No
Halite	No	No	No	Yes	Yes	No	Yes	No	Yes	No[88]	No
QTorrent	No	No	No	No	No	No	No	No	No	No	No
Rufus	No	No	No	No	No	No	No	No	No	No	No
SymTorrent	No	No	No	N/A	N/A	N/A	No	No	No	No	No
Tonido Torrent	No	No	No	Yes	Yes	Yes	Yes	No	No	No	No
Torium	No	No	No	Yes	No	No	Yes	No	No	No	No
ZipTorrent	No	No	No	Yes	Yes	No	No	Yes	No	No	No

uTorrent Falcon Remote Controls Your BitTorrent Downloads from Any Browser [Downloads] (lifehacker.com)
Transmission 2.0 Adds a Whole Lot of Stability to the Popular BitTorrent Client [Downloads] (lifehacker.com)
Put uTorrent On Steroids By Installing Extensions On It [Windows] (makeuseof.com)
uTorrent Outpaces Vuze in BitTorrent Download Speed by 16% [File Sharing] (lifehacker.com)
uTorrent Adds Great iPhone (and Android) Remote Torrent Control Interface [Utorrent] (lifehacker.com)
Dropbox + uTorrent “Watched Folders” FTW (benjaminste.in)
BitTorrent’s Mainline and uTorrent clients reach 100 million active monthly users (downloadsquad.switched.com)
5 Best μTorrent Apps (maketecheasier.com)
Top 10 Cross-Platform BitTorrent Clients (tesarn.blogspot.com)
The 5 Best Torrent Clients For Linux (makeuseof.com)
You: Tribler BitTorrent Client Searches and Downloads Files, No Unreliable Tracker Required [Downloads] (lifehacker.com)
The Next Big DDOS Attack May Come via BitTorrent (gigaom.com)
BitTorrent Inc. To Launch All-In-One BitTorrent Ecosystem (torrentfreak.com)
Bittorrent Inc Launching All In One Application: Vuze Competitor (crenk.com)
BitTorrent Client Offers P2P Without Central Tracking (tech.slashdot.org)
How to Share Your Own Files Using BitTorrent [UltraNewb] (lifehacker.com)
Install apps on uTorrent with App Studio (madrasgeek.com)
Vuze 4.6 adds uTP support, speeds up torrent downloads (downloadsquad.switched.com)

Interview Luis Torgo Author Data Mining with R

Example of k-nearest neighbour classification — Image via Wikipedia

Here is an interview with Prof Luis Torgo, author of the recent best seller “Data Mining with R-learning with case studies”.

Ajay- Describe your career in science. How do you think can more young people be made interested in science.

Luis- My interest in science only started after I’ve finished my degree. I’ve entered a research lab at the University of Porto and started working on Machine Learning, around 1990. Since then I’ve been involved generally in data analysis topics both from a research perspective as well as from a more applied point of view through interactions with industry partners on several projects. I’ve spent most of my career at the Faculty of Economics of the University of Porto, but since 2008 I’m at the department of Computer Science of the Faculty of Sciences of the same university. At the same time I’ve been a researcher at LIAAD / Inesc Porto LA (www.liaad.up.pt).

I like a lot what I do and like science and the “scientific way of thinking”, but I cannot say that I’ve always thought of this area as my “place”. Most of all I like solving challenging problems through data analysis. If that translates into some scientific outcome than I’m more satisfied but that is not my main goal, though I’m kind of “forced” to think about that because of the constraints of an academic career.

That does not mean I’m not passionate about science, I just think there are many more ways of “doing science” than what is reflected in the usual “scientific indicators” that most institutions seem to be more and more obsessed about.

Regards interesting young people in science that is a hard question that I’m not sure I’m qualified to answer. I do tend to think that young people are more sensible to concrete examples of problems they think are interesting and that science helps in solving, as a way of finding a motivation for facing the hard work they will encounter in a scientific career. I do believe in case studies as a nice way to learn and motivate, and thus my book 😉

Ajay- Describe your new book “Data Mining with R, learning with case studies” Why did you choose a case study based approach? who is the target audience? What is your favorite case study from the book

Luis- This book is about learning how to use R for data mining. The book follows a “learn by doing it” approach to data mining instead of the more common theoretical description of the available techniques in this discipline. This is accomplished by presenting a series of illustrative case studies for which all necessary steps, code and data are provided to the reader. Moreover, the book has an associated web page (www.liaad.up.pt/~ltorgo/DataMiningWithR) where all code inside the book is given so that easy copy-paste is possible for the more lazy readers.

The language used in the book is very informal without many theoretical details on the used data mining techniques. For obtaining these theoretical insights there are already many good data mining books some of which are referred in “further readings” sections given throughout the book. The decision of following this writing style had to do with the intended target audience of the book.

In effect, the objective was to write a monograph that could be used as a supplemental book for practical classes on data mining that exist in several courses, but at the same time that could be attractive to professionals working on data mining in non-academic environments, and thus the choice of this more practically oriented approach.

Regards my favorite case study that is a hard question for an author… still I would probably choose the “Predicting Stock Market Returns” case study (Chapter 3). Not only because I like this challenging problem, but mainly because the case study addresses all aspects of knowledge discovery in a real world scenario and not only the construction of predictive models. It tackles data collection, data pre-processing, model construction, transforming predictions into actions using different trading policies, using business-related performance metrics, implementing a trading simulator for “real-world” evaluation, and laying out grounds for constructing an online trading system.

Obviously, for all these steps there are far too many options to be possible to describe/evaluate all of them in a chapter, still I do believe that for the reader it is important to see the overall picture, and read about the relevant questions on this problem and some possible paths that can be followed at these different steps.

In other words: do not expect to become rich with the solution I describe in the chapter !

Ajay- Apart from R, what other data mining software do you use or have used in the past. How would you compare their advantages and disadvantages with R

Luis- I’ve played around with Clementine, Weka, RapidMiner and Knime, but really only playing with teaching goals, and no serious use/evaluation in the context of data mining projects. For the latter I mainly use R or software developed by myself (either in R or other languages). In this context, I do not think it is fair to compare R with these or other tools as I lack serious experience with them. I can however, tell you about what I see as the main pros and cons of R. The main reason for using R is really not only the power of the tool that does not stop surprising me in terms of what already exists and keeps appearing as contributions of an ever growing community, but mainly the ability of rapidly transforming ideas into prototypes. Regards some of its drawbacks I would probably mention the lack of efficiency when compared to other alternatives and the problem of data set sizes being limited by main memory.

I know that there are several efforts around for solving this latter issue not only from the community (e.g. http://cran.at.r-project.org/web/views/HighPerformanceComputing.html), but also from the industry (e.g. Revolution Analytics), but I would prefer that at this stage this would be a standard feature of the language so the the “normal” user need not worry about it. But then this is a community effort and if I’m not happy with the current status instead of complaining I should do something about it!

Ajay- Describe your writing habit- How do you set about writing the book- did you write a fixed amount daily or do you write in bursts etc

Luis- Unfortunately, I write in bursts whenever I find some time for it. This is much more tiring and time consuming as I need to read back material far too often, but I cannot afford dedicating too much consecutive time to a single task. Actually, I frequently tease my PhD students when they “complain” about the lack of time for doing what they have to, that they should learn to appreciate the luxury of having a single task to complete because it will probably be the last time in their professional life!

Ajay- What do you do to relax or unwind when not working?

Luis- For me, the best way to relax from work is by playing sports. When I’m involved in some game I reset my mind and forget about all other things and this is very relaxing for me. A part from sports I enjoy a lot spending time with my family and friends. A good and long dinner with friends over a good bottle of wine can do miracles when I’m too stressed with work! Finally,I do love traveling around with my family.

Luis Torgo

Short Bio: Luis Torgo has a degree in Systems and Informatics Engineering and a PhD in Computer Science. He is an Associate Professor of the Department of Computer Science of the Faculty of Sciences of the University of Porto. He is also a researcher of the Laboratory of Artificial Intelligence and Data Analysis (LIAAD) belonging to INESC Porto LA. Luis Torgo has been an active researcher in Machine Learning and Data Mining for more than 20 years. He has lead several academic and industrial Data Mining research projects. Luis Torgo accompanies the R project almost since its beginning, using it on his research activities. He teaches R at different levels and has given several courses in different countries.

For reading “Data Mining with R” – you can visit this site, also to avail of a 20% discount the publishers have generously given (message below)-

For more information and to place an order, visit us at http://www.crcpress.com. Order online and apply 20% Off discount code 907HM at checkout. CRC is pleased to offer free standard shipping on all online orders!

link to the book page http://www.crcpress.com/product/isbn/9781439810187

Price: $79.95
Cat. #: K10510
ISBN: 9781439810187
ISBN 10: 1439810184
Publication Date: November 09, 2010
Number of Pages: 305
Availability: In Stock
Binding(s): Hardback

Finally! A practical R book on Data Mining: “Data Mining With R, Learning with Case Studies,” by Luis Torgo (r-bloggers.com)
INFORMS Data Mining Competition leaders used Open Source software (r-bloggers.com)
Is Data-Mining Free Speech? The Supreme Court Agrees to Decide a Crucial Case (dailyfinance.com)
Mining of Massive Data Sets (kinlane.com)
Case Study (jonathanlewis.wordpress.com)
Statistical Aspects of Data Mining (kinlane.com)
5 of the Best Free and Open Source Data Mining Software (junauza.com)
US top court to decide state drug data mining law (reuters.com)
Data-mining Google Books: Does the Reader Have To Be Human? (scholarlykitchen.sspnet.org)
Data Mining Competitions | TunedIT (tunedit.org)

Interview Ajay Ohri Decisionstats.com with DMR

From-

http://www.dataminingblog.com/data-mining-research-interview-ajay-ohri/

Here is the winner of the Data Mining Research People Award 2010: Ajay Ohri! Thanks to Ajay for giving some time to answer Data Mining Research questions. And all the best to his blog, Decision Stat!

Data Mining Research (DMR): Could you please introduce yourself to the readers of Data Mining Research?

Ajay Ohri (AO): I am a business consultant and writer based out of Delhi- India. I have been working in and around the field of business analytics since 2004, and have worked with some very good and big companies primarily in financial analytics and outsourced analytics. Since 2007, I have been writing my blog at http://decisionstats.com which now has almost 10,000 views monthly.

All in all, I wrote about data, and my hobby is also writing (poetry). Both my hobby and my profession stem from my education ( a masters in business, and a bachelors in mechanical engineering).

My research interests in data mining are interfaces (simpler interfaces to enable better data mining), education (making data mining less complex and accessible to more people and students), and time series and regression (specifically ARIMAX)
In business my research interests software marketing strategies (open source, Software as a service, advertising supported versus traditional licensing) and creation of technology and entrepreneurial hubs (like Palo Alto and Research Triangle, or Bangalore India).

DMR: I know you have worked with both SAS and R. Could you give your opinion about these two data mining tools?

AO: As per my understanding, SAS stands for SAS language, SAS Institute and SAS software platform. The terms are interchangeably used by people in industry and academia- but there have been some branding issues on this.
I have not worked much with SAS Enterprise Miner , probably because I could not afford it as business consultant, and organizations I worked with did not have a budget for Enterprise Miner.
I have worked alone and in teams with Base SAS, SAS Stat, SAS Access, and SAS ETS- and JMP. Also I worked with SAS BI but as a user to extract information.
You could say my use of SAS platform was mostly in predictive analytics and reporting, but I have a couple of projects under my belt for knowledge discovery and data mining, and pattern analysis. Again some of my SAS experience is a bit dated for almost 1 year ago.

I really like specific parts of SAS platform – as in the interface design of JMP (which is better than Enterprise Guide or Base SAS ) -and Proc Sort in Base SAS- I guess sequential processing of data makes SAS way faster- though with computing evolving from Desktops/Servers to even cheaper time shared cloud computers- I am not sure how long Base SAS and SAS Stat can hold this unique selling proposition.

I dislike the clutter in SAS Stat output, it confuses me with too much information, and I dislike shoddy graphics in the rendering output of graphical engine of SAS. Its shoddy coding work in SAS/Graph and if JMP can give better graphics why is legacy source code preventing SAS platform from doing a better job of it.

I sometimes think the best part of SAS is actually code written by Goodnight and Sall in 1970’s , the latest procs don’t impress me much.

SAS as a company is something I admire especially for its way of treating employees globally- but it is strange to see the rest of tech industry not following it. Also I don’t like over aggression and the SAS versus Rest of the Analytics /Data Mining World mentality that I sometimes pick up when I deal with industry thought leaders.

I think making SAS Enterprise Miner, JMP, and Base SAS in a completely new web interface priced at per hour rates is my wishlist but I guess I am a bit sentimental here- most data miners I know from early 2000’s did start with SAS as their first bread earning software. Also I think SAS needs to be better priced in Business Intelligence- it seems quite cheap in BI compared to Cognos/IBM but expensive in analytical licensing.

If you are a new stats or business student, chances are – you may know much more R than SAS today. The shift in education at least has been very rapid, and I guess R is also more of a platform than a analytics or data mining software.

I like a lot of things in R- from graphics, to better data mining packages, modular design of software, but above all I like the can do kick ass spirit of R community. Lots of young people collaborating with lots of young to old professors, and the energy is infectious. Everybody is a CEO in R ’s world. Latest data mining algols will probably start in R, published in journals.

Which is better for data mining SAS or R? It depends on your data and your deadline. The golden rule of management and business is -it depends.

Also I have worked with a lot of KXEN, SQL, SPSS.

DMR: Can you tell us more about Decision Stats? You have a traffic of 120′000 for 2010. How did you reach such a success?

AO: I don’t think 120,000 is a success. Its not a failure. It just happened- the more I wrote, the more people read.In 2007-2008 I used to obsess over traffic. I tried SEO, comments, back linking, and I did some black hat experimental stuff. Some of it worked- some didn’t.

In the end, I started asking questions and interviewing people. To my surprise, senior management is almost always more candid , frank and honest about their views while middle managers, public relations, marketing folks can be defensive.

Social Media helped a bit- Twitter, Linkedin, Facebook really helped my network of friends who I suppose acted as informal ambassadors to spread the word.
Again I was constrained by necessity than choices- my middle class finances ( I also had a baby son in 2007-my current laptop still has some broken keys – by my inability to afford traveling to conferences, and my location Delhi isn’t really a tech hub.

The more questions I asked around the internet, the more people responded, and I wrote it all down.

I guess I just was lucky to meet a lot of nice people on the internet who took time to mentor and educate me.

I tried building other websites but didn’t succeed so i guess I really don’t know. I am not a smart coder, not very clever at writing but I do try to be honest.

Basic economics says pricing is proportional to demand and inversely proportional to supply. Honest and candid opinions have infinite demand and an uncertain supply.

DMR: There is a rumor about a R book you plan to publish in 2011 Can you confirm the rumor and tell us more?

AO: I just signed a contract with Springer for ” R for Business Analytics”. R is a great software, and lots of books for statistically trained people, but I felt like writing a book for the MBAs and existing analytics users- on how to easily transition to R for Analytics.

Like any language there are tricks and tweaks in R, and with a focus on code editors, IDE, GUI, web interfaces, R’s famous learning curve can be bent a bit.

Making analytics beautiful, and simpler to use is always a passion for me. With 3000 packages, R can be used for a lot more things and a lot more simply than is commonly understood.
The target audience however is business analysts- or people working in corporate environments.

Brief Bio-
Ajay Ohri has been working in the field of analytics since 2004 , when it was a still nascent emerging Industries in India. He has worked with the top two Indian outsourcers listed on NYSE,and with Citigroup on cross sell analytics where he helped sell an extra 50000 credit cards by cross sell analytics .He was one of the very first independent data mining consultants in India working on analytics products and domestic Indian market analytics .He regularly writes on analytics topics on his web site www.decisionstats.com and is currently working on open source analytical tools like R besides analytical software like SPSS and SAS.

Skills of a good data miner (zyxo.wordpress.com)
Data Mining with WEKA (r-bloggers.com)
How Data Mining Can Help You Score on the First Date (volokh.com)
Upcoming webinar on investigative analytics (dbms2.com)
IBM SPSS 19 Now Available to the Global Academic Community via e-academy’s OnTheHub eStore (prweb.com)

How to balance your online advertising and your offline conscience

Image via Wikipedia

I recently found an interesting example of a website that both makes a lot of money and yet is much more efficient than any free or non profit. It is called ECOSIA

If you see a website that wants to balance administrative costs plus have a transparent way to make the world better- this is a great example.

http://ecosia.org/how.php

HOW IT WORKS
You search with Ecosia.

Perhaps you click on an interesting sponsored link.

The sponsoring company pays Bing or Yahoo for the click.

Bing or Yahoo gives the bigger chunk of that money to Ecosia.

Ecosia donates at least 80% of this income to support WWF’s work in the Amazon.

If you like what we’re doing, help us spread the word!

Key facts about the park:

World’s largest tropical forest reserve (38,867 square kilometers, or about the size of Switzerland)
Home to about 14% of all amphibian species and roughly 54% of all bird species in the Amazon – not to mention large populations of at least eight threatened species, including the jaguar
Includes part of the Guiana Shield containing 25% of world’s remaining tropical rainforests – 80 to 90% of which are still pristine
Holds the last major unpolluted water reserves in the Neotropics, containing approximately 20% of all of the Earth’s water
One of the last tropical regions on Earth vastly unaltered by humans
Significant contributor to climatic regulation via heat absorption and carbon storage

http://ecosia.org/statistics.php

They claim to have donated 141,529.42 EUR !!!

http://static.ecosia.org/files/donations.pdf

Well suppose you are the Web Admin of a very popular website like Wikipedia or etc

One way to meet server costs is to say openly hey i need to balance my costs so i need some money.

The other way is to use online advertising.

I started mine with Google Adsense.

Click per milli (or CPM) gives you a very low low conversion compared to contacting ad sponsor directly.

But its a great data experiment-

as you can monitor which companies are likely to be advertised on your site (assume google knows more about their algols than you will)

which formats -banner or text or flash have what kind of conversion rates

what are the expected pay off rates from various keywords or companies (like business intelligence software, predictive analytics software and statistical computing software are similar but have different expected returns (if you remember your eco class)

NOW- Based on above data, you know whats your minimum baseline to expect from a private advertiser than a public, crowd sourced search engine one (like Google or Bing)

Lets say if you have 100000 views monthly. and assume one out of 1000 page views will lead to a click. Say the advertiser will pay you 1 $ for every 1 click (=1000 impressions)

Then your expected revenue is $100.But if your clicks are priced at 2.5$ for every click , and your click through rate is now 3 out of 1000 impressions- (both very moderate increases that can done by basic placement optimization of ad type, graphics etc)-your new revenue is 750$.

Be a good Samaritan- you decide to share some of this with your audience -like 4 Amazon books per month ( or I free Amazon book per week)- That gives you a cost of 200$, and leaves you with some 550$.

Wait! it doesnt end there- Adam Smith‘s invisible hand moves on .

You say hmm let me put 100 $ for an annual paper writing contest of $1000, donate $200 to one laptop per child ( or to Amazon rain forests or to Haiti etc etc etc), pay $100 to your upgraded server hosting, and put 350$ in online advertising. say $200 for search engines and $150 for Facebook.

Woah!

Month 1 would should see more people visiting you for the first time. If you have a good return rate (returning visitors as a %, and low bounce rate (visits less than 5 secs)- your traffic should see atleast a 20% jump in new arrivals and 5-10 % in long term arrivals. Ignoring bounces- within three months you will have one of the following

1) An interesting case study on statistics on online and social media advertising, tangible motivations for increasing community response , and some good data for study

2) hopefully better cost management of your server expenses

3)very hopefully a positive cash flow

you could even set a percentage and share the monthly (or annually is better actions) to your readers and advertisers.

go ahead- change the world!

the key paradigms here are sharing your traffic and revenue openly to everyone

donating to a suitable cause

helping increase awareness of the suitable cause

basing fixed percentages rather than absolute numbers to ensure your site and cause are sustained for years.

3 Green Search Engines (planetsave.com)
Social Enterprise Focus: Ecosia (clearlyso.com)
Yahoo and Microsoft Search Advertisers May See Rate Hike of Up To 78% (dailyfinance.com)
Return on Investment from Google Marketing (firstrate.co.nz)
The Top 10 Paid Search Features You Might Have Missed In 2010 (searchengineland.com)
Bing upgrades draw upon Facebook, other partners (thenewstribune.com)
adCenter Goes Offline During Winter Storm (seroundtable.com)
Why Bing “Likes” Facebook (technologyreview.in)
What Offline Advertisers Can Teach Online Marketers (gabrielcatalano.com)
The Environment friendly Search! (trak.in)

Tag: community

WPS Version 2.5.1 Released – can still run SAS language/data and R

Data File Formats

WPS Version 2.5.1 Released

New language support, new data engines, larger datasets, improved scalability

Summary of Main New Features:

More Information:

Data Support

Extremely Large Data Size Handling

Access Standard Databases

Use I/O Features in WPS Core

Use Standard Data File Formats

LibreOffice Stable Release launched

Computer Education grants from Google

Interview Luis Torgo Author Data Mining with R

Data File Formats

WPS Version 2.5.1 Released New language support, new data engines, larger datasets, improved scalability

Summary of Main New Features:

More Information:

Data Support Extremely Large Data Size Handling

Access Standard Databases

Use I/O Features in WPS Core

Use Standard Data File Formats

Related Articles

Please share:

Related Articles

Please share:

Related Articles

Please share:

Related Articles (Ps the Related Articles is auto generated by Zementa- a software embedded within WordPress.com in case you are wondering what the deal with the linking is)

Please share:

Related Articles

Please share:

Related Articles

Please share:

Key facts about the park:

Related Articles

Please share:

WPS Version 2.5.1 Released

New language support, new data engines, larger datasets, improved scalability

Data Support

Extremely Large Data Size Handling