I use Windows 7 on my laptop (it came pre-installed) and Ubuntu using the VMWare Player. What are the advantages of using VM Player instead of creating a dual-boot system? Well I can quickly shift from Ubuntu to Windows and bakc again without restarting my computer everytime. Using this approach allows me to utilize software that run only on Windows and run software like Rattle, the R data mining GUI, that are much easier installed on Linux.
However if your statistical software is on your Virtual Disk , and your data is on your Windows disk, you need a way to move data from Windows to Ubuntu.
Open My Computer, browse to the folder you want to share. Right-click on the folder, select Properties. Sharing tab. Select the radio button to “Share this Folder”. Change the default generated name if you wish; add a description if you wish. Click the Permissions button to modify the security settings of what users can read/write to the share.
On the Linux side, it depends on the distro, the shell, and the window manager.
Well Ubuntu makes it really easy to configure the Linux steps to move data within Windows and Linux partitions.
VMmare makes it easy to share between your Windows (host) and Linux (guest) OS
and step 2
Start the Wizard
when you finish the wizard and share a drive or folder- hey where do I see my shared ones-
see this folder in Linux- /mnt/hgfs (bingo!)
Hacker HW – Make this folder //mnt/hgfs a shortcut in Places your Ubuntu startup
Hacker Hw 2-
Upload using an anon email your VM dark data to Ubuntu one
Purge using software XX
Reinstall VM and bring back backup
Note time to do this
-General Sharing in Windows
Just open the Network tab in Ubuntu- see screenshots below-
Windows will now ask your Ubuntu user for login-
Once Logged in Windows from within Ubuntu Vmware, this is what happens
You see a tab called “users on “windows username”- pc appear on your Ubuntu Desktop (see top right of the screenshot)
If you double click it- you see your windows path
You can now just click and drag data between your windows and linux partitions , just the way you do it in Windows .
So based on this- if you want to build decision trees, artifical neural networks, regression models, and even time series models for zero capital expenditure- you can use both Ubuntu/R without compromising on your IT policy of Windows only in your organization (there is a shortage of Ubuntu trained IT administrators in the enterprise world)
Revised Installation Procedure for utilizing both Ubuntu /R/Rattle data mining on your Windows PC.
Using VMWare to build a free data mining system in R, as well as isolate your analytics system (thus using both Linux and Windows without overburdening your machine)
Using a virtual partition is slightly better than using a dual boot system. That is because you can keep the specialized operating system (usually Linux) within the main operating system (usually Windows), browse and alternate between the two operating system just using a simple command, and can utilize the advantages of both operating system.
Also you can create project specific discs for enhanced security.
In my (limited ) Mac experience, the comparisons of each operating system are-
1) Mac- Both robust and aesthetically designed OS, the higher price and hardware-lockin for Mac remains a disadvantage. Also many stats and analytical software just wont work on the Mac
2) Windows- It is cheaper than Mac and easier to use than Linux. Also has the most compatibility with applications (usually when not crashing)
3) Linux- The lightest and most customized software in the OS class, free to use, and has many lite versions for newbies. Not compatible with mainstream corporate IT infrastructure as of 2011.
That enables me to use Ubuntu on the alternative OS- keeping my Windows 7 for some Windows specific applications . For software like Rattle, the R data mining GUI , it helps to use two operating systems, in view of difficulties in GTK+.
Installing Rattle on Windows 7 is a major pain thanks to backward compatibility issues and version issues of GTK, but it installs on Ubuntu like a breeze- and it is very very convenient to switch between the two operating systems
Cygwin is a Linux-like environment for Windows. It consists of two parts:
A DLL (cygwin1.dll) which acts as a Linux API emulation layer providing substantial Linux API functionality.
A collection of tools which provide Linux look and feel
What Isn’t Cygwin?
Cygwin is not a way to run native linux apps on Windows. You have to rebuild your application from source if you want it to run on Windows.
Cygwin is not a way to magically make native Windows apps aware of UNIX ® functionality, like signals, ptys, etc. Again, you need to build your apps from source if you want to take advantage of Cygwin functionality.
VMware Player is the easiest way to run multiple operating systems at the same time on your PC. With its user-friendly interface, VMware Player makes it effortless for anyone to try out Windows 7, Chrome OS or the latest Linux releases, or create isolated virtual machines to safely test new software and surf the Web
Additional features in R over other analytical packages-
1) Source Code is given to ensure complete custom solution and embedding for a particular application. Open source code has an advantage that is extensively peer- reviewed in Journals and Scientific Literature. This means bugs will found, shared and corrected transparently.
2) Wide literature of training material in the form of books is available for the R analytical platform.
3) Extensively the best data visualization tools in analytical software (apart from Tableau Software ‘s latest version). The extensive data visualization available in R is of the form a variety of customizable graphs, as well as animation. The principal reason third-party software initially started creating interfaces to R is because the graphical library of packages in R is more advanced as well as rapidly getting more features by the day.
4) Free in upfront license cost for academics and thus budget friendly for small and large analytical teams.
5) Flexible programming for your data environment. This includes having packages that ensure compatibility with Java, Python and C++.
6) Easy migration from other analytical platforms to R Platform. It is relatively easy for a non R platform user to migrate to R platform and there is no danger of vendor lock-in due to the GPL nature of source code and open community.
Statistics are numbers that tell (descriptive), advise ( prescriptive) or forecast (predictive). Analytics is a decision-making help tool. Analytics on which no decision is to be made or is being considered can be classified as purely statistical and non analytical. Thus ease of making a correct decision separates a good analytical platform from a not so good analytical platform. The distinction is likely to be disputed by people of either background- and business analysis requires more emphasis on how practical or actionable the results are and less emphasis on the statistical metrics in a particular data analysis task. I believe one clear reason between business analytics is different from statistical analysis is the cost of perfect information (data costs in real world) and the opportunity cost of delayed and distorted decision-making.
Specific to the following domains R has the following costs and benefits
R is free per license and for download
It is one of the few analytical platforms that work on Mac OS
It’s results are credibly established in both journals like Journal of Statistical Software and in the work at LinkedIn, Google and Facebook’s analytical teams.
It has open source code for customization as per GPL
It also has a flexible option for commercial vendors like Revolution Analytics (who support 64 bit windows) as well as bigger datasets
It has interfaces from almost all other analytical software including SAS,SPSS, JMP, Oracle Data Mining, Rapid Miner. Existing license holders can thus invoke and use R from within these software
Huge library of packages for regression, time series, finance and modeling
High quality data visualization packages
R as a computing platform is better suited to the needs of data mining as it has a vast array of packages covering standard regression, decision trees, association rules, cluster analysis, machine learning, neural networks as well as exotic specialized algorithms like those based on chaos models.
Flexibility in tweaking a standard algorithm by seeing the source code
The RATTLE GUI remains the standard GUI for Data Miners using R. It was created and developed in Australia.
Business Dashboards and Reporting
Business Dashboards and Reporting are an essential piece of Business Intelligence and Decision making systems in organizations. R offers data visualization through GGPLOT, and GUI like Deducer and Red-R can help even non R users create a metrics dashboard
For online Dashboards- R has packages like RWeb, RServe and R Apache- which in combination with data visualization packages offer powerful dashboard capabilities.
R can be combined with MS Excel using the R Excel package – to enable R capabilities to be imported within Excel. Thus a MS Excel user with no knowledge of R can use the GUI within the R Excel plug-in to use powerful graphical and statistical capabilities.
Additional factors to consider in your R installation-
There are some more choices awaiting you now-
1) Licensing Choices-Academic Version or Free Version or Enterprise Version of R
2) Operating System Choices-Which Operating System to choose from? Unix, Windows or Mac OS.
3) Operating system sub choice- 32- bit or 64 bit.
4) Hardware choices-Cost -benefit trade-offs for additional hardware for R. Choices between local ,cluster and cloud computing.
5) Interface choices-Command Line versus GUI? Which GUI to choose as the default start-up option?
6) Software component choice- Which packages to install? There are almost 3000 packages, some of them are complimentary, some are dependent on each other, and almost all are free.
7) Additional Software choices- Which additional software do you need to achieve maximum accuracy, robustness and speed of computing- and how to use existing legacy software and hardware for best complementary results with R.
1) Licensing Choices-
You can choose between two kinds of R installations – one is free and open source from http://r-project.org The other R installation is commercial and is offered by many vendors including Revolution Analytics. However there are other commercial vendors too.
Windows remains the most widely used operating system on this planet. If you are experienced in Windows based computing and are active on analytical projects- it would not make sense for you to move to other operating systems. This is also based on the fact that compatibility problems are minimum for Microsoft Windows and the help is extensively documented. However there may be some R packages that would not function well under Windows- if that happens a multiple operating system is your next option.
Enterprise R from Revolution Analytics- Enterprise R from Revolution Analytics has a complete R Development environment for Windows including the use of code snippets to make programming faster. Revolution is also expected to make a GUI available by 2011. Revolution Analytics claims several enhancements for it’s version of R including the use of optimized libraries for faster performance.
Reasons for choosing MacOS remains its considerable appeal in aesthetically designed software- but MacOS is not a standard Operating system for enterprise systems as well as statistical computing. However open source R claims to be quite optimized and it can be used for existing Mac users. However there seem to be no commercially available versions of R available as of now for this operating system.
Red Hat Enterprise Linux
Other versions of Linux
Linux is considered a preferred operating system by R users due to it having the same open source credentials-much better fit for all R packages and it’s customizability for big data analytics.
Ubuntu Linux is recommended for people making the transition to Linux for the first time. Ubuntu Linux had an marketing agreement with revolution Analytics for an earlier version of Ubuntu- and many R packages can installed in a straightforward way as Ubuntu/Debian packages are available. Red Hat Enterprise Linux is officially supported by Revolution Analytics for it’s enterprise module. Other versions of Linux popular are Open SUSE.
Multiple operating systems-
Virtualization vs Dual Boot-
You can also choose between having a VMware VM Player for a virtual partition on your computers that is dedicated to R based computing or having operating system choice at the startup or booting of your computer. A software program called wubi helps with the dual installation of Linux and Windows.
64 bit vs 32 bit – Given a choice between 32 bit versus 64 bit versions of the same operating system like Linux Ubuntu, the 64 bit version would speed up processing by an approximate factor of 2. However you need to check whether your current hardware can support 64 bit operating systems and if so- you may want to ask your Information Technology manager to upgrade atleast some operating systems in your analytics work environment to 64 bit operating systems.
Hardware choices- At the time of writing this book, the dominant computing paradigm is workstation computing followed by server-client computing. However with the introduction of cloud computing, netbooks, tablet PCs, hardware choices are much more flexible in 2011 than just a couple of years back.
Hardware costs are a significant cost to an analytics environment and are also remarkably depreciated over a short period of time. You may thus examine your legacy hardware, and your future analytical computing needs- and accordingly decide between the various hardware options available for R.
Unlike other analytical software which can charge by number of processors, or server pricing being higher than workstation pricing and grid computing pricing extremely high if available- R is well suited for all kinds of hardware environment with flexible costs. Given the fact that R is memory intensive (it limits the size of data analyzed to the RAM size of the machine unless special formats and /or chunking is used)- it depends on size of datasets used and number of concurrent users analyzing the dataset. Thus the defining issue is not R but size of the data being analyzed.
Local Computing- This is meant to denote when the software is installed locally. For big data the data to be analyzed would be stored in the form of databases.
Server version- Revolution Analytics has differential pricing for server -client versions but for the open source version it is free and the same for Server or Workstation versions.
Cloud Computing- Cloud computing is defined as the delivery of data, processing, systems via remote computers. It is similar to server-client computing but the remote server (also called cloud) has flexible computing in terms of number of processors, memory, and data storage. Cloud computing in the form of public cloud enables people to do analytical tasks on massive datasets without investing in permanent hardware or software as most public clouds are priced on pay per usage. The biggest cloud computing provider is Amazon and many other vendors provide services on top of it. Google is also coming for data storage in the form of clouds (Google Storage), as well as using machine learning in the form of API (Google Prediction API)
Cluster-Grid Computing/Parallel processing- In order to build a cluster, you would need the RMpi and the SNOW packages, among other packages that help with parallel processing.
How much resources
RAM-Hard Disk-Processors- for workstation computing
Instances or API calls for cloud computing
Software Component Choices
Packages to install
Additional software choices
Additional legacy software
Optimizing your R based computing
Libraries to speed up R
citation- R Development Core Team (2010). R: A language and environment for statistical computing. R Foundation for Statistical Computing,Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org.
A new report from Linux Foundation found significant growth trends for enterprise usage of Linux- which should be welcome to software companies that have enabled Linux versions of software, service providers that provide Linux based consulting (note -lesser competition, lower overheads) and to application creators.
Key Findings from the Report
• 79.4 percent of companies are adding more Linux relative to other operating systems in the next five years.
• More people are reporting that their Linux deployments are migrations from Windows than any other platform, including Unix migrations. 66 percent of users surveyed say that their Linux deployments are brand new (“Greenfield”) deployments.
• Among the early adopters who are operating in cloud environments, 70.3 percent use Linux as their primary platform, while only 18.3 percent use Windows.
• 60.2 percent of respondents say they will use Linux for more mission-critical workloads over the next 12 months.
• 86.5 percent of respondents report that Linux is improving and 58.4 percent say their CIOs see Linux as more strategic to the organization as compared to three years ago.
• Drivers for Linux adoption extend beyond cost: technical superiority is the primary driver, followed by cost and then security.
• The growth in Linux, as demonstrated by this report, is leading companies to increasingly seek Linux IT professionals, with 38.3 percent of respondents citing a lack of Linux talent as one of their main concerns related to the platform.
• Users participate in Linux development in three primary ways: testing and submitting bugs (37.5 percent), working with vendors (30.7 percent) and participating in The Linux Foundation activities (26.0 percent).
and from the report itself-
It is an interesting report (and for some reason in a blue font-making it more like a blue paper than a white paper)