Saving Output in R for Presentations

While SAS language has a beautifully designed ODS (Output Delivery System) for saving output from certain analysis in excel files (and html and others), in R one can simply use the object, put it in a write.table and save it a csv file using the file parameter within write.table.

As a business analytics consultant, the output from a Proc Means, Proc Freq (SAS) or a summary/describe/table command (in R) is to be presented as a final report. Copying and pasting is not feasible especially for large amounts of text, or remote computers.

Using the following we can simple save the output  in R

 

> getwd()
[1] “C:/Users/KUs/Desktop/Ajay”
> setwd(“C:\Users\KUs\Desktop”)

#We shifted the directory, so we can save output without putting the entire path again and again for each step.

#I have found the summary command most useful for initial analysis and final display (particularly during the data munging step)

nams=summary(ajay)

# I assigned a new object to the analysis step (summary), it could also be summary,names, describe (HMisc) or table (for frequency analysis),
> write.table(nams,sep=”,”,file=”output.csv”)

Note: This is for basic beginners in R using it for business analytics dealing with large number of variables.

 

pps: Note

If you have a large number of files in a local directory to be read in R, you can avoid typing the entire path again and again by modifying the file parameter in the read.table and changing the working directory to that folder

 

setwd(“C:/Users/KUs/Desktop/”)
ajayt1=read.table(file=”test1.csv”,sep=”,”,header=T)

ajayt2=read.table(file=”test2.csv”,sep=”,”,header=T)

 

and so on…

maybe there is a better approach somewhere on Stack Overflow or R help, but this will work just as well.

you can then merge the objects created ajayt1 and ajayt2… (to be continued)

Cyber Cold War

I try to write on cyber conflict without getting into the politics of why someone is hacking someone else. I always get beaten by someone in the comments thread when I write on politics.

But recent events have forced me to update my usual “how-to” cyber conflict to “why” cyber conflict. This is because of a terrorist attack in my hometown Delhi.

(updated-

http://www.nytimes.com/2012/02/14/world/middleeast/israeli-embassy-officials-attacked-in-india-and-georgia.html?_r=1&hp

Iran allegedly tried  (as per Israel) to assassinate the wife of Israeli Defence Attache in Delhi using a magnetic bomb, India as she went to school to pick up her kids, somebody else put a grenade in Israeli embassy car in Georgia which was found in time. 

Based on reports , initial work suggests the bomb was much more sophisticated than local terrorists, but the terrorists seemed to have some local recce work done.

India has 0 history of antisemitism but this is the second time Israelis have been targeted since 26/11 Mumbai attacks. India buys 12 % of oil annually from Iran (and refuses to join the oil embargo called by US and Europe)

Cyber Conflict is less painful than conflict, which is inevitable as long as mankind exists. Also the Western hemisphere needs a moon shot (cyber conflict could be the Sputnik like moment) and with declining and aging populations but better technology, Western Hemisphere govts need cyber conflict as they are running out of humans to fight their wars. Eastern govt. are even more obnoxious in using children for conflict propaganda, and corruption.

Last week CIA.gov website went down

This week Iranian govt is allegedly blocking https traffic on eve of Annual Revolution Day (what a coincidence!)

 

Some resources to help Internet users in Iran (or maybe this could be a dummy test for the big one – hacking the great firewall of China)

News from Hacker News-

http://news.ycombinator.com/item?id=3575029

 

I’m writing this to report the serious troubles we have regarding accessing Internet in Iran at the moment. Since Thursday Iranian government has shutted down the https protocol which has caused almost all google services (gmail, and google.com itself) to become inaccessible. Almost all websites that reply on Google APIs (like wolfram alpha) won’t work. Accessing to any website that replies on https (just imaging how many websites use this protocol, from Arch Wiki to bank websites). Also accessing many proxies is also impossible. There are almost no official reports on this and with many websites and my email accounts restricted I can just confirm this based on my own and friends experience. I have just found one report here:

Iran Shut Down Gmail , Google , Yahoo and sites using “Https” Protocol

The reason for this horrible shutdown is that the Iranian regime celebrates 1979 Islamic revolution tomorrow.

I just wanted to let you guys know about this. If you have any solution regarding bypassing this restriction please help!

 

The boys at Tor think they can help-

but its not so elegant, as I prefer creating a  batch file rather than explain coding to newbies. 

this is still getting to better and easier interfaces

https://www.torproject.org/projects/obfsproxy-instructions.html.en

Obfsproxy Instructions

client torrc

Step 1: Install dependencies, obfsproxy, and Tor

 

You will need a C compiler (gcc), the autoconf and autotools build system, the git revision control system, pkg-config andlibtoollibevent-2 and its headers, and the development headers of OpenSSL.

On Debian testing or Ubuntu oneiric, you could do:
# apt-get install autoconf autotools-dev gcc git pkg-config libtool libevent-2.0-5 libevent-dev libevent-openssl-2.0-5 libssl-dev

If you’re on a more stable Linux, you can either try our experimental backport libevent2 debs or build libevent2 from source.

Clone obfsproxy from its git repository:
$ git clone https://git.torproject.org/obfsproxy.git
The above command should create and populate a directory named ‘obfsproxy’ in your current directory.

Compile obfsproxy:
$ cd obfsproxy
$ ./autogen.sh && ./configure && make

Optionally, as root install obfsproxy in your system:
# make install

If you prefer not to install obfsproxy as root, you can instead just modify the Transport lines in your torrc file (explained below) to point to your obfsproxy binary.

You will need Tor 0.2.3.11-alpha or later.


Step 2a: If you’re the client…

 

First, you need to learn the address of a bridge that supports obfsproxy. If you don’t know any, try asking a friend to set one up for you. Then the appropriate lines to your tor configuration file:

UseBridges 1
Bridge obfs2 128.31.0.34:1051
ClientTransportPlugin obfs2 exec /usr/local/bin/obfsproxy --managed

Don’t forget to replace 128.31.0.34:1051 with the IP address and port that the bridge’s obfsproxy is listening on.
 Congratulations! Your traffic should now be obfuscated by obfsproxy. You are done! You can now start using Tor.

For old fashioned tunnel creation under Seas of English Channel-

http://dag.wieers.com/howto/ssh-http-tunneling/

Tunneling SSH over HTTP(S)
This document explains how to set up an Apache server and SSH client to allow tunneling SSH over HTTP(S). This can be useful on restricted networks that either firewall everything except HTTP traffic (tcp/80,tcp/443) or require users to use a local (HTTP) proxy.
A lot of people asked why doing it like this if you can just make sshd listen on port 443. Well, that might work if your environment is not hardened like I have seen at several companies, but this setup has a few advantages.

  • You can proxy to anywhere (see the Proxy directive in Apache) based on names
  • You can proxy to any port you like (see the AllowCONNECT directive in Apache)
  • It works even when there is a layer-7 protocol firewall
  • If you enable proxytunnel ssl support, it is indistinguishable from real SSL traffic
  • You can come up with nice hostnames like ‘downloads.yourdomain.com’ and ‘pictures.yourdomain.com’ and for normal users these will look like normal websites when visited.
  • There are many possibilities for doing authentication further along the path
  • You can do proxy-bouncing to the n-th degree to mask where you’re coming from or going to (however this requires more changes to proxytunnel, currently I only added support for one remote proxy)
  • You do not have to dedicate an IP-address for sshd, you can still run an HTTPS site

Related-

http://opensourceandhackystuff.blogspot.in/2012/02/captive-portal-security-part-1.html

and some crypto for young people

http://users.telenet.be/d.rijmenants/en/onetimepad.htm

 

Me- What am I doing about it? I am just writing poems on hacking at http://poemsforkush.com

Opera Unite- the future of cloud computing browsers

The boys (and ladies) at opera have been busy writing code , while the rest of the coders on the cloud were issuing press releases, attending meetings or just sky diving from the cloud. Judging by the language of apps and extensions, it seems that the  engineers de Vikings et Slavs were busy coding while the Anglo Saxons were busy preparing for IPOs.

I really like the complete anonymity offered by Opera and especially Opera Unite

1) The Adblock option blocks all ads (same as other extensions)

2) The lovely Opera Unite has incredible apps for peer to peer sharing. You can create your own spotify, host your own chat application, transfer files, remote manage your computer. C’est magnifique!

Some really awesome apps on Opera Unite

All these apps can make your own desktop into a remotely managed website- so SOPA is irrelevant even if passed without any protest or non violent protests

(SOPA- an acronym for STOP OBAMA or STOP A (?) , since OBAMA is the one the internet really supports , and he is dependent on that goodwill for fundraising or A is the acronym of a legendary media myth of an imaginary web based organization (imaginary as in iota)

QUOTE

I think it would be a good idea.

 Mahatma Gandhiwhen asked what he thought of Western civilization

Denial of Service Attacks against Hospitals and Emergency Rooms

One of the most frightening possibilities of cyber warfare is to use remotely deployed , or timed intrusion malware to disturb, distort, deny health care services.

Computer Virus Shuts Down Georgia Hospital

A doctor in an Emergency Room depends on critical information that may save lives if it is electronic and comes on time. However this electronic information can be distorted (which is more severe than deleting it)

The electronic system of a Hospital can also be overwhelmed. If there can be built Stuxnet worms on   nuclear centrifuge systems (like those by Siemens), then the widespread availability of health care systems means these can be reverse engineered for particularly vicious cyber worms.

An example of prime area for targeting is Veterans Administration for veterans of armed forces, but also cyber attacks against electronic health records.

Consider the following data points-

http://threatpost.com/en_us/blogs/dhs-warns-about-threat-mobile-devices-healthcare-051612

May 16, 2012, 9:03AM

DHS’s National Cybersecurity and Communications Integration Center (NCCIC) issued the unclassfied bulletin, “Attack Surface: Healthcare and Public Health Sector” on May 4. In it, DHS warns of a wide range of security risks, including that could expose patient data to malicious attackers, or make hospital networks and first responders subject to disruptive cyber attack

http://publicintelligence.net/nccic-medical-device-cyberattacks/

National Cybersecurity and Communications Integration Center Bulletin

The Healthcare and Public Health (HPH) sector is a multi-trillion dollar industry employing over 13 million personnel, including approximately five million first-responders with at least some emergency medical training, three million registered nurses, and more than 800,000 physicians.

(U) A significant portion of products used in patient care and management including diagnosis and treatment are Medical Devices (MD). These MDs are designed to monitor changes to a patient’s health and may be implanted or external. The Food and Drug Administration (FDA) regulates devices from design to sale and some aspects of the relationship between manufacturers and the MDs after sale. However, the FDA cannot regulate MD use or users, which includes how they are linked to or configured within networks. Typically, modern MDs are not designed to be accessed remotely; instead they are intended to be networked at their point of use. However, the flexibility and scalability of wireless networking makes wireless access a convenient option for organizations deploying MDs within their facilities. This robust sector has led the way with medical based technology options for both patient care and data handling.

(U) The expanded use of wireless technology on the enterprise network of medical facilities and the wireless utilization of MDs opens up both new opportunities and new vulnerabilities to patients and medical facilities. Since wireless MDs are now connected to Medical information technology (IT) networks, IT networks are now remotely accessible through the MD. This may be a desirable development, but the communications security of MDs to protect against theft of medical information and malicious intrusion is now becoming a major concern. In addition, many HPH organizations are leveraging mobile technologies to enhance operations. The storage capacity, fast computing speeds, ease of use, and portability render mobile devices an optimal solution.

(U) This Bulletin highlights how the portability and remote connectivity of MDs introduce additional risk into Medical IT networks and failure to implement a robust security program will impact the organization’s ability to protect patients and their medical information from intentional and unintentional loss or damage.

(U) According to Health and Human Services (HHS), a major concern to the Healthcare and Public Health (HPH) Sector is exploitation of potential vulnerabilities of medical devices on Medical IT networks (public, private and domestic). These vulnerabilities may result in possible risks to patient safety and theft or loss of medical information due to the inadequate incorporation of IT products, patient management products and medical devices onto Medical IT Networks. Misconfigured networks or poor security practices may increase the risk of compromised medical devices. HHS states there are four factors which further complicate security resilience within a medical organization.

1. (U) There are legacy medical devices deployed prior to enactment of the Medical Device Law in 1976, that are still in use today.

2. (U) Many newer devices have undergone rigorous FDA testing procedures and come equipped with design features which facilitate their safe incorporation onto Medical IT networks. However, these secure design features may not be implemented during the deployment phase due to complexity of the technology or the lack of knowledge about the capabilities. Because the technology is so new, there may not be an authoritative understanding of how to properly secure it, leaving open the possibilities for exploitation through zero-day vulnerabilities or insecure deployment configurations. In addition, new or robust features, such as custom applications, may also mean an increased amount of third party code development which may create vulnerabilities, if not evaluated properly. Prior to enactment of the law, the FDA required minimal testing before placing on the market. It is challenging to localize and mitigate threats within this group of legacy equipment.

3. (U) In an era of budgetary restraints, healthcare facilities frequently prioritize more traditional programs and operational considerations over network security.

4. (U) Because these medical devices may contain sensitive or privacy information, system owners may be reluctant to allow manufactures access for upgrades or updates. Failure to install updates lays a foundation for increasingly ineffective threat mitigation as time passes.

(U) Implantable Medical Devices (IMD): Some medical computing devices are designed to be implanted within the body to collect, store, analyze and then act on large amounts of information. These IMDs have incorporated network communications capabilities to increase their usefulness. Legacy implanted medical devices still in use today were manufactured when security was not yet a priority. Some of these devices have older proprietary operating systems that are not vulnerable to common malware and so are not supported by newer antivirus software. However, many are vulnerable to cyber attacks by a malicious actor who can take advantage of routine software update capabilities to gain access and, thereafter, manipulate the implant.

(U) During an August 2011 Black Hat conference, a security researcher demonstrated how an outside actor can shut off or alter the settings of an insulin pump without the user’s knowledge. The demonstration was given to show the audience that the pump’s cyber vulnerabilities could lead to severe consequences. The researcher that provided the demonstration is a diabetic and personally aware of the implications of this activity. The researcher also found that a malicious actor can eavesdrop on a continuous glucose monitor’s (CGM) transmission by using an oscilloscope, but device settings could not be reprogrammed. The researcher acknowledged that he was not able to completely assume remote control or modify the programming of the CGM, but he was able to disrupt and jam the device.

http://www.healthreformwatch.com/category/electronic-medical-records/

February 7, 2012

Since the data breach notification regulations by HHS went into effect in September 2009, 385 incidents affecting 500 or more individuals have been reported to HHS, according to its website.

http://www.darkdaily.com/cyber-attacks-against-internet-enabled-medical-devices-are-new-threat-to-clinical-pathology-laboratories-215#axzz1yPzItOFc

February 16 2011

One high-profile healthcare system that regularly experiences such attacks is the Veterans Administration (VA). For two years, the VA has been fighting a cyber battle against illegal and unwanted intrusions into their medical devices

 

http://www.mobiledia.com/news/120863.html

 DEC 16, 2011
Malware in a Georgia hospital’s computer system forced it to turn away patients, highlighting the problems and vulnerabilities of computerized systems.

The computer infection started to cause problems at the Gwinnett Medical Center last Wednesday and continued to spread, until the hospital was forced to send all non-emergency admissions to other hospitals.

More doctors and nurses than ever are using mobile devices in healthcare, and hospitals are making patient records computerized for easier, convenient access over piles of paperwork.

http://www.doctorsofusc.com/uscdocs/locations/lac-usc-medical-center

As one of the busiest public hospitals in the western United States, LAC+USC Medical Center records nearly 39,000 inpatient discharges, 150,000 emergency department visits, and 1 million ambulatory care visits each year.

http://www.healthreformwatch.com/category/electronic-medical-records/

If one jumbo jet crashed in the US each day for a week, we’d expect the FAA to shut down the industry until the problem was figured out. But in our health care system, roughly 250 people die each day due to preventable error

http://www.pcworld.com/article/142926/are_healthcare_organizations_under_cyberattack.html

Feb 28, 2008

“There is definitely an uptick in attacks,” says Dr. John Halamka, CIO at both Beth Israel Deaconess Medical Center and Harvard Medical School in the Boston area. “Privacy is the foundation of everything we do. We don’t want to be the TJX of healthcare.” TJX is the Framingham, Mass-based retailer which last year disclosed a massive data breach involving customer records.

Dr. Halamka, who this week announced a project in electronic health records as an online service to the 300 doctors in the Beth Israel Deaconess Physicians Organization,

Save the Data

Breakdown of political party representation in...
Image via Wikipedia

I just read an online cause here-

http://sunlightfoundation.com/savethedata/

Some of the most important technology programs that keep Washington accountable are in danger of being eliminated. Data.gov, USASpending.gov, the IT Dashboard and other federal data transparency and government accountability programs are facing a massive budget cut, despite only being a tiny fraction of the national budget. Help save the data and make sure that Congress doesn’t leave the American people in the dark.

I wonder why the federal government/ non profit agencies can help create a SPARQL database, and in days of cloud computing, why a tech major cannot donate storage space to it, after all despite US corporate tax rate being high, US technological companies do end up paying a lower rate thanks to tax breaks/routing overseas revenue.

In the new age data is power, and the US has led in its mission to use technology to further its own values even especially in Middle East. The datasets should be made public and transitioned to the private sector/academia for research and re designing for data augmentation with out straining the massive deficit /borrowing/ fighting 3 wars. Of particular interest would be datasets of campaign finances  and donors especially given large number of retail/small donors/internet marketing in elections as it will also help serve as an example of democracy and change. Even countries like China can create a corruption/expense efficiency tracking internal dashboard with restricted rights to help with rural and urban governance.

Comparing Bit Torrent Downloaders

Tux, as originally drawn by Larry Ewing
Image via Wikipedia

I personally like UTorrent on Windows and KTorrent on Linux.

While no experts on this, anything that gets the data down faster while maximizing my pipes efficiency.

I also like Torrenting than  any of the sudo-apt get method of downloading software or the zip unzip,tar untar, install/make file

Torrenting is a simpler way of sharing applications but sadly not used much by the stats computing community to share downloads.

Also I think any dashboard or visualization should be sorted (but not alphabetically but numerically/categorically)

SORT THE DASHBOARD —-KEEP IT SORTED

So I am partially recreating after sorting the data viz from http://en.wikipedia.org/wiki/Comparison_of_BitTorrent_clients

BitTorrent client Magnet URI Super-seeding Embedded tracker UPnP[81] NAT Port Mapping Protocol NAT traversal[82] DHT[83] Peer exchange Encryption UDP tracker LPD
µTorrent Yes Yes[95] Yes[96] Yes[97] Yes Yes[98] Yes[99] Yes[85] Yes[100] Yes Yes[101]
BitSpirit [11] Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes No
BitTorrent 6 Yes Yes Yes Yes Yes Yes Yes Yes[85] Yes Yes Yes
OneSwarm Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes No
qBittorrent Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes
SoMud Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes
Vuze (formerly Azureus) Yes Yes Yes Yes Yes Yes[102] Yes[87] Yes Yes Yes No
BitComet Yes Yes Separate download Yes Yes Yes Yes Yes Yes Yes No
Tixati [43] Yes Yes No Yes No No Yes Yes Yes Yes Partial
Aria2 Yes No Yes No No No Yes Yes Yes Yes Yes
Tribler Yes No Yes Yes Yes No Yes Yes Yes No No
Bitflu Yes No No No No No Yes Yes No Yes No
Deluge Yes No No Yes Yes Yes Yes Yes Yes Yes Yes
Flush Yes No No Yes Yes No Yes Yes No No Yes
KTorrent Yes No No Yes Yes Yes Yes Yes Yes Yes Partial
Shareaza Yes No No Yes Yes No Yes[93] Yes No No No
Transmission Yes No No Yes Yes Yes Yes Yes[94] Yes No Yes
LimeWire Partial Yes Yes Yes Yes No Yes Yes Yes Yes No
BitTyrant No Yes[citation needed] Yes Yes Yes Yes[86] Yes[87] Yes Yes No No
BitTornado No Yes Yes[84] Yes No No No No Yes No No
Torrent Swapper No Yes Yes[84] Yes No No No Yes No No No
Localhost No Yes Yes Yes No Yes Yes [89] No No No No
Meerkat Bittorrent Client No Yes No Yes Yes Yes Yes No Yes No No
rTorrent No Yes No No No No Yes Yes Yes Yes No[92]
TorrentFlux No Yes No Yes No No No No Yes No No
TorrentVolve No Partial [76] No Partial[76] Partial [76] Partial [76] Partial[76] Partial [76] Partial [76] Partial [76] No
Opera No No Yes[90] No No No No Yes[91] No No No
BitTorrent 5 / Mainline No No Yes[84] Yes Yes No Yes Yes Yes No No
ABC No No Yes Yes No No No No No No No
Blog Torrent No No Yes No No No No No No No No
MLDonkey No No Yes Yes Yes No No No No Yes No
Tomato Torrent No No Yes No No No Yes No No No No
Acquisition No No No No Yes No No No No No No
Arctic Torrent No No No No No No No Yes No No No
BitLet No No No Yes No No No No No No No
BitLord No No No Yes No Yes No Yes No Yes No
BitThief No No No No No No No No No No No
Bits on Wheels No No No No No No No No No No No
BTG No No No Yes Yes No Yes Yes Yes Yes No
BTPD No No No No No No No No No No No
FlashGet No No No No No No Yes No Yes No No
Folx No No No Yes Yes No Yes Yes No Yes No
Free Download Manager No No No No No No Yes Yes No No No
G3 Torrent No No No No No No No No No No No
Gnome BitTorrent No No No No No No No No No No No
Halite No No No Yes Yes No Yes No Yes No[88] No
QTorrent No No No No No No No No No No No
Rufus No No No No No No No No No No No
SymTorrent No No No N/A N/A N/A No No No No No
Tonido Torrent No No No Yes Yes Yes Yes No No No No
Torium No No No Yes No No Yes No No No No
ZipTorrent No No No Yes Yes No No Yes No No No

 

 

 

 

Choosing R for business – What to consider?

A composite of the GNU logo and the OSI logo, ...
Image via Wikipedia

Additional features in R over other analytical packages-

1) Source Code is given to ensure complete custom solution and embedding for a particular application. Open source code has an advantage that is extensively peer- reviewed in Journals and Scientific Literature.  This means bugs will found, shared and corrected transparently.

2) Wide literature of training material in the form of books is available for the R analytical platform.

3) Extensively the best data visualization tools in analytical software (apart from Tableau Software ‘s latest version). The extensive data visualization available in R is of the form a variety of customizable graphs, as well as animation. The principal reason third-party software initially started creating interfaces to R is because the graphical library of packages in R is more advanced as well as rapidly getting more features by the day.

4) Free in upfront license cost for academics and thus budget friendly for small and large analytical teams.

5) Flexible programming for your data environment. This includes having packages that ensure compatibility with Java, Python and C++.

 

6) Easy migration from other analytical platforms to R Platform. It is relatively easy for a non R platform user to migrate to R platform and there is no danger of vendor lock-in due to the GPL nature of source code and open community.

Statistics are numbers that tell (descriptive), advise ( prescriptive) or forecast (predictive). Analytics is a decision-making help tool. Analytics on which no decision is to be made or is being considered can be classified as purely statistical and non analytical. Thus ease of making a correct decision separates a good analytical platform from a not so good analytical platform. The distinction is likely to be disputed by people of either background- and business analysis requires more emphasis on how practical or actionable the results are and less emphasis on the statistical metrics in a particular data analysis task. I believe one clear reason between business analytics is different from statistical analysis is the cost of perfect information (data costs in real world) and the opportunity cost of delayed and distorted decision-making.

Specific to the following domains R has the following costs and benefits

  • Business Analytics
    • R is free per license and for download
    • It is one of the few analytical platforms that work on Mac OS
    • It’s results are credibly established in both journals like Journal of Statistical Software and in the work at LinkedIn, Google and Facebook’s analytical teams.
    • It has open source code for customization as per GPL
    • It also has a flexible option for commercial vendors like Revolution Analytics (who support 64 bit windows) as well as bigger datasets
    • It has interfaces from almost all other analytical software including SAS,SPSS, JMP, Oracle Data Mining, Rapid Miner. Existing license holders can thus invoke and use R from within these software
    • Huge library of packages for regression, time series, finance and modeling
    • High quality data visualization packages
    • Data Mining
      • R as a computing platform is better suited to the needs of data mining as it has a vast array of packages covering standard regression, decision trees, association rules, cluster analysis, machine learning, neural networks as well as exotic specialized algorithms like those based on chaos models.
      • Flexibility in tweaking a standard algorithm by seeing the source code
      • The RATTLE GUI remains the standard GUI for Data Miners using R. It was created and developed in Australia.
      • Business Dashboards and Reporting
      • Business Dashboards and Reporting are an essential piece of Business Intelligence and Decision making systems in organizations. R offers data visualization through GGPLOT, and GUI like Deducer and Red-R can help even non R users create a metrics dashboard
        • For online Dashboards- R has packages like RWeb, RServe and R Apache- which in combination with data visualization packages offer powerful dashboard capabilities.
        • R can be combined with MS Excel using the R Excel package – to enable R capabilities to be imported within Excel. Thus a MS Excel user with no knowledge of R can use the GUI within the R Excel plug-in to use powerful graphical and statistical capabilities.

Additional factors to consider in your R installation-

There are some more choices awaiting you now-
1) Licensing Choices-Academic Version or Free Version or Enterprise Version of R

2) Operating System Choices-Which Operating System to choose from? Unix, Windows or Mac OS.

3) Operating system sub choice- 32- bit or 64 bit.

4) Hardware choices-Cost -benefit trade-offs for additional hardware for R. Choices between local ,cluster and cloud computing.

5) Interface choices-Command Line versus GUI? Which GUI to choose as the default start-up option?

6) Software component choice- Which packages to install? There are almost 3000 packages, some of them are complimentary, some are dependent on each other, and almost all are free.

7) Additional Software choices- Which additional software do you need to achieve maximum accuracy, robustness and speed of computing- and how to use existing legacy software and hardware for best complementary results with R.

1) Licensing Choices-
You can choose between two kinds of R installations – one is free and open source from http://r-project.org The other R installation is commercial and is offered by many vendors including Revolution Analytics. However there are other commercial vendors too.

Commercial Vendors of R Language Products-
1) Revolution Analytics http://www.revolutionanalytics.com/
2) XL Solutions- http://www.experience-rplus.com/
3) Information Builder – Webfocus RStat -Rattle GUI http://www.informationbuilders.com/products/webfocus/PredictiveModeling.html
4) Blue Reference- Inference for R http://inferenceforr.com/default.aspx

  1. Choosing Operating System
      1. Windows

 

Windows remains the most widely used operating system on this planet. If you are experienced in Windows based computing and are active on analytical projects- it would not make sense for you to move to other operating systems. This is also based on the fact that compatibility problems are minimum for Microsoft Windows and the help is extensively documented. However there may be some R packages that would not function well under Windows- if that happens a multiple operating system is your next option.

        1. Enterprise R from Revolution Analytics- Enterprise R from Revolution Analytics has a complete R Development environment for Windows including the use of code snippets to make programming faster. Revolution is also expected to make a GUI available by 2011. Revolution Analytics claims several enhancements for it’s version of R including the use of optimized libraries for faster performance.
      1. MacOS

 

Reasons for choosing MacOS remains its considerable appeal in aesthetically designed software- but MacOS is not a standard Operating system for enterprise systems as well as statistical computing. However open source R claims to be quite optimized and it can be used for existing Mac users. However there seem to be no commercially available versions of R available as of now for this operating system.

      1. Linux

 

        1. Ubuntu
        2. Red Hat Enterprise Linux
        3. Other versions of Linux

 

Linux is considered a preferred operating system by R users due to it having the same open source credentials-much better fit for all R packages and it’s customizability for big data analytics.

Ubuntu Linux is recommended for people making the transition to Linux for the first time. Ubuntu Linux had an marketing agreement with revolution Analytics for an earlier version of Ubuntu- and many R packages can  installed in a straightforward way as Ubuntu/Debian packages are available. Red Hat Enterprise Linux is officially supported by Revolution Analytics for it’s enterprise module. Other versions of Linux popular are Open SUSE.

      1. Multiple operating systems-
        1. Virtualization vs Dual Boot-

 

You can also choose between having a VMware VM Player for a virtual partition on your computers that is dedicated to R based computing or having operating system choice at the startup or booting of your computer. A software program called wubi helps with the dual installation of Linux and Windows.

  1. 64 bit vs 32 bit – Given a choice between 32 bit versus 64 bit versions of the same operating system like Linux Ubuntu, the 64 bit version would speed up processing by an approximate factor of 2. However you need to check whether your current hardware can support 64 bit operating systems and if so- you may want to ask your Information Technology manager to upgrade atleast some operating systems in your analytics work environment to 64 bit operating systems.

 

  1. Hardware choices- At the time of writing this book, the dominant computing paradigm is workstation computing followed by server-client computing. However with the introduction of cloud computing, netbooks, tablet PCs, hardware choices are much more flexible in 2011 than just a couple of years back.

Hardware costs are a significant cost to an analytics environment and are also  remarkably depreciated over a short period of time. You may thus examine your legacy hardware, and your future analytical computing needs- and accordingly decide between the various hardware options available for R.
Unlike other analytical software which can charge by number of processors, or server pricing being higher than workstation pricing and grid computing pricing extremely high if available- R is well suited for all kinds of hardware environment with flexible costs. Given the fact that R is memory intensive (it limits the size of data analyzed to the RAM size of the machine unless special formats and /or chunking is used)- it depends on size of datasets used and number of concurrent users analyzing the dataset. Thus the defining issue is not R but size of the data being analyzed.

    1. Local Computing- This is meant to denote when the software is installed locally. For big data the data to be analyzed would be stored in the form of databases.
      1. Server version- Revolution Analytics has differential pricing for server -client versions but for the open source version it is free and the same for Server or Workstation versions.
      2. Workstation
    2. Cloud Computing- Cloud computing is defined as the delivery of data, processing, systems via remote computers. It is similar to server-client computing but the remote server (also called cloud) has flexible computing in terms of number of processors, memory, and data storage. Cloud computing in the form of public cloud enables people to do analytical tasks on massive datasets without investing in permanent hardware or software as most public clouds are priced on pay per usage. The biggest cloud computing provider is Amazon and many other vendors provide services on top of it. Google is also coming for data storage in the form of clouds (Google Storage), as well as using machine learning in the form of API (Google Prediction API)
      1. Amazon
      2. Google
      3. Cluster-Grid Computing/Parallel processing- In order to build a cluster, you would need the RMpi and the SNOW packages, among other packages that help with parallel processing.
    3. How much resources
      1. RAM-Hard Disk-Processors- for workstation computing
      2. Instances or API calls for cloud computing
  1. Interface Choices
    1. Command Line
    2. GUI
    3. Web Interfaces
  2. Software Component Choices
    1. R dependencies
    2. Packages to install
    3. Recommended Packages
  3. Additional software choices
    1. Additional legacy software
    2. Optimizing your R based computing
    3. Code Editors
      1. Code Analyzers
      2. Libraries to speed up R

citation-  R Development Core Team (2010). R: A language and environment for statistical computing. R Foundation for Statistical Computing,Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org.

(Note- this is a draft in progress)

%d bloggers like this: