AA classic paper by Donald E Knuth (creator of Tex) on the information complexity of songs can help listeners of music with an interest in analytics. This paper is a classic and dates from 1985 but is pertinent even today.
AA classic paper by Donald E Knuth (creator of Tex) on the information complexity of songs can help listeners of music with an interest in analytics. This paper is a classic and dates from 1985 but is pertinent even today.
I was picking up some funny activity on my web analytics, so to make it easier for readers, here is the entire Decisionstats wordpress xml file zipped. You can download it, unzip and then read it in any wordpress reader to read at your leisure.
decisionstats.wordpress.2012-06-14.xml
Have fun
Updated- There seems to be unusual traffic activity on my poetry blog To make it more convenient for readers , you can download that as a zipped WordPress XML file here-
poemsforkush.wordpress.2012-06-14.pdf
So the cover art is ready, and if you are a reviewer, you can reserve online copies of the book I have been writing for past 2 years. Special thanks to my mentors, detractors, readers and students- I owe you a beer!
You can also go here-
http://www.springer.com/statistics/book/978-1-4614-4342-1
Ohri, Ajay
2012, 2012, XVI, 300 p. 208 illus., 162 in color.
ISBN 978-1-4614-4342-1
Due: September 30, 2012
(net)
R for Business Analytics looks at some of the most common tasks performed by business analysts and helps the user navigate the wealth of information in R and its 4000 packages. With this information the reader can select the packages that can help process the analytical tasks with minimum effort and maximum usefulness. The use of Graphical User Interfaces (GUI) is emphasized in this book to further cut down and bend the famous learning curve in learning R. This book is aimed to help you kick-start with analytics including chapters on data visualization, code examples on web analytics and social media analytics, clustering, regression models, text mining, data mining models and forecasting. The book tries to expose the reader to a breadth of business analytics topics without burying the user in needless depth. The included references and links allow the reader to pursue business analytics topics.
This book is aimed at business analysts with basic programming skills for using R for Business Analytics. Note the scope of the book is neither statistical theory nor graduate level research for statistics, but rather it is for business analytics practitioners. Business analytics (BA) refers to the field of exploration and investigation of data generated by businesses. Business Intelligence (BI) is the seamless dissemination of information through the organization, which primarily involves business metrics both past and current for the use of decision support in businesses. Data Mining (DM) is the process of discovering new patterns from large data using algorithms and statistical methods. To differentiate between the three, BI is mostly current reports, BA is models to predict and strategize and DM matches patterns in big data. The R statistical software is the fastest growing analytics platform in the world, and is established in both academia and corporations for robustness, reliability and accuracy.
Content Level » Professional/practitioner
Keywords » Business Analytics – Data Mining – Data Visualization – Forecasting – GUI – Graphical User Interface – R software – Text Mining
Related subjects » Business, Economics & Finance – Computational Statistics – Statistics
Some possible electronic disruptions that threaten to disrupt the electoral cycle in United States of America currently underway is-
1) Limited Denial of Service Attacks (like for 5-8 minutes) on fund raising websites, trying to fly under the radar of network administrators to deny the targeted fundraising website for a small percentage of funds . Money remains critical to the world’s most expensive political market. Even a 5% dropdown in online fund-raising capacity can cripple a candidate.
2) Limited Man of the Middle Attacks on ground volunteers to disrupt ,intercept and manipulate communication flows. Basically cyber attacks at vulnerable ground volunteers in critical counties /battleground /swing states (like Florida)
3) Electro-Magnetic Disruptions of Electronic Voting Machines in critical counties /swing states (like Florida) to either disrupt, manipulate or create an impression that some manipulation has been done.
4) Use search engine flooding (for search engine de-optimization of rival candidates keywords), and social media flooding for disrupting the listening capabilities of sentiment analysis.
5) Selected leaks (including using digital means to create authetntic, fake or edited collateral) timed to embarrass rivals or influence voters , this can be geo-coded and mass deployed.
6) using Internet communications to selectively spam or influence independent or opinionated voters through emails, short messaging service , chat channels, social media.
7) Disrupt the Hillary for President 2016 campaign by Anonymous-Wikileak sympathetic hacktivists.
Just got the email-more software is good news!
Revolution R Enterprise 6.0 for 32-bit and 64-bit Windows and 64-bit Red Hat Enterprise Linux (RHEL 5.x and RHEL 6.x) features an updated release of the RevoScaleR package that provides fast, scalable data management and data analysis: the same code scales from data frames to local, high-performance .xdf files to data distributed across a Windows HPC Server cluster or IBM Platform Computing LSF cluster. RevoScaleR also allows distribution of the execution of essentially any R function across cores and nodes, delivering the results back to the user.
Detailed information on what’s new in 6.0 and known issues:
http://www.revolutionanalytics.com/doc/README_RevoEnt_Windows_6.0.0.pdf
RevoScaleR high-performance analysis functions will now conveniently work directly with a variety of external data sources (delimited and fixed format text files, SAS files, SPSS files, and ODBC data connections). New functions are provided to create data source objects to represent these data sources (RxTextData, RxOdbcData, RxSasData, and RxSpssData), which in turn can be specified for the ‘data’ argument for these RevoScaleR analysis functions: rxHistogram, rxSummary, rxCube, rxCrossTabs, rxLinMod, rxCovCor, rxLogit, and rxGlm.
# Create a SAS data source with information about variables and # rows to read in each chunk
sasDataFile <- file.path(rxGetOption(“sampleDataDir”),”claims.sas7bdat”)
sasDS <- RxSasData(sasDataFile, stringsAsFactors = TRUE,colClasses = c(RowNum = “integer”),rowsPerRead = 50)
# Compute and draw a histogram directly from the SAS file
rxHistogram( ~cost|type, data = sasDS)
# Compute summary statistics
rxSummary(~., data = sasDS)
# Estimate a linear model
linModObj <- rxLinMod(cost~age + car_age + type, data = sasDS)
summary(linModObj)
# Import a subset into a data frame for further inspection
subData <- rxImport(inData = sasDS, rowSelection = cost > 400,
varsToKeep = c(“cost”, “age”, “type”))
subData
The installation instructions and instructions for getting started with Revolution R Enterprise & RevoDeployR for Windows: http://www.revolutionanalytics.com/downloads/instructions/windows.php
3.87 GB and 3786 packages. Thats what you need to install the whole of R as on CRAN
( Note- Many IT administrators /Compliance Policies in enterprises forbid installing from the Internet in work offices.
Which is where the analytics,$$, and people are)
As downloaded from the CRAN Mirror at UCLA.
Takes 3 hours to download at 1 mbps (I was on an Amazon Ec2 instance)
See screenshot.
Next question- who is the man responsible in the R project for deleting old /depreciated/redundant packages if the authors dont do it.
Many Data Quality Formats give problems when importing in your statistical software.A statistical software is quite unable to distingush between $1,000, 1000% and 1,000 and 1000 and will treat the former three as character variables while the third as a numeric variable by default. This issue is further compounded by the numerous ways we can represent date-time variables.
The good thing is for specific domains like finance and web analytics, even these weird data input formats are fixed, so we can fix up a list of handy data quality conversion functions in R for reference.
After much muddling about with coverting internet formats (or data used in web analytics) (mostly time formats without date like 00:35:23) into data frame numeric formats, I found that the way to handle Date-Time conversions in R is
Dataset$Var2= strptime(as.character(Dataset$Var1),”%M:%S”)
The problem with this approach is you will get the value as a Date Time format (02/31/2012 04:00:45- By default R will add today’s date to it.) while you are interested in only Time Durations (4:00:45 or actually just the equivalent in seconds).
this can be handled using the as.difftime function
dataset$Var2=as.difftime(paste(dataset$Var1))
or to get purely numeric values so we can do numeric analysis (like summary)
dataset$Var2=as.numeric(as.difftime(paste(dataset$Var1)))
(#Maybe there is a more elegant way here- but I dont know)
The kind of data is usually one we get in web analytics for average time on site , etc.
and
for factor variables
Dataset$Var2= as.numeric(as.character(Dataset$Var1))
or
Dataset$Var2= as.numeric(paste(Dataset$Var1))
Slight problem is suppose there is data like 1,504 – it will be converted to NA instead of 1504
The way to solve this is use the nice gsub function ONLy on that variable. Since the comma is also the most commonly used delimiter , you dont want to replace all the commas, just only the one in that variable.
dataset$Variable2=as.numeric(paste(gsub(“,”,””,dataset$Variable)))
Now lets assume we have data in the form of % like 0.00% , 1.23%, 3.5%
again we use the gsub function to replace the % value in the string with (nothing).
dataset$Variable2=as.numeric(paste(gsub(“%”,””,dataset$Variable)))
If you simply do the following for a factor variable, it will show you the level not the value. This can create an error when you are reading in CSV data which may be read as character or factor data type.
Dataset$Var2= as.numeric(Dataset$Var1)
An additional way is to use substr (using substr( and concatenate (using paste) for manipulating string /character variables.
iris$sp=substr(iris$Species,1,3) –will reduce the famous Iris species into three digits , without losing any analytical value.
The other issue is with missing values, and na.rm=T helps with getting summaries of numeric variables with missing values, we need to further investigate how suitable, na.omit functions are for domains which have large amounts of missing data and need to be treated.