Training Proposal: PySpark for Data Processing

Introduction:
This proposal outlines a 3-day PySpark training program designed for 10 participants. The course aims to equip data professionals with the skills to leverage Apache Spark using the Python API (PySpark) for efficient large-scale data processing[5]. Participants will gain hands-on experience with PySpark, covering fundamental concepts to advanced techniques, enabling them to tackle complex data challenges in real-world scenarios[4][5].

Target Audience:

Individuals with Python programming knowledge interested in big data analysis using Apache Spark[6].
Those familiar with object-oriented programming languages seeking to learn Spark[6].
Big Data Developers and Engineers wanting to utilize Spark with Python[6].
Anyone eager to enter the world of big data, Spark, and Python[6].

Learning Objectives:
Upon completion of this training, participants will be able to:

Understand the fundamentals of PySpark, including the Spark ecosystem and execution processes[5].
Work with Resilient Distributed Datasets (RDDs), including creation, transformations, and actions[5].
Utilize DataFrames for structured data processing, including various DataFrame transformations[5].
Apply advanced data processing techniques using Spark DataFrames[5].
Develop scalable data processing pipelines in PySpark[5].
Understand data capturing with messaging systems like Kafka and Flume, and data loading using Sqoop[1].
Gain comprehensive knowledge of tools within the Spark Ecosystem, such as Spark MLlib, Spark SQL, and Spark Streaming[1].

Course Curriculum:
The 3-day training program will cover the following modules:

Day 1: PySpark Fundamentals

Introduction to Big Data and Apache Spark[4].
Spark architecture and its comparison with Hadoop MapReduce[4].
PySpark installation[2][4].
SparkSession and basic PySpark operations[4].
Overview of Python (Values, Types, Variables, Operands and Expressions, Conditional Statements, Loops, Strings and related operations, Numbers)[1].
Python files I/O Functions and Writing to the Screen[1].

Day 2: RDDs and DataFrames

Understanding Resilient Distributed Datasets (RDDs)[5].
Creating RDDs and performing transformations[5].
RDD actions: collect, reduce, count, foreach, aggregate, and save[5].
Introduction to DataFrames[5].
DataFrame transformations[5].
Basic SQL functions[4].

Day 3: Advanced PySpark Techniques

Advanced data processing with Spark DataFrames[5].
Integration with external data sources like Hive and MySQL[4].
Spark SQL and Spark Streaming[1][2].
Spark MLlib[1][2].
Data capturing with Kafka and Flume[1].
Data loading using Sqoop[1].
Deploying PySpark applications in different modes[4].
Performance optimization techniques[5].

Hands-On Exercises:
Throughout the course, participants will engage in hands-on exercises, including:

Creating basic Python scripts[1].
Working with datasets using RDDs and DataFrames[5].
Implementing data processing pipelines[5].
Integrating PySpark with external data sources[4].
Using Spark MLlib for machine learning tasks[1][2].

Training Methodology:
The training will be delivered through a combination of:

Instructor-led sessions[1].
Interactive discussions[1].
Practical demonstrations[1].
Hands-on exercises[1][5].

Materials Provided:

Comprehensive course notes[1].
Sample code and datasets[6].
Access to a PySpark development environment[5].

Trainer Profile:
The training will be conducted by experienced industry experts with in-depth knowledge of PySpark and big data technologies[1].

Duration:
3 Days

Number of Participants:
10

Cost:

Course Fee: \$575 – \$1,800 per participant[4][5]
Total Cost (for 10 participants): \$5,750 – \$18,000

Benefits of Attending:

Gain practical skills in PySpark development[5].
Learn to process large-scale data efficiently[5].
Understand the Spark ecosystem and its components[1][5].
Enhance career prospects in the field of big data[1].

Certification:
Upon completion of the training, participants will receive a certificate of completion[1].

Conclusion:
This PySpark training program offers a comprehensive and practical approach to learning big data processing with Apache Spark and Python[4][5]. By attending this course, participants will gain the skills and knowledge necessary to tackle complex data challenges and advance their careers in the field of big data[1].

Citations:
[1] https://www.certocean.com/course/python-spark-certification-training-using-pyspark/45
[2] https://www.youtube.com/watch?v=sSkAuTqfBA8
[3] https://github.com/hadrienbdc/pyspark-project-template
[4] https://www.koenig-solutions.com/data-processing-pyspark-training
[5] https://www.koenig-solutions.com/pyspark-training
[6] https://www.projectpro.io/projects/big-data-projects/pyspark-projects
[7] https://spark.apache.org/improvement-proposals.html
[8] https://www.thinkific.com/blog/training-proposal-template/

Importing data from csv file using PySpark

There are two ways to import the csv file, one as a RDD and the other as Spark Dataframe(preferred). MLLIB is built around RDDs while ML is generally built around dataframes. https://spark.apache.org/docs/latest/mllib-clustering.html and https://spark.apache.org/docs/latest/ml-clustering.html

!pip install pyspark

from pyspark import SparkContext, SparkConf
sc =SparkContext()

A SparkContext represents the connection to a Spark cluster, and can be used to create RDD and broadcast variables on that cluster. https://spark.apache.org/docs/latest/rdd-programming-guide.html#overview

To create a SparkContext you first need to build a SparkConf object that contains information about your application.Only one SparkContext may be active per JVM. You must stop() the active SparkContext before creating a new one.

Configuration for a Spark application. Used to set various Spark parameters as key-value pairs.

dir(SparkContext)

[‘PACKAGE_EXTENSIONS’,
‘class‘,
‘delattr‘,
‘dict‘,
‘dir‘,
‘doc‘,
‘enter‘,
‘eq‘,
‘exit‘,
‘format‘,
‘ge‘,
‘getattribute‘,
‘getnewargs‘,
‘gt‘,
‘hash‘,
‘init‘,
‘init_subclass‘,
‘le‘,
‘lt‘,
‘module‘,
‘ne‘,
‘new‘,
‘reduce‘,
‘reduce_ex‘,
‘repr‘,
‘setattr‘,
‘sizeof‘,
‘str‘,
‘subclasshook‘,
‘weakref‘,
‘_active_spark_context’,

‘dictToJavaMap’,
‘_do_init’,
‘_ensure_initialized’,
‘_gateway’,
‘_getJavaStorageLevel’,
‘_initialize_context’,
‘_jvm’,
‘_lock’,
‘_next_accum_id’,
‘_python_includes’,
‘_repr_html‘,
‘accumulator’,
‘addFile’,
‘addPyFile’,
‘applicationId’,
‘binaryFiles’,
‘binaryRecords’,
‘broadcast’,
‘cancelAllJobs’,
‘cancelJobGroup’,
‘defaultMinPartitions’,
‘defaultParallelism’,
‘dump_profiles’,
’emptyRDD’,
‘getConf’,
‘getLocalProperty’,
‘getOrCreate’,
‘hadoopFile’,
‘hadoopRDD’,
‘newAPIHadoopFile’,
‘newAPIHadoopRDD’,
‘parallelize’,
‘pickleFile’,
‘range’,
‘runJob’,
‘sequenceFile’,
‘setCheckpointDir’,
‘setJobGroup’,
‘setLocalProperty’,
‘setLogLevel’,
‘setSystemProperty’,
‘show_profiles’,
‘sparkUser’,
‘startTime’,
‘statusTracker’,
‘stop’,
‘textFile’,
‘uiWebUrl’,
‘union’,
‘version’,
‘wholeTextFiles’]

# Loads data.
data = sc.textFile(“C:/Users/Ajay/Desktop/test/new_sample.csv”)

type(data)

pyspark.rdd.RDD

# Loads data. Be careful of indentations and whitespace

from pyspark.sql import SparkSession

spark = SparkSession.builder \
.master(“local”) \
.appName(“Data cleaning”) \
.getOrCreate()

dataframe2 = spark.read.format(“csv”).option(“header”,”true”).option(“mode”,”DROPMALFORMED”).load(“C:/Users/Ajay/Desktop/test/new_sample.csv”)

type(dataframe2)

pyspark.sql.dataframe.DataFrame

dataframe2.printSchema() (same as str(dataframe) in R and dataframe.info() in Pandas)

Eleven lessons for Startup teams from Ocean’s Eleven

Creating a startup is like giving birth to a baby. Sustaining the baby you need a team of concerned protective disciplined and skilled people. Skillset and mindset are the operating words. The skills and the minds of the diversity in the team must gel.

Danny Ocean had an excellent idea to repay his debt to society. He set up a team of skilled hackers and scientists to achieve a common goal. No birds were hurt in Danny’s startup. Even the bad guy won his money in insurance and the French guy lost his money in the sequel. But this not about Ocean’s Twelfth where clearly Danny and his merry men went overboard and hired a bridge too far. This is about the original play on the original startup. So here are the eleven lessons.

Skillsets should be complementary- The grease monkey and the smooth talker are different skill sets.
Team members should have cross-functional skills. In a crisis you need backups. In terms of volatility you need options. Suppose your Hadoop guy called in sick. Cross train your tech team into different technologies and cross train your business team into different skills like leads. deal origination,
The big guy who lends his name to the team needs to have Charisma to do the recruitment for first few chaps
You need a smart guy to handle the Operations like Rusty and you need a guy with a lot of money with nothing to lose like Ruben. Ops and Finance are the first few hires you will make Danny Boy.
Leader needs to have integrity. If that is questioned Leader needs to be flexible. If Leader has a secret personal reason to do the op in a startup, it goes better if that is known to the top guys in the startup team, but not all the members in the startup team have the same need to know.
You need young men who are hungry and eager to prove their mark in the world like Matt Damon. Young men who are skilled but not known well enough are easier to recruit to your startup team . With a limited budget you need a few good young men in your team.
Diversity happens by karma not by design. Choose people based on skills and mindsets. Ethnic orientations should not paly any role in positive or negative decisions in the recruitment. If the startup idea is good, the black guy and the chinese guy and the jewish guy and old man and the young man will play fine. If the idea is bad, not all the liberal books you read can set it right.
Be ruthless when dealing with ruthless situations but be classy always. Class shall always count. You may think of your startup idea all the time but it is most probably not your first and not your last job. Most of us dont get the luxury od dining on the Job like Steve Jobs did. your classy behvaiour will help you in your career beyiind your startp team and even in your personal life.
Dont forget the little guys. The little guys help with blueprints, intel, product design and product delivery. Dont forget them.
Be open to hire more.
Be loyal to the team, and dont fire your people when you are down. Dont crap on your team when the shit hits the fan. Take the shit and swallow it and smile back. that’s what a fearless leader does.

You be good to your team, and your team will rob the casinos and get you girl. BE nice and BE Kind to your team. In case of doubt, watch the DVD again. It’s excellent motivation.

Trying to improve the supply of Data Scientists without ripping young people

In a previous post, I said that many corporate are trying to benefit from the demand for data science as applied to their sector or company but not many are doing enough to improve the supply of data scientists.

In anecdotal arguments for students In India and USA , many have argued that many training companies are charging exorbitant amounts and misguided promises to essentially teach tools and techniques but not the essential analytical mindset for splicing and dicing of data as well as enough information to reach balance between the three skills for data scientists- statistics, programming and business perspective.

Added to this, many people building tools for data scientists have not worked in data science consulting them self but are addicted to one platform or product due to commercial or intellectual compulsions.

Here is what I think could be a supply side solution to the problem of demand of data scientists hindering actual data science benefits to humanity regardless of commercial or social sectors.

Build up a pool of curated best practice training
Get them validated and verified across different business sectors by industry experts
Add hardware or cloud training to software training
Offer them on accessible platforms like mobile, tablet and web
Offer them on accessible languages like Spanish Swahili Chinese Arabic as well
Gamify some of the content to make it interesting, basically start creating data science hackers at an earlier age than just post graduate students
Tie up with industry to offer internships that are fair balanced and demand equal commitment
Tie in soft skill training for better professionalism
Offer all this for free but use data generated for improving this not only on a human intervention basis but computer adaptive training and testing
Monetize only after you reach a huge scale not prematurely
Make it interactive using videos, 15 minute weekly personalized help on Skype from support, webinars but capture data continuously to drive engagement metrics

Do you want to just make money on the demand (uncertain) for data science but do you want to make more money on the supply side of data science too?

Is Python going to be better than R for Big Data Analytics and Data Science? #rstats #python

Uptil now the R ecosystem of package developers has mostly shrugged away the Big Data question. In a fascinating insight Hadley Wickham said this in a recent interview- shockingly it mimicks the FUD you know who has been accused of ( source

https://peadarcoyle.wordpress.com/2015/08/02/interview-with-a-data-scientist-hadley-wickham/

5. How do you respond when you hear the phrase ‘big data’? Big data is extremely overhyped and not terribly well defined. Many people think they have big data, when they actually don’t.

I think there are two particularly important transition points:

* From in-memory to disk. If your data fits in memory, it’s small data. And these days you can get 1 TB of ram, so even small data is big!

* From one computer to many computers.

R is a fantastic environment for the rapid exploration of in-memory data, but there’s no elegant way to scale it to much larger datasets. Hadoop works well when you have thousands of computers, but is incredible slow on just one machine. Fortunately, I don’t think one system needs to solve all big data problems.

To me there are three main classes of problem:

1. Big data problems that are actually small data problems, once you have the right subset/sample/summary.

2. Big data problems that are actually lots and lots of small data problems

3. Finally, there are irretrievably big problems where you do need all the data, perhaps because you fitting a complex model. An example of this type of problem is recommender systems

Ajay- One of the reasons of non development of R Big Data packages is- it takes money. The private sector in R ecosystem is a duopoly ( Revolution Analytics ( acquired by Microsoft) and RStudio (created by Microsoft Alum JJ Allaire). Since RStudio actively tries as a company to NOT step into areas Revolution Analytics works in- it has not ventured into Big Data in my opinion for strategic reasons.

Revolution Analytics project on RHadoop is actually just one consultant working on it here https://github.com/RevolutionAnalytics/RHadoop and it has not been updated since six months

We interviewed the creator of R Hadoop here https://decisionstats.com/2014/07/10/interview-antonio-piccolboni-big-data-analytics-rhadoop-rstats/

However Python developers have been trying to actually develop systems for Big Data actively. The Hadoop ecosystem and the Python ecosystem are much more FOSS friendly even in enterprise solutions.

This is where Python is innovating over R in Big Data-

http://blaze.pydata.org/en/latest/

Blaze: Translates NumPy/Pandas-like syntax to systems like databases.

Blaze presents a pleasant and familiar interface to us regardless of what computational solution or database we use. It mediates our interaction with files, data structures, and databases, optimizing and translating our query as appropriate to provide a smooth and interactive session.
Odo: Migrates data between formats.

Odo moves data between formats (CSV, JSON, databases) and locations (local, remote, HDFS) efficiently and robustly with a dead-simple interface by leveraging a sophisticated and extensible network of conversions. http://odo.pydata.org/en/latest/perf.html

odo takes two arguments, a target and a source for a data transfer.
```
>>> from odo import odo
>>> odo(source, target)  # load source into target 
```
Dask.array: Multi-core / on-disk NumPy arrays

Dask.arrays provide blocked algorithms on top of NumPy to handle larger-than-memory arrays and to leverage multiple cores. They are a drop-in replacement for a commonly used subset of NumPy algorithms.
DyND: In-memory dynamic arrays

DyND is a dynamic ND-array library like NumPy. It supports variable length strings, ragged arrays, and GPUs. It is a standalone C++ codebase with Python bindings. Generally it is more extensible than NumPy but also less mature. https://github.com/libdynd/libdynd

The core DyND developer team consists of Mark Wiebe and Irwin Zaid. Much of the funding that made this project possible came through Continuum Analytics and DARPA-BAA-12-38, part of XDATA.

LibDyND, a component of the Blaze project, is a C++ library for dynamic, multidimensional arrays. It is inspired by NumPy, the Python array programming library at the core of the scientific Python stack, but tries to address a number of obstacles encountered by some of its users. Examples of this are support for variable-sized string and ragged array types. The library is in a preview development state, and can be thought of as a sandbox where features are being tried and tweaked to gain experience with them.

C++ is a first-class target of the library, the intent is that all its features should be easily usable in the language. This has many benefits, such as that development within LibDyND using its own components is more natural than in a library designed primarily for embedding in another language.

This library is being actively developed together with its Python bindings,

http://dask.pydata.org/en/latest/

On a single machine dask increases the scale of comfortable data from fits-in-memory to fits-on-diskby intelligently streaming data from disk and by leveraging all the cores of a modern CPU.

Users interact with dask either by making graphs directly or through the dask collections which provide larger-than-memory counterparts to existing popular libraries:

dask.array = numpy + threading
dask.bag = map, filter, toolz + multiprocessing
dask.dataframe = pandas + threading

Dask primarily targets parallel computations that run on a single machine. It integrates nicely with the existing PyData ecosystem and is trivial to setup and use:

conda install dask
or
pip install dask

https://github.com/cloudera/ibis

When open source fights- closed source wins. When the Jedi fight the Sith Lords will win

So will R people rise to the Big Data challenge or will they bury their heads in sands like an ostrich or a kiwi. Will Python people learn from R design philosophies and try and incorporate more of it without redesigning the wheel

Converting code from one language to another automatically?

How I wish there was some kind of automated conversion tool – that would convert a CRAN R package into a standard Python package which is pip installable

Machine learning for more machine learning anyone?

6 weeks Data Scientist Online Courses #rstats

Hosting a 6 weekend live online certification course on Business Analytics with R starting June 1 at Edureka.Check www.edureka.in/r-for-analytics for more details. Course has been decided to ensure more open data science than current expensive offerings that are tech rather than business oriented but more support and customization than a MOOC This is because many business customers don’t care if it is lapply or ddapply, or command line or GUI, as long as they get good ROI on time and money spent in shifting to R from other analytics software.

The making of a R startup Part 1 #rstats

Note- Decisionstats.com has done almost 105 interviews in the field of analytics, technology startups and thought leaders ( you can see them here http://goo.gl/m3l31). We have covered some of the R authors ( R for SAS and SPSS users, Data Mining using R, Machine Learning for Hackers) , and noted R package creators (ggplot2, RCommander, rattle GUI, forecast)

But what we truly enjoy is interviews with startups in R ecosystem , including founders of Revolution Analytics,Inference for R, RStudio, Cloudnumbers

The latest startup in the R ecosystem with a promising product is RApporter.net . It has actually been there for some time, but with the launch of their new product we ask them the trials and tribulations of creating an open source startup in the data science field.

This is part 1 of the interview with Gergely Daróczi, co-founder of the Rapporter project.

Ajay- Describe the journey of Rapporter till now, and your product plans for 2013.

Greg- The idea of Rapporter presented itself more then 3 years ago while giving statistics, SPSS and R courses at different Hungarian universities and also creating custom statistical reports for a number of companies for a living at the same time.
Long story short, the three Hungarian co-founder faced similar problems at both sectors: students, just like business clients, admired the capabilities of R and the wide variety of tools found on CRAN,but were not eager at all to get into learn how to use that.
So we tried to make up some plans how to let the non-R users also build on the resources of R, and we came up with the idea of an intuitive web-interface as an R front-end.

The real development of a helper R package (which later become “rapport”) started in the January of 2011 by Aleksandar Blagotić and me1 in our spare time and rather just for fun, as we had a dream about using “annotated statistical templates” in R after a few conversations on StackOverflow. We also worked on a front-end in the means of an Rserve driven PHP engine with MySQL – to be dropped and completely rewritten later after some trying experiences and serious benchmarking.

We have released “rapport” package to the public at the end of 2011 on GitHub, and after a few weeks on CRAN too. Despite the fact that we did our best with creating a decent documentation and also some live examples, we somehow forgot to spread the news of the new package to the R community, so “rapport” did not attract any serious attention.

Even so, our enthusiasm for annotated R “templates” did not wane as time passed, so we continued to work on “rapport” by adding new features and also Aleksandar started to fortify his Ruby on Rails skills. We also dropped Rserve with MySQL back-end, and introduced Jeffrey Horner’s awesome RApache with some NoSQL databases.
To be honest, this change resulted in a one-year delay of releasing Rapporter and no ends of headaches on our end, but in the long run, it was a really smart move after all, as we own an easily scalable and a highly available cluster of servers at the moment.

But back to 2012.

As “rapport” got too complex as time passed with newly added features, Aleksandar and I decided to split the package, which move gave birth to “pander”. At that time “knitr” got more and more familiar among R users, so it was a brave move to release “another” similar package, but the roots of “pander” were more then one year old, we used some custom methods not available in “knitr” (like
capturing the R object beside the printed output of chunks), we needed tweakable global options instead of chunk options and we really wanted to build on the power of Pandoc – just like before.

So we had a package for converting R objects to Pandoc’s markdown with a general S3 method, another package to automatically run that and also capture plots and images a brew-like document with various output formats – like pdf, docx, odt etc.
In the summer, while Aleksandar dealt with the web interface, I worked on some new features in our packages:
• automatic and robust caching of chunks with various options for performance reasons,
• automatically unifying “base”, “lattice” and “ggplot2” images to the same style with user options – like major/minor grid color, font family, color palette, margins etc.
• adding other global options to “pander”, to let our expected clients later personalize their
custom report style with a few clicks.

At the same time, we were searching for different options to prevent running malicious code in the parallel R sessions, which might compromise all our users’ sensitive data. Unfortunately no full blown solution existed at that time, and we really wanted to stand clear of running some Java based interpreters in our network.
So I started to create a parser for R commands, which was supposed to filter out malicious R commands before evaluation, and a handful flu got me some spare time to implement “sandboxR” with an open and live “hack my R server” demo, which ended up in a great challenge on my side, but proved to really work after all.
I also had a few conversations with Jeroen Ooms (the author of the awesome OpenCPU), who faced similar problems on his servers and was eager to prevent the issues with the help of AppArmor. The great news of “RAppArmor” did make “sandboxR” needless (as AppArmor just cannot regulate inner R calls), but we started to evaluate all user specified R commands in a separate hat, which allowed me to make “sanboxR” more permissive with black-filtered functions.
In the middle of the summer, I realized that we have an almost working web application with any number of R workers being able to serve tons of users based on the flexible NoSQL database back- ends, but we had no legal background to release such a service, nor had I any solid financial background to found one – moreover the Rapporter project already took huge amount from my family budget.

As I was against of letting some venture capital to dominate the project, and did not found any accelerator that would take on a project with a maturing, almost market-ready product, me and a few associates decided to found a UK company on our own and having confidence in the future and God.

So we founded Easystats Ltd, the company running rapporter.net, in July, and decided to release the first beta and pretty stable version of the application to the public at the end of September. At that time users could:
• upload and use text or SPSS sav data sets,
• specify more then 20 global options to be applied to all generated reports (like plot themes, table width, date format, decimal mark and number of digits, separators and copula in vectors etc.),
• create reports with the help of predefined statistical “templates”,
• “fork” (clone) any of our templates and modify without restriction, or create new statistical templates from scratch,
• edit the body or remove any part of the reports, resize images with the mouse or even with finger on touch-devices,
• and export reports to pdf, odt or docx formats.

A number of new features were introduced since then:

OpenBUGS integration with more permissive security profiles, users can create custom styles for the exported documents (in LaTeX, docx and odt format) to generate unique and possibly branded reports, to share public or even private reports with anyone without the need for registering on rapporter.net by a simple hyperlink, and to let our users to integrate their templates in any homepage, blog post or even HTML mail, so that let anyone use the power of R with a few clicks building on the knowledge of template authors and our reliable back-end.
Although 2 years ago I was pretty sure that this job would be finished in a few months and that we would possibly have a successful project in a year or two, now I am certain, that bunch of new features will make Rapporter more and more user-friendly, intuitive and extensible in the next few years.
Currently, we are working hard on a redesigned GUI with the help of a dedicated UX team at last (which was a really important structural change in the life of Rapporter, as we can really assign and split tasks now just like we dreamed of when the project was a two-men show), which is to be finished no later then the first quarter of the year. Beside design issues, this change would also result
in some new features, like ordering the templates, data sets and reports by popularity, rating or relevance for the currently active data set; and also letting users to alter the style of the resulting reports in a more seamless way.

The next planned tasks for 2013 include:
• a “data transformation” front-end, which would let users to rename and label variables in any uploaded data set, specify the level of measurement, recode/categorize or create new variables with the help of existing ones and/or any R functions,
• edit tables in reports on the fly (change the decimal mark, highlight some elements, rename columns and split tables to multiple pages with a simple click),
• a more robust API to let third-party users temporary upload data to be used in the analysis,
• option to use multiple data sets in a template and to let users merge or connect data online,
• and some top-secret surprises.

Beside the above tasks, which was made up by us, our team is really interested in any feedback from the users, which might change the above order or add new tasks with higher priority, so be sure to add your two cent on our support page.

And we will have to come up with some account plans with reasonable pricing in 2013 for the hosted service to let us cover the server fees and development expenses. But of course Rapporter will remain free for ever for users with basic needs (like analyzing data sets with only a few hundreds of cases) or anyone in the academic sector, and we also plan to provide an option to run Rapporter “off-site” on any Unix-like environment.

Ajay- What are some of the Big Data use cases I can do with Rapporter?

Greg- Although we have released Rapporter beta only a few months ago, we already heard some pretty promising use-cases from our (potential) clients.

But I must emphasize that at first we are not committed to deal with Big Data in the means of user contributed data sets with billions of cases, but rather concentrating on providing an intuitive and responsive way of analyzing traditional, survey-like data frames up to about 100.000 cases.

Anyway, to be on topic: a really promising project of Optimum Dosing Strategies has been using Rapporter’s API for a number of weeks even in 2012 to compute optimal doses for different kind of antibiotics based on Monte-Carlo simulation and Bayesian adaptive feedback among other methods.
This collaboration lets the ID-ODS team develop a powerful calculator with full-blown reports ready to be attached to medical records – without any special technical knowledge on their side, as we maintain the R engine and the integration part, they code in R. This results in pleased clients all over the world, which makes us happy too.

We really look forward to ship a number of educational templates to be used in real life at several (multilingual) universities from September 2013. These templates would let teachers show customizable and interactive reports to the students with any number of comments and narrative paragraphs, which statistical introductory modules would provide a free alternative to other desktop
software used in education.

In the next few months, a part of our team will focus on spatial analysis templates, which would mean that our users could not just map, but really analyze any of their spatially related data with a few clicks and clear parameters.

Another feature request of a client seems to be a really exciting idea. Currently, Google Analytics and other tracking services provide basic options to view, filter and export the historical data of websites, blogs etc.
As creating an interface between Rapporter and the tracking services to be able to fetch the most recent data is not beyond possibility any more with the help of existing API resources, so our clients could generate annotated usage reports of any specified period of time – without restrictions. Just to emphasize some potential add-ons: using the time-series R packages in the analysis or creating real- time “dashboards” with optional forecasts about live data.

Of course you could think of other kind of live or historical data instead of Google Analytics, as creating a template for e.g. transaction data or gas usage of a household could be addressed at any time, and please do not forget about the above referenced use-cases in the 3 rd question (“[…]Rapporter can help: […]”).

But wait: the beauty of Rapporter is that you could implement all of the above ideas by yourself in our system, even without any help from us.

Ajay- What are some of things that can be easily done with Rapporter than with your plain vanilla R?

Greg- Rapporter is basically developed for creating reproducible, literative and annotated statistical modules (a.k.a. “templates”), which means the passing a data set and the list of variables with some optional arguments would end up in a full-blown written report with automatically styled tables and charts.

So using Rapporter is like writing “Sweave” or “knitr” documents, but you write the template only once, and then apply that to any number of data sets with a simple click on an intuitive user interface.

Beside this major objective: as Rapporter is running in the cloud and sharing reports and templates (or even data sets) with collaborators or with anyone on the Internet is really easy, our users can post, share any R code for free and without restrictions or release the templates with specified license and/or fees in a secured environment.

This means that Rapporter can help:

scholars sharing scientific results or methods with reproducible and instantly available demo and/or dedicated implementation along with publications,
teachers to create self-explanatory statistical templates which would help the students internationalize the subject by practice,
any R developer to share a live and interactive demo of the implemented features of the functions with a few clicks,
businesses could use a statistical platform without restrictions for a reasonable monthly fee instead of expensive and non-portable statistical programs,
governments and national statistical offices to publicize census or other big data with a scientific and reliable analytic tool with annotated and clear reports while insuring the anonymity of the respondents by automatically applying custom methods (like data swapping, rounding, micro-aggregation, PRAM, adding noise etc.) to the tables and results, etc.

And of course, do not forget about one of our main objectives to let us open up the world of R to non-R users too with an intuitive, driving user interface.

(To be continued)-

About

Gergely Daróczi is co-ordinating the development of Rapporter and maintaining their R packages. Beside he tries to be active in some open-source projects and on StackOverflow, he is a PhD candidate in sociology and also a lecturer at Corvinus University of Budapest and Pázmány Péter Catholic University in Hungary

Rapporter is a web application helping you to create comprehensive, reliable statistical reports on any mobile device or PC, using an intuitive user interface.

The application builds on the power of R beside other technologies and intended to be used in any browser doing the heavy computations on the server side. Some might consider Rapporter as a customizable graphical user interface to R – running in the cloud.

Currently, Rapporter is under heavily development and only invited alpha testers can access the application. Please sign up for an invitation if you want to have an early-bird insight on Rapporter.

Please share:

Please share:

Please share:

Please share:

Please share:

Please share:

Please share: