Home » Posts tagged 'java'
Tag Archives: java
So I finally got my test plan accepted for a 1 month trial to the Oracle Public Cloud at https://cloud.oracle.com/ .
Some initial thoughts- this Java cloud seemed more suitable for web apps, than for data science ( but I have to spend much more time on this).
I really liked the help and documentation and tutorials, Oracle has invested a lot in it to make it friendly to enterprise users.
Hopefully the Oracle R Enterprise ORE guys can talk to the Oracle Cloud department and get some common use case projects going.
In the meantime, I did a roundup on all R -Java projects.
They include- (more…)
I write on and off on hackers (see http://bit.ly/VWxSvP) and even some poetry on them (http://bit.ly/11RznQl) . During meetups, conferences, online discussions I run into them, I have interviewed them , and I have trained some of them (in analytics). Based on this decade long experience of observing hackers, and two decade long experience of hanging out with them- some thoughts on making you a better hacker, and a happier hacker even if you are a hacker activist or a hacker in enterprise software.
1) Everybody can be a hacker, but you need to know the basic attitude first. Not every Python or Java coder is a hacker. Coding is not hacking. More details here- http://decisionstats.com/2012/02/12/how-to-learn-to-be-a-hacker-easily/
2) Use tools like Coursera, Udacity, Codeacdemy to learn new languages. Even if you dont have the natural gift for memorizing syntax, some of it helps. (I forget syntax quite often. I google)
3) Learn tools like Metasploit if you want to learn the lucrative and romantic art of exploits hacking (http://www.offensive-security.com/metasploit-unleashed/Main_Page). The demand for information security is going to be huge. hackers with jobs are happy hackers.
4) Develop a serious downtime hobby.
Lets face it- your body was not designed to sit in front of a computer for 8 hours. But being a hacker will mean that commitment and maybe more.
I have recently become a Quora addict, and you can see why it is such a great site. If possible say hello to me there at
My latest favorite question-
What are the most hilarious pie charts?
I am only showing you some of the answers, you can see the rest yourself.
I came across this lovely analytics company. Think Big Analytics. and I really liked their lovely explanation of the whole she-bang big data etc stuff. Because Hadoop isnt rocket science and can be made simpler to explain and deploy.
Check them out yourself at http://www.thinkbiganalytics.com/resources_reference
Also they have an awesome series of lectures coming up-
Up and Running with Big Data: 3 Day Deep-Dive
Over three days, explore the Big Data tools, technologies and techniques which allow organisations to gain insight and drive new business opportunities by finding signal in their data. Using Amazon Web Services, you’ll learn how to use the flexible map/reduce programming model to scale your analytics, use Hadoop with Elastic MapReduce, write queries with Hive, develop real world data flows with Pig and understand the operational needs of a production data platform
- MapReduce concepts
- Hadoop implementation: Jobtracker, Namenode, Tasktracker, Datanode, Shuffle & Sort
- Introduction to Amazon AWS and EMR with console and command-line tools
- Implementing MapReduce with Java and Streaming
- Hive Introduction
- Hive Relational Operators
- Hive Implementation to MapReduce
- Hive Partitions
- Hive UDFs, UDAFs, UDTFs
- Pig Introduction
- Pig Relational Operators
- Pig Implementation to MapReduce
- Pig UDFs
- NoSQL discussion
- What Is Hadoop? (blogs.sap.com)
- Big Data and NoSQL: The Problem with Relational Databases (infocus.emc.com)
- Big data, analytics as a service: Likely boom on deck (zdnet.com)
- IBM’s Big Data Analytics Empire (zdnet.com)
Some slides I liked on cloud computing infrastructure as offered by Amazon, IBM, Google , Windows and Oracle
Including juicy stuff on using a cluster of Apple Machines for grid computing , seasonality forecasting (Yet Another Package For Time Series )
But I kind of liked Sumo too-
Sumo is a fully-functional web application template that exposes an authenticated user’s R session within java server pages.
Sumo: An Authenticating Web Application with an Embedded R Session by Timothy T. Bergsma and Michael S. Smith Abstract Sumo is a web application intended as a template for developers. It is distributed as a Java ‘war’ file that deploys automatically when placed in a Servlet container’s ‘webapps’
directory. If a user supplies proper credentials, Sumo creates a session-specific Secure Shell connection to the host and a user-specific R session over that connection. Developers may write dynamic server pages that make use of the persistent R session and user-specific file space.
and for Apple fanboys-
We created the xgrid package (Horton and Anoke, 2012) to provide a simple interface to this distributed computing system. The package facilitates use of an Apple Xgrid for distributed processing of a simulation with many independent repetitions, by simplifying job submission (or grid stuffing) and collation of results. It provides a relatively thin but useful layer between R and Apple’s ‘xgrid’ shell command, where the user constructs input scripts to be run remotely. A similar set of routines, optimized for parallel estimation of JAGS (just another Gibbs sampler) models is available within the runjags package (Denwood, 2010). However, with the exception of runjags, none of the previously mentioned packages support parallel computation over an Apple Xgrid.
Hmm I guess parallel computing enabled by Wifi on mobile phones would be awesome too ! So would be anything using iOS . See the rest of the R Journal at http://journal.r-project.org/current.html
Here is an interview with Charlie Parker, head of large scale online algorithms at http://bigml.com
Ajay- Describe your own personal background in scientific computing, and how you came to be involved with machine learning, cloud computing and BigML.com
Charlie- I am a machine learning Ph.D. from Oregon State University. Francisco Martin (our founder and CEO), Adam Ashenfelter (the lead developer on the tree algorithm), and myself were all studying machine learning at OSU around the same time. We all went our separate ways after that.
Francisco started Strands and turned it into a 100+ million dollar company building recommender systems. Adam worked for CleverSet, a probabilistic modeling company that was eventually sold to Cisco, I believe. I worked for several years in the research labs at Eastman Kodak on data mining, text analysis, and computer vision.
When Francisco left Strands to start BigML, he brought in Justin Donaldson who is a brilliant visualization guy from Indiana, and an ex-Googler named Jose Ortega who is responsible for most of our data infrastructure. They pulled in Adam and I a few months later. We also have Poul Petersen, a former Strands employee, who manages our herd of servers. He is a wizard and makes everyone else’s life much easier.
Ajay- You use clojure for the back end of BigML.com .Are there any other languages and packages you are considering? What makes clojure such a good fit for cloud computing ?
Charlie- Clojure is a great language because it offers you all of the benefits of Java (extensive libraries, cross-platform compatibility, easy integration with things like Hadoop, etc.) but has the syntactical elegance of a functional language. This makes our code base small and easy to read as well as powerful.
We’ve had occasional issues with speed, but that just means writing the occasional function or library in Java. As we build towards processing data at the Terabyte level, we’re hoping to create a framework that is language-agnostic to some extent. So if we have some great machine learning code in C, for example, we’ll use Clojure to tie everything together, but the code that does the heavy lifting will still be in C. For the API and Web layers, we use Python and Django, and Justin is a huge fan of HaXe for our visualizations.
Ajay- Current support is for Decision Trees. When can we see SVM, K Means Clustering and Logit Regression?
Charlie- Right now we’re focused on perfecting our infrastructure and giving you new ways to put data in the system, but expect to see more algorithms appearing in the next few months. We want to make sure they are as beautiful and easy to use as the trees are. Without giving too much away, the first new thing we will probably introduce is an ensemble method of some sort (such as Boosting or Bagging). Clustering is a little further away but we’ll get there soon!
Ajay- How can we use the BigML.com API using R and Python.
Charlie- We have a public github repo for the language bindings. https://github.com/bigmlcom/io Right now, there there are only bash scripts but that should change very soon. The python bindings should be there in a matter of days, and the R bindings in probably a week or two. Clojure and Java bindings should follow shortly after that. We’ll have a blog post about it each time we release a new language binding. http://blog.bigml.com/
Ajay- How can we predict large numbers of observations using a Model that has been built and pruned (model scoring)?
Charlie- We are in the process of refactoring our backend right now for better support for batch prediction and model evaluation. This is something that is probably only a few weeks away. Keep your eye on our blog for updates!
Ajay- How can we export models built in BigML.com for scoring data locally.
Charlie- This is as simple as a call to our API. https://bigml.com/developers/models The call gives you a JSON object representing the tree that is roughly equivalent to a PMML-style representation.
You can read about Charlie Parker at http://www.linkedin.com/pub/charles-parker/11/85b/4b5 and the rest of the BigML team at