Hack for Change #hackforchange

As part of National Day of Hacking, I took part in the hackathon in New York. Here were some insights

A few teams had come prepared reading the challenge questions. These teams had an advantage on time

Creating something in time was a big challenge ( how do you make a product in a single day)

Hackathon consists of 1) organizer giving challenge questions 2) people coming to venue 3) making teams 4) working together as team 5) presenting results (usually one person per team)

The idea is the most important in how relevant and closely aligned to the questions in hackthon you were. Creativity rules

The next important thing was making a balanced team in which everyone gels well, and have skill sets that are complementary ( one front end, one back end, one data scientist in Python, one person who is good at presentation etc)

The next important thing was not getting intimidated by other teams and working on your team idea till last moment

The presentation should be given to a person who is best at expressing 1) what you did 2) how the solution is innovative 3) how it is relevant and useful to challenge

Lastly have fun hacking. People who have fun hacking generally tend to be better hackers.

Screenshot from 2016-06-10 06:40:04

US Congress cedes cyber-war to Executive Branch



Obama Order Sped Up Wave of Cyberattacks Against Iran

Published: June 1, 2012

WASHINGTON — From his first months in office, President Obama secretly ordered increasingly sophisticated attacks on the computer systems that run Iran’s main nuclear enrichment facilities, significantly expanding America’s first sustained use of cyberweapons,



Can the White House declare a cyberwar?

“When we see the results it’s pretty clear they’re doing it without anybody except a very few people knowing about it, much less having any impact on whether it’s happening or not,” said Rep. Jim McDermott (D-Wash.).

McDermott is troubled because “we have given more and more power to the president, through the CIA, to carry out operations, and, frankly, if you go back in history, the reason we have problems with Iran is because the CIA brought about a coup.”




Article. I.

Section 1.

All legislative Powers herein granted shall be vested in a Congress of the United States, which shall consist of a Senate and House of Representatives.

Section. 8.

The Congress shall have Power

Clause 11: To declare War, grant Letters of Marque and Reprisal, and make Rules concerning Captures on Land and Water;




Obama Wins Nobel Peace Prize

KARL RITTER and MATT MOORE   10/ 9/09 11:02 PM ET


Statement Regarding Barack Obama 

The Law School has received many media requests about Barack Obama, especially about his status as “Senior Lecturer.”

From 1992 until his election to the U.S. Senate in 2004, Barack Obama served as a professor in the Law School. He was a Lecturer from 1992 to 1996. He was a Senior Lecturer from 1996 to 2004, during which time he taught three courses per year.


Using Rapid Miner and R for Sports Analytics #rstats

Rapid Miner has been one of the oldest open source analytics software, long long before open source or even analytics was considered a fashion buzzword. The Rapid Miner software has been a pioneer in many areas (like establishing a marketplace for Rapid Miner Extensions) and the Rapid Miner -R extension was one of the most promising enablers of using R in an enterprise setting.
The following interview was taken with a manager of analytics for a sports organization. The sports organization considers analytics as a strategic differentiator , hence the name is confidential. No part of the interview has been edited or manipulated.

Ajay- Why did you choose Rapid Miner and R? What were the other software alternatives you considered and discarded?

Analyst- We considered most of the other major players in statistics/data mining or enterprise BI.  However, we found that the value proposition for an open source solution was too compelling to justify the premium pricing that the commercial solutions would have required.  The widespread adoption of R and the variety of packages and algorithms available for it, made it an easy choice.  We liked RapidMiner as a way to design structured, repeatable processes, and the ability to optimize learner parameters in a systematic way.  It also handled large data sets better than R on 32-bit Windows did.  The GUI, particularly when 5.0 was released, made it more usable than R for analysts who weren’t experienced programmers.

Ajay- What analytics do you do think Rapid Miner and R are best suited for?

 Analyst- We use RM+R mainly for sports analysis so far, rather than for more traditional business applications.  It has been quite suitable for that, and I can easily see how it would be used for other types of applications.

 Ajay- Any experiences as an enterprise customer? How was the installation process? How good is the enterprise level support?

Analyst- Rapid-I has been one of the most responsive tech companies I’ve dealt with, either in my current role or with previous employers.  They are small enough to be able to respond quickly to requests, and in more than one case, have fixed a problem, or added a small feature we needed within a matter of days.  In other cases, we have contracted with them to add larger pieces of specific functionality we needed at reasonable consulting rates.  Those features are added to the mainline product, and become fully supported through regular channels.  The longer consulting projects have typically had a turnaround of just a few weeks.

 Ajay- What challenges if any did you face in executing a pure open source analytics bundle ?

Analyst- As Rapid-I is a smaller company based in Europe, the availability of training and consulting in the USA isn’t as extensive as for the major enterprise software players, and the time zone differences sometimes slow down the communications cycle.  There were times where we were the first customer to attempt a specific integration point in our technical environment, and with no prior experiences to fall back on, we had to work with Rapid-I to figure out how to do it.  Compared to the what traditional software vendors provide, both R and RM tend to have sparse, terse, occasionally incomplete documentation.  The situation is getting better, but still lags behind what the traditional enterprise software vendors provide.

 Ajay- What are the things you can do in R ,and what are the things you prefer to do in Rapid Miner (comparison for technical synergies)

Analyst- Our experience has been that RM is superior to R at writing and maintaining structured processes, better at handling larger amounts of data, and more flexible at fine-tuning model parameters automatically.  The biggest limitation we’ve had with RM compared to R is that R has a larger library of user-contributed packages for additional data mining algorithms.  Sometimes we opted to use R because RM hadn’t yet implemented a specific algorithm.  The introduction the R extension has allowed us to combine the strengths of both tools in a very logical and productive way.

In particular, extending RapidMiner with R helped address RM’s weakness in the breadth of algorithms, because it brings the entire R ecosystem into RM (similar to how Rapid-I implemented much of the Weka library early on in RM’s development).  Further, because the R user community releases packages that implement new techniques faster than the enterprise vendors can, this helps turn a potential weakness into a potential strength.  However, R packages tend to be of varying quality, and are more prone to go stale due to lack of support/bug fixes.  This depends heavily on the package’s maintainer and its prevalence of use in the R community.  So when RapidMiner has a learner with a native implementation, it’s usually better to use it than the R equivalent.

Interview John Myles White , Machine Learning for Hackers

Here is an interview with one of the younger researchers  and rock stars of the R Project, John Myles White,  co-author of Machine Learning for Hackers.

Ajay- What inspired you guys to write Machine Learning for Hackers. What has been the public response to the book. Are you planning to write a second edition or a next book?

John-We decided to write Machine Learning for Hackers because there were so many people interested in learning more about Machine Learning who found the standard textbooks a little difficult to understand, either because they lacked the mathematical background expected of readers or because it wasn’t clear how to translate the mathematical definitions in those books into usable programs. Most Machine Learning books are written for audiences who will not only be using Machine Learning techniques in their applied work, but also actively inventing new Machine Learning algorithms. The amount of information needed to do both can be daunting, because, as one friend pointed out, it’s similar to insisting that everyone learn how to build a compiler before they can start to program. For most people, it’s better to let them try out programming and get a taste for it before you teach them about the nuts and bolts of compiler design. If they like programming, they can delve into the details later.

We once said that Machine Learning for Hackers  is supposed to be a chemistry set for Machine Learning and I still think that’s the right description: it’s meant to get readers excited about Machine Learning and hopefully expose them to enough ideas and tools that they can start to explore on their own more effectively. It’s like a warmup for standard academic books like Bishop’s.
The public response to the book has been phenomenal. It’s been amazing to see how many people have bought the book and how many people have told us they found it helpful. Even friends with substantial expertise in statistics have said they’ve found a few nuggets of new information in the book, especially regarding text analysis and social network analysis — topics that Drew and I spend a lot of time thinking about, but are not thoroughly covered in standard statistics and Machine Learning  undergraduate curricula.
I hope we write a second edition. It was our first book and we learned a ton about how to write at length from the experience. I’m about to announce later this week that I’m writing a second book, which will be a very short eBook for O’Reilly. Stay tuned for details.

Ajay-  What are the key things that a potential reader can learn from this book?

John- We cover most of the nuts and bolts of introductory statistics in our book: summary statistics, regression and classification using linear and logistic regression, PCA and k-Nearest Neighbors. We also cover topics that are less well known, but are as important: density plots vs. histograms, regularization, cross-validation, MDS, social network analysis and SVM’s. I hope a reader walks away from the book having a feel for what different basic algorithms do and why they work for some problems and not others. I also hope we do just a little to shift a future generation of modeling culture towards regularization and cross-validation.

Ajay- Describe your journey as a science student up till your Phd. What are you current research interests and what initiatives have you done with them?

John-As an undergraduate I studied math and neuroscience. I then took some time off and came back to do a Ph.D. in psychology, focusing on mathematical modeling of both the brain and behavior. There’s a rich tradition of machine learning and statistics in psychology, so I got increasingly interested in ML methods during my years as a grad student. I’m about to finish my Ph.D. this year. My research interests all fall under one heading: decision theory. I want to understand both how people make decisions (which is what psychology teaches us) and how they should make decisions (which is what statistics and ML teach us). My thesis is focused on how people make decisions when there are both short-term and long-term consequences to be considered. For non-psychologists, the classic example is probably the explore-exploit dilemma. I’ve been working to import more of the main ideas from stats and ML into psychology for modeling how real people handle that trade-off. For psychologists, the classic example is the Marshmallow experiment. Most of my research work has focused on the latter: what makes us patient and how can we measure patience?

Ajay- How can academia and private sector solve the shortage of trained data scientists (assuming there is one)?

John- There’s definitely a shortage of trained data scientists: most companies are finding it difficult to hire someone with the real chops needed to do useful work with Big Data. The skill set required to be useful at a company like Facebook or Twitter is much more advanced than many people realize, so I think it will be some time until there are undergraduates coming out with the right stuff. But there’s huge demand, so I’m sure the market will clear sooner or later.

The changes that are required in academia to prepare students for this kind of work are pretty numerous, but the most obvious required change is that quantitative people need to be learning how to program properly, which is rare in academia, even in many CS departments. Writing one-off programs that no one will ever have to reuse and that only work on toy data sets doesn’t prepare you for working with huge amounts of messy data that exhibit shifting patterns. If you need to learn how to program seriously before you can do useful work, you’re not very valuable to companies who need employees that can hit the ground running. The companies that have done best in building up data teams, like LinkedIn, have learned to train people as they come in since the proper training isn’t typically available outside those companies.
Of course, on the flipside, the people who do know how to program well need to start learning more about theory and need to start to have a better grasp of basic mathematical models like linear and logistic regressions. Lots of CS students seem not to enjoy their theory classes, but theory really does prepare you for thinking about what you can learn from data. You may not use automata theory if you work at Foursquare, but you will need to be able to reason carefully and analytically. Doing math is just like lifting weights: if you’re not good at it right now, you just need to dig in and get yourself in shape.
John Myles White is a Phd Student in  Ph.D. student in the Princeton Psychology Department, where he studies human decision-making both theoretically and experimentally. Along with the political scientist Drew Conway, he is  the author of a book published by O’Reilly Media entitled “Machine Learning for Hackers”, which is meant to introduce experienced programmers to the machine learning toolkit. He is also working with Mark Hansenon a book for laypeople about exploratory data analysis.John is the lead maintainer for several R packages, including ProjectTemplate and log4r.

(TIL he has played in several rock bands!)

You can read more in his own words at his blog at http://www.johnmyleswhite.com/about/
He can be contacted via social media at Google Plus at https://plus.google.com/109658960610931658914 or twitter at twitter.com/johnmyleswhite/

New Amazon Instance: High I/O for NoSQL

Latest from the Amazon Cloud-

hi1.4xlarge instances come with eight virtual cores that can deliver 35 EC2 Compute Units (ECUs) of CPU performance, 60.5 GiB of RAM, and 2 TiB of storage capacity across two SSD-based storage volumes. Customers using hi1.4xlarge instances for their applications can expect over 120,000 4 KB random write IOPS, and as many as 85,000 random write IOPS (depending on active LBA span). These instances are available on a 10 Gbps network, with the ability to launch instances into cluster placement groups for low-latency, full-bisection bandwidth networking.

High I/O instances are currently available in three Availability Zones in US East (N. Virginia) and two Availability Zones in EU West (Ireland) regions. Other regions will be supported in the coming months. You can launch hi1.4xlarge instances as On Demand instances starting at $3.10/hour, and purchase them as Reserved Instances


High I/O Instances

Instances of this family provide very high instance storage I/O performance and are ideally suited for many high performance database workloads. Example applications include NoSQL databases like Cassandra and MongoDB. High I/O instances are backed by Solid State Drives (SSD), and also provide high levels of CPU, memory and network performance.

High I/O Quadruple Extra Large Instance

60.5 GB of memory
35 EC2 Compute Units (8 virtual cores with 4.4 EC2 Compute Units each)
2 SSD-based volumes each with 1024 GB of instance storage
64-bit platform
I/O Performance: Very High (10 Gigabit Ethernet)
Storage I/O Performance: Very High*
API name: hi1.4xlarge

*Using Linux paravirtual (PV) AMIs, High I/O Quadruple Extra Large instances can deliver more than 120,000 4 KB random read IOPS and between 10,000 and 85,000 4 KB random write IOPS (depending on active logical block addressing span) to applications. For hardware virtual machines (HVM) and Windows AMIs, performance is approximately 90,000 4 KB random read IOPS and between 9,000 and 75,000 4 KB random write IOPS. The maximum sequential throughput on all AMI types (Linux PV, Linux HVM, and Windows) per second is approximately 2 GB read and 1.1 GB write.

Interview James G Kobielus IBM Big Data

Here is an interview with  James G Kobielus, who is the Senior Program Director, Product Marketing, Big Data Analytics Solutions at IBM. Special thanks to Payal Patel Cudia of IBM’s communication team,for helping with the logistics for this.

Ajay -What are the specific parts of the IBM Platform that deal with the three layers of Big Data -variety, velocity and volume

James-Well first of all, let’s talk about the IBM Information Management portfolio. Our big data platform addresses the three layers of big data to varying degrees either together in a product , or two out of the three or even one of the three aspects. We don’t have separate products for the variety, velocity and volume separately.

Let us define these three layers-Volume refers to the hundreds of terabytes and petabytes of stored data inside organizations today. Velocity refers to the whole continuum from batch to real time continuous and streaming data.

Variety refers to multi-structure data from structured to unstructured files, managed and stored in a common platform analyzed through common tooling.

For Volume-IBM has a highly scalable Big Data platform. This includes Netezza and Infosphere groups of products, and Watson-like technologies that can support petabytes volume of data for analytics. But really the support of volume ranges across IBM’s Information Management portfolio both on the database side and the advanced analytics side.

For real time Velocity, we have real time data acquisition. We have a product called IBM Infosphere, part of our Big Data platform, that is specifically built for streaming real time data acquisition and delivery through complex event processing. We have a very rich range of offerings that help clients build a Hadoop environment that can scale.

Our Hadoop platform is the most real time capable of all in the industry. We are differentiated by our sheer breadth, sophistication and functional depth and tooling integrated in our Hadoop platform. We are differentiated by our streaming offering integrated into the Hadoop platform. We also offer a great range of modeling and analysis tools, pretty much more than any other offering in the Big Data space.

Attached- Jim’s slides from Hadoop World

Ajay- Any plans for Mahout for Hadoop

Jim- I cant speak about product plans. We have plans but I cant tell you anything more. We do have a feature in Big Insights called System ML, a library for machine learning.

Ajay- How integral are acquisitions for IBM in the Big Data space (Netezza,Cognos,SPSS etc). Is it true that everything that you have in Big Data is acquired or is the famous IBM R and D contributing here . (see a partial list of IBM acquisitions at at http://www.ibm.com/investor/strategy/acquisitions.wss )

Jim- We have developed a lot on our own. We have the deepest R and D of anybody in the industry in all things Big Data.

For example – Watson has Big Insights Hadoop at its core. Apache Hadoop is the heart and soul of Big Data (see http://www-01.ibm.com/software/data/infosphere/hadoop/ ). A great deal that makes Big Insights so differentiated is that not everything that has been built has been built by the Hadoop community.

We have built additions out of the necessity for security, modeling, monitoring, and governance capabilities into BigInsights to make it truly enterprise ready. That is one example of where we have leveraged open source and we have built our own tools and technologies and layered them on top of the open source code.

Yes of course we have done many strategic acquisitions over the last several years related to Big Data Management and we continue to do so. This quarter we have done 3 acquisitions with strong relevance to Big Data. One of them is Vivisimo (http://www-03.ibm.com/press/us/en/pressrelease/37491.wss ).

Vivisimo provides federated Big Data discovery, search and profiling capabilities to help you figure out what data is out there,what is relevance of that data to your data science project- to help you answer the question which data should you bring in your Hadoop Cluster.

 We also did Varicent , which is more performance management and we did TeaLeaf , which is a customer experience solution provider where customer experience management and optimization is one of the hot killer apps for Hadoop in the cloud. We have done great many acquisitions that have a clear relevance to Big Data.

Netezza already had a massively parallel analytics database product with an embedded library of models called Netezza Analytics, and in-database capabilties to massively parallelize Map Reduce and other analytics management functions inside the database. In many ways, Netezza provided capabilities similar to that IBM had provided for many years under the Smart Analytics Platform (http://www-01.ibm.com/software/data/infosphere/what-is-advanced-analytics/ ) .

There is a differential between Netezza and ISAS.

ISAS was built predominantly in-house over several years . If you go back a decade ago IBM acquired Ascential Software , a product portfolio that was the heart and soul of IBM InfoSphere Information Manager that is core to our big Data platform. In addition to Netezza, IBM bought SPSS two years back. We already had data mining tools and predictive modeling in the InfoSphere portfolio, but we realized we needed to have the best of breed, SPSS provided that and so IBM acquired them.

 Cognos– We had some BI reporting capabilities in the InfoSphere portfolio that we had built ourselves and also acquired for various degrees from prior acquisitions. But clearly Cognos was one of the best BI vendors , and we were lacking such a rich tool set in our product in visualization and cubing and so for that reason we acquired Cognos.

There is also Unica – which is a marketing campaign optimization which in many ways is a killer app for Hadoop. Projects like that are driving many enterprises.

Ajay- How would you rank order these acquisitions in terms of strategic importance rather than data of acquisition or price paid.

Jim-Think of Big Data as an ecosystem that has components that are fitted to particular functions for data analytics and data management. Is the database the core, or the modeling tool the core, or the governance tools the core, or is the hardware platform the core. Everything is critically important. We would love to hear from you what you think have been most important. Each acquisition has helped play a critical role to build the deepest and broadest solution offering in Big Data. We offer the hardware, software, professional services, the hosting service. I don’t think there is any validity to a rank order system.

Ajay-What are the initiatives regarding open source that Big Data group have done or are planning?

Jim- What we are doing now- We are very much involved with the Apache Hadoop community. We continue to evolve the open source code that everyone leverages.. We have built BigInsights on Apache Hadoop. We have the closest, most up to date in terms of version number to Apache Hadoop ( Hbase,HDFS, Pig etc) of all commercial distributions with our BigInsights 1.4 .

We have an R library integrated with BigInsights . We have a R library integrated with Netezza Analytics. There is support for R Models within the SPSS portfolio. We already have a fair amount of support for R across the portfolio.

Ajay- What are some of the concerns (privacy,security,regulation) that you think can dampen the promise of Big Data.

Jim- There are no showstoppers, there is really a strong momentum. Some of the concerns within the Hadoop space are immaturity of the technology, the immaturity of some of the commercial offerings out there that implement Hadoop, the lack of standardization for formal sense for Hadoop.

There is no Open Standards Body that declares, ratifies the latest version of Mahout, Map Reduce, HDFS etc. There is no industry consensus reference framework for layering these different sub projects. There are no open APIs. There are no certifications or interoperability standards or organizations to certify different vendors interoperability around a common API or framework.

The lack of standardization is troubling in this whole market. That creates risks for users because users are adopting multiple Hadoop products. There are lots of Hadoop deployments in the corporate world built around Apache Hadoop (purely open source). There may be no assurance that these multiple platforms will interoperate seamlessly. That’s a huge issue in terms of just magnifying the risk. And it increases the need for the end user to develop their own custom integrated code if they want to move data between platforms, or move map-reduce jobs between multiple distributions.

Also governance is a consideration. Right now Hadoop is used for high volume ETL on multi structured and unstructured data sources, or Hadoop is used for exploratory sand boxes for data scientists. These are important applications that are a majority of the Hadoop deployments . Some Hadoop deployments are stand alone unstructured data marts for specific applications like sentiment analysis like.

Hadoop is not yet ready for data warehousing. We don’t see a lot of Hadoop being used as an alternative to data warehouses for managing the single version of truth of system or record data. That day will come but there needs to be out there in the marketplace a broader range of data governance mechanisms , master data management, data profiling products that are mature that enterprises can use to make sure their data inside their Hadoop clusters is clean and is the single version of truth. That day has not arrived yet.

One of the great things about IBM’s acquisition of Vivisimo is that a piece of that overall governance picture is discovery and profiling for unstructured data , and that is done very well by Vivisimo for several years.

What we will see is vendors such as IBM will continue to evolve security features inside of our Hadoop platform. We will beef up our data governance capabilities for this new world of Hadoop as the core of Big Data, and we will continue to build up our ability to integrate multiple databases in our Hadoop platform so that customers can use data from a bit of Hadoop,some data from a bit of traditional relational data warehouse, maybe some noSQL technology for different roles within a very complex Big Data environment.

That latter hybrid deployment model is becoming standard across many enterprises for Big Data. A cause for concern is when your Big Data deployment has a bit of Hadoop, bit of noSQL, bit of EDW, bit of in-memory , there are no open standards or frameworks for putting it all together for a unified framework not just for interoperability but also for deployment.

There needs to be a virtualization or abstraction layer for unified access to all these different Big Data platforms by the users/developers writing the queries, by administrators so they can manage data and resources and jobs across all these disparate platforms in a seamless unified way with visual tooling. That grand scenario, the virtualization layer is not there yet in any standard way across the big data market. It will evolve, it may take 5-10 years to evolve but it will evolve.

So, that’s the concern that can dampen some of the enthusiasm for Big Data Analytics.


You can read more about Jim at http://www.linkedin.com/pub/james-kobielus/6/ab2/8b0 or

follow him on Twitter at http://twitter.com/jameskobielus

You can read more about IBM Big Data at http://www-01.ibm.com/software/data/bigdata/

Anonymous grows up and matures…Anonanalytics.com

I liked the design, user interfaces and the conceptual ideas behind the latest Anonymous hactivist websites (much better than the shabby graphic design of Wikileaks, or Friends of Wikileaks, though I guess they have been busy what with Julian’s escapades and Syrian emails)


I disagree  (and let us agree to disagree some of the time)

with the complete lack of respect for Graphical User Interfaces for tools. If dDOS really took off due to LOIC, why not build a GUI for SQL Injection (or atleats the top 25 vulnerability testing as by this list http://www.sans.org/top25-software-errors/

Shouldnt Tor be embedded within the next generation of Loic.

Automated testing tools are used by companies like Adobe (and others)… so why not create simple GUI for the existing tools.., I may be completely offtrack here.. but I think hacker education has been a critical misstep[ that has undermined Western Democracies preparedness for Cyber tactics by hostile regimes)…. how to create the next generation of hackers by easy tutorials (see codeacademy and build appropriate modules)

-A slick website to be funded by Bitcoins (Money can buy everything including Mastercard and Visa, but Bitcoins are an innovative step towards an internet economy  currency)

-A collobrative wiki


Seriously dude, why not make this a part of Wikipedia- (i know Jimmy Wales got shifty eyes, but can you trust some1 )

-Analytics for Anonymous (sighs! I should have thought about this earlier)

http://anonanalytics.com/ (can be used to play and bill both sides of corporate espionage and be cyber private investigators)

What We Do

We provide the public with investigative reports exposing corrupt companies. Our team includes analysts, forensic accountants, statisticians, computer experts, and lawyers from various jurisdictions and backgrounds. All information presented in our reports is acquired through legal channels, fact-checked, and vetted thoroughly before release. This is both for the protection of our associates as well as groups/individuals who rely on our work.

_and lastly creative content for Pinterest.com and Public Relations ( what next-? Tom Cruise to play  Julian Assange in the new Movie ?)

http://www.par-anoia.net/ />Potentially Alarming Research: Anonymous Intelligence AgencyInformation is and will be free. Expect it. ~ Anonymous

Links of interest

  • Latest Scientology Mails (Austria)
  • Full FBI call transcript
  • Arrest Tracker
  • HBGary Email Viewer
  • The Pirate Bay Proxy
  • We Are Anonymous – Book
  • To be announced…