New Plotters in Rapid Miner 5.2

I almost missed this because of my vacation and traveling

Rapid Miner has a tonne of new stuff (Statuary Ethics Declaration- Rapid Miner has been an advertising partner for Decisionstats – see the right margin)

see

http://rapid-i.com/component/option,com_myblog/Itemid,172/lang,en/

Great New Graphical Plotters

and some flashy work

and a great series of educational lectures

A Simple Explanation of Decision Tree Modeling based on Entropies

Link: http://www.simafore.com/blog/bid/94454/A-simple-explanation-of-how-entropy-fuels-a-decision-tree-model

Description of some of the basics of decision trees. Simple and hardly any math, I like the plots explaining the basic idea of the entropy as splitting criterion (although we actually calculate gain ratio differently than explained…)

Logistic Regression for Business Analytics using RapidMiner

Link: http://www.simafore.com/blog/bid/57924/Logistic-regression-for-business-analytics-using-RapidMiner-Part-2

Same as above, but this time for modeling with logistic regression.
Easy to read and covering all basic ideas together with some examples. If you are not familiar with the topic yet, part 1 (see below) might help.

Part 1 (Basics): http://www.simafore.com/blog/bid/57801/Logistic-regression-for-business-analytics-using-RapidMiner-Part-1

Deploy Model: http://www.simafore.com/blog/bid/82024/How-to-deploy-a-logistic-regression-model-using-RapidMiner

Advanced Information: http://www.simafore.com/blog/bid/99443/Understand-3-critical-steps-in-developing-logistic-regression-models

and lastly a new research project for collaborative data mining

http://www.e-lico.eu/

e-LICO Architecture and Components

The goal of the e-LICO project is to build a virtual laboratory for interdisciplinary collaborative research in data mining and data-intensive sciences. The proposed e-lab will comprise three layers: the e-science and data mining layers will form a generic research environment that can be adapted to different scientific domains by customizing the application layer.

  1. Drag a data set into one of the slots. It will be automatically detected as training data, test data or apply data, depending on whether it has a label or not.
  2. Select a goal. The most frequent one is probably “Predictive Modelling”. All goals have comments, so you see what they can be used for.
  3. Select “Fetch plans” and wait a bit to get a list of processes that solve your problem. Once the planning completes, select one of the processes (you can see a preview at the right) and run it. Alternatively, select multiple (selecting none means selecting all) and evaluate them on your data in a batch.

The assistant strives to generate processes that are compatible with your data. To do so, it performs a lot of clever operations, e.g., it automatically replaces missing values if missing values exist and this is required by the learning algorithm or performs a normalization when using a distance-based learner.

You can install the extension directly by using the Rapid-I Marketplace instead of the old update server. Just go to the preferences and enter http://rapidupdate.de:8180/UpdateServer as the update URL

Of course Rapid Miner has been of the most professional open source analytics company and they have been doing it for a long time now. I am particularly impressed by the product map (see below) and the graphical user interface.

http://rapid-i.com/content/view/186/191/lang,en/

Product Map

Just click on the products in the overview below in order to get more information about Rapid-I products.

 

Rapid-I Product Overview 

 

Radoop 0.3 launched- Open Source Graphical Analytics meets Big Data

What is Radoop? Quite possibly an exciting mix of analytics and big data computing

 

http://blog.radoop.eu/?p=12

What is Radoop?

Hadoop is an excellent tool for analyzing large data sets, but it lacks an easy-to-use graphical interface. RapidMiner is an excellent tool for data analytics, but its data size is limited by the memory available, and a single machine is often not enough to run the analyses on time. In this project, we combine the strengths of both projects and provide a RapidMiner extension for editing and running ETL, data analytics and machine learning processes over Hadoop.

We have closely integrated the highly optimized data analytics capabilities of Hive and Mahout, and the user-friendly interface of RapidMiner to form a powerful and easy-to-use data analytics solution for Hadoop.

 

and what’s new

http://blog.radoop.eu/?p=198

Radoop 0.3 released – fully graphical big data analytics

Today, Radoop had a major step forward with its 0.3 release. The new version of the visual big data analytics package adds full support for all major Hadoop distributions used these days: Apache Hadoop 0.20.2, 0.20.203, 1.0 and Cloudera’s Distribution including Apache Hadoop 3 (CDH3). It also adds support for large clusters by allowing the namenode, the jobtracker and the Hive server to reside on different nodes.

As Radoop’s promise is to make big data analytics easier, the 0.3 release is also focused on improving the user interface. It has an enhanced breakpointing system which allows to investigate intermediate results, and it adds dozens of quick fixes, so common process design mistakes get much easier to solve.

There are many further improvements and fixes, so please consult the release notes for more details. Radoop is in private beta mode, but heading towards a public release in Q2 2012. If you would like to get early access, then please apply at the signup page or describe your use case in email (beta at radoop.eu).

Radoop 0.3 (15 February 2012)

  • Support for Apache Hadoop 0.20.2, 0.20.203, 1.0 and Cloudera’s Distribution Including Apache Hadoop 3 (CDH3) in a single release
  • Support for clusters with separate master nodes (namenode, jobtracker, Hive server)
  • Enhanced breakpointing to evaluate intermediate results
  • Dozens of quick fixes for the most common process design errors
  • Improved process design and error reporting
  • New welcome perspective to help in the first steps
  • Many bugfixes and performance improvements

Radoop 0.2.2 (6 December 2011)

  • More Aggregate functions and distinct option
  • Generate ID operator for convenience
  • Numerous bug fixes and improvements
  • Improved user interface

Radoop 0.2.1 (16 September 2011)

  • Set Role and Data Multiplier operators
  • Management panel for testing Hadoop connections
  • Stability improvements for Hive access
  • Further small bugfixes and improvements

Radoop 0.2 (26 July 2011)

  • Three new algoritms: Fuzzy K-Means, Canopy, and Dirichlet clustering
  • Three new data preprocessing operators: Normalize, Replace, and Replace Missing Values
  • Significant speed improvements in data transmission and interactive analytics
  • Increased stability and speedup for K-Means
  • More flexible settings for Join operations
  • More meaningful error messages
  • Other small bugfixes and improvements

Radoop 0.1 (14 June 2011)

Initial release with 26 operators for data transmission, data preprocessing, and one clustering algorithm.

Note that Rapid Miner also has a great R extension so you can use R, a graphical interface and big data analytics is now easier and more powerful than ever.


Machine Learning for Hackers – #rstats

I got the incredible and intriguing Machine Learning for Hackers for just $15.99 for an electronic copy from O Reilly Media. (Deal of the Day!)

It has just been launched this month!!

It is an incredible book- and I really  like the way O Reilly has made it so easy to download E Books

I am trying to read it while trying to a write a whole lot of other stuff— and it seems easy to read and understand even for non-hackers like me. Esp with Stanford delaying its online machine learning course- this is one handy e-book to have  to get you started in ML and data science!!

Click the image to see the real deal.

http://shop.oreilly.com/product/0636920018483.do

 

How to find out people who are spamming you

Step 1-

We assume you have Gmail. If you dont have Gmail, you deserve the Spam

You click -show original on the drop down in the spammy message

 

you see a lot of mumbo jumbo

(or you just pick the IP addresses from comment spam)

Step 2-

You pick the IP addresses from the mumbo jumbo above (called headers )

http://en.wikipedia.org/wiki/IP_address

An Internet Protocol address (IP address) is a numerical label assigned to each device (e.g., computer, printer) participating in a computer networkthat uses the Internet Protocol for communication.[1] An IP address serves two principal functions: host or network interface identification and locationaddressing

Step 3-

You find out who has that IP address using arin

https://www.arin.net/

 

Step 4-

You put those IP addresses in your firewall for your computer

http://technet.microsoft.com/en-us/library/cc733090(v=ws.10).aspx

(or if you have a self-hosted blog using Website cpanel ip deny)

http://www.siteground.com/tutorials/cpanel/ip_deny_manager.htm

Step 5-

 

Communicate to that IP Address using IRC

http://en.wikipedia.org/wiki/Internet_Relay_Chat

Internet Relay Chat (IRC) is a protocol for real-time Internet text messaging (chat) or synchronous conferencing.[1] It is mainly designed for group communication in discussion forums, called channels,[2] but also allows one-to-one communication via private message[3] as well as chat and data transfer,[4] including file sharing.[5]

or use HOIC to test your own firewall better before people  spam  you

http://gizmodo.com/5883146/what-is-hoic or

http://www.decisionstats.com/occupy-the-internet/

 

Internet Encryption Algols are flawed- too little too late!

Some news from a paper I am reading- not surprised that RSA has a problem .

http://eprint.iacr.org/2012/064.pdf

Abstract. We performed a sanity check of public keys collected on the web. Our main goal was to test the validity of the assumption that di erent random choices are made each time keys are generated.We found that the vast majority of public keys work as intended. A more disconcerting fi nding is that two out of every one thousand RSA moduli that we collected off er no security.

 

Our conclusion is that the validity of the assumption is questionable and that generating keys in the real world for multiple-secrets” cryptosystems such as RSA is signi cantly riskier than for single-secret” ones such as ElGamal or (EC)DSA which are based on Die-Hellman.

Keywords: Sanity check, RSA, 99.8% security, ElGamal, DSA, ECDSA, (batch) factoring, discrete logarithm, Euclidean algorithm, seeding random number generators, K9.

and

 

99.8% Security. More seriously, we stumbled upon 12720 di erent 1024-bit RSA moduli that o ffer no security. Their secret keys are accessible to anyone who takes the trouble to redo our work. Assuming access to the public key collection, this is straightforward compared to more

traditional ways to retrieve RSA secret keys (cf. [5,15]). Information on the a ected X.509 certi cates and PGP keys is given in the full version of this paper, cf. below. Overall, over the data we collected 1024-bit RSA provides 99.8% security at best (but see Appendix A).

 

However no algol is perfect and even Elliptic Based Crypto ( see http://en.wikipedia.org/wiki/Elliptic_curve_cryptography#Fast_reduction_.28NIST_curves.29 )has a flaw called Shor http://en.wikipedia.org/wiki/Shor%27s_algorithm

Funny thing is ECC is now used for Open DNS


http://dnscurve.org/crypto.html

The DNSCurve project adds link-level public-key protection to DNS packets. This page discusses the cryptographic tools used in DNSCurve.

ELLIPTIC-CURVE CRYPTOGRAPHY

DNSCurve uses elliptic-curve cryptography, not RSA.

RSA is somewhat older than elliptic-curve cryptography: RSA was introduced in 1977, while elliptic-curve cryptography was introduced in 1985. However, RSA has shown many more weaknesses than elliptic-curve cryptography. RSA’s effective security level was dramatically reduced by the linear sieve in the late 1970s, by the quadratic sieve and ECM in the 1980s, and by the number-field sieve in the 1990s. For comparison, a few attacks have been developed against some rare elliptic curves having special algebraic structures, and the amount of computer power available to attackers has predictably increased, but typical elliptic curves require just as much computer power to break today as they required twenty years ago.

IEEE P1363 standardized elliptic-curve cryptography in the late 1990s, including a stringent list of security criteria for elliptic curves. NIST used the IEEE P1363 criteria to select fifteen specific elliptic curves at five different security levels. In 2005, NSA issued a new “Suite B” standard, recommending the NIST elliptic curves (at two specific security levels) for all public-key cryptography and withdrawing previous recommendations of RSA.

Some specific types of elliptic-curve cryptography are patented, but DNSCurve does not use any of those types of elliptic-curve cryptography.

No wonder college kids are hacking defense databases easily nowadays!!

Predictive analytics in the cloud : Angoss

I interviewed Angoss in depth here at http://www.decisionstats.com/interview-eberhard-miethke-and-dr-mamdouh-refaat-angoss-software/

Well they just announced a predictive analytics in the cloud.

 

http://www.angoss.com/predictive-analytics-solutions/cloud-solutions/

Solutions

Overview

KnowledgeCLOUD™ solutions deliver predictive analytics in the Cloud to help businesses gain competitive advantage in the areas of sales, marketing and risk management by unlocking the predictive power of their customer data.

KnowledgeCLOUD clients experience rapid time to value and reduced IT investment, and enjoy the benefits of Angoss’ industry leading predictive analytics – without the need for highly specialized human capital and technology.

KnowledgeCLOUD solutions serve clients in the asset management, insurance, banking, high tech, healthcare and retail industries. Industry solutions consist of a choice of analytical modules:

KnowledgeCLOUD for Sales/Marketing

KnowledgeCLOUD solutions are delivered via KnowledgeHUB™, a secure, scalable cloud-based analytical platform together with supporting deployment processes and professional services that deliver predictive analytics to clients in a hosted environment. Angoss industry leading predictive analytics technology is employed for the development of models and deployment of solutions.

Angoss’ deep analytics and domain expertise guarantees effectiveness – all solutions are back-tested for accuracy against historical data prior to deployment. Best practices are shared throughout the service to optimize your processes and success. Finely tuned client engagement and professional services ensure effective change management and program adoption throughout your organization.

For businesses looking to gain a competitive edge and put their data to work, Angoss is the ideal partner.

—-

Hmm. Analytics in the cloud . Reduce hardware costs. Reduce software costs . Increase profitability margins.

Hmmmmm

My favorite professor in North Carolina who calls cloud as a time sharing, are you listening Professor?

Self Driving Cars , Geo Coded Ads, End of Privacy

Imagine a world in which your car tracks everywhere you go. Over a period of time, it builds up a database of your driving habits, how long you stay at particular kinds of dining places, entertainment places (ahem!) , and the days, and times you do it.  You can no longer go to massage parlours without your data being checked by your car software admin (read – your home admin)

And that data is mined using machine learning algols to give you better ads for pizzas, or a reminder for food after every 3 hours , or an ad for beer every Thursday after 8 pm .

Welcome Brave New World!