Page Mathematics

I was looking at the site http://www.google.com/adplanner/static/top1000/index.html

and I saw this list (Below) and using a Google Doc at https://docs.google.com/spreadsheet/pub?hl=en_US&hl=en_US&key=0AtYMMvghK2ytdE9ybmVQeUxMeXdjWlVKYzRlMkxjX0E&output=html.

I then decided to divide  pageviews by users to check the maths

Facebook is AAAAAmazing! and the Russian social network is amazing too!

or

The maths is wrong! (maybe sampling, maybe virtual pageviews caused by friendstream refresh)

but the average of 1,136 page views per unique visitor per month means 36 page views /visitor a Day!

Rank Site     Category        Unique Visitors (users) Page Views Views/Visitors
1  facebook.com	  Social Networks	880000000 1000000000000	1,136
29 linkedin.com	  Social Networks	80000000     2500000000	31
38 orkut.com	  Social Networks	66000000     4000000000	61
40 orkut.com.br	  Social Networks	62000000    43000000000	694
65 weibo.com	  Social Networks	42000000     2800000000	67
66 renren.com	  Social Networks	42000000     3300000000	79
84 odnoklassniki.ru Social Networks	37000000    13000000000	351
90 scribd.com	  Social Networks	34000000      140000000	4
95 vkontakte.ru	  Social Networks	34000000    48000000000	1,412
and
Rank Site	Category  Unique Visitors (users)Page Views	Page Views/Visitors
1 facebook.com	Social Networks	880000000	1000000000000	1,136
2 youtube.com	Online Video	800000000	100000000000	125
3 yahoo.com	Web Portals	590000000	77000000000	131
4 live.com	Search Engines	490000000	84000000000	171
5 msn.com	Web Portals	440000000	20000000000	45
6 wikipedia.org	Dict    	410000000	6000000000	15
7 blogspot.com	Blogging	340000000	4900000000	14
8 baidu.com	Search Engines	300000000	110000000000	367
9 microsoft.com	Software	250000000	2500000000	10
10	 qq.com	Web Portals	250000000	39000000000	156

see complete list at http://www.google.com/adplanner/static/top1000/index.html Continue reading “Page Mathematics”

Interview Dan Steinberg Founder Salford Systems

Here is an interview with Dan Steinberg, Founder and President of Salford Systems (http://www.salford-systems.com/ )

Ajay- Describe your journey from academia to technology entrepreneurship. What are the key milestones or turning points that you remember.

 Dan- When I was in graduate school studying econometrics at Harvard,  a number of distinguished professors at Harvard (and MIT) were actively involved in substantial real world activities.  Professors that I interacted with, or studied with, or whose software I used became involved in the creation of such companies as Sun Microsystems, Data Resources, Inc. or were heavily involved in business consulting through their own companies or other influential consultants.  Some not involved in private sector consulting took on substantial roles in government such as membership on the President’s Council of Economic Advisors. The atmosphere was one that encouraged free movement between academia and the private sector so the idea of forming a consulting and software company was quite natural and did not seem in any way inconsistent with being devoted to the advancement of science.

 Ajay- What are the latest products by Salford Systems? Any future product plans or modification to work on Big Data analytics, mobile computing and cloud computing.

 Dan- Our central set of data mining technologies are CART, MARS, TreeNet, RandomForests, and PRIM, and we have always maintained feature rich logistic regression and linear regression modules. In our latest release scheduled for January 2012 we will be including a new data mining approach to linear and logistic regression allowing for the rapid processing of massive numbers of predictors (e.g., one million columns), with powerful predictor selection and coefficient shrinkage. The new methods allow not only classic techniques such as ridge and lasso regression, but also sub-lasso model sizes. Clear tradeoff diagrams between model complexity (number of predictors) and predictive accuracy allow the modeler to select an ideal balance suitable for their requirements.

The new version of our data mining suite, Salford Predictive Modeler (SPM), also includes two important extensions to the boosted tree technology at the heart of TreeNet.  The first, Importance Sampled learning Ensembles (ISLE), is used for the compression of TreeNet tree ensembles. Starting with, say, a 1,000 tree ensemble, the ISLE compression might well reduce this down to 200 reweighted trees. Such compression will be valuable when models need to be executed in real time. The compression rate is always under the modeler’s control, meaning that if a deployed model may only contain, say, 30 trees, then the compression will deliver an optimal 30-tree weighted ensemble. Needless to say, compression of tree ensembles should be expected to be lossy and how much accuracy is lost when extreme compression is desired will vary from case to case. Prior to ISLE, practitioners have simply truncated the ensemble to the maximum allowable size.  The new methodology will substantially outperform truncation.

The second major advance is RULEFIT, a rule extraction engine that starts with a TreeNet model and decomposes it into the most interesting and predictive rules. RULEFIT is also a tree ensemble post-processor and offers the possibility of improving on the original TreeNet predictive performance. One can think of the rule extraction as an alternative way to explain and interpret an otherwise complex multi-tree model. The rules extracted are similar conceptually to the terminal nodes of a CART tree but the various rules will not refer to mutually exclusive regions of the data.

 Ajay- You have led teams that have won multiple data mining competitions. What are some of your favorite techniques or approaches to a data mining problem.

 Dan- We only enter competitions involving problems for which our technology is suitable, generally, classification and regression. In these areas, we are  partial to TreeNet because it is such a capable and robust learning machine. However, we always find great value in analyzing many aspects of a data set with CART, especially when we require a compact and easy to understand story about the data. CART is exceptionally well suited to the discovery of errors in data, often revealing errors created by the competition organizers themselves. More than once, our reports of data problems have been responsible for the competition organizer’s decision to issue a corrected version of the data and we have been the only group to discover the problem.

In general, tackling a data mining competition is no different than tackling any analytical challenge. You must start with a solid conceptual grasp of the problem and the actual objectives, and the nature and limitations of the data. Following that comes feature extraction, the selection of a modeling strategy (or strategies), and then extensive experimentation to learn what works best.

 Ajay- I know you have created your own software. But are there other software that you use or liked to use?

 Dan- For analytics we frequently test open source software to make sure that our tools will in fact deliver the superior performance we advertise. In general, if a problem clearly requires technology other than that offered by Salford, we advise clients to seek other consultants expert in that other technology.

 Ajay- Your software is installed at 3500 sites including 400 universities as per http://www.salford-systems.com/company/aboutus/index.html What is the key to managing and keeping so many customers happy?

 Dan- First, we have taken great pains to make our software reliable and we make every effort  to avoid problems related to bugs.  Our testing procedures are extensive and we have experts dedicated to stress-testing software . Second, our interface is designed to be natural, intuitive, and easy to use, so the challenges to the new user are minimized. Also, clear documentation, help files, and training videos round out how we allow the user to look after themselves. Should a client need to contact us we try to achieve 24-hour turn around on tech support issues and monitor all tech support activity to ensure timeliness, accuracy, and helpfulness of our responses. WebEx/GotoMeeting and other internet based contact permit real time interaction.

 Ajay- What do you do to relax and unwind?

 Dan- I am in the gym almost every day combining weight and cardio training. No matter how tired I am before the workout I always come out energized so locating a good gym during my extensive travels is a must. I am also actively learning Portuguese so I look to watch a Brazilian TV show or Portuguese dubbed movie when I have time; I almost never watch any form of video unless it is available in Portuguese.

 Biography-

http://www.salford-systems.com/blog/dan-steinberg.html

Dan Steinberg, President and Founder of Salford Systems, is a well-respected member of the statistics and econometrics communities. In 1992, he developed the first PC-based implementation of the original CART procedure, working in concert with Leo Breiman, Richard Olshen, Charles Stone and Jerome Friedman. In addition, he has provided consulting services on a number of biomedical and market research projects, which have sparked further innovations in the CART program and methodology.

Dr. Steinberg received his Ph.D. in Economics from Harvard University, and has given full day presentations on data mining for the American Marketing Association, the Direct Marketing Association and the American Statistical Association. After earning a PhD in Econometrics at Harvard Steinberg began his professional career as a Member of the Technical Staff at Bell Labs, Murray Hill, and then as Assistant Professor of Economics at the University of California, San Diego. A book he co-authored on Classification and Regression Trees was awarded the 1999 Nikkei Quality Control Literature Prize in Japan for excellence in statistical literature promoting the improvement of industrial quality control and management.

His consulting experience at Salford Systems has included complex modeling projects for major banks worldwide, including Citibank, Chase, American Express, Credit Suisse, and has included projects in Europe, Australia, New Zealand, Malaysia, Korea, Japan and Brazil. Steinberg led the teams that won first place awards in the KDDCup 2000, and the 2002 Duke/TeraData Churn modeling competition, and the teams that won awards in the PAKDD competitions of 2006 and 2007. He has published papers in economics, econometrics, computer science journals, and contributes actively to the ongoing research and development at Salford.

Interview Eberhard Miethke and Dr. Mamdouh Refaat, Angoss Software

Here is an interview with Eberhard Miethke and Dr. Mamdouh Refaat, of Angoss Software. Angoss is a global leader in delivering business intelligence software and predictive analytics solutions that help businesses capitalize on their data by uncovering new opportunities to increase sales and profitability and to reduce risk.

Ajay-  Describe your personal journey in software. How can we guide young students to pursue more useful software development than just gaming applications.

 Mamdouh- I started using computers long time ago when they were programmed using punched cards! First in Fortran, then C, later C++, and then the rest. Computers and software were viewed as technical/engineering tools, and that’s why we can still see the heavy technical orientation of command languages such as Unix shells and even in the windows Command shell. However, with the introduction of database systems and Microsoft office apps, it was clear that business will be the primary user and field of application for software. My personal trip in software started with scientific applications, then business and database systems, and finally statistical software – which you can think of it as returning to the more scientific orientation. However, with the wide acceptance of businesses of the application of statistical methods in different fields such as marketing and risk management, it is a fast growing field that in need of a lot of innovation.

Ajay – Angoss makes multiple data mining and analytics products. could you please introduce us to your product portfolio and what specific data analytics need they serve.

a- Attached please find our main product flyers for KnowledgeSTUDIO and KnowledgeSEEKER. We have a 3rd product called “strategy builder” which is an add-on to the decision tree modules. This is also described in the flyer.

(see- Angoss Knowledge Studio Product Guide April2011  and http://www.scribd.com/doc/63176430/Angoss-Knowledge-Seeker-Product-Guide-April2011  )

Ajay-  The trend in analytics is for big data and cloud computing- with hadoop enabling processing of massive data sets on scalable infrastructure. What are your plans for cloud computing, tablet based as well as mobile based computing.

a- This is an area where the plan is still being figured out in all organizations. The current explosion of data collected from mobile phones, text messages, and social websites will need radically new applications that can utilize the data from these sources. Current applications are based on the relational database paradigm designed in the 70’s through the 90’s of the 20th century.

But data sources are generating data in volumes and formats that are challenging this paradigm and will need a set of new tools and possibly programming languages to fit these needs. The cloud computing, tablet based and mobile computing (which are the same thing in my opinion, just different sizes of the device) are also two technologies that have not been explored in analytics yet.

The approach taken so far by most companies, including Angoss, is to rely on new xml-based standards to represent data structures for the particular models. In this case, it is the PMML (predictive modelling mark-up language) standard, in order to allow the interoperability between analytics applications. Standardizing on the representation of models is viewed as the first step in order to allow the implementation of these models to emerging platforms, being that the cloud or mobile, or social networking websites.

The second challenge cited above is the rapidly increasing size of the data to be analyzed. Angoss has already identified this challenge early on and is currently offering in-database analytics drivers for several database engines: Netezza, Teradata and SQL Server.

These drivers allow our analytics products to translate their routines into efficient SQL-based scripts that run in the database engine to exploit its performance as well as the powerful hardware on which it runs. Thus, instead of copying the data to a staging format for analytics, these drivers allow the data to be analyzed “in-place” within the database without moving it.

Thus offering performance, security and integrity. The performance is improved because of the use of the well tuned database engines running on powerful hardware.

Extra security is achieved by not copying the data to other platforms, which could be less secure. And finally, the integrity of the results are vastly improved by making sure that the results are always obtained by analyzing the up-to-date data residing in the database rather than an older copy of the data which could be obsolete by the time the analysis is concluded.

Ajay- What are the principal competing products to your offerings, and what makes your products special or differentiated in value to them (for each customer segment).

a- There are two major players in today’s market that we usually encounter as competitors, they are: SAS and IBM.

SAS offers a data mining workbench in the form of SAS Enterprise Miner, which is closely tied to SAS data mining methodology known as SEMMA.

On the other hand, IBM has recently acquired SPSS, which offered its Clementine data mining software. IBM has now rebranded Clementine as IBM SPSS Modeller.

In comparison to these products, our KnowledgeSTUDIO and KnowledgeSEEKER offer three main advantages: ease of use; affordability; and ease of integration into existing BI environments.

Angoss products were designed to look-and-feel-like popular Microsoft office applications. This makes the learning curve indeed very steep. Typically, an intermediate level analyst needs only 2-3 days of training to become proficient in the use of the software with all its advanced features.

Another important feature of Angoss software products is their integration with SAS/base product, and SQL-based database engines. All predictive models generated by Angoss can be automatically translated to SAS and SQL scripts. This allows the generation of scoring code for these common platforms. While the software interface simplifies all the tasks to allow business users to take advantage of the value added by predictive models, the software includes advanced options to allow experienced statisticians to fine-tune their models by adjusting all model parameters as needed.

In addition, Angoss offers a unique product called StrategyBuilder, which allows the analyst to add key performance indicators (KPI’s) to predictive models. KPI’s such as profitability, market share, and loyalty are usually required to be calculated in conjunction with any sales and marketing campaign. Therefore, StrategyBuilder was designed to integrate such KPI’s with the results of a predictive model in order to render the appropriate treatment for each customer segment. These results are all integrated into a deployment strategy that can also be translated into an execution code in SQL or SAS.

The above competitive features offered by the software products of Angoss is behind its success in serving over 4000 users from over 500 clients worldwide.

Ajay -Describe a major case study where using Angoss software helped save a big amount of revenue/costs by innovative data mining.

a-Rogers Telecommunications Inc. is one of the largest Canadian telecommunications providers, serving over 8.5 million customers and a revenue of 11.1 Billion Canadian Dollars (2009). In 2008, Rogers engaged Angoss in order to help with the problem of ballooning accounts receivable for a period of 18 months.

The problem was approached by improving the efficiency of the call centre serving the collections process by a set of predictive models. The first set of models were designed to find accounts likely to default ahead of time in order to take preventative measures. A second set of models were designed to optimize the call centre resources to focus on delinquent accounts likely to pay back most of the outstanding balance. Accounts that were identified as not likely to pack quickly were good candidates for “Early-out” treatment, by forwarding them directly to collection agencies. Angoss hosted Rogers’ data and provided on a regular interval the lists of accounts for each treatment to be deployed by the call centre dialler. As a result of this Rogers estimated an improvement of 10% of the collected sums.

Biography-

Mamdouh has been active in consulting, research, and training in various areas of information technology and software development for the last 20 years. He has worked on numerous projects with major organizations in North America and Europe in the areas of data mining, business analytics, business analysis, and engineering analysis. He has held several consulting positions for solution providers including Predict AG in Basel, Switzerland, and as ANGOSS Corp. Mamdouh is the Director of Professional services for EMEA region of ANGOSS Software. Mamdouh received his PhD in engineering from the University of Toronto and his MBA from the University of Leeds, UK.

Mamdouh is the author of:

"Credit Risk Scorecards: Development and Implmentation using SAS"
 "Data Preparation for Data Mining Using SAS",
 (The Morgan Kaufmann Series in Data Management Systems) (Paperback)
 and co-author of
 "Data Mining: Know it all",Morgan Kaufmann



Eberhard Miethke  works as a senior sales executive for Angoss

 

About Angoss-

Angoss is a global leader in delivering business intelligence software and predictive analytics to businesses looking to improve performance across sales, marketing and risk. With a suite of desktop, client-server and in-database software products and Software-as-a-Service solutions, Angoss delivers powerful approaches to turn information into actionable business decisions and competitive advantage.

Angoss software products and solutions are user-friendly and agile, making predictive analytics accessible and easy to use.

Interview Scott Gidley CTO and Founder, DataFlux

Here is an interview with Scott Gidley, CTO and co-founder of leading data quality ccompany DataFlux . DataFlux is a part of SAS Institute and in 2011 acquired Baseline Consulting besides launching the latest version of their Master Data Management  product. Continue reading “Interview Scott Gidley CTO and Founder, DataFlux”

Interview Jaime Fitzgerald President Fitzgerald Analytics

Here is an interview with noted analytics expert Jaime Fitzgerald, of Fitzgerald Analytics.

Ajay-Describe your career journey from being a Harvard economist to being a text analytics thought leader.

 Jaime- I was attracted to economics because of the logic, the structured and systematic approach to understanding the world and to solving problems. In retrospect, this is the same passion for logic in problem solving that drives my business today.

About 15 years ago, I began working in consulting and initially took a traditional career path. I worked for well-known strategy consulting firms including First Manhattan Consulting Group, Novantas LLC, Braun Consulting, and for the former Japan-focused division of Deloitte Consulting, which had spun off as an independent entity. I was the only person in their New York City office for whom Japanese was not the first language.

While I enjoyed traditional consulting, I was especially passionate about the role of data, analytics, and process improvement. In traditional strategy consulting, these are important factors, but I had a vision for a “next generation” approach to strategy consulting that would be more transparent, more robust, and more focused on the role that information, analysis, and process plays in improving business results. I often explain that while my firm is “not your father’s consulting model,” we have incorporated key best practices from traditional consulting, and combined them with an approach that is more data-centric, technology-centric, and process-centric.

At the most fundamental level, I was compelled to found Fitzgerald Analytics more than six years ago by my passion for the role information plays in improving results, and ultimately improving lives. In my vision, data is an asset waiting to be transformed into results, including profit as well as other results that matter deeply to people. For example,one of the most fulfilling aspects of our work at Fitzgerald Analytics is our support of non-profits and social entrepreneurs, who we help increase their scale and their success in achieving their goals.

Ajay- How would you describe analytics as a career option to future students. What do you think are the most essential qualities an analytics career requires.

Jaime- My belief is that analytics will be a major driver of job-growth and career growth for decades. We are just beginning to unlock the full potential of analytics, and already the demand for analytic talent far exceeds the supply.

To succeed in analytics, the most important quality is logic. Many people believe that math or statistical skills are the most important quality, but in my experience, the most essential trait is what I call “ThoughtStyle” — critical thinking, logic, an ability to break down a problem into components, into sub-parts.

Ajay -What are your favorite techniques and methodologies in text analytics. How do you see social media and Big Data analytics as components of text analytics

 Jaime-We do a lot of work for our clients measuring Customer Experience, by which I mean the experience customers have when interacting with our clients. For example, we helped a major brokerage firm to measure 12 key “Moments that Matter,” including the operational aspects of customer service, customer satisfaction and sentiment, and ultimately customer behavior. Clients care about this a lot, because customer experience drives customer loyalty, which in turn drives customer behavior, customer loyalty, and customer profitability.

Text analytics plays a key role in these projects because much of our data on customer sentiment comes via unstructured text data. For example, we have access to call center transcripts and notes, to survey responses, and to social media comments.

We use a variety of methods, some of which I’m not in a position to describe in great detail. But at a high level, I would say that our favorite text analytics methodologies are “hybrid solutions” which use a two-step process to answer key questions for clients:

Step 1: convert unstructured data into key categorical variables (for example, using contextual analysis to flag users who are critical vs. neutral vs. advocates)

Step 2: linking sentiment categories to customer behavior and profitability (for example, linking customer advocacy and loyalty with customer profits as well as referral volume, to define the ROI that clients accrue for customer satisfaction improvements)

Ajay- Describe your consulting company- Fitzgerald Analytics and some of the work that you have been engaged in.

 Jaime- Our mission is to “illuminate reality” using data and to convert Data to Dollars for our clients. We have a track record of doing this well, with concrete and measurable results in the millions of dollars. As a result, 100% of our clients have engaged us for more than one project: a 100% client loyalty rate.

Our specialties–and most frequent projects–include customer profitability management projects, customer segmentation, customer experience management, balanced scorecards, and predictive analytics. We are often engaged to address high-stakes analytic questions, including issues that help to set long-term strategy. In other cases, clients hire us to help them build their internal capabilities. We have helped build several brand new analytic teams for clients, which continue to generate millions of dollars of profits with their fact-based recommendations.

Our methodology is based on Steven Covey’s principle: “begin with the end in mind,” the concept of starting with the client’s goal and working backwards from there. I often explain that our methods are what you would have gotten if Steven Covey had been a data analyst…we are applying his principles to the world of data analytics.

Ajay- Analytics requires more and more data while privacy requires the least possible data. What do you think are the guidelines that need to be built in sharing internet browsing and user activity data and do we need regulations just like we do for sharing financial data.

 Jaime- Great question. This is an essential challenge of the big data era. My perspective is that firms who depend on user data for their analysis need to take responsibility for protecting privacy by using data management best practices. Best practices to adequately “mask” or remove private data exist…the problem is that these best practices are often not applied. For example, Facebook’s practice of sharing unique user IDs with third-party application companies has generated a lot of criticism, and could have been avoided by applying data management best practices which are well known among the data management community.

If I were able to influence public policy, my recommendation would be to adopt a core set of simple but powerful data management standards that would protect consumers from perhaps 95% of the privacy risks they face today. The number one standard would be to prohibit sharing of static, personally identifiable user IDs between companies in a manner that creates “privacy risk.” Companies can track unique customers without using a static ID…they need to step up and do that.

Ajay- What are your favorite text analytics software that you like to work with.

 Jaime- Because much of our work in deeply embedded into client operations and systems, we often use the software our clients already prefer. We avoid recommending specific vendors unless our client requests it. In tandem with our clients and alliance partners, we have particular respect for Autonomy, Open Text, Clarabridge, and Attensity.

Biography-

http://www.fitzgerald-analytics.com/jaime_fitzgerald.html

The Founder and President of Fitzgerald Analytics, Jaime has developed a distinctively quantitative, fact-based, and transparent approach to solving high stakes problems and improving results.  His approach enables translation of Data to Dollars™ using methodologies clients can repeat again and again.  He is equally passionate about the “human side of the equation,” and is known for his ability to link the human and the quantitative, both of which are needed to achieve optimal results.

Experience: During more than 15 years serving clients as a management strategy consultant, Jaime has focused on customer experience and loyalty, customer profitability, technology strategy, information management, and business process improvement.  Jaime has advised market-leading banks, retailers, manufacturers, media companies, and non-profit organizations in the United States, Canada, and Singapore, combining strategic analysis with hands-on implementation of technology and operations enhancements.

Career History: Jaime began his career at First Manhattan Consulting Group, specialists in financial services, and was later a Co-Founder at Novantas, the strategy consultancy based in New York City.  Jaime was also a Manager for Braun Consulting, now part of Fair Isaac Corporation, and for Japan-based Abeam Consulting, now part of NEC.

Background: Jaime is a graduate of Harvard University with a B.A. in Economics.  He is passionate and supportive of innovative non-profit organizations, their effectiveness, and the benefits they bring to our society.

Upcoming Speaking Engagements:   Jaime is a frequent speaker on analytics, information management strategy, and data-driven profit improvement.  He recently gave keynote presentations on Analytics in Financial Services for The Data Warehousing Institute, the New York Technology Council, and the Oracle Financial Services Industry User Group. A list of Jaime’s most interesting presentations on analyticscan be found here.

He will be presenting a client case study this fall at Text Analytics World re:   “New Insights from ‘Big Legacy Data’: The Role of Text Analytics” 

Connecting with Jaime:  Jaime can be found at Linkedin,  and Twitter.  He edits the Fitzgerald Analytics Blog.

Interview Rapid-I -Ingo Mierswa and Simon Fischer

Here is an interview with Dr Ingo Mierswa , CEO of Rapid -I and Dr Simon Fischer, Head R&D. Rapid-I makes the very popular software Rapid Miner – perhaps one of the earliest leading open source software in business analytics and business intelligence. It is quite easy to use, deploy and with it’s extensions and innovations (including compatibility with R )has continued to grow tremendously through the years.

In an extensive interview Ingo and Simon talk about algorithms marketplace, extensions , big data analytics, hadoop, mobile computing and use of the graphical user interface in analytics.

Special Thanks to Nadja from Rapid I communication team for helping coordinate this interview.( Statuary Blogging Disclosure- Rapid I is a marketing partner with Decisionstats as per the terms in https://decisionstats.com/privacy-3/)

Ajay- Describe your background in science. What are the key lessons that you have learnt while as scientific researcher and what advice would you give to new students today.

Ingo: My time as researcher really was a great experience which has influenced me a lot. I have worked at the AI lab of Prof. Dr. Katharina Morik, one of the persons who brought machine learning and data mining to Europe. Katharina always believed in what we are doing, encouraged us and gave us the space for trying out new things. Funnily enough, I never managed to use my own scientific results in any real-life project so far but I consider this as a quite common gap between science and the “real world”. At Rapid-I, however, we are still heavily connected to the scientific world and try to combine the best of both worlds: solving existing problems with leading-edge technologies.

Simon: In fact, during my academic career I have not worked in the field of data mining at all. I worked on a field some of my colleagues would probably even consider boring, and that is theoretical computer science. To be precise, my research was in the intersection of game theory and network theory. During that time, I have learnt a lot of exciting things, none of which had any business use. Still, I consider that a very valuable experience. When we at Rapid-I hire people coming to us right after graduating, I don’t care whether they know the latest technology with a fancy three-letter acronym – that will be forgotten more quickly than it came. What matters is the way you approach new problems and challenges. And that is also my recommendation to new students: work on whatever you like, as long as you are passionate about it and it brings you forward.

Ajay-  How is the Rapid Miner Extensions marketplace moving along. Do you think there is a scope for people to say create algorithms in a platform like R , and then offer that algorithm as an app for sale just like iTunes or Android apps.

 Simon: Well, of course it is not going to be exactly like iTunes or Android apps are, because of the more business-orientated character. But in fact there is a scope for that, yes. We have talked to several developers, e.g., at our user conference RCOMM, and several people would be interested in such an opportunity. Companies using data mining software need supported software packages, not just something they downloaded from some anonymous server, and that is only possible through a platform like the new Marketplace. Besides that, the marketplace will not only host commercial extensions. It is also meant to be a platform for all the developers that want to publish their extensions to a broader community and make them accessible in a comfortable way. Of course they could just place them on their personal Web pages, but who would find them there? From the Marketplace, they are installable with a single click.

Ingo: What I like most about the new Rapid-I Marketplace is the fact that people can now get something back for their efforts. Developing a new algorithm is a lot of work, in some cases even more that developing a nice app for your mobile phone. It is completely accepted that people buy apps from a store for a couple of Dollars and I foresee the same for sharing and selling algorithms instead of apps. Right now, people can already share algorithms and extensions for free, one of the next versions will also support selling of those contributions. Let’s see what’s happening next, maybe we will add the option to sell complete RapidMiner workflows or even some data pools…

Ajay- What are the recent features in Rapid Miner that support cloud computing, mobile computing and tablets. How do you think the landscape for Big Data (over 1 Tb ) is changing and how is Rapid Miner adapting to it.

Simon: These are areas we are very active in. For instance, we have an In-Database-Mining Extension that allows the user to run their modelling algorithms directly inside the database, without ever loading the data into memory. Using analytic databases like Vectorwise or Infobright, this technology can really boost performance. Our data mining server, RapidAnalytics, already offers functionality to send analysis processes into the cloud. In addition to that, we are currently preparing a research project dealing with data mining in the cloud. A second project is targeted towards the other aspect you mention: the use of mobile devices. This is certainly a growing market, of course not for designing and running analyses, but for inspecting reports and results. But even that is tricky: When you have a large screen you can display fancy and comprehensive interactive dashboards with drill downs and the like. On a mobile device, that does not work, so you must bring your reports and visualizations very much to the point. And this is precisely what data mining can do – and what is hard to do for classical BI.

Ingo: Then there is Radoop, which you may have heard of. It uses the Apache Hadoop framework for large-scale distributed computing to execute RapidMiner processes in the cloud. Radoop has been presented at this year’s RCOMM and people are really excited about the combination of RapidMiner with Hadoop and the scalability this brings.

 Ajay- Describe the Rapid Miner analytics certification program and what steps are you taking to partner with academic universities.

Ingo: The Rapid-I Certification Program was created to recognize professional users of RapidMiner or RapidAnalytics. The idea is that certified users have demonstrated a deep understanding of the data analysis software solutions provided by Rapid-I and how they are used in data analysis projects. Taking part in the Rapid-I Certification Program offers a lot of benefits for IT professionals as well as for employers: professionals can demonstrate their skills and employers can make sure that they hire qualified professionals. We started our certification program only about 6 months ago and until now about 100 professionals have been certified so far.

Simon: During our annual user conference, the RCOMM, we have plenty of opportunities to talk to people from academia. We’re also present at other conferences, e.g. at ECML/PKDD, and we are sponsoring data mining challenges and grants. We maintain strong ties with several universities all over Europe and the world, which is something that I would not want to miss. We are also cooperating with institutes like the ITB in Dublin during their training programmes, e.g. by giving lectures, etc. Also, we are leading or participating in several national or EU-funded research projects, so we are still close to academia. And we offer an academic discount on all our products 🙂

Ajay- Describe the global efforts in making Rapid Miner a truly international software including spread of developers, clients and employees.

Simon: Our clients already are very international. We have a partner network in America, Asia, and Australia, and, while I am responding to these questions, we have a training course in the US. Developers working on the core of RapidMiner and RapidAnalytics, however, are likely to stay in Germany for the foreseeable future. We need specialists for that, and it would be pointless to spread the development team over the globe. That is also owed to the agile philosophy that we are following.

Ingo: Simon is right, Rapid-I already is acting on an international level. Rapid-I now has more than 300 customers from 39 countries in the world which is a great result for a young company like ours. We are of course very strong in Germany and also the rest of Europe, but also concentrate on more countries by means of our very successful partner network. Rapid-I continues to build this partner network and to recruit dynamic and knowledgeable partners and in the future. However, extending and acting globally is definitely part of our strategic roadmap.

Biography

Dr. Ingo Mierswa is working as Chief Executive Officer (CEO) of Rapid-I. He has several years of experience in project management, human resources management, consulting, and leadership including eight years of coordinating and leading the multi-national RapidMiner developer team with about 30 developers and contributors world-wide. He wrote his Phd titled “Non-Convex and Multi-Objective Optimization for Numerical Feature Engineering and Data Mining” at the University of Dortmund under the supervision of Prof. Morik.

Dr. Simon Fischer is heading the research & development at Rapid-I. His interests include game theory and networks, the theory of evolutionary algorithms (e.g. on the Ising model), and theoretical and practical aspects of data mining. He wrote his PhD in Aachen where he worked in the project “Design and Analysis of Self-Regulating Protocols for Spectrum Assignment” within the excellence cluster UMIC. Before, he was working on the vtraffic project within the DFG Programme 1126 “Algorithms for large and complex networks”.

http://rapid-i.com/content/view/181/190/ tells you more on the various types of Rapid Miner licensing for enterprise, individual and developer versions.

(Note from Ajay- to receive an early edition invite to Radoop, click here http://radoop.eu/z1sxe)

 

Top 25 Errors in Programming that lead to hacker attacks

I am elaborating an earlier article on https://decisionstats.com/top-25-most-dangerous-software-errors/ based on my continued research into cyber conflict and strategy. My inputs are in italics – the rest is a condensed article for further thought.

This is thus a very useful initiative for the world to follow and upgrade their cyber security.

It is in accordance with the US policy to secure its cyber infrastructure (http://www.whitehouse.gov/the-press-office/remarks-president-securing-our-nations-cyber-infrastructure)  and countries like India, and even Europe as well as other nations could do well to atleast benchmark their own security practices in software and digital infrastructure with it. There seems to much better technical coordination between rogue hackers than patriotic hackers imho 😉


The Department of Homeland Security of the United States of America has just launched a list of top 25 errors in programming or creating software that increase vulnerability to hacking attacks. The list which is available at http://cwe.mitre.org/top25/index.html lists down a methodology fo measuring vulnerability called Common Weakness Scoring System (CWSS) and uses that score to rank the various errors as well as suggestions to eliminate these weaknesses or errors.
Measuring Weaknesses

The importance of a weakness (that arises due to software bugs) may vary depending on business usage or project implementation, the technologies , operating systems and computing environments in use, and the risk or threat perception.The Common Weakness Scoring System (CWSS) provides a mechanism for scoring weaknesses. and provides a framework for prioritizing security errors (“weaknesses”) that are discovered in software applications.
Identifying Weaknesses
For example the number 1 weakness is shown with
1CWE-89: Improper Neutralization of Special Elements used in an SQL Command (‘SQL Injection’).
The rest of the weaknesses are

RANK SCORE ID NAME
[1] 93.8 CWE-89 Improper Neutralization of Special Elements used in an SQL Command (‘SQL Injection’)
[2] 83.3 CWE-78 Improper Neutralization of Special Elements used in an OS Command (‘OS Command Injection’)
[3] 79.0 CWE-120 Buffer Copy without Checking Size of Input (‘Classic Buffer Overflow’)
[4] 77.7 CWE-79 Improper Neutralization of Input During Web Page Generation (‘Cross-site Scripting’)
[5] 76.9 CWE-306 Missing Authentication for Critical Function
[6] 76.8 CWE-862 Missing Authorization
[7] 75.0 CWE-798 Use of Hard-coded Credentials
[8] 75.0 CWE-311 Missing Encryption of Sensitive Data
[9] 74.0 CWE-434 Unrestricted Upload of File with Dangerous Type
[10] 73.8 CWE-807 Reliance on Untrusted Inputs in a Security Decision
[11] 73.1 CWE-250 Execution with Unnecessary Privileges
[12] 70.1 CWE-352 Cross-Site Request Forgery (CSRF)
[13] 69.3 CWE-22 Improper Limitation of a Pathname to a Restricted Directory (‘Path Traversal’)
[14] 68.5 CWE-494 Download of Code Without Integrity Check
[15] 67.8 CWE-863 Incorrect Authorization
[16] 66.0 CWE-829 Inclusion of Functionality from Untrusted Control Sphere
[17] 65.5 CWE-732 Incorrect Permission Assignment for Critical Resource
[18] 64.6 CWE-676 Use of Potentially Dangerous Function
[19] 64.1 CWE-327 Use of a Broken or Risky Cryptographic Algorithm
[20] 62.4 CWE-131 Incorrect Calculation of Buffer Size
[21] 61.5 CWE-307 Improper Restriction of Excessive Authentication Attempts
[22] 61.1 CWE-601 URL Redirection to Untrusted Site (‘Open Redirect’)
[23] 61.0 CWE-134 Uncontrolled Format String
[24] 60.3 CWE-190 Integer Overflow or Wraparound
[25] 59.9 CWE-759 Use of a One-Way Hash without a Salt
Details of each weakness is given by http://cwe.mitre.org/top25/index.html#Details
It includes Summary , Weakness Prevalence, Consequences, Remediation Cost, Ease of Detection ,Attacker Awareness and Attack Frequency .In addition the following sections describe each software vulnerability in detail- Technical Details ,Code Examples ,Detection Methods ,References,Prevention and Mitigation, Related CWEs and Related Attack Patterns.
Other important software weaknesses are –

[26] CWE-770: Allocation of Resources Without Limits or Throttling
[27] CWE-129: Improper Validation of Array Index
[28] CWE-754: Improper Check for Unusual or Exceptional Conditions
[29] CWE-805: Buffer Access with Incorrect Length Value
[30] CWE-838: Inappropriate Encoding for Output Context
[31] CWE-330: Use of Insufficiently Random Values
[32] CWE-822: Untrusted Pointer Dereference
[33] CWE-362: Concurrent Execution using Shared Resource with Improper Synchronization (‘Race Condition’)
[34] CWE-212: Improper Cross-boundary Removal of Sensitive Data
[35] CWE-681: Incorrect Conversion between Numeric Types
[36] CWE-476: NULL Pointer Dereference
[37] CWE-841: Improper Enforcement of Behavioral Workflow
[38] CWE-772: Missing Release of Resource after Effective Lifetime
[39] CWE-209: Information Exposure Through an Error Message
[40] CWE-825: Expired Pointer Dereference
[41] CWE-456: Missing Initialization
Mitigating Weaknesses
Here is an example of the new matrix for migrations that also list the top 25 errors . This thus shows a way to fix the weaknesses and relative impact on each weakness by the following mitigations.
http://cwe.mitre.org/top25/mitigations.html#MitigationMatrix

Effectiveness ratings include:

  • High: The mitigation has well-known, well-understood strengths and limitations; there is good coverage with respect to variations of the weakness.
  • Moderate: The mitigation will prevent the weakness in multiple forms, but it does not have complete coverage of the weakness.
  • Limited: The mitigation may be useful in limited circumstances, only be applicable to a subset of this weakness type, require extensive training/customization, or give limited visibility.
  • Defense in Depth (DiD): The mitigation may not necessarily prevent the weakness, but it may help to minimize the potential impact when an attacker exploits the weakness.

Within the matrix, the following mitigations are identified:

 

  • M1: Establish and maintain control over all of your inputs.
  • M2: Establish and maintain control over all of your outputs.
  • M3: Lock down your environment.
  • M4: Assume that external components can be subverted, and your code can be read by anyone.
  • M5: Use industry-accepted security features instead of inventing your own.

The following general practices are omitted from the matrix:

  • GP1: Use libraries and frameworks that make it easier to avoid introducing weaknesses.
  • GP2: Integrate security into the entire software development lifecycle.
  • GP3: Use a broad mix of methods to comprehensively find and prevent weaknesses.
  • GP4: Allow locked-down clients to interact with your software.

 

M1 M2 M3 M4 M5 CWE
High DiD Mod CWE-22: Improper Limitation of a Pathname to a Restricted Directory (‘Path Traversal’)
Mod High DiD Ltd CWE-78: Improper Neutralization of Special Elements used in an OS Command (‘OS Command Injection’)
Mod High Ltd CWE-79: Improper Neutralization of Input During Web Page Generation (‘Cross-site Scripting’)
Mod High DiD Ltd CWE-89: Improper Neutralization of Special Elements used in an SQL Command (‘SQL Injection’)
Mod DiD Ltd CWE-120: Buffer Copy without Checking Size of Input (‘Classic Buffer Overflow’)
Mod DiD Ltd CWE-131: Incorrect Calculation of Buffer Size
High DiD Mod CWE-134: Uncontrolled Format String
Mod DiD Ltd CWE-190: Integer Overflow or Wraparound
High CWE-250: Execution with Unnecessary Privileges
Mod Mod CWE-306: Missing Authentication for Critical Function
Mod CWE-307: Improper Restriction of Excessive Authentication Attempts
DiD CWE-311: Missing Encryption of Sensitive Data
High CWE-327: Use of a Broken or Risky Cryptographic Algorithm
Ltd CWE-352: Cross-Site Request Forgery (CSRF)
Mod DiD Mod CWE-434: Unrestricted Upload of File with Dangerous Type
DiD CWE-494: Download of Code Without Integrity Check
Mod Mod Ltd CWE-601: URL Redirection to Untrusted Site (‘Open Redirect’)
Mod High DiD CWE-676: Use of Potentially Dangerous Function
Ltd DiD Mod CWE-732: Incorrect Permission Assignment for Critical Resource
High CWE-759: Use of a One-Way Hash without a Salt
DiD High Mod CWE-798: Use of Hard-coded Credentials
Mod DiD Mod Mod CWE-807: Reliance on Untrusted Inputs in a Security Decision
High High High CWE-829: Inclusion of Functionality from Untrusted Control Sphere
DiD Mod Mod CWE-862: Missing Authorization
DiD Mod CWE-863: Incorrect Authorization