August 2014 – DECISION STATS

While courts, politicians, activists , spies and even corporate leaders have spoken or ducked on the question of whole sale data collection by NSA, one group that is both in the thick of action as well as conspicuous by it’s silence is the data scientist community.

While one prominent open source member of R community spoke out against analyzing the data leaked by Wikileaks ( an admirable stand given his background) no one seems to be perturbed to be speaking on analyzing data belonging to fellow citizens and the world at the same time. ( see-WHY I WILL NOT ANALYZE THE NEW WIKILEAKS DATA)

Now Data Scientists and the Intel Communities have long worked together and one of the SAS Institute’s solid cash cows remains its strangehold on intel analytics 😉 http://www.sas.com/resources/brochure/government-intelligence-community-overview-brochure.pdf) , what is perplexing is the deafening silence regarding the violation of Fourth Amendment rights of American citizens domestically and abroad (see http://en.wikipedia.org/wiki/Fourth_Amendment_to_the_United_States_Constitution) and the active collusion by primarily data scientists in this

The right of the people to be secure in their persons, houses, papers, and effects,^[a] against unreasonable searches and seizures, shall not be violated, and no Warrants shall issue, but upon probable cause, supported by Oath or affirmation, and particularly describing the place to be searched, and the persons or things to be seized

Is not your email , your social media, your mobile data a part of your person/house/paper and effect. Does Atlas need to shrug and do data scientists need to say enough is enough to stop this blatant misuse?

No- because the well funded NSA and DOD budgets will always be more than the conscience of a few data scientists. There will always be hackers for hire, and the people shall be led by a sheep.

The Numerati- or the numerically enabled technological elite data scientists are as culpable as the agencies using them. This can be addressed by lawsuits against compliant statisticians and data miners as well as they are the ones enabling violation of fourth amendment rights. Countries like India have chosen to feed off this data trough and countries like China have chosen to create their own walled off internet instead. It is American data scientists alone who can help guide their Congress to Sanity. The timing is pertinent as Congress debates amending the Foreign Intelligence Surveillance Act

” the proposed changes would not touch the agency’s abilities overseas, which are authorized by Executive Order 12333, a Reagan-era presidential directive. The administration has declassified some rules for handling Americans’ messages gathered under the order, but the scope of that collection and other details about how the messages are used has remained unclear.”

( see – http://www.nytimes.com/2014/08/14/us/politics/reagan-era-order-on-surveillance-violates-rights-says-departing-aide.html , https://www.documentcloud.org/documents/836235-ussid-sp0018.html and http://www.nytimes.com/2014/07/25/us/politics/senators-bill-is-stricter-on-nsa-than-houses.html)

Are you a data scientist who wants to help out? Help the ACLU educate Congress ( https://www.aclu.org/national-security/fix-fisa-end-warrantless-wiretapping )on the proper way to dispose off private data, and anonymize the data already collected .

Otherwise the next generations will be born in an age where every move recorded by CC cameras or wearable computing devices will be mined by corporations for ads and governments for threats.

The last word on this was said by a wise old White Man ( in an age where Wise Old White Men are no longer fashionable or even correct)

The tree of liberty must be refreshed from time to time with the blood of patriots and tyrants–Thomas Jefferson

Maybe he was referring to this tree

Here is an interview with ‎Christoph Waldhauser a noted researcher and the Founder, CEO at KDSS K Data Science Solutions, which is a R based Analytics advisory firm. In a generous and insightful interview, Christoph talks of the perceptions around open source, academia versus startups, Europe and North America for technology and his own journey through it all.

Ajay Ohri (AO)- Describe your career in science. At what point did you decide to become involved in open source projects including R

Chrisoph Waldhauser (CW)- When I did my second course on quantitative social science, the software we used was SPSS. At that time, the entire social science curriculum was built around that package. There were no student versions available, only a number of computer labs on campus that had SPSS installed. I had previously switched from Windows to Linux to cut on licensing costs (as a student you are constantly short on money) and to try something new. So I was willing to try out the same with R. In the beginning I was quite lost, having to work with survey weighted data and only rudimentary exposure to Perl before that. That was in a time long before RStudio and I started out with Emacs and ESS. As you might imagine, I landed in a world full of metaphorical pain. But in due time I was able to replicate all of the things we did in class. Even more, I was able to do them much more efficiently. And while my colleagues where only interpreting SPSS output, I was understanding where those numbers came from. This epiphany was then guiding me for the remainder of my scientific career. For instance, instead of Word I’d use LaTeX (back then a thing unheard of at my department) and even put free/libre/open source software in the focus of my research for my master thesis.

Continuing down that path, I eventually started working at WU Vienna University of Economics and Business and that had led me to one of the centers of R development. There, most of every day’s work was revolving around Free Software. The people there had a great impact on my understanding of Free Software and its influence on how we do research.

AO- Describe your work in social media analytics including package creation for Google Plus and your publication on Twitter

CW- Social media analytics is a very exciting field. The majority of research focuses on listening to the garden hose of social media data, that is, analyzing the communication revolving around certain keywords or communities. For instance, linking real-world events to the #euromaidan hashtag in Ukraine. I tread down a different path: instead of looking at what all users have to say on a certain topic, I investigate how a certain user or class of users communicates across all topics they bring up. So instead of following a top-down approach, I chose to go bottom-up.

Starting with the smallest building blocks of a social network has a number of advantages and leads to Google+ eventually. The reason behind this is, that the utility of social media for Google is different from say Twitter or Facebook. While classical social media is used to engage followers, say a lottery connected to the Facebook page of a brand, Google+ has an additional purpose: enlist users to help Google produce better, more accurate search results. With this in mind, focus shifts naturally from many users on one topic to how a single user can use Google+ to optimize the message they get across and manage the search terms they are associated with.

This line of argument has fueled my research in Google+ and Twitter: Which messages resonate most with the followers of a certain user? We know that each follower aims at resharing and liking messages she deems interesting. What precisely is interesting to a follower depends on her tastes. And if that will eventually lead to a reshare or not depends also on other factors like time of day and chance. For this, I’ve created a simulation framework that is centered on the preferences of individual users and their decision to reshare a social media message. In analyses of Twitter and Google+ content, we’ve found interesting patterns. For instance, there are significant differences in the types of message that are popular among followers of the US Democrats’ and Republicans’ Twitter accounts. I’m currently investigating if these observations can also be found in Europe. In the world of brand marketing, we’ve found significant differences in the wording of messages between localized Google+ pages. For instance, different mixtures of emotions in BMW’s German and US Google+ pages are key to increased reshare rates.

AO- What are some of the cultural differences in implementing projects in academia and in startups?

CW- This is a very hard question, mainly due to the fact that there is no one academia. Broadly speaking, quantitative academia can always be broken down in two classes that I like to refer to as science vs. engineering. Within this framework, science is seeking to understand why something is happening, often at the cost of practical implentability. The engineering mindset on the other hand focuses on producing working solutions, and is less concerned about understanding why something behaves the way it does. Take for instance neural networks that are currently enjoying somewhat of a renaissance due to deep learning approaches. In science, neural networks are not really popular because they are black boxes. You can use a neural network to produce great classifiers and use them to filter e.g. the picture of cats out of a stack of pictures. But even then it is not clear what factors are the defining essence of a cat. So while engineers might be happy to have found a useful classifier, scientists will not be content. This focus on understanding precludes many technological options to science. For instance, big data analyses aim at finding patterns that hold for most cases, but accept if the patterns don’t apply to every case. This is fine for engineering, but science would require theoretical explanations for each case that did not match the pattern. To me, this “rigor” has few practical benefits.

In startups, there is little place for science. The largest part of startups is being financed by some sort borrowed capital. And these lenders are only interested in return on their investment, and not insights or enlightenment. So, to me, there are few difference between academic engineering and startups. But I find that startups that want to do science proper, will have a very hard time getting off to good start. That is not to say it’s impossible, just more difficult.

AO- What do you think of the open access publishing movement as represented by http://arxiv.org/ and JSS. What are some of the costs and benefits for researchers that prevent whole scale adoption of the open access system and how can these be addressed

CW- I think it is important to differentiate open access and preprints like arxiv.org. Open access merely means that articles are accessible without paying for accessing them. As most research that is published has been paid for by the taxpayer anyway, it should also be freely accessible to them. Keeping information behind paywalls is a moral choice, and I think it is self-evident which side we as a scientific community should choose. I’d also question the argument of publishing houses that their services are costly. Which services? Copy-editing? Marketing stunts at conferences? I fail to see how these services are important to academia.

Turning to preprints, one must note that academic publishing is currently plagued by two flaws. One is the lack of transparency that leads to poor reviews. The other one is academia’s using of publications as a quantitative indicator of academic success. This led to a vast increase of results being submitted for publication: a publishing house that had to review hundreds of articles before is now facing thousands. Therefore, it is not uncommon today for authors having to wait for multiple years until a final decision has been made by the editors. And the longer the backlog of articles gets, the lower the quality of the reviews will become. This is unacceptable.

Preprints are a way around this deadlock. Findings can be accessed by fellow researchers even before a formal review has been completed. In an ideal world with impeccable review quality, this would lead to a watering down of the quality of research being available. This certainly poses a risk, but today’s reviews are from flawless. More often than not, reviews fail to discover obvious flaws in research design and barely ever do reviewers check if data actually do lead to the results published. So, relying on preprints or reviewed articles, researchers always need to use common sense anyway: If five independent research groups come to the same conclusion, the papers are likely to be solid. This heuristic is somewhat similar to Wikipedia: it might not always be correct, but most of the time it is.

AO- What are some of the differences that you have encountered in the ecosystem in funding, research, and development both in academia and tech startups as compared to Europe versus North America

CW- Living and working in a country that is increasingly being affected by the aftershocks of the Great Depression of 2007–2009 has left me somewhat disillusioned. In face of the economic problems in Europe at the moment, most of public funding has come to a full stop. Private capital is somewhat still available, but also here risk management has led to an increased focus on business plans and marketability. As pointed out above, this is less of a problem for engineering approaches (even though writing convincing business plans is challenging to scientists and engineers alike). But it is outright deadly to science. From what I see, North America has a different tradition. There, engineering generates so much revenue that part of that revenue goes back to science. An attitude we certainly lack in Europe is what Tim O’Reilly terms the makers’ mindset. We could use some more of that.

AO-In enterprise software people often pay more for software that is bad compared to software that is open source. What are your thoughts on this

CW- I’ve just had an interesting discussion with the head of a credit risk unit in a major bank. The unit is switching from SAS to SPSS for modeling credit risk. R, an equally capable or perhaps even superior free software solution, was not even considered. The rationale behind that is simple: in case the software is faulty, there is a company to blame and hold liable. Free software in general does not have software companies that back it in that way. So this appears to be the reason behind the psychological barrier to use free software. But I think it is a false security. Suppose every bank in the world uses either SAS or SPSS for credit risk modeling. And at one point, a fatal flaw in one of those two packages is being discovered. This flaw is then likely to affect most of the banks that chose it. So within 24 hours of that flaw being discovered, the company backing the product will have to file for bankruptcy. It is somewhat ironic that people responsible for credit risk management don’t see that the high correlation introduced due to all banks relying on the same software company does not mitigate but greatly inflate their risk.

For example, some years ago a SAS executive said, she’d feel more comfortable to fly in a plane that has been developed using closed source and not open source software, because closed source would provide increased quality. That line of argument has been thoroughly refuted. However, there is some truth in the fact that an investor might be more comfortable in putting money in a aircraft company that relies on closed source software for reasons of liability. Should a plane go down because of a closed source software bug, then the software company and not the aircraft company could be held liable. So any lawsuits against the aircraft company would be redirected to the software company. But at the end of the day, again, the software company will go out of business, leaving the aircraft company with the damage non the less.

So the trade off between poorly designed or implemented, expensive closed source software and superior free software is made due to questions of liability. But the truth is, that this is a false sense of security. I would therefore always argue in favor of free software.

About–

KDSS is a bleeding edge state of the art data science advisory firm. You can see more of Christoph’s work at https://github.com/tophcito