For the love of Data : Interview with DataJoy Founder James Allen

Describe how you came u with the idea of Setting up DataJoy? What are some of the things that you have learnt while creating ShareLaTeX.

The idea for DataJoy came organically as we talked to users of ShareLaTeX about the difficulties in their research workflow, beyond just paper writing. Like with LaTeX, Python and R have a high learning curve for new users. Having to first worry about installing them and getting a working environment set up is a difficult hurdle for people when they just want to start getting a feel for the language itself. Basically we want to let you write and run your first few lines of Python or R as quickly as possible.

There are also the difficulties that people face with collaboration. Getting someone else in a position to be able to run your code can be hard, especially if you use a lot of specialist packages or specific versions. If you’re actively working together on some code, making sure you don’t get in each other’s way is difficult. Version control systems have a very steep learning curve and need your entire team to use it. We think the real-time nature of DataJoy is a nice middle ground that lets everyone work together without fear of overwriting or disrupting your collaborator’s work, but has no learning curve.

With ShareLaTeX, we realised that there is a huge silent majority of students and researchers who may not be very tech-literate but are actively engaged in the academic process. These people just want to achieve their end goal, whether it’s submitting an assignment, writing a paper, or analysing some data. They aren’t posting on Stack Overflow or reading blogs about best practices because they don’t care about the technology, they only care about getting their work done. These are the people who we’ve found that we can help the most.

I can set up a ipython notebook server on Amazon and also using RStudio Server ( or just use an AMI which has both). What advantages does DataJoy give me as a data scientist? How is it different from R-fiddle?

Absolutely, and I don’t think DataJoy will ever replace this use-case. If you’re advanced enough in your understanding of your tools, and the infrastructure behind them then setting up a server on Amazon for yourself has a lot of benefits. However, there are a lot more people out there who want the benefits of a cloud environment, but wouldn’t know where to start with setting up their own server and are more focused on the results of their research than in learning how to do so.

Even as someone who does know how to set up such a server though, it’s still an extra piece of infrastructure that you need to manage and support. If you use DataJoy then you can let us do that for you and just focus on your actual data science workflow.

What are some of the ways that you have thought of monetizing this model of creating infrastructure for data scientists?

It’s still very early days and we’re still learning about the needs of different users, but I think there are likely to be 2 or 3 main sources of revenue for DataJoy:

Individual accounts for users looking for more compute resources or more advanced features,
Group and site license for teams in enterprise, or universities, or teaching looking to move their whole teams’ workflow to DataJoy,
Onsite installations

Are you thinking of expanding to include things like Spark et al for users?

We’re focusing on Python and R at the moment to make sure that we can provide the best user experience for these languages. However, our long term goal is to make DataJoy language agnostic so that you can bring your favourite language and toolchain and we’ll be able to support it. We have a very flexible infrastructure on the backend and the limitation to Python and R at the moment is to keep things simpler for us and users.

What are some case studies that you want to share?

We’re really excited about how DataJoy is being used in classrooms all over the world. I haven’t asked permission to share these stories publicly, so without naming names, I’m aware of a lecturer who is using DataJoy to run classes in an interactive way that just wasn’t possible before. He can present the lesson as code in DataJoy on the projector, and have all his students be logged into the same project on their laptops. Students can fill in chunks of code as the lesson progresses and it appears immediately on the projector and on other student’s screens.

Likewise, another lecturer is using DataJoy as a way to distribute assignments to students and if they get stuck, she can quickly log in directly to the student’s project and help them debug it. This has saved her lots of unnecessary hassle of getting the student to email her the code, and then fighting with possible version mismatches or missing dependencies. Being able to see the problem in exactly the same context as the student has been invaluable.

These cases are really exciting to us because they open up completely new ways of teaching that just weren’t possible before.

Do you intend to make the code for DataJoy open source or for users who want to run their own DataJoy server on premise?

Yes, absolutely! ShareLaTeX is already open source and available for users to run individual instances. The DataJoy code base is branched from ShareLaTeX and it still in our open GitHib repositories. The only problem with DataJoy at the moment is that the infrastructure for running Python and R code on our backend is quite tied in to our specific architecture. As soon as we work out how to abstract that so that it can run easily anywhere, we will release DataJoy as an open source project.

What else is on your product roadmap for DataJoy?

At the moment we have two main focuses: Improving DataJoy for teaching, and improving the ease of use for new Python/R users. We want to make it easier for teachers to manage large classes of students and work interactively with them. We also want to make sure that we remove the roadblocks that new Python or R users face, including making error messages more clear, making it ridiculously easy to install any package (even ones that need compiling from source) and providing help, tutorials and examples at the right times.

Describe your own journey as a developer hacker and entrepreneur. What advice would you give to young people entering data science and devops today?

I came to ShareLaTeX and DataJoy after doing a PhD in theoretical physics at Durham University which I finished in early 2013. I’d always had an interest in programming, and worked as a part-time web developer for a web hosting company while I was an undergraduate at Edinburgh studying maths. As a PhD student, I’d written a prototype LaTeX editor that had a bit of traction, and teamed up with my co-founder Henry to work on ShareLaTeX in 2012. Henry comes from a strong software development background and has helped me mature a lot as a software developer to be able to write and maintain large scale services.

I don’t have much experience doing data science directly, but my advice for all aspect of life would be reach out and talk to as many people as possible, especially if they are doing interesting or different work from you. Only by getting lots of opinions (sometimes conflicting!) can you start to build up a realistic view of the world. Surrounding yourself with people you can learn from is very important too, and part of this. If you can’t find people in real life, then find good people to listen to online. Of course, always evaluate what they say with a critical eye :).

Do you intend to make the code for DataJoy open source or for users who want to run their own DataJoy server on premise?

What else is on your product roadmap for DataJoy?

Describe your own journey as a developer hacker and entrepreneur. What advice would you give to young people entering data science and devops today?

How would Datajoy enable coding on mobile phones or even learning coding on mobile phones.

We’d love to support DataJoy on mobile devices, but they present a number of unique technical challenges. We’ve found that what makes a nice user interface on a PC does not transfer to a tablet/phone very well, and so we’d need to redesign the whole experience. We also have to work with poorer network connections, and offline usage. These are problems that we’re excited to tackle because I think it would let people work in ways with Python and R that haven’t been possible before, but for now we’re focused on improving the desktop/laptop experience

(ps – I love DataJoy, and I have no commercial interests at all in them. I just get a kick from kicking tires in R and Python in a browser WITHOUT any installations hassles)

https://www.getdatajoy.com/