My journey into data science

September 8, 2019

I started my first full-time data science job about eight months ago, and getting here has certainly has been a learning experience! While the experience is still fresh in my mind, I wanted to share my thoughts. This post is a reflection on how I went from a PhD program in Cell and Molecular Biology to an industry data scientist.

What’s a data scientist?

I first heard the term ‘data science’ around 2015. As I wrapped up my first year of graduate school, one of my new lab-mates told me about a program called Insight Data Science. Insight takes newly-minted PhDs and trains them to be data scientists. I never went through the Insight program. What stuck with me, though, was how Insight defined a data scientist: someone that is “better at programming than a typical statistician, and better at statistics than a typical programmer.” Of course, this definition does not really explain what exactly a data scientist does, but learning this definition placed data science on my radar as a growing ‘alt-academic’ career path. (‘Alt-academic’ is a common and unfortunate term within academia — it refers to any career track that does not involve becoming a professor. These days, the vast majority of new PhDs do not become professors.)

Three years earlier, in 2012, Harvard Business Review declared Data Scientist the “sexiest job of the 21st century.” (Which, seven years later, is still cringe-worthy.) I was oblivious to all of this at the time. As the data science hype-train started to take off in 2012, I had just graduated with a B.S. in biochemistry. I went on to work for the American Red Cross Biomedical Services Division in my hometown of Philadelphia where I gained an appreciation for how data-driven the field of molecular biology had become. I decided that I’d go to graduate school to study some combination of biology and computer science. My only explicit goal was to do something more interesting than my lab tech position at the Red Cross. I still had never heard the term data science.

If you ask data scientists who are former academics to define data science, many will tell you that data science is just science. Scientific research has always involved analyzing data, and it has always required statistical skills. More recently, as the volume of data generated by the broader scientific enterprise has increased, it also requires programming. New areas of specialization emerged, like computational physics and computational biology. As a computational biologist, I learned how to manage large data sets, work on a super computing cluster, clean data from protein databases, visualize data, and build all sorts of statistical models. Many of these skills are also the core skills of a data scientist. Of course, I still had many gaps in my skill set, which I discuss below, but this notion of data-science-as-just-science is part of why you see so many data scientists with PhDs. You don’t need a PhD to be a data scientist, but the process of doing original research during a PhD program promotes the development of data science skills.

Building a data science skill set

About a year or so into my PhD program, I had decided that I was unlikely to stay in academia. Data science seemed like a better option, so I set out to position myself for a career in data science. At first, this was just attending meet-ups and discussion panels, and trying to talk to as many people in the industry as possible. I heard a few common themes:

The laundry list of skills a data scientists needs is overwhelming, but you don’t need every skill listed on a job posting to actually get a job
Machine learning, not just statistical modeling, skills were in high demand

Machine learning was an area where I knew I lacked experience. To address this gap, I took two online courses, one through edX and one through Coursera. The first, and probably best machine learning course I’ve taken, was Stanford’s Learning from Data. Its definitely more heavy on theory than most machine learning courses, but it provides a strong foundation on which to learn new machine learning techniques as they appear. The second course I took (or really set of courses) were Andrew Ng’s Coursera Deep Learning Specialization. (A full review of all the online courses I’ve taken warrants a separate post, but I would now recommend fast.ai over this Andrew Ng’s course.)

As I wrapped up these courses in late 2017, three years into my PhD program, I started to look more seriously at data science internship positions. I went to the my university’s Science and Technology career fair, resume in hand. I heard a variety of things from employers:

Candidates with your background usually don’t have strong enough software engineering skills
Candidates usually have strong software engineering skills, but don’t have the statistical and machine learning skills needed
We’re looking for more applied data science experience, even for our internship positions
These online courses are nice, but you don’t have enough applied deep learning experience

These last two comments were, to say the least, incredibly disheartening. At the time I had several papers published in academic journals. These papers represented real, applied scientific research and statistical modeling. But my thesis projects did not include deep learning.

A meandering path to a data science internship

To gain applied deep learning experience, I started looking into Kaggle competitions. Most computer vision competitions – the ones that almost exclusively require deep learning – are won by data scientists with specialized hardware. Training deep learning models is order of magnitudes faster on an Nvidia GPU, which I did not have access to. Although there are plenty of cloud providers that will rent you access to an Nvidia GPU, I decided to take the more difficult route of building my own PC specifically for deep learning. I didn’t have a strong rationale for doing this, other than that it seemed fun. After scouring the internet for blog posts and a lot of help from the reddit community, I purchased all of the parts for my own deep learning box. Total cost: $1200.

Building a $1200 PC is a significant investment for a graduate student. Within months, however, it paid of handsomely. I would easily describe building that PC as the best investment I’ve ever made in my data science career, but probably not for the reasons you think. I toiled away at a Humpback Whale Identification Challenge for a few weeks, and never actually submitted a model. Meanwhile, in the spring of 2018, a friend convinced me to attend the SXSW Start-Up Crawl, a networking event where attendees travel to several locations across Austin to meet representatives of start-ups. At the start-up crawl, my future employer, Valkyrie Intelligence, was demonstrating the Mark1, their new custom-built deep learning server. Having just built my own PC specifically for deep learning, I had a great conversation with Valkyrie founder Charlie Burgoyne about the unique hardware requirements for training and deploying deep learning models. I asked Charlie if we could meet for lunch to discuss careers in data science.

A few weeks later, I met Charlie for lunch at an Italian restaurant in downtown Austin. Over chicken parm sandwiches, we discussed the state of the data science industry as a whole. I learned about data science consulting. We discussed the difference between data science generalists and specialists and how the role of a data scientists was evolving. We talked about the value of bring data-driven, empirical thinking to business problems. We also, of course, discussed my background and interests. I eventually asked if Valkyrie would be open to taking on interns. The short answer was yes.

After a few interviews with other members of the Valkyrie team, I was hired as a paid intern in May of 2018. I worked for Valkyrie part time while I wrapped up my dissertation work. The interview process did not include take-home projects or white-boarding or other coding challenges. We certainly discussed my background and some of the more technical aspects of my research, but at the time Valkyrie did not include coding challenges as part of the hiring process for interns. I only bring the interview process up because I’ve heard many horror stories about the grueling gauntlet of data science technical interviews. The more time I spend in data science, the more I’m convinced that these horror stories are not the norm.

After three months interning at Valkyrie and my first client project wrapping up, I was offered a full-time position as a data scientist – four months out from December graduation date. I started full-time at Valkyrie in January of this year.

Moving in all directions at once

It’s easy to spin a cohesive narrative about my journey from graduate student to data scientist in retrospect. At this point in the story, however, I’ve left out a whole lot of details about all of the things that I tried that did not obviously advance my career prospects. At many points I felt lost, overwhelmed, and helpless. I felt like I was running ten directions at once, having no idea what would stick. Here’s a recap of some of those directions.

Networking events

I went to many, many data science meet-ups and networking events. At all of these events, I met a lot of people, and heard a lot opinions on what it took to land a job in data science. Many people gave conflicting advice. As an introvert, these events were exhausting. Eventually, attending networking events did pay off in two important ways: it encouraged me to develop my machine learning skills, and it led to an internship.

Job fairs

I went to one job fair during my job hunt. I talked to many companies who had data science teams, but couldn’t really tell me what those teams did. Sometimes two representatives within the same company would give conflicting advice on whether or not I was qualified for a data science position. I would like to think that the advice to develop applied deep learning experience was helpful, except that I was offered an internship without ever directly demonstrating that experience. Not to mention that deep learning just isn’t used as frequently in data science as the blog-o-sphere and job postings would have you believe. Most companies do not need deep learning.

Blogging

I created this blog under the vague notion that, to get a job in today’s economy, you need to develop your personal brand. You’ll also hear from other data scientists that the best way to demonstrate your data science skills is to blog about them. Well, since you can see that I posted at most 1-2 times a year and never discussed an specific data science side projects, I can’t say that having this blog improved my career prospects. I do think that blogging about side projects is an excellent way to demonstrate your skills. It’s just not the path that I took.

Advice for job seekers

Looking back at my experiences, I still think that I got incredibly lucky at several points on my career development path. I happened to join the laboratory of Claus Wilke, who later became a minor data science Twitter celebrity and published a book on data visualization. I decided to build a PC for deep learning just weeks before, by chance, I met Charlie and the Valkyrie team who had also been building their own deep learning hardware.

What I did learn from this adventure, though, is the incredible value of networking. In the data science job market, companies hire people they know. This is especially true of smaller companies, and of course, this is true of many industries outside of data science. For technically-minded introverts (I’m one of them), networking events can be tiring. But my advice to them is to go anyway. Networking is a skill that you can develop.

Once you’re in the industry, networking becomes even more valuable. I now receive informal job offers at meet-up events, and I have candidates come to me asking about positions at Valkyrie. I don’t have any intention of leaving my current position, but my options are wide open mostly thanks to networking.

Benjamin R Jack