One of the hottest topics in tech right now is our focus for today: the present and the future of data science.
On this episode of CTO Studio, you’ll hear data science downloads from Christopher Keown who runs a local machine learning data science meetup in San Diego. You’ll also hear insights from Robert Swisher, who is the CTO of biproxi and Alex Balazs, the chief architect at Intuit.
We specifically discuss when you should be adding data scientists to your team, how the role of engineering in the data science environment of today, and why data science is actually very personal and very local. Join us for those discussions on episode 61 of CTO Studio!
In this episode, you’ll hear:
- What are black box models? (8:40)
- Where is the disconnect happening in data science right now? (16:05)
- When should you as a CTO be hiring a data scientist? (29:55)
- How much data do you need before you can develop predictions with machine learning? (38:35)
- Have we tapped into the true potential of data science yet? (47:45)
- And so much more!
We jump straight into Chris’ meetup groups: he started the San Diego Machine Learning group in 2017 as he was finishing up graduate school. He didn’t want to go into a career in data science, but he did want to be involved in the field.
Because so much is happening within San Diego he thought he could bring people in from the community and they could work on data science together. The group was also created for people new to data science, it is a place where they can get some experience from people with more knowledge. And for the more seasoned veterans, the group is a place where they can hone their skills.
When I asked Christopher to tell us what data science is he defines it from the data scientist perspective, which means it is about getting insight from your data. It’s about the story the data is telling. And what story you are pursuing is driven by your business objectives.
For example, if it’s about understanding your customer better and understand their behavior better or increase profits, data science can tell you about each of those objectives.
Next we transition into talking about Robert’s company, biproxi. biproxi is an end to end transaction platform for commercial real estate. They provide tools for the middle market commercial real estate broker, and help those brokers run professional transactions online like the big firms. They have also just started releasing Zillow-type data for 32 million commercial real estate assets in the U.S., and they are the only company to do this.
Alex asks Chris about his machine learning group: what is the makeup of his attendees? He is asking within a certain context. When he started at Intuit 20 years ago there were make-file engineers – that was the only job was to make files. Now we use the term full stack engineer and that includes front end, back end, dev ops, owning your own quality, and more frequently data and data science are being included in this term. So the role of the engineer is really transforming in this era of data science as data science becomes more and more real.
Are you a technology professional looking to connect with like minded people? Sign up to get connected with 7CTOs!
On today’s CTO Studio, Christopher weighs in with his thoughts on this as well as what kind of people actually attend his machine learning/data science meetup group. In reference to the pipeline, he says there are so many aspects of the data science pipeline from start to finish. The traditional computer scientists come in when there are more engineering-type questions raised, they like the black and white problems such as getting data from point A to point B.
But the part that is really novel is the real insight into the data, the exploratory process and that is where the data scientists are coming onboard. This is also the most popular part of the pipeline right now. His group sees a lot of people who are in their 30s and are going into this career now because it wasn’t available when they were in school.
With his group they are learning from practice: they practice on different data sets to see how well they can make predictions based on those data sets. It’s something you can only learn through experience, you can’t just read it in a text book. You have to do it over and over because every situation is different. By practicing every week in his group the members develop those skills.
What are some of the data sets they have been experimenting with? The primary platform they use for data exploration is called Caggle. Caggle is a company that was created independent of Google but has since been bought by Google. They set up contests, like one Chris’ group is involved in now. This particular contest tries to determine if a comment is toxic or not (on a site like Quora): is it racist, is it sexist, what kind of toxic is it?
There are a lot of aspects that go into answering this question. First, how do you turn a comment into numbers? You can’t do math on words. Then once you have those comments turned into numbers how can you actually make a prediction as to whether or not it is toxic? And how do you know you are making a good prediction, and how do you explain that prediction?
One of the challenges with these “black box” models is you cannot understand their reasoning. Let’s say you are an insurance company and the model you use rejects someone. You need a model that also tells you why this person was rejected so you can tell them.
Christopher also tells us the typical tools data scientists work with and Alex talks about the immense opportunity data science makes available to any and all industries today. We also dip our toes into the issue of probabilistic or deterministic data science predictions, join us to hear that fascinating discussion and more on this episode of CTO Studio.