V writes (in response to this post):
- Is there a Business Requirement sign off in DS ? At what stage does it come?
- In real life DS , do the customers want more inference or āblack boxā methods?
- Do you need to do web scraping to get additional supporting data, in addition to the customer data? I mean, in actual business scenario, how much of web scraping is done?
- In which scenarios would the business separate out Data Engineering and Data Modelling? I assume that it isnāt cost-effective to separate the two roles but I may be wrong.
- How do I know I have done my best in creating meaningful features and the model can not be improved further?
Hey V, these are great questions that get into the intersection of data science and business! Iām happy that youāre thinking about them.
Yes, thereās often a set of requirements. I work with business to determine the benefit they would like to see (e.g., automating a process 95% of the time, improving revenue, etc) and the deliverables (e.g., an product categorisation API, a recommender system). This is done early so we donāt invest effort in deliverables that donāt get used.
This depends. Initially, customers might want something more understandable (e.g., regression, decision trees), though the level of comfort varies across people. As we earn their trust, we get more free rein, including using more black box approaches.
I seldom do web scrapping. I find that the effort required to clean up that data is usually not worth it. If Iām scraping it, I would probably use a combination of Selenium and Python libraries (e.g., beautiful soup, spacy).
This usually depends on the size of the overall data team. As teams get larger and more mature, thereās usually a tendency to specialiseāthatās when the roles are separated. Nonetheless, some teams (such as in StitchFix) deliberately keep have the generalist data scientist role so they do end-to-end.
Itās hard to say how good is āgood enoughā. Itās an unbounded problem, almost like asking how much fraud caught is good enough. What I usually try to do though is to time-box itāhow much does the additional model performance from better features cost? And then I work from there. Alternatively, you could try to brute force it and perform operations between each feature (e.g., add, subtract, multiple, divide) though this might lead to overfitting. Feature statistics help too.
Have a question for me? Happy to answer concise questions via email on topics I know about. More details in How I Can Help.
Join 9,800+ readers getting updates on machine learning, RecSys, LLMs, and engineering.