A running example
There will be many practical use cases throughout the book, sometimes a couple in each chapter. But we will also have a running example, building a search engine. This problem is interesting for a number of reasons:
- It is fun
- Business in almost any domain can benefit from a search engine
- Many businesses already have text data; often it is not used effectively, and its use can be improved
- Processing text requires a lot of effort, and it is useful to learn to do this effectively
We will try to keep it simple, yet, with this example, we will touch on all the technical parts of the data science process throughout the book:
- Data Understanding: Which data can be useful for the problem? How can we obtain this data?
- Data Preparation: Once the data is obtained, how can we process it? If it is HTML, how do we extract text from it? How do we extract inpidual sentences and words from the text?
- Modeling: Ranking documents by their relevance with respect to a query is a data science problem and we will discuss how it can be approached.
- Evaluation: The search engine can be tested to see if it is useful for solving the business problem or not.
- Deployment: Finally, the engine can be deployed as a REST service or integrated directly to the live system.
We will obtain and prepare the data in Chapter 2, Data Processing Toolbox, understand the data in Chapter 3, Exploratory Data Analysis, build simple models and evaluate them in Chapter 4, Supervised Machine Learning - Classification and Regression, look at how to process text in Chapter 6, Working with Text - Natural Language Processing and Information Retrieval, see how to apply it to millions of webpages in Chapter 9, Scaling Data Science, and, finally, learn how we can deploy it in Chapter 10, Deploying Data Science Models.