User:MPopov (WMF)/Notes/Text categorization

Consider these two scenarios:

  1. You want to group articles together but you don't know what the groups are ahead of time
  2. You want to categorize some piece of text – for example:
    • which job category a job posting is for
    • identify the author when the author is unknown
    • when a user comment is harassment
    • when an email is spam
    • whether a review is positive or negative

These are examples of problems in natural language processing and a combination of statistical analysis, machine learning, and information retrieval called text mining. In all of these cases what you're interested in is a predictive model which, when given an input (data), predicts some output (category).

Terminology edit

A document is a single unit of analysis and is usually made up of tokens (usually words, but can also be combinations of words called n-grams). A document can be any size:

  • each comment on a Talk page
  • a whole page (either an article or an article's Talk page)
  • each chapter in a book
  • each book in a library

A document may also be made up of smaller documents so that analysis can be performed hierarchically. For example, instead of analyzing toxicity of a Talk page as a single blob of text, you might analyze toxicity of individual comments, aggregate those to form your analysis of toxicity of individual topics/conversations, and aggregate those to form your analysis of toxicity of the whole page.

The process of breaking down a document into smaller units of analysis (usually tokens) is called tokenization and it varies from language to language. This is useful for calculating term frequencies (counts of how many times a term appeared) and obtaining embeddings (numerical representation of a term, useful for calculating similarities).

Scenarios edit

Scenario 1: Topic Modeling edit

If you don't know what groups your documents should be grouped into, that's a job for topic modeling. The topics (groups) are a latent (hidden) variable and you use statistical models to infer membership of documents to unknown groups.

I recommend the following resources for learning more about and actually doing this:

Scenario 2: Classification edit

If you already know which groups your documents should be categorized as and you have example documents for each of those groups, this becomes a supervised (classification) learning problem. The idea is to train a model to predict classes (groups). Binary classification such as spam email detection (spam vs ham), toxic comment detection (toxic vs not), and sentiment analysis (positive vs negative) can be extended to 2+ classes (e.g. which genre of music a song belongs to, which job category a job posting is for).

I recommend the following resources for learning more about and actually doing this:

ML as a service edit

You could stop there if your project was just a one-off categorization exercise, but you may also be interested in categorizing documents on a regular basis. So, once you have a model you're satisfied with (acceptable accuracy, reasonable runtime for performing predictions, doesn't require too many resources), you can make it available as an API. You pass the data (e.g. documents) to an endpoint (local or hosted remotely), the application processes the received data, passes the processed data to the model, the model outputs predictions, and the application responds to the web request with those predictions.

I recommend the following resources:

Making a predictive model available as an API is part of productionizing the model, but there many other parts (scalability, latency, dealing with concept drift, dealing with bias, ease of redeployment) involved in ML in production. I recommend the following resources for learning about it:

Working with data edit

I recommend the following resources for learning how to work with data: