Introduction
During live broadcasting on Twitch, many streamers and esports organizations struggle to extract interesting moments from the massive stream data. Maintaining staff to find the highlights during live broadcasts is expensive for large esports organizations as well. In addition to this problem, another one arises, live broadcasts usually go on for quite a long time, sometimes the streamer needs to take a break, at these moments there is an outflow of the audience from the stream. Showing interesting highlights during this time might be a potential solution to keep the audience watching the stream.
In the field of highlights detection, active work is underway. Recently, a huge number of solutions have been proposed, several large companies and startups tried to offer ready-made solutions as well. Unfortunately, most of the services don’t offer an open-source solution for the problem. By the current date, there are no ready-made solutions that could be launched and tested out of the box, several startups dealing with a similar problem were forced to close. Perhaps the root cause of such unsuccessful developments in this area is the complexity of the task, even a person is not able to accurately formulate the definition of an interesting moment. This work is intended, among other things, to formulate this definition more accurately and to try to improve the already existing results of research and development.
Machine Learning module development
To solve the problem of finding interesting moments in live streams, I decided to split the problem into smaller pieces, thus, a combination of several machine learning algorithms was used together. At first, I extracted 10-second clips from the live stream, for each of the clips I extracted the following information from chat + audio + video data:
- Assessment of the clip by the sentiment and the number of comments at the current moment of time;
- Assessment of the clip by the sound volume and detected sound signals using the PANN CNN for Audio Tagging;
- Assessment of the clip by the amount of movement from frame to frame with Dense Optical Flow and Convolutional Neural Network;
These features were then used by the metamodel to score each clip independently. Here is a rough architecture of the Machine Learning workflow:

Now let’s dive into each component of the Machine Learning module independently.
1. Twitch Chat Features
To extract data from Twitch chat, the choice was made to use the following chat features:
- Total number of messages for the 10-second clip
- Total number of messages with positive sentiment for the 10-second clip
- Total number of messages with negative sentiment for the 10-second clip
For this purpose, I used the brilliant transformers library, where Sentiment analysis can be done in a few lines of code.
2. Motion features
With motion estimation it gets more interesting, to understand the Optical Flow concept better you may want to check out my post on using Optical Flow for Workout movement counting.
That said, the motion estimation model is based on the Optical Flow algorithm, which highlights each pixel of the image so that pixels moving in one direction are highlighted with the same color. Optical Flow can be configured so that it groups pixels and highlights entire areas, thanks to this trick, frames with a large number of movements take on the following look:

Frames with a low number of movements take on the following look:

As you see, it is very easy to distinguish images of one type from another; in order for this functionality to work in the app, the architecture of a simple Convolutional Neural Network (CNN) is used. The task of the network is to recognize pictures that are encoded by Optical Flow as pictures with a lot of movement (class 1) and pictures with little movement (class 0). With such an easy task, even CNN with the simplest architecture performs very well.
3. Sound features
Several approaches were used simultaneously to extract data from sounds. First, for each clip, the maximum sound value was extracted. Secondly, using PANN CNN for Audio Tagging, it was possible to find specific sounds in the audio, from which the necessary data was then extracted. Here is the list of sound features extracted by the model:

4. The Metamodel!
Due to the difficulty of finding interesting moments by a person, as well as the vagueness of the concept of “interesting moment”, it was decided to create a metamodel, which could conclude what features are important for identifying interesting moments from the data.
The metamodel takes all the data about a Twitch broadcast extracted by audio, motion, and chat components. Then it predicts the probability of the moment to be an interesting moment based on provided features.
After extracting all the data from the live broadcast, the following set of features was fed to the input of the metamodel:

The classic supervised learning approach was used to train the metamodel. For this, a sample of live broadcasts was collected and marked. To mark each broadcast I used supervise.ly platform, it turned out to be very handy for data labeling!

For marking up live broadcasts, a ready-made codebase from my application was used. All features from the selected live broadcasts were unloaded. Each broadcast was cut into consecutive clips of 10 seconds each. For each broadcast, the following datasheet was obtained:

After all the class labels were marked up, the process of training the metamodel started. The problem of class prediction from two labels is a common binary classification problem. To solve the problem, I used models from the scikit-learn library. The best model (interpretability, speed, and f1 score) turned out to be Logistic Regression with an F1 score of 0.44. Here are the feature importances of the Logistic Regression model (coefficients):

After training the metamodel and visualizing the importance of the features, one can see several interesting patterns that the model was able to extract from the data:
The most important criteria for determining the class “interesting moment”:
- sound volume (sound_loudness)
- the sound of bursts of fire (Machine gun)
- laughter.
The most important criteria for determining the class “uninteresting moment”:
- music in the background (Music)
- speech (Speech, Male speech)
- single shooting (Gunshot, gunfire)
Features that do not affect the result of the model:
- the sounds of steps
- the number of positive messages in the chat
- the sound of a woman’s voice.
An interesting fact is that the model classified the sound of a female voice as unimportant, although this is not true. The fact is that the male voice predominated in the training sample, because of this, the model concluded that the female voice is not important, this is a classic problem of “Bias in Artificial Intelligence”
Implementation details

The app consists of two main parts:
• Web application, where the user interacts with the main interface
• Server, where video stream processing and highlight detection happens
During interaction with the Frontend part of the application, the user uploads a link to the live broadcast, where he wants to find highlights, then the link, along with the user’s metadata, is sent to the server.

On the server, the live broadcast and its comments are continuously loaded into local memory, while at the same time interesting moments are being detected in the live broadcast. As soon as a new interesting moment is found, it is saved in Google Cloud Storage, and the link to the video containing the interesting moment is sent back to the web application.
Both parts of the application (web application and server) are located on the Google Cloud Platform and, when launched, receive their public IP address to which you can connect.
Here is a quick showcase of the app:
Conclusion
I hope this post helped you to understand my pipeline of highlight detection better. I released all the code on my GitHub profile, so feel free to take a look and use it for your own projects!
You may like to check other posts on my Medium profile and don’t forget to subscribe 🙂