How I achieved 1st place in Kaggle OSIC Pulmonary Fibrosis Progression competition


During the OSIC Pulmonary Fibrosis Progression competition, the competitors were asked to predict a patient’s severity of the decline in lung function based on a CT scan of their lungs and some additional tabular data fields. The challenge was to use machine learning techniques to make a prediction with the image, metadata, and baseline FVC as input. The task wasn’t simple. Due to the rather low amount of data available, it was not easy to use traditional Computer Vision approaches to model the dependency between CT scans and patient FVC values. Moreover, the public leaderboard score was based only on 15% of test data and didn’t correlate with the validation at all, which made it really hard to select the best models for final submission.

In general, it is really hard to explain all the subtle points of the competition here, so I recommend to visit the competition link above and read about it yourself if you are really interested!

Inputs and outputs of the model

As described above, the input data consisted of chest CT scans in dicom format, which formed a 3d chest ct scan plus some additional metadata fields, which described each patient in general. Here is an example slice of one of the CT scans.

Image for post

And here is a sample of the additional metadata fields for the patient.

Image for post

For each patient, there was given an initial CT scan that corresponded to the first week in the patients’ metadata and additional weeks, which described how patients FVC changed during that time.

For the test set for each patient, only the first-week data was given, along with the initial CT scan. The task was not only to predict the “FVC” values for the following weeks but also to display the “Confidence” score for each prediction.

Evaluation metric

Speaking about the metric, the loglikelihood score was used to score all the submissions.

Image for post

Here is my python implementation of the metric.

def loglikelihood(real, pred, sigmas):
sigmasClipped = np.maximum(sigmas, 70)
delta = np.abs(real - pred)
deltaClipped = np.minimum(delta, 1000)
metric = - np.sqrt(2) * deltaClipped / sigmasClipped - np.log(np.sqrt(2) * sigmasClipped)
return np.mean(metric)

Validation technique

Speaking about validation, I tried to make it as close as possible to the scoring method used by organizers. Originally, they scored only 3 last predictions (last 3 weeks for each patient), so I developed a similar validation framework. For the test set, I included only patients not present in the training set and used only 3 last weeks for scoring. Though, this validation scheme did not correlate with the leaderboard well.

kf = KFold(n_splits = 5, shuffle = True, random_state = 4444)foldMetrics = []
for trIdsIdx, valIdsIdx in kf.split(trainData['PatientID'].unique()):
trIds, valIds = trainData['PatientID'].unique()[trIdsIdx], trainData['PatientID'].unique()[valIdsIdx]
tr = trainData[trainData['PatientID'].apply(lambda x: x in trIds)]
val = trainData[trainData['PatientID'].apply(lambda x: x in valIds)].reset_index(drop=True)

valIdx = []
for idx in val.groupby('PatientID')['target_week'].apply(lambda x: np.array(x.index[np.in1d(np.array(x), np.array(sorted(x)[-3:]))])):

val = val.iloc[valIdx]


foldMetrics.append(loglikelihood(val['target_FVC'].values, val_pred, sigmas))

Best model

My best model turned out to be the blend of 2 models, which in fact was introduced before the competition end by other kagglers: EfficientNet B5 and Quantile Regression Dense Neural Network. Whereas EfficientNet used CT scan slices along with tabular data, Quantile Regression relied manually on tabular data. Due to the fact that those 2 models were originally built a bit different from each other, blending them was a good idea to get a high score due to the diversity in their predictions.

Here are the exact steps on how I achieved the 1st place on the private leaderboard.

  • Trained both models from scratch. For Effnet b5 I selected 30 epochs and for Quantile Regression 600 epochs to train
  • I did some feature filtering, by removing the precomputed “Percent” feature which made the predictions worse (probably because this feature was precomputed)
  • In terms of model blending, I just gave a higher score to the Quantile Regression model, because from my point of view it was more reliable.

In terms of training time, Effnet b5 was finetuning for about 5 minutes on a single Nvidia Titan RTX GPU card. Quantile Regression trained directly on the Kaggle machine during inference and took not more than 30 seconds to train. In general, all inference process took only 3 minutes on Kaggle machines.

Also, I would like to give a small tip, on how I tend to select submissions and verify the prediction correctness, in general, this helps me a lot. When I have trained the new model and receive the submission file, I always draw distribution plots of prediction values, along with plots for predictions themselves. Here is how prediction plots look like:

Image for post

These are plots for the test set “Confidence” for a subset of my models, sometimes by looking at those plots you can identify strange model behavior and find a bug. In general, I always analyze the predictions really carefully and build a lot of graphs before submitting anything.

What didn’t work

I have tried a ton of stuff, but it almost always worked badly both on LB and CV. Here are a few things:

  • Calculated lung volume with methods from the public notebooks and passed it as features for both models
  • Tested other models, XGBoost, Log Regressions on tabular data. Thanks to my CV it immediately turned out that trees do not work here, so I didn’t do anything with trees since the beginning of this competition.
  • Since I was testing simple models, my 2nd selected submission was a really simple logistic regression model, which by the way landed in the bronze zone
  • Augmentations for the CT scans worked bad, maybe I should have spent more time testing them
  • Histogram features of the image didn’t work either.
  • If you have analyzed model outputs, you might have noticed those spikes (both for Confidence and FVC) It made total sense to remove them, but they didn’t work on my validation, so I left it as is. Though, I am still confused about the reason why it didn’t work on the private test set.
Image for post
Confidence spikes

Final words

This is my first gold on Kaggle and I am really happy about that. I would like to thank all the Kaggle community for making so many notebooks public and being active on the forum! Without this, I wouldn’t have learned that much during this competition and all other previous ones.

Below I will attach links to my final submission notebooks, along with Kaggle writeup and Github repo.

1st Place notebook

Bronze zone very simple solution

Kaggle writeup

Github repository