This benchmark is not meant for iterative testing sessions or parameter tweaking. Each algorithm should only be submitted once its development is finished (or in a state that you want to reference in a publication). We have limited the total number of submissions to three. You cannot edit/update your submission after it has been accepted. Do not waste submissions to tweak your parameters or training data.
To be listed on the public leaderboard, please follow these steps:
Get the data. Create an account to get access to the download links.
The download packages contain additional technical submission details.
Compute your results.
Identical parameter settings must be used for all frames.
Upload and submit.
Login, upload your results,
add a brief description of your method, and submit for evaluation. To support double-blind review processes,
author details may first be anonymous and may later be updated upon request.
Check your results.
Your submission will be evaluated automatically.
You will get notified via email as soon as we approve the evaluation results.
Frequently Asked Questions
wd141-wd155 are weird/broken!
Yes indeed! These are negative test cases, cases where we expect the algorithm to fail. These out-of-scope cases are only evaluated in the category negative within the benchmark. They have no influence on the other scores.
How do you evaluate negative test cases?
Any pixels with void labels are considered correct as well as an additional best case GT. For example, the upside-down image allows either void labels or an upside-down version of the regular GT as correct. Likewise, the correct result for the black-and-white image can either be void or the respective labels from a colored image showing the same scene. This is done for each pixel so mixtures of void and valid labels will be evaluated fairly. In future versions of the benchmark (after the CVPR ROB 2018 challenge) we will only count 'unlabeled' (id 0) as a correct label (in addition to the best case GT).
For instance segmentation, the mixing of different silhouettes would skew results. Thus, here we only exchange empty *.txt to the respective best case GT.
The classic cityscapes trainIds 0-18 do not contain any void labels. How do I fix this?
We do not encourage any solution that has no negative class/void classes. One quick /dirty hack to introduce void pixels could be a post-processing step where you map all pixels with an argmax probability < threshold to id 0.
Whatever steps and mechanisms you choose: you have to handle all frames of the dataset in the same way. If you apply some post-processing to negative test frames only, then this is considered cheating! In the end, you have to find a balance between producing good quality output while failing gracefully in the event of out-of-scope situations.
The negative test cases are unfair!
WildDash tries to focus on algorithm robustness rather than benchmarking the best-case performance. We apply the same metric to all submissions. In the CVPR challenge, negative test cases only affect a single ranking out of more than 30 that will be used to calculate an aggregated rank. It will on average have less than 5% impact on your algorithm's total rank.
wd150 shows an indoor scene and ROB 2018's ScanNet has labels for indoors. Should I do something special here?
No, WildDash evaluates labels compatible with the cityscapes label policy. For our dataset, the indoor scene is out-of-scope (see above about how negative tests are evaluated). All ids > 33 are mapped to 0 internally so you can submit algorithm results with a mixture of cityscapes label id [0-33] and unknown ids above 33 (e.g. ScanNet). Please see the ROB website for ROB specific questions.
How do you calculate your metrics? Which label ids are relevant?
Version one of WildDash (wd_val_01, wd_bench_01, wd_both_01) is evaluated with the cityscapes evaluation scripts.
The same label policy, metrics, and weighting are used. See the official cityscapes website for details.
Thus, only 19 labels having the cityscapes trainIds (ignoreInEval == False) are evaluated.
In addition, see above about the handling of void labels for negative test cases.
All other evaluation except negative ignore regions with void labels in the GT but otherwise consider void in algorithm results as bad pixels.
In the future (after the finish of ROB 2018), for v2 we will be evaluating all labels also present in the training data.
I want to download/submit, but get the error message "Due to data protection legislation, we need to manually approve each account before granting access to download the data."
The error message does say it all: We have to manually approve each of our users individually before they can access the privacy-relevant data from our dataset. This process is done periodically but may take up to a week. Please use your academic email address to speed-up the process and register well ahead of paper/challenge deadlines. Sorry for the delays but this is a necessity to fulfill our data protection obligations.
Which parts of the dataset may I use during training?
You can use all of the validation frames from wd_val_01 and the associated GT during training in any way you see fit. The use of benchmarking frames from wd_bench_01 during training (e.g. by creating GT yourself or by unsupervised learning) is considered cheating. We remove cheating submissions from the leaderboard and may invoke temporary or permanent bans of cheating users or institutes.
The ROB 2018 rules allow the use of benchmarking frames during training. How do I handle WildDash's restrictions here?
The use of other data sources is not restricted for WildDash. Just skip WilDDash's own benchmarking frames (or frames from the same source sequence). All other data (including benchmarking frames from e.g. Cityscapes; KITTI; etc.) is fine.
Can I use/apply FancyPreprocessingSteps/FancyPostprocessingSteps?
You can use any kind and mixture of classifiers and pre/post processing effects as long as there are no manual steps to distinguish parts of the benchmarking data (e.g. define that a specific subset are negative test cases and process them differently). We accept solutions that use the image content (rather than the image's file name or other meta-data) to automatically detect negative test example and handle them differently.
Is there an archived version of the ROB 2018 results?
Yes! You can find the overall results here. The WildDash specific results are archived here: semantic / instance segmentation.
Why are only evaluation results for validation frames visualized on the website? Where are the benchmarking frames?
The GT for benchmarking frames should remain hidden to allow a fair evaluation. Visualizations with comparisons against the benchmarking GT would lead to the deciphering of our GT.
The interface shows all submissions but the total number is less than expected!
We will set the status of results for algorithms which were stable for more than a few month to "archived". This frees up new slots for you to submit new algorithms so that previously participating teams will not be at an disadvantage.
My own validation results differ slightly from the numbers on the benchmark!
The WildDash validation and benchmarking frames should have similar composition and difficulty but there is still some divergence to be expected. Additionally, WildDash evaluates metrics per frame and averages these per-frame metrics for each frame from a given subset. This differs slightly from averaging all pixels of all frames but better represents our interpretation of frames as individual test cases.
Which licenses apply to the datasets and benchmarks?
These are available in the download itself and here:
WildDash dataset and benchmark license
RailSem19 dataset license
The WildDash Benchmark is part of the semantic and instance segmentation challenges of the
Robust Vision Challenge 2018.
If you want to participate, follow the instructions above and add _ROB as a postfix to your method name.
Please note that you must use the same model / parameter setup to compute your results for all benchmarks of the respective challenge.