Par4Sem |
Dataset
Learning_to_Rank Dataset collected in each iterations are found here
File formats: It is a TAB separated format where:
- First column is HIT ID (so all sentences with the same HIT ID were assigned to one worker at MTurk)
- Second column is the iteration number (at which Iteration this data is collected),
- Third, fourth and fifth columns are the sentence, the target unit in the sentence, and the possible candidate suggestion respectively.
- The sixth and the seventh columns are the offsets of the target unit in the sentence.
- The last column is the number of workers who choose the candidate suggestion for the given target unit.
Example:
3A520CCNWN1XLHTKZR9HJXTKN5BAEO 3 Others oppose Assad and accuse Damascus of heavy-handed meddling in Lebanese politics. meddling appointments 56 64 0
3A520CCNWN1XLHTKZR9HJXTKN5BAEO 3 Others oppose Assad and accuse Damascus of heavy-handed meddling in Lebanese politics. meddling blending 56 64 1
3A520CCNWN1XLHTKZR9HJXTKN5BAEO 3 Others oppose Assad and accuse Damascus of heavy-handed meddling in Lebanese politics. meddling clashing 56 64 1
3A520CCNWN1XLHTKZR9HJXTKN5BAEO 3 Others oppose Assad and accuse Damascus of heavy-handed meddling in Lebanese politics. meddling dealing 56 64 1
3A520CCNWN1XLHTKZR9HJXTKN5BAEO 3 Others oppose Assad and accuse Damascus of heavy-handed meddling in Lebanese politics. meddling interacting 56 64 2
3A520CCNWN1XLHTKZR9HJXTKN5BAEO 3 Others oppose Assad and accuse Damascus of heavy-handed meddling in Lebanese politics. meddling interruptions 56 64 1
3A520CCNWN1XLHTKZR9HJXTKN5BAEO 3 Others oppose Assad and accuse Damascus of heavy-handed meddling in Lebanese politics. meddling involvement 56 64 3
3A520CCNWN1XLHTKZR9HJXTKN5BAEO 3 Others oppose Assad and accuse Damascus of heavy-handed meddling in Lebanese politics. meddling meddlesome 56 64 1
3A520CCNWN1XLHTKZR9HJXTKN5BAEO 3 Others oppose Assad and accuse Damascus of heavy-handed meddling in Lebanese politics. meddling opponents 56 64 0
3A520CCNWN1XLHTKZR9HJXTKN5BAEO 3 Others oppose Assad and accuse Damascus of heavy-handed meddling in Lebanese politics. meddling upheaval 56 64 0
Difficult unit identification dataset: The difficult unit identification dataset that are used to instantiate the classification model are based on the CWI dataset from here.
The training data are in the following format:
<ID> Both China and the Philippines flexed their muscles on Wednesday. 31 51 flexed their muscles 10 10 3 2 1 0.25
<ID> Both China and the Philippines flexed their muscles on Wednesday. 31 37 flexed 10 10 2 6 1 0.4
<ID> Both China and the Philippines flexed their muscles on Wednesday. 44 51 muscles 10 10 0 0 0 0.0
Each line represents a sentence with one complex word annotation and relevant information, each separated by a TAB character.
- The first column shows the HIT ID of the sentence. All sentences with the same ID belong to the same HIT.
- The second column shows the actual sentence where there exists a complex phrase annotation.
- The third and fourth columns display the start and end offsets of the target word in this sentence.
- The fifth column represents the target word.
- The sixth and seventh columns show the number of native annotators and the number of non-native annotators who saw the sentence.
- The eighth and ninth columns show the number of native annotators and the number of non-native annotators who marked the target word as difficult.
-
The tenth and eleventh columns show the gold-standard label for the binary and probabilistic classification tasks. The labels in the binary classification task were assigned in the following manner:
- 0: simple word (none of the annotators marked the word as difficult)
- 1: complex word (at least one annotator marked the word as difficult)
The labels in the probabilistic classification task were assigned as the number of annotators who marked the word as difficult/the total number of annotators.
Details about the dataset is available in the CWI Shared Task 2018 dataset section.