Glioma Grading Clinical and Mutation Features Dataset

Donated on 12/14/2022

Gliomas are the most common primary tumors of the brain. They can be graded as LGG (Lower-Grade Glioma) or GBM (Glioblastoma Multiforme) depending on the histological/imaging criteria. Clinical and molecular/mutation factors are also very crucial for the grading process. Molecular tests are expensive to help accurately diagnose glioma patients. In this dataset, the most frequently mutated 20 genes and 3 clinical features are considered from TCGA-LGG and TCGA-GBM brain glioma projects. The prediction task is to determine whether a patient is LGG or GBM with a given clinical and molecular/mutation features. The main objective is to find the optimal subset of mutation genes and clinical features for the glioma grading process to improve performance and reduce costs.

Dataset Characteristics

Tabular, Multivariate

Subject Area

Life Science

Associated Tasks

Classification, Other

Attribute Type

Real, Categorical, Integer

# Instances

839

# Attributes

23

Information

For what purpose was the dataset created?

Gliomas are the most common primary tumors of the brain. They can be graded as LGG (Lower-Grade Glioma) or GBM (Glioblastoma Multiforme) depending on the histological/imaging criteria. Clinical and molecular/mutation factors are also very crucial for the grading process. Molecular tests are expensive to help accurately diagnose glioma patients. In this dataset, the most frequently mutated 20 genes and 3 clinical features are considered from TCGA-LGG and TCGA-GBM brain glioma projects. The prediction task is to determine whether a patient is LGG or GBM with a given clinical and molecular/mutation features. The main objective is to find the optimal subset of mutation genes and clinical features for the glioma grading process to improve performance and reduce costs.

Who funded the creation of the dataset?

The Cancer Genome Atlas (TCGA) Project – NCI

What do the instances in this dataset represent?

In this dataset, the most frequently mutated 20 genes and 3 clinical features are considered from TCGA-LGG and TCGA-GBM brain glioma projects. The preprocessed and organized CSV dataset file consists of twenty-four fields per record. Each field is separated by a comma and each record is separated by a newline. Gender, Age_at_diagnosis, and, Race features are clinical factors, the remaining 20 molecular features consist of IDH1, TP53, ATRX, PTEN, EGFR, CIC, MUC16, PIK3CA, NF1, PIK3R1, FUBP1, RB1, NOTCH1, BCOR, CSMD3, SMARCA4, GRIN2A, IDH2, FAT4, PDGFRA. These molecular features can be mutated or not_mutated (wildtype) depending on the TCGA Case_ID. Complete attribute documentation for preprocessed dataset file is as follows: 1. Gender : Gender (0 = male; 1 = female) 2. Age_at_diagnosis : Age at diagnosis with the calculated number of days 3. Race : Race a. 0 = white; b. 1 = black or african American; c. 2 = asian; d. 3 = american indian or alaska native) 4. IDH1 : isocitrate dehydrogenase (NADP(+))1 (0 = NOT_MUTATED; 1= MUTATED) 5. TP53 : tumor protein p53 (0 = NOT_MUTATED; 1 = MUTATED) 6. ATRX : ATRX chromatin remodeler (0 = NOT_MUTATED; 1 = MUTATED) 7. PTEN : phosphatase and tensin homolog (0 = NOT_MUTATED; 1 = MUTATED) 8. EGFR : epidermal growth factor receptor (0 = NOT_MUTATED; 1 = MUTATED) 9. CIC : capicua transcriptional repressor (0 = NOT_MUTATED; 1 = MUTATED) 10. MUC16 : mucin 16, cell surface associated (0 = NOT_MUTATED; 1 = MUTATED) 11. PIK3CA : phosphatidylinositol-4,5-bisphosphate 3-kinase catalytic subunit alpha (0 = NOT_MUTATED; 1 = MUTATED) 12. NF1 : neurofibromin 1 (0 = NOT_MUTATED; 1 = MUTATED) 13. PIK3R1 : phosphoinositide-3-kinase regulatory subunit 1 (0 = NOT_MUTATED; 1 = MUTATED) 14. FUBP1 : far upstream element binding protein 1 (0 = NOT_MUTATED; 1 = MUTATED) 15. RB1 : RB transcriptional corepressor 1 (0 = NOT_MUTATED; 1 = MUTATED) 16. NOTCH1 : notch receptor 1 (0 = NOT_MUTATED; 1 = MUTATED) 17. BCOR : BCL6 corepressor (0 = NOT_MUTATED; 1 = MUTATED) 18. CSMD3 : CUB and Sushi multiple domains 3 (0 = NOT_MUTATED; 1 = MUTATED) 19. SMARCA4 : SWI/SNF related, matrix associated, actin dependent regulator of chromatin, subfamily a, member 4 (0 = NOT_MUTATED; 1 = MUTATED) 20. GRIN2A : glutamate ionotropic receptor NMDA type subunit 2A (0 = NOT_MUTATED; 1 = MUTATED) 21. IDH2 : isocitrate dehydrogenase (NADP(+)) 2 (0 = NOT_MUTATED; 1 = MUTATED) 22. FAT4 : FAT atypical cadherin 4 (0 = NOT_MUTATED; 1 = MUTATED) 23. PDGFRA : platelet-derived growth factor receptor alpha (0 = NOT_MUTATED; 1 = MUTATED) The class label information is given as follows: • Grade : Glioma grade class information (1 = GBM; 0 = LGG) Additional Information: There are 23 instances where Gender, Age_at_diagnosis, or Race feature values are ‘--’, or ‘not reported’. These instances, and Project, Case_ID, and Primary_Diagnosis columns were removed from the original dataset file to construct the preprocessed dataset file. Age_at_diagnosis feature values were converted from string to continuous value by adding day information to the corresponding year information in the dataset as a floating-point number for the preprocessing stage. All processed and unprocessed files also exist in this directory. Additional columns of the original dataset file : Project column represents corresponding TCGA-LGG or TCGA-GBM project names. Case_ID column refers to the related project Case_ID information. Primary_Diagnosis column provides information related to the type of primary diagnosis.

Are there recommended data splits?

No. We suggest 10-fold cross-validation for feature selection, classification etc.

Does the dataset contain data that might be considered sensitive in any way?

There is race information in this dataset.

Was there any data preprocessing performed?

Yes. There are 23 instances where Gender, Age_at_diagnosis, or Race feature values are ‘--’, or ‘not reported’. These instances, and Project, Case_ID, and Primary_Diagnosis columns were removed from the original dataset file to construct the preprocessed dataset file. Age_at_diagnosis feature values were converted from string to continuous value by adding day information to the corresponding year information in the dataset as a floating-point number for the preprocessing stage. All processed and unprocessed files also exist in this directory.

Has the dataset been used for any tasks already?

Feature selection, classification etc.

Citation Requests/Acknowledgements

Tasci, E., Zhuge, Y., Kaur, H., Camphausen, K., & Krauze, A. V. (2022). Hierarchical Voting-Based Feature Selection and Ensemble Learning Model Scheme for Glioma Grading with Clinical and Molecular Characteristics. International Journal of Molecular Sciences, 23(22), 14155.

Features

Attribute NameRoleTypeDescriptionUnitsMissing Values
GradeTargetCategoricalGrade labelN/Afalse
GenderFeatureCategoricalGenderN/Afalse
Age_at_diagnosisFeatureNumerical - ContinuousAge at diagnosis with the calculated number of daysyearsfalse
RaceFeatureCategoricalRace (a. 0 = white; b. 1 = black or african American; c. 2 = asian; d. 3 = american indian or alaska native)N/Afalse
IDH1FeatureCategoricalisocitrate dehydrogenase (NADP(+))1 (0 = NOT_MUTATED; 1= MUTATED)N/Afalse
TP53FeatureCategoricaltumor protein p53 (0 = NOT_MUTATED; 1 = MUTATED)N/Afalse
ATRXFeatureCategoricalATRX chromatin remodeler (0 = NOT_MUTATED; 1 = MUTATED)N/Afalse
PTENFeatureCategoricalphosphatase and tensin homolog (0 = NOT_MUTATED; 1 = MUTATED)N/Afalse
EGFRFeatureCategoricalepidermal growth factor receptor (0 = NOT_MUTATED; 1 = MUTATED)N/Afalse
CICFeatureCategoricalcapicua transcriptional repressor (0 = NOT_MUTATED; 1 = MUTATED)N/Afalse

1 to 10 of 24

Introduction Paper

-

Download
1 citations
4791 views

Keywords

Brain tumorGliomaMutationTumor gradingClinical featuresMolecular features

Creators

Erdal Tasci

erdal.tasci@nih.gov

Radiation Oncology Branch (ROB), National Cancer Institute (NCI), National Institutes of Health (NIH), Building 10

Erdal Tasci

erdal.tasci@nih.gov

NIH/NCI

Kevin Camphausen

camphauk@mail.nih.gov

Radiation Oncology Branch (ROB), National Cancer Institute (NCI), National Institutes of Health (NIH), Building 10

Andra Valentina Krauze

andra.krauze@nih.gov

Radiation Oncology Branch (ROB), National Cancer Institute (NCI), National Institutes of Health (NIH), Building 10

Ying Zhuge

zhugey@mail.nih.gov

Radiation Oncology Branch (ROB), National Cancer Institute (NCI), National Institutes of Health (NIH), Building 10

License

This dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.

This allows for the sharing and adaptation of the datasets for any purpose, provided that the appropriate credit is given.

By using the UCI Machine Learning Repository, you acknowledge and accept the cookies and privacy practices used by the UCI Machine Learning Repository.

Learn More