This data was donated by researchers of the University of Wisconsin and includes the measurements from digitized images of fine-needle aspirate of a breast mass.
You can find the dataset at https://github.com/dataspelunking/MLwR/blob/master/Machine%20Learning%20with%20R%20(2nd%20Ed.)/Chapter%2003/wisc_bc_data.csv.
The breast cancer data includes 569 examples of cancer biopsies, each with 32 features. One feature is an identification number, another is the cancer diagnosis and 30 are numeric-valued laboratory measurements. The diagnosis is coded as "M" to indicate malignant or "B" to indicate benign.
The other 30 numeric measurements comprise the mean, standard error and worst (i.e. largest) value for 10 different characteristics of the digitized cell nuclei, which are as follows:-
Creators:
Matjaz Zwitter & Milan Soklic (physicians)
Institute of Oncology University Medical Center
Ljubljana, Yugoslavia
Donors:
Ming Tan and Jeff Schlimmer (Jeffrey.Schlimmer '@' a.gp.cs.cmu.edu)
This is one of three domains provided by the Oncology Institute that has repeatedly appeared in the machine learning literature. (See also lymphography and primary-tumor.)
This data set includes 201 instances of one class and 85 instances of another class. The instances are described by 9 attributes, some of which are linear and some are nominal.
Michalski,R.S., Mozetic,I., Hong,J., & Lavrac,N. (1986). The Multi-Purpose Incremental Learning System AQ15 and its Testing Application to Three Medical Domains. In Proceedings of the Fifth National Conference on Artificial Intelligence, 1041-1045, Philadelphia, PA: Morgan Kaufmann.
Clark,P. & Niblett,T. (1987). Induction in Noisy Domains. In Progress in Machine Learning (from the Proceedings of the 2nd European Working Session on Learning), 11-30, Bled, Yugoslavia: Sigma Press.
Tan, M., & Eshelman, L. (1988). Using weighted networks to represent classification knowledge in noisy domains. Proceedings of the Fifth International Conference on Machine Learning, 121-134, Ann Arbor, MI.
Cestnik,G., Konenenko,I, & Bratko,I. (1987). Assistant-86: A Knowledge-Elicitation Tool for Sophisticated Users. In I.Bratko & N.Lavrac (Eds.) Progress in Machine Learning, 31-45, Sigma Press.
Creators:
1. Dr. William H. Wolberg, General Surgery Dept. University of Wisconsin, Clinical Sciences Center Madison, WI 53792wolberg '@' eagle.surgery.wisc.edu
2. W. Nick Street, Computer Sciences Dept. University of Wisconsin, 1210 West Dayton St., Madison, WI 53706street '@' cs.wisc.edu 608-262-6619
3. Olvi L. Mangasarian, Computer Sciences Dept. University of Wisconsin, 1210 West Dayton St., Madison, WI 53706olvi '@' cs.wisc.edu
Donor:
Nick Street
Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. A few of the images can be found at
Separating plane described above was obtained using Multisurface Method-Tree (MSM-T) [K. P. Bennett, "Decision Tree Construction Via Linear Programming." Proceedings of the 4th Midwest Artificial Intelligence and Cognitive Science Society, pp. 97-101, 1992], a classification method which uses linear programming to construct a decision tree. Relevant features were selected using an exhaustive search in the space of 1-4 features and 1-3 separating planes.
The actual linear program used to obtain the separating plane in the 3-dimensional space is that described in: [K. P. Bennett and O. L. Mangasarian: "Robust Linear Programming Discrimination of Two Linearly Inseparable Sets", Optimization Methods and Software 1, 1992, 23-34].
This database is also available through the UW CS ftp server:ftp ftp.cs.wisc.educd math-prog/cpo-dataset/machine-learn/WDBC/
1) ID number
2) Diagnosis (M = malignant, B = benign)
3-32)
Ten real-valued features are computed for each cell nucleus:
a) radius (mean of distances from center to points on the perimeter)
b) texture (standard deviation of gray-scale values)
c) perimeter
d) area
e) smoothness (local variation in radius lengths)
f) compactness (perimeter^2 / area - 1.0)
g) concavity (severity of concave portions of the contour)
h) concave points (number of concave portions of the contour)
i) symmetry
j) fractal dimension ("coastline approximation" - 1)
First Usage:
W.N. Street, W.H. Wolberg and O.L. Mangasarian. Nuclear feature extraction for breast tumor diagnosis. IS&T/SPIE 1993 International Symposium on Electronic Imaging: Science and Technology, volume 1905, pages 861-870, San Jose, CA, 1993.
O.L. Mangasarian, W.N. Street and W.H. Wolberg. Breast cancer diagnosis and prognosis via linear programming. Operations Research, 43(4), pages 570-577, July-August 1995. Medical literature:
W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Machine learning techniques to diagnose breast cancer from fine-needle aspirates. Cancer Letters 77 (1994) 163-171.
W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Image analysis and machine learning applied to breast cancer diagnosis and prognosis. Analytical and Quantitative Cytology and Histology, Vol. 17 No. 2, pages 77-87, April 1995.
W.H. Wolberg, W.N. Street, D.M. Heisey, and O.L. Mangasarian. Computerized breast cancer diagnosis and prognosis from fine needle aspirates. Archives of Surgery 1995;130:511-516.
W.H. Wolberg, W.N. Street, D.M. Heisey, and O.L. Mangasarian. Computer-derived nuclear features distinguish malignant from benign breast cytology. Human Pathology, 26:792*796, 1995. See also:
Please refer to the Machine Learning Repository's citation policy. [1] Papers were automatically harvested and associated with this data set, in collaborationwith
This dataset of breast cancer patients was obtained from the 2017 November update of the SEER Program of the NCI, which provides information on population-based cancer statistics. The dataset involved female patients with infiltrating duct and lobular carcinoma breast cancer (SEER primary cites recode NOS histology codes 8522/3) diagnosed in 2006-2010. Patients with unknown tumor size, examined regional LNs, regional positive LNs, and patients whose survival months were less than 1 month were excluded; thus, 4024 patients were ultimately included.
The United States Cancer Statistics (USCS) online databases in WONDER provide cancer incidence and mortality data for the United States for the years since 1999, by year, state and metropolitan areas (MSA), age group, race, ethnicity, gender, childhood cancer classifications and cancer site. Report case counts, deaths, crude and age-adjusted incidence and death rates, and 95% confidence intervals for rates. The USCS data are the official federal statistics on cancer incidence from registries having high-quality data and cancer mortality statistics for 50 states and the District of Columbia. USCS are produced by the Centers for Disease Control and Prevention (CDC) and the National Cancer Institute (NCI), in collaboration with the North American Association of Central Cancer Registries (NAACCR). Mortality data are provided by the Centers for Disease Control and Prevention (CDC), National Center for Health Statistics (NCHS), National Vital Statistics System (NVSS).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Iraq-Oncology Teaching Hospital/National Center for Cancer Diseases (IQ-OTH/NCCD) lung cancer dataset was collected in the above-mentioned specialist hospitals over a period of three months in fall 2019. It includes CT scans of patients diagnosed with lung cancer in different stages, as well as healthy subjects. IQ-OTH/NCCD slides were marked by oncologists and radiologists in these two centers. The dataset contains a total of 1190 images representing CT scan slices of 110 cases (see Figure 1). These cases are grouped into three classes: normal, benign, and malignant. of these, 40 cases are diagnosed as malignant; 15 cases diagnosed with benign; and 55 cases classified as normal cases. The CT scans were originally collected in DICOM format. The scanner used is SOMATOM from Siemens. CT protocol includes: 120 kV, slice thickness of 1 mm, with window width ranging from 350 to 1200 HU and window center from 50 to 600 were used for reading. with breath hold at full inspiration. All images were de-identified before performing analysis. Written consent was waived by the oversight review board. The study was approved by the institutional review board of participating medical centers. Each scan contains several slices. The number of these slices range from 80 to 200 slices, each of them represents an image of the human chest with different sides and angles. The 110 cases vary in gender, age, educational attainment, area of residence and living status. Some of them are employees of the Iraqi ministries of Transport and Oil, others are farmers and gainers. Most of them come from places in the middle region of Iraq, particularly, the provinces of Baghdad, Wasit, Diyala, Salahuddin, and Babylon.
Cancer Rates for Lake County Illinois. Explanation of field attributes: Colorectal Cancer - Cancer that develops in the colon (the longest part of the large intestine) and/or the rectum (the last several inches of the large intestine). This is a rate per 100,000. Lung Cancer – Cancer that forms in tissues of the lung, usually in the cells lining air passages. This is a rate per 100,000. Breast Cancer – Cancer that forms in tissues of the breast. This is a rate per 100,000. Prostate Cancer – Cancer that forms in tissues of the prostate. This is a rate per 100,000. Urinary System Cancer – Cancer that forms in the organs of the body that produce and discharge urine. These include the kidneys, ureters, bladder, and urethra. This is a rate per 100,000. All Cancer – All cancers including, but not limited to: colorectal cancer, lung cancer, breast cancer, prostate cancer, and cancer of the urinary system. This is a rate per 100,000.
The dataset is maintained by VISION AND IMAGE PROCESSING LAB, University of Waterloo. The images of the dataset were extracted from the public databases DermIS and DermQuest, along with manual segmentations of the lesions.
The dataset was used in the following journal publication. [1] Glaister, J., A. Wong, and D. A. Clausi, "Automatic segmentation of skin lesions from dermatological photographs using a joint probabilistic texture distinctiveness approach", IEEE Transactions on Biomedical Engineering [2] Amelard, R., J. Glaister, A. Wong, and D. A. Clausi, "High-level intuitive features (HLIFs) for intuitive skin lesion descriptionpdf", IEEE Transactions on Biomedical Engineering, vol. 62, issue 3, pp. 820-831, October, 2015. [3] Glaister, J., R. Amelard, A. Wong, and D. A. Clausi, "MSIM: Multi-Stage Illumination Modeling of Dermatological Photographs for Illumination-Corrected Skin Lesion Analysis", IEEE Transactions on Biomedical Engineering, vol. 60, issue 7, pp. 1873 - 1883, November, 2013.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Total no. of attributes:16 No .of instances:284 Attribute information: 1. Gender: M(male), F(female) 2. Age: Age of the patient 3. Smoking: YES=2 , NO=1. 4. Yellow fingers: YES=2 , NO=1. 5. Anxiety: YES=2 , NO=1. 6. Peer_pressure: YES=2 , NO=1. 7. Chronic Disease: YES=2 , NO=1. 8. Fatigue: YES=2 , NO=1. 9. Allergy: YES=2 , NO=1. 10. Wheezing: YES=2 , NO=1. 11. Alcohol: YES=2 , NO=1. 12. Coughing: YES=2 , NO=1. 13. Shortness of Breath: YES=2 , NO=1. 14. Swallowing Difficulty: YES=2 , NO=1. 15. Chest pain: YES=2 , NO=1. 16. Lung Cancer: YES , NO.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Dataset Card for Lung Cancer
Dataset Summary
The effectiveness of cancer prediction system helps the people to know their cancer risk with low cost and it also helps the people to take the appropriate decision based on their cancer risk status. The data is collected from the website online lung cancer prediction system .
Supported Tasks and Leaderboards
[More Information Needed]
Languages
[More Information Needed]
Dataset… See the full description on the dataset page: https://huggingface.co/datasets/nateraw/lung-cancer.
A training set for academic machine learning can be created using the dataset, which comprises of 10015 dermatoscopic images. All significant diagnostic categories for pigmented lesions are represented in the cases in a representative manner:
- Actinic keratoses and intraepithelial carcinoma / Bowen's disease (akiec
),
- basal cell carcinoma (bcc
),
- benign keratosis-like lesions (solar lentigines / seborrheic keratoses and lichen-planus like keratoses, bkl
),
- dermatofibroma (df
),
- melanoma (mel
),
- melanocytic nevi (nv
) and
- vascular lesions (angiomas, angiokeratomas, pyogenic granulomas and hemorrhage, vasc
).
Histopathology (histo) is used to confirm more than 50% of lesions; in the remaining cases, follow-up exams, expert consensus, or in-vivo confocal microscopy confirmation are used as the gold standard (confocal).
Dataset Collected from: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/DBW86T
https://www.apache.org/licenses/LICENSE-2.0https://www.apache.org/licenses/LICENSE-2.0
Cervical cancer is one of the leading causes of cancer-related deaths among women worldwide. Early detection and accurate prediction of cervical cancer can significantly improve the chances of successful treatment and save lives. This dataset help to develop a predictive model using machine learning techniques to identify individuals at high risk of cervical cancer, allowing for timely intervention and medical care.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Data comes from two different sources. Population-based lung cancer incidence rates for the period 2010-2014 (most updated data) were abstracted from National Cancer Institute state cancer profiles (Schwartz et al. 1996).This national county-level database of cancer data is collected by state public health surveillance systems. The domain specific county level environmental quality index (EQI) data for the period 2000-2005 were abstracted from United States Environmental Protection Agency (USEPA) profile. Complete descriptions of the datasets used in the EQI are provided in Lobdell’s paper (Lobdell 2011). Data were merged based on the Federal Information Processing Standards (FIPS) code. Out of 3144 counties in United States this study has available information for 2602 counties: Data was not available for four states namely Kansas, Michigan, Minnesota and Nevada due to state legislation and regulations which prohibit the release of county-level data to outside entities, county whose lung cancer mortality information is missing were omitted from the data set, the Union county, Florida is an outlier in terms of mortality information which was deleted from the data set, in the process of local control analysis this study experiences two (cluster 28 and 29) non-informative clusters (non-informative cluster is one for which either treatment or control group information is missing). For analysis, non-informative clusters information was deleted from the data set. Three types of variables are used in this study: (i) lung cancer mortality as an outcome variable (ii) binary treatment indicator is the PM2.5 high (greater than 10.59 mg/m3) vs. low (less than 10.59 mg/m3) (iii) three potential X confounder for clustering namely land EQI, sociodemographic EQI and built EQI. For each index, higher values correspond to poorer environmental quality (Jagai et al. 2017). As PM2.5 is one of the indicators for measuring air EQI, that is why we do not consider the air EQI to avoid confounding effects.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Users can access data about cancer statistics in the United States including but not limited to searches by type of cancer and race, sex, ethnicity, age at diagnosis, and age at death. Background Surveillance Epidemiology and End Results (SEER) database’s mission is to provide information on cancer statistics to help reduce the burden of disease in the U.S. population. The SEER database is a project to the National Cancer Institute. The SEER database collects information on incidence, prevalence, and survival from specific geographic areas representing 28 percent of the United States population. User functionality Users can access a variety of reso urces. Cancer Stat Fact Sheets allow users to look at summaries of statistics by major cancer type. Cancer Statistic Reviews are available from 1975-2008 in table format. Users are also able to build their own tables and graphs using Fast Stats. The Cancer Query system provides more flexibility and a larger set of cancer statistics than F ast Stats but requires more input from the user. State Cancer Profiles include dynamic maps and graphs enabling the investigation of cancer trends at the county, state, and national levels. SEER research data files and SEER*Stat software are available to download through your Internet connection (SEER*Stat’s client-server mode) or via discs shipped directly to you. A signed data agreement form is required to access the SEER data Data Notes Data is available in different formats depending on which type of data is accessed. Some data is available in table, PDF, and html formats. Detailed information about the data is available under “Data Documentation and Variable Recodes”.
Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
Cancer diagnoses and age-standardised incidence rates for all types of cancer by age and sex including breast, prostate, lung and colorectal cancer.
Signatures of Oncogenic Pathway Deregulation in Human Cancers. The ability to define cancer subtypes, recurrence of disease, and response to specific therapies using DNA microarray-based gene expression signatures has been demonstrated in multiple studies. Such data is also of substantial importance to the analysis of cellular signaling pathways central to the oncogenic process. With this focus, we have developed a series of gene expression signatures that reliably reflect the activation status of several oncogenic pathways. When evaluated in several large collections of human cancers, these gene expression signatures identify patterns of pathway deregulation in tumors, and clinically relevant associations with disease outcomes. Combining signature-based predictions across several pathways identifies coordinated patterns of pathway deregulation that distinguish between specific cancers and tumor sub-types. Clustering tumors based on pathway signatures further defines prognosis in respective patient subsets, demonstrating that patterns of oncogenic pathway deregulation underlie the development of the oncogenic phenotype and reflect the biology and outcome of specific cancers. Furthermore, predictions of pathway deregulation in cancer cell lines are shown to coincide with sensitivity to therapeutic agents that target components of the pathway, underscoring the potential for such pathway prediction to guide the use of targeted therapeutics. Keywords: other Overall design: RNA was extracted from frozen tissue of primary breast tumors for gene array analysis.
The Hallmarks of Cancer (*HOC) corpus consists of 1852 PubMed publication abstracts manually annotated by experts according to the Hallmarks of Cancer taxonomy. The taxonomy consists of 37 classes in a hierarchy. Zero or more class labels are assigned to each sentence in the corpus.
The Breast Cancer Histopathological Image Classification (BreakHis) is composed of 9,109 microscopic images of breast tumor tissue collected from 82 patients using different magnifying factors (40X, 100X, 200X, and 400X). It contains 2,480 benign and 5,429 malignant samples (700X460 pixels, 3-channel RGB, 8-bit depth in each channel, PNG format). This database has been built in collaboration with the P&D Laboratory - Pathological Anatomy and Cytopathology, Parana, Brazil.
Paper: F. A. Spanhol, L. S. Oliveira, C. Petitjean and L. Heutte, "A Dataset for Breast Cancer Histopathological Image Classification," in IEEE Transactions on Biomedical Engineering, vol. 63, no. 7, pp. 1455-1462, July 2016, doi: 10.1109/TBME.2015.2496264
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset contains information of 213 cancer patients undergoing clinical or surgical treatment characterized on sociodemographic and clinical data as well as data from the Care Transition Measure (CTM 15-Brazil). Data collection was carried out 7 to 30 days after their discharge from hospital from June to August 2019. Understanding these data can contribute to improving quality of care transitions and avoiding hospital readmissions. To this end, this dataset contains a broad array of variables:
*gender
*age group
*place of residence
*race
*marital status
*schooling
*paid work activity
*type of treatment
*cancer staging
*metastasis
*comorbidities
*main complaint
*continue use medication
*diagnosis
*cancer type
*diagnostic year
*oncology treatment
*first hospitalization
*readmission in the last 30 days
*number of hospitalizations in the last 30 days
*readmission in the last 6 months
*number of hospitalizations in the last 6 months
*readmission in the last year
*number of hospitalizations in the last year
*questions 1-15 from CTM 15-Brazil
The data are presented as a single Excel XLSX file: cancer patient´s care transitions dataset.xlsx.
The analyses of the present dataset have the potential to generate hospital readmission prevention strategies to be implemented by the hospital team. Researchers who are interested in CTs of cancer patients can extensively explore the variables described here.
The project from which these data were extracted was approved by the institution’s research ethics committee (approval n. 3.266.259/2019) at Associação Hospital de Caridade Ijuí, Rio Grande do Sul, Brazil.
Breast cancer (BC) has become the greatest threat to women’s health worldwide. Clinically, identification of axillary lymph node (ALN) metastasis and other tumor clinical characteristics such as ER, PR, and so on, are important for evaluating the prognosis and guiding the treatment for BC patients.
Several studies intended to predict the ALN status and other tumor clinical characteristics by clinicopathological data and genetic testing score. However, due to the relatively poor predictive values and high genetic testing costs, these methods are often limited. Recently, deep learning (DL) has enabled rapid advances in computational pathology, DL can perform high-throughput feature extraction on medical images and analyze the correlation between primary tumor features and above status. So far, there is no relevant research on preoperatively predicting ALN metastasis and other tumor clinical characteristics based on WSIs of primary BC samples.
Our paper has introduced a new dataset of Early Breast Cancer Core-Needle Biopsy WSI (BCNB), which includes core-needle biopsy whole slide images (WSIs) of early breast cancer patients and the corresponding clinical data. The WSIs have been examined and annotated by two independent and experienced pathologists blinded to all patient-related information.
Based on this dataset, we have studied the deep learning algorithm for predicting the metastatic status of ALN preoperatively by using multiple instance learning (MIL), and have achieved the best AUC of 0.831 in the independent test cohort. For more details, please review our paper.
There are WSIs of 1058 patients, and only part of tumor regions are annotated in WSIs. Except for the WSIs, we have also provided the clinical characteristics of each patient, which includes age, tumor size, tumor type, ER, PR, HER2, HER2 expression, histological grading, surgical, Ki67, molecular subtype, number of lymph node metastases, and the metastatic status of axillary lymph node (ALN). The dataset has been desensitized, and not contained the privacy information of patients.
Based on this dataset, we have studied the prediction of the metastatic status of axillary lymph node (ALN) in our paper, which is a weakly supervised classification task. However, other researches based on our dataset are also feasible, such as the prediction of histological grading, molecular subtype, HER2, ER, and PR. We do not limit the specific content for your research, and any research based on our dataset is welcome.
Please note that the dataset is only used for education and research, and the usage for commercial and clinical applications is not allowed. The usage of this dataset must follow the license.
This data was donated by researchers of the University of Wisconsin and includes the measurements from digitized images of fine-needle aspirate of a breast mass.
You can find the dataset at https://github.com/dataspelunking/MLwR/blob/master/Machine%20Learning%20with%20R%20(2nd%20Ed.)/Chapter%2003/wisc_bc_data.csv.
The breast cancer data includes 569 examples of cancer biopsies, each with 32 features. One feature is an identification number, another is the cancer diagnosis and 30 are numeric-valued laboratory measurements. The diagnosis is coded as "M" to indicate malignant or "B" to indicate benign.
The other 30 numeric measurements comprise the mean, standard error and worst (i.e. largest) value for 10 different characteristics of the digitized cell nuclei, which are as follows:-