Breast Cancer Histopathological Database (BreakHis)

The Breast Cancer Histopathological Image Classification (BreakHis) is  composed of 9,109 microscopic images of breast tumor tissue collected from 82 patients using different magnifying factors (40X, 100X, 200X, and 400X).  To date, it contains 2,480  benign and 5,429 malignant samples (700X460 pixels, 3-channel RGB, 8-bit depth in each channel, PNG format). This database has been built in collaboration with the P&D Laboratory  – Pathological Anatomy and Cytopathology, Parana, Brazil (http://www.prevencaoediagnose.com.br). We believe that researchers will find this database a useful tool since it makes future benchmarking and evaluation possible.

Characteristics

The dataset BreaKHis is divided into two main groups: benign tumors and malignant tumors. Histologically benign is a term referring to a lesion that does not match any criteria of malignancy – e.g., marked cellular atypia, mitosis, disruption of basement membranes, metastasize, etc. Normally, benign tumors are relatively “innocents”, presents slow growing and remains localized. Malignant tumor is a synonym for cancer: lesion can invade and destroy adjacent structures (locally invasive) and spread to distant sites (metastasize) to cause death.

In current version, samples present in dataset were collected by SOB method, also named partial mastectomy or excisional biopsy. This type of procedure, compared to any methods of needle biopsy, removes the larger size of tissue sample and is done in a hospital with general anesthetic.

The BreaKHis 1.0 is structured as follows:

Magnification Benign Malignant Total
40X 652 1,370 1,995
100X 644 1,437 2,081
200X 623 1,390 2,013
400X 588 1,232 1,820
Total of Images 2,480 5,429 7,909

Both breast tumors benign and malignant can be sorted into different types based on the way the tumoral cells look under the microscope. Various types/subtypes of breast tumors can have different prognoses and treatment implications. The dataset currently contains four histological distinct types of benign breast tumors: adenosis (A), fibroadenoma (F), phyllodes tumor (PT), and tubular adenona (TA);  and four malignant tumors (breast cancer): carcinoma (DC), lobular carcinoma (LC), mucinous carcinoma (MC) and papillary carcinoma (PC).

Each image filename stores information about the image itself: method of procedure biopsy, tumor class, tumor type, patient identification, and magnification factor. For example, SOB_B_TA-14-4659-40-001.png is the image 1, at magnification factor 40X, of a benign tumor of type tubular adenoma, original from the slide 14-4659, which was collected by procedure SOB. More formally, the format of image file name is given by the following BNF notation:

<BIOPSY_PROCEDURE>_<TUMOR_CLASS>_<TUMOR_TYPE>-<YEAR>-<SLIDE_ID>-<MAG>-<SEQ>
<BIOPSY_PROCEDURE>::=SOB
<TUMOR_CLASS>::=M|B
<TUMOR_TYPE>::=<BENIGN_TYPE>|<MALIGNANT_TYPE>
<BENIGN_TYPE>::=A|F|PT|TA
<MALIGNANT_TYPE>::=DC|LC|MC|PC
<YEAR>::=<DIGIT><DIGIT>
<PATIENT_ID>::=<NUMBER><SEC>
<SEQ>::=<NUMBER>
<MAG>::=40|100|200|400
<NUMBER>::=<NUMBER><DIGIT>|<DIGIT>
<SEC>::=<SEC>::<LETTER>|<LETTER>
<DIGIT>::=0|1|…|9
<LETTER>::=A|B|…|Z

A slide of breast malignant tumor (stained with HE) seen in different magnification factors:  (a) 40X, (b) 100X, (c) 200X, and (d) 400X.

How to obtain access to the images

The BreakHis Database may be used for non-commercial research provided you acknowledge the source of the image by citing the following paper in publications about your research:

[1] Spanhol, F., Oliveira, L. S., Petitjean, C., Heutte, L., A Dataset for Breast Cancer Histopathological Image Classification, IEEE Transactions on Biomedical Engineering (TBME), 63(7):1455-1462, 2016. [pdf]

You may download the BreaKHis database using this link:

http://www.inf.ufpr.br/vri/databases/BreaKHis_v1.tar.gz

We kindly ask you to fill the following form after downloading the dataset.

If you want to use the same 5-fold structure we have used in [1], you can download this python script.Then follows these steps:
  1. decompress the file mkfold.tag.gz
  2. copy the file BreakHis_v1.tar.gz into the mkfold directory
  3. decompress the BreakHis_v1.tar.gz file
  4. run the script <python mkfold.py>

It will create five directories inside the mkfold directory containing the structure used in [1].

References:

Some Statistics (updated Nov 30, 2022)

This dataset has been downloaded 7726 times from 141 different countries

Countries with more than 30 downloads:

Downloads per year (since 2017)

Creative Commons License
BreaKHis – Breast Cancer Histopathological Database by Spanhol, F., Oliveira, L. S., Petitjean, C. and Heutte, L. is licensed under a Creative Commons Attribution 4.0 International License.