Foundation models are trained on massive amounts of data to distinguish complex patterns and can be adapted to a wide range of downstream tasks with minimal computational resources. Here, we develop a foundation model for prostate cancer digital pathology called HistoEncoder by pre-training on 48 million prostate tissue tile images. We demonstrate that HistoEncoder features extracted from tile images with similar histological patterns map closely together in the feature space. HistoEncoder outperforms models pre-trained with natural images, even without fine-tuning or with 1000 times less training data. We describe two use cases that leverage the capabilities of HistoEncoder by fine-tuning the model with a limited amount of data and computational resources. First, we show how HistoEncoder can be used to automatically annotate large-scale datasets with high accuracy. Second, we combine histomics with commonly used clinical nomograms, significantly improving prostate cancer-specific death survival models. Foundation models such as HistoEncoder can allow organizations with limited resources to build effective clinical software tools without needing extensive datasets or significant amounts of computing.
Purpose: The use of MRI-targeted biopsies has led to lower detection of Gleason Grade Group 1 (GG1) prostate cancer and increased detection of GG2 disease. Although this finding is generally attributed to improved sensitivity and specificity of MRI for aggressive cancers, it might also be explained by grade inflation. Our objective was to determine the likelihood of definitive treatment and risk of post-treatment recurrence for patients with GG2 cancer diagnosed using targeted biopsies relative to men with GG1 cancer diagnosed using systematic biopsies. Materials and Methods: We performed a retrospective study on a large tertiary center registry (HUS Acamedic Datalake) to retrieve data on prostate cancer diagnosis, treatment, and cancer recurrence. We included patients with either GG1 with systematic biopsies (3317 men) or GG2 with targeted biopsies (554 men) from 1993 to 2019. We assessed the risk of curative treatment and recurrence after treatment. Kaplan-Meier survival curves were computed to assess treatment- and recurrence-free survival. Cox proportional hazards regression analysis was performed to assess the risk of posttreatment recurrence. Results: Patients with systematic biopsy detected GG1 cancer had a significantly longer median time-to-treatment (31 months) than those with targeted biopsy detected GG2 cancer (4 months, p<0.0001). Risk of recurrence after curative treatment was similar between groups with the upper bound of the 95% CI, excluding an important difference (HR: 0.94, 95% CI [0.71-1.25], p=0.7). Conclusions: GG2 cancers detected by MRI-targeted biopsy are treated more aggressively than GG1 cancers detected by systematic biopsy, despite having similar oncologic risk. To prevent further overtreatment related to the MRI pathway, treatment guidelines from the pre-MRI era need to be updated to consider changes in the diagnostic pathway.
The presence of detailed clinical information in electronic health record (EHR) systems presents promising prospects for enhancing patient care through automated retrieval techniques. Nevertheless, it is widely acknowledged that accessing data within EHRs is hindered by various methodological challenges. Specifically, the clinical notes stored in EHRs are composed in a narrative form, making them prone to ambiguous formulations and highly unstructured data presentations, while structured reports commonly suffer from missing and/or erroneous data entries. This inherent complexity poses significant challenges when attempting automated large-scale medical knowledge extraction tasks, necessitating the application of advanced tools, such as natural language processing (NLP), as well as data audit techniques. This work aims to address these obstacles by creating and validating a novel pipeline designed to extract relevant data pertaining to prostate cancer patients. The objective is to exploit the inherent redundancies available within the integrated structured and unstructured data entries within EHRs in order to generate comprehensive and reliable medical databases, ready to be used in advanced research studies. Additionally, the study explores potential opportunities arising from these data, offering valuable prospects for advancing research in prostate cancer.
The presence of detailed clinical information in electronic health record (EHR) systems presents promising prospects for enhancing patient care through automated retrieval techniques. Nevertheless, it is widely acknowledged that accessing data within EHRs is hindered by various methodological challenges. Specifically, the clinical notes stored in EHRs are composed in a narrative form, making them prone to ambiguous formulations and highly unstructured data presentations, while structured reports commonly suffer from missing and/or erroneous data entries. This inherent complexity poses significant challenges when attempting automated large-scale medical knowledge extraction tasks, necessitating the application of advanced tools, such as natural language processing (NLP), as well as data audit techniques. This work aims to address these obstacles by creating and validating a novel pipeline designed to extract relevant data pertaining to prostate cancer patients. The objective is to exploit the inherent redundancies available within the integrated structured and unstructured data entries within EHRs in order to generate comprehensive and reliable medical databases, ready to be used in advanced research studies. Additionally, the study explores potential opportunities arising from these data, offering valuable prospects for advancing research in prostate cancer.