De-identification
De-identification is the process of removing or obscuring identifiers from data so that an individual cannot reasonably be identified. HIPAA recognizes two methods: Safe Harbor and Expert Determination.
Safe Harbor vs Expert Determination
Under HIPAA, de-identification of PHI follows one of two methods. Safe Harbor (45 CFR 164.514(b)(2)) requires removing 18 specific identifier categories — names, all geographic subdivisions smaller than state, dates more precise than year (with exceptions over 89), phone, fax, email, SSN, medical record numbers, account numbers, certificate or license numbers, vehicle identifiers, device identifiers, web URLs, IP addresses, biometric identifiers, full-face photos, and any other unique identifying number — plus the covered entity must not have actual knowledge that the residual data could be used alone or in combination to identify an individual.
Expert Determination (45 CFR 164.514(b)(1)) requires a qualified statistician to formally determine that the risk of re-identification is "very small" given the data and anticipated recipients. Expert Determination allows retention of data Safe Harbor would strip (e.g. ZIP codes, dates) when the statistician documents adequate controls.
De-identification for AI training
AI vendors that train on healthcare-adjacent data face a specific tension: Safe Harbor strips dates and geographic precision that are often essential for clinical AI utility. Most production-grade healthcare AI vendors use Expert Determination on training datasets and document the methodology in their compliance materials. Ask vendors which method they used, who the expert was, and when the determination was last refreshed.
Limits of de-identification
De-identification is not anonymization, and re-identification attacks are increasingly published in the academic literature. A 2019 study (Rocher et al., Nature Communications) showed that 99.98% of Americans could be re-identified from any dataset with 15 demographic attributes. For high-risk AI deployments, treat de-identified data as still potentially sensitive and layer access controls accordingly.