Abstract¶
The passage of SB-1421 and SB-16 made publicly available, for the first time in California’s history, police records related to uses of force and misconduct. Before the bill had even gone into effect, some agencies began shredding records to avoid disclosing them. It was imperative for the accountability community to collect, preserve, and make searchable these documents, which pierced a decades long veil of secrecy around police records and internal investigative practices. But with over 700 agencies statewide potentially holding disclosable records, advocates had to move quickly to secure the records and make them available to the public. The California Law Enforcement Accountability Network (CLEAN) brought together journalists, data scientists, defense attorneys, civil rights advocates, freedom of information advocates, and academic researchers. The group made thousands of PRA requests and followed up on them throughout the year, filing dozens of lawsuits to compel recalcitrant agencies to produce records.
The resulting deluge of data – over 20 terabytes of scanned PDFs, body-worn camera footage, photographs, audio recordings of witness interviews – required expertise about police investigations and sophisticated data processing tools in order to be able to use it for accountability purposes.
CLEAN’s data science team, working out of the Berkeley Institute for Data Science, leveraged generative AI to organize records into cases, redact sensitive information, and create powerful search indexes over the data, and built interfaces to view and annotate this historic collection.
Our project was only possible because of the existing open-source ecosystem, and we are now working to open-source our data processing and annotation tools, with the goal of extending this collaborative model of independent police oversight and accountability to other states.

Tarak Shah | BIDS / Community Law Enforcement Accountability Network¶
Tarak Shah is a data scientist with the Human Rights Data Analysis Group (HRDAG), and he serves as program manager for the Community Law Enforcement Accountability Network (CLEAN). He has over seven years of experience applying quantitative analysis in the service of human rights, including work with the Innocence Project of New Orleans, the Invisible Institute, the San Francisco Public Defenders’ Office, the UN Office of the High Commissioner for Human Rights, la Comisión de la Verdad (the Truth Commission) in Colombia, and the ACLU. He further provides ongoing analytical support to community organizations working for justice and accountability including Berkeley Copwatch, the Chicago Torture Justice Center, and Kilómetro Cero (Puerto Rico). Tarak specializes in working with large collections of unstructured data, statistical analysis of racial disparities in criminal-legal outcomes, database deduplication, and connecting machine learning tools with community-based expertise to illuminate under-documented forms of violence.