Your datasets, under your control: Introducing the Mozilla Data Collective

Sunday 2:00 PM–2:30 PM in Ballroom 1

You're probably familiar with Mozilla's Common Voice project - a crowd-sourced, collective effort that's provided speech data in over 300 languages to help democratise voice technologies and make speech recognition work better for more people.

Now, we're delighted to introduce a sister platform to Common Voice - the Mozilla Data Collective.

The Problem

AI has a data crisis. We're running out of quality training data because the entire web has already been harvested by crawlers to train AI models — leading to the "Token Crisis". What’s left? Synthetic data generated en masse - that’s bland, generic and unrepresentative of the world’s diversity. This data is also problematic for training models, as it can lead to model collapse. Meanwhile, quality datasets from diverse contributors sit unused in silos.

The Solution

Mozilla Data Collective is a platform in the truest sense. It’s yours to stand on, and make of it what you will. Mozilla Data Collective works by allowing you to share your data, retain ownership of it, and control who uses it. Upload datasets from research, community collections or specialised corpora. Set your terms - who uses it, for what purpose, and what you get in return. Keep control by tracking who’s using your datasets.

Why It Matters

Our vision is to encourage the creation of safe, responsible AI that works for everyone - by helping communities to share authentic, ethical and diverse data - a stark contrast to models built by indiscriminately scraping the web and reproducing or synthesising its Anglocentric, white, male biases.

We want people and organisations - like you - to be able to create, curate and control your data, rather than have it harvested and scraped without your knowledge or consent.

Join Kathy Reid as she walks you through why better AI requires better data - and why better data requires collective, collaborative, co-created approaches: Mozilla Data Collective.

Kathy Reid she/her • @kathyreid@aus.social

Kathy Reid works at the intersection of open source, emerging technologies and technical communities.

Over the last 20 years, she has held several technical leadership positions, including roles as Digital Platforms and Operations Manager at Deakin University, managing platforms such as WordPress, Drupal, Squiz Matrix and Atlassian Confluence, technical lead on projects involving digital signage and videoconferencing, and has worked as a web and application developer.

More recently, she has run her own technical consulting micro-business, and been engaged on a variety of projects involving data visualisation, certification applications and emerging technologies workshops.

She was previously Director of Developer Relations at Mycroft.AI, an open source voice assistant startup, and President of Linux Australia, Inc, a not for profit organisation which advocates for the use of open source technologies and runs technical events such as Linux Conference Australia. She brought GovHack – the open data hackathon – to Geelong in 2015 and 2016 and in 2011 ran Geelong’s first unconference – BarCampGeelong. Most recently, she worked as a voice open source specialist for Mozilla.

Kathy holds Arts and Science undergraduate degrees from Deakin University and an MBA (Computing) from Charles Sturt University, a Master in Applied Cybernetics (MAppCyber) from Australian National University, as well as several ITIL qualifications.

In 2019, she was one of 16 people from across the world chosen to undertake a Masters Program in a brand new branch of engineering at the Australian National University's 3A Institute, where she is now a PhD candidate researching voice data and ways to prevent and respond to bias in machine learning systems that use voice and speech, like speech recognition.

Kathy currently works with the Common Voice team at Mozilla Foundation as an engineer.