The State of Fair Synthetic Data in 2021

Written by

Michael Platzer

Last year MOSTLY AI has introduced and demonstrated the groundbreaking idea of generating fair synthetic data. I.e., data, that is representative of the real world, but that has unwanted biases, unwanted relations, surgically removed from it at the same time. Machine learning models that are then trained on fair synthetic data will be fair by design. It’s a thought-provoking paradigm shift, that will allow organizations to govern not only Privacy but also Fairness within AI at its source, that is the AI training data itself.

Fast forward to today, we are excited to see many things happening around fairness:

The broad public interest in fairness and AI bias has drastically picked up, resulting in media coverage, documentaries, books, public debates, analyst reports, etc.
The regulators are becoming active, with most notably the European Commission proposing an AI regulation, that explicitly demands that training data shall be fair & representative. US regulators are expected to follow suit, particularly within high-risk domains, like finance and health care.
Leading AI conferences, like the ICLR, expand beyond accuracy and dedicate workshops to ethics, like Responsible AI or Synthetic Data for Privacy.

Speaking of ICLR, we had the honor to present our work on fair synthetic data at this year’s conference. This is another recognition of our work, which was already featured by Andrew Ng, IEEE Spectrum, Forbes, Slate, and many more. While the corresponding research paper is now available on arxiv.org, and the Fair Synthetic Data poster is accessible here, we summarize the key take aways once more:

MOSTLY AI’s technology allows to generate Synthetic Data that is both statistical representative as well as fair - in the sense that it adheres to provided fairness constraints
The trade-off between representativeness and fairness can be explicitly controlled for.
Hidden proxy variables (e.g., body height serving as proxy for gender, or ZIP codes for race), which pose a significant challenge in combating biases, are successfully controlled for.
And fairness can be established while only marginally impacting the downstream machine learning utility. One thus does not need to compromise significantly on accuracy in the pursuit of Ethical AI. Find out how to create fair synthetic data from our earlier blog post!

Figure 1. MOSTLY AI's presented empirical results on fair US census data

Last but not least, and as further testimony to the validity of the approach, Amazon just published a paper on fast fair synthetic data as well. And as their study leverages the same US census data as our paper, it allows for a direct comparison of results. We always knew that the quality of our synthetic data is unparalleled, but even we were amazed to see the effectiveness of our approach, as we excel on every single available dimension of realism, accuracy, as well as fairness:

Figure 2. MOSTLY AI's approach to fair synthetic data compared to previous work

So, if Ethical AI is a priority for your organization or you deploy AI algorithms that directly impact the lives of individuals, then talk to us, and let’s discuss how we can get you started with fair synthetic data today. If you would like to learn more about fair synthetic data, read our Fairness Series!

What is data bias? Data bias is the systematic error introduced into data workflows and machine learning (ML) models due to inaccurate, missing, or incorrect data points which fail to accurately represent the population. Data bias in AI systems can lead to poor decision-making, costly compliance issues as well as drastic societal consequences. Amazon’s gender-biased […]

Ready to get started?

Get started for free or get in touch with our sales team for a demo.

Get started free Request a demo

Name	Borlabs Cookie
Provider	Owner of this website, Imprint
Purpose	Saves the visitors preferences selected in the Cookie Box of Borlabs Cookie.
Cookie Name	borlabs-cookie
Cookie Expiry	1 Year

Name	Google Tag Manager
Provider	Google Ireland Limited, Gordon House, Barrow Street, Dublin 4, Ireland
Purpose	Used to control advanced script and event handling.
Privacy Policy	https://policies.google.com/privacy?hl=en
Cookie Name	-

Accept	Google Analytics
Name	Google Analytics
Provider	Google Ireland Limited, Gordon House, Barrow Street, Dublin 4, Ireland
Purpose	Cookie by Google used for website analytics. Generates statistical data on how the visitor uses the website.
Privacy Policy	https://policies.google.com/privacy?hl=en
Cookie Name	_ga,_gat,_gid
Cookie Expiry	2 Years

Accept	LinkedIn Insight Tag (LinkedIn Pixel)
Name	LinkedIn Insight Tag (LinkedIn Pixel)
Provider	LinkedIn Corporation
Purpose	The LinkedIn Insight Tag is a lightweight JavaScript tag that powers conversion tracking, website audiences, and website demographics for LinkedIn ad campaigns.
Privacy Policy	https://www.linkedin.com/legal/privacy-policy
Cookie Expiry	2 Years

Accept	Hotjar
Name	Hotjar
Provider	Hotjar Ltd., Dragonara Business Centre, 5th Floor, Dragonara Road, Paceville St Julian's STJ 3141 Malta
Purpose	Hotjar is an user behavior analytic tool by Hotjar Ltd.. We use Hotjar to understand how users interact with our website.
Privacy Policy	https://www.hotjar.com/legal/policies/privacy/
Host(s)	*.hotjar.com
Cookie Name	_hjClosedSurveyInvites, _hjDonePolls, _hjMinimizedPolls, _hjDoneTestersWidgets, _hjIncludedInSample, _hjShownFeedbackMessage, _hjid, _hjRecordingLastActivity, hjTLDTest, _hjUserAttributesHash, _hjCachedUserAttributes, _hjLocalStorageTest, _hjptid
Cookie Expiry	Session / 1 Year

Accept	Mixpanel
Name	Mixpanel
Provider	Mixpanel S.L
Purpose	We utilize Mixpanel cookies to gather data regarding your usage of our website. This information enables us to comprehend your interests, enhance our products and services, and deliver an improved user experience on our website.
Privacy Policy	https://mixpanel.com/legal/privacy-policy
Host(s)	*.mostly.ai, mostly.ai
Cookie Name	_mp

Accept	HubSpot
Name	HubSpot
Provider	HubSpot Inc., 25 First Street, 2nd Floor, Cambridge, MA 02141, USA
Purpose	HubSpot is a user database management service provided by HubSpot, Inc. We use HubSpot on this website for our online marketing activities.
Privacy Policy	https://legal.hubspot.com/privacy-policy
Host(s)	*.hubspot.com, hubspot-avatars.s3.amazonaws.com, hubspot-realtime.ably.io, hubspot-rest.ably.io, js.hs-scripts.com
Cookie Name	__hs_opt_out, __hs_d_not_track, hs_ab_test, hs-messages-is-open, hs-messages-hide-welcome-message, __hstc, hubspotutk, __hssc, __hssrc, messagesUtk
Cookie Expiry	Session / 30 Minutes / 1 Day / 1 Year / 13 Months

Accept	VWO
Name	VWO
Provider	VWO, Wingify Software Pvt., Heidenkampsweg 58, Hamburg, 20097, Germany
Purpose	VWO allows website owners to conduct A/B testing, create heatmaps, and track user behavior to optimize their website's performance and user experience. We use VWO on this website for our online marketing activities.
Privacy Policy	https://vwo.com/privacy-policy/
Host(s)	mostly.ai
Cookie Name	_vis_opt_exp_#_goal_#, _vis_opt_test_cookie, _vis_opt_exp_#_combi, _vis_opt_exp_#_exclude, _vis_opt_exp_#_split, _vis_opt_s, _vis_opt_out, _vwo_uuid, _vwo_uuid_#, _vwo_ds, _vwo_sn, _vwo_uuid_v2, _vis_opt_exp_#_combi_choose, _vwo_referrer, _vwo, wingify_push_db_status, wingify_push_subscription_id, wingify_push_subscription_endpoint, pushcrew_opt_out, wingify_push_do_not_show_notification_popup, pshcrw_update_subId, wingify_push_subscription_status, wingify_push_subscriber_lang, wingify_donot_track_actions, wingify_do_not_show_chicklet, _wingify_pc_uuid, wingifyEcomData-, wingify_push_gcm_id, wingifyRetrySegment-, wingifySegment-*, pshcrw_v_k, wingify_push_subscriber_id, _vwo_global_opt_out, _vwo_ssm

The State of Fair Synthetic Data in 2021

Related posts

Ready to get started?