PhD student aims to improve AI for underserved languages and communities

Hellina Hailu Nigatu always loved math and physics – really any field that let her calculate things. Then she found computer science. By the time she'd started her PhD at UC Berkeley, she saw how computing skills could address a range of issues from healthcare to women's rights.

She learned in classes about concepts in Natural Language Processing and large language models developed in English. But Nigatu, who is Ethiopian, found that most of the methods she learned wouldn’t work in other languages she knew. The data didn’t exist on the internet to build chatbots for the Tigrinya language, like the ones already changing how English speakers live and work. The safeguards blocking some harmful content online weren’t effective for Amharic-speaking users.

“This is what happens when you have diversity in computer science,” said Nigatu, who is starting the fourth year of her doctorate program. “Almost all of my projects are inspired by personal experience.”

“I could be an English speaker who improves machine translation, speech recognition and whatever other technology in Amharic by looking at performance on some evaluation metric,” she said. “But if I did not speak this language, if I was not from Ethiopia, if I was not impacted by this, I wouldn’t have the context to understand the nuanced problems that go beyond automatic metrics.”

The artificial intelligence boom is rapidly transforming modern life. But with a lack of diversity in who informs and develops these technologies and the disparities in the existing data being used, these tools could cement and exacerbate global inequalities.

Nigatu is an up-and-coming expert combating that risk through research and mentorship. She hopes to develop tools that are informed by and useful for communities, including her family and friends, who speak languages that have little available data online.

Preserving languages and cultures that have less data online

Nigatu graduated from Addis Ababa University with a bachelor’s of science in electrical and computer engineering. She graduated from Berkeley with a master’s of science in computer science and is being advised for her PhD by Berkeley Department of Electrical Engineering and Computer Sciences’ Sarah Chasins and John Canny.

The barriers Nigatu has faced in applying Natural Language Processing tools and techniques to languages with less data available – or low-resourced languages – inspired one of her latest research papers. Early in her doctorate program, she decided to build language models in Amharic and Tigrinya. But when she went to Wikipedia, a common source of data for language processing work, she found there weren’t enough entries for either language to develop those models. The entries that did exist were often not high quality enough to use, either.

Her recent paper from the ACM Conference on Human Factors in Computing Systems – “Low-Resourced Languages and Online Knowledge Repositories: A Need-Finding Study” – looked for an answer to why that hurdle existed. Nigatu and co-authors analyzed Wikipedia forum entries by experienced contributors and conducted interviews with novice contributors to Wikipedia for three languages – Amharic, Tigrinya, and Afan Oromo.

They found several intersecting challenges that made it difficult for people to add entries to the platform in these languages. Wikipedia contributors struggled with the design of the platform or the lack of language-support tools like keyboards for languages that weren’t based in Latin. They also ran into challenges due to socio-political issues, such as the lack of scholarly work freely accessible on the internet that they could cite in their entries.

The dearth of data on currently recognized online knowledge repositories like Wikipedia presents major risks, Nigatu said. These platforms are being used to develop next generation tools that will be foundational to how we live our lives moving forward. The platforms are already preserving and sharing information and cultural values in higher-resourced languages like English, but aren’t offering equally useful or accessible data for other languages, she said.

“We should be building from the ground up with community values in mind.”

“We should be building from the ground up with community values in mind, so that we preserve these languages and these identities without expecting them to adapt to the default,” said Nigatu. These platforms must “serve their needs” and not exploit them or their data, she said.

Improving the safety of the online experience

Nigatu has also found research questions through her own lived experiences. She noticed that when she searched for benign terms in Amharic on YouTube, she’d receive policy-violating, pornographic videos as results. When she dug deeper, she found this was a broader pattern for Amharic search results on the platform. She turned her discovery of this issue into a paper.

Niagatu and co-author Inioluwa Deborah Raji collected data from, and conducted interviews with, users of the platform centered on the YouTube results for Amharic searches. In their paper “I Searched for a Religious Song in Amharic and Got Sexual Content Instead”: Investigating Online Harm in Low-Resourced Languages on YouTube,” they found that content moderation was a big problem. Very few human content moderators focus on any non-English language, and automated content moderation doesn’t work well for low-resourced languages.

They also found people posting on YouTube used techniques to evade moderation, like using “doctor” or medical terms in their channel names in order to seem like they were offering health advice. And by analyzing the comments, they found that migrant workers located in Middle Eastern countries seeking medical advice were often tricked into clicking on pornographic content. Amharic users interviewed for the paper said they felt disempowered and devalued.

“This project showed that when we ignore a huge population in how we design and how we build technologies, the result is that these populations are disproportionately burdened with the harms,” said Nigatu of the paper published in Proceedings of ACM Conference on Fairness, Accountability, and Transparency.

“When we ignore a huge population in how we design and how we build technologies, the result is that these populations are disproportionately burdened with the harms.”

But for Nigatu, this project also highlighted something else. As she was working on this paper, she had to overcome significant pushback and questions from others about whether this was a problem and whether there was real harm being done here. It made her question whether to tackle the paper at all. She credits Raji for encouraging her to continue.

“When I say ‘online harm,’ people are like, ‘Oh, did it lead to genocide? Did people die from it?’ And I'm like, ‘I mean, harm doesn't have to get to that stage for it to be harmful,’” said Nigatu. “The standards that we have for what's acceptable to certain communities and what's not is very, very different.”

Expanding the field moving forward

Nigatu is already onto her next project. In June, she received a diversity, equity and inclusion scholarship from the ACM Conference on Fairness, Accountability, and Transparency.

Using this recognition, she will work to improve machine translation for low-resourced, Ethiopian languages. She will also fund and mentor female students from Ethiopia to work on that research with her. This builds on past mentorship she’s offered to Ethiopian women in the computer science field.

“I know how hard it was for me to try to do research when I was there, so I do whatever I can do to try to bridge that gap,” Nigatu said.

While technology is rapidly advancing, diversifying who creates computing tools will take time. In the meantime, she hopes researchers will speak with members of underrepresented communities to hear their experiences and insights.

She feels the responsibility of being an Ethiopian female researcher every day. When she interviewed with Chasins to be her PhD advisor at Berkeley years ago, Chasins asked what would get Nigatu out of bed on the mornings she was too exhausted to do the work. Nigatu’s answer was simple.

“I don't have the luxury of saying, ‘Oh, I can’t do it today,’ because it's practically me. I am the person who's at the end going to be using these technologies. It's my friend. It's my sisters. It's everyone that I'm close to,” she said.

“I pull a lot of the motivation for my work from my personal experience and from the experiences of those around me,” Nigatu said. “It keeps me going."

PhD student aims to improve AI for underserved languages and communities

Preserving languages and cultures that have less data online

Improving the safety of the online experience

Expanding the field moving forward

For Media Inquiries

Meet the Expert

Latest News

New database on police use of force and misconduct in California makes records public

Berkeley computer science researchers propose evidence-based AI policy recommendations

Conversation at UC Berkeley workshop shares perspectives on AI and humanity

Jennifer Chayes recognized with 2025 Richard Tapia Award for efforts to diversify computing

Students celebrate, get inspired by alum speaker at CDSS college graduation

Data Discovery showcases undergraduate research projects with real-world application

Researchers consider innovative, inclusive ways to develop and scale computational health