ForGood
Home / Jobs /Engineering & Technology / Data Curation Intern
K
Karya

Data Curation Intern

LevelEntry
Posted10h ago
Engineering & Technology

Quick answer

Karya is hiring a Data Curation Intern to assist in building high-quality, multilingual AI datasets through data engineering and linguistic analysis.

Role
Data Curation Intern
Organization
Karya
Level
Entry
Category
Engineering & Technology

The role

Karya is seeking a Data Curation Intern to help build high-quality datasets for AI/ML models with a focus on Indian languages. This role involves auditing, cleaning, and structuring large open-source datasets to ensure they meet the rigorous requirements of modern machine learning pipelines. Interns will progress from text data curation to speech and voice model data preparation while working on real-world projects. This position offers a unique opportunity to gain hands-on experience in computational linguistics and data engineering.

What you'll do

  • Audit and profile open-source datasets to assess quality and noise levels.
  • Implement data cleaning pipelines including deduplication and noise removal.
  • Apply metadata tagging schemas to categorize text data.
  • Develop validation checklists and quality scorecards for dataset readiness.
  • Prepare text passages and metadata standards for speech and voice model training.

What it takes

  • Strong attention to detail.
  • Proficiency in Python for data processing (pandas, regex, spaCy, NLTK).
  • Familiarity with text data formats such as CSV, JSONL, and Parquet.
  • Ability to work independently and document processes clearly.
  • Curiosity about AI/ML or computational linguistics.

What you'll bring

PythonData CleaningPandasRegexNLP Libraries (spaCy/NLTK)Data Structuring

Frequently asked questions

What is the focus of this internship role?

This role focuses on data curation for AI/ML models, specifically cleaning and preparing multilingual Indian language datasets for both text and voice training.

What technical skills are required for this position?

Applicants should be comfortable using Python for data processing, including libraries like pandas, regex, and NLP tools such as spaCy or NLTK.

Is compensation provided for this internship?

The provided job description does not explicitly specify the compensation for this role.

How to apply

Apply directly on Karya's site. We link straight through — no resume parsing, no profile to fill out.

Apply now →

This listing is aggregated from a third-party source and its summary may be auto-generated, so details can be inaccurate or out of date. ForGood is not the employer and is not liable for the content — please verify everything on Karya's official posting before applying.