Turkish NLP - Tutorial

mervenoyan · February 23, 2021, 12:46pm

Introduction to NLP in Turkish

Turkish in Perspective

Turkish alphabet consists of 29 letters, 8 of them being vowels, written with the Latin alphabet with an addition of letters “ç”, “ö”, “ü”, “ğ”, “ı”, “ş”. Basic word order of a sentence in Turkish is subject-object-verb. It’s an agglutinative language that uses suffixes on top of stems to create new words or conjugate existing ones. There’s no grammatical gender in Turkish unlike Germanic languages, there’s a pronoun called “o” to refer to all genders and objects. The original Turkish doesn’t contain suffixes to indicate gender in jobs and articles in front of nouns. For formal referrals, second-person plural pronoun “siz” is used.

Turkish NLP with Hugging Face

Currently there are 48 models that can make predictions on Turkish language, and 41 datasets that include Turkish examples. One of the most used models is the bert-base-turkish-cased by Munich Digitization Center, being used as a popular language model base for Turkish NLP tasks.

Loading a Dataset

Let’s dive into a dataset! Loading a dataset is easy enough with the datasets library.

We will load the dataset ‘turkish_product_reviews’ and fine-tune “savasy/bert-base-turkish-sentiment-cased“ with it.

# uncomment and install datasets, if not installed

# !pip install datasets

from datasets import load_dataset

dataset = load_dataset('turkish_product_reviews', split = "train")

Let’s examine the dataset:

In [1]:dataset

Out[1]:Dataset({

features: ['sentence', 'sentiment'],

num_rows: 23516

})

In [2]:dataset[0]["sentence"]

Out[1]:’beklentimin altında bir ürün kaliteli değil’

This dataset does not contain a separate test set, so we divide the training set twice, once for test set and once for validation set.

from sklearn.model_selection import train_test_split

train_texts, test_texts, train_labels, test_labels = train_test_split(texts, labels, test_size=.2)

Using a Pre-Trained Model

Below model is Turkish BERT fine-tuned on sentiment analysis, we will fine-tune this model on the above dataset.
You can see the model here: model page

from transformers import AutoTokenizer, AutoModel
model_name = 'savasy/bert-base-turkish-sentiment-cased'
tokenizer = AutoTokenizer.from_pretrained('model_name')
model = AutoModel.from_pretrained('model_name')

Check out example notebook for full application of above process.
To-do:

Add Turkish translation
Add comments in Turkish to notebook
Add native Tensorflow implementation

yavuzkomecoglu · July 21, 2021, 12:48pm

The ProductReviews dataset used here was parsing all data positively labeled due to a file extension error. This bug has been fixed and now parses negative and positive as it should.

Upgrade your library to version 1.9.0

https://github.com/huggingface/datasets/pull/2530

bozden · November 16, 2021, 9:28pm

Hi @mervenoyan, I know it can be flexible, but is there a “decided” alphabet for Turkish?

Topic		Replies	Views
Turkish NLP - Introductions Languages at Hugging Face	31	4958	April 10, 2024
Bengali NLP - Introductions Languages at Hugging Face	14	1965	February 26, 2021
Arabic NLP - Tutorial - الدورة التعليمية Languages at Hugging Face	2	7247	February 22, 2021
Persian NLP - Introductions Languages at Hugging Face	14	4053	January 21, 2022
Arabic NLP - Introductions Languages at Hugging Face	16	3822	May 18, 2022