Turkish NLP - Tutorial

Introduction to NLP in Turkish

Turkish in Perspective

Turkish alphabet consists of 29 letters, 8 of them being vowels, written with the Latin alphabet with an addition of letters “ç”, “ö”, “ü”, “ğ”, “ı”, “ş”. Basic word order of a sentence in Turkish is subject-object-verb. It’s an agglutinative language that uses suffixes on top of stems to create new words or conjugate existing ones. There’s no grammatical gender in Turkish unlike Germanic languages, there’s a pronoun called “o” to refer to all genders and objects. The original Turkish doesn’t contain suffixes to indicate gender in jobs and articles in front of nouns. For formal referrals, second-person plural pronoun “siz” is used.

Turkish NLP with Hugging Face

Currently there are 48 models that can make predictions on Turkish language, and 41 datasets that include Turkish examples. One of the most used models is the bert-base-turkish-cased by Munich Digitization Center, being used as a popular language model base for Turkish NLP tasks.

Loading a Dataset

Let’s dive into a dataset! Loading a dataset is easy enough with the datasets library.

We will load the dataset ‘turkish_product_reviews’ and fine-tune “savasy/bert-base-turkish-sentiment-cased“ with it.

# uncomment and install datasets, if not installed

# !pip install datasets

from datasets import load_dataset

dataset = load_dataset('turkish_product_reviews', split = "train")

Let’s examine the dataset:

In [1]:dataset

Out[1]:Dataset({

features: ['sentence', 'sentiment'],

num_rows: 23516

})

In [2]:dataset[0]["sentence"]

Out[1]:’beklentimin altında bir ürün kaliteli değil’

This dataset does not contain a separate test set, so we divide the training set twice, once for test set and once for validation set.

from sklearn.model_selection import train_test_split

train_texts, test_texts, train_labels, test_labels = train_test_split(texts, labels, test_size=.2)

Using a Pre-Trained Model

Below model is Turkish BERT fine-tuned on sentiment analysis, we will fine-tune this model on the above dataset.
You can see the model here: model page

from transformers import AutoTokenizer, AutoModel
model_name = 'savasy/bert-base-turkish-sentiment-cased'
tokenizer = AutoTokenizer.from_pretrained('model_name')
model = AutoModel.from_pretrained('model_name')

Check out example notebook for full application of above process.
To-do:

  • Add Turkish translation
  • Add comments in Turkish to notebook
  • Add native Tensorflow implementation
6 Likes

The ProductReviews dataset used here was parsing all data positively labeled due to a file extension error. This bug has been fixed and now parses negative and positive as it should.

Upgrade your library to version 1.9.0

https://github.com/huggingface/datasets/pull/2530

Hi @mervenoyan, I know it can be flexible, but is there a “decided” alphabet for Turkish?