A simple Language Detector

Objective:

The goal here is to create a language detector model based on “The Europarl parallel” corpus using classical techniques of Natural Language Processing and Machine Learning.
I will then use this model to detect the language of new sentences taken from Wikipedia.
This language detection task falls basically into Text Classification.

png

Data
Data analysis
Data pre-processing
Preparing Datasets
Variables definition
Data vectorization
Model
Model evaluation
Inference
Conclusion

Importing libraries

import os
import time
import string
import re
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import feature_extraction
from sklearn import pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
import itertools
from tqdm import tqdm
import dataframe_image as dfi
import pickle

Global data path

data_path = "data/Europarl/"
corpus_path = data_path + "/Europarl_parallel_corpus"

1.Data

The Europarl parallel corpus is extracted from the proceedings of the European Parliament. It includes versions in 21 European languages: Romanic (French, Italian, Spanish, Portuguese, Romanian), Germanic (English, Dutch, German, Danish, Swedish), Slavik (Bulgarian, Czech, Polish, Slovak, Slovene), Finni-Ugric (Finnish, Hungarian, Estonian), Baltic (Latvian, Lithuanian), and Greek.

For a detailed description of this corpus, please read Europarl: A Parallel Corpus for Statistical Machine Translation, Philipp Koehn, MT Summit 2005, Pdf.

Get the original data here.

I will use this corpus for language detection although it is originally intended for translation.

Here under is a small table to visualize some basic informations about the data:

sentence_aligned_corpora_df = pd.read_csv(data_path + "Europarl_parallel_corpus_description.csv")
dfi.export(sentence_aligned_corpora_df, "img/language_detection_Europarl_parallel_corpus_description_df.png")

png

Each folder of the parallel corpus contains two files (exemple for bulgarian folder “bg-en”):

text in the considered language: “europarl-v7.bg-en.bg”
same text translated in english: “europarl-v7.bg-en.en”

For language detection, I would keep only the texts in the languages considered. As to the detection of the English language, I will use the translation from French text file.

2.Data analysis

Let’s load the Czech data from folder

# Setting column width display option
pd.set_option('display.max_colwidth', 150)

czech_df = pd.read_csv(
    corpus_path + "/cs-en/europarl-v7.cs-en.cs",
    "UTF-8", names=["Czech Sentences"],
    engine='python')

dfi.export(czech_df.head(20), "img/language_detection_czech_df.png")

png

We can see in the first lines of this czech dataset that some greek words appears (row 11), as well as some italian words (row 9).
It is reasonable to think that there must exist many other imprecisions inside the dataframe.

We can also observe some sentences that contains just numbers. These will be treated during the pre-processing phase.

Czech Data Exploration

Getting counts of sentences, words and characters:

total_num_words = 0
line_num_words = 0
num_char = 0
words_by_line = []
charDict = {}

for i, row in tqdm(czech_df.iterrows(), total=czech_df.shape[0]):
    line = row['Czech Sentences']
    line_num_words = len(line.split(" "))
    
    # words count
    total_num_words += line_num_words   
    words_by_line.append(line_num_words)
    
    # characters count
    for char in line:
        num_char += 1
        if char in charDict.keys():
            charDict[char] += 1
        else:
            charDict[char] = 1

100%|██████████| 646526/646526 [01:55<00:00, 5613.49it/s]

print("Number of sentences:", czech_df.shape[0])
print("Number of words:", total_num_words)
print("Minimum number of words by sentence:", min(words_by_line))
print("Maximum number of words by sentence:", max(words_by_line))

Number of sentences: 646526
Number of words: 12999948
Minimum number of words by sentence: 1
Maximum number of words by sentence: 227

print("Total number of characters:", num_char)
print(f'Number of different characters: {len(charDict)}')

Total number of characters: 87212619
Number of different characters: 269

Let’s check these characters and their occurrence in the detail

charDict = dict(sorted(charDict.items(),key= lambda x:x[1], reverse=True))
print(charDict)

png

And visualize the ordered occurrence of these characters

#setting plots background color
plt.rcParams.update({'axes.facecolor':'#f8fafc'})

plt.figure(figsize=(20, 8))
plt.bar(range(len(charDict)), list(charDict.values()), align='center', edgecolor='black', color='darkorange')
plt.xticks(range(len(charDict)), list(charDict.keys()))
plt.tick_params(labelsize = 14)
plt.xlabel('Characters', fontsize = 15)
plt.ylabel('Number of occurrences', fontsize = 15)
plt.title('Czech Data: Characters occurrences (first 50)', fontsize = 18)
plt.xlim([-1, 50])
plt.show()

png

One can observe the occurrence of some typical Czech characters like ž ě ř č š ů.
Digits and Punctuation should be removed, as well as special characters. Capital letters will be lowered.

Now let’s see how words and sentences are distributed

plt.figure(figsize=(14, 8))

# BOXPLOT
plt.subplot(211)
flierprops = dict(marker='o', markerfacecolor='darkorange', markersize=10)
plt.boxplot(words_by_line, vert=False, patch_artist=True, meanline=True, showmeans=True, flierprops=flierprops)
plt.xlabel('Number of words', fontsize = 12)
plt.title('Czech Data: Distribution of the number of words of sentences', fontsize = 15)
plt.xticks(np.arange(0, 230, 20))
plt.xlim([0,230])
plt.yticks([1,], ["Czech"], fontsize = 12)

# HISTOGRAM
plt.subplot(212)
plt.hist(words_by_line, bins=300, color='purple')
plt.xlabel('Number of words', fontsize = 12)
plt.ylabel('Number of sentences', fontsize = 12)
plt.title('Czech Data: Distribution of sentences by the number of words', fontsize = 15)
plt.xticks(np.arange(0, 230, 20))
plt.xlim([0,230])

plt.tight_layout()
plt.show()

png

We can observe many outliers: sentences with a higher amount of words.
On the other hand, we see that half of the sentences have less than 20 words.

3.Data pre-processing

Pre-processing steps:

remove string begin/end spaces
lower capital letters
divide compound words (vice-president -> vice president)
remove punctuation marks
remove digits
remove non-ascii characters
remove extra spaces

Let’s see the result of each pre-processing steps on a simple exemple

text = """  PRE-Pross ! "# & '()*./:;+ ,- =>$$? @ [\]^_ `{|} ~) ěï ū Щ %Ъ < ЫЭ §± »€ ≤• 223787° ý ł  """

# Defining dictionary for punctuation removal
for char in string.punctuation:
    print(char, end = ' ')
punctuation_marks = dict((ord(char), "") for char in string.punctuation)

! " # $$ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~

print(text, "\n")
text = text.strip()                       # remove string begin/end spaces
print(text, "\n")
text = text.lower()                       # lower capital letters
print(text, "\n")
text = text.replace("-", " ")             # divide compound words
print(text, "\n")
text = text.translate(punctuation_marks)  # remove punctuation marks
print(text, "\n")
text = re.sub('\d+', '', text)            # remove digits
print(text, "\n")
text = re.sub('\\W',' ',text)             # remove non-ascii characters
print(text, "\n")
text = re.sub(' +', ' ', text)            # remove extra spaces
print(text, "\n")

  PRE-Pross ! "# & '()*./:;+ ,- =>$$? @ [\]^_ `{|} ~) ěï ū Щ %Ъ < ЫЭ §± »€ ≤• 223787° ý ł   

PRE-Pross ! "# & '()*./:;+ ,- =>$$? @ [\]^_ `{|} ~) ěï ū Щ %Ъ < ЫЭ §± »€ ≤• 223787° ý ł 

pre-pross ! "# & '()*./:;+ ,- =>$$? @ [\]^_ `{|} ~) ěï ū щ %ъ < ыэ §± »€ ≤• 223787° ý ł 

pre pross ! "# & '()*./:;+ ,  =>$$? @ [\]^_ `{|} ~) ěï ū щ %ъ < ыэ §± »€ ≤• 223787° ý ł 

pre pross            ěï ū щ ъ  ыэ §± »€ ≤• 223787° ý ł 

pre pross            ěï ū щ ъ  ыэ §± »€ ≤• ° ý ł 

pre pross            ěï ū щ ъ  ыэ            ý ł 

pre pross ěï ū щ ъ ыэ ý ł

Additional pre-processing step:

I decided to rearrange the sentences during the data pre-processing phase so that I had the same specified number of words in each.
In this way, the machine learning model will be trained on a dataset with standardized length sentences.

I arbitrary choosed to set the length of sentences to 50 words

set lengh of sentences to 50 words:
- each line wich has less than 50 words will be concatenated to the next one
- each line wich has 50 or more words will be sliced in order to respect the given lenght of 50 words by sentences

By the way, this pre-processing can split or concatenate some valid sentences with sentences from other languages and could reduce the imprecision over text data.

def dataPreprocessing(dataframe, NB_WORDS_BY_ROW=50):
    """Processing data and dataframe"""
    """Each row of the dataframe will contain 'NB_WORDS_BY_ROW' words (50 by default)"""
    
    stack_line = ""
    nonAscii = re.compile('\\W') # need to compile regular expression for better performance

    for i, row in tqdm(dataframe.iterrows(), total=dataframe.shape[0]):
        line = row[dataframe.columns[0]]          # get text data (type str) from dataframe
        line = (stack_line + " " + line).strip()  # remove string begin/end spaces
        line = line.lower()                       # lower capital letters
        line = line.replace("-", " ")             # divide compound words (vice-president)
        line = line.translate(punctuation_marks)  # remove punctuation marks
        line = re.sub('\d+', '', line)            # remove digits       
        line = nonAscii.sub(' ', line)            # remove non-ascii characters
        line = re.sub(' +', ' ', line)            # remove extra spaces

        line_list = line.split()
        line_num_words = len(line_list)
        
        if line_num_words < NB_WORDS_BY_ROW:
            stack_line = line        # stack words of current line
            dataframe.iloc[i] = ""   # clear dataframe row

        elif line_num_words >= NB_WORDS_BY_ROW: # when NB_WORDS_BY_ROW criteria is reached
            # write down in dataframe the sentence with the specified number of words
            dataframe.iloc[i] = ' '.join(line_list[0:NB_WORDS_BY_ROW])
            # and stack the lasts words
            stack_line = ' '.join(line_list[NB_WORDS_BY_ROW:])

    # remove empty rows and shuffle rows of dataframe
    drop_indexes = dataframe[dataframe[dataframe.columns[0]] == ""].index
    dataframe.drop(drop_indexes, inplace=True)
    dataframe = dataframe.sample(frac=1)
    dataframe.reset_index(drop=True, inplace=True)
    
    return dataframe

As a verification, let’s apply pre-processing on English language data

english_df = pd.read_csv(
    corpus_path + "/en-en/europarl-v7.fr-en.en",
    "UTF-8", names=["English Sentences"],
    engine='python')

english_df_preproc = dataPreprocessing(english_df)
dfi.export(english_df_preproc.head(10), "img/language_detection_english_df_preproc.png")

100%|██████████| 2005688/2005688 [10:22<00:00, 3220.82it/s]

png

Now let’s check the characters that make up the words and sentences of English data after pre-processing

charDict = {}

for i, row in tqdm(english_df_preproc.iterrows(), total=english_df_preproc.shape[0]):
    line = row['English Sentences']
    for char in line:
        if char in charDict.keys():
            charDict[char] += 1
        else:
            charDict[char] = 1

100%|██████████| 999912/999912 [03:12<00:00, 5194.18it/s]

print(f'Number of different characters: {len(charDict)}')

Number of different characters: 167

Check these characters and their occurrence in detail

charDict = dict(sorted(charDict.items(),key= lambda x:x[1], reverse=True))
print(charDict)

png

After pre-processing, we can observe the occurrence of some characters that belongs to other languages (ß έ й ε ã …), and other special characters (¼ º ½ ¾) that has not been removed during pre-processing. It confirms what we saw during the data analysis: depending on the results and performance of the language detection model, we could have to clean the data a little more precisely.

And visualize the ordered occurrence of these characters

plt.figure(figsize=(20, 8))
plt.bar(range(len(charDict)), list(charDict.values()), align='center', edgecolor='black', color='darkorange')
plt.xticks(range(len(charDict)), list(charDict.keys()))
plt.tick_params(labelsize = 14)
plt.xlabel('Characters', fontsize = 15)
plt.ylabel('Number of occurrences', fontsize = 15)
plt.title('English Data: Characters occurrences (first 50)', fontsize = 18)
plt.xlim([-1, 50])
plt.show()

png

However, we observe here that the characters having a significant occurrence are those which compose classic English words and sentences! The appearance of characters that belongs to other languages and remaining special characters seems to be anecdotal.

So let’s keep it like this for the moment, and do some more precise data cleaning later if necessary.

Corpus pre-processing

Now let’s pre-process sentences for all languages of the entire corpus

Note: for english data, I choosed the file from “fr-en” folder which is the english translated text from the french text.

langDict = {
    'bg': 'Bulgarian',
    'cs': 'Czech',
    'da': 'Danish',
    'de': 'German',
    'el': 'Greek',
    'en': 'English',
    'es': 'Spanish',
    'et': 'Estonian',
    'fi': 'Finnish',
    'fr': 'French',
    'hu': 'Hungarian',
    'it': 'Italian',
    'lt': 'Lithuanian',
    'lv': 'Latvian',
    'nl': 'Dutch',
    'pl': 'Polish',
    'pt': 'Portuguese',
    'ro': 'Romanian',
    'sk': 'Slovak',
    'sl': 'Slovenian',
    'sv': 'Swedish'
}

For calculation purpose and to prevent memory issues that occurs with this large corpus, for each language, only 100000 sentences will be randomly selected.

languages = []
d = {}
    
for _, (key, language) in enumerate(langDict.items()):
    print(language)
    languages.append(language)
    folder = key + '-en'  # a folder name is like "bg-en"
    files_path = os.path.join(corpus_path, folder)

    for file in os.listdir(files_path):
        d[language] = pd.read_csv(files_path + "/" + file, "UTF-8", names=["Sentences"], engine='python')
        d[language] = d[language].sample(frac=1)          # randomize sentences
        d[language].reset_index(drop=True, inplace=True)
        d[language] = d[language].iloc[0:100000].copy()   # select 100000 sentences
        d[language] = dataPreprocessing(d[language])      # pre-processing sentences
        d[language]['Language'] = language                # add column that indicates the language

Bulgarian
100%|██████████| 100000/100000 [00:32<00:00, 3045.56it/s]


Czech
100%|██████████| 100000/100000 [00:32<00:00, 3052.08it/s]


Danish
100%|██████████| 100000/100000 [00:32<00:00, 3084.68it/s]


German
100%|██████████| 100000/100000 [00:33<00:00, 2997.50it/s]


Greek
100%|██████████| 100000/100000 [00:34<00:00, 2910.47it/s]


English
100%|██████████| 100000/100000 [00:29<00:00, 3345.92it/s]


Spanish
100%|██████████| 100000/100000 [00:33<00:00, 2998.63it/s]


Estonian
100%|██████████| 100000/100000 [00:32<00:00, 3090.61it/s]


Finnish
100%|██████████| 100000/100000 [00:33<00:00, 3012.31it/s]


French
100%|██████████| 100000/100000 [00:33<00:00, 2970.94it/s]


Hungarian
100%|██████████| 100000/100000 [00:33<00:00, 2963.75it/s]


Italian
100%|██████████| 100000/100000 [00:33<00:00, 3007.12it/s]


Lithuanian
100%|██████████| 100000/100000 [00:32<00:00, 3056.49it/s]


Latvian
100%|██████████| 100000/100000 [00:32<00:00, 3062.43it/s]


Dutch
100%|██████████| 100000/100000 [00:31<00:00, 3220.98it/s]


Polish
100%|██████████| 100000/100000 [00:33<00:00, 2983.80it/s]


Portuguese
100%|██████████| 100000/100000 [00:33<00:00, 3013.53it/s]


Romanian
100%|██████████| 100000/100000 [00:33<00:00, 2992.63it/s]


Slovak
100%|██████████| 100000/100000 [00:32<00:00, 3050.19it/s]


Slovenian
100%|██████████| 100000/100000 [00:32<00:00, 3109.38it/s]


Swedish
100%|██████████| 100000/100000 [00:32<00:00, 3056.71it/s]

4.Preparing Datasets

Now we are ready to create our Train and Test sets by concatenating dataframes of each language:

80% of the data dedicated to Train set
20% of the data dedicated to Test set

train_df = pd.DataFrame()
test_df  = pd.DataFrame()

for language in languages:
    ratio = int(d[language].shape[0] * 0.8)
    train_df = pd.concat([train_df, d[language][0:ratio]], axis=0)
    test_df  = pd.concat([test_df,  d[language][ratio:]],  axis=0)

# Saving Train and Test datasets as csv files
train_df.reset_index(drop=True, inplace=True)
test_df.reset_index(drop=True, inplace=True)
train_df.to_csv(data_path + '/Europarl_csv/train_df.csv', index = False)
test_df.to_csv(data_path + '/Europarl_csv/test_df.csv', index = False)

print(train_df.shape)
print(test_df.shape)

(741158, 2)
(185303, 2)

dfi.export(train_df.tail(), "img/language_detection_train_df.png")

png

dfi.export(test_df.tail(), "img/language_detection_test_df.png")

png

Visualizing the number of sentences for each language

plt.figure(figsize=(20, 12))

# TRAIN DATASET
plt.subplot(211)
sns.countplot(data=train_df, x=train_df['Language'])
plt.xlabel('Languages', fontsize = 12)
plt.ylabel('Number of sentences', fontsize = 12)
plt.title('Train Dataset: Number of sentenses by language', fontsize = 15)

# TEST DATASET
plt.subplot(212)
sns.countplot(data=test_df, x=test_df['Language'])
plt.xlabel('Languages', fontsize = 12)
plt.ylabel('Number of sentences', fontsize = 12)
plt.title('Test Dataset: Number of sentenses by language', fontsize = 15)

plt.show()

png

Since the sentences of all languages are initially variable length, the result of the preprocessing gives us a different number of normalized length sentences for each language.

5.Variables definition

Creating variables for model input

train_df = pd.read_csv(data_path + '/Europarl_csv/train_df.csv', index_col = False)

X_train = train_df.iloc[:,0] # Independent Variable
y_train = train_df.iloc[:,1] # Dependent Variable

test_df  = pd.read_csv(data_path + '/Europarl_csv/test_df.csv', index_col = False)

X_test = test_df.iloc[:,0] # Independent Variable
y_test = test_df.iloc[:,1] # Dependent Variable

6.Data vectorization

Applying TF-IDF Vectorizer with Character Tokenization

I chose Character Tokenization because characters are the most relevant entities to recognize a language.

One can found on Wikipedia a Language recognition chart: “This language recognition chart presents a variety of clues one can use to help determine the language in which a text is written."

Here under an extract of the article:

png

[…]

png

Furhtermore, Character Tokenization splits a sentence into a set of characters, so it can overcomes “Out Of Vocabulary” words coherently by preserving the information of the word, and limits the size of the vocabulary.

I use the TfidfVectorizer from Scikit-Learn library to convert raw sentences into matrix of TF-IDF features. Selecting the ‘char_wb’ option to create n-grams of characters only from text inside word boundaries; n-grams at the edges of words are padded with space (whitespace can begin or end trigrams but cannot be in the middle).

vectorizer = feature_extraction.text.TfidfVectorizer(ngram_range=(1,3), analyzer='char_wb')

Let’s test this vectorizer on the Italian data:

Italian_Sentences = d["Italian"].iloc[:,0]
print(Italian_Sentences.shape)

(50207,)

print(Italian_Sentences[0])

è tuttora caratterizzata da una mancanza di parità tra i generi vorrei ringraziarvi per linteresse dimostrato verso lordine del giorno del prossimo consiglio europeo e sarò lieto di raccogliere i vostri stimolanti commenti e pareri in occasione della prossima discussione ho votato a favore della relazione gröner sulla lotta contro

Italian_vect = vectorizer.fit_transform(Italian_Sentences)
Italian_vect = pd.DataFrame(Italian_vect.toarray(), columns= vectorizer.get_feature_names())

Italian_vect.shape

(50207, 9398)

As we see the number of features (for Italian data) is equal to 9398.

dfi.export(Italian_vect.head(10), "img/language_detection_italian_vect.png", max_cols=20)

png

And as expected, we can find some features that are composed with characters from other languages.

7.Model

Model definition

For classification, I selected in a first approach the logistic regression model from Scikit-Learn library.
The Vectorizer and the Classifier are then combined in a pipeline:

Language_detector = pipeline.Pipeline([
    ('vectorizer', vectorizer),
    ('classifier', LogisticRegression(max_iter=1000))
])

Model training

start_time = time.time()
Language_detector.fit(X_train, y_train)
end_time = time.time()
print("\nProcessing time = {:0.2f} min".format((end_time - start_time)/60))

Processing time = 43.20 min

8.Model evaluation

Accuracy

y_test_pred = Language_detector.predict(X_test)

accuracy = accuracy_score(y_test, y_test_pred) * 100
print(f"Test Accuracy = {accuracy:.2f}%")

Test Accuracy = 100.00%

Awesome! Even with special characters and characters from other languages, a 100% accuracy is achieved on test data.

Classification report

report = classification_report(y_test, y_test_pred)
print(report)

              precision    recall  f1-score   support

   Bulgarian       1.00      1.00      1.00      9180
       Czech       1.00      1.00      1.00      7948
      Danish       1.00      1.00      1.00      9034
       Dutch       1.00      1.00      1.00     10101
     English       1.00      1.00      1.00      9990
    Estonian       1.00      1.00      1.00      6804
     Finnish       1.00      1.00      1.00      6649
      French       1.00      1.00      1.00     10541
      German       1.00      1.00      1.00      9259
       Greek       1.00      1.00      1.00     10290
   Hungarian       1.00      1.00      1.00      7905
     Italian       1.00      1.00      1.00     10042
     Latvian       1.00      1.00      1.00      7404
  Lithuanian       1.00      1.00      1.00      7073
      Polish       1.00      1.00      1.00      8005
  Portuguese       1.00      1.00      1.00     10184
    Romanian       1.00      1.00      1.00      9623
      Slovak       1.00      1.00      1.00      7987
   Slovenian       1.00      1.00      1.00      7933
     Spanish       1.00      1.00      1.00     10472
     Swedish       1.00      1.00      1.00      8879

    accuracy                           1.00    185303
   macro avg       1.00      1.00      1.00    185303
weighted avg       1.00      1.00      1.00    185303

Precision, recall and f1-score all reach the maximum value !

Confusion matrix

cm = confusion_matrix(y_test, y_test_pred)

fig = plt.figure(figsize=(16, 10))
sns.heatmap(cm, annot=True, cmap="Blues", fmt='.0f', xticklabels=languages, yticklabels=languages)
plt.ylabel('True labels', size=15)
plt.xlabel('Predicted labels', size=15)
plt.xticks(rotation=45)
plt.title('Confusion Matrix', size=18, y=1.02)
plt.show()

png

The model has near perfect language classification and detection capability on test data!
And it is reasonable to think that the misclassifications observed here are the result of the remaining sentences from other languages.

Saving model

pickle.dump(Language_detector, open(data_path + "Language_detector.model", "wb"))

Gives us the possibility to re-use this pre-trained model for deployment (in web application for exemple).

9.Inference

Importing the model

Language_detector = pickle.load(open(data_path + "Language_detector.model", "rb"))

Inference on new sentences

Let’s pick up few sentences in different languages from Wikipedia and test our model’s language detection capability:

inference_sentences = {
    'Bulgarian':  "Статистиката е дисциплина, която се занимава със събирането, организирането, анализа, интерпретацията и представянето на данни.",
    'Czech':      "Statistika je vědní obor, který se zabývá sběrem, organizací, analýzou, interpretací a prezentací empirických dat za účelem prohloubení znalostí určité oblasti, obvykle hromadného jevu.",
    'Danish':     "Statistik er en videnskabelig metode, hvormed man effektivt anvender numeriske data, som fx kan komme fra eksperimenter, spørgeskemaer eller registre.",
    'German':     "Statistik „ist die Lehre von Methoden zum Umgang mit quantitativen Informationen“ (Daten).[1] Sie ist eine Möglichkeit, „eine systematische Verbindung zwischen Erfahrung (Empirie) und Theorie herzustellen“.",
    'Greek':      "Η Στατιστική είναι μία μεθοδική μαθηματική, παλαιότερα τεχνική και σήμερα επιστήμη που επιχειρεί να εξαγάγει έγκυρη γνώση χρησιμοποιώντας εμπειρικά δεδομένα παρατήρησης ή και πειράματος.",
    'English':    "Statistics is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data.",
    'Spanish':    "La estadística (la forma femenina del término alemán Statistik, derivado a su vez del italiano statista, hombre de Estado),1 es la rama de las matemáticas que estudia la variabilidad, colección, organización, análisis, interpretación, y presentación de los datos, así como el proceso aleatorio que los genera siguiendo las leyes de la probabilidad.",
    'Estonian':   "Statistika ehk arvustikuteadus on teadus, mis käsitleb andmete kogumist, töötlemist ja analüüsi. Statistika on teadus massnähtuste kvantitatiivse uurimise meetoditest.",
    'Finnish':    "Tilastotiede on todennäköisyyslaskentaan perustuva tieteenala, joka tutkii tilastollisten aineistojen keräämistä, käsittelyä ja tältä pohjalta tehtävää päättelyä.",
    'French':     "La statistique est la discipline qui étudie des phénomènes à travers la collecte de données, leur traitement, leur analyse, l'interprétation des résultats et leur présentation afin de rendre ces données compréhensibles par tous.",
    'Hungarian':  "A statisztika avagy számhasonlítás[1] a valóság számszerű információinak megfigyelésére, összegzésére, elemzésére és modellezésére irányuló gyakorlati tevékenység és tudomány. ",
    'Italian':    "La statistica è una disciplina che ha come fine lo studio quantitativo e qualitativo di un particolare fenomeno collettivo in condizioni di incertezza o non determinismo, cioè di non completa conoscenza di esso o di una sua parte.",
    'Lithuanian': "Statistika – tikslusis mokslas, kuriame efektyviai panaudojami duomenys iš gautų bandymų ir eksperimentų. Į tai įeina ne tik duomenų rinkimas, jų analizė ir interpretavimas, tačiau ir duomenų prognozavimas iš apklausų ir eksperimentų rezultatų.",
    'Latvian':    "Statistika ir zinātne, kas nodarbojas ar datu iegūšanu, apstrādi, analīzi un to izskaidrošanu.[1] Statistika tiek plaši izmantota dažādās jomās, piemēram, valsts iestādēm statistiskie dati, kurus tās iegūst no tautas skaitīšanas, ir nepieciešami, lai noskaidrotu sabiedrības dzīves un tautsaimniecības kvantitatīvās likumsakarības.",
    'Dutch':      "Statistiek is de wetenschap en de techniek van het verzamelen, bewerken, interpreteren en presenteren van gegevens.",
    'Polish':     "Statystyka (niem. Statistik, „badanie faktów i osób publicznych”, z łac. [now.] statisticus, „polityczny, dot. polityki”, od status, „państwo, stan”) – nauka, której przedmiotem zainteresowania są metody pozyskiwania i prezentacji, a przede wszystkim analizy danych opisujących zjawiska, w tym masowe. ",
    'Portuguese': "Estatística é a ciência que utiliza as teorias probabilísticas para explicar a frequência da ocorrência de eventos,[1] tanto em estudos observacionais quanto em experimentos para modelar a aleatoriedade e a incerteza de forma a estimar ou possibilitar a previsão de fenômenos futuros, conforme o caso. ",
    'Romanian':   "Statistica este o știință care, folosind calculul probabilităților, studiază fenomenele și procesele de tip colectiv (din societate, natură etc.) din punct de vedere cantitativ[1][2] în scopul descrierii acestora și al descoperirii legilor care guvernează manifestarea lor.",
    'Slovak':     "Štatistika je veda založená na využívaní empirických (t. j. založených na skúsenostiach) dát.",
    'Slovenian':  "Statístika je znanost in veščina o razvoju znanja z uporabo izkustvenih podatkov. Njena osnova je matematična statistika, ki je veja uporabne matematike.",
    'Swedish':    "Statistik är en gren inom tillämpad matematik som sysslar med insamling, utvärdering, analys och presentation av data eller information.",
}

# Setting column width display option
pd.set_option('display.max_colwidth', 100)

data = []
for _, (language, sentence) in enumerate(inference_sentences.items()):
    sentence_array = np.array(sentence).reshape(-1)
    predicted_language = Language_detector.predict(sentence_array)
    data.append([sentence, language, predicted_language[0]])

inference_df = pd.DataFrame(data, columns=["Sentence", "Language", "Predicted language"])
dfi.export(inference_df, "img/language_detection_inference_df.png")

png

Thus I can build a language detection parser which takes a sentence as input and returns the detected language accordingly.

Language detection parser

sentence = input('Sentence: \n')
sentence_array = np.array(sentence).reshape(-1)
predicted_language = Language_detector.predict(sentence_array)
print("\n")
print(f'Language detected: {predicted_language[0]}')

Sentence: 
Automatisch leren, machinaal leren of machine learning is een breed onderzoeksveld binnen kunstmatige intelligentie, dat zich bezighoudt met de ontwikkeling van algoritmes en technieken waarmee computers kunnen leren. 


Language detected: Dutch

10.Conclusion

The Europarl parallel corpus - extracted from the proceedings of the European Parliament - includes versions in 21 European languages.
I used this corpus originally intended for translation to build a language detection model.

The analysis of the data revealed some imprecisions: each language data contains some sentences from other languages, and these imprecisions exists for all language data:

Analysing characters, the occurrence of letters of other alphabets allow us to understand that these appearances are anecdotal.
On the other hand it is more difficult to estimate the number of sentences of other languages whose alphabets are common with the considered language.

Despite of this, I made the initial hypothesis that these sentences are rare enough to be unsignificant, and do some more precise data cleaning in a second step if necessary.

Then I applied the following basic pre-processing techniques:

remove string begin/end spaces
lower capital letters
divide compound words
remove punctuation marks
remove digits
remove non-ascii characters
remove extra spaces
set the length of sentences to 50 words

I arbitrary choosed to set the length of sentences to 50 words, in order to train a machine learning model on a dataset with standardized word length sentences. Sentences has been concatenated or sliced to fit this given word length.
This pre-processing is possible as for language detection I do not analyse the semantic content of sentences but only words or characters.
By the way, this pre-processing can split or concatenate some valid sentences with sentences from other languages and could reduce the imprecision over text data.

For calculation purpose and to prevent memory issues that occurs with this large corpus, I randomly selected only 100000 sentences for each language.

For the Natural Language Processing, I chose Character Tokenization because characters and n-grams of characters are the most relevant entities to recognize a language, it can overcomes “Out Of Vocabulary” words coherently by preserving the information of the word, as well as limiting the size of the vocabulary.

I built a Machine Learning model with a pipeline combining a Vectorizer and a Classifier:

Vectorizer: the TfidfVectorizer from Scikit-Learn library which convert raw sentences into matrix of TF-IDF features, selecting the ‘char_wb’ option to create unigram, bigram and trigram features of characters only from text inside word boundaries.
Classifier: the logistic regression model from Scikit-Learn library.

The model’s training converged in 43 min with enough but borderline memory load!

The model evaluation shows:

100% accuracy on test data, ven with special characters and characters from other languages.
Precision, recall and f1-score all reach the maximum value.
The model shows a near perfect language classification and detection capability on test data.

Now we have a language detection model that can predict accurately the language (in the list the Europarl corpus languages) of any new sentence, and that can be deployed on a web application for exemple.

December 10, 2020

Objective:

Table of contents

Importing libraries

Global data path

1.Data

2.Data analysis

Let’s load the Czech data from folder

Czech Data Exploration

Let’s check these characters and their occurrence in the detail

And visualize the ordered occurrence of these characters

Now let’s see how words and sentences are distributed

3.Data pre-processing

Pre-processing steps:

Let’s see the result of each pre-processing steps on a simple exemple

Additional pre-processing step:

As a verification, let’s apply pre-processing on English language data

Now let’s check the characters that make up the words and sentences of English data after pre-processing

Check these characters and their occurrence in detail

And visualize the ordered occurrence of these characters

Corpus pre-processing

4.Preparing Datasets

Now we are ready to create our Train and Test sets by concatenating dataframes of each language:

Visualizing the number of sentences for each language

5.Variables definition

Creating variables for model input

6.Data vectorization

Applying TF-IDF Vectorizer with Character Tokenization

Let’s test this vectorizer on the Italian data:

7.Model

Model definition

Model training

8.Model evaluation

Accuracy

Classification report

Confusion matrix

Saving model

9.Inference

Importing the model

Inference on new sentences

Language detection parser

10.Conclusion