Wednesday, September 12, 2018

Does POS tagging improves text clustering?

Answer: No, at least for sweedish language.

Rosell, M., 2009. Part of speech tagging for text clustering in swedish. In Proceedings of the 17th Nordic Conference of Computational Linguistics (NODALIDA 2009) (pp. 150-157).

How to evaluate text clustering? 
Answer:
Normalized Mutual Information (similar to information gain) 

Any available dataset to test?
Answer:
http://ana.cachopo.org/datasets-for-single-label-text-categorization

Friday, April 6, 2018

GDPR will be tough on technology companies

General Data Protection Regulation (GDPR) is the data protection law that is going to be implemented in European Union on May 25th, 2018. The law is expected to give users more control over their personal data than the existing data privacy laws. However, it will be tough for the technology companies to comply with them.

The main argument of the law is that whenever the technology companies request the users for their data, their requests should be unambiguous. The users should know about all the processed and unprocessed data of the user and should have the right to control or delete that data. Currently, the technology companies also allow to download the data uploaded by the users but do not give any information regarding the data that resulted in using their data. Moreover, the data should be portable that users should be able to port their data from one technology company to another one.

The GDPR law informs the technology companies that how they should take care of the users' privacy. The law also details the fines if the technology companies do not comply with the law. The purpose of penalties in the law is to force the top level of the technology companies to think about privacy of the users as well.

Many organizations are expected to outsource the data related problems to a third party as a result of the law. These third parties will take care of all the privacy issues of the users’ data in the form of a global privacy infrastructure. However, it will be tough for small technology companies to comply with the law or outsource their data privacy issues due to financial reasons.

The critics of the law argue that the law will stop innovate business in European Union that depends upon data and artificial intelligence. Moreover, the law will also affect the technologies companies operating worldwide. The critics are also arguing that GDPR might not help much to the users. Already, many EU users are stuck in law cases with technology companies and technology companies are unable to help them in getting their processed data. The companies will have to restructure their data inventories in order to comply with the law that will highly affect them financially. The biggest challenge to the technology companies will be to comply with the single standard format of storing the users’ data for portability. Changing the format of data storage will certainly affect their working model. However, how much they will be affected? only time will tell.


Tuesday, April 3, 2018

The mathematical corporation where machine intelligence + human ingenuity achieve the impossible [ Notes from the book]

Book by
Josh Sullivan and Angela Zutavern

---------------------------Summary of what I read on April 3rd, 2018-------------------------------------

In the industrial era, we used the switches and the term “flip the switch” to perform different tasks. In the current era, the largest switch is a machine intelligence. However, modern advancement is not only due to technology-machine learning- but also due to leadership. Leadership with the technology makes the elements of the successful organization to whom we refer as “Mathematical cooperation”.

Mathematical corporations are driven by the data and algorithms. Data and algorithms have made the corporation forward-looking and experimental oriented. The forward-looking is a guide to future power for the organizations. However, none of the organization is mature enough in forward-looking using data but their leaders understand the critical pieces to guide the direction of their organization.

Instead of using big data and artificial intelligence to answer the old known questions regarding the organization, the leaders of the mathematical corporations are using big data and artificial intelligence to answer the unknown question that no one is asking today. Therefore, smart machines along with intelligent imagination of the leaders make “Big minds” that are driving force in mathematical corporations.

Leaders in the industry are all convinced that big mind is disrupting the current business. The power of big minds is predicting the future that can be valuable to their customers. Having the power to predict unknown universe was never available to the leaders of the past.

Why were we restricted in the past? We were restricted with our lack of ability in prediction. The magic of prediction is in viewing the details. In past, we don’t have all the knowledgeable data to understand the wellbeing of the organizations. With data, we are looking the things in detail and are predicting that we don’t know. However, how should we use the data? The answer lies in letting the machine to learn the patterns.  In order to teach machines to find the unknown patterns, we also need the thinking skills to work with the machines and this book is a guide to get the required thinking skills.


Recently, many leaders are turning their traditional organizations to mathematical corporations. Mark Field, CEO of Ford, is among those leaders. He allowed his employees to drive the cars fitted with hundreds of sensors in one of the experiments in the organization to gather and analyze data to better serve the customer needs. Traditionally, the survey-based mechanism is used to gather data from the customers. However, that mechanism is not enough as compared to data generated from sensors to understand the needs of the customer that even they don’t know. According to Mark, organizations work in two parallel worlds: real and digital. The digital world helps in making the real world better by predicting the unknown requirements. Ford is not alone gathering data and serving customers based on the users’ data.

There are many other examples as well. Gathering data from unconventional tools, such as social media has also helped many organization to better serve their customers.  For example, gathering data from social media helped Glaxo Smith Kline (GSK) to recall one of its product and improve its reputation among customers. GSK succeed due to emerging new tools for data collection that were not available previously. Therefore, using ever growing new tools of data gathering, mathematical corporations will keep on disrupting business in new ways.

Tuesday, March 27, 2018

Social media and Surveillance capitalism

We all use the social media, such as Facebook and Twitter and these have become part of our daily life. I often ask a question from my students who are learning data mining as a subject that why are these platforms free of cost in this world where nothing is free? A number of students do answer that these companies earn through advertisement. However, these students also get stuck when I ask a continuing question that why you do not get annoyed with an advertisement on social media. We get really annoyed watching advertisement on TV and changes the channel during advertisements but hardly leave social media, why? If the advertisements are very few in the social media then how they earn? The answer lies in the term “surveillance capitalism” used by Harvard Business School professor Shoshana Zuboff.

Most of the people are aware of two big jargons in data science, Big Data and Personalized Recommender Systems. The business model of the social media companies, such as Facebook, Twitter, and Google revolves around these two jargons. Using personalized recommender systems, the social media companies target individual advertisement to a user that most probably looking for such kind of a product. Therefore, when we see an advertisement on the social media, we hardly get annoyed. However, to understand the specific and individualized requirements of the users, the social media platforms require a lot and lot of personalized data. To gather the personalized data and in huge amount, the platform is provided free of cost. The users are profiled and compared with different users and are categorized. After categorization of people, the advertisements are targeted to specific people rather than in a random way. This is the good side of recommender systems. However, if we look closely, all surveillance systems are also built on the same model and this is pointed by Shoshana Zuboff.

Ok, now we know social media is a surveillance system but why we are not able to easily leave the social media platforms? The reason is that we are being used as Guinea pigs and a number of psychological and behavior predicting tests are being performed on us by the social media companies. To keep us connected to these social media platforms and to gather more data from us, we are being studied. As a result of tests, social media is engineered to be as habit-forming as cocaine to keep us connected to the platforms. I agree with the Julian Assange that ‏we are the product in the business model of the social media companies.

The social media companies are not restricted to only do surveillance for advertisement. Nothing is stopping them from playing dirty. Most importantly, these platforms are being used for brainwashing people for different agendas. Cambridge Analytica scandal has brought this dirty game into the limelight that these platforms even affects the results of the elections. If anyone of us thinks that it is not possible to brainwash him or her, he or she should look at the smartphone in his/her hand. The business model of the social media platforms requires the constant input of data. Therefore, use cases of smartphones have been devised. We are brainwashed to buy smartphones to constantly provide personalized data to these companies. Even the older mobile phone companies, such as Nokia fails to counter the force of social media companies. We have bought the bugging devices of these companies with our own will. Interesting, isn’t it. Have you ever wondered that why power in the batteries of the smartphones get consumed so quickly even if you are not using them? guess it yourself.

Apart from what we like or dislike, the smart mobile phones have helped the social media companies to know that where we work, where our homes are, what type of jobs we do, to whom we talk, and who our job mates are. For example, if someone wants to find out a number of Army officers of Pakistan in a certain city and their office locations, these companies can provide all the details much better than any insider. 

Looking at the power of the online surveillance used by social media companies, many countries are planning to profile their citizens and rate them. These countries will allow or restrict people on access to different things, such as travelling or online purchasing power based on their rating. On the outside, it looks good that governments will be able to track each individual citizen and this will help in curbing the problems, such as terrorism. However, the same power will also help authoritarian governments to control their citizen. They will be able to brainwash their citizen, suppress the freedom of speech, and gain more power.

In the real world, we do mistakes, learn from them and hide those mistakes in the past. However, in the systems where we are (social media and mobile phones) and will be (civilian rating systems by governments) profiled, we will not able to get rid of our past mistakes. The past mistakes will always remain as black spots on our profile. Guess, how you will feel when during a job interview, the interviewer tells you that you were booked by traffic police on one wheeling a bike 20 years ago or you had expressed a negative opinion against someone on your twitter account 10 years ago so you have less rating than the other guy.


Do you like to live like a Guinea pig who is constantly being observed and documented? I don’t.