But what you're testing for
is not MBTI, it's not the official MBTI, it's not been verified by an outside source or standardized in definition.
There is likely bias in that it might say more about which types are more popular rather than about anyone's actual personality.
You are trusting that people here have entered the correct MBTI and there are so many issues with this:
- Many people here are not typed properly.
- Many people have purposely entered a different type than what they actually think they type as.
- Most people type by functions AND NOT DICHOTOMIES, which is how MBTI actually types.
- In addition, there are many differences in opinion about the definitions of the functions, if that is what has been used to list a type.
- A point Vendrah brought up, where you draw text from is significant. The text in someone's diary will vary drastically from the text they write in an academic paper. People here use the forum in different ways. Some who are Thinkers write diary style blog posts while some who are Feelers don't blog deeply about their emotions for everyone here to read.
You've created and tested this in the same place with no method of outside verification of whether it's testing for what you say it's testing for. There needs to be proof that it is indeed testing for what you say it's testing for. I truly wish you would provide that outside verification because it is exciting that you may have created something that can do what you say. I just think you're putting the cart before the horse by claiming you know it does this without verifying it.
You've done a great job creating it. Just please test it more so that you can progress further in the direction you want to take it in.
I do have some experience in developing tests. I created the one on the forum and did some experiments in an attempt to improve the questions. The key thing I found through that exercise is that questions related to cognitive functions are not very good at predicting personality type. I know functions are the foundation of the system but based on my experience, I could not get them to work very well. The reason is that it is extremely common for people to have two functions - say Fe and Ni that are closely correlated in their ranking, perhaps with reverse order, and you often get the type wrong. The other thing you run into is people don't reliably test for function order in general. Function ordering tests aren't very good at discerning type as a result. The official MBTI test uses dichotomies and I understand why. They simply work better. The worst prediction accuracy in the forum test is the last letter - P vs. J. The reason it fails is I'm relying on function preferences and order in the other three letters to determine it. If I changed it to be a dichotomy evaluation, I'd get much more accurate results. I know what is accurate because I ask people if they have taken an assessment before, what their type is, and how sure they are that it is accurate. I should fix the last letter inaccuracy, but it would require a significant change in logic.
As to machine learning, the way it works is that you have a dataset with labels that you know to be reasonably accurate. You train the machine learning classifier on a set of examples with those labels, giving you a classifier that can predict the result. Then you run it against a test dataset and compare the predicted labels against actual labels. You can try different classifiers such as Naive Bayes, BERT, Support Vector Machines etc. to see which classifier works better. Then you can tune the parameters that you feed into the classifier and the number of epochs to reduce the amount of loss. Your test validates the result. There might be bias, but there are proven methods to measure these things as described
here that we used in our analysis. There is also the issue of the more common types on the forum being wildly different than those in the general population. We had to adjust for that or the classifier would be biased towards those types that are more commonly on the forum. We balanced the data two ways through oversampling - first having an even distribution of each type and second having a distribution that matches types in the general population. The even distribution seemed to work fine.
You are absolutely right in that where you take text from makes a real difference. One of the recommendations in our paper is to find more diverse data than what we used. There is a problem though. Machine learning requires a certain volume of data. Other comparable studies used datasets of 8500 rows. The biggest one used 30,000 which included a combination of Reddit, PersC and Twitter posts. We trained on over a million rows, which helps make it more accurate. It's possible there is a dataset out there that is better than a personality forum, but I'm not sure where you would get it from. Again, the idea isn't to obtain perfection. You're going to have mislabeled data in any dataset that you would find. Errors are expected. The goal is to create something that is practically useful in its predictions. 80% is a good result. I did have an automated test on the forum for a few years that predicted Big 5 type through machine learning. It ran against Facebook, Tumblr, Twitter and Forum posts and it used an API from IBM which ran the classifier against the input text. You could select how many posts you wanted to feed into it. The Facebook and Twitter results were always pretty poor but the forum and Tumblr seemed to work a lot better, so if I were to add more diverse data, I'd probably consider Tumblr and Reddit as opposed to Twitter.
I should mention that we trained classifiers in two ways. The first way was to predict 4-letter type. The second was four binary predictions - one for each dichotomy. You could also make four binary predictions for cognitive functions. I wonder how that would turn out
. It would be interesting.
I have realized we made a mistake in our classifier and we need to make an adjustment. The end result is going to be less accurate but based on what we have so far, I think it is an interesting result.