Not So Short Notes

In the AI Era, is going to college worth it to become a programmer?

In practical terms, you do not need a university degree to become a programmer because, unlike law and medicine, software engineering is not regulated. You just need to have the skills. This concern is even bigger in the AI era. People wonder if going to college is still worth it. “I can make a React web app using ChatGPT without knowing tons of JavaScript and data structures.”, they say. Some youngsters may have been programming since their 8s or 10s.

I am here to say “Yes, it is worth it!”. But maybe not for the reasons you might think. I will discuss AI too.

Yes, go to college!

In general, going to college to become a software developer/engineer or like is a good idea. However, it is not because of the course content. You do not need to formally enroll in an “Introduction to Programming in Python” course at the university to learn how to program in Python. This would be beneficial, but not strictly necessary because there are a lot of free online materials for you to study. I am not talking about a YouTuber’s tutorial on Python. I am talking about high-tier university courses taught by the best Professors in the world. Examples are the MIT OpenCourseWare, Stanford Online, and complete courses on YouTube: Harvard CS50 – Full Computer Science University Course, CMU Intro to Database Systems, etc.

Now, some of the reasons why going to college is worth it are the following:

Getting into a good university is difficult. If you overcome this challenge, it will prepare you for other things.
The university gives you various opportunities outside the classroom, such as contact with professionals and researchers, all sorts of events, internships, exchange programs, and the first experiences with teaching (“Teaching Assistants”), research (Bachelor research projects), science outreach, etc.
Your classmates may become valuable networks in the future. While you study together you are in the same position. This is like buying shares when they are low, but they can go up later, giving you profit.
The whole process of completing a STEM degree is not easy. Overcoming this barrier makes you better prepared for other things in life.
At the university, you have opportunities to acquire and train soft skills in a controlled environment, such as when managing a group project, presenting assignments, and interacting with your classmates and teachers. If you mess up, you will not be putting your job at risk.

In summary, the university opens up a world of opportunities and multiple learning environments that will definitely be beneficial for almost anyone who wants to become a programmer, even if you are already a self-taught coder.

When is it not worth it?

In some particular cases, going directly to college may not be feasible. I am referring to the unfortunate situation wherein the person does not have enough money or time. For example, an adult who has to work 8 hours a day to sustain a partner and a kid, but who wants to shift from the current job to software engineering. Because this person faces urgency, money, and time constraints, it may be better to learn practical skills that can lead to a Junior position as soon as possible, such as JavaScript, HTML, CSS, and some frameworks for web development. I am not saying this is the best path for a person in this condition. I am just saying this person has to deal with a lot more issues than an 18s who is supported by the parents.

But AI will code for me…

Current AI systems cannot be much better than an average programmer because a lot of important codebases are not open-source (meaning it is available online to be web-scraped), and a huge amount of open-source code is composed of repetitive ordinary Python and JavaScript projects or code snippets. In other words, the training data is not so incredible… Moreover, the so-called “hallucinations” (unexpected undesired token outputs) are unavoidable by design. Even if they were rare (but they are not), this would be enough to raise concerns about the idea of employing LLMs in a pipeline executing numerous operations per minute without human supervision. This can be addressed by error handling techniques and data validation, but I do not know if this can solve the problem. In some use cases, this might not be a major problem. However, for sensitive applications, this would easily become a serious issue. Imagine an LLM deleting entries and updating databases. Even if it makes 1 error out of 1000 prompts, the problem can quickly scale in a large system handling thousands of operations per minute.

In the video below (in Portuguese), Lucas Montano, a senior programmer, gave Devin root access to his VPS to work on his small hobby project. It is very entertaining and revealing about AI systems programming capabilities. Asked to execute simple instructions, Devi made catastrophic errors, deleting important files and committing thousands of lines out of the blue. The funny thing is that after being questioned, Devin lied and said it understood… Lucas, as an experienced programmer, spotted all the issues and rolled back the codebase while unsuccessfully trying to make Devin do things correctly. Devin’s behavior is not a surprise. It just does not understand anything. It does not treat texts as containing meanings, which must be understood and used to guide future actions, but as sequences of tokens. Devin’s job is to guess the next tokens using probability and linear algebra techniques. Our interaction with Devin is not communication, but a wider window of context (more tokens) for him to try its luck at guessing the rest. There will always be absurd errors like those. Incremental progress is expected but major advances would require novel AI paradigms to guarantee the consistency and quality of the AI system outputs.

Trying out LLMs to disprove mathematical conjectures with luck

Some unsolved problems in Number Theory are surprisingly not so hard to understand, even for a non-mathematician like myself. The Collatz conjecture is a famous example but there are others:

Beal conjecture: If $A^{x} + B^{y} = C^{z}$ , where A, B, C, x, y, z are positive integers and x, y, z are ≥ 3, do A, B, and C have a common prime factor?
Feit-Thompson conjecture¹: There are no distinct prime numbers p and q such that $\frac{p^{q} - 1}{p - 1}$ divides $\frac{q^{p} - 1}{q - 1}$

If these conjectures turn out to be false, computational methods may be able to find counterexamples. In the case of the Beal conjecture, assuming that $A^{x} + B^{y} = C^{z}$ , we just need to find positive integers A, B, and C that do not have common prime factors. A counterexample to Feit–Thompson conjecture is simply two distinct prime numbers p and q such that $\frac{p^{q} - 1}{p - 1}$ divides $\frac{q^{p} - 1}{q - 1}$ . If we get lucky, we can find counterexamples by massively testing cases.

What if we could write fast C programs to execute these tests? I am not familiar with the C programming language. However, thanks to LLMs we can now write programs like these without knowing much about the subject or programming languages. In my experiments, I asked ChatGPT, Copilot, and Gemini for C code to test those and other conjectures. The results were not great. Although I could compile and run the code, the outputs were almost always wrong. Obviously, there is a memory issue here as the code tests more or larger numbers. I was expecting that. If the conjectures were false though, there was a chance I could find relatively small counterexamples. But how did I know that some outputs were wrong? Because it is much easier to check if a single example satisfies or disproves the conjecture than to find this example (if it exists). I asked LLMs for Python code to check this (at least I can understand Python code). In some cases, we can even check this manually. I was aware that some conjectures have already been tested up to very large numbers. This is why tried to select what I believed to be underexplored conjectures. No progress in mathematics has been made though. 😂

For a moment, I kinda felt like I was doing mathematical research to some extent but, at the same time, it all looked so fake since I had very poor knowledge about the subject (Number Theory and C programming). One risk of using LLMs to speed up some tasks is to stay ignorant of the matter while we can still get some things done. Instead of studying C and thinking of how to use it to address certain mathematical problems, I could write prompts to LLMs and execute some tests. As fun as it was, it is clear to me that I need much more than prompts to LLMs to do research in STEM.

A counterexample to the Feit-Thompson conjecture (p=17, q=3313) with a common factor 112643 was found by N. M. Stephens in 1971, demonstrating that the conjecture is, actually, false. No other such pairs exist with both values less than 400000. Source here. ↩︎

PhD Conclusion: Another Milestone

On June 28th, 2024 I successfully concluded my PhD in Computer Science from the Free University of Bozen-Bolzano. Many things have happened since I was accepted for this PhD program, including the COVID-19 pandemic. I feel like I have sacrificed a lot to achieve this. In exchange, I accomplished something I did not even imagine was possible for someone like me. The whole experience of living abroad, first in Italy, now in the Netherlands, has been indescribable. I could not have done much if I had not crossed paths with several people. Since the beginning, I have been very lucky. So, I expressed my thanks in the acknowledgements section of my PhD thesis.

Although I now hold a PhD in Computer Science, I still feel like I must prove to myself I deserve it by acquiring solid foundations in the field. I have become very specialized in ontology, conceptual modeling, risk and security modeling, Unified Foundation Ontology, OntoUML, and related topics. However, there is a lot of basic knowledge in Computer Science that I want to master: databases, data structures, mathematics, computer architecture, etc. While I am alive I will be learning, I guess.

Tips for Researchers

As I drastically shifted from doing law in Brazil to a research career in computer science in Europe, I have tried to understand what works and what does not work to develop my research career and work. I have collected several tips that I want to share here. Of course, you may disagree with them since they reflect my personal experience, but I hope they will give you something to consider.

Collaboration is essential

I was used to doing things by myself and, eventually, delivering what I was supposed to. This is a mistake! You should always collaborate with your colleagues and other researchers. They will share their experience and knowledge with you, noticing things you have not seen. A simple conversation with other researchers can uncover your blind spots.

In the same context of collaboration, you should always communicate your progress to your colleagues and bosses. And do it periodically, instead of waiting for weeks or months. Do not work alone for long periods. The quicker you share results and the course of your work with your colleagues, the faster you will receive feedback, and find mistakes and misunderstandings. For example, if you are writing a paper aiming at a given conference deadline, do not wait until the last week to share your progress with your colleagues. Otherwise, you will not be able to fix the problems on time.

To explore ideas, reduce the scope, and implement them

Sometimes, we have ideas and think they are great. We want to share them immediately with our colleagues. But ideas are just ideas. By themselves, they are worth nothing. You can try to work on a smaller limited version of your idea, implement something that can be considered, and then share your preliminary results. This will be much more productive because you will get feedback based on a concrete thing, instead of just a vague conversation.

In this same context, you should not wait for solutions from others. Always try to bring something concrete yourself. It is going to be more fruitful to discuss not only ideas but solutions and implementations. For instance, if you think mapping one language to another will bring major benefits, start by mapping a toy version of those languages, and see if it is possible, how it is possible, what you gain from this, and how you can extend the solutions to the real case.

Improvisation and patches accumulate debts

If you are in a hurry to accomplish deadlines, you may be tempted to improvise and submit a lesser-developed work. You will create excuses for yourself saying you could have done better work if the conditions were more favorable (say, if you had had more time). The problem with this way of doing things is that it is harder to fix a bad product, which will accumulate debts for you as time goes on. You can meet the deadline but now you will have to work on the same thing aiming for another deadline, delaying other things you have to do, and messing up your whole plan. It is easier to develop further a carefully made product.

Being a low-profile researcher is bad nowadays

Some people value their privacy to the point they diminish their public online presence. They have a few inactive social networks. They do not produce any online content. They rarely speak publicly, except if they have to. Although this is a valid way of living, it is not good for your research career. Whether you like it or not, researchers are public individuals whose work should be available and explained to the scientific community and society. As a researcher, you get (often public) funding for a greater cause, not for sustaining your way of life.

You should publicize your research as much as possible with a presence on ResearchGate, ArXiv, YouTube, Google Scholar, Twitter, LinkedIn, etc. You should smooth the way for stakeholders to reach your research outputs. You can do this by documenting your research on the right media beyond the published article (for instance, by publishing additional material on Github). Particularly, I try to build a GitHub repository for each “project” (say, a proposed artifact), which can be associated with multiple publications, always adding permanent identifiers, such as DOI generated by Zenodo, PURL, or w3id.org/. The project naturally evolves with theoretical and practical developments, and you can track them by having a proper website (for example, like this).

You should make clear the scientific relevance and business value of your research.

Moreover, people will want to get to know you and your work. You should have a personal website! This is very easy to do nowadays with services such as Wix, Google Sites, and Github Pages. Creating content on yourself and your activities is the best way to have a public image you would like, instead of letting people make your image for you (because they will do it anyway, even if you are low profile!).

Lasting impact

Lasting impacts depend on timing and chance. This is in part out of your control because there are too many external factors involved.

But there is one thing you can do: progressively develop research on top of previous results, either yours or someone else. This way you can go further and do innovative things. If you work on separate solutions to very narrow problems, you may gain some skills but will not be able to leverage your previous work to do something that was not possible earlier. Instead, for example, if you build an information artifact, a method, or a mathematical theory, you can now build something new based on those outcomes, and no one else will be able to do something like this because this option was simply not available yet.

You may think you have become a scientist because of your love for knowledge. However, in science, we only care about interesting knowledge. Perhaps, because it opens novel questions or challenges current knowledge. Maybe, because it connects with our current theories. Sometimes, because it is more useful than current solutions. You are not paid to fulfill your curiosity.

In this context, do not miss the “forest”, the big picture of your research work, despite often working on “trees”.

Teaching and Management

Lastly, do not forget part of your work as a scientist involves teaching activities and management activities (for instance, organization of events).

Bonus

Rules to write a good paper, by Professor Daniel Lemire at the Data Science Laboratory of the Université du Québec (TÉLUQ) in Montreal.
My Professional Advice, by Professor Michael J. Pyrcz, The University of Texas at Austin.
Ten simple rules for writing a response to reviewers, by William Stafford Noble, Departments of Genome Sciences and Computer Science and Engineering, University of Washington, Seattle, Washington, USA.

What a journey! Brazil, Italy, and the Netherlands

After three years of living in Bolzano, Italy, I have just moved to Enschede, in the Netherlands. Now, I am a Guest Researcher at the University of Twente (UT) as part of my Period Abroad as a Ph.D. student in Computer Science at the Free University of Bozen-Bolzano (Unibz). I am a member of the Semantics, Cybersecurity & Services group (SCS), led by my supervisor Prof. Giancarlo Guizzardi.

It has been quite a journey for me, something I had never thought would be possible when I was in law school. When I look back on my origin, struggles, and resilience, it is really amazing to realize all the things I have done. I feel so grateful and happy to be here. I have to say along the way I was lucky enough to find good people who helped me with every step. I am sure that without them I would have never achieved the same things.

Let me say some words about UT: by walking around its campus you feel you are part of something important that is going on in the development of society! It is like breathing science and technology! Some buildings are true pieces of artwork. I am impressed by how the staff is good at solving problems. Everything seems to be well-designed to be efficient, including the pieces of information on the website. I am very excited to work at this place. I expect to live in the Netherlands for the next few years. Let’s see where this will go.

Basic Personal Cybersecurity Measures

My Ph.D. research concerns security modeling and analysis through an ontological approach. Because of that, I have been studying risk management in general and cybersecurity in particular. So I realized how vulnerable I was since I was not applying basic security mechanisms to protect my accounts and devices. Here are some practical recommendations for anyone.

TURN ON YOUR FIREWALL

A firewall is “network security system that monitors and controls incoming and outgoing network traffic based on predetermined security rules”. By default, you would like to deny incoming and allow outgoing connections. Here is a tutorial on UFW, a firewall that is usually available by default in several Linux distributions (Ubuntu, Debian). You can check whether it is active through sudo ufw status. Most likely, you just need to activate it by the following command: sudo ufw enable. On Mac OS you can simply activate the firewall in System Settings > Networking. I am not using Windows anymore, so I do not know exactly how to handle this on it. However, I guess Windows takes care of this by default through Windows Defender.

USE A PASSWORD MANAGER

Thanks to the suggestion of my colleague Pedro Paulo, I started to use a password manager. You should use one too! Two good well-known free options are Bitwarden and 1Password. A password manager not only protects your accounts but also makes your life easier when you have to log in. You have a single very long master password (ideally, a passphrase bigger than 25 characters). You shall not forget this one! Then, use the password manager to generate a different random password with at least 25 characters with upper and lower case letters, numbers, and special characters for every account you have (something like this: fZ6^8AHT^ciqEx$$J!#*ig4pK).

You can easily fill out the spaces of login and password by simply clicking on the options of the password manager. You can also save your bank account data, cards, and important notes inside your vault of the password manager. Do not rely on browser password managers (Google Chrome, Edge, Firefox, etc.). They are not as good as a specialized password manager, such as the aforementioned ones.

Recently, LastPass, a password manager company, suffered an attack, and the criminals had access to the encrypted vaults. So I do not recommend this password manager. 1Password wrote on its blog about the case, highlighting the importance of long master passwords and additional security measures. However, any password manager with cloud infrastructure can be targeted by this kind of attack. These companies are very attractive targets for criminals because they are a well of sensitive data, even if encrypted. To make your data less vulnerable, you can self-host your vault (check this and this, in the case of Bitwarden). By doing this, your data become a much less attractive target because it would require a lot of effort by the attacker to get the data of a single person. Nevertheless, just self-host your vault if you know exactly what you are doing. Otherwise, you should simply rely on the security of the password manager of your choice.

USE TWO-FACTOR AUTHENTICATION (2FA)

Use 2FA for every account you have that has this feature, including your account of the very password manager. There are several good free options. Adding this second step for login to your accounts makes stealing your data much harder, particularly if you already have a long random password for the first step of the login.

USE A VIRTUAL PRIVATE NETWORK (VPN)

A VPN is not about changing your location to give you access to entertainment material from other countries. It is mostly about making your internet traffic more secure by encrypting it. This is more important when you access public Wi-Fi networks. Consider that a VPN provider is a private company. So you need to know its reputation and services in order to choose a good one. Here is a ranking that includes many important characteristics. By default, do not trust on free VPNs. Sometimes they make money by selling your data (logs). Check the logging policy of the company.

You can also combine a VPN with Tor, making your traffic anonymous even for your VPN provider. But notice that normally Tor makes your traffic very slow. Moreover, there is a chance your firewall interferes with the VPN or Tor services. You should check this out too.

There are, of course, other things you can do to protect yourself and your data against possible attacks. But I would say that using properly a good password manager combined with 2FA is the minimum everyone should do to manage their accounts in our increasingly digital world.

The Distinction Between Representation and Reasoning

Knowledge representation and reasoning is the field of artificial intelligence dedicated to representing information about the world in a form that a computer system can use to solve complex tasks such as diagnosing a medical condition, having a dialog in a natural language, or scheduling. Here I want to highlight the distinction between the task of representation as modeling a domain of knowledge and the task of reasoning. Each task requires different kinds of support. Whilst this difference may seem obvious, it turns out the researchers in different communities often believe they are the same thing in the following sense: the interest in representation relies on what you can automatically reason from formalized knowledge. In other words, once we have sound reasoning algorithms, people should be capable of representing reality according to their necessities. This emphasis on reasoning assumes implicitly that the task of representation is an easy one or at least it depends on personal needs that are not technically supported.

One problem with this priority inversion is that reasoning algorithms are useless if the representation is badly designed. Knowledge-based systems can only make the right inferences for supporting the solution they are supposed to support if the knowledge base is adequate. Whereas reasoning algorithms can help to debug a model (knowledge base) by finding inconsistencies, they offer little assistance in terms of how you should design a knowledge base according to the domain. The means of representation is very often a logic language, such as First-Order Logic (FOL), OWL, etc, which, again, offers no clue about knowledge engineering. Modeling errors are way more common than we think they are: researchers have found a large number of modeling mistakes associated with the failure to employ the distinction between types and individuals in the Wikidata knowledge graph, that is, incorrect uses of instantiation, which is a relation between an individual and a type, and specialization (or subtyping), which is a relation between two types.

We should keep in mind that we primarily represent the world for ourselves, that is, for human understanding and communication, mainly by creating diagrams and drawings as abstractions of reality. This general modeling task is the central concern of the field of Conceptual Modeling. I claim that several other tasks that look more straightforward depend on this one: (a) the machine-readable representation, (b) automated reasoning, (c) database design, and (d) problem-solving, among others. This is so because human understanding and communication come first, and, based on them, we advance doing multiple different things that require different optimizations.

Let’s consider this formal theory describing what events are and the entities and relations involved. Such a theory is important because it accounts for changes, a key aspect for representing many domains. The mereology of events contains a has-part relation between events that is a strict partial order, stated by three different axioms: (A1) irreflexivity, (A2) asymmetry, and (A3) transitivity. However, a transitive relation is asymmetric if, and only if, it is irreflexive. This means having both A1 and A2 is redundant, these axioms are not independent in this theory, since one can be deduced from the other (assuming A3). Then, the question is: should we minimize this theory to the point it has only independent axioms? The answer depends on what the theory is for. As a model of reality, helping us to understand the nature of events, the theory is better with all those axioms because they are informative and communicative for us; they help us to understand what has-part relation means. However, for automated reasoning purposes, such as deducing logical implications, the optimal theory should have the least possible number of axioms, because the more axioms, the harder it is to compute satisfiability, which is the basis of reasoning services. There is no problem in having the two versions of the theory, since each of them serves different goals. This example makes clear the different requirements of the task of representation and the task of reasoning.

Now, consider three concepts: being red, being a person, and being a student. In FOL, all of them can be represented as a unary predicate. Nevertheless, this would ignore important aspects they have, the ontological differences (according to a given worldview): for example, (a) being red is a property that inheres in an object, it is existentially dependent on the object that is the bearer of the property of being red; (b) whatever is a person is necessarily a person; it cannot cease to be a person without ceasing to exist; (c) on the other hand, being a student is not only changeable, but it is also something that depends on a relationship with an educational institution, an enrollment relation. Although you can express this in FOL, you have no guidance to think about the representation, because FOL syntax ultimately represents a tiny ontology of sets and membership relations.

What if you can use a visual language for representation that is able to offer you a series of ontological distinctions embodied in its own constructs? What if you can use a language that shows in itself a number of modeling patterns that help you to create more consistent and complex models (as diagrams)? A language that is formalized in the sense that its models correspond to a FOL axiomatization. This language is OntoUML, a general-purpose conceptual modeling language that is ontologically grounded in the Unified Foundational Ontology.

Research Career and Love

It is well known that the research career has several challenges due to its instability: (a) salaries are in general smaller than the ones in industry; (b) most contracts are temporary (one, two or three years, though renewable); (c) in any case, to advance his or her career, the researcher is expected to work in different universities and research groups, often having to move to another country – and this implies a series of difficulties by itself (housing, language, culture, etc). Having a durable love story is among these challenges.

When I moved to Bolzano (Italy) for my PhD in Computer Science, I had a girlfriend in Brazil. I usually say a relationship is as important as it is part of our own history: if we live meaningful moments of our lives with someone, witnessing each other’s success, failures, despairs, and happiness, then these shared periods build what we are, and the partner becomes one of the building blocks of our history. If, otherwise, we mostly shared some fun moments with the partner, without much consequence, then the relationship is not that important to our history. My relationship was of the first kind: she was with me when I was desperate after receiving death threat in my own home; she was with me while I pursued a career abroad; she was with me when I moved to Rio de Janeiro to work; I was with her when she graduated in civil engineering; I was with her when she was hired, and when she was fired, and when her grandmother died… Just to cite a few situations. We supported each other as much as we could, given the respective circumstances. We learned a lot with each other. Unfortunately, due to the distance combined with pandemic times, we decided to break up after almost two years of long-distance relationship (she was then living in Israel, while I was in Italy). At the time we had no perspective of living together once again. We were both goal-driven, focusing on our careers, so, when we started to date, I predicted our separation. Now I plan to move to the Netherlands for my research period abroad, and then who knows where I will be.

Indeed, I am fine by myself. But I have noticed I became so demanding toward a possible partner, and so easily bored by other people, that I have no clue about my chances of finding someone that catches my attention. Love is like a gambling game of hurting and being hurt, and the luck is in the middle.

My 4 Guesses About Our Future

*Altered Carbon*, a cyberpunk Netflix TV series based on the romance by Richard K. Morgan.

Let me give you four of my guesses about the future of our knowledge and society. How far in the future? Let’s keep it open, but let’s consider at least 20 years ahead. To make it more interesting I am going to write my level of confidence for each of my predictions.

1. An Even More Data-Driven World: High Confidence

Yes, our society is already data-driven in almost every aspect, but I have some specific guesses about what the world will look like as it becomes more and more data-driven.

Smart gadgets, such as our smartphones and smartwatches nowadays, will become even more omnipresent, tracking information from every point of our lives: heartbeats, movements, blood sugar level, several other rates about the biochemistry of our bodies, including our DNA and thoughts (electrical signals in the brain). This means that the future smart gadgets will be more and more integrated into our bodies, making us cyborg creatures.

(Related to this I might mention groundbreaking advances and applications of nanotechnology)

All these data will be integrated with governments and companies’ services, such as the health system (allowing a highly personalized medicine), social security, education, and ads. The power of the big players to control the people will be even greater as well: the power of controlling people’s minds and behaviors. How about the privacy laws? This will be always an issue, but will not hinder the major outcomes, because (a) people are generally willing to trade their privacy for seemly free benefits (Facebook customers are a good example of that), and (b) big players have numerous ways to bypass privacy laws – fishes like ourselves cannot escape from the aquarium in which we live.

2. Shifts in Artificial Intelligence Research: High Confidence

Currently, AI research is mostly about machine learning techniques, or at least this is what appears in the media most frequently. Roughly, this approach is based on statistical patterns found in the datasets, and it has produced impressive results in the last 20 years (see Applications). The problem is that we are reaching a plateau in this area, since some absurdly big language models based on Deep Learning, such as GPT-3, show basic limitations. This line of research keeps betting on increasing the size of the models, and a GPT-4 is under preparation: it is expected to have 100 trillion parameters, having 500 times the size of GPT-3. Though all this is fascinating, the cost of such a technology is getting more and more unsustainable, whereas the return cannot progress at the same pace.

This is why I believe some qualitative advances must happen in the next decades in AI research, something that improves dramatically the efficiency of our AI techniques in terms of computational cost, required energy, required amount of data, and the expected benefits, including accuracy, generalizability, meaning understanding and interpretability. What could be such an advance? I don’t really know… I believe highly data-driven approaches and applications are here to stay, but they must be somehow optimized qualitatively, not only with faster chips or larger training datasets.

Recently I have attended a lecture by Professor Ute Schmid, who presented striking capabilities of Inductive Programming: the trained algorithm was not only able to recognize images, but also to highlight the sections of the image with the relevant characteristics for the recognition. In Inductive Logic Programming, the system must be trained with positive and negative examples, just like in traditional machine learning, but the system includes background knowledge and then deduces a hypothesized logic program that entails all the positive and none of the negative examples. Because of that, the necessary amount of data is much smaller compared to the one in traditional machine learning. I cannot say this approach will be the future, but it is certainly interesting.

Or maybe someone will present a constructive proof of P=NP… (extremely unlikely)

3. Traces of Life Outside Earth: High Confidence

I truly believe extraterrestrial life is much more common than we can see at the moment. I am talking about microbial life or its fossils. We just haven’t found it… yet. Maybe the NASA mission Mars 2020 will find some traces of life. I particularly have some expectations that this will happen. Anyway, it seems very likely we will find something in the next decades.

How about intelligent life outside Earth? I have no clue about it. Maybe it exists, but it locates too far away from us, maybe there is none.

4. Fusion Power: Moderate Confidence

If we were able to efficiently produce energy via nuclear fusion, we might be able to deal with both the problem of CO₂ and Greenhouse gas emissions and our greedy demand for electricity. I am not sure whether we will be able to achieve this in the next decades, but there is a good possibility.

A Descriptive Data Analysis about the Chess Grandmasters

Judit Polgár, the real ‘Queen’s Gambit’. Source: DW PT. English version DW.

Recently, the Indian-American Abhimanyu Mishra became the youngest Grandmaster (GM) in chess history, qualifying for the title at the age of 12 years 4 months and 25 days, whereas the Venezuelan Salvador Diaz Carias got the FIDE Master (FM) Title at the age of 88. Motivated by these news, which I came to know by the Brazilian YouTube channel Xadrez Brasil, I decided to do a descriptive data analysis and visualization about the GMs based on the Wikipedia "List of chess grandmasters", using Google Colab to make the Jupyter Notebook.

You can see and download the complete Jupyter Notebook HERE or in my Github repository. Alternatively, to directly access the notebook in Google Colab, click HERE. The notebook is written as a data analysis report.

In this post, I skip the Python scripts and methodological details. I focus on the results, showing some nice visualizations about GM statistics.

I was curious about questions like:

What is the distribution of GM title since 1950 (when it started)?
What is the relationship, if any, between age and receiving GM title?
What is the distribution of GM title among the countries and sex?

The findings are not really surprising: in summary, we notice a great increasing of GM titles in the last decades; we see the older you are, the less likely are your chances of becoming a GM; indeed, obtaining a GM title is almost always a game for young people (say, under 35 years old); countries with tradition in chess, such as Post-Soviet states, have higher numbers of GMs; the women represent a small fraction of GMs.

According to that list, there are 1939 registered GMs, including male, female, alive or deceased. The graph below shows the distribution of all of them according to the year of title acquisition and the age of the player at that time.

Note: “TitleAge” means the age of the player when he/she obtained the GM title; “TitleYear” means the year when this happened.

Age and GM Title

Interactive scatter plot (with regression line) showing the distribution of title age of the GM and the year of the title. Made with Flourish.

Immediately we observe two things:

(a) First, since 1990 we have observed a lot more GM titles. Most likely this is due to the fact that chess became much more popular and accessible, particularly after the spreading of home computers and the internet; the more people interested in chess, the more GMs.

(b) Second, most players become GM under the 30s years old. Indeed, the mean is about 27 years old, and 75% of the total of players were under 31 years old when they received the title.

The oldest GM title receiver is about 88 years old. But this is not entirely correct, because among the 10 oldest chess players (all being at least 77 years old) only Jacques Mieses (1865-1954) was an active player. He received his title at the age of 85 in 1950 (inauguration of the GM title), but it is said his chess strength was not that great anymore. The other 9 players received honorary titles.

There are only 19 players who got the title when they were 70 years old or older. As we see, at least half of them got an honorary title. The box plot highlights that whoever receives the GM title at the age of 45 years or older is already an outlier among the GMs. This is even more true when we consider the honorary titles.

*Box plot about the title age of the GMs when they received the title.*

The same information discussed can be seen in the charts below:

Country and Sex of GMs

Among the 1939 GMs, 1901 are male, whereas just 38 are female, which represent 1.95% of the total. Indeed, notable professional chess players have already expressed sexist opinions about women in chess, such as Garry Kasparov and Bob Fischer. Later, after losing a rapid game in 2002 against Judit Polgár, Kasparov admitted he was wrong and has changed his opinion: “The Polgárs showed that there are no inherent limitations to their aptitude – an idea that many male players refused to accept until they had unceremoniously been crushed by a twelve-year-old with a ponytail.”.

So, historically women have not be seen as serious professional chess players, and this certainly has been impacting the presence of women in the sport. The real history of Polgárs’ sisters and the Netflix miniseries “The Queen’s Gambit” (2020) have helped to dissolve the sexism in the chess world.

Concerning the distribution of GMs per country, unsurprisingly Hungary, Germany, India, Russia, URSS, China, Ukraine, and USA are the ones with the greatest number of GMs. As we know, the current dominant World Champion Magnus Carlsen is from a country that appears to have little tradition in chess, since Norway has only 16 registered GMs.

You can interact with the visualization below to see more; “M” means male, and “F” means female.

Distributions of GMs per sex and country displayed with Flourish.

I have really enjoyed doing this study. 🙂