Wednesday, March 23, 2022

Machine Learning with Amelia

by Joe Cerniglia

Incredibly, it has been more than two years since the last post was published here on Amelia Earhart Archaeology. It is time to revive the search for Amelia!

But before I do, I need to insert a little background on what I've been up to, and then tie that back to the Earhart search. Lately, I have been trying to learn new skills. One of those is Machine Learning. I have been studying this subject independently for about four months now, and I will be taking my first Machine Learning course with eCornell, the online division of Cornell University, next week. Next year I will be working on the Machine Learning certificate program offered by eCornell. 

One of our assignments in the current course I am taking was to locate a data set of interest, load the data into a Jupyter notebook and carry out some analyses. Jupyter notebook is a tool that allows coders to integrate their code and the result of the code, plus explanatory text. While this may not seem all that revolutionary, research papers that transparently show all of the methodology behind their research are all too uncommon. Very often it becomes very difficult for those who wish to follow up on research to see the work behind it and exactly how a result was derived. The tools and languages used are often disclosed but the exact code and data remain elusive.

There is a name for this phenomenon, and I think it's a good one. It's called the reproducibility crisis.

The Alan Turing Institute in the U.K. has recognized the crisis and is urging the adoption of uniform standards of reproducibility in academic research. They identify one source of the problem as follows:

Issues in reproducible research predominantly stem from academic incentives that encourage competition between research teams working on similar questions. The system means that an individual research team - behaving rationally - is likely not to share their data, code, protocols nor experimental design expertise with researchers working in the same area as themselves. The outcomes will likely result in siloed knowledge and a lack of transparency in research methods.

Find their article here: Turing Response to Reproducibility and Research Integrity Inquiry 

Predictably, and unfortunately, this phenomenon is not limited to academia. Obscuration seems to be popular everywhere, even in such things as product packaging. 

In my own small way, then, I want to help address the reproducibility crisis in Amelia Earhart research by making available a research effort of my own.

While it has been two years since this blog had a post, it has been about 10 years since news of the famous freckle cream jar, which just possibly may have belonged to Amelia Earhart, first hit the airwaves.

Our joint paper on this topic, A Freckle in Time, first released in October 2013, stated that the jar had an unusual chemistry that was closer to that of window glass than that of cosmetic containers. We had some data in books that informed us of this fact, but we lacked real data that we could use to verify this for ourselves.

Casting about for topics to incorporate into my coursework, I came upon a glass dataset from the University of California at Irvine Machine Learning Repository:

Recalling our work on glass chemistry in our paper, I thought that this would be an excellent opportunity to put our finding of 'unusual chemistry' to the test.

My goal was to use the UCI glass database to build a machine learning model, and then to apply that model to both the artifact jar found on Nikumaroro and to the clear facsimile jar in the same size and shape (but not color) found on eBay. The model would then determine which of several types of glass would best categorize these two samples. 

What I learned was that a prudently tweaked ML model can indeed spot the clear facsimile correctly as a container. Also, as we had suspected nine years ago, the artifact jar of reputed freckle fame was identified by the model as a 'window non-float,' a variety of glass that is most often found in churches.

Usually in Machine Learning, a misidentification is considered a failure, but in this case, I would consider the misidentification a success. It showed, in a reproducible and much more rigorously scientific way than previously we had shown, that the artifact jar is indeed unusual and original in its chemistry. I further speculate that the inability of the model to characterize it is not the result of a less than fully robust sample dataset, but rather the result of a lack of real-word siblings to be found, even in the 1930s.

While admittedly some of this IS speculation, the absence of evidence I observe is one that is drawn from years of searching for the exact twin to the artifact jar, so far without result.

My research, transparently presented with the original Python source code and research result, may be found here on the Binder website.

A word of instruction on using Binder:

When you click on the link above, you will be brought to Binder, a website that hosts Jupyter notebooks for interactive sessions. The site may take 2 to 5 minutes to load and, if the site is busy, it may not load at all. Binder is a free service, paid for by a small number of corporations committed to open source work. The website states in its documentation

We are still working on defining what the exact goals for uptime and reliability should be.

Understandably, they are still working out the kinks. I find, however, that when I shut down my browser completely, and then re-click on the link, the service then becomes available.

When you reach the workspace, in the left-hand pane double-click on the first document: DatasetPlayground_interactive_SMOTE.ipynb. When this document has opened click on the >> tool button at the top of the page to execute the code. Inside the document, you will be instructed in how to use the two interactive exhibits it contains, so that you may explore the data and verify the findings independently.

The conclusions I draw will probably not be controversial, nor will they be greeted with the fanfare of the initial reports of freckle creme, but they do constitute progress.

I have many more articles to post here, but first I need to write them. This will take time, but they are forthcoming. I appreciate the loyal readers and followers of this blog, and most of all, I appreciate your patience!

June 15, 2022 Update: I have written a much-expanded version without interactive links that may be viewed here:

viewer link