Friday, November 4, 2022

Machine Learning With Amelia Finds An Audience in the U.K.

 by Joe Cerniglia

Photograph credit: ID 98648556 © Mickem | Dreamstime.com

The Alan Turing Institute in the United Kingdom has published a reformulated version of the article I posted back in March of 2022. Turing Data Stories is a blog of the Institute that is dedicated to writing and publishing examples of open-sourced research, meaning that all the various threads of the author's work (programming code, methodology, and, most important, the original source data) are available to all to experiment with, critique, build upon, and understand.

The importance of having open-sourced research became obvious to me as I began to study academic articles on machine learning. Often, data from which many of these studies are drawn are not publicly available; thus, they offer no way to duplicate the work and prove that the methods elaborated in the articles are sound. The problem is common in a significant percentage of academic research that has been published. While such omissions may have been answerable if not justifiable in the pre-internet era, the age of the internet seems to have weakened if not demolished altogether any pretense of a rationale that once existed for them. Nevertheless, these omissions persist. This seemed to me a problem that someone should be working on. It was then that I discovered The Alan Turing Institute, and learned to my delight that, in the United Kingdom at least, this problem is taken very seriously at the highest levels

I then proposed to the Turing Data Stories group that perhaps my article could be expanded as a data story to try to model some of these ideals. Working with them over the course of these past few months, and having them peer-review the work, I have been impressed by their dedication to building an A.I. infrastructure in the U.K. that will propel the country to enjoy the many benefits of a broad computer literacy. We need something similar here in the United States.

Here, then, is the latest Turing Data Story.

The connection between Amelia Earhart and the United Kingdom may seem tenuous, but in fact it is rather significant. Earhart was whisked by air to London in a driving rainstorm after her solo trans-Atlantic flight in 1932, the first made by a woman and the second made by a human being, after Lindbergh. That event so inscribed itself upon the memory of Londoners that Walter Sickert thought it worth commemorating, in a painting that now hangs in the permanent collection of London's Tate Gallery

Additionally, the U.K. played an important part in the early research that has led some to believe Nikumaroro Island may have been the place where Earhart's world flight ended. In 1940, British subjects working as coconut planters on Nikumaroro, then Gardner Island, discovered a skull on the southeast corner of the island. Their subsequent searches in this area of the island led to the discovery of additional bones, along with a sextant box, a woman's shoe, a Benedictine bottle, and the remnants of a campsite, which to them seemed evidence of a castaway. They reported this discovery to British authorities in their chain of command, and sent the bones and artifacts to the Western Pacific High Commission in Suva, Fiji (then a British colonial possession) for further analysis. It was precisely in this area of the island where the bones were found that researchers from TIGHAR discovered the glass cosmetic jar, which is the subject of my data story, almost exactly 70 years later.

While the commanding officers in Fiji ultimately doubted they had received the remains of Amelia Earhart, their contribution to the investigation and to the various lines of evidence is noteworthy and relevant to the data story itself.

Thus, it is also noteworthy that the U.K.'s self-described national institute for data science and artificial intelligence has taken an interest in the story, both for its ability to illustrate good data science practices and simply because it is a great story. 

Enjoy!





Wednesday, March 23, 2022

Machine Learning with Amelia

by Joe Cerniglia


Incredibly, it has been more than two years since the last post was published here on Amelia Earhart Archaeology. It is time to revive the search for Amelia!

But before I do, I need to insert a little background on what I've been up to, and then tie that back to the Earhart search. Lately, I have been trying to learn new skills. One of those is Machine Learning. I have been studying this subject independently for about four months now, and I will be taking my first Machine Learning course with eCornell, the online division of Cornell University, next week. Next year I will be working on the Machine Learning certificate program offered by eCornell. 

One of our assignments in the current course I am taking was to locate a data set of interest, load the data into a Jupyter notebook and carry out some analyses. Jupyter notebook is a tool that allows coders to integrate their code and the result of the code, plus explanatory text. While this may not seem all that revolutionary, research papers that transparently show all of the methodology behind their research are all too uncommon. Very often it becomes very difficult for those who wish to follow up on research to see the work behind it and exactly how a result was derived. The tools and languages used are often disclosed but the exact code and data remain elusive.

There is a name for this phenomenon, and I think it's a good one. It's called the reproducibility crisis.

The Alan Turing Institute in the U.K. has recognized the crisis and is urging the adoption of uniform standards of reproducibility in academic research. They identify one source of the problem as follows:

Issues in reproducible research predominantly stem from academic incentives that encourage competition between research teams working on similar questions. The system means that an individual research team - behaving rationally - is likely not to share their data, code, protocols nor experimental design expertise with researchers working in the same area as themselves. The outcomes will likely result in siloed knowledge and a lack of transparency in research methods.

Find their article here: Turing Response to Reproducibility and Research Integrity Inquiry 

Predictably, and unfortunately, this phenomenon is not limited to academia. Obscuration seems to be popular everywhere, even in such things as product packaging. 

In my own small way, then, I want to help address the reproducibility crisis in Amelia Earhart research by making available a research effort of my own.

While it has been two years since this blog had a post, it has been about 10 years since news of the famous freckle cream jar, which just possibly may have belonged to Amelia Earhart, first hit the airwaves.

Our joint paper on this topic, A Freckle in Time, first released in October 2013, stated that the jar had an unusual chemistry that was closer to that of window glass than that of cosmetic containers. We had some data in books that informed us of this fact, but we lacked real data that we could use to verify this for ourselves.

Casting about for topics to incorporate into my coursework, I came upon a glass dataset from the University of California at Irvine Machine Learning Repository: 

https://archive.ics.uci.edu/ml/datasets/glass+identification

Recalling our work on glass chemistry in our paper, I thought that this would be an excellent opportunity to put our finding of 'unusual chemistry' to the test.

My goal was to use the UCI glass database to build a machine learning model, and then to apply that model to both the artifact jar found on Nikumaroro and to the clear facsimile jar in the same size and shape (but not color) found on eBay. The model would then determine which of several types of glass would best categorize these two samples. 

What I learned was that a prudently tweaked ML model can indeed spot the clear facsimile correctly as a container. Also, as we had suspected nine years ago, the artifact jar of reputed freckle fame was identified by the model as a 'window non-float,' a variety of glass that is most often found in churches.

Usually in Machine Learning, a misidentification is considered a failure, but in this case, I would consider the misidentification a success. It showed, in a reproducible and much more rigorously scientific way than previously we had shown, that the artifact jar is indeed unusual and original in its chemistry. I further speculate that the inability of the model to characterize it is not the result of a less than fully robust sample dataset, but rather the result of a lack of real-word siblings to be found, even in the 1930s.

While admittedly some of this IS speculation, the absence of evidence I observe is one that is drawn from years of searching for the exact twin to the artifact jar, so far without result.

My research, transparently presented with the original Python source code and research result, may be found here on the Binder website.

A word of instruction on using Binder:

When you click on the link above, you will be brought to Binder, a website that hosts Jupyter notebooks for interactive sessions. The site may take 2 to 5 minutes to load and, if the site is busy, it may not load at all. Binder is a free service, paid for by a small number of corporations committed to open source work. The website states in its documentation

We are still working on defining what the exact goals for uptime and reliability should be.

Understandably, they are still working out the kinks. I find, however, that when I shut down my browser completely, and then re-click on the link, the service then becomes available.

When you reach the workspace, in the left-hand pane double-click on the first document: DatasetPlayground_interactive_SMOTE.ipynb. When this document has opened click on the >> tool button at the top of the page to execute the code. Inside the document, you will be instructed in how to use the two interactive exhibits it contains, so that you may explore the data and verify the findings independently.

The conclusions I draw will probably not be controversial, nor will they be greeted with the fanfare of the initial reports of freckle creme, but they do constitute progress.

I have many more articles to post here, but first I need to write them. This will take time, but they are forthcoming. I appreciate the loyal readers and followers of this blog, and most of all, I appreciate your patience!


June 15, 2022 Update: I have written a much-expanded version without interactive links that may be viewed here:

viewer link