The unseen species problem is commonly referred to in ecology and deals with the estimation of the number of species represented in an ecosystem that were not observed by samples. It more specifically relates to how many new species would be discovered if more samples were taken in an ecosystem. The study of the unseen species problem was started in the early 1940s, by Alexander Steven Corbet. He spent two years in British Malaya trapping butterflies and was curious how many new species he would discover if he spent another two years trapping. Many different estimation methods have been developed to determine how many new species would be discovered given more samples. The unseen species problem also applies more broadly, as the estimators can be used to estimate any new elements of a set not previously found in samples. An example of this is determining how many words William Shakespeare knew based on all of his written works.
The unseen species problem can be broken down mathematically as follows: If
n
Xn\triangleqX1,\ldots,Xn
m
m+n | |
X | |
n+1 |
\triangleqXn+1,\ldots,Xn+m
m
In the early 1940s Alexander Steven Corbet spent 2 years in British Malaya trapping butterflies.[1] He kept track of how many species he observed, and how many members of each species were captured. For example, there were 74 different species of which he captured only 2 individual butterflies.
When Corbet returned to the United Kingdom, he approached biostatistician Ronald Fisher and asked how many new species of butterflies he could expect to catch if he went trapping for another two years;[2] in essence, Corbet was asking how many species he observed zero times.
Fisher responded with a simple estimation: for an additional 2 years of trapping, Corbet could expect to capture 75 new species. He did this using a simple summation (data provided by Orlitsky[2] in the table from the Example below:Here
\varphii
i
To estimate the number of unseen species, let
t\triangleqm/n
m
n
m=tn
\varphii
i
\varphi2=74
The Good–Toulmin (GT) estimator was developed by Good and Toulmin in 1953.[3] The estimate of the unseen species based on the Good–Toulmin estimator is given byThe Good–Toulmin Estimator has been shown to be a good estimate for values of
t\leq1.
UGT
U
\sqrt{n} ⋅ t,
t\leq1.
However, for
t>1,
t>1,
UGT
(-t)i\varphii
i
\varphii>0,
\varphii>0,
UGT
t,
U
t.
t>1,
UGT
U
To compensate for this, Efron and Thisted in 1976[4] showed that a truncated Euler transform can also be a usable estimate (the "ET" estimate):withwhere
X\sim\operatorname{Bin}\left(k,
1 | |
1+t |
\right),
k
Similar to the approach by Efron and Thisted, Alon Orlitsky, Ananda Theertha Suresh, and Yihong Wu developed the smooth Good–Toulmin estimator. They realized that the Good–Toulmin estimator failed because of the exponential growth, and not its bias. Therefore, they estimated the number of unseen species by truncating the seriesOrlitsky, Suresh, and Wu also noted that for distributions with
t>1
l-th
l
L
L
L
Ul
(-t)i
L
L
t\proptolnn
The species discovery curve can also be used. This curve relates the number of species found in an area as a function of the time. These curves can also be created by using estimators (such as the Good–Toulmin estimator) and plotting the number of unseen species at each value for
t
A species discovery curve is always increasing, as there is never a sample that could decrease the number of discovered species. Furthermore, the species discovery curve is also decelerating the more samples taken, the fewer unseen species are expected to be discovered. The species discovery curve will also never asymptote, as it is assumed that although the discovery rate might become infinitely slow, it will never actually stop. Two common models for a species discovery curve are the logarithmic and the exponential function.
As an example, consider the data Corbet provided Fisher in the 1940s. Using the Good–Toulmin model, the number of unseen species is found usingThis can then be used to create a relationship between
t
U
Number of species, \varphii | 118 | 74 | 44 | 24 | 29 | 22 | 20 | 19 | 20 | 15 | 12 | 14 | 6 | 12 | 6 |
From the plot, it is seen that at
t=1
t
U
t
There are numerous uses for the predictive algorithm. Knowing that the estimators are accurate, it allows scientists to extrapolate accurately the results of polling people by a factor of 2. They can predict the number of unique answers based on the number of people that have answered similarly. The method can also be used to determine the extent of someone's knowledge.
Based on research of Shakespeare's known works done by Thisted and Efron, there are 884,647 total words. The research also found that there are at total of
N=864
Uwords ≈ 11{,}460
Uwords
t=infty
Uwords(t\toinfty) ≈ 35{,}000