List of datasets for machine-learning research explained

These datasets are used in machine learning (ML) research and have been cited in peer-reviewed academic journals. Datasets are an integral part of the field of machine learning. Major advances in this field can result from advances in learning algorithms (such as deep learning), computer hardware, and, less-intuitively, the availability of high-quality training datasets.[1] High-quality labeled training datasets for supervised and semi-supervised machine learning algorithms are usually difficult and expensive to produce because of the large amount of time needed to label the data. Although they do not need to be labeled, high-quality datasets for unsupervised learning can also be difficult and costly to produce.[2] [3] [4] [5]

Many organizations including governments publish and share their datasets. The datasets are classified, based on the licenses, as Open data and Non-Open data.

The datasets from various governmental-bodies are presented in List of open government data sites. The datasets are ported on open data portals. They are made available for searching, depositing and accessing through interfaces like Open API. The datasets are made available as various sorted types and subtypes.

List of sorting used for datasets

TypeSubtypes
Specific categoryFinance, Economics, Commerce, Societal, Health, Academy, Sports, Food, Agriculture, Travel, Geospatial, Political, Consumer, Transport, Logistics, Environmental, Real-Estate, Legal, Entertainment, Energy, Hospitality
ScopeSupranational Union, National, Subnational, Municipality, Urban, Rural
LanguageMandarin Chinese, Spanish, English, Arabic, Hindi, Bengali
TypeTabular, Graph, Text, Image, Sound, Video
UsageTraining, validating, and testing
File-FormatsCSV, JSON, XML, KML, GeoJSON, Shapefile, GML
LicensesCreative-Commons, GPL, Other Non-Open data licenses
Last-UpdatedLast-Hour, Last-Day, Last-Week, Last-Month, Last-Year
File-SizeMinimum, Maximum, Range
StatusVerified, In-Preparation, Deactivated(or Deprecated)
Number of records100s, 1000s, 10000s, 100000s, Millions
Number of variablesLess than 10, 10s, 100s, 1000s, 10000s
ServicesIndividual, Aggregation
The data portal is classified based on its type of license. The open source license based data portals are known as open data portals which are used by many government organizations and academic institutions.

List of open data portals

See also: Open data portal.

Portal-nameLicenseList of installations of the portalTypical usages
Comprehensive Knowledge Archive Network (CKAN)AGPLhttps://ckan.github.io/ckan-instances/

https://github.com/sebneu/ckan_instances/blob/master/instances.csv

Data repository for government or non-profit organisations, Data Management Solution for Research Institutes
DKANGPLhttps://getdkan.org/communityData repository for government or non-profit organisations, Data Management Solution for Research Institutes
DataverseApachehttps://dataverse.org/installations

https://dataverse.org/metrics

Data Management Solution for Research Institutes
DSpaceBSDhttps://registry.lyrasis.org/Data Management Solution for Research Institutes
OpenMLBSDhttps://www.openml.org/search?type=data&sort=runs&status=activeData Management Solution to share datasets, algorithms, and experiments results through APIs.

List of portals suitable for multiple types of applications

See also: machine learning. The data portal sometimes lists a wide variety of subtypes of datasets pertaining to many machine learning applications.

Academic Torrentshttps://academictorrents.com
Amazon Datasetshttps://registry.opendata.aws/
Awesome Public Datasets Collectionhttps://github.com/awesomedata/awesome-public-datasets
data.worldhttps://data.world/datasets/machine-learning
Datahub – Core Datasetshttps://datahub.io/docs/core-data
DataONEhttps://www.dataone.org/
DataPortalshttps://dataportals.org/
Datasetlist.comhttps://www.datasetlist.com
https://index.okfn.org/
Google Dataset Searchhttps://datasetsearch.research.google.com/
Hugging Facehttps://huggingface.co/docs/datasets/
IBM's Data Asset Exchangehttps://developer.ibm.com/exchanges/data/
Jupyter – Tutorial Datahttps://jupyter-tutorial.readthedocs.io/en/latest/data-processing/opendata.html
Kagglehttps://www.kaggle.com/datasets
Machine learning datasetshttps://macgence.com/data-sets-and-cataloges/
Major Smart Cities with Open Datahttps://rlist.io/l/major-smart-cities-with-open-data-portals
Microsoft Datasetshttps://msropendata.com/datasets
Open Data Inceptionhttps://opendatainception.io/
Opendatasofthttps://data.opendatasoft.com/explore/dataset/open-data-sources%40public/table/?sort=code_en
OpenDOARhttps://v2.sherpa.ac.uk/opendoar/
OpenMLhttps://www.openml.org/search?type=data
Papers with Codehttps://paperswithcode.com/datasets
Penn Machine Learning Benchmarkshttps://github.com/EpistasisLab/pmlb/tree/master/datasets
Public APIshttps://github.com/public-apis/public-apis
Registry of Open Access Repositorieshttp://roar.eprints.org/ 
REgistry of REsearch Data REpositorieshttps://www.re3data.org/ 
UCI Machine Learning Repositoryhttp://mlr.cs.umass.edu/ml/
Speech Datasethttps://www.shaip.com/offerings/speech-data-catalog/
Visual Data Discoveryhttps://visualdata.io/discovery

List of portals suitable for a specific subtype of applications

See also: Machine learning. The data portals which are suitable for a specific subtype of machine learning application are listed in the subsequent sections.

Image data

See main article: List of datasets in computer vision and image processing.

Text data

These datasets consist primarily of text for tasks such as natural language processing, sentiment analysis, translation, and cluster analysis.

Reviews

Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
Amazon reviewsUS product reviews from Amazon.com.None.233.1 millionTextClassification, sentiment analysis2015 (2018)[6] [7] McAuley et al.
OpinRank Review DatasetReviews of cars and hotels from Edmunds.com and TripAdvisor respectively.None.42,230 / ~259,000 respectivelyTextSentiment analysis, clustering2011[8] [9] K. Ganesan et al.
MovieLens22,000,000 ratings and 580,000 tags applied to 33,000 movies by 240,000 users.None.~ 22MTextRegression, clustering, classification2016[10] GroupLens Research
Yahoo! Music User Ratings of Musical ArtistsOver 10M ratings of artists by Yahoo users.None described.~ 10MTextClustering, regression2004[11] [12] Yahoo!
Car Evaluation Data SetCar properties and their overall acceptability.Six categorical features given.1728TextClassification1997[13] [14] M. Bohanec
YouTube Comedy Slam Preference DatasetUser vote data for pairs of videos shown on YouTube. Users voted on funnier videos.Video metadata given.1,138,562TextClassification2012[15] [16] Google
Skytrax User Reviews DatasetUser reviews of airlines, airports, seats, and lounges from Skytrax.Ratings are fine-grain and include many aspects of airport experience.41396TextClassification, regression2015[17] Q. Nguyen
Teaching Assistant Evaluation DatasetTeaching assistant reviews.Features of each instance such as class, class size, and instructor are given.151TextClassification1997[18] [19] W. Loh et al.
Vietnamese Students’ Feedback Corpus (UIT-VSFC)Students’ Feedback.Comments16,000TextClassification1997[20] Nguyen et al.
Vietnamese Social Media Emotion Corpus (UIT-VSMEC)Users’ Facebook Comments.Comments6,927TextClassification1997[21] Nguyen et al.
Vietnamese Open-domain Complaint Detection dataset (ViOCD)Customer product reviewsComments5,485TextClassification2021[22] Nguyen et al.
ViHOS: Hate Speech Spans Detection for VietnameseSocial Media TextsCommentsContaining 26k spans on 11k commentsTextSpan Detection2021[23] Hoang et al.

News articles

Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
NYSK DatasetEnglish news articles about the case relating to allegations of sexual assault against the former IMF director Dominique Strauss-Kahn.Filtered and presented in XML format.10,421XML, textSentiment analysis, topic extraction2013[24] Dermouche, M. et al.
The Reuters Corpus Volume 1Large corpus of Reuters news stories in English.Fine-grain categorization and topic codes.810,000TextClassification, clustering, summarization2002[25] Reuters
The Reuters Corpus Volume 2Large corpus of Reuters news stories in multiple languages.Fine-grain categorization and topic codes.487,000TextClassification, clustering, summarization2005[26] Reuters
Thomson Reuters Text Research CollectionLarge corpus of news stories.Details not described.1,800,370TextClassification, clustering, summarization2009[27] T. Rose et al.
Saudi Newspapers Corpus31,030 Arabic newspaper articles.Metadata extracted.31,030JSONSummarization, clustering2015[28] M. Alhagri
RE3D (Relationship and Entity Extraction Evaluation Dataset)Entity and Relation marked data from various news and government sources. Sponsored by DstlFiltered, categorisation using Baleen typesnot knownJSONClassification, Entity and Relation recognition2017[29] Dstl
Examiner Spam Clickbait CatalogueClickbait, spam, crowd-sourced headlines from 2010 to 2015Publish date and headlines3,089,781CSVClustering, Events, Sentiment2016[30] R. Kulkarni
ABC Australia News CorpusEntire news corpus of ABC Australia from 2003 to 2019Publish date and headlines1,186,018CSVClustering, Events, Sentiment2020[31] R. Kulkarni
Worldwide News – Aggregate of 20K FeedsOne week snapshot of all online headlines in 20+ languagesPublish time, URL and headlines1,398,431CSVClustering, Events, Language Detection2018[32] R. Kulkarni
Reuters News Wire Headline11 Years of timestamped events published on the news-wirePublish time, Headline Text16,121,310CSVNLP, Computational Linguistics, Events2018R. Kulkarni
The Irish Times Ireland News Corpus24 Years of Ireland News from 1996 to 2019Publish time, Headline Category and Text1,484,340CSVNLP, Computational Linguistics, Events2020[33] R. Kulkarni
News Headlines Dataset for Sarcasm DetectionHigh quality dataset with Sarcastic and Non-sarcastic news headlines.Clean, normalized text26,709JSONNLP, Classification, Linguistics2018[34] Rishabh Misra

Messages

Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
Enron Email DatasetEmails from employees at Enron organized into folders.Attachments removed, invalid email addresses converted to user@enron.com or no_address@enron.com.~ 500,000TextNetwork analysis, sentiment analysis2004 (2015)[35] [36] Klimt, B. and Y. Yang
Ling-Spam DatasetCorpus containing both legitimate and spam emails.Four version of the corpus involving whether or not a lemmatiser or stop-list was enabled.2,412 Ham 481 Spam TextClassification2000[37] [38] Androutsopoulos, J. et al.
SMS Spam Collection DatasetCollected SMS spam messages.None.5,574TextClassification2011[39] [40] T. Almeida et al.
Twenty Newsgroups DatasetMessages from 20 different newsgroups.None.20,000TextNatural language processing1999[41] T. Mitchell et al.
Spambase DatasetSpam emails.Many text features extracted.4,601TextSpam detection, classification1999[42] M. Hopkins et al.

Twitter and tweets

Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
MovieTweetingsMovie rating dataset based on public and well-structured tweets~710,000TextClassification, regression2018[43] S. Dooms
Twitter100kPairs of images and tweets100,000Text and ImagesCross-media retrieval2017[44] [45] Y. Hu, et al.
Sentiment140Tweet data from 2009 including original text, time stamp, user and sentiment.Classified using distant supervision from presence of emoticon in tweet.1,578,627Tweets, comma, separated valuesSentiment analysis2009[46] [47] A. Go et al.
ASU Twitter DatasetTwitter network data, not actual tweets. Shows connections between a large number of users.None.11,316,811 users, 85,331,846 connectionsTextClustering, graph analysis2009[48] [49] R. Zafarani et al.
SNAP Social Circles: Twitter DatabaseLarge Twitter network data.Node features, circles, and ego networks.1,768,149TextClustering, graph analysis2012[50] [51] J. McAuley et al.
Twitter Dataset for Arabic Sentiment AnalysisArabic tweets.Samples hand-labeled as positive or negative.2000TextClassification2014[52] [53] N. Abdulla
Buzz in Social Media DatasetData from Twitter and Tom's Hardware. This dataset focuses on specific buzz topics being discussed on those sites.Data is windowed so that the user can attempt to predict the events leading up to social media buzz.140,000TextRegression, Classification2013[54] [55] F. Kawala et al.
Paraphrase and Semantic Similarity in Twitter (PIT)This dataset focuses on whether tweets have (almost) same meaning/information or not. Manually labeled. tokenization, part-of-speech and named entity tagging18,762TextRegression, Classification2015[56] [57] Xu et al.
Geoparse Twitter benchmark datasetThis dataset contains tweets during different news events in different countries. Manually labeled location mentions.location annotations added to JSON metadata6,386Tweets, JSONClassification, Information Extraction2014[58] [59] S.E. Middleton et al.
Sarcasm, Perceived and Intended, by Reactive Supervision (SPIRS) Intended and perceived sarcastic tweets along with their context collected using reactive supervision; an equal number of negative (non-sarcastic) samples30,000Tweet IDs, CSVClassification2020[60] [61] B. Shmueli et al.
Dutch Social media collectionThis dataset contains COVID-19 tweets made by Dutch speakers or users from Netherlands. The data has been machine labeledclassified for sentiment, tweet text & user description translated to English. Industry mention are extracted271,342JSONLSentiment, multi-label classification, machine translation2020[62] [63] [64] Aaaksh Gupta, CoronaWhy
ReactionGIF datasetA dataset of 30K tweets and their GIF reactionsClassified for sentiment, reaction, and emotion30,000Tweet IDs, JSONLClassified for sentiment, reaction, and emotion2021[65] B. Shmueli et al.

Dialogues

Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
NPS Chat CorpusPosts from age-specific online chat rooms.Hand privacy masked, tagged for part of speech and dialogue-act.~ 500,000XMLNLP, programming, linguistics2007[66] Forsyth, E., Lin, J., & Martell, C.
Twitter Triple CorpusA-B-A triples extracted from Twitter.4,232TextNLP2016[67] Sordini, A. et al.
UseNet CorpusUseNet forum postings.Anonymized e-mails and URLs. Omitted documents with lengths <500 words or >500,000 words, or that were <90% English.7 billionText2011[68] Shaoul, C., & Westbury C.
NUS SMS CorpusSMS messages collected between two users, with timing analysis.~ 10,000XMLNLP2011[69] KAN, M
Reddit All Comments CorpusAll Reddit comments (as of 2015).~ 1.7 billionJSONNLP, research2015[70] Stuck_In_the_Matrix
Ubuntu Dialogue CorpusDialogues extracted from Ubuntu chat stream on IRC.930 thousand dialogues, 7.1 million utterancesCSVDialogue Systems Research2015[71] Lowe, R. et al.
Dialog State Tracking ChallengeThe Dialog State Tracking Challenges 2 & 3 (DSTC2&3) were research challenge focused on improving the state of the art in tracking the state of spoken dialog systems.Transcription of spoken dialogs with labellingDSTC2 contains ~3.2k calls – DSTC3 contains ~2.3k callsJsonDialogue state tracking2014[72] April 2016 .Henderson, Matthew and Thomson, Blaise and Williams, Jason D

Legal

Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
FreeLawFiltered data from Court Listener, part of the FreeLaw project.Cleaned and normalized text4,940,710JsonNLP, linguistics2020T. Hoppe
Pile of LawCorpus of legal and administrative dataCleaned, normalized, and privatized ~50,000,000JsonNLP, linguistics, sentiment2022[73] [74] L. Zheng; N. Guha; B. Anderson; P. Henderson; D. Ho
Caselaw Access ProjectAll official, book-published state and federal United States case law — every volume or case designated as an official report of decisions by a court within the United States.Cleaned and normalized text~10,000JsonNLP, linguistics2022[75] A. Aizman; S. Chapman; J. Cushman; K. Dulin; H. Eidolon; et al.

Other text

Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
Web of Science DatasetHierarchical Datasets for Text ClassificationNone.46,985TextClassification,Categorization2017[76] [77] K. Kowsari et al.
Legal Case ReportsFederal Court of Australia cases from 2006 to 2009.None.4,000TextSummarization,citation analysis2012[78] [79] F. Galgani et al.
Blogger Authorship CorpusBlog entries of 19,320 people from blogger.com.Blogger self-provided gender, age, industry, and astrological sign.681,288TextSentiment analysis, summarization, classification2006[80] [81] J. Schler et al.
Social Structure of Facebook NetworksLarge dataset of the social structure of Facebook.None.100 colleges coveredTextNetwork analysis, clustering2012[82] [83] A. Traud et al.
Dataset for the Machine Comprehension of TextStories and associated questions for testing comprehension of text.None.660TextNatural language processing, machine comprehension2013[84] [85] M. Richardson et al.
The Penn Treebank ProjectNaturally occurring text annotated for linguistic structure.Text is parsed into semantic trees.~ 1M wordsTextNatural language processing, summarization1995[86] [87] M. Marcus et al.
DEXTER DatasetTask given is to determine, from features given, which articles are about corporate acquisitions.Features extracted include word stems. Distractor features included.2600TextClassification2008[88] Reuters
Google Books N-gramsN-grams from a very large corpus of booksNone.2.2 TB of textTextClassification, clustering, regression2011[89] [90] Google
Personae CorpusCollected for experiments in Authorship Attribution and Personality Prediction. Consists of 145 Dutch-language essays.In addition to normal texts, syntactically annotated texts are given.145TextClassification, regression2008[91] [92] K. Luyckx et al.
PushShiftArchives of social media websites, including Reddit, Twitter, and Hackernews.Text extracted and normalized from WARCs~100,000,000 postsJsonNLP, sentiment, linguistics2022[93] [94] J. Baumgartner
SEC FilingsEDGAR | Company Filings|Text extracted.||csv|NLP||||-|CNAE-9 Dataset|Categorization task for free text descriptions of Brazilian companies.|Word frequency has been extracted.|1080|Text|Classification|2012|[95] [96] |P. Ciarelli et al.|-|Sentiment Labeled Sentences Dataset|3000 sentiment labeled sentences.|Sentiment of each sentence has been hand labeled as positive or negative.|3000|Text|Classification, sentiment analysis|2015|[97] [98] |D. Kotzias|-|BlogFeedback Dataset|Dataset to predict the number of comments a post will receive based on features of that post.|Many features of each post extracted.|60,021|Text|Regression|2014|[99] [100] |K. Buza|-|PubMed Central|PubMed® comprises more than 35 million citations for biomedical literature from MEDLINE, life science journals, and online books. |None|35 Million|Text|NLP||||-|USPTO|The United States Patent and Trademark Office|||Text|NLP||||-|PhilPapers|Open access collection of philosophy publications|||Text|NLP||||-|Book Corpus|A popular large-scale text corpus.|None||Text|NLP|2015|[101] |Zhu, Yukun, et al.|-|Stanford Natural Language Inference (SNLI) Corpus|Image captions matched with newly constructed sentences to form entailment, contradiction, or neutral pairs.|Entailment class labels, syntactic parsing by the Stanford PCFG parser|570,000|Text|Natural language inference/recognizing textual entailment|2015|[102] |S. Bowman et al.|-|DSL Corpus Collection (DSLCC)|A multilingual collection of short excerpts of journalistic texts in similar languages and dialects.|None|294,000 phrases|Text|Discriminating between similar languages|2017|[103] |Tan, Liling et al.|-|Urban Dictionary Dataset|Corpus of words, votes and definitions|User names anonymised|2,580,925|CSV|NLP, Machine comprehension|2016 May|[104] |Anonymous|-|T-REx|Wikipedia abstracts aligned with Wikidata entities|Alignment of Wikidata triples with Wikipedia abstracts|11M aligned triples|JSON and NIF https://hadyelsahar.github.io/t-rex/|NLP, Relation Extraction|2018|[105] | H. Elsahar et al.|-|General Language Understanding Evaluation (GLUE)|Benchmark of nine tasks|Various|~1M sentences and sentence pairs||NLU|2018|[106] [107] [108] | Wang et al.|-|Contract Understanding Atticus Dataset (CUAD) (formerly known as Atticus Open Contract Dataset (AOK)) |Dataset of legal contracts with rich expert annotations||~13,000 labels|CSV and PDF|Natural language processing, QnA|2021||The Atticus Project|-|Vietnamese Image Captioning Dataset (UIT-ViIC) |Vietnamese Image Captioning Dataset||19,250 captions for 3,850 images |CSV and PDF|Natural language processing, Computer vision|2020|[109] |Lam et al.|-|Vietnamese Names annotated with Genders (UIT-ViNames)|Vietnamese Names annotated with Genders||26,850 Vietnamese full names annotated with genders|CSV |Natural language processing|2020|[110] |To et al.|-|Vietnamese Constructive and Toxic Speech Detection Dataset (UIT-ViCTSD)|Vietnamese Constructive and Toxic Speech Detection Dataset||10,000 Vietnamese users' comments on online newspapers on 10 domains|CSV |Natural Language Processing|2021|[111] |Nguyen et al.|-|PG-19|A set of books extracted from the Project Gutenberg books library|||Text|Natural Language Processing|2019||Jack W et al.|-|Deepmind Mathematics|Mathematical question and answer pairs.|||Text|Natural Language Processing|2018|[112] |D Saxton et al.|-|Anna's Archive|A comprehensive archive of published books and papers|None|100,356,641|Text, epub, PDF|Natural Language Processing|2024||}

Sound data

These datasets consist of sounds and sound features used for tasks such as speech recognition and speech synthesis.

Speech

Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
Zero Resource Speech Challenge 2015Spontaneous speech (English), Read speech (Xitsonga).None, raw WAV files.English: 5h, 12 speakers; Xitsonga: 2h30, 24 speakersWAV (audio only)Unsupervised discovery of speech features/subword units/word units2015[113] [114] Versteegh et al.
Parkinson Speech DatasetMultiple recordings of people with and without Parkinson's Disease.Voice features extracted, disease scored by physician using unified Parkinson's disease rating scale.1,040TextClassification, regression2013[115] [116] B. E. Sakar et al.
Spoken Arabic DigitsSpoken Arabic digits from 44 male and 44 female.Time-series of mel-frequency cepstrum coefficients.8,800TextClassification2010[117] [118] M. Bedda et al.
ISOLET DatasetSpoken letter names.Features extracted from sounds.7797TextClassification1994[119] [120] R. Cole et al.
Japanese Vowels DatasetNine male speakers uttered two Japanese vowels successively.Applied 12-degree linear prediction analysis to it to obtain a discrete-time series with 12 cepstrum coefficients.640TextClassification1999[121] [122] M. Kudo et al.
Parkinson's Telemonitoring DatasetMultiple recordings of people with and without Parkinson's Disease.Sound features extracted.5875TextClassification2009[123] [124] A. Tsanas et al.
TIMITRecordings of 630 speakers of eight major dialects of American English, each reading ten phonetically rich sentences.Speech is lexically and phonemically transcribed.6300TextSpeech recognition, classification.1986[125] [126] J. Garofolo et al.
Arabic Speech CorpusA single-speaker, Modern Standard Arabic (MSA) speech corpus with phonetic and orthographic transcripts aligned to phoneme level.Speech is orthographically and phonetically transcribed with stress marks.~1900Text, WAVSpeech Synthesis, Speech Recognition, Corpus Alignment, Speech Therapy, Education.2016[127] N. Halabi
Common VoiceA public domain database of crowdsourced data across a wide range of dialects. Validation by other users .English: 1,118 hoursMP3 with corresponding text files Speech recognition2017 June (2019 December)[128] Mozilla
LJSpeechA single-speaker corpus of English public-domain audiobook recordings, split into short clips at punctuation marks.Quality check, normalized transcription alongside the original.13,100CSV, WAVSpeech synthesis2017[129] Keith Ito, Linda Johnson
Arabic Speech Commands DatasetCollected from 30 contributors and grouped into 40 keywords.Raw WAV files12,000WAV, CSVSpeech recognition, keyword spotting2021[130] Abdulkader Ghandoura

Music

Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
Geographic Origin of Music Data SetAudio features of music samples from different locations.Audio features extracted using MARSYAS software.1,059TextGeographic classification, clustering2014[131] [132] F. Zhou et al.
Million Song DatasetAudio features from one million different songs.Audio features extracted.1MTextClassification, clustering2011[133] [134] T. Bertin-Mahieux et al.
MUSDB18Multi-track popular music recordings Raw audio150MP4, WAVSource Separation2017[135] Z. Rafii et al.
Free Music ArchiveAudio under Creative Commons from 100k songs (343 days, 1TiB) with a hierarchy of 161 genres, metadata, user data, free-form text.Raw audio and audio features.106,574Text, MP3Classification, recommendation2017[136] M. Defferrard et al.
Bach Choral Harmony DatasetBach chorale chords.Audio features extracted.5665TextClassification2014[137] [138] D. Radicioni et al.

Other sounds

Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
UrbanSoundLabeled sound recordings of sounds like air conditioners, car horns and children playing.Sorted into folders by class of events as well as metadata in a JSON file and annotations in a CSV file.1,059Sound(WAV)Classification2014[139] [140] J. Salamon et al.
AudioSet10-second sound snippets from YouTube videos, and an ontology of over 500 labels. 128-d PCA'd VGG-ish features every 1 second.2,084,320Text (CSV) and TensorFlow Record filesClassification2017[141] J. Gemmeke et al., Google
Bird Audio Detection challengeAudio from environmental monitoring stations, plus crowdsourced recordings17,000+Classification2016 (2018)[142] [143] Queen Mary University and IEEE Signal Processing Society
WSJ0 Hipster Ambient MixturesAudio from WSJ0 mixed with noise recorded in the San Francisco Bay AreaNoise clips matched to WSJ0 clips28,000Sound (WAV)Audio source separation2019[144] Wichern, G., et al., Whisper and MERL
Clotho4,981 audio samples of 15 to 30 seconds long, each audio sample having five different captions of eight to 20 words long.24,905Sound (WAV) and text (CSV)Automated audio captioning2020[145] [146] K. Drossos, S. Lipping, and T. Virtanen

Signal data

Datasets containing electric signal information requiring some sort of signal processing for further analysis.

Electrical

Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
Witty Worm DatasetDataset detailing the spread of the Witty worm and the infected computers.Split into a publicly available set and a restricted set containing more sensitive information like IP and UDP headers.55,909 IP addressesTextClassification2004[147] [148] Center for Applied Internet Data Analysis
Cuff-Less Blood Pressure Estimation DatasetCleaned vital signals from human patients which can be used to estimate blood pressure.125 Hz vital signs have been cleaned.12,000TextClassification, regression2015[149] [150] M. Kachuee et al.
Gas Sensor Array Drift DatasetMeasurements from 16 chemical sensors utilized in simulations for drift compensation.Extensive number of features given.13,910TextClassification2012[151] [152] A. Vergara
Servo DatasetData covering the nonlinear relationships observed in a servo-amplifier circuit.Levels of various components as a function of other components are given.167TextRegression1993[153] [154] K. Ullrich
UJIIndoorLoc-Mag DatasetIndoor localization database to test indoor positioning systems. Data is magnetic field based.Train and test splits given.40,000TextClassification, regression, clustering2015[155] [156] D. Rambla et al.
Sensorless Drive Diagnosis DatasetElectrical signals from motors with defective components.Statistical features extracted.58,508TextClassification2015[157] [158] M. Bator

Motion-tracking

Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
Wearable Computing: Classification of Body Postures and Movements (PUC-Rio)People performing five standard actions while wearing motion trackers.None.165,632TextClassification2013[159] [160] Pontifical Catholic University of Rio de Janeiro
Gesture Phase Segmentation DatasetFeatures extracted from video of people doing various gestures.Features extracted aim at studying gesture phase segmentation.9900TextClassification, clustering2014[161] [162] R. Madeo et a
Vicon Physical Action Data Set Dataset10 normal and 10 aggressive physical actions that measure the human activity tracked by a 3D tracker.Many parameters recorded by 3D tracker.3000TextClassification2011[163] [164] T. Theodoridis
Daily and Sports Activities DatasetMotor sensor data for 19 daily and sports activities.Many sensors given, no preprocessing done on signals.9120TextClassification2013[165] [166] B. Barshan et al.
Human Activity Recognition Using Smartphones DatasetGyroscope and accelerometer data from people wearing smartphones and performing normal actions.Actions performed are labeled, all signals preprocessed for noise.10,299TextClassification2012[167] [168] J. Reyes-Ortiz et al.
Australian Sign Language SignsAustralian sign language signs captured by motion-tracking gloves.None.2565TextClassification2002[169] [170] M. Kadous
Weight Lifting Exercises monitored with Inertial Measurement UnitsFive variations of the biceps curl exercise monitored with IMUs.Some statistics calculated from raw data.39,242TextClassification2013[171] [172] W. Ugulino et al.
sEMG for Basic Hand movements DatasetTwo databases of surface electromyographic signals of 6 hand movements.None.3000TextClassification2014[173] [174] C. Sapsanis et al.
REALDISP Activity Recognition DatasetEvaluate techniques dealing with the effects of sensor displacement in wearable activity recognition.None.1419TextClassification2014[175] O. Banos et al.
Heterogeneity Activity Recognition DatasetData from multiple different smart devices for humans performing various activities.None.43,930,257TextClassification, clustering2015[176] [177] A. Stisen et al.
Indoor User Movement Prediction from RSS DataTemporal wireless network data that can be used to track the movement of people in an office.None.13,197TextClassification2016[178] [179] D. Bacciu
PAMAP2 Physical Activity Monitoring Dataset18 different types of physical activities performed by 9 subjects wearing 3 IMUs.None.3,850,505TextClassification2012[180] A. Reiss
OPPORTUNITY Activity Recognition DatasetHuman Activity Recognition from wearable, object, and ambient sensors is a dataset devised to benchmark human activity recognition algorithms.None.2551TextClassification2012[181] [182] D. Roggen et al.
Real World Activity Recognition DatasetHuman Activity Recognition from wearable devices. Distinguishes between seven on-body device positions and comprises six different kinds of sensors.None.3,150,000 (per sensor)TextClassification2016[183] T. Sztyler et al.
Toronto Rehab Stroke Pose Dataset3D human pose estimates (Kinect) of stroke patients and healthy participants performing a set of tasks using a stroke rehabilitation robot.None.10 healthy person and 9 stroke survivors (3500–6000 frames per person)CSVClassification2017[184] [185] [186] E. Dolatabadi et al.
Corpus of Social Touch (CoST)7805 gesture captures of 14 different social touch gestures performed by 31 subjects. The gestures were performed in three variations: gentle, normal and rough, on a pressure sensor grid wrapped around a mannequin arm.Touch gestures performed are segmented and labeled.7805 gesture captures CSVClassification2016[187] [188] M. Jung et al.

Other signals

Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
Wine DatasetChemical analysis of wines grown in the same region in Italy but derived from three different cultivars.13 properties of each wine are given178TextClassification, regression1991[189] [190] M. Forina et al.
Combined Cycle Power Plant Data SetData from various sensors within a power plant running for 6 years.None9568TextRegression2014[191] [192] P. Tufekci et al.

Physical data

Datasets from physical systems.

High-energy physics

Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
HIGGS DatasetMonte Carlo simulations of particle accelerator collisions.28 features of each collision are given.11MTextClassification2014[193] [194] [195] D. Whiteson
HEPMASS DatasetMonte Carlo simulations of particle accelerator collisions. Goal is to separate the signal from noise.28 features of each collision are given.10,500,000TextClassification2016[196] D. Whiteson

Systems

Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
Yacht Hydrodynamics DatasetYacht performance based on dimensions.Six features are given for each yacht.308TextRegression2013[197] [198] R. Lopez
Robot Execution Failures Dataset5 data sets that center around robotic failure to execute common tasks.Integer valued features such as torque and other sensor measurements.463TextClassification1999[199] L. Seabra et al.
Pittsburgh Bridges DatasetDesign description is given in terms of several properties of various bridges.Various bridge features are given.108TextClassification1990[200] [201] Y. Reich et al.
Automobile DatasetData about automobiles, their insurance risk, and their normalized losses.Car features extracted.205TextRegression1987[202] [203] J. Schimmer et al.
Auto MPG DatasetMPG data for cars.Eight features of each car given.398TextRegression1993[204] Carnegie Mellon University
Energy Efficiency DatasetHeating and cooling requirements given as a function of building parameters.Building parameters given.768TextClassification, regression2012[205] [206] A. Xifara et al.
Airfoil Self-Noise DatasetA series of aerodynamic and acoustic tests of two and three-dimensional airfoil blade sections.Data about frequency, angle of attack, etc., are given.1503TextRegression2014[207] R. Lopez
Challenger USA Space Shuttle O-Ring DatasetAttempt to predict O-ring problems given past Challenger data.Several features of each flight, such as launch temperature, are given.23TextRegression1993[208] [209] D. Draper et al.
Statlog (Shuttle) DatasetNASA space shuttle datasets.Nine features given.58,000TextClassification2002[210] NASA

Astronomy

Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
Volcanoes on Venus – JARtool experiment DatasetVenus images returned by the Magellan spacecraft.Images are labeled by humans.not givenImagesClassification1991[211] [212] M. Burl
MAGIC Gamma Telescope DatasetMonte Carlo generated high-energy gamma particle events.Numerous features extracted from the simulations.19,020TextClassification2007[213] R. Bock
Solar Flare DatasetMeasurements of the number of certain types of solar flare events occurring in a 24-hour period.Many solar flare-specific features are given.1389TextRegression, classification1989[214] G. Bradshaw
CAMELS Multifield Dataset2D maps and 3D grids from thousands of N-body and state-of-the-art hydrodynamic simulations spanning a broad range in the value of the cosmological and astrophysical parametersEach map and grid has 6 cosmological and astrophysical parameters associated to it405,000 2D maps and 405,000 3D grids2D maps and 3D gridsRegression2021[215] Francisco Villaescusa-Navarro et al.

Earth science

Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
Volcanoes of the WorldVolcanic eruption data for all known volcanic events on earth.Details such as region, subregion, tectonic setting, dominant rock type are given.1535TextRegression, classification2013[216] E. Venzke et al.
Seismic-bumps DatasetSeismic activities from a coal mine.Seismic activity was classified as hazardous or not.2584TextClassification2013[217] [218] M. Sikora et al.
CAMELS-USCatchment hydrology dataset with hydrometeorological timeseries and various attributessee Reference671CSV, Text, ShapefileRegression2017[219] [220] N. Addor et al. / A. Newman et al.
CAMELS-ChileCatchment hydrology dataset with hydrometeorological timeseries and various attributessee Reference516CSV, Text, ShapefileRegression2018[221] C. Alvarez-Garreton et al.
CAMELS-BrazilCatchment hydrology dataset with hydrometeorological timeseries and various attributessee Reference897CSV, Text, ShapefileRegression2020[222] V. Chagas et al.
CAMELS-GBCatchment hydrology dataset with hydrometeorological timeseries and various attributessee Reference671CSV, Text, ShapefileRegression2020[223] G. Coxon et al.
CAMELS-AustraliaCatchment hydrology dataset with hydrometeorological timeseries and various attributessee Reference222CSV, Text, ShapefileRegression2021[224] K. Fowler et al.
LamaH-CECatchment hydrology dataset with hydrometeorological timeseries and various attributessee Reference859CSV, Text, ShapefileRegression2021[225] C. Klingler et al.

Other physical

Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
Concrete Compressive Strength DatasetDataset of concrete properties and compressive strength.Nine features are given for each sample.1030TextRegression2007[226] [227] I. Yeh
Concrete Slump Test DatasetConcrete slump flow given in terms of properties.Features of concrete given such as fly ash, water, etc.103TextRegression2009[228] [229] I. Yeh
Musk DatasetPredict if a molecule, given the features, will be a musk or a non-musk.168 features given for each molecule.6598TextClassification1994[230] Arris Pharmaceutical Corp.
Steel Plates Faults DatasetSteel plates of 7 different types.27 features given for each sample.1941TextClassification2010[231] Semeion Research Center
Noble Metal Monometallic Nanoparticles DatasetsProcessing and structural features of monometallic nanoparticles, labels being formation energy.85-182 features given for each sample.425 to 4000CSVRegression2017 to 2023[232] [233] [234] [235] [236] [237] A. Barnard and G. Opletal
Noble Metal Bimetallic Nanoparticles DatasetsProcessing and structural features of bimetallic nanoparticles, labels being formation energy.922 features given for each sample.138147 to 162770CSVRegression2023[238] [239] [240] [241] [242] [243] [244] [245] [246] [247] [248] [249] J. Ting et al.
AuPdPt Trimetallic Nanoparticles DatasetProcessing and structural features of AuPdPt nanoparticles, labels being formation energy.1958 features given for each sample.48136CSVRegression2023[250] K. Lu et al.

Biological data

Datasets from biological systems.

Human

Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
Age DatasetA structured general-purpose dataset on life, work, and death of 1.22 million distinguished people. Public domain.A five-step method to infer birth and death years, gender, and occupation from community-submitted data to all language versions of the Wikipedia project.1,223,009TextRegression, Classification2022Paper[251] Dataset[252] Amoradnejad et al.
Synthetic Fundus Dataset[253] Photorealistic retinal images and vessel segmentations. Public domain.2500 images with 1500*1152 pixels useful for segmentation and classification of veins and arteries on a single background.2500ImagesClassification, Segmentation2020[254] C. Valenti et al.
EEG DatabaseStudy to examine EEG correlates of genetic predisposition to alcoholism.Measurements from 64 electrodes placed on the scalp sampled at 256 Hz (3.9 ms epoch) for 1 second.122TextClassification1999[255] H. Begleiter
P300 Interface DatasetData from nine subjects collected using P300-based brain-computer interface for disabled subjects.Split into four sessions for each subject. MATLAB code given.1,224TextClassification2008[256] [257] U. Hoffman et al.
Heart Disease Data SetAttributed of patients with and without heart disease.75 attributes given for each patient with some missing values.303TextClassification1988[258] [259] A. Janosi et al.
Breast Cancer Wisconsin (Diagnostic) DatasetDataset of features of breast masses. Diagnoses by physician is given.10 features for each sample are given.569TextClassification1995[260] [261] W. Wolberg et al.
National Survey on Drug Use and HealthLarge scale survey on health and drug use in the United States.None.55,268TextClassification, regression2012[262] United States Department of Health and Human Services
Lung Cancer DatasetLung cancer dataset without attribute definitions56 features are given for each case32TextClassification1992[263] [264] Z. Hong et al.
Arrhythmia DatasetData for a group of patients, of which some have cardiac arrhythmia.276 features for each instance.452TextClassification1998[265] [266] H. Altay et al.
Diabetes 130-US hospitals for years 1999–2008 Dataset9 years of readmission data across 130 US hospitals for patients with diabetes.Many features of each readmission are given.100,000TextClassification, clustering2014[267] [268] J. Clore et al.
Diabetic Retinopathy Debrecen DatasetFeatures extracted from images of eyes with and without diabetic retinopathy.Features extracted and conditions diagnosed.1151TextClassification2014[269] [270] B. Antal et al.
Diabetic Retinopathy Messidor DatasetMethods to evaluate segmentation and indexing techniques in the field of retinal ophthalmology (MESSIDOR)Features retinopathy grade and risk of macular edema1200Images, TextClassification, Segmentation2008[271] [272] Messidor Project
Liver Disorders DatasetData for people with liver disorders.Seven biological features given for each patient.345TextClassification1990[273] [274] Bupa Medical Research Ltd.
Thyroid Disease Dataset10 databases of thyroid disease patient data.None.7200TextClassification1987[275] [276] R. Quinlan
Mesothelioma DatasetMesothelioma patient data.Large number of features, including asbestos exposure, are given.324TextClassification2016[277] [278] A. Tanrikulu et al.
Parkinson's Vision-Based Pose Estimation Dataset2D human pose estimates of Parkinson's patients performing a variety of tasks. Camera shake has been removed from trajectories.134TextClassification, regression2017[279] [280] [281] M. Li et al.
KEGG Metabolic Reaction Network (Undirected) DatasetNetwork of metabolic pathways. A reaction network and a relation network are given.Detailed features for each network node and pathway are given.65,554TextClassification, clustering, regression2011[282] M. Naeem et al.
Modified Human Sperm Morphology Analysis Dataset (MHSMA)Human sperm images from 235 patients with male factor infertility, labeled for normal or abnormal sperm acrosome, head, vacuole, and tail.Cropped around single sperm head. Magnification normalized. Training, validation, and test set splits created.1,540.npy filesClassification2019[283] [284] S. Javadi and S.A. Mirroshandel

Animal

Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
Abalone DatasetPhysical measurements of Abalone. Weather patterns and location are also given.None.4177TextRegression1995[285] Marine Research Laboratories – Taroona
Zoo DatasetArtificial dataset covering 7 classes of animals.Animals are classed into 7 categories and features are given for each.101TextClassification1990[286] R. Forsyth
Demospongiae DatasetData about marine sponges.503 sponges in the Demosponge class are described by various features.503TextClassification2010[287] E. Armengol et al.
Farm animals dataPLF data inventory (cows, pigs; location, acceleration, etc.).Labeled datasets.List is constantly updatedTextClassification2020[288] V. Bloch
Splice-junction Gene Sequences DatasetPrimate splice-junction gene sequences (DNA) with associated imperfect domain theory.None.3190TextClassification1992G. Towell et al.
Mice Protein Expression DatasetExpression levels of 77 proteins measured in the cerebral cortex of mice.None.1080TextClassification, Clustering2015[289] [290] C. Higuera et al.

Fungi

Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
UCI Mushroom DatasetMushroom attributes and classification.Many properties of each mushroom are given.8124TextClassification1987[291] J. Schlimmer
Secondary Mushroom DatasetMushroom attributes and classificationSimulated data from larger and more realistic primary mushroom entries. Fully reproducible.61069TextClassification2020[292] [293] D. Wagner et al.

Plant

Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
Forest Fires DatasetForest fires and their properties.13 features of each fire are extracted.517TextRegression2008[294] [295] P. Cortez et al.
Iris DatasetThree types of iris plants are described by 4 different attributes.None.150TextClassification1936[296] [297] R. Fisher
Plant Species Leaves DatasetSixteen samples of leaf each of one-hundred plant species.Shape descriptor, fine-scale margin, and texture histograms are given.1600TextClassification2012[298] [299] J. Cope et al.
Soybean DatasetDatabase of diseased soybean plants.35 features for each plant are given. Plants are classified into 19 categories.307TextClassification1988[300] R. Michalski et al.
Seeds DatasetMeasurements of geometrical properties of kernels belonging to three different varieties of wheat.None.210TextClassification, clustering2012[301] [302] Charytanowicz et al.
Covertype DatasetData for predicting forest cover type strictly from cartographic variables.Many geographical features given.581,012TextClassification1998[303] [304] J. Blackard et al.
Abscisic Acid Signaling Network DatasetData for a plant signaling network. Goal is to determine set of rules that governs the network.None.300TextCausal-discovery2008[305] J. Jenkens et al.
Folio Dataset20 photos of leaves for each of 32 species.None.637Images, textClassification, clustering2015[306] [307] T. Munisami et al.
Oxford Flower Dataset17 category dataset of flowers.Train/test splits, labeled images,1360Images, textClassification2006[308] [309] M-E Nilsback et al.
Plant Seedlings Dataset12 category dataset of plant seedlings.Labelled images, segmented images,5544ImagesClassification, detection2017[310] Giselsson et al.
Fruits-360Database with images of 131 fruits and vegetables.100x100 pixels, white background.90483Images (jpg)Classification2017–2024[311] Mihai Oltean

Microbe

Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
Ecoli DatasetProtein localization sites.Various features of the protein localizations sites are given.336TextClassification1996[312] [313] K. Nakai et al.
MicroMass DatasetIdentification of microorganisms from mass-spectrometry data.Various mass spectrometer features.931TextClassification2013[314] [315] P. Mahe et al.
Yeast DatasetPredictions of Cellular localization sites of proteins.Eight features given per instance.1484TextClassification1996[316] [317] K. Nakai et al.

Drug discovery

Anomaly data

Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
Numenta Anomaly Benchmark (NAB)Data are ordered, timestamped, single-valued metrics. All data files contain anomalies, unless otherwise noted.None50+ filesCSVAnomaly detection2016 (continually updated)[319] Numenta
Skoltech Anomaly Benchmark (SKAB)Each file represents a single experiment and contains a single anomaly. The dataset represents a multivariate time series collected from the sensors installed on the testbed.There are two markups for Outlier detection (point anomalies) and Changepoint detection (collective anomalies) problems30+ files (v0.9)CSVAnomaly detection2020 (continually updated)[320] [321] Iurii D. Katser and Vyacheslav O. Kozitsin
On the Evaluation of Unsupervised Outlier Detection: Measures, Datasets, and an Empirical StudyMost data files are adapted from UCI Machine Learning Repository data, some are collected from the literature.treated for missing values, numerical attributes only, different percentages of anomalies, labels1000+ filesARFFAnomaly detection2016 (possibly updated with new datasets and/or results)[322] Campos et al.

Question answering data

This section includes datasets that deals with structured data.

Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
DBpedia Neural Question Answering (DBNQA) DatasetA large collection of Question to SPARQL specially design for Open Domain Neural Question Answering over DBpedia Knowledgebase.This dataset contains a large collection of Open Neural SPARQL Templates and instances for training Neural SPARQL Machines; it was pre-processed by semi-automatic annotation tools as well as by three SPARQL experts.894,499Question-query pairsQuestion Answering2018[323] [324] Hartmann, Soru, and Marx et al.
Vietnamese Question Answering Dataset (UIT-ViQuAD)A large collection of Vietnamese questions for evaluating MRC models.This dataset comprises over 23,000 human-generated question-answer pairs based on 5,109 passages of 174 Vietnamese articles from Wikipedia.23,074Question-answer pairsQuestion Answering2020[325] Nguyen et al.
Vietnamese Multiple-Choice Machine Reading Comprehension Corpus(ViMMRC)A collection of Vietnamese multiple-choice questions for evaluating MRC models.This corpus includes 2,783 Vietnamese multiple-choice questions.2,783Question-answer pairsQuestion Answering/Machine Reading Comprehension2020[326] Nguyen et al.
Open-Domain Question Answering Goes Conversational via Question RewritingAn end-to-end open-domain question answering.This dataset includes 14,000 conversations with 81,000 question-answer pairs. Context, Question, Rewrite, Answer, Answer_URL, Conversation_no, Turn_no, Conversation_sourceFurther details are provided in the project's GitHub repository and respective Hugging Face dataset card.Question Answering2021[327] Anantha and Vakulenko et al.
UnifiedQAQuestion-answer dataProcessed datasetQuestion Answering2020[328] Khashabi et al.

Dialog or instruction prompted data

This section includes datasets that ...

Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
Taskmaster"The Taskmaster corpus consists of THREE datasets, Taskmaster-1 (TM-1), Taskmaster-2 (TM-2), and Taskmaster-3 (TM-3), comprising over 55,000 spoken and written task-oriented dialogs in over a dozen domains."Taskmaster-1: goal-oriented conversational dataset. It includes 13,215 task-based dialogs comprising six domains.Taskmaster-2: 17,289 dialogs in the seven domains (restaurants, food ordering, movies, hotels, flights, music and sports).

Taskmaster-3: 23,757 movie ticketing dialogs.

Taskmaster-1 and Taskmaster-2: conversation id, utterances, Instruction idTaskmaster-3: conversation id, utterances, vertical, scenario, instructions.

For further details check the project's GitHub repository or the Hugging Face dataset cards (taskmaster-1, taskmaster-2, taskmaster-3).

Dialog/Instruction prompted2019[329] Byrne and Krishnamoorthi et al.
DrRepairA labeled dataset for program repair. Pre-processed dataCheck format details in the project's worksheet.Dialog/Instruction prompted2020[330] Michihiro et al.
Natural Instructions v2Large dataset that covers a wider range of reasoning abilitiesEach task consists of input/output, and a task definition.Additionally, each ask contains a task definition.

Further information is provided in the GitHub repository of the project and the Hugging Face data card.

Input/Output and task definition2022[331] Wang et al.
LAMBADA" LAMBADA is a collection of narrative passages sharing the characteristic that human subjects are able to guess their last word if they are exposed to the whole passage, but not if they only see the last sentence preceding the target word."Information about this dataset's format is available in the HuggingFace dataset card and the project's website.The dataset can be downloaded here, and the rejected data here.2016[332] Paperno et al.
FLANA re-preprocessed version of the FLAN dataset with updates since the original FLAN dataset was released is available in Hugging Face:
  1. test data
  2. train data
  3. validation data

The scripts to process the data are available in the GitHub repo mentioned on the paper: https://github.com/google-research/FLAN/tree/main/flan.

Another FLAN GitHub repo was created as well. This is the one associated with the dataset card in Hugging Face.

2021[333] Wei et al.

Cybersecurity

Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
MITRE ATTACKThe ATT&CK is a globally-accessible knowledge base of adversary tactics and techniques. Data can be downloaded from these two GitHub repositories: version 2.1 and version 2.0[334] MITRE ATTACK
CAPECCommon Attack Pattern Enumeration and ClassificationData can be downloaded from CAPEC's website:Mechanisms of AttackDomains of Attack[335] CAPEC
CVE CVE is a list of publicly disclosed cybersecurity vulnerabilities that is free to search, use, and incorporate into products and services. Data can be downloaded from: Allitems[336] CVE
CWECommon Weakness Enumeration data.Data can be downloaded from:Software DevelopmentHardware DesignResearch Concepts[337] CWE
MalwareTextDBAnnotated database of malware texts.The GitHub repository of the project contains the data to download.[338] Kiat et al.
USENIX Security Symposium proceedingsCollection of security proceedings from USENIX Security Symposium – technical sessions from 1995 to 2022.This data is not pre-processed.1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008,2009, 20102011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022.[339] USENIX Security Symposium
APTNotesCollection of public documents, whitepapers and articles about APT campaigns. All the documents are publicly available data.This data is not pre-processed.The GitHub repository of the project contains a file with links to the data stored in box.Data files can also be downloaded here.[340] APT Notes
arXiv Cryptography and Security papersCollection of articles about cybersecurityThis data is not pre-processed.All articles available here.[341] arXiv
Security eBooks for freeSmall collection of security eBooks, and security presentations publicly available.This data is not pre-processed.[342] [343] [344] [345] [346] [347] [348] [349] [350] [351] [352] [353]
National Cyber Security strategy repositoryRepository of worldwide strategy documents about cybersecurity.This data is not pre-processed.[354]
Cyber Security Natural Language ProcessingData about cybersecurity strategies from more than 75 countries.Tokenization, meaningless-frequent words removal.Yanlin Chen, Yunjian Wei, Yifan Yu, Wen Xue, Xianya Qin
APT Reports collectionSample of APT reports, malware, technology, and intelligence collectionRaw and tokenize data available.All data is available in this GitHub repository.blackorbird
Offensive Language Identification Dataset (OLID)Data available in the project's website.Data is also available here.[355] Zampieri et al.
Cyber reports from the National Cyber Security CentreThis data is not pre-processed.Threat reports, reports and advisory, news, blog-posts, speeches.Alternate list of reports.[356]
APT reports by KasperskyThis data is not pre-processed.[357]
The cyberwireThis data is not pre-processed.Newsletters, podcasts, and stories.[358]
Databreaches newsThis data is not pre-processed.News, list of news from Aug 2022 to Feb 2023[359]
CybernewsThis data is not pre-processed.News, curated list of news[360]
BleepingcomputerThis data is not pre-processed.News[361]
TherecordThis data is not pre-processed.Cybercrime news[362]
HackreadThis data is not pre-processed.Hacking news[363]
SecurelistThis data is not pre-processed.APT reports, archive, DDOS reports, incidents, Kaspersky security bulletin, industrial threats, malware-reports, opinions, publications, research, and SAS.[364]
Stucco projectThe Stucco project collects data not typically integrated into security systems.This data is not pre-processedProject's website with data informationReviewed source with links to data sources[365]
FarsightsecurityWebsite with technical information, reports, and more about security topics.This data is not pre-processedTechnical information, research, reports.[366]
SchneierWebsite with academic papers about security topics.This data is not pre-processedPapers per category, papers archive by date.[367]
TrendmicroWebsite with research, news, and perspectives bout security topics.This data is not pre-processedReviewed list of Trendmicro research, news, and perspectives.[368]
The Hacker NewsNews about cybersecurity topics.This data is not pre-processeddata breaches, cyberattacks, vulnerabilities, malware news.[369]
KrebsonsecuritySecurity news and investigationThis data is not pre-processedcurated list of news[370]
Mitre DefendMatrix of Defend artifactsjson files[371]
Mitre AtlasMitre Atlas is a knowledge base of adversary tactics, techniques, and case studies for machine learning (ML) systems based on real-world observations.This data is not pre-processed[372]
Mitre EngageMITRE Engage is a framework for planning and discussing adversary engagement operations that empowers you to engage your adversaries and achieve your cybersecurity goals.This data is not pre-processed[373]
Hacking TutorialsThis data is not pre-processed[374]

Climate and sustainability

Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
TCFD reportsDatabase of company reports that include TCFD-related disclosures. This data is not pre-processedDirect link to reportsCurated list of reports[375] TCFD Knowledge Hub
Corporate Social Responsibility ReportsA listing of responsibility reports on the internet.This data is not pre-processedCurated list of reports[376] ResponsibilityReports
The Intergovernmental Panel on Climate Change (IPCC)A collection of comprehensive assessment reports about knowledge on climate change, its causes, potential impacts and response optionsThis data is not pre-processedReportsCurated list of reports[377] IPCC
Alliance for Research on Corporate SustainabilityThis data is not pre-processedCurated list of blog posts[378] ARCS
ESG corpus: Knowledge Hub of the Accounting for SustainabilityThis data is not pre-processedGuides, case studies, blogs, and reports & surveys.[379] Mehra et al.
CLIMATE-FEVERA dataset adopting the FEVER methodology that consists of 1,535 real-world claims regarding climate-change collected on the internet.Each claim is accompanied by five manually annotated evidence sentences retrieved from the English Wikipedia that support, refute or do not give enough information to validate the claim totalling in 7,675 claim-evidence pairs.Dataset HF card, and project's GitHub repository.[380] Diggelmann et al.
Climate News datasetA dataset for NLP and climate change media researchersThe dataset is made up of a number of data artifacts (JSON, JSONL & CSV text files & SQLite database)Climate news DB, Project's GitHub repository[381] ADGEfficiency
ClimatextClimatext is a dataset for sentence-based climate change topic detection.HF dataset[382] University of Zurich
GreenBizCollection of articles and news about climate and sustainabilityThis data is not pre-processedCurated list of climate articlesCurated list of sustainability articles[383]
Top research pre-prints in climate and sustainabilityList of pre-prints from researchers in the reuters hot listThis data is not pre-processedCurated list of pre-prints[384] Maurice Tamman
ARCSThis data is not pre-processedCurated list of corporate sustainability blogs[385]
GreenBizWebsite with articles about climate and sustainabilityThis data is not pre-processed[386] GreenBiz
CSRWIREThis data is not pre-processedCurated list of articles[387] CSRWIRE
CDPArticles about climate, water, and forestsThis data is not pre-processed[388] CDP

Code data

Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
The StackA 3.1 TB dataset consisting of permissively licensed source code in 30 programming languages.Filtered through license detection and deduplication.6 TB, 51.76B files (prior to deduplication); 3 TB, 5.28B files (after). 358 programming languages.ParquetLanguage modeling, autocompletion, program synthesis.2022[389] [390] D. Kocetkov, R. Li, L. Ben Allal, L. von Werra, H. de Vries
GitHub repositoriesThis data is not pre-processedCurated lis of repositories from GitHub: 61 62 63 64 65 66 67 68 69 70 71, 72, 73, 74, 75, 76, 77 101
IBM Public GitHub repositoriesThis data is not pre-processedCurated list of repositories from GitHub
RedHat Public GitHub repositoriesThis data is not pre-processedCurated list of repositories from GitHub
StackExchange Public Archive.org filesThis data is not pre-processedCurated list of files from Archive.org
Gitlab Public repositoriesThis data is not pre-processedCurated list of repositories from Gitlab: 1 2
Ansible Collections public repositoriesThis data is not pre-processedCurated list of repositories from GitHub.
CodeParrot GitHub Code Dataset This data is not pre-processedCurated list of repositories from Hugging Face: 1 2 3 4 5 6 7 8 9 10
OKDThe Community Distribution of Kubernetes that powers Red Hat OpenShiftThis data is not pre-processedList of GitHub repositories of the project
OpenShiftThe developer and operations friendly Kubernetes distroList of GitHub repositories of the project
KubernetesThis data is not pre-processedList of GitHub repositories of the project
Red Hat DeveloperGitHub home of the Red Hat Developer programThis data is not pre-processedList of GitHub repositories of the project
Red Hat WorkshopsThis data is not pre-processedList of GitHub repositories of the project
Kubernetes SIGsThis data is not pre-processedList of GitHub repositories of the project
KonveyorThis data is not pre-processedList of GitHub repositories of the project
RedHat MarketplaceThis data is not pre-processedList of GitHub repositories of the project
Redhat blogThis data is not pre-processed[391]
Kubernetes ioThis data is not pre-processed[392]
Docs OpenshiftThis data is not pre-processed[393]
cncf ioThis data is not pre-processed[394]
Kubernetes presentationsList of publicly available Kubernetes presentationsThis data is not pre-processeddata link
Red Hat Open Innovation LabsThis data is not pre-processedList of GitHub repositories of the project
Red Hat DemosThis data is not pre-processedList of GitHub repositories of the project
Red Hat OpenShift OnlineThis data is not pre-processedList of GitHub repositories of the project
Software CollectionsThis data is not pre-processedList of GitHub repositories of the project
Red Hat InsightsThis data is not pre-processedList of GitHub repositories of the project
Red Hat GovernmentThis data is not pre-processedList of GitHub repositories of the project
Red Hat ConsultingThis data is not pre-processedList of GitHub repositories of the project
Red Hat Communities of PracticeThis data is not pre-processedList of GitHub repositories of the project
Red Hat Partner TechThis data is not pre-processedList of GitHub repositories of the project
Red Hat DocumentationThis data is not pre-processedList of GitHub repositories of the project
IBMThis data is not pre-processedList of GitHub repositories of the project
IBM Cloud This data is not pre-processedList of GitHub repositories of the project
Build Lab Team This data is not pre-processedList of GitHub repositories of the project
Terraform IBM ModulesThis data is not pre-processedList of GitHub repositories of the project
Cloud SchematicsThis data is not pre-processedList of GitHub repositories of the project
OCP Power DemosThis data is not pre-processedList of GitHub repositories of the project
IBM App Modernization This data is not pre-processedList of GitHub repositories of the project
Kubernetes OperatorHub This data is not pre-processedList of GitHub repositories of the project
Cloud Native Computing Foundation (CNCF) This data is not pre-processedList of GitHub repositories of the project
Operator FrameworkThis data is not pre-processedList of GitHub repositories of the project
GitHub repositories referenced in artifacthub.ioThis data is not pre-processedList of GitHub repositories in artifacthub.io
Red Hat Communities of PracticeThis data is not pre-processedList of GitHub repositories of the project
Red Hat partnerThis data is not pre-processedList of GitHub repositories of the project
IBM RepositoriesThis data is not pre-processedList of GitHub repositories for the project
Build Lab TeamThis data is not pre-processedList of GitHub repositories for the project
Operator FrameworkThis data is not pre-processedList of GitHub repositories for the project
GitHub repositoriesThis data is not pre-processedList of GitHub repositories for the project
Red HatThis data is not pre-processedList of GitHub repositories of the project
Kubernetes PatternsThis data is not pre-processedList of GitHub repositories of the project
Kubernetes Deployment & Security PatternsThis data is not pre-processedList of GitHub repositories of the project
Kubernetes for Full-Stack DevelopersThis data is not pre-processedList of GitHub repositories of the project
Load Balancer Cloudwatch MetricsThis data is not pre-processedGitHub repository of the project
DynatraceThis data is not pre-processedhttps://docs.dynatrace.com/docs/observe-and-explore/metrics/built-in-metrics
AIOps Challenge 2020 DataThis data is not pre-processedGitHub repository of the project
LoghubThis data is not pre-processedList of repositories
HTML PagesThis data is not pre-processedList of HTML pages
Opensift ebooksThis data is not pre-processed[395]
Kubernetes ebooksThis data is not pre-processedKubernetes Patterns, Kubernetes Deployment, Kubernetes for Full-Stack Developers
Kubernetes for Full-Stack DevelopersThis data is not pre-processedKubernetes for Full-Stack Developers
List of public and licensed Github repositoriesThis data is not pre-processedList of repositories

Multivariate data

Financial

Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
Dow Jones IndexWeekly data of stocks from the first and second quarters of 2011.Calculated values included such as percentage change and a lags.750Comma separated valuesClassification, regression, Time series2014[396] [397] M. Brown et al.
Statlog (Australian Credit Approval)Credit card applications either accepted or rejected and attributes about the application.Attribute names are removed as well as identifying information. Factors have been relabeled.690Comma separated valuesClassification1987[398] [399] R. Quinlan
eBay auction dataAuction data from various eBay.com objects over various length auctionsContains all bids, bidderID, bid times, and opening prices.~ 550TextRegression, classification2012[400] [401] G. Shmueli et al.
Statlog (German Credit Data)Binary credit classification into "good" or "bad" with many featuresVarious financial features of each person are given.690TextClassification1994[402] H. Hofmann
Bank Marketing DatasetData from a large marketing campaign carried out by a large bank .Many attributes of the clients contacted are given. If the client subscribed to the bank is also given.45,211TextClassification2012[403] [404] S. Moro et al.
Istanbul Stock Exchange DatasetSeveral stock indexes tracked for almost two years.None.536TextClassification, regression2013[405] [406] O. Akbilgic
Default of Credit Card ClientsCredit default data for Taiwanese creditors.Various features about each account are given.30,000TextClassification2016[407] [408] I. Yeh
StockNetStock movement prediction from tweets and historical stock pricesNoneTextNLP2018[409] Yumo Xu and Shay B. Cohen

Weather

Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
Cloud DataSetData about 1024 different clouds.Image features extracted.1024TextClassification, clustering1989[410] P. Collard
El Nino DatasetOceanographic and surface meteorological readings taken from a series of buoys positioned throughout the equatorial Pacific.12 weather attributes are measured at each buoy.178080TextRegression1999[411] Pacific Marine Environmental Laboratory
Greenhouse Gas Observing Network DatasetTime-series of greenhouse gas concentrations at 2921 grid cells in California created using simulations of the weather.None.2921TextRegression2015[412] D. Lucas
Atmospheric from Continuous Air Samples at Mauna Loa ObservatoryContinuous air samples in Hawaii, USA. 44 years of records.None.44 yearsTextRegression2001[413] Mauna Loa Observatory
Ionosphere DatasetRadar data from the ionosphere. Task is to classify into good and bad radar returns.Many radar features given.351TextClassification1989[414] Johns Hopkins University
Ozone Level Detection DatasetTwo ground ozone level datasets.Many features given, including weather conditions at time of measurement.2536TextClassification2008[415] [416] K. Zhang et al.

Census

Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
Adult DatasetCensus data from 1994 containing demographic features of adults and their income.Cleaned and anonymized.48,842Comma separated valuesClassification1996[417] United States Census Bureau
Census-Income (KDD)Weighted census data from the 1994 and 1995 Current Population Surveys.Split into training and test sets.299,285Comma separated valuesClassification2000[418] [419] United States Census Bureau
IPUMS Census DatabaseCensus data from the Los Angeles and Long Beach areas.None256,932TextClassification, regression1999[420] IPUMS
US Census Data 1990Partial data from 1990 US census.Results randomized and useful attributes selected.2,458,285TextClassification, regression1990[421] United States Census Bureau

Transit

Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
Bike Sharing DatasetHourly and daily count of rental bikes in a large city.Many features, including weather, length of trip, etc., are given.17,389TextRegression2013[422] [423] H. Fanaee-T
New York City Taxi Trip DataTrip data for yellow and green taxis in New York City.Gives pick up and drop off locations, fares, and other details of trips.6 yearsTextClassification, clustering2015[424] New York City Taxi and Limousine Commission
Taxi Service Trajectory ECML PKDDTrajectories of all taxis in a large city.Many features given, including start and stop points.1,710,671TextClustering, causal-discovery2015[425] [426] M. Ferreira et al.
METR-LASpeed from loop detectors in the highway of Los Angeles County.Average speed in 5 minutes timesteps.7,094,304 from 207 sensors and 34,272 timestepsComma separated valuesRegression, Forecasting2014[427] Jagadish et al.
PeMSSpeed, flow, occupancy and other metrics from loop detectors and other sensors in the freeway of the State of California, U.S.A..Metric usually aggregated via Average into 5 minutes timesteps.39,000 individual detectors, each containing years of timeseriesComma separated valuesRegression, Forecasting, Nowcasting, Interpolation(updated realtime)[428] California Department of Transportation

Internet

Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
Webpages from Common Crawl 2012Large collection of webpages and how they are connected via hyperlinksNone.3.5BTextclustering, classification2013[429] V. Granville
Internet Advertisements DatasetDataset for predicting if a given image is an advertisement or not.Features encode geometry of ads and phrases occurring in the URL.3279TextClassification1998[430] [431] N. Kushmerick
Internet Usage DatasetGeneral demographics of internet users.None.10,104TextClassification, clustering1999[432] D. Cook
URL Dataset120 days of URL data from a large conference.Many features of each URL are given.2,396,130TextClassification2009[433] [434] J. Ma
Phishing Websites DatasetDataset of phishing websites.Many features of each site are given.2456TextClassification2015[435] R. Mustafa et al.
Online Retail DatasetOnline transactions for a UK online retailer.Details of each transaction given.541,909TextClassification, clustering2015[436] D. Chen
Freebase Simple Topic DumpFreebase is an online effort to structure all human knowledge.Topics from Freebase have been extracted.largeTextClassification, clustering2011[437] [438] Freebase
Farm Ads DatasetThe text of farm ads from websites. Binary approval or disapproval by content owners is given.SVMlight sparse vectors of text words in ads calculated.4143TextClassification2011[439] [440] C. Masterharm et al.
The PileAssembling several large datasets of diverse and unstructured textsVarious (removing HTML and Javascript from websites, removing duplicated sentences)825 GiB English textJSON Lines[441] [442] Natural Language Processing, Text Prediction2021[443] Gao et al.
OSCARLarge collection of monolingual corpora extracted from web data (Common Crawl dumps) covering 150+ languagesVarious (filtering, language classification, adult-content detection and other labelling)3.4 TB English text, 1.4 TB Chinese text, 1.1 TB Russian text, 595 MB German text, 431 MB French text, and data for 150+ languages (figures for version 23.01)JSON Lines[444] Natural Language Processing, Text Prediction2021[445] [446] Ortiz Suarez, Abadji, Sagot et al.
OpenWebTextAn open-source recreation of the WebText corpus. The text is web content extracted from URLs shared on Reddit with at least three upvotes. Extracted non-HTML content, deduplicated, and tokenized.8,013,769 Documents, 38GBTextNatural Language Processing, Text Prediction2019[447] [448] A. Gokaslan, V. Cohen
ROOTSA well-documented and representative multilingual dataset with the explicit goal of doing good for and by the people whose data was collected.Extracted non-HTML content, cleaned out UI and ads, deduplicated, removed PII, and tokenized.1.6 TB, 59 languages.ParquetNatural Language Processing, Text Prediction2022[449] [450] H. Laurençon, L. Saulnier, T. Wang, C. Akiki, A. Villanova del Moral, T. Le Scao

Games

Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
Poker Hand Dataset5 card hands from a standard 52 card deck.Attributes of each hand are given, including the Poker hands formed by the cards it contains.1,025,010TextRegression, classification2007[451] R. Cattral
Connect-4 DatasetContains all legal 8-ply positions in the game of connect-4 in which neither player has won yet, and in which the next move is not forced.None.67,557TextClassification1995[452] J. Tromp
Chess (King-Rook vs. King) DatasetEndgame Database for White King and Rook against Black King.None.28,056TextClassification1994[453] [454] M. Bain et al.
Chess (King-Rook vs. King-Pawn) DatasetKing+Rook versus King+Pawn on a7.None.3196TextClassification1989[455] R. Holte
Tic-Tac-Toe Endgame DatasetBinary classification for win conditions in tic-tac-toe.None.958TextClassification1991[456] D. Aha

Other multivariate

Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
Housing Data SetMedian home values of Boston with associated home and neighborhood attributes.None.506TextRegression1993[457] D. Harrison et al.
The Getty Vocabulariesstructured terminology for art and other material culture, archival materials, visual surrogates, and bibliographic materials.None.largeTextClassification2015[458] Getty Center
Yahoo! Front Page Today Module User Click LogUser click log for news articles displayed in the Featured Tab of the Today Module on Yahoo! Front Page.Conjoint analysis with a bilinear model.45,811,883 user visitsTextRegression, clustering2009[459] [460] Chu et al.
British Oceanographic Data CentreBiological, chemical, physical and geophysical data for oceans. 22K variables tracked.Various.22K variables, many instancesTextRegression, clustering2015[461] British Oceanographic Data Centre
Congressional Voting Records DatasetVoting data for all USA representatives on 16 issues.Beyond the raw voting data, various other features are provided.435TextClassification1987[462] J. Schlimmer
Entree Chicago Recommendation DatasetRecord of user interactions with Entree Chicago recommendation system.Details of each users usage of the app are recorded in detail.50,672TextRegression, recommendation2000[463] R. Burke
Insurance Company Benchmark (COIL 2000)Information on customers of an insurance company.Many features of each customer and the services they use.9,000TextRegression, classification2000[464] [465] P. van der Putten
Nursery DatasetData from applicants to nursery schools.Data about applicant's family and various other factors included.12,960TextClassification1997[466] [467] V. Rajkovic et al.
University DatasetData describing attributed of a large number of universities.None.285TextClustering, classification1988[468] S. Sounders et al.
Blood Transfusion Service Center DatasetData from blood transfusion service center. Gives data on donors return rate, frequency, etc.None.748TextClassification2008[469] [470] I. Yeh
Record Linkage Comparison Patterns DatasetLarge dataset of records. Task is to link relevant records together.Blocking procedure applied to select only certain record pairs.5,749,132TextClassification2011[471] [472] University of Mainz
Nomao DatasetNomao collects data about places from many different sources. Task is to detect items that describe the same place.Duplicates labeled.34,465TextClassification2012[473] [474] Nomao Labs
Movie DatasetData for 10,000 movies.Several features for each movie are given.10,000TextClustering, classification1999[475] G. Wiederhold
Open University Learning Analytics DatasetInformation about students and their interactions with a virtual learning environment.None.~ 30,000TextClassification, clustering, regression2015[476] [477] J. Kuzilek et al.
Mobile phone recordsTelecommunications activity and interactionsAggregation per geographical grid cells and every 15 minutes.largeTextClassification, Clustering, Regression2015[478] G. Barlacchi et al.

Curated repositories of datasets

As datasets come in myriad formats and can sometimes be difficult to use, there has been considerable work put into curating and standardizing the format of datasets to make them easier to use for machine learning research.

  • OpenML:[479] Web platform with Python, R, Java, and other APIs for downloading hundreds of machine learning datasets, evaluating algorithms on datasets, and benchmarking algorithm performance against dozens of other algorithms.
  • PMLB:[480] A large, curated repository of benchmark datasets for evaluating supervised machine learning algorithms. Provides classification and regression datasets in a standardized format that are accessible through a Python API.
  • Metatext NLP: https://metatext.io/datasets web repository maintained by community, containing nearly 1000 benchmark datasets, and counting. Provides many tasks from classification to QA, and various languages from English, Portuguese to Arabic.
  • Appen

Off The Shelf and Open Source Datasets hosted and maintained by the company. These biological, image, physical, question answering, signal, sound, text, and video resources number over 250 and can be applied to over 25 different use cases.[481] [482]

See also

References

Notes and References

  1. Web site: Datasets Over Algorithms. Edge.com. 8 January 2016. Wissner-Gross. A..
  2. Weiss . G. M. . Provost . F. . Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction . Journal of Artificial Intelligence Research . AI Access Foundation . 19 . September 1, 2003 . 1076-9757 . 10.1613/jair.1199 . 315–354 . 2344521 .
  3. Turney . Peter . Types of cost in inductive concept learning . 2000 . cs/0212034.
  4. Book: Abney, Steven. Semisupervised Learning for Computational Linguistics. September 17, 2007. CRC Press. 978-1-4200-1080-0.
  5. Book: Žliobaitė . Indrė . Bifet . Albert . Pfahringer . Bernhard . Holmes . Geoff . Lecture Notes in Computer Science . 6913 . Machine Learning and Knowledge Discovery in Databases . Active Learning with Evolving Streaming Data . Springer Berlin Heidelberg . Berlin, Heidelberg . 2011 . 978-3-642-23807-9 . 0302-9743 . 10.1007/978-3-642-23808-6_39 . 597–612.
  6. 1506.04757 . McAuley . Julian . Targett . Christopher . Shi . Qinfeng . Anton van den Hengel . Image-based Recommendations on Styles and Substitutes . 2015 . cs.CV .
  7. Web site: Amazon review data. 2021-10-08. nijianmo.github.io.
  8. Ganesan . Kavita . Zhai . Chengxiang . 2012 . Opinion-based entity ranking . Information Retrieval . 15 . 2. 116–150 . 10.1007/s10791-011-9174-8. 2142/15252 . 16258727 . free .
  9. Lv, Yuanhua, Dimitrios Lymberopoulos, and Qiang Wu. "An exploration of ranking heuristics in mobile local search." Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval. ACM, 2012.
  10. Harper . F. Maxwell . Konstan . Joseph A. . 2015 . The MovieLens Datasets: History and Context . ACM Transactions on Interactive Intelligent Systems . 5 . 4. 19 . 10.1145/2827872 . 16619709 .
  11. Koenigstein, Noam, Gideon Dror, and Yehuda Koren. "Yahoo! music recommendations: modeling music ratings with temporal dynamics and item taxonomy." Proceedings of the fifth ACM conference on Recommender systems. ACM, 2011.
  12. McFee, Brian, et al. "The million song dataset challenge." Proceedings of the 21st international conference companion on World Wide Web. ACM, 2012.
  13. Bohanec, Marko, and Vladislav Rajkovic. "Knowledge acquisition and explanation for multi-attribute decision making." 8th Intl Workshop on Expert Systems and their Applications. 1988.
  14. Tan, Peter J., and David L. Dowe. "MML inference of decision graphs with multi-way joins." Australian Joint Conference on Artificial Intelligence. 2002.
  15. Web site: Quantifying comedy on YouTube: why the number of o's in your LOL matter. Metatext NLP Database. 2020-10-26.
  16. Book: https://link.springer.com/chapter/10.1007/978-3-642-32692-9_63. 10.1007/978-3-642-32692-9_63. A Classifier for Big Data. Convergence and Hybrid Information Technology. Communications in Computer and Information Science. 2012. Kim. Byung Joo. 310. 505–512. 978-3-642-32691-2.
  17. Pérezgonzález . Jose D. . Gilbey . Andrew . 2011 . Predicting Skytrax airport rankings from customer reviews . Journal of Airport Management . 5 . 4. 335–339 . 10.69554/RFZC4321 .
  18. Loh, Wei-Yin, and Yu-Shan Shih. "Split selection methods for classification trees." Statistica sinica(1997): 815–840.
  19. Lim . Tjen-Sien . Loh . Wei-Yin . Shih . Yu-Shan . 2000 . A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms . Machine Learning . 40 . 3. 203–228 . 10.1023/a:1007608224229. 17030953 .
  20. Kiet Van Nguyen, Vu Duc Nguyen, Phu X. V. Nguyen, Tham T. H. Truong, Ngan Luu-Thuy Nguyen. "UIT-VSFC: Vietnamese Students’ Feedback Corpus for Sentiment Analysis
  21. Book: https://link.springer.com/chapter/10.1007/978-981-15-6168-9_27. 10.1007/978-981-15-6168-9_27. Emotion Recognition for Vietnamese Social Media Text. Computational Linguistics. Communications in Computer and Information Science. 2020. Ho. Vong Anh. Nguyen. Duong Huynh-Cong. Nguyen. Danh Hoang. Pham. Linh Thi-Van. Nguyen. Duc-Vu. Nguyen. Kiet Van. Nguyen. Ngan Luu-Thuy. 1215. 319–333. 1911.09339. 978-981-15-6167-2. 208202333.
  22. Nhung Thi-Hong Nguyen, Phuong Ha-Dieu Phan, Luan Thanh Nguyen, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen. 2104.11969 . Vietnamese Open-domain Complaint Detection in E-Commerce Websites. 24 April 2021. cs.CL .
  23. Phu Gia Hoang, Canh Duc Luu, Khanh Quoc Tran, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen. 2301.10186 . ViHOS: Hate Speech Spans Detection for Vietnamese. 26 January 2023. cs.CL .
  24. Dermouche . Mohamed . Velcin . Julien . Khouas . Leila . Loudcher . Sabine . 2014 IEEE International Conference on Data Mining . A Joint Model for Topic-Sentiment Evolution over Time . IEEE . 2014 . 773–778 . 978-1-4799-4302-9 . 10.1109/icdm.2014.82 .
  25. Rose . Tony . Stevenson . Mark . Whitehead . Miles . The Reuters Corpus Volume 1-from Yesterday's News to Tomorrow's Language Resources . https://web.archive.org/web/20190806015015/https://pdfs.semanticscholar.org/3e4b/dc7f8904c58f8fce199389299ec1ed8e1226.pdf . dead . 2019-08-06 . LREC . 2 . 2002 . 9239414 .
  26. Amini . Massih R. . Usunier . Nicolas . Goutte . Cyril . Learning from Multiple Partially Observed Views – an Application to Multilingual Text Categorization . 2009 . 28–36 . Advances in Neural Information Processing Systems. 22 .
  27. Liu . Ming . etal . VRCA: a clustering algorithm for massive amount of texts . Proceedings of the 24th International Conference on Artificial Intelligence . AAAI Press . 2015 . 6 August 2019 . 5 November 2021 . https://web.archive.org/web/20211105004605/https://www.aaai.org/ocs/index.php/IJCAI/IJCAI15/paper/download/10903/10990 . dead .
  28. Al-Harbi . S . Almuhareb . A . Al-Thubaity . A . Khorsheed . M. S. . Al-Rajeh . A . 2008 . Automatic Arabic Text Classification . Proceedings of the 9th International Conference on the Statistical Analysis of Textual Data, Lyon, France.
  29. Web site: Relationship and Entity Extraction Evaluation Dataset: Dstl/re3d. GitHub. 2018-12-17.
  30. Web site: The Examiner – SpamClickBait Catalogue.
  31. Web site: A Million News Headlines.
  32. Web site: One Week of Global News Feeds.
  33. Web site: IrishTimes – the Waxy-Wany News.
  34. Web site: News Headlines Dataset For Sarcasm Detection. kaggle.com. 2019-04-27.
  35. Klimt, Bryan, and Yiming Yang. "Introducing the Enron Corpus." CEAS. 2004.
  36. 0806.3201 . Kossinets . Gueorgi . Kleinberg . Jon . Watts . Duncan . The Structure of Information Pathways in a Social Communication Network . 2008 . physics.soc-ph .
  37. cs/0006013 . Androutsopoulos . Ion . Koutsias . John . Chandrinos . Konstantinos V. . Paliouras . George . Spyropoulos . Constantine D. . 2000 . An evaluation of Naive Bayesian anti-spam filtering . Proceedings of the Workshop on Machine Learning in the New Information Age . 11th European Conference on Machine Learning, Barcelona, Spain . G. . Potamias . V. . Moustakis . M. . van Someren . 11 . 9–17 . 2000cs........6013A.
  38. Bratko . Andrej . et al . 2006 . Spam filtering using statistical data compression models . The Journal of Machine Learning Research . 7 . 2673–2698 .
  39. Almeida, Tiago A., José María G. Hidalgo, and Akebo Yamakami. "Contributions to the study of SMS spam filtering: new collection and results."Proceedings of the 11th ACM symposium on Document engineering. ACM, 2011.
  40. Delany . Jane . Sarah . Buckley . Mark . Greene . Derek . 2012 . SMS spam filtering: methods and data . Expert Systems with Applications . 39 . 10. 9899–9908 . 10.1016/j.eswa.2012.02.053. 15546924 .
  41. Joachims, Thorsten. A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. No. CMU-CS-96-118. Carnegie-mellon univ pittsburgh pa dept of computer science, 1996.
  42. Dimitrakakis, Christos, and Samy Bengio. Online Policy Adaptation for Ensemble Algorithms. No. EPFL-REPORT-82788. IDIAP, 2002.
  43. Dooms, S. et al. "Movietweetings: a movie rating dataset collected from twitter, 2013. Available from https://github.com/sidooms/MovieTweetings."
  44. Twitter100k: A Real-world Dataset for Weakly Supervised Cross-Media Retrieval. 1703.06618. RoyChowdhury. Aruni. Lin. Tsung-Yu. Maji. Subhransu. Learned-Miller. Erik. cs.CV. 2017.
  45. Web site: huyt16/Twitter100k. GitHub. en. 2018-03-26.
  46. Go . Alec . Bhayani . Richa . Huang . Lei . 2009 . Twitter sentiment classification using distant supervision . CS224N Project Report, Stanford . 1 . 12 .
  47. Chikersal, Prerna, Soujanya Poria, and Erik Cambria. "SeNTU: sentiment analysis of tweets by combining a rule-based classifier with supervised learning." Proceedings of the International Workshop on Semantic Evaluation, SemEval. 2015.
  48. Zafarani, Reza, and Huan Liu. "Social computing data repository at ASU." School of Computing, Informatics and Decision Systems Engineering, Arizona State University (2009).
  49. Data Science Course by DataTrained Education "IBM Certified Data Science Course." IBM Certified Online Data Science Course
  50. McAuley . Julian J. . Leskovec . Jure . Learning to Discover Social Circles in Ego Networks . NIPS . 2012 . 2012 .
  51. Šubelj . Lovro . Fiala . Dalibor . Bajec . Marko . Network-based statistical comparison of citation topology of bibliographic databases . Scientific Reports . 4 . 6496. 6496 . 2014 . 10.1038/srep06496. 25263231 . 4178292 . 1502.05061 . 2014NatSR...4E6496S .
  52. Abdulla, N., et al. "Arabic sentiment analysis: Corpus-based and lexicon-based." Proceedings of the IEEE conference on Applied Electrical Engineering and Computing Technologies (AEECT). 2013.
  53. Abooraig, Raddad, et al. "On the automatic categorization of Arabic articles based on their political orientation." Third International Conference on Informatics Engineering and Information Science (ICIEIS2014). 2014.
  54. Kawala, François, et al. "Prédictions d'activité dans les réseaux sociaux en ligne." 4ième conférence sur les modèles et l'analyse des réseaux: Approches mathématiques et informatiques. 2013.
  55. 1601.00024. Sabharwal. Ashish. Selecting Near-Optimal Learners via Incremental Data Allocation. Samulowitz. Horst. Tesauro. Gerald. cs.LG. 2015.
  56. Xu et al. "SemEval-2015 Task 1: Paraphrase and Semantic Similarity in Twitter (PIT)" Proceedings of the 9th International Workshop on Semantic Evaluation. 2015.
  57. Xu et al. "Extracting Lexically Divergent Paraphrases from Twitter" Transactions of the Association for Computational (TACL). 2014.
  58. 10.1109/MIS.2013.126. Real-Time Crisis Mapping of Natural Disasters Using Social Media. IEEE Intelligent Systems. 29. 2. 9–17. 2014. Middleton. Stuart E. Middleton. Lee. Modafferi. Stefano. 15139204.
  59. Web site: geoparsepy. 2016. Python PyPI library
  60. Book: Shmueli . Boaz . Ku . Lun-Wei . Ray . Soumya . Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) . Reactive Supervision: A New Method for Collecting Sarcasm Data . 2020 . https://aclanthology.org/2020.emnlp-main.201/ . Association for Computational Linguistics . 2553–2559 . 10.18653/v1/2020.emnlp-main.201. 221970454 .
  61. Web site: Shmueli . Boaz . SPIRS Sarcasm Dataset . GitHub.
  62. Web site: Dutch social media collection. Gupta, Aakash. COVID-19 Data Hub. 2020. 11 November 2023. 10.5072/FK2/MTPTL7.
  63. Web site: Streamlit. 2020-12-18. huggingface.co.
  64. Web site: Dutch Social media collection. 2020-12-18. kaggle.com. en.
  65. Book: Shmueli . Boaz . Ray . Soumya . Lun-Wei . Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers) . Happy Dance, Slow Clap: Using Reaction GIFs to Predict Induced Affect on Twitter . 2021 . As . Association for Computational Linguistics . 395–401 . 10.18653/v1/2021.acl-short.50. 235125510 .
  66. Forsyth, E., Lin, J., & Martell, C. (2008, June 25). The NPS Chat Corpus. Retrieved from http://faculty.nps.edu/cmartell/NPSChat.htm
  67. 1506.06714 . Sordoni . Alessandro . Galley . Michel . Auli . Michael . Brockett . Chris . Ji . Yangfeng . Mitchell . Margaret . Nie . Jian-Yun . Gao . Jianfeng . Dolan . Bill . A Neural Network Approach to Context-Sensitive Generation of Conversational Responses . 2015 . cs.CL .
  68. Shaoul, C. & Westbury C. (2013) A reduced redundancy USENET corpus (2005–2011) Edmonton, AB: University of Alberta (downloaded from http://www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html)
  69. KAN, M. (2011, January). NUS Short Message Service (SMS) Corpus. Retrieved from http://www.comp.nus.edu.sg/entrepreneurship/innovation/osr/corpus/
  70. Stuck_In_the_Matrix. (2015, July 3). I have every publicly available Reddit comment for research. ~ 1.7 billion comments @ 250 GB compressed. Any interest in this? [Original post]. Message posted to https://www.reddit.com/r/datasets/comments/3bxlg7/i_have_every_publicly_available_reddit_comment/
  71. 1506.08909 . Lowe . Ryan . Pow . Nissan . Serban . Iulian . Pineau . Joelle . The Ubuntu Dialogue Corpus: A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems . 2015 . cs.CL .
  72. Jason Williams Antoine Raux Matthew Henderson, "https://www.microsoft.com/en-us/research/publication/the-dialog-state-tracking-challenge-series-a-review/", Dialogue & Discourse
  73. Book: Zheng . Lucia . Guha . Neel . Anderson . Brandon R. . Henderson . Peter . Ho . Daniel E. . Proceedings of the Eighteenth International Conference on Artificial Intelligence and Law . When does pretraining help? . 2021-06-21 . http://dx.doi.org/10.1145/3462757.3466088 . 159–168 . New York, NY, USA . ACM . 10.1145/3462757.3466088. 9781450385268 . 233296302 .
  74. Web site: pile-of-law/pile-of-law · Datasets at Hugging Face . 2023-01-11 . huggingface.co. 4 July 2022 .
  75. Web site: About Caselaw Access Project . 2023-01-11 . case.law . en.
  76. K. Kowsari, D. E. Brown, M. Heidarysafa, K. Jafari Meimandi, M. S. Gerber and L. E. Barnes, "HDLTex: Hierarchical Deep Learning for Text Classification", 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 364–371.
  77. K. Kowsari, D. E. Brown, M. Heidarysafa, K. Jafari Meimandi, M. S. Gerber and L. E. Barnes, "Web of Science Dataset",
  78. Galgani, Filippo, Paul Compton, and Achim Hoffmann. "Combining different summarization techniques for legal text." Proceedings of the Workshop on Innovative Hybrid Approaches to the Processing of Textual Data. Association for Computational Linguistics, 2012.
  79. Nagwani . N. K. . 2015 . Summarizing large text collection using topic modeling and clustering based on MapReduce framework . Journal of Big Data . 2 . 1. 1–18 . 10.1186/s40537-015-0020-5. free .
  80. Schler . Jonathan . et al . Effects of Age and Gender on Blogging . AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs . 6 . 2006 . 6 August 2019 . 14 November 2020 . https://web.archive.org/web/20201114000329/https://www.aaai.org/Papers/Symposia/Spring/2006/SS-06-03/SS06-03-039.pdf . dead .
  81. Anand, Pranav, et al. "Believe Me-We Can Do This! Annotating Persuasive Acts in Blog Text."Computational Models of Natural Argument. 2011.
  82. Traud, Amanda L., Peter J. Mucha, and Mason A. Porter. "Social structure of Facebook networks." Physica A: Statistical Mechanics and its Applications391.16 (2012): 4165–4180.
  83. 1206.6474. Richard. Emile. Estimation of Simultaneously Sparse and Low Rank Matrices. Savalle. Pierre-Andre. Vayatis. Nicolas. cs.DS. 2012.
  84. Richardson . Matthew . Burges . Christopher JC . Renshaw . Erin . MCTest: A Challenge Dataset for the Open-Domain Machine Comprehension of Text . EMNLP . 1 . 2013 .
  85. 1502.05698. Weston. Jason. Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks. Bordes. Antoine. Chopra. Sumit. Rush. Alexander M.. Bart van Merriënboer. Joulin. Armand. Mikolov. Tomas. cs.AI. 2015.
  86. Marcus . Mitchell P. . Ann Marcinkiewicz . Mary . Santorini . Beatrice . 1993 . Building a large annotated corpus of English: The Penn Treebank . Computational Linguistics . 19 . 2. 313–330 .
  87. Collins . Michael . 2003 . Head-driven statistical models for natural language parsing . Computational Linguistics . 29 . 4. 589–637 . 10.1162/089120103322753356. free .
  88. Guyon, Isabelle, et al., eds. Feature extraction: foundations and applications. Vol. 207. Springer, 2008.
  89. Lin, Yuri, et al. "Syntactic annotations for the google books ngram corpus." Proceedings of the ACL 2012 system demonstrations. Association for Computational Linguistics, 2012.
  90. Krishnamoorthy . Niveda . et al . Generating Natural-Language Video Descriptions Using Text-Mined Knowledge . AAAI . 1 . 2013 . 6 August 2019 . 6 August 2019 . https://web.archive.org/web/20190806022756/https://www.aaai.org/ocs/index.php/AAAI/AAAI13/paper/download/6454/7204 . dead .
  91. Luyckx, Kim, and Walter Daelemans. "Personae: a Corpus for Author and Personality Prediction from Text." LREC. 2008.
  92. Solorio, Thamar, Ragib Hasan, and Mainul Mizan. "A case study of sockpuppet detection in wikipedia." Workshop on Language Analysis in Social Media (LASM) at NAACL HLT. 2013.
  93. Web site: Pushshift Files . 2023-01-12 . files.pushshift.io . 12 January 2023 . https://web.archive.org/web/20230112015822/https://files.pushshift.io/ . dead .
  94. Baumgartner . Jason . Zannettou . Savvas . Keegan . Brian . Squire . Megan . Blackburn . Jeremy . 2020-01-23 . The Pushshift Reddit Dataset . cs.SI . 2001.08435 .
  95. Ciarelli, Patrick Marques, and Elias Oliveira. "Agglomeration and elimination of terms for dimensionality reduction." Intelligent Systems Design and Applications, 2009. ISDA'09. Ninth International Conference on. IEEE, 2009.
  96. Zhou, Mingyuan, Oscar Hernan Madrid Padilla, and James G. Scott. "Priors for random count matrices derived from a family of negative binomial processes." Journal of the American Statistical Association just-accepted (2015): 00–00.
  97. Kotzias, Dimitrios, et al. "From group to individual labels using deep features." Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2015.
  98. 1602.08033. Ning. Yue. Modeling Precursors for Event Forecasting via Nested Multi-Instance Learning. Muthiah. Sathappan. Rangwala. Huzefa. Ramakrishnan. Naren. cs.SI. 2016.
  99. Buza, Krisztian. "Feedback prediction for blogs."Data analysis, machine learning and knowledge discovery. Springer International Publishing, 2014. 145–152.
  100. Soysal . Ömer M . 2015 . Association rule mining with mostly associated sequential patterns . Expert Systems with Applications . 42 . 5. 2582–2592 . 10.1016/j.eswa.2014.10.049.
  101. Zhu, Yukun, et al. "Aligning books and movies: Towards story-like visual explanations by watching movies and reading books." Proceedings of the IEEE international conference on computer vision. 2015.
  102. 1508.05326 . Bowman . Samuel R. . Angeli . Gabor . Potts . Christopher . Manning . Christopher D. . A large annotated corpus for learning natural language inference . 2015 . cs.CL .
  103. Web site: DSL Corpus Collection. ttg.uni-saarland.de. 2017-09-22.
  104. Web site: Urban Dictionary Words and Definitions.
  105. H. Elsahar, P. Vougiouklis, A. Remaci, C. Gravier, J. Hare, F. Laforest, E. Simperl, "T-REx: A Large Scale Alignment of Natural Language with Knowledge Base Triples", Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018).
  106. 1804.07461 . Wang . Alex . Singh . Amanpreet . Michael . Julian . Hill . Felix . Levy . Omer . Bowman . Samuel R. . GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding . 2018 . cs.CL .
  107. Computers Are Learning to Read—But They're Still Not So Smart . 29 December 2019 . Wired . en.
  108. Web site: GLUE Benchmark. gluebenchmark.com. en. 2019-02-25.
  109. Web site: Quan . Hoang Lam . Quang . Duy Le . Van Kiet . Nguyen . Ngan . Luu-Thuy Nguyen. . UIT-ViIC: A Dataset for the First Evaluation on Vietnamese Image Captioning.
  110. Book: To . Quoc Huy . Nguyen . Van Kiet . Nguyen . Luu Thuy Ngan . Nguyen . Gia Tuan Anh . Proceedings of the 4th International Conference on Natural Language Processing and Information Retrieval. Gender Prediction Based on Vietnamese Names with Machine Learning Techniques . 2020 . 55–60 . 10.1145/3443279.3443309 . 2010.10852 . 9781450377607 . 224814110 .
  111. Book: Nguyen. Luan Thanh. Van Nguyen. Kiet. Nguyen. Ngan Luu-Thuy. 2021-03-18. Advances and Trends in Artificial Intelligence. Artificial Intelligence Practices. Constructive and Toxic Speech Detection for Open-Domain Social Media Comments in Vietnamese. Lecture Notes in Computer Science. 12798. 572–583. 10.1007/978-3-030-79457-6_49. 2103.10069. 978-3-030-79456-9. 232269671.
  112. Saxton, David, et al. "Analysing Mathematical Reasoning Abilities of Neural Models." International Conference on Learning Representations. 2018.
  113. M. Versteegh, R. Thiollière, T. Schatz, X.-N. Cao, X. Anguera, A. Jansen, and E. Dupoux (2015). "The Zero Resource Speech Challenge 2015," in INTERSPEECH-2015.
  114. M. Versteegh, X. Anguera, A. Jansen, and E. Dupoux, (2016). "The Zero Resource Speech Challenge 2015: Proposed Approaches and Results," in SLTU-2016.
  115. Sakar . Betul Erdogdu . et al . 2013 . Collection and analysis of a Parkinson speech dataset with multiple types of sound recordings . IEEE Journal of Biomedical and Health Informatics. 17 . 4. 828–834 . 10.1109/jbhi.2013.2245674. 25055311 . 15491516 .
  116. Zhao, Shunan, et al. "Automatic detection of expressed emotion in Parkinson's disease." Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 2014.
  117. Used in: Hammami, Nacereddine, and Mouldi Bedda. "Improved tree model for Arabic speech recognition." Computer Science and Information Technology (ICCSIT), 2010 3rd IEEE International Conference on. Vol. 5. IEEE, 2010.
  118. Maaten, Laurens. "Learning discriminative fisher kernels." Proceedings of the 28th International Conference on Machine Learning (ICML-11). 2011.
  119. Cole, Ronald, and Mark Fanty. "Spoken letter recognition." Proc. Third DARPA Speech and Natural Language Workshop. 1990.
  120. Chapelle . Olivier . Sindhwani . Vikas . Keerthi . Sathiya S. . 2008 . Optimization techniques for semi-supervised support vector machines . The Journal of Machine Learning Research . 9 . 203–233 .
  121. Kudo . Mineichi . Toyama . Jun . Shimbo . Masaru . 1999 . Multidimensional curve classification using passing-through regions . Pattern Recognition Letters . 20 . 11. 1103–1111 . 10.1016/s0167-8655(99)00077-x. 1999PaReL..20.1103K . 10.1.1.46.2515 .
  122. Jaeger . Herbert . et al . 2007 . Optimization and applications of echo state networks with leaky-integrator neurons . Neural Networks . 20 . 3. 335–352 . 10.1016/j.neunet.2007.04.016. 17517495 .
  123. Tsanas . Athanasios . et al . 2010 . Accurate telemonitoring of Parkinson's disease progression by noninvasive speech tests . IEEE Transactions on Biomedical Engineering. 57 . 4. 884–893 . 10.1109/tbme.2009.2036000. 19932995 . 7382779 . Submitted manuscript .
  124. Clifford . Gari D. . Clifton . David . 2012 . Wireless technology in disease management and medicine . Annual Review of Medicine . 63 . 479–492 . 10.1146/annurev-med-051210-114650. 22053737 .
  125. Zue . Victor . Seneff . Stephanie . Glass . James . 1990 . Speech database development at MIT: TIMIT and beyond . Speech Communication . 9 . 4. 351–356 . 10.1016/0167-6393(90)90010-7.
  126. Kapadia, Sadik, Valtcho Valtchev, and S. J. Young. "MMI training for continuous phoneme recognition on the TIMIT database." Acoustics, Speech, and Signal Processing, 1993. ICASSP-93., 1993 IEEE International Conference on. Vol. 2. IEEE, 1993.
  127. Halabi . Nawar . 2016 . Modern Standard Arabic Phonetics for Speech Synthesis . PhD Thesis . University of Southampton, School of Electronics and Computer Science.
  128. Ardila . Rosana . Branson . Megan . Davis . Kelly . Henretty . Michael . Kohler . Michael . Meyer . Josh . Morais . Reuben . Saunders . Lindsay . Tyers . Francis M. . Weber . Gregor . Common Voice: A Massively-Multilingual Speech Corpus . Dec 13, 2019 . cs.CL . 1912.06670v2 .
  129. Web site: The LJ Speech Dataset . 2022-04-13 . keithito.com.
  130. Ghandoura . Abdulkader . Hjabo . Farouk . Al Dakkak . Oumayma . June 2021 . Building and benchmarking an Arabic Speech Commands dataset for small-footprint keyword spotting . Engineering Applications of Artificial Intelligence . 102 . 104267 . 10.1016/j.engappai.2021.104267 . 235637809 . 0952-1976.
  131. Zhou, Fang, Q. Claire, and Ross D. King. "Predicting the geographical origin of music." Data Mining (ICDM), 2014 IEEE International Conference on. IEEE, 2014.
  132. Saccenti . Edoardo . Camacho . José . 2015 . On the use of the observation-wise k-fold operation in PCA cross-validation . Journal of Chemometrics . 29 . 8. 467–478 . 10.1002/cem.2726. 10481/55302 . 62248957 . free .
  133. Bertin-Mahieux, Thierry, et al. "The million song dataset." ISMIR 2011: Proceedings of the 12th International Society for Music Information Retrieval Conference, 24–28 October 2011, Miami, Florida. University of Miami, 2011.
  134. Henaff . Mikael . et al . Unsupervised learning of sparse features for scalable audio classification . ISMIR . 11 . 2011 .
  135. Book: 10.5281/zenodo.1117372. MUSDB18 – a corpus for music separation . 2017 . Rafii . Zafar . Music .
  136. Defferrard. Michaël. Benzi. Kirell. Vandergheynst. Pierre. Bresson. Xavier. 6 December 2016. FMA: A Dataset For Music Analysis. 1612.01840. cs.SD.
  137. Esposito . Roberto . Radicioni . Daniele P. . 2009 . Carpediem: Optimizing the viterbi algorithm and applications to supervised sequential learning . The Journal of Machine Learning Research . 10 . 1851–1880 .
  138. Sourati . Jamshid . et al . 2016 . Classification Active Learning Based on Mutual Information . Entropy . 18 . 2. 51 . 10.3390/e18020051. 2016Entrp..18...51S . free .
  139. Salamon, Justin; Jacoby, Christopher; Bello, Juan Pablo. "A dataset and taxonomy for urban sound research." Proceedings of the ACM International Conference on Multimedia. ACM, 2014.
  140. 1502.00141. Lagrange. Mathieu. An evaluation framework for event detection using a morphological model of acoustic scenes. Lafay. Grégoire. Rossignol. Mathias. Benetos. Emmanouil. Roebel. Axel. stat.ML. 2015.
  141. Gemmeke, Jort F., et al. "Audio Set: An ontology and human-labeled dataset for audio events." IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). 2017.
  142. News: Watch out, birders: Artificial intelligence has learned to spot birds from their songs . 22 July 2018 . Science AAAS . 18 July 2018 . en.
  143. Web site: Bird Audio Detection challenge . . 22 July 2018 . 3 May 2016.
  144. 1907.01160 . Wichern . Gordon . Antognini . Joe . Flynn . Michael . Licheng Richard Zhu . McQuinn . Emmett . Crow . Dwight . Manilow . Ethan . Jonathan Le Roux . WHAM!: Extending Speech Separation to Noisy Environments . 2019 . cs.SD .
  145. Drossos, K., Lipping, S., and Virtanen, T. "Clotho: An Audio Captioning Dataset" IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). 2020.
  146. Drossos, K., Lipping, S., and Virtanen, T. (2019). Clotho dataset (Version 1.0) [Data set]. Zenodo. http://doi.org/10.5281/zenodo.3490684
  147. The CAIDA UCSD Dataset on the Witty Worm – 19–24 March 2004, http://www.caida.org/data/passive/witty_worm_dataset.xml
  148. Chen, Zesheng, and Chuanyi Ji. "Optimal worm-scanning method using vulnerable-host distributions." International Journal of Security and Networks 2.1–2 (2007): 71–80.
  149. Kachuee, Mohamad, et al. "Cuff-less high-accuracy calibration-free blood pressure estimation using pulse transit time." Circuits and Systems (ISCAS), 2015 IEEE International Symposium on. IEEE, 2015.
  150. PhysioBank, PhysioToolkit. "PhysioNet: components of a new research resource for complex physiologic signals." Circulation. v101 i23. e215-e220.
  151. Vergara . Alexander . et al . 2012 . Chemical gas sensor drift compensation using classifier ensembles . Sensors and Actuators B: Chemical . 166 . 320–329 . 10.1016/j.snb.2012.01.074. 2012SeAcB.166..320V .
  152. Korotcenkov . G. . Cho . B. K. . 2014 . Engineering approaches to improvement of conductometric gas sensor parameters. Part 2: Decrease of dissipated (consumable) power and improvement stability and reliability . Sensors and Actuators B: Chemical . 198 . 316–341 . 10.1016/j.snb.2014.03.069. 2014SeAcB.198..316K .
  153. Quinlan . John R . Learning with continuous classes . 5th Australian Joint Conference on Artificial Intelligence . 92 . 1992 .
  154. Merz . Christopher J. . Pazzani . Michael J. . 1999 . A principal components approach to combining regression estimates . Machine Learning . 36 . 1–2. 9–32 . 10.1023/a:1007507221352. free .
  155. Torres-Sospedra, Joaquin, et al. "UJIIndoorLoc-Mag: A new database for magnetic field-based localization problems." Indoor Positioning and Indoor Navigation (IPIN), 2015 International Conference on. IEEE, 2015.
  156. Berkvens, Rafael, Maarten Weyn, and Herbert Peremans. "Mean Mutual Information of Probabilistic Wi-Fi Localization." Indoor Positioning and Indoor Navigation (IPIN), 2015 International Conference on. Banff, Canada: IPIN. 2015.
  157. Paschke, Fabian, et al. "Sensorlose Zustandsüberwachung an Synchronmotoren."Proceedings. 23. Workshop Computational Intelligence, Dortmund, 5.-6. Dezember 2013. KIT Scientific Publishing, 2013.
  158. Lessmeier, Christian, et al. "Data Acquisition and Signal Analysis from Measured Motor Currents for Defect Detection in Electromechanical Drive Systems."
  159. Ugulino, Wallace, et al. "Wearable computing: Accelerometers’ data classification of body postures and movements ." Advances in Artificial Intelligence-SBIA 2012. Springer Berlin Heidelberg, 2012. 52–61.
  160. Schneider . Jan . et al . 2015 . Augmenting the senses: a review on sensor-based learning support . Sensors . 15 . 2. 4097–4133 . 10.3390/s150204097. 25679313 . 4367401 . 2015Senso..15.4097S . free .
  161. Madeo, Renata CB, Clodoaldo AM Lima, and Sarajane M. Peres. "Gesture unit segmentation using support vector machines: segmenting gestures from rest positions." Proceedings of the 28th Annual ACM Symposium on Applied Computing. ACM, 2013.
  162. Lun . Roanna . Zhao . Wenbing . 2015 . A survey of applications and human motion recognition with Microsoft Kinect . International Journal of Pattern Recognition and Artificial Intelligence . 29 . 5. 1555008 . 10.1142/s0218001415550083.
  163. Theodoridis, Theodoros, and Huosheng Hu. "Action classification of 3d human models using dynamic ANNs for mobile robot surveillance ."Robotics and Biomimetics, 2007. ROBIO 2007. IEEE International Conference on. IEEE, 2007.
  164. Etemad, Seyed Ali, and Ali Arya. "3D human action recognition and style transformation using resilient backpropagation neural networks." Intelligent Computing and Intelligent Systems, 2009. ICIS 2009. IEEE International Conference on. Vol. 4. IEEE, 2009.
  165. Altun . Kerem . Barshan . Billur . Tunçel . Orkun . 2010 . Comparative study on classifying human activities with miniature inertial and magnetic sensors . Pattern Recognition . 43 . 10. 3605–3620 . 10.1016/j.patcog.2010.04.019. 2010PatRe..43.3605A . 11693/11947 . free .
  166. Nathan . Ran . Ran Nathan . et al . 2012 . Using tri-axial acceleration data to identify behavioral modes of free-ranging animals: general concepts and tools illustrated for griffon vultures . The Journal of Experimental Biology . 215 . 6. 986–996 . 10.1242/jeb.058602. 22357592 . 3284320 .
  167. Anguita, Davide, et al. "Human activity recognition on smartphones using a multiclass hardware-friendly support vector machine." Ambient assisted living and home care. Springer Berlin Heidelberg, 2012. 216–223.
  168. Su . Xing . Tong . Hanghang . Ji . Ping . 2014 . Activity recognition with smartphone sensors . Tsinghua Science and Technology . 19 . 3. 235–249 . 10.1109/tst.2014.6838194. 62751498 .
  169. Kadous, Mohammed Waleed. Temporal classification: Extending the classification paradigm to multivariate time series. Diss. The University of New South Wales, 2002.
  170. Graves, Alex, et al. "Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks." Proceedings of the 23rd international conference on Machine learning. ACM, 2006.
  171. Velloso, Eduardo, et al. "Qualitative activity recognition of weight lifting exercises."Proceedings of the 4th Augmented Human International Conference. ACM, 2013.
  172. Mortazavi, Bobak Jack, et al. "Determining the single best axis for exercise repetition recognition and counting on smartwatches ." Wearable and Implantable Body Sensor Networks (BSN), 2014 11th International Conference on. IEEE, 2014.
  173. Sapsanis, Christos, et al. "Improving EMG based Classification of basic hand movements using EMD." Engineering in Medicine and Biology Society (EMBC), 2013 35th Annual International Conference of the IEEE. IEEE, 2013.
  174. Andrianesis . Konstantinos . Tzes . Anthony . 2015 . Development and control of a multifunctional prosthetic hand with shape memory alloy actuators . Journal of Intelligent & Robotic Systems . 78 . 2. 257–289 . 10.1007/s10846-014-0061-6. 207174078 .
  175. Banos . Oresti . et al . 2014 . Dealing with the effects of sensor displacement in wearable activity recognition . Sensors . 14 . 6. 9995–10023 . 10.3390/s140609995. 24915181 . 4118358. 2014Senso..14.9995B . free .
  176. Stisen, Allan, et al. "Smart Devices are Different: Assessing and MitigatingMobile Sensing Heterogeneities for Activity Recognition."Proceedings of the 13th ACM Conference on Embedded Networked Sensor Systems. ACM, 2015.
  177. Bhattacharya, Sourav, and Nicholas D. Lane. "From Smart to Deep: Robust Activity Recognition on Smartwatches using Deep Learning."
  178. Bacciu . Davide . et al . 2014 . An experimental characterization of reservoir computing in ambient assisted living applications . Neural Computing and Applications . 24 . 6. 1451–1464 . 10.1007/s00521-013-1364-4. 11568/237959 . 14124013 . free .
  179. Book: https://link.springer.com/chapter/10.1007/978-3-642-41043-7_3. 10.1007/978-3-642-41043-7_3. Multisensor Data Fusion for Activity Recognition Based on Reservoir Computing. Evaluating AAL Systems Through Competitive Benchmarking. Communications in Computer and Information Science. 2013. Palumbo. Filippo. Barsocchi. Paolo. Gallicchio. Claudio. Chessa. Stefano. Micheli. Alessio. 386. 24–35. 978-3-642-41042-0.
  180. Reiss, Attila, and Didier Stricker. "Introducing a new benchmarked dataset for activity monitoring."Wearable Computers (ISWC), 2012 16th International Symposium on. IEEE, 2012.
  181. Roggen, Daniel, et al. "OPPORTUNITY: Towards opportunistic activity and context recognition systems." World of Wireless, Mobile and Multimedia Networks & Workshops, 2009. WoWMoM 2009. IEEE International Symposium on a. IEEE, 2009.
  182. Kurz, Marc, et al. "Dynamic quantification of activity recognition capabilities in opportunistic systems." Vehicular Technology Conference (VTC Spring), 2011 IEEE 73rd. IEEE, 2011.
  183. Sztyler, Timo, and Heiner Stuckenschmidt. "On-body localization of wearable devices: an investigation of position-aware activity recognition." Pervasive Computing and Communications (PerCom), 2016 IEEE International Conference on. IEEE, 2016.
  184. Zhi. Ying Xuan. Lukasik. Michelle. Li. Michael H.. Dolatabadi. Elham. Wang. Rosalie H.. Taati. Babak. 2018. Automatic Detection of Compensation During Robotic Stroke Rehabilitation Therapy. IEEE Journal of Translational Engineering in Health and Medicine. 6. 2100107. 10.1109/JTEHM.2017.2780836. 2168-2372. 5788403. 29404226.
  185. Book: Dolatabadi. Elham. Zhi. Ying Xuan. Ye. Bing. Coahran. Marge. Lupinacci. Giorgia. Mihailidis. Alex. Wang. Rosalie. Taati. Babak. Proceedings of the 11th EAI International Conference on Pervasive Computing Technologies for Healthcare . The toronto rehab stroke pose dataset to detect compensation during stroke rehabilitation therapy . 2017-05-23. ACM. 375–381. 10.1145/3154862.3154925. 9781450363631. 24581930.
  186. Web site: Toronto Rehab Stroke Pose Dataset.
  187. Jung. Merel M.. Poel. Mannes. Poppe. Ronald. Heylen. Dirk K. J.. 2017-03-01. Automatic recognition of touch gestures in the corpus of social touch. Journal on Multimodal User Interfaces. en. 11. 1. 81–96. 10.1007/s12193-016-0232-9. 1802116. 1783-8738.
  188. 2016-06-01. Corpus of Social Touch (CoST). en. 10.4121/uuid:5ef62345-3b3e-479c-8e1d-c922748c9b29. Jung. M.M. (Merel). University of Twente.
  189. Aeberhard, S., D. Coomans, and O. De Vel. "Comparison of classifiers in high dimensional settings." Dept. Math. Statist., James Cook Univ., North Queensland, Australia, Tech. Rep 92-02 (1992).
  190. Basu, Sugato. "Semi-supervised clustering with limited background knowledge." AAAI. 2004.
  191. Tüfekci . Pınar . 2014 . Prediction of full load electrical power output of a base load operated combined cycle power plant using machine learning methods . International Journal of Electrical Power & Energy Systems . 60 . 126–140 . 10.1016/j.ijepes.2014.02.027. 2014IJEPE..60..126T .
  192. Kaya, Heysem, Pınar Tüfekci, and Fikret S. Gürgen. "Local and global learning methods for predicting power of a combined gas & steam turbine." International conference on emerging trends in computer and electronics engineering (ICETCEE'2012), Dubai. 2012.
  193. Baldi . Pierre . Sadowski . Peter . Whiteson . Daniel . 2014. Searching for exotic particles in high-energy physics with deep learning . Nature Communications . 5 . 2014 . 2014NatCo...5.4308B . 10.1038/ncomms5308 . 24986233 . 1402.4735 . 195953 .
  194. Baldi . Pierre . Sadowski . Peter . Whiteson . Daniel . 2015 . Enhanced Higgs Boson to τ+ τ− Search with Deep Learning . Physical Review Letters . 114 . 11. 111801 . 10.1103/physrevlett.114.111801. 25839260 . 2015PhRvL.114k1801B . 1410.3469 . 2339142 .
  195. The Higgs Machine Learning Challenge. Journal of Physics: Conference Series. 664. 7. 072015. 2015JPhCS.664g2015A. Adam-Bourdarios. C.. Cowan. G.. Germain-Renaud. C.. Guyon. I.. Kégl. B.. Rousseau. D.. 2015. 10.1088/1742-6596/664/7/072015. free.
  196. 1601.07913 . 10.1140/epjc/s10052-016-4099-4 . Parameterized neural networks for high-energy physics . 2016 . Baldi . Pierre . Cranmer . Kyle . Faucett . Taylor . Sadowski . Peter . Whiteson . Daniel . The European Physical Journal C . 76 . 5 . 235 . 2016EPJC...76..235B . 254108545 .
  197. Ortigosa . I. . Lopez . R. . Garcia . J. . A neural networks approach to residuary resistance of sailing yachts prediction . Proceedings of the International Conference on Marine Engineering MARINE . 2007 .
  198. Gerritsma, J., R. Onnink, and A. Versluis.Geometry, resistance and stability of the delft systematic yacht hull series. Delft University of Technology, 1981.
  199. Liu, Huan, and Hiroshi Motoda. Feature extraction, construction and selection: A data mining perspective. Springer Science & Business Media, 1998.
  200. Reich, Yoram. Converging to Ideal Design Knowledge by Learning. [Carnegie Mellon University], Engineering Design Research Center, 1989.
  201. Book: https://link.springer.com/chapter/10.1007/978-3-540-48247-5_11. 10.1007/978-3-540-48247-5_11. Experiments in Meta-level Learning with ILP. Principles of Data Mining and Knowledge Discovery. Lecture Notes in Computer Science. 1999. Todorovski. Ljupčo. Džeroski. Sašo. 1704. 98–106. 978-3-540-66490-1. 39382993 .
  202. Wang, Yong. A new approach to fitting linear models in high dimensional spaces. Diss. The University of Waikato, 2000.
  203. Kibler . Dennis . Aha . David W. . Albert . Marc K. . 1989 . Instance-based prediction of real-valued attributes . Computational Intelligence . 5 . 2. 51–57 . 10.1111/j.1467-8640.1989.tb00315.x. 40800413 .
  204. Palmer, Christopher R., and Christos Faloutsos. "Electricity based external similarity of categorical attributes." Advances in Knowledge Discovery and Data Mining. Springer Berlin Heidelberg, 2003. 486–500.
  205. Tsanas . Athanasios . Xifara . Angeliki . 2012 . Accurate quantitative estimation of energy performance of residential buildings using statistical machine learning tools . Energy and Buildings . 49 . 560–567 . 10.1016/j.enbuild.2012.03.003. 2012EneBu..49..560T .
  206. De Wilde . Pieter . 2014 . The gap between predicted and measured energy performance of buildings: A framework for investigation . Automation in Construction . 41 . 40–49 . 10.1016/j.autcon.2014.02.009.
  207. Brooks, Thomas F., D. Stuart Pope, and Michael A. Marcolini. Airfoil self-noise and prediction. Vol. 1218. National Aeronautics and Space Administration, Office of Management, Scientific and Technical Information Division, 1989.
  208. Draper, David. "Assessment and propagation of model uncertainty." Journal of the Royal Statistical Society, Series B (Methodological) (1995): 45–97.
  209. Lavine . Michael . 1991 . Problems in extrapolation illustrated with space shuttle O-ring data . Journal of the American Statistical Association . 86 . 416. 919–921 . 10.1080/01621459.1991.10475132.
  210. Wang, Jun, Bei Yu, and Les Gasser. "Concept tree based clustering visualization with shaded similarity matrices." Data Mining, 2002. ICDM 2003. Proceedings. 2002 IEEE International Conference on. IEEE, 2002.
  211. Magellan: Radar Performance and Data Products . 10.1126/science.252.5003.260 . 1991 . Pettengill . Gordon H. . Ford . Peter G. . Johnson . William T. K. . Raney . R. Keith . Soderblom . Laurence A. . Science . 252 . 5003 . 260–265 . 17769272 . 1991Sci...252..260P . 43398343 .
  212. Aharonian . F. . et al . 2008 . Energy spectrum of cosmic-ray electrons at TeV energies . Physical Review Letters . 101 . 26. 261104 . 2008PhRvL.101z1104A . 10.1103/PhysRevLett.101.261104 . 19437632 . 0811.3894 . 2440/51450 . 41850528 .
  213. Bock . R. K. . et al . 2004 . Methods for multidimensional event classification: a case study using images from a Cherenkov gamma-ray telescope . Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment . 516 . 2. 511–528 . 10.1016/j.nima.2003.08.157. 2004NIMPA.516..511B .
  214. Li . Jinyan . et al . 2004 . Deeps: A new instance-based lazy discovery and classification system . Machine Learning . 54 . 2. 99–124 . 10.1023/b:mach.0000011804.08528.7d. free .
  215. Villaescusa-Navarro. Francisco. al.. et. The CAMELS Multifield Data Set: Learning the Universe's Fundamental Parameters with Artificial Intelligence. The Astrophysical Journal Supplement Series. 2022. 259. 2. 61. 10.3847/1538-4365/ac5ab0. 2109.10915. 2022ApJS..259...61V. 237604997 . free .
  216. Siebert, Lee, and Tom Simkin. "Volcanoes of the world: an illustrated catalog of Holocene volcanoes and their eruptions." (2014).
  217. Sikora . Marek . Wróbel . Łukasz . 2010 . Application of rule induction algorithms for analysis of data collected by seismic hazard monitoring systems in coal mines . Archives of Mining Sciences . 55 . 1. 91–114 .
  218. Sikora, Marek, and Beata Sikora. "Rough natural hazards monitoring." Rough Sets: Selected Methods and Applications in Management and Engineering. Springer London, 2012. 163–179.
  219. Addor. Nans. Newman. Andrew J.. Mizukami. Naoki. Clark. Martyn P.. 2017-10-20. The CAMELS data set: catchment attributes and meteorology for large-sample studies. Hydrology and Earth System Sciences. en. 21. 10. 5293–5313. 10.5194/hess-21-5293-2017. 2017HESS...21.5293A. 1607-7938 . free .
  220. Newman. A. J.. Clark. M. P.. Sampson. K.. Wood. A.. Hay. L. E.. Bock. A.. Viger. R. J.. Blodgett. D.. Brekke. L.. Arnold. J. R.. Hopson. T.. 2015-01-14. Development of a large-sample watershed-scale hydrometeorological data set for the contiguous USA: data set characteristics and assessment of regional variability in hydrologic model performance. Hydrology and Earth System Sciences. en. 19. 1. 209–223. 10.5194/hess-19-209-2015. 2015HESS...19..209N. 1607-7938 . free .
  221. Alvarez-Garreton. Camila. Mendoza. Pablo A.. Boisier. Juan Pablo. Addor. Nans. Galleguillos. Mauricio. Zambrano-Bigiarini. Mauricio. Lara. Antonio. Puelma. Cristóbal. Cortes. Gonzalo. Garreaud. Rene. McPhee. James. 2018-11-13. The CAMELS-CL dataset: catchment attributes and meteorology for large sample studies – Chile dataset. Hydrology and Earth System Sciences. en. 22. 11. 5817–5846. 10.5194/hess-22-5817-2018. 2018HESS...22.5817A. 133955609. 1607-7938 . free .
  222. Chagas. Vinícius B. P.. Chaffe. Pedro L. B.. Addor. Nans. Fan. Fernando M.. Fleischmann. Ayan S.. Paiva. Rodrigo C. D.. Siqueira. Vinícius A.. 2020-09-08. CAMELS-BR: hydrometeorological time series and landscape attributes for 897 catchments in Brazil. Earth System Science Data. en. 12. 3. 2075–2096. 10.5194/essd-12-2075-2020. 2020ESSD...12.2075C. 234737197. 1866-3516 . free .
  223. Coxon. Gemma. Addor. Nans. Bloomfield. John P.. Freer. Jim. Fry. Matt. Hannaford. Jamie. Howden. Nicholas J. K.. Lane. Rosanna. Lewis. Melinda. Robinson. Emma L.. Wagener. Thorsten. 2020-10-12. CAMELS-GB: hydrometeorological time series and landscape attributes for 671 catchments in Great Britain. Earth System Science Data. en. 12. 4. 2459–2483. 10.5194/essd-12-2459-2020. 2020ESSD...12.2459C. 226192657. 1866-3516 . free .
  224. Fowler. Keirnan J. A.. Acharya. Suwash Chandra. Addor. Nans. Chou. Chihchung. Peel. Murray C.. 2021-08-06. CAMELS-AUS: hydrometeorological time series and landscape attributes for 222 catchments in Australia. Earth System Science Data. en. 13. 8. 3847–3867. 10.5194/essd-13-3847-2021. 2021ESSD...13.3847F. 238796784. 1866-3516 . free .
  225. Klingler. Christoph. Schulz. Karsten. Herrnegger. Mathew. 2021-09-16. LamaH-CE: LArge-SaMple DAta for Hydrology and Environmental Sciences for Central Europe. Earth System Science Data. en. 13. 9. 4529–4565. 10.5194/essd-13-4529-2021. 2021ESSD...13.4529K. 240533508. 1866-3516 . free .
  226. Yeh . I–C . 1998 . Modeling of strength of high-performance concrete using artificial neural networks . Cement and Concrete Research . 28 . 12. 1797–1808 . 10.1016/s0008-8846(98)00165-3.
  227. Zarandi . MH Fazel . et al . 2008 . Fuzzy polynomial neural networks for approximation of the compressive strength of concrete . Applied Soft Computing . 8 . 1. 488–498 . 10.1016/j.asoc.2007.02.010. 2008ApSoC...8...79S .
  228. Yeh, I. "Modeling slump of concrete with fly ash and superplasticizer." Computers and Concrete5.6 (2008): 559–572.
  229. Gencel . Osman . et al . 2011 . Comparison of artificial neural networks and general linear model approaches for the analysis of abrasive wear of concrete . Construction and Building Materials . 25 . 8. 3486–3494 . 10.1016/j.conbuildmat.2011.03.040.
  230. Dietterich, Thomas G., et al. "A comparison of dynamic reposing and tangent distance for drug activity prediction ." Advances in Neural Information Processing Systems (1994): 216–216.
  231. Buscema, Massimo, William J. Tastle, and Stefano Terzi. "Meta net: A new meta-classifier family."Data Mining Applications Using Artificial Adaptive Systems. Springer New York, 2013. 141–182.
  232. Barnard, Amanda; Sun, Baichuan; Motevalli Soumehsaraei, Ben; & Opletal, George (2019): Silver Nanoparticle Data Set. v3. CSIRO. Data Collection. https://doi.org/10.25919/5d22d20bc543e
  233. Barnard, Amanda; Sun, Baichuan; & Opletal, George (2019): Platinum Nanoparticle Data Set. v2. CSIRO. Data Collection. https://doi.org/10.25919/5d3958d9bf5f7
  234. Barnard, Amanda; & Opletal, George (2019): Gold Nanoparticle Data Set. v1. CSIRO. Data Collection. https://doi.org/10.25919/5d395ef9a4291
  235. Barnard, Amanda; & Opletal, George (2019): Ruthenium Nanoparticle Data Set. v1. CSIRO. Data Collection. https://doi.org/10.25919/5e30b8fa67484
  236. Barnard, Amanda; & Opletal, George (2019): Copper Nanoparticle Data Set. v1. CSIRO. Data Collection. https://doi.org/10.25919/5e30ba386311f
  237. Barnard, Amanda; & Opletal, George (2023): Palladium Nanoparticle Data Set. v2. CSIRO. Data Collection. https://doi.org/10.25919/epxd-8p61
  238. Ting, Jonathan; Barnard, Amanda; Opletal, George (2023): AuCo Nanoparticle Data Set. v2. CSIRO. Data Collection. https://doi.org/10.25919/7h3x-1343
  239. Ting, Jonathan; Barnard, Amanda; & Opletal, George (2023): PtCo Nanoparticle Data Set. v1. CSIRO. Data Collection. https://doi.org/10.25919/jzh8-rd31
  240. Ting, Jonathan; Barnard, Amanda; & Opletal, George (2023): PtAu Nanoparticle Data Set. v1. CSIRO. Data Collection. https://doi.org/10.25919/tdnv-jp30
  241. Ting, Jonathan; Barnard, Amanda; & Opletal, George (2023): PdPt Nanoparticle Data Set. v1. CSIRO. Data Collection. https://doi.org/10.25919/qced-2e85
  242. Ting, Jonathan; Barnard, Amanda; & Opletal, George (2023): PdCo Nanoparticle Data Set. v1. CSIRO. Data Collection. https://doi.org/10.25919/az9t-vr97
  243. Ting, Jonathan; Barnard, Amanda; & Opletal, George (2023): CoPt Nanoparticle Data Set. v1. CSIRO. Data Collection. https://doi.org/10.25919/0bs4-sn79
  244. Ting, Jonathan; Barnard, Amanda; & Opletal, George (2023): CoPd Nanoparticle Data Set. v1. CSIRO. Data Collection. https://doi.org/10.25919/em3a-9a89
  245. Ting, Jonathan; Barnard, Amanda; & Opletal, George (2023): CoAu Nanoparticle Data Set. v1. CSIRO. Data Collection. https://doi.org/10.25919/991j-hg07
  246. Ting, Jonathan; Barnard, Amanda; & Opletal, George (2023): AuPt Nanoparticle Data Set. v1. CSIRO. Data Collection. https://doi.org/10.25919/7zh9-3f67
  247. Ting, Jonathan; Barnard, Amanda; & Opletal, George (2023): PtPd Nanoparticle Data Set. v1. CSIRO. Data Collection. https://doi.org/10.25919/9sz9-3a85
  248. Ting, Jonathan; Barnard, Amanda; & Opletal, George (2023): PdAu Nanoparticle Data Set. v1. CSIRO. Data Collection. https://doi.org/10.25919/6ajg-1275
  249. Ting, Jonathan; Barnard, Amanda; & Opletal, George (2023): AuPd Nanoparticle Data Set. v1. CSIRO. Data Collection. https://doi.org/10.25919/v0r5-sw08
  250. Lu, Kaihan; Ting, Jonathan; Barnard, Amanda; & Opletal, George (2023): AuPdPt Nanoparticle Data Set. v1. CSIRO. Data Collection. https://doi.org/10.25919/psvw-am47
  251. Amoradnejad . Issa . Amoradnejad . Rahimberdi . et al . 2022 . Age dataset: A structured general-purpose dataset on life, work, and death of 1.22 million distinguished people . Workshop Proceedings of the 16th International AAAI Conference on Web and Social Media (ICWSM) . 3 . 1–4 . ICWSM . 10.36190/2022.82 . 249668669 .
  252. Web site: Age Dataset . . 7 June 2022 .
  253. Web site: Synthetic Fundus Dataset . 22 February 2023 . 29 November 2021 . https://web.archive.org/web/20211129155047/http://math.unipa.it/cvalenti/fundus/ . dead .
  254. Lo Castro . Dario . et al . 2020 . A visual framework to create photorealistic retinal vessels for diagnosis purposes . Journal of Biomedical Informatics . 108 . 103490 . 10.1016/j.jbi.2020.103490 . 32640292 . 220429697 .
  255. Ingber . Lester . 1997 . Statistical mechanics of neocortical interactions: Canonical momenta indicatorsof electroencephalography . Physical Review E . 55 . 4. 4578–4593. 1997PhRvE..55.4578I. 10.1103/PhysRevE.55.4578. physics/0001052. 6390999 .
  256. Hoffmann . Ulrich . Vesin . Jean-Marc . Ebrahimi . Touradj . Diserens . Karin . 2008 . An efficient P300-based brain–computer interface for disabled subjects . Journal of Neuroscience Methods . 167 . 1. 115–125 . 10.1016/j.jneumeth.2007.03.005. 17445904 . 10.1.1.352.4630 . 9648828 .
  257. Donchin . Emanuel . Kevin M. . Spencer . Ranjith . Wijesinghe . The mental prosthesis: assessing the speed of a P300-based brain-computer interface . IEEE Transactions on Rehabilitation Engineering . 8 . 2 . 2000 . 174–179 . 10896179 . 10.1109/86.847808. 84043 .
  258. Detrano . Robert . et al . 1989 . International application of a new probability algorithm for the diagnosis of coronary artery disease . The American Journal of Cardiology . 64 . 5. 304–310 . 10.1016/0002-9149(89)90524-9. 2756873 .
  259. Bradley . Andrew P . 1997 . The use of the area under the ROC curve in the evaluation of machine learning algorithms . Pattern Recognition . 30 . 7. 1145–1159 . 10.1016/s0031-3203(96)00142-2. 1997PatRe..30.1145B . 13806304 .
  260. Book: https://www.spiedigitallibrary.org/conference-proceedings-of-spie/1905/0000/Nuclear-feature-extraction-for-breast-tumor-diagnosis/10.1117/12.148698.short. 10.1117/12.148698. Nuclear feature extraction for breast tumor diagnosis. Biomedical Image Processing and Biomedical Visualization. 1993. Acharya. Raj S. Street. W. N.. Wolberg. W. H.. Mangasarian. O. L.. 1905. 861–870. 14922543. Dmitry B. Goldgof.
  261. Demir, Cigdem, and Bülent Yener. "Automated cancer diagnosis based on histopathological images: a systematic survey." Rensselaer Polytechnic Institute, Tech. Rep (2005).
  262. Abuse, Substance. "Mental Health Services Administration, Results from the 2010 National Survey on Drug Use and Health: Summary of National Findings, NSDUH Series H-41, HHS Publication No.(SMA) 11-4658." Rockville, MD: Substance Abuse and Mental Health Services Administration 201 (2011).
  263. Hong . Zi-Quan . Yang . Jing-Yu . 1991 . Optimal discriminant plane for a small number of samples and design method of classifier on the plane . Pattern Recognition . 24 . 4. 317–324 . 10.1016/0031-3203(91)90074-f. 1991PatRe..24..317H .
  264. Li, Jinyan, and Limsoon Wong. "Using rules to analyse bio-medical data: a comparison between C4. 5 and PCL." Advances in Web-Age Information Management. Springer Berlin Heidelberg, 2003. 254–265.
  265. Güvenir, H. Altay, et al. "A supervised machine learning algorithm for arrhythmia analysis."Computers in Cardiology 1997. IEEE, 1997.
  266. Lagus, Krista, et al. "Independent variable group analysis in learning compact representations for data." Proceedings of the International and Interdisciplinary Conference on Adaptive Knowledge Representation and Reasoning (AKRR'05), T. Honkela, V. Könönen, M. Pöllä, and O. Simula, Eds., Espoo, Finland. 2005.
  267. Strack, Beata, et al. "Impact of HbA1c measurement on hospital readmission rates: analysis of 70,000 clinical database patient records." BioMed Research International 2014; 2014
  268. Rubin . Daniel J . 2015 . Hospital readmission of patients with diabetes . Current Diabetes Reports . 15 . 4. 1–9 . 10.1007/s11892-015-0584-7. 25712258 . 3908599 .
  269. Antal . Bálint . Hajdu . András . 2014 . An ensemble-based system for automatic screening of diabetic retinopathy . Knowledge-Based Systems . 60 . 2014. 20–27 . 10.1016/j.knosys.2013.12.023. 1410.8576 . 2014arXiv1410.8576A . 13984326 .
  270. 1505.04424. Haloi. Mrinal. Improved Microaneurysm Detection using Deep Neural Networks. cs.CV. 2015.
  271. Web site: ADCIS Download Third Party: Messidor Database. ELIE. Guillaume PATRY, Gervais GAUTHIER, Bruno LAY, Julien ROGER, Damien. adcis.net. en. 2018-02-25.
  272. Decencière. Etienne. Zhang. Xiwei. Cazuguel. Guy. Lay. Bruno. Cochener. Béatrice. Trone. Caroline. Gain. Philippe. Ordonez. Richard. Massin. Pascale. 2014-08-26. Image Analysis & Stereology. en. 33. 3. 231–234. 10.5566/ias.1155. 1854-5165. Feedback on a Publicly Distributed Image Database: The Messidor Database. free.
  273. Bagirov . A. M. . et al . 2003 . Unsupervised and supervised data classification via nonsmooth and global optimization . Top . 11 . 1. 1–75 . 10.1007/bf02578945. 10.1.1.1.6429 . 14165678 .
  274. Fung, Glenn, et al. "A fast iterative algorithm for fisher discriminant using heterogeneous kernels."Proceedings of the twenty-first international conference on Machine learning. ACM, 2004.
  275. Quinlan, John Ross, et al. "Inductive knowledge acquisition: a case study." Proceedings of the Second Australian Conference on Applications of expert systems. Addison-Wesley Longman Publishing Co., Inc., 1987.
  276. Zhou . Zhi-Hua . Jiang . Yuan . 2004 . NeC4. 5: neural ensemble based C4. 5 . IEEE Transactions on Knowledge and Data Engineering. 16 . 6. 770–773 . 10.1109/tkde.2004.11. 10.1.1.1.8430 . 1024861 .
  277. Er . Orhan . et al . 2012 . An approach based on probabilistic neural network for diagnosis of Mesothelioma's disease . Computers & Electrical Engineering . 38 . 1. 75–81 . 10.1016/j.compeleceng.2011.09.001.
  278. Er, Orhan, A. Çetin Tanrikulu, and Abdurrahman Abakay. "Use of artificial intelligence techniques for diagnosis of malignant pleural mesothelioma."Dicle Tıp Dergisi 42.1 (2015).
  279. Li. Michael H.. Mestre. Tiago A.. Fox. Susan H.. Taati. Babak. 2017-07-25. Vision-Based Assessment of Parkinsonism and Levodopa-Induced Dyskinesia with Deep Learning Pose Estimation. Journal of Neuroengineering and Rehabilitation. 15. 1. 97. 1707.09416. 10.1186/s12984-018-0446-z. 30400914. 6219082. 2017arXiv170709416L . free .
  280. Li. Michael H.. Mestre. Tiago A.. Fox. Susan H.. Taati. Babak. May 2018. Automated assessment of levodopa-induced dyskinesia: Evaluating the responsiveness of video-based features. Parkinsonism & Related Disorders. 53. 42–45. 10.1016/j.parkreldis.2018.04.036. 29748112. 13666294. 1353-8020.
  281. Web site: Parkinson's Vision-Based Pose Estimation Dataset Kaggle. kaggle.com. 2018-08-22.
  282. Shannon. Paul. etal. 2003. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Research . 13 . 11 . 2498–2504 . 10.1101/gr.1239303 . 14597658 . 403769.
  283. Javadi. Soroush. Mirroshandel. Seyed Abolghasem. 2019. A novel deep learning method for automatic assessment of human sperm images. Computers in Biology and Medicine. 109. 182–194. 0010-4825. 10.1016/j.compbiomed.2019.04.030. 31059902. 146809768.
  284. Web site: soroushj/mhsma-dataset: MHSMA: The Modified Human Sperm Morphology Analysis Dataset. github.com. 2019-05-03.
  285. Clark, David, Zoltan Schreter, and Anthony Adams. "A quantitative comparison of dystal and backpropagation." Proceedings of 1996 Australian Conference on Neural Networks. 1996.
  286. Jiang, Yuan, and Zhi-Hua Zhou. "Editing training data for kNN classifiers with neural network ensemble." Advances in Neural Networks–ISNN 2004. Springer Berlin Heidelberg, 2004. 356–361.
  287. Ontañón, Santiago, and Enric Plaza. "On similarity measures based on a refinement lattice." Case-Based Reasoning Research and Development. Springer Berlin Heidelberg, 2009. 240–255.
  288. Web site: PLF data inventory. GitHub. 5 November 2021.
  289. Higuera . Clara . Gardiner . Katheleen J. . Cios . Krzysztof J. . 2015 . Self-organizing feature maps identify proteins critical to learning in a mouse model of down syndrome . PLOS ONE . 10 . 6. e0129126 . 10.1371/journal.pone.0129126. 26111164 . 4482027 . 2015PLoSO..1029126H . free .
  290. Ahmed . Md Mahiuddin . et al . 2015 . Protein dynamics associated with failed and rescued learning in the Ts65Dn mouse model of Down syndrome . PLOS ONE . 10 . 3. e0119491 . 10.1371/journal.pone.0119491. 25793384 . 4368539 . 2015PLoSO..1019491A . free .
  291. Langley. PAT. 2014. Trading off simplicity and coverage in incremental concept learning. Machine Learning Proceedings. 1988. 73. 6 August 2019. 6 August 2019. https://web.archive.org/web/20190806184005/https://www.westmont.edu/~iba/pubs/hillary-paper.pdf. dead.
  292. Web site: Mushroom Data Set 2020. 2021-04-06. mushroom.mathematik.uni-marburg.de.
  293. Wagner. Dennis. Heider. Dominik. Hattab. Georges. 2021-04-14. Mushroom data creation, curation, and simulation to support classification tasks. Scientific Reports. en. 11. 1. 8134. 10.1038/s41598-021-87602-3. 33854157. 8046754. 2021NatSR..11.8134W. 2045-2322.
  294. Cortez, Paulo, and Aníbal de Jesus Raimundo Morais. "A data mining approach to predict forest fires using meteorological data." (2007).
  295. Farquad . M. A. H. . Ravi . V. . Raju . S. Bapi . 2010 . Support vector regression based hybrid rule extraction methods for forecasting . Expert Systems with Applications . 37 . 8. 5577–5589 . 10.1016/j.eswa.2010.02.055.
  296. Fisher . Ronald A . 1936 . The use of multiple measurements in taxonomic problems . Annals of Eugenics . 7 . 2. 179–188 . 10.1111/j.1469-1809.1936.tb02137.x. 2440/15227 . free .
  297. Ghahramani, Zoubin, and Michael I. Jordan. "Supervised learning from incomplete data via an EM approach ." Advances in neural information processing systems 6. 1994.
  298. Mallah . Charles . Cope . James . Orwell . James . 2013 . Plant leaf classification using probabilistic integration of shape, texture and margin features . Signal Processing, Pattern Recognition and Applications . 5 . 1 .
  299. Yahiaoui, Itheri, Olfa Mzoughi, and Nozha Boujemaa. "Leaf shape descriptor for tree species identification ." Multimedia and Expo (ICME), 2012 IEEE International Conference on. IEEE, 2012.
  300. Tan, Ming, and Larry Eshelman. "Using weighted networks to represent classification knowledge in noisy domains." Proceedings of the Fifth International Conference on Machine Learning. 2014.
  301. Charytanowicz, Małgorzata, et al. "Complete gradient clustering algorithm for features analysis of x-ray images." Information technologies in biomedicine. Springer Berlin Heidelberg, 2010. 15–24.
  302. Sanchez . Mauricio A. . et al . 2014 . Fuzzy granular gravitational clustering algorithm for multivariate data . Information Sciences . 279 . 498–511 . 10.1016/j.ins.2014.04.005.
  303. Blackard . Jock A. . Dean . Denis J. . 1999 . Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables . Computers and Electronics in Agriculture . 24 . 3. 131–151 . 10.1016/s0168-1699(99)00046-0. 1999CEAgr..24..131B . 10.1.1.128.2475 . 13985407 .
  304. Fürnkranz, Johannes. "Round robin rule learning."Proceedings of the 18th International Conference on Machine Learning (ICML-01): 146—153. 2001.
  305. Li . Song . Assmann . Sarah M. . Albert . Réka . 2006 . Predicting essential components of signal transduction networks: a dynamic model of guard cell abscisic acid signaling . PLOS Biol . 4 . 10. e312 . 10.1371/journal.pbio.0040312. 16968132 . 1564158 . 2006q.bio....10012L . q-bio/0610012 . free .
  306. Munisami . Trishen . et al . 2015 . Plant Leaf Recognition Using Shape Features and Colour Histogram with K-nearest Neighbour Classifiers . Procedia Computer Science . 58 . 740–747 . 10.1016/j.procs.2015.08.095. free .
  307. Li . Bai . 2016 . Atomic potential matching: An evolutionary target recognition approach based on edge features . Optik . 127 . 5. 3162–3168 . 10.1016/j.ijleo.2015.11.186. 2016Optik.127.3162L .
  308. Razavian, Ali, et al. "CNN features off-the-shelf: an astounding baseline for recognition." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 2014.
  309. Nilsback, Maria-Elena, and Andrew Zisserman. "A visual vocabulary for flower classification."Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on. Vol. 2. IEEE, 2006.
  310. Giselsson . Thomas M. . et al . 2017 . A Public Image Database for Benchmark of Plant Seedling Classification Algorithms . 1711.05458 . cs.CV .
  311. Web site: Oltean. Mihai . 2017 . Fruits-360 dataset. .
  312. Nakai . Kenta . Kanehisa . Minoru . 1991 . Expert system for predicting protein localization sites in gram-negative bacteria . Proteins: Structure, Function, and Bioinformatics . 11 . 2. 95–110 . 10.1002/prot.340110203. 1946347 . 27606447 .
  313. Ling, Charles X., et al. "Decision trees with minimal costs." Proceedings of the twenty-first international conference on Machine learning. ACM, 2004.
  314. Mahé, Pierre, et al. "Automatic identification of mixed bacterial species fingerprints in a MALDI-TOF mass-spectrum." Bioinformatics (2014): btu022.
  315. Barbano . Duane . et al . 2015 . Rapid characterization of microalgae and microalgae mixtures using matrix-assisted laser desorption ionization time-of-flight mass spectrometry (MALDI-TOF MS) . PLOS ONE . 10 . 8. e0135337 . 10.1371/journal.pone.0135337. 26271045 . 4536233 . 2015PLoSO..1035337B . free .
  316. Horton . Paul . Nakai . Kenta . 1996 . A probabilistic classification system for predicting the cellular localization sites of proteins . ISMB-96 Proceedings . 4 . 109–15 . 8877510 . 6 August 2019 . 4 November 2021 . https://web.archive.org/web/20211104042943/https://www.aaai.org/Papers/ISMB/1996/ISMB96-012.pdf . dead .
  317. Allwein . Erin L. . Schapire . Robert E. . Singer . Yoram . 2001 . Reducing multiclass to binary: A unifying approach for margin classifiers . The Journal of Machine Learning Research . 1 . 113–141 .
  318. Mayr . Andreas . Klambauer . Guenter . Unterthiner . Thomas . Hochreiter . Sepp . 2016 . DeepTox: Toxicity Prediction Using Deep Learning . Frontiers in Environmental Science . 3 . 80 . 10.3389/fenvs.2015.00080. free .
  319. Book: Lavin . Alexander . Ahmad . Subutai . 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA) . Evaluating Real-Time Anomaly Detection Algorithms -- the Numenta Anomaly Benchmark . 1510.03336 . 38–44 . 12 October 2015 . 10.1109/ICMLA.2015.141 . 978-1-5090-0287-0 . 6842305 .
  320. Web site: Iurii D. Katser . Vyacheslav O. Kozitsin . SKAB GitHub repository . . 12 January 2021.
  321. Iurii D. Katser . Vyacheslav O. Kozitsin . Skoltech Anomaly Benchmark (SKAB) . Kaggle . 2020 . 10.34740/KAGGLE/DSV/1693952 . 17 March 2024 . 12 January 2021.
  322. Campos. Guilherme O.. Zimek. Arthur. Arthur Zimek. Sander. Jörg. Campello. Ricardo J. G. B.. Micenková. Barbora. Schubert. Erich. Assent. Ira. Houle. Michael E.. On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Mining and Knowledge Discovery. 30. 4. 891. 2016. 1384-5810. 10.1007/s10618-015-0444-8. 1952214.
  323. Ann-Kathrin Hartmann, Tommaso Soru, Edgard Marx. Generating a Large Dataset for Neural Question Answering over the DBpedia Knowledge Base. 2018.
  324. Tommaso Soru, Edgard Marx. Diego Moussallem, Andre Valdestilhas, Diego Esteves, Ciro Baron. SPARQL as a Foreign Language. 2018.
  325. Kiet Van Nguyen, Duc-Vu Nguyen, Anh Gia-Tuan Nguyen, Ngan Luu-Thuy Nguyen. A Vietnamese Dataset for Evaluating Machine Reading Comprehension. COLING 2020.
  326. Kiet Van Nguyen, Khiem Vinh Tran, Son T. Luu, Anh Gia-Tuan Nguyen, Ngan Luu-Thuy Nguyen. Enhancing Lexical-Based Approach With External Knowledge for Vietnamese Multiple-Choice Machine Reading Comprehension. IEEE Access. 2020.
  327. 2010.04898 . Anantha . Raviteja . Vakulenko . Svitlana . Tu . Zhucheng . Longpre . Shayne . Pulman . Stephen . Chappidi . Srinivas . Open-Domain Question Answering Goes Conversational via Question Rewriting . 2020 . cs.IR .
  328. Khashabi . Daniel . Min . Sewon . Khot . Tushar . Sabharwal . Ashish . Tafjord . Oyvind . Clark . Peter . Hajishirzi . Hannaneh . November 2020 . UNIFIEDQA: Crossing Format Boundaries with a Single QA System . Findings of the Association for Computational Linguistics: EMNLP 2020 . Online . Association for Computational Linguistics . 1896–1907 . 10.18653/v1/2020.findings-emnlp.171. 2005.00700 . 218487109 .
  329. Byrne . Bill . Krishnamoorthi . Karthik . Sankar . Chinnadhurai . Neelakantan . Arvind . Duckworth . Daniel . Yavuz . Semih . Goodrich . Ben . Dubey . Amit . Cedilnik . Andy . Kim . Kyu-Young . 2019-09-01 . Taskmaster-1: Toward a Realistic and Diverse Dialog Dataset . cs.CL . 1909.05358 .
  330. Yasunaga . Michihiro . Liang . Percy . 2020-11-21 . Graph-based, Self-Supervised Program Repair from Diagnostic Feedback . International Conference on Machine Learning . en . PMLR . 10799–10808. 2005.10636 .
  331. Wang . Yizhong . Mishra . Swaroop . Alipoormolabashi . Pegah . Kordi . Yeganeh . Mirzaei . Amirreza . Arunkumar . Anjana . Ashok . Arjun . Dhanasekaran . Arut Selvan . Naik . Atharva . Stap . David . Pathak . Eshaan . Karamanolakis . Giannis . Lai . Haizhi Gary . Purohit . Ishan . Mondal . Ishani . 2022-10-24 . Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks . cs.CL . 2204.07705 .
  332. Paperno . Denis . Kruszewski . Germán . Lazaridou . Angeliki . Pham . Ngoc Quan . Bernardi . Raffaella . Pezzelle . Sandro . Baroni . Marco . Boleda . Gemma . Fernández . Raquel . August 2016 . The LAMBADA dataset: Word prediction requiring a broad discourse context . Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . Berlin, Germany . Association for Computational Linguistics . 1525–1534 . 10.18653/v1/P16-1144. 10230/32702 . 2381275 .
  333. Wei . Jason . Bosma . Maarten . Zhao . Vincent . Guu . Kelvin . Yu . Adams Wei . Lester . Brian . Du . Nan . Dai . Andrew M. . Le . Quoc V. . 2022-02-10 . Finetuned Language Models are Zero-Shot Learners . 2109.01652 . en.
  334. Web site: Working with ATT&CK MITRE ATT&CK® . 2023-01-14 . attack.mitre.org.
  335. Web site: CAPEC - Common Attack Pattern Enumeration and Classification (CAPEC™) . 2023-01-14 . capec.mitre.org.
  336. Web site: CVE - Home . 2023-01-14 . cve.mitre.org.
  337. Web site: CWE - Common Weakness Enumeration . 2023-01-14 . cwe.mitre.org.
  338. Lim . Swee Kiat . Muis . Aldrian Obaja . Lu . Wei . Ong . Chen Hui . July 2017 . MalwareTextDB: A Database for Annotated Malware Articles . Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . Vancouver, Canada . Association for Computational Linguistics . 1557–1567 . 10.18653/v1/P17-1143. 7816596 .
  339. Web site: USENIX . 2023-01-19 . USENIX . en.
  340. Web site: APTnotes Read the Docs . 2023-01-19 . readthedocs.org.
  341. Web site: Cryptography and Security authors/titles recent submissions . 2023-01-19 . arxiv.org . en.
  342. Web site: Holistic Info-Sec for Web Developers - Fascicle 0 . 2023-01-20 . f0.holisticinfosecforwebdevelopers.com.
  343. Web site: Holistic Info-Sec for Web Developers - Fascicle 1 . 2023-01-20 . f1.holisticinfosecforwebdevelopers.com.
  344. Web site: Vincent . Adam . Web Services Web Services Hacking and Hardening . owasp.org.
  345. Web site: McCray . Joe . Advanced SQL Injection . defcon.org.
  346. Web site: Shah . Shreeraj . Blind SQL injection discovery & exploitation technique . blueinfy.com.
  347. Web site: Palcer . C. C. . Ethical hacking . textfiles.
  348. Web site: Hacking Secrets Revealed - Information and Instructional Guide .
  349. Web site: Park . Alexis . Hack any website .
  350. Web site: Cerrudo . Cesar . Martinez Fayo . Esteban . Hacking Databases for Owning your Data . blackhat.
  351. Web site: O'Connor . Tj. . Violent Python-A Cookbook for Hackers, Forensic Analysts, Penetration Testers and Security Engineers . Github.
  352. Web site: Grand . Joe . Hardware Reverse Engineering: Access, Analyze, & Defeat . blackhat.
  353. Web site: Chang . Jason V. . Computer Hacking: Making the Case for National Reporting Requirement . cyber.harvard.edu.
  354. Web site: National Cybersecurity Strategies Repository . 2023-01-20 . ITU . en-US.
  355. Zampieri . Marcos . Malmasi . Shervin . Nakov . Preslav . Rosenthal . Sara . Farra . Noura . Kumar . Ritesh . 2019-04-16 . Predicting the Type and Target of Offensive Posts in Social Media . cs.CL . 1902.09666 .
  356. Web site: Threat reports . 2023-01-20 . www.ncsc.gov.uk . en.
  357. Web site: Category: APT reports Securelist . 2023-01-23 . securelist.com.
  358. Web site: Your Cybersecurity News Connection - Cyber News CyberWire . 2023-01-23 . The CyberWire.
  359. Web site: News . 21 August 2016 . 2023-01-23 . en-US.
  360. Web site: Cybernews . Cybernews.
  361. Web site: BleepingComputer . 2023-01-23 . BleepingComputer . en-us.
  362. Web site: Homepage . 2023-01-23 . The Record from Recorded Future News . en.
  363. Web site: 2022-01-08 . HackRead Latest Cyber Crime - InfoSec- Tech - Hacking News . 2023-01-23 . en-US.
  364. Web site: Securelist Kaspersky's threat research and reports . 2023-01-31 . securelist.com.
  365. Book: Harshaw . Christopher R. . Bridges . Robert A. . Iannacone . Michael D. . Reed . Joel W. . Goodall . John R. . Proceedings of the 11th Annual Cyber and Information Security Research Conference . GraphPrints . 2016-04-05 . https://doi.org/10.1145/2897795.2897806 . CISRC '16 . New York, NY, USA . Association for Computing Machinery . 1–4 . 10.1145/2897795.2897806 . 978-1-4503-3752-6.
  366. Web site: Farsight Security, cyber security intelligence solutions . 2023-02-13 . Farsight Security . en.
  367. Web site: Schneier on Security . 2023-02-13 . www.schneier.com . en-US.
  368. Web site: #1 in Cloud Security & Endpoint Cybersecurity . 2023-02-13 . Trend Micro . en-US.
  369. Web site: The Hacker News #1 Trusted Cybersecurity News Site . 2023-02-13 . The Hacker News . en.
  370. Web site: Krebs on Security – In-depth security news and investigation . 2023-02-25 . en-US.
  371. Web site: MITRE D3FEND Knowledge Graph . 2023-03-31 . d3fend.mitre.org . en.
  372. Web site: MITRE ATLAS™ . 2023-03-31 . atlas.mitre.org.
  373. Web site: MITRE Engage™ An Adversary Engagement Framework from MITRE . 2023-04-01 . en-US.
  374. Web site: Hacking Tutorials - The best Step-by-Step Hacking Tutorials . 2023-04-01 . Hacking Tutorials . en-US.
  375. Web site: TCFD Knowledge Hub . 2023-02-03 . TCFD Knowledge Hub . en.
  376. Web site: ResponsibilityReports.com . 2023-02-03 . www.responsibilityreports.com.
  377. Web site: About — IPCC . 2023-02-20.
  378. Web site: Alliance for Research on Corporate Sustainability ARCS serves as a vehicle for advancing rigorous academic research on corporate sustainability issues. . 2023-03-02 . corporate-sustainability.org.
  379. Mehra . Srishti . Louka . Robert . Zhang . Yixun . 2022-03-26 . ESGBERT: Language Model to Help with Classification Tasks Related to Companies Environmental, Social, and Governance Practices . 2203.16788 . Embedded Systems and Applications . 183–190 . 10.5121/csit.2022.120616. 9781925953657 . 247825524 .
  380. Diggelmann . Thomas . Boyd-Graber . Jordan . Bulian . Jannis . Ciaramita . Massimiliano . Leippold . Markus . 2021-01-02 . CLIMATE-FEVER: A Dataset for Verification of Real-World Climate Claims . cs.CL . 2012.00614.
  381. Web site: climate-news-db . 2023-02-03 . www.climate-news-db.com.
  382. Web site: Climatext . 2023-02-19 . www.sustainablefinance.uzh.ch . en.
  383. Web site: Greenbiz . 2023-03-02 . www.greenbiz.com.
  384. News: Explore the @Reuters Hot List of 1,000 top climate scientists . en . Reuters . 2023-03-22.
  385. Web site: Blogs Alliance for Research on Corporate Sustainability . 2023-03-27 . corporate-sustainability.org.
  386. Web site: Greenbiz . 2023-03-29 . www.greenbiz.com.
  387. Web site: CSR News . 2023-03-29 . www.csrwire.com . en.
  388. Web site: CDP Homepage . 2023-03-29 . www.cdp.net . en.
  389. de Vries . Harm . The Stack: 3 TB of permissively licensed source code . 2022 . cs.CL . 2211.15533 .
  390. Web site: The Stack Dedup . Huggingface . 29 August 2023.
  391. Web site: Hybrid cloud blog . 2023-04-09 . content.cloud.redhat.com . en-us.
  392. Web site: Production-Grade Container Orchestration . 2023-04-09 . Kubernetes . en.
  393. Web site: Home Official Red Hat OpenShift Documentation . 2023-04-09 . docs.openshift.com.
  394. Web site: Cloud Native Computing Foundation . 2023-04-09 . Cloud Native Computing Foundation . en-US.
  395. Web site: Red Hat - We make open source technologies for the enterprise . 2023-05-01 . www.redhat.com . en.
  396. Brown, Michael Scott, Michael J. Pelosi, and Henry Dirska. "Dynamic-radius species-conserving genetic algorithm for the financial forecasting of Dow Jones index stocks." Machine Learning and Data Mining in Pattern Recognition. Springer Berlin Heidelberg, 2013. 27–41.
  397. Shen . Kao-Yi . Tzeng . Gwo-Hshiung . 2015 . Fuzzy Inference-Enhanced VC-DRSA Model for Technical Analysis: Investment Decision Aid . International Journal of Fuzzy Systems . 17 . 3. 375–389 . 10.1007/s40815-015-0058-8. 68241024 .
  398. Quinlan . J. Ross . 1987 . Simplifying decision trees . International Journal of Man-Machine Studies . 27 . 3. 221–234 . 10.1016/s0020-7373(87)80053-6. 10.1.1.18.4267 .
  399. Hamers . Bart . Suykens . Johan AK . De Moor . Bart . 2003 . Coupled transductive ensemble learning of kernel models . Journal of Machine Learning Research . 1 . 1–48 .
  400. [Galit Shmueli|Shmueli, Galit]
  401. Peng, Jie, and Hans-Georg Müller. "Distance-based clustering of sparsely observed stochastic processes, with applications to online auctions." The Annals of Applied Statistics (2008): 1056–1077.
  402. Eggermont, Jeroen, Joost N. Kok, and Walter A. Kosters. "Genetic programming for data classification: Partitioning the search space."Proceedings of the 2004 ACM symposium on Applied computing. ACM, 2004.
  403. Moro . Sérgio . Cortez . Paulo . Rita . Paulo . 2014 . A data-driven approach to predict the success of bank telemarketing . Decision Support Systems . 62 . 22–31 . 10.1016/j.dss.2014.03.001. 10071/9499 . 14181100 . free .
  404. 1411.5653. Payne. Richard D.. Bayesian Big Data Classification: A Review with Complements. Mallick. Bani K.. stat.ME. 2014.
  405. Akbilgic . Oguz . Bozdogan . Hamparsum . Balaban . M. Erdal . 2014 . A novel Hybrid RBF Neural Networks model as a forecaster . Statistics and Computing . 24 . 3. 365–375 . 10.1007/s11222-013-9375-7. 17764829 .
  406. Jabin, Suraiya. "Stock market prediction using feed-forward artificial neural network." Int. J. Comput. Appl. (IJCA) 99.9 (2014).
  407. Yeh . I-Cheng . Che-hui . Lien . 2009 . The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients . Expert Systems with Applications . 36 . 2. 2473–2480 . 10.1016/j.eswa.2007.12.020. 15696161 .
  408. Lin . Shu Ling . 2009 . A new two-stage hybrid approach of credit risk in banking industry . Expert Systems with Applications . 36 . 4. 8333–8341 . 10.1016/j.eswa.2008.10.015.
  409. Yumo Xu and Shay B. Cohen. 2018. Stock Movement Prediction from Tweets and Historical Prices. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1970–1979, Melbourne, Australia. Association for Computational Linguistics.
  410. Pelckmans . Kristiaan . et al . 2005 . The differogram: Non-parametric noise variance estimation and its use for model selection . Neurocomputing . 69 . 1. 100–122 . 10.1016/j.neucom.2005.02.015.
  411. Bay . Stephen D. . et al . 2000 . The UCI KDD archive of large data sets for data mining research and experimentation . ACM SIGKDD Explorations Newsletter . 2 . 2. 81–85 . 10.1145/380995.381030. 10.1.1.15.9776 . 534881 .
  412. Lucas . D. D. . et al . 2015 . Designing optimal greenhouse gas observing networks that consider performance and cost . Geoscientific Instrumentation, Methods and Data Systems . 4 . 1. 121 . 10.5194/gi-4-121-2015. 2015GI......4..121L . free .
  413. Pales . Jack C. . Keeling . Charles D. . 1965 . The concentration of atmospheric carbon dioxide in Hawaii . Journal of Geophysical Research . 70 . 24. 6053–6076 . 10.1029/jz070i024p06053 . 1965JGR....70.6053P.
  414. Sigillito, Vincent G., et al. "Classification of radar returns from the ionosphere using neural networks." Johns Hopkins APL Technical Digest10.3 (1989): 262–266.
  415. Zhang, Kun, and Wei Fan. "Forecasting skewed biased stochastic ozone days: analyses, solutions and beyond." Knowledge and Information Systems14.3 (2008): 299–326.
  416. Reich, Brian J., Montserrat Fuentes, and David B. Dunson. "Bayesian spatial quantile regression." Journal of the American Statistical Association (2012).
  417. Kohavi . Ron . Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid . KDD . 96 . 1996 .
  418. Oza, Nikunj C., and Stuart Russell. "Experimental comparisons of online and batch versions of bagging and boosting." Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2001.
  419. Bay . Stephen D . 2001 . Multivariate discretization for set mining . Knowledge and Information Systems . 3 . 4. 491–512 . 10.1007/pl00011680. 10.1.1.217.921 . 10945544 .
  420. Ruggles . Steven . 1995 . Sample designs and sampling errors . Historical Methods. 28 . 1. 40–46 . 10.1080/01615440.1995.9955312.
  421. Meek, Christopher, Bo Thiesson, and David Heckerman. "The Learning Curve Method Applied to Clustering." AISTATS. 2001.
  422. Fanaee-T . Hadi . Gama . Joao . 2013. Event labeling combining ensemble detectors and background knowledge . Progress in Artificial Intelligence . 2 . 2–3. 113–127 . 10.1007/s13748-013-0040-3 . 3345087 .
  423. Giot, Romain, and Raphaël Cherrier. "Predicting bikeshare system usage up to one day ahead." Computational intelligence in vehicles and transportation systems (CIVTS), 2014 IEEE symposium on. IEEE, 2014.
  424. Zhan . Xianyuan . et al . 2013 . Urban link travel time estimation using large-scale taxi data with partial information . Transportation Research Part C: Emerging Technologies . 33 . 37–49 . 10.1016/j.trc.2013.04.001. 2013TRPC...33...37Z .
  425. Moreira-Matias . Luis . et al . 2013 . Predicting taxi–passenger demand using streaming data . IEEE Transactions on Intelligent Transportation Systems. 14 . 3. 1393–1402 . 10.1109/tits.2013.2262376. 14764358 .
  426. Hwang . Ren-Hung . Hsueh . Yu-Ling . Chen . Yu-Ting . 2015 . An effective taxi recommender system based on a spatio-temporal factor analysis model . Information Sciences . 314 . 28–40 . 10.1016/j.ins.2015.03.068.
  427. H. V. Jagadish, Johannes Gehrke, Alexandros Labrinidis, Yannis Papakonstantinou, Jignesh M. Patel,Raghu Ramakrishnan, and Cyrus Shahabi. Big data and its technical challenges. Commun. ACM,57(7):86–94, July 2014.
  428. http://pems.dot.ca.gov/ Caltrans PeMS
  429. Meusel, Robert, et al. "The Graph Structure in the Web—Analyzed on Different Aggregation Levels."The Journal of Web Science 1.1 (2015).
  430. Kushmerick, Nicholas. "Learning to remove internet advertisements." Proceedings of the third annual conference on Autonomous Agents. ACM, 1999.
  431. Fradkin, Dmitriy, and David Madigan. "Experiments with random projections for machine learning."Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2003.
  432. This data was used in the American Statistical Association Statistical Graphics and Computing Sections 1999 Data Exposition.
  433. Ma, Justin, et al. "Identifying suspicious URLs: an application of large-scale online learning."Proceedings of the 26th annual international conference on machine learning. ACM, 2009.
  434. Levchenko, Kirill, et al. "Click trajectories: End-to-end analysis of the spam value chain." Security and Privacy (SP), 2011 IEEE Symposium on. IEEE, 2011.
  435. Mohammad, Rami M., Fadi Thabtah, and Lee McCluskey. "An assessment of features related to phishing websites using an automated technique."Internet Technology And Secured Transactions, 2012 International Conference for. IEEE, 2012.
  436. Singh, Ashishkumar, et al. "Clustering Experiments on Big Transaction Data for Market Segmentation." Proceedings of the 2014 International Conference on Big Data Science and Computing. ACM, 2014.
  437. Bollacker, Kurt, et al. "Freebase: a collaboratively created graph database for structuring human knowledge." Proceedings of the 2008 ACM SIGMOD international conference on Management of data. ACM, 2008.
  438. Mintz, Mike, et al. "Distant supervision for relation extraction without labeled data." Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2-Volume 2. Association for Computational Linguistics, 2009.
  439. Mesterharm, Chris, and Michael J. Pazzani. "Active learning using on-line algorithms ."Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2011.
  440. Wang . Shusen . Zhang . Zhihua . 2013 . Improving CUR matrix decomposition and the Nyström approximation via adaptive sampling . The Journal of Machine Learning Research . 14 . 1. 2729–2769 . 1303.4207 . 2013arXiv1303.4207W .
  441. Web site: The Pile . 2022-04-14 . pile.eleuther.ai.
  442. Web site: JSON Lines . 2022-04-14 . jsonlines.org.
  443. 2101.00027 . cs.CL . Leo . Gao . Stella . Biderman . The Pile: An 800GB Dataset of Diverse Text for Language Modeling . 2020-12-31 . Black . Sid . Golding . Laurence . Hoppe . Travis . Foster . Charles . Phang . Jason . He . Horace . Thite . Anish . Nabeshima . Noa . Presser . Shawn.
  444. Web site: OSCAR . 2023-08-12 . oscar-project.org.
  445. Ortiz Suarez, Pedro, et al. "https://inria.hal.science/hal-02148693v1/file/Asynchronous_Pipeline_for_Processing_Huge_Corpora_on_Medium_to_Low_Resource_Infrastructures.pdf." Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures. CMLC-7, 2019.
  446. Abadji, Julien, et al. "https://aclanthology.org/2022.lrec-1.463.pdf." Towards a Cleaner Document-Oriented Multilingual Crawled Corpus. LREC, 2022.
  447. Web site: Cohen . Vanya . OpenWebTextCorpus . 2023-01-09 . OpenWebTextCorpus . en.
  448. Web site: openwebtext · Datasets at Hugging Face . 2023-01-09 . huggingface.co. 16 November 2022 .
  449. Saulnier . Lucile . The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset . 2023 . cs.CL . 2303.03915 . en.
  450. Web site: BigScience Data · Datasets at Hugging Face . 2023-08-29 . huggingface.co. 29 August 2023 .
  451. Cattral . Robert . Franz . Oppacher . Dwight . Deugo . Evolutionary data mining with automatic rule generalization . https://web.archive.org/web/20190806015013/https://pdfs.semanticscholar.org/c068/ea7807367573f4b5f98c0681fca665e9ef74.pdf . dead . 2019-08-06 . Recent Advances in Computers, Computing and Communications . 2002 . 296–300. 18625415 .
  452. Burton . Ariel N. . Kelly . Paul H.J. . Performance prediction of paging workloads using lightweight tracing . Future Generation Computer Systems . Elsevier BV . 22 . 7 . 2006 . 0167-739X . 10.1016/j.future.2006.02.003 . 784–793.
  453. Bain . Michael . Stephen . Muggleton . Learning optimal chess strategies . Machine Intelligence . 13 . Oxford University Press, Inc. . 1994. 291–309 . 10.1093/oso/9780198538509.003.0012 . 978-0-19-853850-9 .
  454. Book: Machine Learning – Learning Efficient Classification Procedures and Their Application to Chess End Games. Quilan, J.R. Learning Efficient Classification Procedures and Their Application to Chess End Games. Machine Learning: An Artificial Intelligence Approach. 1. 463–482 . 1983. 10.1007/978-3-662-12405-5_15. 978-3-662-12407-9.
  455. Book: Shapiro, Alen D. . Structured induction in expert systems . Addison-Wesley Longman Publishing Co., Inc. . 1987.
  456. Matheus . Christopher J. . Rendell . Larry A. . Constructive Induction on Decision Trees . IJCAI . 89 . 1989 .
  457. Belsley, David A., Edwin Kuh, and Roy E. Welsch. Regression diagnostics: Identifying influential data and sources of collinearity. Vol. 571. John Wiley & Sons, 2005.
  458. Ruotsalo . Tuukka . Aroyo . Lora . Schreiber . Guus . 2009 . Knowledge-based linguistic annotation of digital cultural heritage collections . IEEE Intelligent Systems . 24 . 2 . 64–75 . 10.1109/MIS.2009.32 . 1871.1/9f6091aa-9596-46a9-9251-f11edeeb28b7 . 6667472 . 6 December 2018 . 16 August 2017 . https://web.archive.org/web/20170816023938/http://dare.ubvu.vu.nl/bitstream/handle/1871/24407/243319.pdf?sequence=3 . dead .
  459. Book: 1003.5956 . 10.1145/1935826.1935878 . Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms . Proceedings of the fourth ACM international conference on Web search and data mining . 2011 . Li . Lihong . Chu . Wei . Langford . John . Wang . Xuanhui . 297–306 . 9781450304931 . 744200 .
  460. Yeung, Kam Fung, and Yanyan Yang. "A proactive personalized mobile news recommendation system." Developments in E-systems Engineering (DESE), 2010. IEEE, 2010.
  461. Gass . Susan E. . Roberts . J. Murray . 2006 . The occurrence of the cold-water coral Lophelia pertusa (Scleractinia) on oil and gas platforms in the North Sea: colony growth, recruitment and environmental controls on distribution . Marine Pollution Bulletin . 52 . 5. 549–559 . 10.1016/j.marpolbul.2005.10.002. 16300800 . 2006MarPB..52..549G .
  462. Gionis . Aristides . Mannila . Heikki . Tsaparas . Panayiotis . 2007 . Clustering aggregation . ACM Transactions on Knowledge Discovery from Data . 1 . 1. 4 . 10.1145/1217299.1217303. 10.1.1.709.528 . 433708 .
  463. Obradovic, Zoran, and Slobodan Vucetic.Challenges in Scientific Data Mining: Heterogeneous, Biased, and Large Samples. Technical Report, Center for Information Science and Technology Temple University, 2004.
  464. Van Der Putten . Peter . van Someren . Maarten . 2000 . CoIL challenge 2000: The insurance company case . Published by Sentient Machine Research, Amsterdam. Also a Leiden Institute of Advanced Computer Science Technical Report . 9 . 1–43 .
  465. Mao . K. Z. . 2002 . RBF neural network center selection based on Fisher ratio class separability measure . IEEE Transactions on Neural Networks. 13 . 5. 1211–1217 . 10.1109/tnn.2002.1031953. 18244518 .
  466. Olave . Manuel . Rajkovic . Vladislav . Bohanec . Marko . 1989 . An application for admission in public school systems . Expert Systems in Public Administration . 1 . 145–160 .
  467. 1212.2472 . Lizotte . Daniel J. . Madani . Omid . Greiner . Russell . Budgeted Learning of Naive-Bayes Classifiers . 2012 . cs.LG .
  468. Lebowitz . Michael . 1986 . Concept learning in a rich input domain: Generalization-based memory . Machine Learning: An Artificial Intelligence Approach . 2 . 193–214 . 9780934613002 .
  469. Yeh . I-Cheng . Yang . King-Jang . Ting . Tao-Ming . 2009 . Knowledge discovery on RFM model using Bernoulli sequence . Expert Systems with Applications . 36 . 3. 5866–5871 . 10.1016/j.eswa.2008.07.018.
  470. Lee . Wen-Chen . Cheng . Bor-Wen . 2011 . An intelligent system for improving performance of blood donation . Journal of Quality Vol . 18 . 2. 173 .
  471. Schmidtmann, Irene, et al. "Evaluation des Krebsregisters NRW Schwerpunkt Record Linkage ." Abschlußbericht vom 11 (2009).
  472. Sariyar . Murat . Borg . Andreas . Pommerening . Klaus . 2011 . Controlling false match rates in record linkage using extreme value theory . Journal of Biomedical Informatics . 44 . 4. 648–654 . 10.1016/j.jbi.2011.02.008. 21352952 .
  473. Candillier, Laurent, and Vincent Lemaire. "Design and Analysis of the Nomao challenge Active Learning in the Real-World." Proceedings of the ALRA: Active Learning in Real-world Applications, Workshop ECML-PKDD. 2012.
  474. Marquez, Ivan Garrido. "A Domain Adaptation Method for Text Classification based on Self-adjusted Training Approach." (2013).
  475. Nagesh, Harsha S., Sanjay Goil, and Alok N. Choudhary. "Adaptive Grids for Clustering Massive Data Sets." SDM. 2001.
  476. Kuzilek, Jakub, et al. "OU Analyse: analysing at-risk students at The Open University." Learning Analytics Review (2015): 1–16.
  477. Siemens, George, et al. Open Learning Analytics: an integrated & modularized platform. Diss. Open University Press, 2011.
  478. Barlacchi. Gianni. De Nadai. Marco. Larcher. Roberto. Casella. Antonio. Chitic. Cristiana. Torrisi. Giovanni. Antonelli. Fabrizio. Vespignani. Alessandro. Pentland. Alex. Lepri. Bruno. A multi-source dataset of urban life in the city of Milan and the Province of Trentino. Scientific Data. 2. 2015. 150055. 2052-4463. 10.1038/sdata.2015.55. 26528394. 4622222. 2015NatSD...250055B.
  479. Vanschoren J, van Rijn JN, Bischl B, Torgo L . 2013 . OpenML: networked science in machine learning . SIGKDD Explorations . 15 . 2 . 49–60 . 10.1145/2641190.2641198 . 1407.7722 . 4977460 .
  480. Olson RS, La Cava W, Orzechowski P, Urbanowicz RJ, Moore JH . 2017 . PMLB: a large benchmark suite for machine learning evaluation and comparison . BioData Mining . 10 . 1 . 36 . 10.1186/s13040-017-0154-4 . 29238404 . 5725843 . 2017arXiv170300512O . 1703.00512 . free .
  481. Web site: Off The Shelf Datasets . appen.com . . 30 December 2020.
  482. Web site: Open Source Datasets . appen.com . . 30 December 2020.