I ran into Henry Lieberman (MIT) at AAAI15 conference on Artificial Intelligence/Robotics/CognitiveScience (AI/Robotics/CogSci). Below you can find some code to manipulate MIT ConceptNet5 with application to O*NET the Department of Labors database of occupations (nearly 1000) broken down into tasks and many other useful items. Just download the data sets – takes 10-15 minutes, and use the code below, or email me and I can send it to you…
See very bottom below for goal of this exercise:
print(“Congratulations! You have program access to O*NET occupations and tasks dictionaries!”)
Henry suggested that I check out MIT’s ConceptNet5 (http://conceptnet5.media.mit.edu/), which can be thought of as a large collection of facts. The facts have been collected from many sources, some more reliable than others, so rather than call them facts, we should really call them assertions that people have made about the world. A lot of the assertions are simple commonsense – for example, “an apple is a fruit.” However, to make all these assertions useful to artificial intelligence programmers and others, the assertions have been encoded in ConceptNet5 in a special computer readable format; and so one would write something like [/r/IsA, /c/en/apple, /c/en/fruit], where /r/ says what comes next is a relation, and /c/en/ says what comes next is an English concept.
So ConceptNet5 contains over 10 million assertions, and about 7 million of those assertions are in English. There are 50 relations, and 2,798,486 English concepts. As a graph, just the English part is about 3 million nodes and 7 millon edges or arcs connecting them. That is a pretty big graph or network.
Anyone who wants the latest version of ConceptNet5 can download it from here: http://conceptnet5.media.mit.edu/downloads/v5.3/
I downloaded the CSV version, and wrote a simple Python program to help me peer inside.
Here is what I found:
Here are the 50 relations that I found, and how frequently they are used to connect to English concepts:
[[‘IsA’, 4762459], [‘RelatedTo’, 1114866], [‘PartOf’, 613269], [‘AtLocation’, 321706], [‘EtymologicallyDerivedFrom’, 192612], [‘DerivedFrom’, 93861], [‘Synonym’, 84240], [‘UsedFor’, 40596], [‘CapableOf’, 33082], [‘HasSubevent’, 23730], [‘Antonym’, 23173], [‘HasPrerequisite’, 21821], [‘HasProperty’, 17446], [‘Causes’, 16279], [‘MotivatedByGoal’, 14079], [‘MemberOf’, 10770], [‘SimilarTo’, 10121], [‘ReceivesAction’, 9354], [‘HasA’, 8615], [‘HasContext’, 7862], [‘InstanceOf’, 7584], [‘DefinedAs’, 5641], [‘CausesDesire’, 4515], [‘Desires’, 4471], [‘NotDesires’, 3755], [‘HasFirstSubevent’, 3727], [‘wordnet/adjectivePertainsTo’, 3309], [‘TranslationOf’, 2781], [‘HasLastSubevent’, 2690], [‘NotCapableOf’, 2542], [‘wordnet/adverbPertainsTo’, 2520], [‘MadeOf’, 1906], [‘NotHasProperty’, 1003], [‘Attribute’, 560], [‘CreatedBy’, 444], [‘NotIsA’, 420], [‘NotHasA’, 366], [‘Entails’, 356], [‘DesireOf’, 237], [‘InheritsFrom’, 156], [‘SymbolOf’, 144], [‘LocatedNear’, 83], [‘wordnet/participleOf’, 53], [‘LocationOfAction’, 40], [‘NotMadeOf’, 21], [‘SimilarSize’, 8], [‘CompoundDerivedFrom’, 2], [‘NotCauses’, 1], [‘HasPainIntensity’, 1], [‘HasPainCharacter’, 1]]
Here are the top 400 English concepts I found:
[[‘place’, 458623], [‘person’, 372535], [‘populate_place’, 314554], [‘settlement’, 282695], [‘specie’, 160865], [‘eukaryote’, 157531], [‘organisation’, 147159], [‘athlete’, 135201], [‘animal’, 115268], [‘musical_work’, 92150], [‘architectural_structure’, 91854], [‘album’, 90849], [‘village’, 82999], [‘soccer_player’, 58769], [‘build’, 56700], [‘artist’, 53704], [‘film’, 52730], [‘infrastructure’, 46068], [‘insect’, 44132], [‘plant’, 39688], [‘natural_place’, 39221], [‘company’, 35486], [‘town’, 33524], [‘educational_institution’, 32907], [‘single’, 31946], [‘mean_of_transportation’, 31604], [‘write_work’, 30422], [‘administrative_region’, 27698], [‘musical_artist’, 27183], [‘body_of_water’, 26173], [‘office_holder’, 23408], [‘band’, 22847], [‘school’, 22698], [‘software’, 22422], [‘book’, 22403], [‘mollusca’, 21282], [‘broadcaster’, 21169], [‘ship’, 20790], [‘politician’, 20651], [‘event’, 20319], [‘route_of_transportation’, 20076], [‘river’, 19185], [‘stream’, 18402], [‘television_show’, 17129], [‘military_person’, 17021], [‘populate_area’, 15714], [‘station’, 15424], [‘business’, 15284], [‘unite_state’, 15121], [‘video_game’, 14993], [‘radio_station’, 14910], [‘road’, 14699], [‘city’, 13956], [‘baseball_player’, 13709], [‘fish’, 13337], [‘unincorporated_area’, 13231], [‘sport_team’, 12992], [‘actor’, 12893], [‘soccer_club’, 12705], [‘university’, 12412], [‘bird’, 12239], [‘gridiron_football_player’, 11465], [‘planet’, 11404], [‘mountain’, 11187], [‘writer’, 10916], [‘protein’, 10795], [‘military_unit’, 10352], [‘site’, 10312], [‘scientist’, 10246], [‘device’, 10230], [‘airport’, 9858], [‘organization’, 9776], [‘club’, 9386], [‘movie’, 9302], [‘musical_composition’, 9278], [‘periodical_literature’, 9274], [‘flower_plant’, 9222], [‘live_thing’, 9018], [‘fowl’, 8859], [‘greenery’, 8851], [‘township’, 8831], [‘computer_game’, 8821], [‘educational_organization’, 8820], [‘watercourse’, 8818], [‘military_personnel’, 8806], [‘tv_show’, 8754], [‘musical_performer’, 8746], [‘underspecified_location’, 8734], [‘software_object’, 8730], [‘music_single’, 8726], [‘soccer_manager’, 8689], [‘ice_hockey_player’, 8686], [‘self-powered_vehicle’, 8671], [‘single-broadcast_tv_show’, 8464], [‘lake’, 8436], [‘military_conflict’, 8389], [‘mammal’, 8318], [‘troop’, 8224], [‘car_engine’, 8174], [‘street’, 8168], [‘football_player’, 7790], [‘facility’, 7768], [‘mollusk’, 7663], [‘masovian_voivodeship’, 7563], [‘privately_hold_company’, 7548], [‘cricketer’, 7546], [‘fungus’, 7462], [‘soccer_coach’, 7407], [‘fictitious_character’, 7265], [‘conflict’, 7255], [‘fictional_character’, 7055], [‘rugby_player’, 6897], [‘state_school’, 6772], [‘drug’, 6758], [‘car’, 6753], [‘chemical_compound’, 6744], [‘chemical_substance’, 6398], [‘amphibian’, 6277], [‘aircraft’, 6273], [‘census-designated_place’, 6263], [‘american_football_player’, 6245], [‘television_episode’, 6244], [‘reptile’, 6149], [‘california’, 6131], [‘stadium’, 6050], [‘disease’, 5777], [‘television_station’, 5679], [‘cleric’, 5623], [‘language’, 5461], [‘non-‘, 5339], [‘-ly’, 5310], [‘british_royalty’, 5143], [‘protect_area’, 5045], [‘great_poland_voivodeship’, 5036], [‘model_airplane’, 4961], [‘public_company’, 4789], [‘clergyman’, 4783], [‘historic_place’, 4730], [‘newspaper’, 4658], [‘song’, 4592], [‘un-‘, 4569], [‘sport_event’, 4567], [‘cartoon_character’, 4457], [‘weapon’, 4415], [‘member_of_parliament’, 4306], [‘illness’, 4289], [‘comic_character’, 4285], [‘college_coach’, 4250], [‘geographical_region’, 4250], [‘\\xc5\\x82\\xc3\\xb3d\\xc5\\xba_voivodeship’, 4187], [‘township/n/united_states’, 4087], [‘private_school’, 3858], [‘-ness’, 3843], [‘automobile’, 3843], [‘new_york’, 3841], [‘khuzestan_province’, 3787], [‘basketball_player’, 3746], [‘country’, 3725], [‘lublin_voivodeship’, 3706], [‘mineral’, 3691], [‘anatomical_structure’, 3642], [‘royal_family’, 3616], [‘warmian-masurian_voivodeship’, 3603], [‘podlaskie_voivodeship’, 3588], [‘canada’, 3550], [‘body_part’, 3505], [‘musical_composition_song’, 3496], [‘election’, 3433], [‘public_university’, 3432], [‘village_development_committee’, 3389], [‘wisconsin’, 3334], [‘razavi_khorasan_province’, 3322], [‘ohio’, 3315], [‘human_settlement’, 3231], [‘manager’, 3228], [‘unite_kingdom’, 3225], [‘academic_journal’, 3179], [‘someone’, 3164], [‘pomeranian_voivodeship’, 3145], [‘pennsylvania’, 3139], [‘west_virginia’, 3080], [‘england’, 3071], [‘minnesota’, 3034], [‘west_pomeranian_voivodeship’, 3033], [‘kuyavian-pomeranian_voivodeship’, 2979], [‘magazine’, 2975], [‘bridge’, 2954], [‘island’, 2935], [‘extend_play’, 2933], [‘civil_township’, 2916], [‘christian_bishop’, 2901], [‘indiana’, 2900], [‘illinois’, 2883], [‘studio_album’, 2862], [‘historic_build’, 2837], [‘texas’, 2827], [‘something’, 2806], [‘sale_contract’, 2797], [‘florida’, 2757], [‘china’, 2724], [‘-ite’, 2712], [‘you’, 2670], [‘india’, 2659], [‘airline’, 2658], [‘germany’, 2635], [‘cyclist’, 2618], [‘arachnid’, 2606], [‘museum’, 2590], [‘lorestan_province’, 2587], [‘commune_of_romania’, 2584], [‘oxygen’, 2577], [‘-er’, 2553], [‘saint’, 2553], [‘michigan’, 2505], [‘low_silesian_voivodeship’, 2503], [‘human’, 2487], [‘political_party’, 2485], [‘skyscraper’, 2483], [‘name’, 2458], [‘new_york_city’, 2453], [‘live_album’, 2449], [‘london’, 2443], [‘water’, 2440], [‘food’, 2377], [‘kerman_province’, 2368], [‘governor’, 2360], [‘president’, 2303], [‘australian_rule_football_player’, 2295], [‘game’, 2266], [‘wrestler’, 2254], [‘community_group’, 2223], [‘-like’, 2211], [‘bishop’, 2184], [‘website’, 2168], [‘\\xc5\\x9bwi\\xc4\\x99tokrzyskie_voivodeship’, 2147], [‘activity’, 2135], [‘crustacean’, 2091], [‘play’, 2086], [‘hospital’, 2086], [‘locomotive’, 2083], [‘private_university’, 2081], [‘action’, 2074], [‘martial_artist’, 2066], [‘public_transit_system’, 2056], [‘non-profit_organisation’, 2052], [‘village/n/united_states’, 2047], [‘monarch’, 2039], [‘mixed-sex_education’, 2034], [‘-less’, 2023], [‘gaelic_game_player’, 2022], [‘municipality_of_brazil’, 2021], [‘municipality_of_switzerland’, 2006], [‘give_name’, 1991], [‘government_agency’, 1990], [‘australia’, 1979], [‘moon’, 1941], [‘manga’, 1932], [‘comic_creator’, 1925], [‘gmina’, 1914], [‘less_poland_voivodeship’, 1911], [‘supreme_court_of_unite_state_case’, 1903], [‘legal_case’, 1897], [‘subsidiary’, 1892], [‘japan’, 1892], [‘france’, 1889], [‘record_label’, 1876], [‘president_of_organization’, 1872], [‘high_school’, 1860], [‘reservoir’, 1847], [‘boxer’, 1845], [‘shop_mall’, 1834], [‘compilation_album’, 1831], [‘golf_player’, 1830], [‘child’, 1816], [‘hydrogen’, 1811], [‘kentucky’, 1810], [‘ontario’, 1784], [‘tennis_player’, 1782], [‘congressman’, 1782], [‘virginia’, 1775], [‘journalist’, 1762], [‘horse’, 1755], [‘tamil_nadu’, 1733], [‘prime_minister’, 1733], [‘judge’, 1710], [‘hormozgan_province’, 1708], [‘castile_and_le\\xc3\\xb3n’, 1708], [‘dog’, 1707], [‘unite_state_representative’, 1690], [‘podkarpackie_voivodeship’, 1664], [‘iowa’, 1643], [‘time’, 1641], [‘mountain_range’, 1632], [‘work’, 1624], [‘south_khorasan_province’, 1617], [‘figure_skater’, 1615], [‘styria/n/slovenia’, 1608], [‘shop_center’, 1585], [‘district_of_peru’, 1580], [‘bone’, 1576], [‘park’, 1568], [‘low_carniola’, 1563], [‘music’, 1559], [‘house’, 1558], [‘anti-‘, 1550], [‘new_jersey’, 1545], [‘criminal’, 1539], [‘railway_line’, 1524], [‘tree’, 1524], [‘-able’, 1502], [‘oregon’, 1499], [‘kansa’, 1499], [‘spain’, 1497], [‘sound’, 1478], [‘central_province_sri_lanka’, 1475], [‘iron’, 1473], [‘man’, 1469], [‘-eth’, 1467], [‘protein_molecule’, 1465], [‘this’, 1463], [‘-est’, 1443], [‘municipality_of_spain’, 1443], [‘italy’, 1435], [‘wood’, 1419], [‘write’, 1413], [‘missouri’, 1409], [‘body’, 1406], [‘metal’, 1404], [‘cell’, 1403], [‘salt’, 1382], [‘dance’, 1381], [‘adult_actor’, 1350], [‘karnataka’, 1342], [‘light’, 1338], [‘crater’, 1332], [‘computer’, 1331], [‘woman’, 1329], [‘football_match’, 1325], [‘lubusz_voivodeship’, 1324], [‘paint’, 1322], [‘-y’, 1322], [‘eye’, 1313], [‘it’, 1306], [‘head’, 1297], [‘republika_srpska’, 1287], [‘lunar_crater’, 1278], [‘west_bengal’, 1277], [‘trade_union’, 1274], [‘silesian_voivodeship’, 1264], [‘i’, 1254], [‘non-profit_organization’, 1252], [‘money’, 1243], [‘color’, 1239], [‘municipality’, 1239], [‘cat’, 1236], [‘eat’, 1233], [‘model’, 1230], [‘wind’, 1223], [‘north_carolina’, 1221], [‘genus’, 1220], [‘philosopher’, 1208], [‘home’, 1201], [‘andhra_pradesh’, 1200], [‘flower’, 1200], [‘like’, 1199], [‘power_station’, 1196], [‘markazi_province’, 1186], [‘drink’, 1183], [‘musical’, 1179], [‘paper’, 1178], [‘good’, 1174], [‘product’, 1173], [‘los_angeles’, 1171], [‘opole_voivodeship’, 1167], [‘norway’, 1162], [‘municipality_of_mexico’, 1138], [‘race_car_driver’, 1131], [‘family’, 1131], [‘award’, 1128], [‘toronto’, 1125], [‘earth’, 1124], [‘sport_league’, 1116], [‘hand’, 1109], [‘washington_d._c’, 1101], [‘oklahoma’, 1100], [‘lighthouse’, 1092], [‘arkansas’, 1090], [‘point’, 1090], [‘massachusetts’, 1088], [‘township/n/pennsylvania’, 1088], [‘party’, 1079], [‘line’, 1075], [‘northeast_region_brazil’, 1073], [‘infantry’, 1070], [‘state’, 1068], [‘new_england_town’, 1059], [‘hotel’, 1054], [‘bed’, 1049], [‘mayor’, 1046], [‘plate’, 1045]]
If you want to do the same, here are some tips:
Just create a file named “foo.py” in your Mac’s “Documents” folder to contain this:
# === Step 1 begin
# File “foo.py” should contain this code
# Google(“python comments”) – https://docs.python.org/2/tutorial/introduction.html
# Comments are # for one-line, triple quotes for multi-line
print(“”)
print(“Step 1 begin”)
# Install Python – in Applications folder on Mac
# Google (“python mac install”) – https://www.python.org/downloads/mac-osx/
# Download and install latest version – Latest Python 3 Release – Python 3.4.2
# Start IDLE – in Applications folder on Mac, under Python folder double-click IDLE
# In IDLE, type “import foo” and recall “foo.py” is in your Mac documents folder
import os
import string
print(“”)
print(“Step 1 complete”)
print(“Congratulations! You have Python installed, launched IDLE, and imported foo.py”)
# === Step 1 complete
# === Step 2 begin
# Building a dictionary of relationships and concepts
# MIT ConceptNet5 was downloaded as CSV files
print(“”)
print(“Step 2 begin”)
global path2assertions, dictrelations, dictconcepts, listrelations, listconcepts
path2assertions = “/Users/username/Documents/MyPythonFoo/MITConceptNet5_assertions”
def ac_step2(p):
“AC = Assertion Curate function”
global dictrelations, dictconcepts, listrelations, listconcepts
dictrelations = {}
dictconcepts = {}
listrelations = []
listconcepts = []
files = 0
Total_assertions = 0
English_assertions = 0
for i in os.listdir(p): # loop through CSV files in MITConceptNet5_assertion path p
y = p + “/” + i
if not(os.path.isfile(y)):
print(“Need to curate assertions”,y)
files = files + 1
ff = open(y, mode = ‘rb’)
try:
while True:
x = ff.readline()
Total_assertions = Total_assertions + 1
if x != b”:
x = repr(x)
y1 = “[”
y2 = “]”
x1 = x.find(y1)
x2 = x.find(y2)
z = x[x1:x2+1]
z1 = z.find(“/c/en/”)
z2 = z.rfind(“/c/en/”)
if z1 > 0:
if not z1 == z2:
English_assertions = English_assertions + 1
w1 = z.find(“,”)
w2 = z.rfind(“,”)
r = z[1:w1]
r = r[3:len(r)-1]
e1 = z[w1+1:w2]
e1 = e1[6:len(e1)-1]
e2 = z[w2+1:len(z)-1]
e2 = e2[6:len(e2)-1]
b = [r,e1,e2]
# print(b)
if r in dictrelations:
c = dictrelations[r]
dictrelations[r] = c + 1
else:
dictrelations[r] = 1
if e1 in dictconcepts:
c = dictconcepts[e1]
dictconcepts[e1] = c + 1
else:
dictconcepts[e1] = 1
if e2 in dictconcepts:
c = dictconcepts[e2]
dictconcepts[e2] = c + 1
else:
dictconcepts[e2] = 1
else:
break
except UnicodeDecodeError:
print(“UnicodeDecodeError:”, y)
print(“Files:”,files,”Total Assertions:”, Total_assertions, “English_assertions”, English_assertions)
ff.close()
# create list from dictionary, and then sort the list
for i in dictrelations:
listrelations.append([i,dictrelations[i]])
listrelations.sort(key=lambda n: n[1], reverse=True)
print(“”)
print(“Number of unique relations =”, len(listrelations))
print(“Top 50 relations =”, listrelations[0:50])
for i in dictconcepts:
listconcepts.append([i,dictconcepts[i]])
listconcepts.sort(key=lambda n: n[1], reverse=True)
print(“”)
print(“Number of unique concepts =”, len(listconcepts))
print(“Top 400 conscepts =”, listconcepts[0:400])
ac_step2(path2assertions)
print(“”)
print(“Step 2 complete”)
print(“Congratulations! The MIT ConceptNet5 CSV files looks good”)
print(“Checked Folder: MyPythonFoo/MITConceptNet5_assertions”)
# === Step 2 complete
When you run “foo.py” in Python IDLE it will generate the information I described above, showing the relations and the English concept.
Have fun!!!
O*NET analysis code:
# Step 8 below
# === Step 8 begin
# Build occupation and task dictionary (BOTD)
# This is processing some O*NET data from http://www.onetcenter.org/database.html?p=2
# Download the data as a zip file that expands to a folder called db_19_0
# Rename the folder onet_db_19_0 and place it somewhere where you can process it
# The list of 900+ occuptions is in a file called “Occupation Data.txt”
# The list of 10 or so task descriptions for each occupation is in a file called “Task Statements.txt”
# Read in the two files and create a dictionary with all the information
# First test that the both files exist
#
# When ready to test this code….
# Under run menu, select check module and then run module
print(” “)
print(“Step 8 begin”)
# import os (operating system functions)
import os
# set x to the current working directory
# x = os.getcwd()
x = ‘/Users/jamesspohrer/Documents’
# x should be something like ‘/Users/jamesspohrer/Documents’
y = ‘/MyPythonRebuild/JimSpohrerOtherDataSets/onet_db_19_0’
z = x + y
# make sure the path exists to the O*NET directory
if os.path.exists(z):
print(“O*NET directory one_db_19_0 exists!”)
else:
print(“Error: O*Net directory one_db_19_0 not found!”)
# make sure the two files, occupations and tasks exist
f1 = z + ‘/Occupation Data.txt’
f2 = z + ‘/Task Statements.txt’
if os.path.isfile(f1):
print(“O*NET file /one_db_19_0/Occuption Data.txt exists!”)
else:
print(“Error:O*NET file /one_db_19_0/Occuption Data.txt does not exist!”)
if os.path.isfile(f2):
print(“O*NET file /one_db_19_0/Task Statements.txt exists!”)
else:
print(“Error:O*NET file /one_db_19_0/Task Statements.txt does not exist!”)
global path2onet
path2onet = z
global dictoccupation,listoccupation,dicttask,listtask
def botd_step8(p):
“botd_step8 = Build occupation dictionary”
global dictoccupation,dicttask,listoccupation
dictoccupation = {}
dicttask = {}
listoccupation = []
f1 = p + ‘/Occupation Data.txt’
f2 = p + ‘/Task Statements.txt’
ff1 = open(f1, mode = ‘r’, encoding=’utf-8′)
x = ff1.readline() # skip header line
y = 0
c1 = “\t”
occupation_code = 0
occupation_description = “”
try:
while True:
x = ff1.readline()
# if y > 20:
# print(“Line”, x)
# print(“dictoccupation length”, len(dictoccupation))
# break
y = y + 1
if x != “”:
# print(“Line:”,y,repr(x))
i1 = x.find(c1)
occupation_code = x[0:i1]
x = x[i1+1:]
i1 = x.find(c1)
occupation_title = x[0:i1]
x = x[i1+1:]
i1 = x.find(c1)
occupation_description = x[0:i1]
# print(occupation_code)
# print(occupation_title)
# print(occupation_description)
if occupation_code in dictoccupation:
z = dictoccupation[occupation_code]
z[0] = z[0] + 1
else:
dictoccupation[occupation_code] = [1,occupation_title,occupation_description]
else:
break
except UnicodeDecodeError:
print(“UnicodeDecodeError:”, y)
print(“Done with Occupation Data dictionary, number of occupations”,y-1)
ff1.close()
# create list from dictionary, and then sort the list
for i in dictoccupation:
listoccupation.append([i,dictoccupation[i][0]])
listoccupation.sort(key=lambda n: n[1], reverse=True)
ff2 = open(f2, mode = ‘r’, encoding=’utf-8′)
x = ff2.readline() # skip header line
y = 0
c1 = “\t”
occupation_code = 0
task_code = 0
task_description = “”
try:
while True:
x = ff2.readline()
# if y > 19494:
# print(“Line”, x)
# print(“dicttask length”, len(dicttask))
# break
y = y + 1
if x != “”:
# print(“Line:”,y,repr(x))
i1 = x.find(c1)
occupation_code = x[0:i1]
x = x[i1+1:]
i1 = x.find(c1)
task_code = x[0:i1]
x = x[i1+1:]
i1 = x.find(c1)
task_description = x[0:i1]
# print(occupation_code)
# print(task_code)
# print(task_description)
if task_code in dicttask:
z = dicttask[task_code]
z[0] = z[0] + 1
z[1] = z[1] + [occupation_code]
else:
dicttask[task_code] = [1,[occupation_code],task_description]
else:
break
except UnicodeDecodeError:
print(“UnicodeDecodeError:”, y)
print(“Done with Task Statements dictionary, number of tasks”,y-1)
ff2.close()
botd_step8(path2onet)
print(“”)
print(“Step 8 complete”)
print(“Congratulations! You have program access to O*NET occupations and tasks dictionaries!”)
# === Step 8 completed