Data Mining and Textual Analysis – HIS 3460: Digital History

[entry-title]

[entry-published] · by [entry-author] · in [entry-terms taxonomy="category" before=""] · [entry-comments-link] [entry-edit-link before=" · "]

9 Biggest Banks’ Derivative Exposure – $228.72 Trillion

These days we hear about billions spent here and trillions in debt there. But few people can actually visualize the magnitude of these figures and how do they fit in the big picture. I found that infographics, when done right, can present complex and boring info in simple and interesting way.

With this particular infographic you don’t need to have finance diploma in order to understand how extremely serious the world derivative situation has become. There is no government in the world that has this kind of money. This is roughly 3 times the entire world economy. The unregulated market presents a massive financial risk. The corruption and immorality of the banks pushes the world economy ever closer to the financial cliff.

Click here to see the entire eye-opening infographic from Demonocracy.info

[entry-title]

[entry-published] · by [entry-author] · in [entry-terms taxonomy="category" before=""] · [entry-comments-link] [entry-edit-link before=" · "]

Reading

Frederick W. Gibbs and Trevor J. Owens, “The Hermeneutics of Data and Historical Writing,” in Writing History in the Digital Age, 2012.

Blog Post

Post to blog by 8am on Wednesday 10/24 a link to an example of a powerful graphical representation of data. Say what data is being represented, and why you think it’s done in a powerful way.

[entry-title]

[entry-published] · by [entry-author] · in [entry-terms taxonomy="category" before=""] · [entry-comments-link] [entry-edit-link before=" · "]

Maps

Review what you’ve produced
Theorize additional possibilities

Reading

Edward Tufte, “PowerPoint is Evil,” Wired, September 2003.
What is Tufte’s argument?
The Gettysburg Powerpoint Presentation, by Peter Norvig
- http://www.youtube.com/watch?v=BvA0J_2ZpIQ
Examples of well-presented data:
- The Second Presidential Debate in Graphs
- The Feltron Biennal Report

[entry-title]

[entry-published] · by [entry-author] · in [entry-terms taxonomy="category" before=""] · [entry-comments-link] [entry-edit-link before=" · "]

It seems like a very complicated move; the transformation and innovation to the internet and its progressive data bases. However, it could be simply explained by looking at the transformation our world has gone through, is going through, and try to plan and predict future transformations. The technological advances that we have gone through are sometimes taken for granted or sometimes not even noticed, but thats not the point i am trying to make. The point i am trying to make is that technological innovations drive our world and more specifically our daily lives. As you can see from the reading more and more; people are working on moving their work online. The benefit of it being online is far beyond anyone could have ever imagined. Having your work online means that everyone across the world can access it, thus people can help with their input and corrections/additions. Also it means that it will never decease, once its online, especially in multiple places it is practically impossible for it to just disappear. Of course there are down sides to it being online as well.

My group picked a few topics to focus on, one of them is the affect of the internet and its tools like twitter, Facebook, and other social websites have of presidential elections. We could show different statues from various people like the presidents themselves, vice presidents, and from educated people and see how the discussion on the posts develops in favor of the candidate. Also since most younger people use these social websites you can target them on those websites and educate them about the elections. Danny Hayes points out that the way the candidates are portrayed via the social media affects whether people vote for them. He also shows that the way things are phrased make a difference to the public. This mainly has to do with younger people that use the internet, by younger people i mean the generation that actually uses the internet. As the internet became the primary source (in the sense of the main source) people get their information from it also divides the voters since some actually do not use the internet, which is perfectly fine as long as its a credible source and you double check the information portrayed to you from the TV, Newspaper, or radio.

[entry-title]

[entry-published] · by [entry-author] · in [entry-terms taxonomy="category" before=""] · [entry-comments-link] [entry-edit-link before=" · "]

With all the debates that the presidential candidates do before the election, usually, it ends up that there is just too much data for any one person to process and analyze. In order for our group, Contra, to attempt to answer the correlation with the War on Drugs and Presidential Elections, we would seek to use data mining software to weed out information that is not pertinent to our project when we analyze presidential debates.

As the War on Drugs is not a widely talked about topic for presidential elections, the use of data mining softwares will be beneficial in helping us find out exactly how often phrases such as “drugs”, “cartels”, “anti-(fill in whatever drug)” come up during the debates. Then we would use this information to come to a conclusion on how much emphasis is put on the War on Drugs during the presidential elections and hopefully we can answer some of the questions that our group posed for our project.

[entry-title]

[entry-published] · by [entry-author] · in [entry-terms taxonomy="category" before=""] · [entry-comments-link] [entry-edit-link before=" · "]

Secondary Sources for Group 1:

1.) PRESIDENTIAL COMMUNICATION

This document goes in to great detail how the Internet has become a factor during Clinton’s Presidency and how Bush Jr was the first president to establish a Twitter account and the significance of how social media and the Internet became an integral part of the election process from that point.

2.) Does the Messenger Matter? Candidate-Media Agenda Convergence and Its Effects on Voter

Issue Salience

By using data during the early stages of the 2006 Texas midterm election, Danny Hayes show how media coverage of candidates affects the willingness for voters to vote for them. The findings in Hayes pieces show the value for candidates of enlisting new media to help pass along their message and to strengthen their influence as a political candiate.

3.) Campaign Politics and the Digital Divide: Constituency Characteristics, Strategic Considerations, and Candidate Internet Use in State Legislative Elections

In this piece, the authors discuss how the internet divides voters. The piece talks about how just like the elderly, those less well educated, and some minorities are less likely to use the Internet than other Americans, candidates for lower-level offices are less likely to use it than presidential and congressional candidates.

[entry-title]

[entry-published] · by [entry-author] · in [entry-terms taxonomy="category" before=""] · [entry-comments-link] [entry-edit-link before=" · "]

Text mining will be an essential component in visualizing the answers to the questions Contra will present. Since presedential debate transcripts are so widely available, we can use data mining to specifically find out how many times, The War on Drugs was mentioned by the candidates. This can give us a statistical basis to comparatively look at the funding that was pumped into the war on drugs in the subsequent presedential term. In this current election we see this domestic war scarcely being mentioned if at all by both candidates. I hypothesize that a correlation could emerge showing that the war continues to have an increase in funding, while it being less of a platform for campaigns to run on. Highlighting this information will open up further questions, such as; why? We could also illustrate on whether or not states support an increase, or decrease in budget within this war. This can be done by seeking out secondary sources in which different representatives give their personal opinion on the campaign.

[entry-title]

[entry-published] · by [entry-author] · in [entry-terms taxonomy="category" before=""] · [entry-comments-link] [entry-edit-link before=" · "]

Being in the midst of the debates, as we are today, it would be hard to read through any newspaper or watch the nightly news without hearing of the Presidential and Vice-Presidential debates. Pundits and analysts are quick to jump on every word, interruption, and perceived mistake in order to determine who “won.” With this in mind, our group The Instigators, aim to analyze all possible information in order to see if the debates truly make a difference. Many believe that at this point in the campaign cycle, most voters have already decided well in advance who they plan on voting for; however, for those still on the fence, could the debates truly sway them in either direction?

Given these circumstances, data mining will serve as an invaluable tool in the examination of the countless information released in response to the debates. Finding the correlations between live-real-time reactions from online sources such as Twitter, Facebook, and various RSS feeds, will shed light on the question of voter-impact. It is imperative to approach this question on importance of the debates from many different angles, in order to provide a more nuanced response to a complex question. While many individual sources will claim to provide their own idea of the “winner” of each debate, the general data that will be received by our group may in fact not be as simple as yes/no, Democrat/Republican.

As a sub-focus, it may also be important to mine data regarding third party candidates and their lack of inclusion in all of the debates.

[entry-title]

[entry-published] · by [entry-author] · in [entry-terms taxonomy="category" before=""] · [entry-comments-link] [entry-edit-link before=" · "]

One of the main questions that my group is considering basing our project on is: how often do presidents and opposing candidates break or keep their promises while in office or running for office? The idea of text mining is to thoroughly read different articles or other forms of written research and to find patterns throughout the research. These patterns consist of repeating statements or words that a person, in this president or opposing candidate), would say. This is a very beneficial way of doing our research because when a candidate is running for president or any type of office they usually use specific phrases to get the public’s attention and they constantly repeat these phrases whenever they are in public.

Like, my group member Tatsiana stated in her post, we attempt to compare what the president did during his term and what he said when he was campaigning. Unfortunatly the class is only until december, so we will not be able to do the project for the entire 4 year term of the president for this current election, but hopefully we will get an idea of what he will do up until the due date of this proejct. Along with this current election, we will also look at past presidencies and make the same coomparison. It is important that I say this becase both professors use this current election as a topic for postings and for topics of discussion in class, so it would only seem appropriate to integrate this election into our project.

[entry-title]

[entry-published] · by [entry-author] · in [entry-terms taxonomy="category" before=""] · [entry-comments-link] [entry-edit-link before=" · "]

Throughout the process of gathering information and data to help support our argument concerning debates on voter behavior, text mining will be pertinent. Going through large amounts of numbers and words to grab and portray what is important and what is essential for the project is the basis of text mining. Many broadcast companies now have a system which they give a room full of undecided voters a device to grasp their emotions and reactions to what the candidates are saying. Word by word, topic by topic, we can understand what ideas presented affect this group of people. It is either positive, negative or no response- such as if the moderator is speaking. Digging through these numbers could help us convey our argument. Making a correlation between what a candidate says, to how people react, to then the numbers at the polls could be attained by text mining. It would be important to understand how many people are actually studied with these devices and how much it actually represents the general public or those who are all undecided. I believe these studies have large amounts of numbers and information that can help us make direct correlations to election day results. I am a little unsure as to how we actually will attain these numbers and what programs are used to get the pertinent data.

[entry-title]

[entry-published] · by [entry-author] · in [entry-terms taxonomy="category" before=""] · [entry-comments-link] [entry-edit-link before=" · "]

As Ted Underwood mentioned in the reading, some of the biggest obstacles around text mining is not only finding the data needed, but finding the skills to collect the correct data.

A reason being that our topic revolves around social media, which can be traced back not just from MySpace, but to early social networking services such as email, chat services and other early internet social structures. Also, as modern history goes, text mining can be easier as we will have more resources as sites, blogs and social applications become more accessible and popular.

After our group uncovers more secondary documents, as we feed them into a Wordle-like application we can see common themes such as undecided, voting, and different kind of feelings that stem from being a first-time voter. These similarities can help us focus on what aspect of the sources we should focus our attention towards, and can help us specify our final historical question.

In the case of secondary sources, my group may find itself in the same predicament the Underwood found himself in his own research.

“A lot of excitement about digital humanities is premised on the notion that we already have large collections of digitized sources waiting to be used. But it’s not true, because page images are not the same thing as clean, machine-readable text.” – Underwood

However, Many of our sources with social media can be a primary source – with interviews, blogs to mine through, and various social networks to comb through by means of twitter hashtags, trending topics, and blogging categories.

[entry-title]

[entry-published] · by [entry-author] · in [entry-terms taxonomy="category" before=""] · [entry-comments-link] [entry-edit-link before=" · "]

One of the potential questions our group is considering to research is “How common is it for a president to break his promises made during the presidential campaign?” In order to answer this question and draw the parallels between current and historian elections, we would have to process quite large amount of text and find just the information that we need to prove our historian question. Text mining will be the essential tool in our analyses.

We will use lexical analyses that are based on searching of the key words in candidates’ speeches to find out the major promises that they made during their presidential campaign. Also, we will look for frequency of their promises – that is how often in their debates, interviews, speeches and other public appearances do they repeat these promises. Sometimes we may identify a certain patterns in their speeches that are related to their promises.

Then, we are going to compare the information that we gathered about campaign promises with the real actions these candidates made once they are elected to the presidential office. By doing this, our goal is to find out if it is common in politics for presidential candidates to make false promises to the voters, and if the voters can trust these candidates.

[entry-title]

[entry-published] · by [entry-author] · in [entry-terms taxonomy="category" before=""] · [entry-comments-link] [entry-edit-link before=" · "]

Text mining involves a program analyzing large volumes of unstructured data for the purpose of extraction of specific words and key phrases.

Since both of our historical questions proposed so far involve social media, we will need to use as many social media websites as we can because larger amounts of data will be better for comparison and analysis.

Unlike Ted Underwood, who needed literary works for his project, we can obtain the necessary information straight from the social media websites.

As far as the necessity of learning how to program, I am not sure whether it will be necessary for our project or not. The public toolsets for text mining, given as examples on professor Underwood’s website, seem sufficient enough for the job.

Text mining will help us divide and categorize information, thereby revealing patterns.

In our case, text mining will be used to determine how the names of presidential candidates, “Presidential Election 2012,” and popular political issues, are being used by young/first time voters. This election is arguably is first to be so immersed by the social media, which makes it perfect for this project.

I am not sure if my response is adequate enough for the posted question. Perhaps if I came to class last Wednesday, it would have been better. Unfortunately, the train tracks between my house and Baruch were broken at Prospect Park station.

[entry-title]

[entry-published] · by [entry-author] · in [entry-terms taxonomy="category" before=""] · [entry-comments-link] [entry-edit-link before=" · "]

Our group can use text mining to answer the historical question that the group has proposed about if and how the outcome of presidential debates determined who won the election. Text mining would allow us to see if there were key words or phrases used by candidates during the debates that proved to have a positive or negative effect on voters, and as result, attracted voters or deterred them away. Another way text mining will be beneficial in our project is to determine if other aspects apart from analytics played a role in deciding the outcome of elections based on a candidate’s performance during the debates. During debates, candidates present various types of data to present their case to voters: statistical data, such as their previous track record while serving in their current governmental post; and conditional data, such as what they expect to accomplish if they are chosen as president. Because debating not only deals with factual data presented by the candidates but also the manner in which the candidates convey the data, such as their behavioral disposition, body language, tone of voice, eye contact, etc., data mining will help capture the effects of these different factors and what role they played in steering the outcome of the election. However, we keep in mind that our analysis is on the premise that the election process is very complex and trying to keep all variables stable poses multifaceted challenges.

[entry-title]

[entry-published] · by [entry-author] · in [entry-terms taxonomy="category" before=""] · [entry-comments-link] [entry-edit-link before=" · "]

By Monday, Oct. 15, at 8:00am:

Complete Reading:
- Richard White, “What is Spatial History?” Spatial History Lab: Working paper; Submitted February 1, 2010.
- Explore Hypercities.com. Come with a question about historical maps for our guest speaker.

Blog Post(s):
- Each member of your group
  - In 200-300 words answer the following questions: How could your group use text mining to answer the historical question(s) you’ve proposed thus far?
- One member of the group:
  - post 3-5 secondary sources your group will be reading to provide background information.
  - For secondary sources, you might look at JSTOR, search the library catalog, or consult a librarian. Comment on this post if you have any questions that you think we can help you with.

[entry-title]

[entry-published] · by [entry-author] · in [entry-terms taxonomy="category" before=""] · [entry-comments-link] [entry-edit-link before=" · "]

Announcements

Blog commenting: “The Magnificent Seven ds106 Comment Challenge“

Reading

James Grossman, “‘Big Data’: An Opportunity for Historians?” March 2012.

Big Data
“And because we [historians] look for stories—for ways of synthesizing diverse strands into narrative themes—we usually look for interactions among variables that to other eyes might not seem related.”
Importance of collaboration: e.g., joining “the historian’s facility with sifting and contextualizing information to the computer scientist’s (or marketing professional’s) ability to generate and process data.”

Ted Underwood, “Where to start with text mining,” The Stone and the Shell, August 14, 2012

“Quantitative analysis starts to make things easier only when we start working on a scale where it’s impossible for a human reader to hold everything in memory.”
- quantitative v. qualitative?
Close reading v. distant reading
OCR challenges with primary sources
Wordle
Tools? Some programming needed.
“you can build complex arguments on a very simple foundation”
What can we do?
- Categorize documents
- Contrast the vocabulary of different corpora
- Trace the history of particular features (words or phrases) over time (e.g. ngram viewer, Bookworm)
- Cluster features that tend to be associated in a given corpus of documents (aka topic modeling)
- Entity extraction
- Visualization (e.g. geographically, network graph)

Group Projects

Group 1
Caroline, Anton, Eli, Cameron, Leanardo

Group 2
Estevan, Tatsiana, Phillip, Jordan Burgos

Group 3 – Instigator
Felipe, Jordan Smith, Robert, Pablo

Group 4 – Contra
Guang, Cary, William, Stephen, Shaif

[entry-title]

[entry-published] · by [entry-author] · in [entry-terms taxonomy="category" before=""] · [entry-comments-link] [entry-edit-link before=" · "]

With the emotional mix of the best mesmerized date in the world there are a few things i would like to point out.

First of all as a person that was not in the united states during the 9/11 attack, realizing that it was the best event to ever be historically recorded is kind of amazing. We are living through history that will definitely be referred to in years to come. Since it was so preciecly documented it shows how history is becoming more of a precise, reliable, credible science.

No one can doubt the fact that it didn’t happen.

Furthermore, i would like to point out the number of causalities ~2700. Thats a massive amount of people. There are towns in this world that are smaller than that, probably cities. Taking out an entire town is a big deal, at least in my book. Thus we saw all the changes that happened in 2003, security tightening up and all sorts of regulations to make sure it doesn’t happen again.

The analytical part that i was thinking about while doing the reading is why did something have to happen for us to realize that the possibility of this happening yet so minute still exists. Who would do such a thing? what would be the consequences of such a disaster? and most importantly WHY?

For future reference, to prevent attacks we shouldn’t wait for them to happen. we should try to plan ahead, and to do so we can take a look at other countries and see what they have thought of – This is not cheating, its saving “lifes”. for example ill take israel, they have such a tight security because of the vigorous experience they have been having with terror attacks. There are metal detectors and check points that you must pass through almost everywhere you go, especially public locations. Another concern i have for NYC is that bridges, tunnels, and malls are not highly regulated in the security aspect of it. Yet courts have metal detectors.

http://1howmany.com/how-many-people-died-in-9-11

[entry-title]

[entry-published] · by [entry-author] · in [entry-terms taxonomy="category" before=""] · [entry-comments-link] [entry-edit-link before=" · "]

This is just some data mining from the reading: ”THE the Professor Who Fooled Wikipedia Got Caught by Reddit”

There are always bad guys and thus vandalism happens everywhere.. esp on the web because its where it is most likely to be seen

Vandalism is done to influence people so a great place for that would be the WWW.

That’s why Wikipedia is hard to rely on because its edited by whomever desires too and among us people, unfortently there are bad guys, all you have to do is hope they either disappear (unlikely to happen) or won’t come

Announcements

Reading

Group Projects

Recent Comments

Meta