As I Told You I Had a Meeting with Eneko and Kepa Today to Discuss About the Thesis

As I Told You I Had a Meeting with Eneko and Kepa Today to Discuss About the Thesis

Hi Philip,

As I told you I had a meeting with Eneko and Kepa today to discuss about the thesis.

First of all, as you said the classes are really time consuming, so it is true that the rhythm for working on the thesis is not the most convenient, and I have to take it into account when making the thesis planning. But the most time consuming is the first year and I have already passed it. Now that I have to be teaching the same stuff, my priority has to be the thesis.

As for the thesis itself, there are two possibilities, one is to continue in UMCP with you, the other would be to continue in the University of the Basque Country with Eneko, and there is still other one, to continue in the University of the Basque Country but having two co-directors Eneko and you.

As you can imagine I would like to finish it in UMCP, but this is not only my decision but also yours and UMCP’s. As I do not want to push you in any way, so I won’t insist or add anything to what I said.

As for the work done and the sketch for the thesis, I send you the sketch we planned (you and me) last time I was there, and in this mail I send you what has been done over that previous sketch.

Depending on where do I do the thesis I guess that the requirements are different. I have been speaking with Eneko and Kepa and I will try to write briefly what the University of the Basque Country would consider as a thesis, and you will have to tell me what is a thesis in terms of Maryland. Time frame should be no more than two years from now.

Goal: Automatic Acquisition of Subcategorization Information for Basque verbs

1. Introduction and state of the art

Study and analysis of different systems

Brent, Manning, Carroll and Briscoe, Korhonen, Schulte im Walde, Maraugodakis et al. etc…

2. Standard way to pursue: Monolingual way

Basic algorithm:

Monolingual corpus + Statistic measures (X2, MI…)  subcategorization frames candidates + filters (from dictionaries get information for English possible subcategorization frames)  real subcategorization frames  subcategorization frames candidates

We tried this approach. The evaluation was difficult to make (see evaluation below) and results are difficult to compare with some other works (ACL workshop article I presented).

Shortcomings and problems of this approach for Basque:

  1. We lack a high quality Basque parser so noisy corpus.
  2. Small corpus to apply Statistics.
  3. The argument adjunct distinction is slippery and loose even for the human taggers.
  4. It is difficult to establish a threshold to statistically tell apart arguments from adjuncts
  5. There is nothing like a pre-established Basque possible Subcat. Frame list coming from a dictionary like LDOCE in English, so there is no way to filter the candidates obtained statistically.
  6. Verb sense disambiguation is needed since depending on the sense the Subcat. Frames are different.

So what to do? As we thought in the Sketch we can try to improve some of these shortcomings by using bilingual information.

2. Bilingual way

In the first Sketch I wrote last time I was in Maryland, we thought of two ways to pursue bilingually.

1. Use Comparable corpora to get better parses.

2. Use parallel corpora (English-Basque, Spanish-Basque) to transfer the English or Spanish subcategorization information from these languages to Basque. We have been exploring the first.

Comparable Corpora

  1. Use of comparable corpora to improve Basque parses. Transfer attachment information from English to Basque (Lrec2004) + heuristics based on linguistic knowledge.

We did some experiments already on this task and information transfer is possible (I sent you attached the paper we sent to CoLing). The experiments we performed are to the level of simple attachment decisions over sentences with just two verbs and with no more linguistic information on it. Now we are extending the experiment to capture sub-sentence boundaries.

  1. Increase Basque corpus.
  2. Perform again monolingual acquisition process using the new parses obtained from comparable corpora as opposed to the old monolingual ones.
  3. Evaluation: This part is tricky. It is very difficult even for humans to tell apart arguments from adjuncts. That is why we thought of two different ways of evaluating the quality of the obtained Subcat. Frames.
  1. Against subcategorization frames obtained by a human introspection. We still do not have it but we will work on it.
  1. Against a treebank to see if the obtained subcategorization frames would allow to make the right attachment decisions. We have now a manually tagged 50.000 words treebank for Basque (and I also have been working supervising this). So for example if we have a sentence with two verbs, the idea would be to see if the subcategorization frames we got allow us to get the right subsentence for each verb.

We should perform again the evaluation over exactly the same data on the Subcat. Frames we got in the old monolingual experiment, and compare them with the results obtained for the new monolingual Subcat. Frames obtained after improving the Basque parses.

By the end of the summer we would like to have the steps 1,2,3 and 4 done as to get a grasp on the difficulties we can find on exploiting comparable corpora to get Subcat. information and to see how to continue improving the transfer of the information from comparable corpus since there are several aspects that have to be dealt with, like how to restrict or disambiguate the senses of the verbs and nouns etc. Perhaps it is then when we would know how the thesis goes.

To exploit the transfer of syntactic information from English to Basque using comparable corpora in order to obtain monolingual subcategorization information would be a thesis by its own in the University of the Basque Country. And on the way perhaps (but this depends on the time) to exploit some semantic information such as transferring and using selectional preferences as well, and to exploit semantic domains to play with the senses of the verbs, since depending on the sense the subcategorization frames are different.

In the Sketch I presented you long time ago (which I attach in this message), parallel corpora was also included inside the ways to explore in the thesis. My goal is not to defend in here but in there you have to decide as the director whether to leave this out or not. Inside this parallel way 2 options were previewed or planned

  1. To use English-Basque parallel corpus.
  2. To use Spanish-Basque parallel corpus (because in principle there is more parallel corpus).But you told me to ask if there were subcategorization frames for Spanish and it seems that there are not or at least not as so. So pursuing this way does not seem to be promising).

I would like to know your opinion and what would you add, change on what I told you, comments on the paper etc. And probably I good idea would be to discuss your comments on the phone.

Waiting for your response and thank you for everything